idnits 2.17.1 draft-ietf-rmt-bb-track-02.txt: -(511): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** The document seems to lack a 1id_guidelines paragraph about 6 months document validity -- however, there's a paragraph with a matching beginning. Boilerplate error? ** The document is more than 15 pages and seems to lack a Table of Contents. == There are 5 instances of lines with non-ascii characters in the document. == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year == The "Author's Address" (or "Authors' Addresses") section title is misspelled. == Line 1596 has weird spacing: '... Parent no ...' == Line 1608 has weird spacing: '... Parent no ...' == Line 1614 has weird spacing: '... Child eit...' == Line 1617 has weird spacing: '... Parent no...' == Line 1620 has weird spacing: '... Child eit...' == (4 more instances...) == The document seems to use 'NOT RECOMMENDED' as an RFC 2119 keyword, but does not include the phrase in its RFC 2119 key words list. -- The exact meaning of the all-uppercase expression 'MAY NOT' is not defined in RFC 2119. If it is intended as a requirements expression, it should be rewritten using one of the combinations defined in RFC 2119; otherwise it should not be all-uppercase. == The expression 'MAY NOT', while looking like RFC 2119 requirements text, is not defined in RFC 2119, and should not be used. Consider using 'MUST NOT' instead (if that is what you mean). Found 'MAY NOT' in this paragraph: Retransmissions MAY NOT be sent at a faster rate than the current TransmissionRate advertised by the Sender. -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- Couldn't find a document date in the document -- date freshness check skipped. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Unused Reference: '1' is defined on line 1909, but no explicit reference was found in the text == Unused Reference: '4' is defined on line 1919, but no explicit reference was found in the text == Unused Reference: '5' is defined on line 1922, but no explicit reference was found in the text == Unused Reference: '6' is defined on line 1926, but no explicit reference was found in the text == Unused Reference: '7' is defined on line 1930, but no explicit reference was found in the text == Unused Reference: '8' is defined on line 1934, but no explicit reference was found in the text == Unused Reference: '9' is defined on line 1937, but no explicit reference was found in the text == Unused Reference: '10' is defined on line 1940, but no explicit reference was found in the text == Unused Reference: '11' is defined on line 1945, but no explicit reference was found in the text == Unused Reference: '13' is defined on line 1953, but no explicit reference was found in the text == Unused Reference: '14' is defined on line 1957, but no explicit reference was found in the text == Unused Reference: '15' is defined on line 1961, but no explicit reference was found in the text == Unused Reference: '18' is defined on line 1973, but no explicit reference was found in the text ** Downref: Normative reference to an Informational RFC: RFC 3048 (ref. '2') ** Downref: Normative reference to an Informational RFC: RFC 2887 (ref. '3') -- Possible downref: Non-RFC (?) normative reference: ref. '5' -- Possible downref: Non-RFC (?) normative reference: ref. '6' -- Possible downref: Non-RFC (?) normative reference: ref. '7' -- Possible downref: Non-RFC (?) normative reference: ref. '8' -- Possible downref: Non-RFC (?) normative reference: ref. '9' -- Possible downref: Non-RFC (?) normative reference: ref. '10' -- Possible downref: Non-RFC (?) normative reference: ref. '11' == Outdated reference: A later version (-01) exists of draft-ietf-rmt-pi-track-security-00 -- Possible downref: Normative reference to a draft: ref. '12' -- Possible downref: Non-RFC (?) normative reference: ref. '13' == Outdated reference: A later version (-03) exists of draft-kadansky-tram-02 -- Possible downref: Normative reference to a draft: ref. '14' -- Possible downref: Normative reference to a draft: ref. '15' == Outdated reference: A later version (-03) exists of draft-ietf-rmt-bb-tree-config-02 -- Possible downref: Normative reference to a draft: ref. '16' -- No information found for draft-ietf-rmt-track-pi-udp - is the name correct? -- Possible downref: Normative reference to a draft: ref. '17' == Outdated reference: A later version (-10) exists of draft-ietf-rmt-pi-norm-02 ** Downref: Normative reference to an Experimental draft: draft-ietf-rmt-pi-norm (ref. '18') == Outdated reference: A later version (-08) exists of draft-ietf-rmt-pi-alc-02 ** Downref: Normative reference to an Experimental draft: draft-ietf-rmt-pi-alc (ref. '19') == Outdated reference: A later version (-07) exists of draft-speakman-pgm-spec-06 ** Downref: Normative reference to an Experimental draft: draft-speakman-pgm-spec (ref. '20') ** Downref: Normative reference to an Informational RFC: RFC 3269 (ref. '21') Summary: 11 errors (**), 0 flaws (~~), 31 warnings (==), 17 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 RMT Working Group Brian Whetten 3 Internet Engineering Task Force Consultant 4 Internet Draft Dah Ming Chiu 5 Document: draft-ietf-rmt-bb-track-02.txt Miriam Kadansky 6 November 2002 Sun Microsystems 7 Expires May 2003 Seok Joo Koh 8 ETRI 9 Gursel Taskale 10 TIBCO 12 Reliable Multicast Transport Building Block for TRACK 13 15 Status of this Memo 17 This document is an Internet-Draft and is in full conformance with 18 all provisions of Section 10 of RFC2026. 20 Internet-Drafts are working documents of the Internet Engineering 21 Task Force (IETF), its areas, and its working groups. Note that other 22 groups may also distribute working documents as Internet-Drafts. 23 Internet-Drafts are draft documents valid for a maximum of six months 24 and may be updated, replaced, or obsoleted by other documents at any 25 time. It is inappropriate to use Internet- Drafts as reference 26 material or to cite them other than as "work in progress." 28 The list of current Internet-Drafts can be accessed at 29 http://www.ietf.org/ietf/1id-abstracts.txt 30 The list of Internet-Draft Shadow Directories can be accessed at 31 http://www.ietf.org/shadow.html. 33 Abstract 35 This document describes the TRACK Building Block. It contains 36 functions relating to positive acknowledgments and hierarchical tree 37 construction and maintenance. It is primarily meant to be used as 38 part of the TRACK Protocol Instantiation. It is also designed to be 39 useful as part of overlay multicast systems that wish to offer 40 efficient confirmed delivery of multicast messages. 42 Conventions used in this document 44 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 45 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 46 document are to be interpreted as described in RFC-2119. 48 1. Introduction 49 One of the protocol instantiations the RMT WG is chartered to create 50 is a TRee-based ACKnowledgement protocol (TRACK). Rather than create 51 a set of monolithic protocol specifications, the RMT WG has chosen to 52 break the reliable multicast protocols into Building Blocks (BB) and 53 Protocol Instantiations (PI). A Building Block is a specification of 54 the algorithms of a single component, with an abstract interface to 55 other BBs and PIs. A PI combines a set of BBs, adds in the 56 additional required functionality not specified in any BB, and 57 specifies the specific instantiation of the protocol. For more 58 information, see the Reliable Multicast Transport Building Blocks and 59 Reliable Multicast Design Space documents [2][3]. 61 As specified in [2], there are two primary reliability requirements 62 for a transport protocol, ensuring goodput, and confirming delivery 63 to the Sender. The NORM and ALC PIs are responsible solely for 64 ensuring goodput. TRACK is designed to offer application level 65 confirmed delivery, aggregation of control traffic and Receiver 66 statistics, local recovery, automatic tree building, and enhanced 67 flow and congestion control. 69 Whereas the NORM and ALC PIs run only over other building blocks, the 70 TRACK PI has a more difficult integration task. To run in 71 conjunction with NORM, it must either re-implement the functionality 72 in the NORM PI, or integrate directly with the NORM PI. In addition, 73 in order to have reasonable commercial applicability, TRACK needs to 74 be able to run over other protocols in addition to NORM. To meet 75 both of these challenges, the TRACK PI is designed to integrate with 76 other transport layer protocols, including NORM, PGM [20], ALC [19], 77 UDP, or an overlay multicast system. In order to accomplish this, 78 there can be multiple TRACK PIs, one for each transport protocol it 79 is specified to integrate with. The vast majority of the protocol 80 functionality exists in this document, the TRACK BB, which in turn 81 references the automatic tree building block [16]. For more details 82 on the specific functionality of TRACK, please see the reference 83 TRACK PI[21]. 85 TRACK is organized around a Data Channel and a Control Channel. The 86 Data Channel is responsible for multicast data from the Sender to all 87 other nodes in a TRACK session. In order to integrate with NORM and 88 other goodput-ensuring transport protocols, these protocols are used 89 as the Data Channel for a given Data Session. This Data Channel MAY 90 also provide congestion control. Otherwise, congestion control MUST 91 be provided by the TRACK PI, through using the TFMCC or other 92 approved congestion control building block. 94 This document describes the TRACK Building Block. It contains 95 functions relating to positive acknowledgments and hierarchical tree 96 construction and maintenance. While named as a building block, this 97 document describes more functionality than the PI documents. With 98 the exception of congestion control, almost all of the functionality 99 is encapsulated in this document or the BBs it references. The TRACK 100 PIs are then primarily responsible for instantiating packet formats 101 in conjunction with the other transport protocols it uses as its Data 102 Channel. 104 The TRACK BB assumes that there is an Automatic Tree Building BB [16] 105 which provides the list of parents (known as Service Nodes within the 106 Tree BB) each node should join to. If Receivers are used that may 107 also serve as Repair Heads, the TRACK BB assumes the Auto Tree BB is 108 also responsible for selecting the role of each Receiver as either 109 Receiver or Repair Head. However, the TRACK BB may specify that a 110 particular node may not operate as a Repair Head. 112 The TRACK BB also assumes that a separate session advertisement 113 protocol notifies the Receivers as to when to join a session, the 114 data multicast address for the session, and the control parameters 115 for the session. This functionality MAY be provided in a TRACK PI 116 document. 118 The TRACK BB provides the following detailed functionality. 120 ... .Hierarchical Session Creation and Maintenance. This set of 121 functionality is responsible for creating and maintaining (but not 122 configuring) a hierarchical tree of Repair Heads and Receivers. 123 - Bind. When a child knows the parent it wishes to join to for 124 a given Data Session, it binds to that parent. 125 - Unbind. When a child wishes to leave a Data Session, either 126 because the session is over or because the application is 127 finished with the session, it initiates an unbind operation 128 with its parent. 129 - Eject. A parent can also force a child to unbind. This 130 happens if the parent needs to leave the session, if the child 131 is not behaving correctly, or if the parent wants to move the 132 child to another parent as part of tree configuration 133 maintenance. 134 - Fault Detection. In order to verify liveness, parents and 135 children send regular heartbeat messages between themselves. 136 The Sender also sends regular null data messages to the group, 137 if it has no data to send. 138 - Fault Recovery. When a child detects that its parent is no 139 longer reachable, it may switch to another parent. When a 140 parent detects that one of its children is no longer 141 reachable, it removes that child from its membership list and 142 reports this up the tree to the Sender of the Data Session. 143 - Distributed Membership. Each Parent is responsible for 144 maintaining a local list of the children attached to it. 146 - Data Sessions. This functionality is responsible for the reliable, 147 ordered transmission of a set of data messages, which together 148 constitute a Data Session. These are initially transmitted using 149 another transport protocol, the Data Channel Protocol, which has 150 primary responsibility for ensuring goodput and congestion control. 151 - Data Transmission. The Sender takes sequenced data messages 152 from the application, and passes them to the Data Channel 153 Protocol for multicast transmission. It delays passing them 154 to the Data Channel Protocol if it is presently flow 155 controlled. 156 - Flow Control and Buffer Management. Receivers and Repair 157 Heads MAY maintain a set of buffers that are at least as large 158 as the Senders transmission window. The Receivers pass their 159 reception status up to the Sender as part of their TRACK 160 messages. This MAY be used to advance the buffer windows at 161 each node and limit the Senders window advancement to the 162 speed of the slowest Receiver. 163 - Retransmission Requests. While primary responsibility for 164 goodput rests with the Data Channel Protocol, Receivers MAY 165 request retransmission of lost messages from their parents. 166 - Local Recovery. Repair heads keep track of retransmission 167 requests from their children, and provide repairs to them. If 168 a Repair Head cannot fulfill a retransmission request, it 169 forwards it up the tree. 170 - End of Stream. When a Data Session is completed, this is 171 signaled as an End of Stream condition. 173 ...TRACK Generation and Aggregation. This set of functionality is 174 responsible for periodically generating TRACK messages from all 175 Receivers and aggregating them at Repair Heads. These messages 176 provide updated flow control window information, roundtrip time 177 measurements, and congestion control statistics. They OPTIONALLY 178 acknowledge receipt of data, OPTIONALLY report missing messages, 179 and OPTIONALLY provide group statistics. The algorithms include: 180 - TRACK Timing. In order to avoid ACK implosion, the Receivers 181 and Repair Heads use timing algorithms to control the speed at 182 which TRACK messages are sent. 183 - TRACK Aggregation. In order to provide the highest levels of 184 scalability and reliability, interior tree nodes provide 185 aggregation of control traffic flowing up the tree. The 186 aggregated feedback information includes that used for end-to- 187 end confirmed delivery, flow control, congestion control, and 188 group membership monitoring and management. 189 - Statistics Request. A Sender may prompt Receivers to generate 190 and report a set of statistics back to the Sender. These 191 statistics are self-describing data types, and may be defined 192 by either the TRACK PI or the application. 194 - Statistics Aggregation. In addition to the predefined 195 aggregation types, aggregation of self-describing data may 196 also be performed on Receiver statistics flowing up the tree. 198 . Application Level Confirmed Delivery. Senders can issue requests 199 for application level confirmation of data up to a given message. 200 Receivers reply to this request, and the confirmations are reliably 201 forwarded up the tree. 203 - Distributed RTT Calculations. One of the primary challenges of 204 congestion control is efficient RTT calculations. TRACK provides 205 two methods to perform these calculations. 206 - Sender Per-Message RTT Calculations. On demand, a Sender 207 stamps outgoing messages with a timestamp. As each TRACK is 208 passed up the tree, the amount of dally time spent waiting at 209 each node is accumulated. The lowest measurements are passed 210 up the tree, and the dally time is subtracted from the 211 original measurement. 212 - Local Per-Level RTT Calculations. Each parent measures the 213 local RTT to each of its children as part of the keep-alive 214 messages used for failure detection. 216 2. Applicability Statement 218 The primary objective of TRACK is to provide additional functionality 219 in conjunction with a receiver reliable protocol. These functions 220 MAY include application layer reliability, enhanced congestion 221 control, flow control, statistics reporting, local recovery, and 222 automatic tree building. It is designed to do this while still 223 offering scalability in the range of 10,000s of Receivers per Data 224 Session. The primary corresponding design tradeoffs are additional 225 complexity, and lower isolation of nodes in the face of network and 226 host failures. 228 There is a fundamental tradeoff between reliability and real-time 229 performance in the face of failures. There are two primary types of 230 single layer reliability that have been proposed to deal with this: 231 Sender reliable and Receiver reliable delivery. Sender reliable 232 delivery is similar to TCP, where the Sender knows the identity of 233 the Receivers in a Data Session, and is notified when any of them 234 fails to receive all the data messages. Receiver reliable delivery 235 limits knowledge of group membership and failures to only the actual 236 Receivers. Senders do not have any knowledge of the membership of a 237 group, and do not require Receivers to explicitly join or leave a 238 Data Session. Receiver reliable protocols scale better in the face 239 of networks that have frequent failures, and have very high isolation 240 of failures between Receivers. This TRACK BB provides Sender 241 reliable delivery, typically in conjunction with a Receiver reliable 242 system. 244 This BB is specified according to the guidelines in [21]. It 245 specifies all communication between entities in terms of messages, 246 rather than packets. A message is an abstract communication unit, 247 which may be part of, or all of, a given packet. It does not have a 248 specific format, although it does contain a list of fields, some of 249 which may be optional, and some of which may have fixed lengths 250 associated with them. It is up to each protocol instantiation to 251 combine the set of messages in this BB, with those in other 252 components, and create the actual set of packet formats that will be 253 used. 255 As mentioned in the introduction, this BB assumes the existence of a 256 separate Auto Tree Configuration BB. It also assumes that Data 257 Sessions are advertised to all Receivers as part of an external BB or 258 other component. 260 Except where noted, this applicability statement is applicable both 261 to the TRACK BB and to the TRACK PIs. 263 2.1 Application Types 265 TRACK is designed to support a wide range of applications that 266 require one to many bulk data transfer and application layer 267 confirmed delivery. Examples of applications that fit into the one- 268 to-many data dissemination model are: real time financial news and 269 market data distribution, electronic software distribution, audio 270 video streaming, distance learning, software updates and server 271 replication. 273 Historically, financial applications have had the most stringent 274 reliability requirements, while audio video streaming have had the 275 least stringent. For applications that do not require this level of 276 reliability, or that demand the lowest levels of latency and the 277 highest levels of failure isolation, TRACK may be less applicable. 279 TRACK is designed to work in conjunction with a receiver reliable 280 protocol such as NORM, to allow applications to select this tradeoff 281 on a dynamic basis. 283 2.2 Network Infrastructure 285 TRACK is designed to work over almost all multicast and broadcast 286 capable network infrastructures. It is specifically designed to be 287 able to support both asymmetrical and single source multicast 288 environments. 290 Asymmetric networks with very low upbound bandwidth and a very low 291 loss Data Channel may be better served solely through NACK based 292 protocols, particularly if high reliability is not required. A good 293 example is some satellite networks. 295 Networks that have very high loss rates, and regularly experience 296 partial network partitions, router flapping, or other persistent 297 faults, may be better served through NACK only protocols. Some 298 wireless networks fall in to this category. 300 2.3 Private and Public Networks 302 TRACK is designed to work in private networks, controlled networks 303 and in the public Internet. A controlled network typically has a 304 single administrative domain, has more homogenous network bandwidth, 305 and is more easily managed and controlled. These networks have the 306 fewest barriers to IP multicast deployment and the most immediate 307 need for reliable multicast services. Deployment in the Internet 308 requires a protocol to span multiple administrative domains, over 309 vastly heterogeneous networks. 311 2.4 Manual vs. Automatic Controls 313 Some networks can take advantage of manual or centralized tools for 314 configuring and controlling the usage of a reliable multicast group. 315 In the public Internet the tools have to span multiple administrative 316 domains where policies may be inconsistent. Hence, it is preferable 317 to design tools that are fully distributed and automatic. To address 318 these requirements, TRACK provides automatic configuration, but can 319 also support manual configuration options. 321 2.5 Heterogeneous Networks 323 While the majority of controlled networks are symmetrical and support 324 many-to-many multicast, in designing a protocol for the Internet, we 325 must deal with virtually all major network types. These include 326 asymmetrical networks, satellite networks, networks where only a 327 single node may send to a multicast group, and wireless networks. 328 TRACK takes this into account by not requiring any many-to-many 329 multicast services. TRACK does not assume that the topology used for 330 sending control messages has any congruence to the topology of the 331 multicast address used for sending data messages. 333 2.6 Use of Network Infrastructure 335 TRACK is designed to run in either single level or hierarchical 336 configurations. In a single level, there is no need for specialized 337 network infrastructure. In hierarchical configurations, special 338 nodes called Repair Heads are defined, which may run either as part 339 of a distributed application, or as part of dedicated server 340 software. TRACK does not specifically support or require Generic 341 Router Assist or other router level assist. 343 2.7 Deployment Constraints 344 The two primary tradeoffs TRACK has, for the functionality it 345 provides, are additional complexity, and decreased failure isolation. 346 Hence, if target applications are to be deployed in networks with 347 high rates of persistent failures, and isolation of failed Receivers 348 from affecting other Receivers is of high importance, TRACK may not 349 be appropriate. Similarly, if simplicity is paramount, TRACK may not 350 be appropriate. 352 2.8 Target Scalability 354 The target scalability of TRACK is tens of thousands of simultaneous 355 Receivers per Data Session. Dedicated Repair Heads are targeted to 356 be able to support thousands of simultaneous Data Sessions. 358 2.9 Known Failure Modes 360 If a hierarchical Control Tree is misconfigured, so that loop-free, 361 contiguous connection is not provided, failure will occur. This 362 failure is designed to occur gracefully, at the initialization of a 363 Data Session. 365 If the configuration parameters on control traffic are poorly chosen 366 on an asymmetrical network, where there is much less control channel 367 bandwidth available than data channel bandwidth, there may be a very 368 high rate of control traffic. This control traffic is not 369 dynamically congestion controlled like the data traffic, and so could 370 potentially cause congestion collapse. 372 This potential control channel overload could be exacerbated by an 373 application that makes overly heavy use of the application level 374 confirmation or statistics gathering functions. 376 2.10 Potential Conflicts With Other Components 378 None are known of at this time. 380 3. Architecture Definition 382 3.1 TRACK Entities 384 3.1.1 Node Types 385 TRACK divides the operation of the protocol into three major 386 entities: Sender, Receiver, and Repair Head. The Repair Head 387 corresponds to the Service Node described in the Tree Building draft. 388 It is assumed that Senders and Receivers typically run as part of an 389 application on an end host client. Repair Heads MAY be components in 390 the network infrastructure, managed by different network managers as 391 part of different administrative domains, or MAY run on an end host 392 client, in which case they function as both Receivers and Repair 393 Heads. Absent of any automatic tree configuration, it is assumed 394 that the Infrastructure Repair Heads have relatively static 395 configurations, which consist of a list of nearby possible Repair 396 Heads. Senders and Receivers, on the other hand, are transient 397 entities, which typically only exist for the duration of a single 398 Data Session. In addition to these core components, applications that 399 use TRACK are expected to interface with other services that reside 400 in other network entities, such as multicast address allocation, 401 session advertisement, network management consoles, DHCP, DNS, 402 overlay networking, application level multicast, and multicast key 403 management. 405 3.1.2 Multicast Group Address 407 A Multicast Group Address is a logical address that is used to 408 address a set of TRACK nodes. It is RECOMMENDED to consist of a pair 409 consisting of an IP multicast address and a UDP port number. In this 410 case, it may optionally have a Time To Live (TTL) value, although 411 this value MUST only be used for providing a global scope to a Data 412 Session, and not for scoping of local retransmissions. Data Multicast 413 Addresses are Multicast Group Addresses. 415 TRACK MAY be used with an overlay multicast or application layer 416 multicast system. In this case, a Multicast Group Address MAY have a 417 different format. The TRACK PI is responsible for specifying the 418 format of a Multicast Group Address. 420 3.1.3 Data Session 422 A Data Session is the unit of reliable delivery of TRACK. It 423 consists of a sequence of sequentially numbered Data messages, which 424 are sent by a single Sender over a single Data Multicast Address. 425 They are delivered reliably, with acknowledgements and 426 retransmissions occurring over the Control Tree. A Data Session ID 427 uniquely identifies it. A given Data Session is received by a set of 428 zero or more Receivers, and a set of zero or more Repair Heads. One 429 or more Data Sessions MAY share the same Data Multicast Address 430 (although this is NOT RECOMMENDED). Each TRACK node can 431 simultaneously participate in multiple Data Sessions. A Receiver 432 MUST join all the Data Multicast Addresses and Control Trees 433 corresponding to the Data Sessions it wishes to receive. 435 3.1.4 Data Channel 437 A Data Session is multicast over a Data Channel. The Data Channel is 438 responsible for efficiently delivering the Data messages to the 439 members of a Data Session, and providing statistical reliability 440 guarantees on this delivery. It does this by employing a Data 441 Channel Protocol, such as NORM, ALC, PGM, or Overlay Multicast. 442 TRACK is then responsible for providing application level, Sender 443 based reliability, by confirming delivery to all Receivers, and 444 optionally retransmitting lost messages that did not get correctly 445 delivered by the Data Channel. A common scenario would be to use 446 TRACK to provide application level confirmation of delivery, and 447 recover from persistent failures in the network that are beyond the 448 scope of the Data Channel Protocol. 450 3.1.5 Data Channel Protocol 452 This is the transport protocol used by a TRACK PI to ensure goodput 453 and statistical reliability on a Data Channel. 455 3.1.6 Data Multicast Address 457 This is the Multicast Group Address used by the Data Channel 458 Protocol, to efficiently deliver Data messages to all Receivers and 459 Repair Heads. All Data Multicast Addresses used by TRACK are assumed 460 to be unidirectional and only support a single Sender. 462 3.1.7 Control Tree 464 A Control Tree is a hierarchical communication path used to send 465 control information from a set of Receivers, through zero or more 466 Repair Heads (RHs), to a Sender. Information from lower nodes is 467 aggregated as the information is relayed to higher nodes closer to 468 the Sender. Each Data Session uses a Control Tree. It is acceptable 469 to have a degenerate Control Tree with no Repair Heads, which 470 connects all of the Receivers directly to the Sender. 472 Each RH in the Control Tree uses a separate Local Control Channel for 473 communicating with its children. It is RECOMMENDED that each Local 474 Control Channel correspond to a separate Multicast Group Address. 475 Optionally, these RH multicast addresses MAY be the same as the Data 476 Multicast Address. 478 3.1.8 Local Control Channel 479 A Local Control Channel is a unidirectional multicast path from a 480 Repair Head or Sender to its children. It uses a Multicast Group 481 Address for this communication. 483 3.1.9 Host ID 485 With the widespread deployment of network address translators, 486 creating a short globally unique ID for a host is a challenge. By 487 default, TRACK uses a 48 bit long Host ID field, filled with the low- 488 order 48 bits of the MD5 signature of the DNS name of the source. A 489 TRACK PI, to match up with the goodput-ensuring protocol that TRACK 490 PI uses as its Data Channel Protocol, MAY redefine the length and 491 contents of this identifier. 493 3.1.10 Data Session ID 495 A Data Session ID is a globally unique identifier for a Data Session. 496 It may either be selected by the Data Channel Protocol (i.e. NORM) or 497 by TRACK. By default, it is the combination of the Host ID for the 498 Sender, combined with the 16 bit port number used for the Data 499 Session at the Sender. This identifier is included in every TRACK 500 message. 502 3.1.11 Child ID 504 All members in a TRACK Data Session, besides the Sender, are 505 identified by the combination of their Host ID, and the port number 506 with which they send IP packets to their parent. 508 3.1.12 Message Sequence Numbers 510 A Message Sequence Number is a 32 bit number in the range from 1 511 through 2^32 � 1, which is used to specify the sequential order of a 512 Data message in a Data Stream. A Sender node assigns consecutive 513 Sequence Numbers to the Data messages provided by the Sender 514 application. By default, zero is reserved to indicate that the Data 515 Session has not yet started. A TRACK PI MAY redefine this. Message 516 Sequence Numbers may wrap around, and so Sequence Number arithmetic 517 MUST be used to compare any two Sequence Numbers. 519 3.1.13 Data Queue 521 A Data Queue is a buffer, maintained by a Sender or a Repair Head, 522 for transmission and retransmission of the Data messages provided by 523 the Sender application. New Data messages are added to the Data 524 Queue as they arrive from the sending application, up to a specified 525 buffer limit. The admission rate of messages to the network is 526 controlled by the flow and congestion control algorithms. Once a 527 message has been received by the Receivers of a Data Stream, it may 528 be deleted from the buffer. 530 At the Sender, A TRACK PI may integrate the Data Queue with the 531 buffer used by the Data Channel Protocol. 533 3.2 Basic Operation of the Protocol 535 For each Data Session, TRACK provides sequenced, reliable delivery of 536 data from a single Sender to up to tens of thousands of Receivers. A 537 TRACK Data Session consists of a network that has exactly one Sender 538 node, zero or more Receiver nodes and zero or more Repair Heads. 540 The figure below illustrates a TRACK Data Session with multiple 541 Repair Heads. 543 Before a Data Session starts, a session advertisement MUST be 544 received by all members of the Data Session, notifying them to join 545 the group, and the appropriate configuration information for the Data 546 Session. This MAY be provided directly by the application, by an 547 external service, or by the TRACK PI. 549 A Sender joins the Control Tree and a Data Channel Protocol. It 550 multicasts Data messages on the Data Multicast Address, using the 551 Data Channel Protocol. All of the nodes in the session subscribe to 552 the Data Multicast Address and join the Data Channel Protocol. 554 There is no assumption of congruence between the topology of the Data 555 Multicast Address and the topology of the Control Tree. 557 -------> SD (Sender node)----->| 558 ^^^ | 559 / | \ Control | 560 TRACKs / | \ Tree | 561 / | \ | 562 / | \ (Repair | 563 / | \ Head | 564 / | \ nodes) v 565 RH RH RH <------------| 566 ^^ ^^^ ^^ | Data 567 / | / | \ | \ | Channel 568 / | / | \ | \ | 569 / | / | \ | \ v 570 R R R R R R R <--------- 571 (Receiver Nodes) 573 A Receiver joins the appropriate Data Channel Protocol, and the Data 574 Multicast Address used by that protocol, in order to receive Data. A 575 Receiver periodically informs its parent about the messages that it 576 has received by unicasting a TRACK message to the parent. It MAY 577 also request retransmission of lost messages in this TRACK. Each 578 parent node aggregates the TRACKs from its child nodes and (if it is 579 not the Sender) unicasts a single aggregated TRACK to its parent. 581 The Sender and each Repair Head have a multicast Local Control 582 Channel to their children. This is used for transmitting Heartbeat 583 messages that inform their child nodes that the parent node is still 584 functioning. This channel is also used to perform local 585 retransmission of lost Data messages to just these children. TRACK 586 MUST still provide correct operation even if multicast addresses are 587 reused across multiple Data Sessions or multiple Local Control 588 Channels. It is NOT RECOMMENDED to use the same multicast address 589 for multiple Local Control Channels serving any given Data Session. 591 The communication path forms a loop from the Sender to the Receivers, 592 through the Repair Heads back to the Sender. Original data (ODATA), 593 Retransmission (RDATA) and NullData messages regularly exercise the 594 downward data direction. Heartbeat messages exercise the downward 595 control direction. TRACK messages regularly exercise the Control 596 Tree in the upward direction. This combination constantly checks 597 that all of the nodes in the tree are still functioning correctly, 598 and initiates fault recovery when required. 600 This hierarchical infrastructure allows TRACK to provide a number of 601 functions in a scaleable way. Application level confirmation of 602 delivery and statistics aggregation both operate in a request-reply 603 mode. A sender issues a request for application level confirmation 604 or statistics reporting, and the receivers report back the 605 appropriate information in their TRACK messages. This information is 606 aggregated by the Repair Heads, and passed back up to the Sender. 607 Since TRACK messages are not delivered with the reliability of data 608 messages, Receivers and Repair Heads transmit this information 609 redundantly. 611 TRACK also gathers control information that is useful for improving 612 the performance of flow and congestion control algorithms, including 613 scaleable round trip time measurements. 615 Normally, goodput in ensured by lower level protocols, such as the 616 NACKs and FEC algorithms in NORM and PGM. However, TRACKs MAY also 617 include optional retransmission requests, in the form of selective 618 bitmaps indicating which messages need to be retransmitted. The RH 619 is then responsible for retransmitting these messages on the Local 620 Control Channel to its children. 622 3.3 Component Relationships 623 TRACK is primarily designed to run in conjunction with another 624 transport protocol that is responsible for ensuring goodput. It is 625 RECOMMENDED that this Data Channel Protocol also be responsible for 626 congestion control, although the TRACK PI MAY provide this congestion 627 control function instead, and MAY pass the congestion control 628 statistics it collects to the Data Channel Protocol, in order to 629 enhance the performance of the congestion control algorithms. 631 The primary Data Channel Protocol that TRACK is designed to work with 632 is NORM. In this case, the NORM PI is responsible for interfacing 633 with the NACK BB, the FEC BB, the Generic Router Assist BB, and the 634 appropriate congestion control BB. 636 TRACK then adds additional functionality that complements this 637 receiver-reliable protocol, such as application level confirmed 638 delivery, retransmission in the face of persistent failures, 639 statistics aggregation, and collection of extra information for 640 congestion control. 642 The TRACK BB is responsible for specifying all of the TRACK-specific 643 functionality. It interfaces with the Automatic Tree Building Block. 644 The TRACK PI is then responsible for instantiating a complete 645 protocol that includes all of the other components. It is expected 646 that there will be multiple TRACK PIs, one for each Data Channel 647 Protocol that it is specified to work with. 649 The following figure illustrates this, for the case where NORM is the 650 Data Channel Protocol. 652 +----------+ 653 | | 654 | TRACK | 655 | PI | 656 | | 657 +----------+ 658 / \ 659 / \ 660 / \ 661 +---------+ +---------+ 662 | | | | 663 | TRACK | | NORM | Data Channel 664 | BB | | PI | Protocol 665 | | | | 666 +---------+ +---------+ 667 | | 668 | | 669 | | 670 +---------+ +-----------------------+ 671 | | | | 672 | Tree | | FEC, CC, GRA, NACK | 673 | BB | | Building Blocks | 674 | | | | 675 +---------+ +-----------------------+ 677 For more details on integration, please see the example TRACK PI over 678 UDP [17]. 680 4. TRACK Functionality 682 4.1 Hierarchical Session Creation and Maintenance 684 4.1.1 Overview of Tree Configuration 686 Before a Data Session starts reliably delivering data, the tree for 687 the Data Session needs to be created. This process binds each 688 Receiver to either a Repair Head or the Sender, and binds the 689 participating Repair Heads into a loop-free tree structure with the 690 Sender as the root of the tree. This process requires tree 691 configuration knowledge, which can be provided with some combination 692 of manual and/or automatic configuration. The algorithms for 693 automatic tree configuration are part of the Automatic Tree 694 Configuration BB. They return to each node the address of the parent 695 it should bind to, as well as zero or more backup parents to use if 696 the primary parent fails. 698 In addition to receiving the tree configuration information, the 699 Receivers all receive a Session Advertisement message from the 700 Senders, informing them of the Data Multicast Address and other 701 session configuration information. This advertisement may contain 702 other relevant session information such as whether or not Repair 703 Heads should be used, whether manual or automatic tree configuration 704 should be used, the time at which the session will start, and other 705 protocol settings. This advertisement is created as part of either 706 the TRACK PI or as part of an external service. In this way, the 707 Sender enforces a set of uniform session configuration parameters on 708 all members of the session. 710 As described in the automatic tree configuration BB, the general 711 algorithm for a given node in tree creation is as follows. 712 1) Get advertisement that a session is starting 713 2) Get a list of neighbor candidates using the getSNs Tree BB 714 interface, and OPTIONALLY contact them 715 3) Select best neighbor as parent in a loop free manner 716 4) Bind to parent 717 5) Optionally, later rebind to another parent 718 When a child finishes step 4, it is up to automatic tree 719 configuration to, if necessary, continue building the tree in order 720 to connect the node back to the Sender. After the session is 721 created, children can unbind from their parents and bind again to new 722 parents. This happens when faults occur, or as part of a tree 723 optimization process. Steps 1 through 3 are external to the TRACK 724 BB. Step 4 is performed as part of session creation. Step 5 is 725 performed as part of session maintenance in conjunction with 726 automatic tree building, as either an Unbind or Eject, combined with 727 another Bind operation. 729 Once steps 1 through 3 are completed, Receivers join the Data 730 Multicast Address, and attempt to Bind to either the Sender or a 731 local Repair Head. A Receiver will attempt to bind to the first node 732 in the tree configuration list returned by step 3, and if this fails, 733 it will move to the next one. A Receiver only binds to a single 734 Repair Head or Sender, at a time, for each Data Session. 736 The automatic tree building BB ensures that the tree is formed 737 without loops. As part of this, when a Repair Head has a Receiver 738 attempt to bBnd to it for a given Data Session, it may not at first 739 be able to accept the connection, until it is able to join the tree 740 itself. Because of this, a Receiver will sometimes have to 741 repeatedly attempt to Bind to a given parent before succeeding. 743 Once the Sender initiates tree building, it is also free to start 744 sending Data messages on the Data Multicast Address. Repair Heads 745 and Receivers may start receiving these messages, but may not request 746 retransmission or deliver data to the application until they receive 747 confirmation that they have successfully bound to the tree. 749 4.1.2 Bind 751 4.1.2.1 Input Parameters 753 In order to join a Data Session and Bind to the tree, the following 754 nodes need the following parameters. 756 A Repair Head requires the following parameters. 758 - Session: the unique identifier for the Data Session to join, 759 received from the session advertisement algorithm specified in the 760 PI. 762 - ParentAddress: the address and port of the parent node to which 763 the node should connect, received from the Auto Tree BB. 765 - UDPListenPort: the number of the port on which the node will 766 listen for its childrens control messages. This parameter is 767 configured by the application. 769 - RepairAddr: the multicast address, UDP port, and TTL on which this 770 node sends control messages to its children. This parameter is 771 configured by the application. 773 A Sender requires the above parameters, except for the ParentAddress. 774 A Receiver requires the above parameters, except for the 775 UDPListenPort and RepairAddr. 777 4.1.2.2 Bind Algorithm 779 A Bind operation happens when a child wishes to join a parent in the 780 distribution tree for a given Data Session. The Receivers initiate 781 the first Bind protocols to their parents, which then cause recursive 782 binding by each parent, up to the Sender. Each Receiver sends a 783 separate BindRequest message for each of the streams that it would 784 like to join. At the discretion of the PI, multiple BindRequest 785 messages may be bundled together in a single message. 787 A node sends a BindRequest message to its automatically selected or 788 manually configured parent node. The parent node sends either a 789 BindConfirm message or a BindReject message. Reception of a 790 BindConfirm message terminates the algorithm successfully, while 791 receipt of a BindReject message causes the node to either retry the 792 same parent or restart the Bind algorithm with its next parent 793 candidate (depending on the BindReject reason code), or if it has 794 none, to declare a REJECTED_BY_PARENT error. Once the node is 795 accepted by a Repair head, it informs the Tree BB using the setSN 796 interface. 798 Reliability is achieved through the use of a standard request- 799 response protocol. At the beginning of the algorithm, the child 800 initializes TimeMaxBindResponse to the constant 801 TIMEOUT_PARENT_RESPONSE and initializes NumBindResponseFailures to 0. 802 Every time it sends a BindRequest message, it waits 803 TimeMaxBindResponse for a response from the parent node. If no 804 response is received, the node doubles its value for 805 TimeMaxBindResponse, but limits TimeMaxBindResponse to be no larger 806 than MAX_TIMEOUT_PARENT_RESPONSE. It also 807 increments NumBindResponseFailures, and retransmits the BindRequest 808 message. If NumBindResponseFailures reaches NUM_MAX_PARENT_ATTEMPTS, 809 it reports a PARENT_UNREACHABLE error. 811 When a parent receives a BindRequest message, it first consults the 812 automatic tree building BB for approval (using the acceptChild Tree 813 BB interface), for instance to ensure that accepting the BindRequest 814 will not cause a loop in the tree. Then the parent checks to be sure 815 that it does not have more than MaxChildren children already bound to 816 it for this session. If it can accept the child, it sends back a 817 BindConfirm message. Otherwise, it sends the node a BindReject 818 message. Then the parent checks to see if it is already a member of 819 this Data Session. If it is not yet a member of this session, it 820 attempts to join the tree itself. 822 The BindConfirm message contains the lowest Sequence Number that the 823 Repair Head has available. If this number is 0, then the Repair Head 824 has all of the data available from the start of the session. 825 Otherwise, the requesting node is attempting a late join, and can 826 only use this Repair Head if late join was allowed by the PI. If 827 late join is not allowed, the node may try another Repair Head, or 828 give up. 830 Similarly, if a failure recovery occurs, when a node tries to bind to 831 a new Repair Head, it must follow the same rules as for a late join. 832 See Fault Recovery, below. 834 4.1.3 Unbind 836 A child may decide to leave a Data Session for the following reasons. 837 1) It detects that the Data Session is finished. 2) The application 838 requests to leave the Data Session. 3) It is not able to keep up 839 with the data rate of the Data Session. When any of these conditions 840 occurs, it initiates an Unbind process. 842 An Unbind is, like the Bind function, a simple request-reply 843 protocol. Unlike the Bind function, it only has a single response, 844 UnbindConfirm. With this exception, the Unbind operation uses the 845 same state variables and reliability algorithms as the Bind function. 847 When a child receives an UnbindConfirm message from its parent, it 848 reports a LEFT_DATA_SESSION_GRACEFULLY event. If it does not receive 849 this message after NUM_MAX_PARENT_ATTEMPTS, then it reports a 850 LEFT_DATA_SESSION_ABNORMALLY event. Unbinds are reported to the Tree 851 BB using the lostSN interface. 853 4.1.4 Eject 855 A parent may decide to remove one or more of its children from a data 856 stream for the following reasons. 1) The parent needs to leave the 857 group due to application reasons. 2) The Repair Head detects an 858 unrecoverable failure with either its parent or the Sender. 3) The 859 parent detects that the child is not able to keep up with the speed 860 of the data stream. 4) The parent is not able to handle the load of 861 its children and needs some of them to move to another parent. In 862 the first two cases, the parent needs to multicast the advertisement 863 of the termination of one or more Data Sessions to all of its 864 children. In the second two cases, it needs to send one or more 865 unicast notifications to one or more of its children. 867 Consequently, an Eject can be done either with a repeated multicast 868 advertisement message to all children, or a set of unicast request- 869 reply messages to the subset of children that it needs to go to. 871 For the multicast version of Eject, the parent sends a multicast 872 UnbindRequest message to all of its children for a given Data 873 Session, on its Local Multicast Channel. It is only necessary to 874 provide statistical reliability on this message, since children will 875 detect the parents failure even if the message is not received. 876 Therefore, the UnbindRequest message is sent 877 FAILURE_DETECTION_REDUNDANCY times. 879 For the unicast version of Eject, the parent sends a unicast 880 UnbindRequest message to all of its children. Each of them responds 881 with an EjectConfirm. Reliability is ensured through the same 882 request-reply mechanism as the Bind operation. 884 Ejections are reported to the Tree BB using the removeChild 885 interface. 887 4.1.5 Fault Detection 889 There are three cases where fault detection is needed. 1) Detection 890 (by a child) that a parent has failed. 2) Detection (by a parent) 891 that a child has failed. 3) Detection (by either a Repair Head or 892 Receiver) that a Sender has failed. 894 In order to be scaleable and efficient, fault detection is primarily 895 accomplished by periodic keep-alive messages, combined with the 896 existing TRACK messages. Nodes expect to see keep-alive messages 897 every set period of time. If more than a fixed number of periods go 898 by, and no keep-alive messages of a given type are received, the node 899 declares a preliminary failure. The detecting node may then ping the 900 potentially failed node before declaring it failed, or it can just 901 declare it failed. 903 Failures are detected through three keep-alive messages: Heartbeat, 904 TRACK, and NullData. The Heartbeat message is multicast periodically 905 from a parent to its children on its Local Control Channel. NullData 906 messages are multicast by a Sender on the Data Control Channel when 907 it has no data to send. TRACK messages are generated periodically, 908 even if no data is being sent to a Data Session, as described in 909 section 7.2. 911 Heartbeat messages are multicast every HeartbeatPeriod seconds, from 912 a parent to its children. Every time that a parent sends a 913 Retransmission message or a Heartbeat message (as well as at 914 initialization time), it resets a timer for HeartbeatPeriod seconds. 915 If the timer goes off, a Heartbeat is sent. The HeatbeatPeriod is 916 dynamically computed as follows: 918 interval = AckWindow / MessageRate 920 HeartbeatPeriod = 2 * interval 922 Global configuration parameters ConstantHeartbeatPeriod and 923 MinimumHeartbeatPeriod can be used to either set HeartbeatPeriod to a 924 constant, or give HeartbeatPeriod a lower bound, globally. 926 Similarly, a NullData message is multicast by the Sender to all Data 927 Session members, every NULL_DATA_PERIOD. The NullData timer is set 928 to NULL_DATA_PERIOD, and is reset every time that a Data or NullData 929 message is sent by the Sender. 931 The key parameter for failure detection is the global tree parameter 932 FAILURE_DETECTION_REDUNDANCY. The higher the value for this 933 parameter, the more keep-alive messages that must be missed before a 934 failure is declared. 936 A major goal of failure detection is for children to detect parent 937 failures fast enough that there is a high probability they can rejoin 938 the stream at another parent, before flow control has advanced the 939 buffer window to a point where the child can not recover all lost 940 messages in the stream. In order to attempt to do this, children 941 detect a failure of a parent if FAILURE_DETECTION_REDUNDANCY * 942 HeartbeatPeriod time goes by without any heartbeats. As part of 943 buffer window advancement, all parents MAY choose to buffer all 944 messages for a minimum of FAILURE_DETECTION_REDUNDANCY * 2 * 945 HeartbeatPeriod seconds, which gives children a period of time to 946 find a new parent before the buffers are freed. Children report 947 parent failures to the Tree BB using the lostSN interface. 949 A parent detects a preliminary failure of one of its children if it 950 does not receive any TRACK messages from that child in 951 FAILURE_DETECTION_REDUNDANCY * TrackTimeout seconds (see discussion 952 of how TrackTimeout is computed below). Because a failed child can 953 slow down the groups progress, it is very important that a parent 954 resolve the childs status quickly. Once a parent declares a 955 preliminary failure of a child, it issues a set of up to 956 FAILURE_DETECTION_REDUNDANCY Heartbeat messages that are unicast (or 957 multicast) to the failed Receiver(s). These messages are spaced 958 apart by 2*LocalRTT, where LocalRTT is the round trip time that has 959 been measured to the child in question (see below for description of 960 how LocalRTT is measured). These Heartbeat messages contain a 961 ChildrenList field that contains the children who are requested to 962 send a TRACK immediately. 964 Whenever a child receives a Heartbeat message where the child is 965 identified in the ChildrenList field, it immediately sends a TRACK to 966 its parent. If a parent does not receive a TRACK message from a 967 child after waiting a period of 2*LocalRTT after the last Heartbeat 968 message to that child, it declares the child failed, and removes it 969 from the parents child membership list. It informs the Tree BB using 970 the removeChild interface. 972 A child or a Repair Head detects the failure of a Sender if it does 973 not receive a Data or NullData message from a Sender in 974 FAILURE_DETECTION_REDUNDANCY * NULL_DATA_PERIOD. 976 Note that the more Receivers there are in a tree, and the higher the 977 loss rate, the larger FAILURE_DETECTION_REDUNDANCY must be, in order 978 to give the same probability that erroneous failures wont be 979 declared. 981 4.1.6 Fault Notification 983 When a parent detects the failure of a child, it adds a failure 984 notification field to the next TRANSMISSION_REDUNDANCY TRACK messages 985 that it sends up the tree. It sends this notification multiple times 986 because TRACKs are not delivered reliably. A failure notification 987 field includes the failure code, as well as a list of one or more 988 failed nodes. Failure notifications are aggregated up the tree and 989 delivered to the Sender. A failure notification is not a definitive 990 report of a node failure, as the child may have detected a 991 communication failure with its parent and moved to a different Repair 992 Head. 994 4.1.7 Fault Recovery 996 The Fault Recovery algorithms require a list of one or more addresses 997 of alternate parents that can be bound to, and that still provide 998 loop free operation. 1000 If a child detects the failure of its parent, it then re-runs the 1001 Bind operation to a new parent candidate, in order to rejoin the 1002 tree. A node may perform a late join, i.e. binding with a Repair 1003 Head which cannot provide all the necessary repair data, only if 1004 allowed by the PI. 1006 4.1.8 Distributed Membership. 1008 Each Repair Head is responsible for maintaining a set of state 1009 variables on the status of its children. Unlike the Generic Router 1010 Assist, this is hard state, that only is removed when a child leaves 1011 that Repair Head gracefully, or after the Repair Head detects that a 1012 child has failed. These variables MUST include, but are not 1013 necessarily limited to, the following: 1014 - ChildID. This is the two byte identifier assigned to the Child by 1015 the Repair Head. This uniquely identifies this Child to this 1016 Repair Head, but has no meaning outside that scope. 1017 - GlobalChildIdentifier. This is the globally unique identifier for 1018 this Child. 1019 - ChildRTT. This is the weighted average of the local RTT to this 1020 Child. 1021 - LastTRACK. This is the contents of the last TRACK message sent 1022 from this Child, if any, not including options. 1023 - LastApplicationLevelConfirmation. This is the contents of the last 1024 Application Level Confirmation sent from this Child, if any. 1025 - Last Statistics. This is the contents of the last Statistics 1026 message sent from this Child, if any. 1027 - ChildLiveness. This is a set of variables that keep track of the 1028 liveness of each child. This includes the last time a TRACK 1029 message was received from this child, as well as the number of 1030 Heartbeat messages that have been directed at it, and the time at 1031 which the last Heartbeat message was sent to the child. Please see 1032 Fault Detection, above, for more details. 1034 4.2 Data Sessions. 1036 4.2.1 Data Transmission and Retransmission 1038 Data is multicast by a Sender on the Data Multicast Address via the 1039 Data Channel Protocol. The Data Channel Protocol is responsible for 1040 taking care of as many retransmissions as possible, and for ensuring 1041 the goodput of the Data Session. TRACK is then responsible for 1042 providing OPTIONAL flow control and application level reliability. 1043 The mechanics of an application level confirmation of delivery are 1044 handled by TRACK, including keeping track of the distributed 1045 membership list of receivers and aggregating acknowledgements up the 1046 Control Tree. Please see below for more details on flow control and 1047 application level confirmation. 1049 A common scenario for handling recovery of lost messages is to allow 1050 the Data Channel Protocol to provide statistical reliability, and 1051 then allow TRACK to provide retransmissions for more persistent 1052 failure cases, such as if a Receiver is not able to receive any Data 1053 messages for a few minutes. 1055 Retransmissions of data messages may be multicast by the Sender on 1056 the Data Multicast Address or be multicast on a Local Control Channel 1057 by a Repair Head. 1059 A Repair Head joins all of the Data Multicast Addresses that any of 1060 its descendants have joined. A Repair Head is responsible for 1061 receiving and buffering all data messages using the reliability 1062 semantics configured for a stream. As a simple to implement option, 1063 a Repair Head MAY also function as a Receiver, and pass these data 1064 messages to an attached application. 1066 For additional fault tolerance, a Receiver MAY subscribe to the 1067 multicast address associated with the Local Control Channel of one or 1068 more Repair Heads in addition to the multicast address of its parent. 1069 In this case it does not bind to this Repair Head or Sender, but will 1070 process Retransmission messages sent to this address. If the 1071 Receivers Repair Head fails and it transfers to another Repair Head, 1072 this minimizes the number of data messages it needs to recover after 1073 binding to the new Repair Head. 1075 4.2.2 Local Retransmission 1077 If a Repair Head or Sender determines from its child nodes TRACK 1078 messages that a Data message was missed, the Repair Head retransmits 1079 the Data message. The Repair Head or Sender multicasts the 1080 Retransmission message on its multicast Local Control Channel. In 1081 the event that a Repair Head receives a retransmission and knows that 1082 its children need this repair, it re-multicasts the retransmission to 1083 its children. 1085 The scope of retransmission (the multicast TTL) is considered part of 1086 the Control Channels multicast address, and is derived during tree 1087 configuration. 1089 A Repair Head maintains the following state for each of its children, 1090 for the purpose of providing repair service to the local group: 1092 - HighestConsecutivelyReceived. A Sequence Number indicating all 1093 Data messages up to this number (inclusive) that have been received 1094 by a given child. 1096 - MissingMessages. A data structure to keep track of the reception 1097 status of the Data messages with Sequence Number higher than 1098 HighestConsecutivelyReceived. 1100 The minimum HighestConsecutivelyReceived value of all its children is 1101 kept as the variable LocalStable. 1103 A Repair Head also maintains a retransmission buffer. The size of the 1104 retransmission buffer MUST be greater than the maximum value of a 1105 Senders transmission window. The retransmission buffer MUST keep all 1106 the Data messages received by the Repair Head with Sequence Number 1107 higher than LocalStable, optionally some messages with Sequence 1108 Number lower than LocalStable if there is room (beyond the maximum 1109 value of Senders transmission window). The latter messages are kept 1110 in the retransmission buffer in case a Receiver from another group 1111 losses its parent and needs to join this group. 1113 As TRACK messages are received, the Repair Head updates the above 1114 state variables. 1116 To perform local repair, a Repair Head implements a retransmission 1117 queue with memory. Each lost message is entered into the 1118 retransmission queue in increasing order according to its Sequence 1119 Number. If the same Data message has already been retransmitted 1120 recently (recognized due to the queues memory) it is delayed by the 1121 local group RTT (see roundtrip time measurement) before 1122 retransmission. 1124 Retransmissions MAY NOT be sent at a faster rate than the current 1125 TransmissionRate advertised by the Sender. 1127 4.2.3 Flow and Rate Control 1129 TRACK offers the ability to limit the rate of Data traffic, through 1130 both flow control and rate limits. 1132 When a Receiver sends a TRACK to its parent, the HighestAllowed field 1133 provides information on the status of the Receivers flow control 1134 window. The value of HighestAllowed is computed as follows: 1136 HighestAllowed = seqnum + ReceiverWindow 1138 Where seqnum is the highest Sequence Number of consecutively received 1139 data messages at the Receiver. The size of the ReceiverWindow may 1140 either be based on a parameter local to the Receiver or be a global 1141 parameter. 1143 If flow control is enabled for a given Data Session, then a Sender 1144 MUST NOT send any Data messages to the Data Channel Protocol that are 1145 higher than the current value for HighestAllowed that it has. On 1146 startup, HighestAllowed is initialized to ReceiverWindow. 1148 In addition, the Sender application MAY provide minimum and maximum 1149 rate limits. Unless overridden by the Data Channel Protocol, a 1150 Sender will not offer Data messages to the Data Channel Protocol at 1151 lower than MinimumDataRate (except possibly during short periods of 1152 time when certain slow Receivers are being ejected), or higher than 1153 MaximumDataRate. If a Receiver is not able to keep up with the 1154 minimum rate for a period of time, it SHOULD leave the group 1155 promptly. Receivers that leave the group MAY attempt to rejoin the 1156 group at a later time, but SHOULD NOT attempt an immediate 1157 reconnection. 1159 4.2.4 Reliability Window 1161 The Sender and each Repair Head maintain a window of messages for 1162 possible retransmission. As messages are acknowledged by all of its 1163 children, they are released from the parents retransmission buffer, 1164 as described in 4.2.2. In addition, there are two global parameters 1165 that can affect when a parent releases a data message from the 1166 retransmission buffer -- MinHoldTime, and MaxHoldTime. 1168 MinHoldTime specifies a minimum length of time a message must be held 1169 for retransmission from when it was received. This parameter is 1170 useful to handle scenarios where one or more children have been 1171 disconnected from their parent, and have to reconnect to another. 1172 If, for example, MinHoldTime is set to FAILURE_DETECTION_REDUNDANCY * 1173 2 * ConstantHeartbeatPeriod, then there is a high likelihood that any 1174 child will be able to recover any lost messages after reconnecting to 1175 another parent. 1177 The Sender continually advertises to the members of the Data Session 1178 both edges of its retransmission window. The higher value is the 1179 SeqNum field in each Data or NullData message, which specifies the 1180 highest Sequence Number of any data message sent. The trailing edge 1181 of the window is advertised in the HighestReleased field. This 1182 specifies the largest Sequence Number of any message sent that has 1183 subsequently been released from the Senders retransmission window. 1184 If both values are the same then the window is presently empty. Zero 1185 is not a legitimate value for a data Sequence Number, so if either 1186 field has a value of zero, then no messages have yet reached that 1187 state. All Sequence Number fields use Sequence Number arithmetic so 1188 that a Data Session can continue after exhausting the Sequence Number 1189 space. 1191 When a member of a Data Session receives an advertisement of a new 1192 HighestReleased value, it stores this, and is no longer allowed to 1193 ask for retransmission for any messages up to and including the 1194 HighestReleased value. If it has any outstanding missing messages 1195 that are less than or equal to HighestReleased, it MAY move forward 1196 and continue delivering the next data messages in the stream. It 1197 also SHOULD report an error for the messages that are no longer 1198 recoverable. 1200 MaxHoldTime specifies the maximum length of time a message may be 1201 held for retransmission. This parameter is set at the Sender which 1202 uses it to set the HighestReleased field in data message headers. 1203 This is particularly useful for real-time, semi-reliable streams such 1204 as live video, where retransmissions are only useful for up to a few 1205 seconds. When combined with Unordered delivery semantics, and 1206 application-level jitter control at the Receivers, this provides Time 1207 Bounded Reliability. MaxHoldTime MUST always be larger than 1208 MinHoldTime. 1210 4.2.5 Ordering Semantics 1212 TRACK offers two flavors of ordering semantics: Ordered or Unordered. 1213 One of these is selected on a per session basis as part of the 1214 Session Configuration Parameters. 1216 Unordered service provides a reliable stream of messages, without 1217 duplicates, and delivers them to the application in the order 1218 received. This allows the lowest latency delivery for time sensitive 1219 applications. It may also be used by applications that wish to 1220 provide its own jitter control. 1222 Ordered service provides TCP semantics on delivery. All messages are 1223 delivered in the order sent, without duplicates. 1225 4.2.6 Retransmission Requests. 1227 A Receiver detects that it has missed one or more Data messages by 1228 gaps in the sequence numbers of received messages. Each Receiver 1229 keeps track of HighestSequenceNumber, the highest sequence number 1230 known of for a Data Session, as observed from Data, RData, and 1231 NullData messages. Any sequence numbers between HighestReleased and 1232 HighestSequenceNumber that have not been received are assumed to be 1233 missing. 1235 When a Receiver detects missing messages it MAY send off a request 1236 for retransmission, if local retransmission is enabled. It does this 1237 by sending a Retransmission Request message. The timing of this 1238 request is described below. 1240 4.2.7 End Of Stream. 1242 When an application signals that a Data Session is complete, the 1243 Sender advertises this to its children by setting the End of Session 1244 option on the last Data Message in the Data Session, as well as all 1245 subsequent retransmissions of that Data Message, and all subsequent 1246 Null Data messages. 1248 The Sender SHOULD NOT leave the Data Session until it has a report 1249 from the TRACK reports that all group members have left the Data 1250 Session, or it has waited a period of at least 1251 FAILURE_DETECTION_REDUNDANCY * TrackTimeout seconds. 1253 4.3 Control Traffic Generation and Aggregation. 1255 One of the largest challenges for scaleable reliable multicast 1256 protocols has been that of controlling the potential explosion of 1257 control traffic. There is a fundamental tradeoff between the latency 1258 with which losses can be detected and repaired, and the amount of 1259 control traffic generated by the protocol. 1261 TRACK messages are the primary form of control traffic in this BB. 1262 They are sent from Receivers and Repair Heads to their parents. 1263 TRACK messages may be sent for the following purposes: 1264 - to request retransmission of messages 1265 - to advance the Senders transmission window for flow control 1266 purposes 1267 - to deliver application level confirmation of data reception 1268 - to propagate other relevant feedback information up through the 1269 session (such as RTT and loss reports, for congestion control) 1271 4.3.1 TRACK Generation with the Rotating TRACK Algorithm 1273 Each Receiver sends a TRACK message to its parent once per AckWindow 1274 of data messages received. A Receiver uses an offset from the 1275 boundary of each AckWindow to send its TRACK, in order to reduce 1276 burstiness of control traffic at the parents. Each parent has a 1277 maximum number of children, MaxChildren. When a child binds to the 1278 parent, the parent assigns a locally unique ChildID to that child, 1279 between 0 and MaxChildren-1. 1281 Each child in a tree generates a TRACK message at least once every 1282 AckWindow of data messages, when the most recent data messages 1283 Sequence Number, modulo AckWindow, is equal to MemberID. If the 1284 message that would have triggered a given TRACK for a given node is 1285 missed, the node will generate the TRACK as soon as it learns that it 1286 has missed the message, typically through receipt of a higher 1287 numbered data message. 1289 Together, AckWindow and MaxChildren determine the maximum ratio of 1290 control messages to data messages seen by each parent, given a 1291 constant load of data messages. 1293 In each data message, the Sender advertises the current MessageRate 1294 (measured in messages per second) it is sending data at. This rate 1295 is generated by the congestion control algorithms in use at the 1296 Sender. 1298 At the time a node sends a regular TRACK, it also computes a 1299 TRACKTimeout value: 1301 interval = AckWindow / MessageRate 1303 TRACKTimeout = 2 * interval 1305 If no TRACKs are sent within TRACKTimeout interval, a TRACK is 1306 generated, and TRACKTimeout is increased by a factor of 2, up to a 1307 value of MAX_TRACK_TIMEOUT. 1309 This timer mechanism is used by a Receiver to ensure timely repair of 1310 lost messages and regular feedback propagation up the tree even when 1311 the Sender is not sending data continuously. This mechanism 1312 complements the AckWindow-based regular TRACK generation mechanism. 1314 4.3.2 TRACK Aggregation. 1316 There are many reasons for providing feedback from all the Receivers 1317 to the Sender in an aggregated form. The major ones are listed 1318 below: 1320 1) End-to-end delivery confirmation. This confirmation tells the 1321 Sender that all the Receivers (in the entire tree) have received data 1322 messages up to a certain Sequence Number. This is carried in an 1323 Application Level Confirmation message. 1325 2) Flow control. The aggregated information is carried in the field 1326 HighestAllowed. It tells the Sender the highest Sequence Number that 1327 all the Receivers (in the entire tree) are prepared to receive. 1329 3) Congestion control feedback. Information about the state of the 1330 tree can be passed up to help control the congestion control 1331 algorithms for the group. 1333 4) Counting current membership in the group. This information is 1334 carried in the field SubTreeCount. This lets the Sender know the 1335 number of Receivers currently connected to the repair tree. 1337 5) Measuring the round-trip time from the Sender to the "worst" 1338 Receiver. 1340 A Repair Head maintains state for each child. Each time a TRACK 1341 (from a child) is received, the corresponding states for that child 1342 are updated based on the information in the TRACK message. When a 1343 Repair Head sends a TRACK message to its parent, the following fields 1344 of its TRACK message are derived from the aggregation of the 1345 corresponding states for its children. The following rules describe 1346 how the aggregation is performed: 1348 - WorstLossRate. Take the maximum value of the WorstLossRate from 1349 all Children. 1350 - SubTreeCount. Take the sum of the SubTreeCount from all Children. 1351 - HighestAllowed. Take the minimum of the HighestAllowed value from 1352 all children. 1353 - WorstEdgeThroughput. Take the minimum value of the 1354 WorstEdgeThroughput field from all Children. 1355 - UnicastCost. Take the sum of the UnicastCost from all Children. 1356 - MulticastCost. Take the sum of the MulticastCost from all 1357 Children. 1358 - SenderDallyTime: take the minimum value, for all of the children, 1359 of (childs reported SenderDallyTime + childs local dally time). 1360 - FailureCount: take the sum of the FailureCount for all Children. 1361 - FailureList: concatenate the FailureList fields for all Children, 1362 up to a maximum list size of MaxFailureListSize. 1364 Note, the SenderTimeStamp, ParentTimestamp, and ParentDallyTime 1365 fields are not aggregated. The Sender will derive the roundtrip time 1366 to the worst Receiver by doing its local aggregation for 1367 SenderDallyTime and then compute: 1368 RTT = currentTime � SenderTimeStamp � SenderDallyTime. 1370 Application level confirmations (ALCs) are handled as follows. For a 1371 set of ALC requests from receivers, the ones with the highest value 1372 for HighConfirmationSequenceNumber are considered, and all others are 1373 discarded. 1375 For the ConfirmationStatus field, the following rules apply. Note 1376 that ConfirmationStatus of SomeReceiversAcknowledge can correspond to 1377 a ConfirmationCount of zero. 1378 If all children report AllReceiversAcknowledge Then 1379 ConfirmationStatus = AllReceiversAcknowlege 1380 Else If at least one child reports (ListOfFailures OR 1381 FailuresExceedMaximumListSize) Then 1382 If the count of all reported failures > 1383 MaximumFailureListSize Then 1384 ConfimationStatus = FailuresExceedMaximumListSize 1385 Else 1386 ConfirmationStatus = ListOfFailures 1387 Else 1388 ConfirmationStatus = SomeReceiversAcknowledge 1390 The ConfirmationCount field is equal to the sum of the 1391 ConfimationCount for the aggregated ALC reports of all Children. The 1392 PendingCount field is equal to the sum of the PendingCount fields of 1393 all Children. The FailureList field is the concatenation of the 1394 FailureList fields of all aggregated ALC reports of all children, up 1395 to a maximum length of MaximumFailureListSize. 1397 In addition to these fields with fixed aggregation rules, TRACK 1398 supports a set of user defined aggregation statistics. These 1399 statistics are self describing in terms of their data type and 1400 aggregation method. Statistics reports are numbered, and only the 1401 most recent statistics report request is aggregated to the Sender. 1402 Statistics are aggregated over the set of Child statistics reports 1403 that have been received with that number. Aggregation methods 1404 include minimum, maximum, sum, product, and concatenation. 1406 4.3.3 Statistics Reporting. 1408 A Sender can request a list of aggregated statistics from all 1409 Receivers in the group. There are a set of predefined statistics, 1410 such as loss rate and average throughput. There is also the capacity 1411 to request a set of other TRACK statistics, as well as application 1412 defined statistics. 1414 The format of each statistic is self-describing, both in terms of 1415 data type, size, and aggregation method. A Sender reliably sends out 1416 a statistics request by attaching it as an option to a Data message. 1417 When a Receiver gets a request for a statistic, it fills in the data 1418 fields, and forwards it up the tree in the next TRACK message. Since 1419 TRACKs are not reliable, multiple copies are sent in a total of 1420 NumReplies consecutive TRACK messages from each Receiver. Each 1421 statistics report is aggregated according to the method described in 1422 the statistic, and the result is delivered to the Sender. 1424 Most aggregation options have fixed length no matter how many 1425 Receivers there are. The one exception is concatenation, which 1426 creates a list of values from some or all Receivers, up to a length 1427 of MaximumStatisticsListSize entries. It is NOT RECOMMENDED to use 1428 this to create group-wide lists, unless the groups size is carefully 1429 controlled. 1431 4.4 Application Level Confirmed Delivery. 1433 Flow control and the reliability window are concerned with goodput, 1434 of delivering data with a high probability that it is delivered at 1435 all Receivers. However, neither mechanism provides explicit 1436 confirmation to the Sender as to the list of recipients for each 1437 message. Application level confirmed delivery allows applications to 1438 determine the set of applications that have received a set of data 1439 messages. 1441 There are three primary factors that determine the reliability 1442 semantics of a message: the senders knowledge of the Receiver list, 1443 the application level actions that must be performed in order to 1444 consider a message delivered, and the response to persistent failure 1445 conditions at Receivers. For example, an extremely strong 1446 distributed guarantee would consist of the following. First, the 1447 full Receiver membership list is known at the Sender, and verified to 1448 make sure no Receivers have left the group. Second, the application 1449 at each Receiver must write the Data to persistent store before it 1450 can be acknowledged. Third, Receivers are given a very long period 1451 of time - say one hour � to recover all lost Data messages, before 1452 they are ejected from the Data Session. In the meantime, 1453 transmission of Data messages is flow controlled by the slowest 1454 receivers. 1456 A weaker form of reliability would include the following. First, 1457 that the Sender gets a count of Receivers, and otherwise depends on 1458 the distributed group membership algorithms to maintain the 1459 membership list. Second, that Data messages are considered reliably 1460 delivered as soon as the application receives the Data from TRACK. 1461 Third, that Retransmissions are limited to only 30 seconds, and 1462 Receivers must choose to leave the Data Session or continue with 1463 missing Data messages, if a failure takes longer than this period to 1464 recover from. 1466 TRACK provides the functionality to easily implement a wide range of 1467 application level confirmation semantics, based on how these three 1468 items are configured. It is the applications responsibility to then 1469 select the configurations it desires for a given Data Session. 1471 4.4.1 Application Level Confirmation Mechanisms 1473 The primary mechanism for application level confirmation (ALC) of 1474 delivery is the ALC report. To check for ALC of delivery, a Sender 1475 issues a Application Level Confirmation Request, by attaching this 1476 message as an option to a Data message, and reliably transmitting it 1477 to all Receivers. Each ALC Request includes a specified level of 1478 reliability, a reply redundancy factor, and the range of Data message 1479 sequence numbers that the ALC Confirmation covers. 1481 When a Receiver gets an ALC Request, it checks to see if the 1482 application has delivered the specified range of Data Messages, 1483 including both the Low Confirmation Sequence Number and the High 1484 Confirmation Sequence Number. When it sends the next TRACK out, it 1485 sets the ConfirmationStatus field to either SomeReceiversAcknowledge 1486 if it is still pending confirmation, AllReceiversAcknowledge if it 1487 has application level confirmation, ListOfFailures if it has a 1488 failure and MaximumFailureListSize > 0, or 1489 FailuresExceedsMaximumListSize otherwise. It also sets the 1490 ConfirmCount to 1 if it has a confirmation, and PendingCount to 1 if 1491 it is still pending. If the Immediate ACK bit is set in the ALC 1492 Request, the Receiver generates an ACK immediately. 1494 One example of how an application can implicitly signal confirmation 1495 of delivery is through the freeing of buffers passed to it by the 1496 transport. The API could specify that whenever an application has 1497 freed up a buffer containing one or more data messages, then these 1498 messages are considered acknowledged by the application. 1499 Alternatively, the application could be required to explicitly 1500 acknowledge each message. 1502 4.5 Distributed RTT Calculations. 1504 This TRACK BB provides two algorithms for distributed RTT 1505 calculations � LocalRTT measurements and SenderRTT measurements. 1506 LocalRTT measurements are only between a parent and its children. 1507 SenderRTT measurements are end-to-end RTT measurements, measuring the 1508 RTT to the worst Receiver as selected by the congestion control 1509 algorithms. 1511 The SenderRTT is useful for congestion control. It can be used to set 1512 the data rate based on the TCP response function, which is being 1513 proposed for the congestion control building blocks. 1515 The LocalRTT can be used to (a) quickly detect faulty children (as 1516 described under fault detection) or (b) avoid sending unnecessary 1517 retransmissions (as described in the local repair algorithm). 1519 In the case of LocalRTT measurements, a parent initiates measurement 1520 by including a ParentTimestamp field in a Heartbeat message sent to 1521 its children. When a child receives a Heartbeat message with this 1522 field set, it notes the time of receipt using its local system clock, 1523 and stores this with the message as HeartbeatReceiveTime. When the 1524 child next generates a TRACK, just before sending it, it measures its 1525 system clock again as TRACKSendTime, and calculates the 1526 LocalDallyTime. 1528 LocalDallyTime = TRACKSendTime � HeartbeatReceiveTime. 1530 The child includes this value, along with the ParentTimestamp field, 1531 as fields in the next TRACK message sent. Every heartbeat message 1532 that is multicast to all children SHOULD include a ParentTimestamp 1533 field. 1535 The SenderRTT algorithm is similar. A Sender initiates the process 1536 by including a SenderTimestamp field in a data message. When a 1537 Receiver gets a message with this field set, it keeps track of the 1538 DataReceiveTime for that message, and when it generates the next 1539 TRACK message, includes the SenderTimestamp and SenderDallyTime 1540 value. These values are aggregated by Repair Heads, as described 1541 above. 1543 Each node only keeps track of the most recent value for 1544 {SenderTimestamp, DataReceiveTime} and {ParentTimestamp, 1545 HeartbeatReceiveTime}, replacing any older values any time that a new 1546 message is received with these values set. As long as it has non- 1547 zero values to report, each node sends up both a {SenderTimestamp, 1548 SenderDallyTime} and a {ParentTimestamp, LocalDallyTime} set of 1549 fields in each TRACK message generated. 1551 Unless redefined by the TRACK PI, these RTT measurements are averaged 1552 using an exponentially weighted moving average, where the first RTT 1553 measurement, RTT_measurement, initializes the average RTT_average, 1554 and then each successive measurement is averaged in according to the 1555 following formula. The RECOMMENDED value for alpha is 1/8. 1556 RTT_average = RTT_measurement * alpha + RTT_average (1-alpha) 1558 4.6 SNMP Support 1560 The Repair Heads and the Sender are designed to interact with SNMP 1561 management tools. This allows network managers to easily monitor and 1562 control the sessions being transmitted. All TRACK nodes MAY have 1563 SNMP MIBs defined in a separate document. SNMP support is OPTIONAL 1564 for Receiver nodes, but is RECOMMENDED for all other nodes. 1566 4.7 Late Join Semantics 1568 TRACK offers three flavors of late join support: 1569 a) No Recovery 1570 A Receiver binds to a Repair Head after the session has started 1571 and agrees to the reliability service starting from the Sequence 1572 Number in the current data message received from the Sender. 1573 b) Continuation 1574 This semantic is used when a Receiver has lost its Repair Head 1575 and needs to re-affiliate. In this case, the Receiver must 1576 indicate the oldest Sequence Number it needs to repair in order 1577 to continue the reliability service it had from the previous 1578 Repair Head. The binding occurs if this is possible. 1579 c) No Late Join 1580 For some applications, it is important that a Receiver receives 1581 either all data or no data (e.g. software distribution). In this 1582 case option (c) is used. 1584 These are specified by the LateJoinSemantics session parameter, and 1585 enforced by a Parent when a Child attempts to bind to it. 1587 5. Message Types 1588 The following table summarizes the messages and their fields used by 1589 the TRACK BB. All messages contain the session identifier. For more 1590 details, please see the sample TRACK PI [17]. 1592 +--------------------------------------------------------------------+ 1593 Message From To Mcast? Fields 1594 +--------------------------------------------------------------------+ 1596 BindRequest Child Parent no Scope, Level, Role, Rejoin 1597 BindSequenceNumber, SubTreeCount 1598 +--------------------------------------------------------------------+ 1600 BindConfirm Parent Child no RepairAddr, BindSequenceNumber 1601 LowestAvailableRepair 1602 Level, ChildIndex, Role 1603 +--------------------------------------------------------------------+ 1605 BindReject Parent Child no Reason, BindSequenceNumber 1606 +--------------------------------------------------------------------+ 1608 UnbindRequest Child Parent no Reason, ChildIndex 1609 +--------------------------------------------------------------------+ 1611 UnbindConfirm Parent Child no 1612 +--------------------------------------------------------------------+ 1614 EjectRequest Parent Child either Reason, AlternateParent 1615 +--------------------------------------------------------------------+ 1617 EjectConfirm Child Parent no 1618 +--------------------------------------------------------------------+ 1620 Heartbeat Parent Child either Level, ParentTimestamp 1621 ChildrenList, SeqNum 1622 HighestReleased 1623 +--------------------------------------------------------------------+ 1625 NullData, Sender all yes SenderTimeStamp, DataLength 1626 OData HighestReleased, SeqNum 1627 EndOfStream, TransmissionRate 1628 +--------------------------------------------------------------------+ 1630 Rdata Parent Child yes SenderTimeStamp, DataLength 1631 HighestReleased, SeqNum 1632 EndOfStream, TransmissionRate 1633 +--------------------------------------------------------------------+ 1635 Track Child Parent no BitMask, SubTreeCount 1636 Slowest, HighestAllowed 1638 ParentThere, ParentTimeStamp 1639 ParentDallyTime, SenderTimeStamp 1640 SenderDallyTime, CongestionControl 1641 FailureList 1642 +--------------------------------------------------------------------+ 1644 ALCRequest Sender Receiver yes Immediate, Reliability 1645 NumReplies, SeqNumRange 1646 +--------------------------------------------------------------------+ 1648 ALCReply Child Parent yes SeqNumRange, ConfirmStatus 1649 ConfirmCount, PendingCount 1650 FailedChildren 1651 +--------------------------------------------------------------------+ 1653 StatsRequest Sender Receiver yes Immediate, StatsSeqNum 1654 NumReplies, StatsList 1655 +--------------------------------------------------------------------+ 1657 StatsReply Child Parent yes StatsSeqNum, StatsList 1658 +--------------------------------------------------------------------+ 1659 The various fields of the messages are described as follows: 1661 - BindSequenceNumber: This is a monotonically increasing sequence 1662 number for each bind request from a given Receiver for a given Data 1663 Session. 1665 - Scope: an integer to indicate how far a repair message travels. 1666 This is optional. 1668 - Rejoin: a flag as to whether this Receiver was previously a member 1669 of this Data Session. 1671 - Level: an integer that indicates the level in the repair tree. 1672 This value is used to keep loops in the tree from forming, in 1673 addition to indicating the distance from the Sender. Any changes in 1674 a nodes level are passed down to the Tree BB using the 1675 treeLevelUpdate interface. 1677 - Role: This indicates if the bind requestor is a Receiver or Repair 1678 Head. 1680 - SubTreeCount: This is an integer indicating the current number of 1681 Receivers below the node. 1683 - RepairAddr: This field in the BindConfirm message is used to tell 1684 the Receiver which multicast address the Repair Head will be sending 1685 retransmissions on. If this field is null, then the Receiver should 1686 expect retransmissions to be sent on the Senders data multicast 1687 address. 1689 - AlternateParent: This is an optional field that specifies another 1690 parent a Child may attempt to bind to. 1692 - SeqNum: an integer indicating the Sequence Number of a data message 1693 within a given Data Session. For a Heartbeat, it is the highest 1694 sequence number the parent knows about. 1696 - ChildIndex: This is an integer the Repair Head assigns to a 1697 particular child. The child Receiver uses this value to implement 1698 the rotating TRACK Generation algorithm. 1700 - LowestRepairAvailable: This is the lowest sequence number that a 1701 Repair Head will provide repairs for. 1703 - Reason: a code indicating the reason for the BindReject, 1704 UnbindRequest, or EjectRequest message. 1706 - ParentTimestamp: This field is included in Heartbeat messages to 1707 signal the need to do a local RTT measurement from a parent. It is 1708 the time when the parent sent the message. 1710 - ChildrenList: This field contains the identifiers for a list of 1711 children. As part of the keepalive message, this field together with 1712 the SeqNum field is used to urge those listed Receivers to send a 1713 TRACK (for the provided SeqNum). The Repair Head sending this must 1714 have been missing the regular TRACKs from these children for an 1715 extended period of time. 1717 - SenderTimestamp: This field is included in Data messages to signal 1718 the need to do a roundtrip time measurement from the Sender, through 1719 the tree, and back to the Sender. It is the time (measured by the 1720 Senders local clock) when it sent the message. 1722 - ApplicationSynch: a Sequence Number signaling a request for 1723 confirmed delivery by the application. 1725 - EndOfStream: indicates that this message is the end of the data for 1726 this session. 1728 - TransmissionRate: This field is used by the Sender to tell the 1729 Receivers its sending rate, in messages per second. It is part of 1730 the data or nulldata messages. 1732 - HighestReleased: This field contains a Sequence Number, 1733 corresponding to the trailing edge of the Senders retransmission 1734 window. It is used (as part of the data, nulldata or retransmission 1735 headers) to inform the Receivers that they should no longer attempt 1736 to recover those messages with a smaller (or same) Sequence Number. 1738 - HighestAllowed: a Sequence Number, used for flow control from the 1739 Receivers. It signals the highest 1740 Sequence Number the Sender is allowed to send that will not overrun 1741 the Receivers buffer pools. 1743 - BitMask: an array of 1s and 0s. Together with a Sequence Number it 1744 is used to indicate lost data messages. If the ith element is a 1, 1745 it indicates the message SeqNum+i is lost. 1747 - Slowest: This field contains a field that characterizes the slowest 1748 Receiver in the subtree beneath (and including) the node sending the 1749 TRACK. This is used to provide information for the congestion 1750 control BB. 1752 - SenderDallyTime: This field is associated with a SenderTimestamp 1753 field. It contains the sum of the waiting time that should be 1754 subtracted from the RTT measurement at the Sender. 1756 - ParentDallyTime: This is the same as the SenderDallyTime, but is 1757 associated with a ParentTimestamp instead of a SenderTimestamp. 1759 - DataLength: This is the length of the Data payload. 1761 - CongestionControl: This includes any additional congestion control 1762 variables for aggregation, such as WorstLossRate, 1763 WorstEdgeThroughput, UnicastCost, and MulticastCost. 1765 - ApplicationConfirms: This is the SeqNum value for which delivery 1766 has been confirmed by all children at or below this parent. 1768 - FailedChildren: This is a list of all children that have recently 1769 been dropped from the repair tree. 1771 - Immediate: If set to 1, a Receiver should immediately send a TRACK 1772 on receipt of this packet. 1774 - Reliability: The level of reliability required in order to consider 1775 the set of data packets reliably delivered. 1777 - NumReplies: The number of consecutive TRACK messages that should be 1778 sent with this message attached 1780 - SeqNumRange: The set of data messages that the ALC request applies 1781 to. 1783 - ConfirmStatus: The acknowledgement status of the Receivers in the 1784 subtree up to the node that sends this message. 1786 - ConfirmCount: The number of Receivers in the subtree up to the node 1787 that sends this message, that have acknowledged the ALC request. 1789 - PendingCount: The number of Receivers in this subtree that are 1790 still pending in their decision as to acknowledging this ALC request. 1792 - StatsSeqNum: The number of this request for statistics. 1794 - StatsList: The list of statistics to be filled in by Receivers, and 1795 aggregated by the control tree. 1797 6. Global Configuration Variables, Constants, and Reason Codes 1799 6.1 Global Configuration Variables 1800 These are variables that control the Data Session and are advertised to 1801 all participants. Some of them MAY instead be configured as constants. 1803 - TimeMaxBindResponse: the time, in seconds, to wait for a response 1804 to a BindRequest. Initial value is TIMEOUT_PARENT_RESPONSE 1805 (recommended value is 3). Maximum value is 1806 MAX_TIMEOUT_PARENT_RESPONSE. 1808 - MaxChildren: The maximum number of children a Repair Head is 1809 allowed to handle. Recommended value: 32. 1811 - ConstantHeartbeatPeriod: Instead of dynamically calculating the 1812 HeartbeatPeriod, a constant period may be used instead. Recommended 1813 value: 3 seconds. 1815 - MinimumHeartbeatPeriod: The minimum value for the dynamically 1816 calculated HeartbeatPeriod. Recommended value: 1 second. 1818 - MinHoldTime: The minimum amount of time a Repair Head holds on to 1819 data messages. 1821 - MaxHoldTime: The maximum amount of time a Repair Head holds on to 1822 data messages. 1824 - AckWindow: The number of messages seen before a Receiver issues an 1825 acknowledgement. Recommended value: 32. 1827 - LateJoinSemantics: The options available to a Receiver who wishes 1828 to join a Data Session that is already in progress. 1830 - MaximumFailureListSize: The maximum number of entries that can be 1831 in a failure list. This MUST be small enough that the FailureList 1832 does not ever cause a TRACK to exceed the size of a maximum UDP 1833 packet. Recommended value: 800. 1835 - MaximumStatisticsListSize: The maximum number of entries that can 1836 be in a statistics list. This MUST be small enough that the 1837 FailureList does not ever cause a TRACK to exceed the size of a 1838 maximum UDP packet. Recommended value: 100. 1840 - MaximumDataRate: The maximum admission rate for data messages from 1841 the application to the Data Channel Protocol. 1843 - MinimumDataRate: The minimum admission rate for data messages from 1844 the application to the Data Channel Protocol. 1846 6.2 Constants 1848 - NUM_MAX_PARENT_ATTEMPTS: The number of times to try to bind to a 1849 Repair Head before declaring a PARENT_UNREACHABLE error. Recommended 1850 value is 5. 1852 - TIMEOUT_PARENT_RESPONSE: The minimum value, in seconds, between 1853 attempts to contact a parent. Recommended value is 1 second. 1855 - MAX_TIMEOUT_PARENT_RESPONSE: The maximum value, in seconds, 1856 between attempts to contact a parent. Recommended value is 16. 1858 - NULL_DATA_PERIOD: The time between transmission of NullData 1859 Messages. Recommended value is 1. 1861 - FAILURE_DETECTION_REDUNDANCY: The number of times a message is sent 1862 without receiving a response before declaring an error. Recommended 1863 value is 3. 1865 - MAX_TRACK_TIMEOUT: The maximum value for TRACKTimeout. Recommended 1866 value is 5 seconds. 1868 - TRANSMISSION_REDUNDANCY: The number of times a failure notification 1869 is redundantly sent up the tree in a TRACK message. Recommended 1870 value is 3. 1872 6.3 Reason Codes 1874 - BindReject reason codes 1875 - LOOP_DETECTED 1876 - MAX_CHILDREN_EXCEEDED 1878 - UnbindRequest reason codes 1879 - SESSION_DONE 1880 - APPLICATION_REQUEST 1881 - RECEIVER_TOO_SLOW 1883 - EjectRequest reason codes 1884 - PARENT_LEAVING 1885 - PARENT_FAILURE 1886 - CHILD_TOO_SLOW 1887 - PARENT_OVERLOADED 1889 7. Security 1891 As specified in [12], the primary security requirement for a TRACK 1892 protocol is protection of the transport infrastructure. This is 1893 accomplished through the use of lightweight group authentication of 1894 the control and, optionally, the data messages sent to the group. 1895 These algorithms use IPsec and shared symmetric keys. For TRACK, 1896 [12] recommends that there be one shared key for the Data Session and 1897 one for each Local Control Channel. These keys are distributed 1898 through a separate key manager component, which may be either 1899 centralized or distributed. Each member of the group is responsible 1900 for contacting the key manager, establishing a pair-wise security 1901 association with the key manager, and obtaining the appropriate keys. 1903 The exact algorithms for this BB are presently the subject of 1904 research within the IRTF Secure Multicast Group (SMuG) and 1905 standardization within the Multicast Security working group. 1907 8. References 1909 [1] Bradner, S., "The Internet Standards Process -- Revision 3", BCP 1910 9, RFC 2026, October 1996. 1912 [2] Whetten, B., et. al. "Reliable Multicast Transport Building 1913 Blocks for One-to-Many Bulk-Data Transfer." RFC 3048, January 1914 2001. 1916 [3] Handley, M., et. al. "The Reliable Multicast Design Space for 1917 Bulk Data Transfer." RFC 2887, August 2000. 1919 [4] Bradner, S., "Key words for use in RFCs to Indicate Requirement 1920 Levels", BCP 14, RFC 2119, March 1997 1922 [5] Whetten, B., Taskale, G. "Overview of the Reliable Multicast 1923 Transport Protocol II (RMTP-II)." IEEE Networking, Special Issue 1924 on Multicast, February 2000. 1926 [6] Nonnenmacher, J., Biersack, E. "Reliable Multicast: Where to 1927 use Forward Error Correction", Proc. 5th. Workshop on Protocols 1928 for High Speed Networks, Sophia Antipolis, France, Oct. 1996. 1930 [7] Nonnenmacher, J., et. al. "Parity-Based Loss Recovery for 1931 Reliable Multicast Transmission", In Proc. of ACM SIGCOMM 97, 1932 Cannes, France, September 1997. 1934 [8] Rizzo, L. "Effective erasure codes for reliable computer 1935 communications protocols", DEIT Technical Report LR-970115. 1937 [9] Nonnenmacher, J., Biersack, E. "Optimal Multicast Feedback", 1938 Proc. IEEE INFOCOM 1998, March 1998. 1940 [10] Whetten, B., Conlan, J. "A Rate Based Congestion Control 1941 Scheme for Reliable Multicast", GlobalCast Communications 1942 Technical White Paper, November 1998. 1943 http://www.talarian.com/rmtp-ii 1945 [11] Padhye, J., et. al. "Modeling TCP Throughput: A Simple Model 1946 and its Empirical Validation". University of Massachusetts 1947 Technical Report CMPSCI TR 98-008. 1949 [12] Hardjorno, T., Whetten, B. "Security Requirements for TRACK" 1950 draft-ietf-rmt-pi-track-security-00.txt, June 2000. Work in 1951 Progress. 1953 [13] Golestani, J., "Fundamental Observations on Multicast 1954 Congestion Control in the Internet", Bell Labs, Lucent Technology, 1955 paper presented at the July 1998 RMRG meeting. 1957 [14] Kadansky, M., D. Chiu, J. Wesley, J. Provino, "Tree-based 1958 Reliable Multicast (TRAM)", draft-kadansky-tram-02.txt, Work in 1959 Progress. 1961 [15] Whetten, B., M. Basavaiah, S. Paul, T. Montgomery, "RMTP-II 1962 Specification", draft-whetten-rmtp-ii-00.txt, April 8, 1998. Work 1963 in Progress. 1965 [16] Kadansky, M., Chiu, D. M., Whetten, B., Levine, B. N., Taskale, 1966 G., Cain, B., Thaler, D., Koh, s. J., "Reliable Multicast 1967 Transport Building Block: Tree Auto-Configuration", draft-ietf- 1968 rmt-bb-tree-config-02.txt, March 2, 2001. Work in Progress. 1970 [17] Whetten, B. et. al., "TRACK Protocol Instantiation Over UDP", 1971 draft-ietf-rmt-track-pi-udp-00.txt, November 2002. 1973 [18] Adamson, B., et. al., "NACK Oriented Reliable Multicast 1974 Protocol (NORM), draft-ietf-rmt-pi-norm-02.txt, July 2001. Work 1975 in Progress. 1977 [19] Vicisano, L., et. al., "Asynchronous Layered Coding - A 1978 scalable reliable multicast protocol", draft-ietf-rmt-pi-alc- 1979 02.txt, July 2001. Work in Progress. 1981 [20] Speakman, T., et. al., "Pragmatic General Multicast (PGM)", 1982 draft-speakman-pgm-spec-06.txt, Feb 2001. Work in Progress. 1984 [21] Kermode, R., Vicisano, L., "Author Guidelines for RMT Building 1985 Blocks and Protocol Instantiation Documents", RFC 3269. 1987 10. Acknowledgements 1988 We would like to thank the follow people: Sanjoy Paul, Seok Joo Koh, 1989 Supratik Bhattacharyya, Joe Wesley, and Joe Provino. 1991 11. Authors Addresses 1993 Brian Whetten 1994 890 Sea Island Lane 1995 Foster City, CA 94404 1996 b2@whetten.net 1997 Dah Ming Chiu 1998 Sun Microsystems Laboratories 1999 1 Network Drive 2000 Burlington, MA 01803 2001 dahming.chiu@sun.com 2003 Miriam Kadansky 2004 Sun Microsystems Laboratories 2005 1 Network Drive 2006 Burlington, MA 01803 2007 miriam.kadansky@sun.com 2009 Seok Joo Koh 2010 sjkoh@pec.etri.re.kr 2012 Gursel Taskale 2013 TIBCO Corporation 2014 gursel@tibco.com 2016 Full Copyright Statement 2018 "Copyright (C) The Internet Society (2000). All Rights Reserved. This 2019 document and translations of it may be copied and furnished to 2020 others, and derivative works that comment on or otherwise explain it 2021 or assist in its implementation may be prepared, copied, published 2022 and distributed, in whole or in part, without restriction of any 2023 kind, provided that the above copyright notice and this paragraph are 2024 included on all such copies and derivative works. However, this 2025 document itself may not be modified in any way, such as by removing 2026 the copyright notice or references to the Internet Society or other 2027 Internet organizations, except as needed for the purpose of 2028 developing Internet standards in which case the procedures for 2029 copyrights defined in the Internet Standards process must be 2030 followed, or as required to translate it into languages other than 2031 English. 2033 The limited permissions granted above are perpetual and will not be 2034 revoked by the Internet Society or its successors or assigns. 2036 This document and the information contained herein is provided on an 2037 "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING 2038 TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING 2039 BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION 2040 HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF 2041 MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE."