idnits 2.17.1 draft-ietf-rmt-design-space-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** The document seems to lack a 1id_guidelines paragraph about 6 months document validity -- however, there's a paragraph with a matching beginning. Boilerplate error? ** The document seems to lack a 1id_guidelines paragraph about the list of current Internet-Drafts. ** The document seems to lack a 1id_guidelines paragraph about the list of Shadow Directories. ** The document is more than 15 pages and seems to lack a Table of Contents. == No 'Intended status' indicated for this document; assuming Proposed Standard == The page length should not exceed 58 lines per page, but there was 20 longer pages, the longest (page 2) being 60 lines == It seems as if not all pages are separated by form feeds - found 0 form feeds but 21 pages Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. ** There are 37 instances of lines with control characters in the document. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (Mar 2000) is 8805 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'LVS99' is mentioned on line 679, but not defined == Unused Reference: 'KCW98' is defined on line 854, but no explicit reference was found in the text == Unused Reference: 'RV97' is defined on line 875, but no explicit reference was found in the text -- Possible downref: Non-RFC (?) normative reference: ref. 'BC94' -- Possible downref: Non-RFC (?) normative reference: ref. 'BKKKLZ95' -- Possible downref: Non-RFC (?) normative reference: ref. 'BLMR98' -- Possible downref: Non-RFC (?) normative reference: ref. 'CDT99' -- Possible downref: Non-RFC (?) normative reference: ref. 'FLST98' -- Possible downref: Non-RFC (?) normative reference: ref. 'FJM95' -- Possible downref: Non-RFC (?) normative reference: ref. 'Ha99' -- Possible downref: Non-RFC (?) normative reference: ref. 'HC97' -- Possible downref: Non-RFC (?) normative reference: ref. 'KCW98' -- Possible downref: Non-RFC (?) normative reference: ref. 'LMSSS97' -- Possible downref: Non-RFC (?) normative reference: ref. 'PPV98' -- Possible downref: Non-RFC (?) normative reference: ref. 'Ri97' -- Possible downref: Non-RFC (?) normative reference: ref. 'RV97' -- Possible downref: Non-RFC (?) normative reference: ref. 'RVC98' -- Possible downref: Non-RFC (?) normative reference: ref. 'RMWT98' -- Possible downref: Non-RFC (?) normative reference: ref. 'WHA98' -- Possible downref: Non-RFC (?) normative reference: ref. 'WKM94' -- Possible downref: Non-RFC (?) normative reference: ref. 'WGL97' Summary: 8 errors (**), 0 flaws (~~), 7 warnings (==), 20 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internet Engineering Task Force RMT WG 3 INTERNET-DRAFT M. Handley, ACIRI 4 draft-ietf-rmt-design-space-01 B. Whetten, Talarian 5 R. Kermode, Motorola 6 S. Floyd, ACIRI 7 L. Vicisano, Cisco 8 M. Luby, Digital Fountain 9 10th Mar 2000 10 Expires: Sep 2000 12 The Reliable Multicast Design Space for Bulk Data Transfer 14 Status of this Memo 16 This document is an Internet-Draft and is in full conformance with all 17 provisions of Section 10 of RFC2026. 19 Internet-Drafts are working documents of the Internet Engineering Task 20 Force (IETF), its areas, and its working groups. Note that other groups 21 may also distribute working documents as Internet-Drafts. 23 Internet-Drafts are draft documents valid for a maximum of six months 24 and may be updated, replaced, or obsoleted by other documents at any 25 time. It is inappropriate to use Internet- Drafts as reference material 26 or to cite them other than as "work in progress." 28 To view the list Internet-Draft Shadow Directories, see 29 http://www.ietf.org/shadow.html. 31 Copyright Notice 33 Copyright (C) The Internet Society (2000). All Rights Reserved. 35 Abstract 37 The design space for reliable multicast is rich, with many possible 38 solutions having been devised. However, application requirements serve 39 to constrain this design space to a relatively small solution space. 40 This document provides an overview of the design space and the ways in 41 which application constraints affect possible solutions. 43 1. Introduction 45 The term ``general purpose reliable multicast protocol'' is something of 46 an oxymoron. Different applications have different requirements of a 47 reliable multicast protocol, and these requirements constrain the design 48 space in ways that two applications with differing requirements often 49 cannot share a single solution. There are however many successful reli- 50 able multicast protocol designs that serve more special purpose require- 51 ments well. 53 In this document we attempt to review the design space for reliable mul- 54 ticast protocols intended for bulk data transfer. The term bulk data 55 transfer should be taken as having broad meaning - the main limitations 56 are that the data stream is continuous and long lived - constraints nec- 57 essary for the forms of congestion control we currently understand. The 58 purpose of this review is to gather together an overview of the field 59 and to make explicit the constraints imposed by particular mechanisms. 60 The aim is to provide guidance to the standardization process for proto- 61 cols and protocol building blocks. In doing this, we cluster potential 62 solutions into a number of loose categories - real protocols may be com- 63 posed of mechanisms from more than one of these clusters. 65 The main constraint on solutions is imposed by the need to scale to 66 large receiver sets. For small receiver sets the design space is much 67 less restricted. 69 2. Application Constraints 71 Application requirements for reliable multicast (RM) are as broad and 72 varied as the applications themselves. However, there are a set of 73 requirements that significantly affect the design of an RM protocol. A 74 brief list includes: 76 o Does the application need to know that everyone received the data? 78 o Does the application need to constrain differences between 79 receivers? 81 o Does the application need to scale to large numbers of receivers? 83 o Does the application need to be totally reliable? 85 o Does the application need ordered data? 87 o Does the application need to provide low-delay delivery? 88 o Does the application need to provide time-bounded delivery? 90 o Does the application need many interacting senders? 92 o Is the application data flow intermittent? 94 o Does the application need to work in the public Internet? 96 o Does the application need to work without a return path (e.g. 97 satellite)? 99 o Does the application need to provide secure delivery? 101 In the context of standardizing bulk data transfer protocols, we can 102 rule out applications with multiple interacting senders and intermittent 103 data flows. It is not that these applications are unimportant, but that 104 we do not yet have effective congestion control for such applications. 106 2.1. Did everyone receive the data? 108 In many applications a logically defined unit or units of data is to be 109 delivered to multiple clients, e.g., a file or a set of files, a soft- 110 ware package, a stock quote or package of stock quotes, an event notifi- 111 cation, a set of slides, a frame or block from a video. An application 112 data unit (ADU) is defined to be a logically separable unit of data that 113 is useful to the application. In some cases, an application data unit 114 may be short enough to fit into a single packet (e.g., an event notifi- 115 cation or a stock quote), whereas in other cases an application data 116 unit may be much longer than a packet (e.g., a software package). 118 A protocol may optionally provide delivery confirmation to ensure reli- 119 able delivery, i.e., a mechanism for receivers to inform the sender when 120 data has been delivered. There are two types of confirmation, at the 121 application data unit level and at the packet level. Application data 122 unit confirmation is useful at the application level, e.g., to inform 123 the application about receiver progress and to decide when to stop send- 124 ing packets about a particular application data unit. Packet confirma- 125 tion is useful at the transport level, e.g., to inform the transport 126 level when it can release buffer space being used for storing packets 127 for which delivery has been confirmed. 129 Some applications have a strong requirement for confirmation that all 130 the receivers got an ADU, or if not, to be informed of which specific 131 receivers failed to receive the entire ADU. Examples include applica- 132 tions where receivers pay for data, and reliable file-system replica- 133 tion. Other applications do not have such a requirement. An example is 134 the distribution of free software. 136 If the application does need to know that every receiver got the ADU, 137 then a positive acknowledgment must be received from every receiver, 138 although it may be possible to aggregate these acknowledgments. If the 139 application needs to know precisely which receivers failed to get the 140 ADU, additional constraints are placed on acknowledgment aggregation. 142 It should be noted that different mechanisms can be used for ADU-level 143 confirmation and packet-level confirmation in the same application. For 144 example, an ADU-level confirmation mechanism using positive acknowledg- 145 ments may sit on top of a packet-level NACK or FEC-based transport. 146 Typically this only makes sense when ADUs are significantly larger than 147 a single packet. 149 2.2. Constraining differences 151 Some applications need to constrain differences between receivers so 152 that the data reception characteristics for all receivers falls within 153 some range. An example is a stock price feed, where it is unacceptable 154 for a receiver to suffer delivery that is delayed significantly more 155 than any other receiver. 157 This requirement is difficult to satisfy without harming performance. 158 Typically solutions involve not sending more than a limited amount of 159 new data until positive acknowledgments have been received from all the 160 receivers. Such a solution does not cope with network and end-system 161 failures well. 163 2.3. Receiver Set Scaling 165 There are many applications for RM that do not need to scale to large 166 numbers of receivers. For such applications, a range of solutions may 167 be available that are not available for applications where scaling to 168 large receiver sets is a requirement. 170 A protocol must achieve good throughput of application data units to 171 receivers. This means that most data that is delivered to receivers is 172 useful in recovering the application data unit that they are trying to 173 receive. A protocol must also provide good congestion control to fairly 174 share the available network resources between all applications. 175 Receiver set scaling is one of the most important constraints in meeting 176 these requirements, because it strictly limits the mechanisms that can 177 be used to achieve these requirements to those that will efficiently 178 scale to a large receiver population. Acknowledgement packets have been 179 employed by many systems to achieve these goals, but it is important to 180 understand the strength and limitations of different ways of using such 181 packets. 183 In a very small system, it may be acceptable to have the receivers 184 acknowledge every packet. This approach provides the sender with the 185 maximum amount of information about reception conditions at all the 186 receivers, information that can be used both to achieve good throughput 187 and to achieve congestion control. 189 For larger systems, such ``flat ACK'' schemes cause acknowledge implo- 190 sions at the sender. Attempts have been made to reduce this problem by 191 sending aggregate ACKs infrequently [RMWT98, BC94], but it is very dif- 192 ficult to incorporate effective congestion control into such protocols 193 because of the spareceness of feedback. 195 Using negative acknowledgments (NACKs) instead of ACKs reduces this 196 problem to one of NACK implosion (only from the receivers missing the 197 packets), and because the sender really only needs to know that at least 198 one receiver is missing data in order to achieve good throughput, vari- 199 ous NACK suppression mechanisms can be applied. 201 An alternative to NACKs is ACK aggregation, which can be done by arrang- 202 ing the receivers into a logical tree, so that each leaf sends ACKs to 203 its parent which aggregates them, and passes them on up the tree. Tree- 204 based protocols scale well, but tree formation can be problematic. 206 Other ACK topologies such as rings are also possible, but are often more 207 difficult to form and maintain than trees are. An alternative strategy 208 is to add mechanisms to routers so that they can help out in achieving 209 good throughput or in reducing the cost of achieving good throughput. 211 All these solutions improve receiver set scaling, but they all have lim- 212 its of one form or another. One class of solutions scales to an infi- 213 nite number of receivers by having no feedback channel whatsoever in 214 order to achieve good throughput. These open-loop solutions take the 215 initial data and encode it using an FEC-style mechanism. This encoded 216 data is transmitted in a continuous stream. Receivers then join the 217 session and receive packets until they have sufficient packets to decode 218 the original data, at which point they leave the session. 220 Thus, it is clear that the intended scale of the session constrains the 221 possible solutions. All solutions will work for very small sessions, 222 but as the intended receive set increases, the range of possible solu- 223 tions that can be deployed safely decreases. 225 It should also be noted that hybrids of these mechanisms are possible, 226 and that using one mechanism at the packet-level and a different (typi- 227 cally higher overhead) solution at the ADU level may also scale reason- 228 ably if the ADUs are large compared to packets. 230 2.4. Total vs Semi-reliable 232 Many applications require delivery of application data units to be 233 totally reliable; if any of the application data unit is missing, none 234 of the received portion of the application data unit is useful. File 235 transfer applications are a good example of applications requiring total 236 reliability. 238 However, some applications do not need total reliability. An example is 239 audio broadcasting, where missing packets reduce the quality of the 240 received audio but do not render it unusable. Such applications can 241 sometimes get by without any additional reliability over native IP reli- 242 ability, but often having a semi-reliable multicast protocol is desir- 243 able. 245 2.5. Time-bounded Delivery 247 Many applications just require data to be delivered to the receivers as 248 fast as possible. They have no absolute deadline for delivery. 250 However, some applications have hard delivery constraints - if the data 251 does not arrive at the receiver by a certain time, there is no point in 252 delivering it at all. Such time-boundedness may be as a result of real- 253 time constraints such as with audio or video streaming, or as the result 254 of new data superseding old data. In both cases, the requirement is for 255 the application to have a greater degree of control over precisely what 256 the application sends at which time than might be required with applica- 257 tions such as file transfer. 259 Time-bounded delivery usually also implies a semi-reliable protocol, but 260 the converse does not necessarily hold. 262 3. Network Constraints 264 The properties of the network in which the application is being deployed 265 may themselves constrain the reliable multicast design space. 267 3.1. Internet vs Intranet 269 In principle the Internet and intranets are the same. In practice how- 270 ever, the fact that an intranet is under one administration might allow 271 for solutions to be configured that can not easily be done in the public 272 Internet. Thus, if the data is of very high value, it might be appro- 273 priate to enhance the routers to provide assistance to a reliable multi- 274 cast transport protocol. In the public Internet, it is less likely that 275 the additional expense required to support this state in the routers 276 would be acceptable. 278 3.2. Return Path 280 In principle, when feedback is required from receivers, this feedback 281 can be multicast or unicast. Multicast feedback has advantages, espe- 282 cially in NACK-based protocols where it is valuable for NACK suppres- 283 sion. However, it is not clear at this time whether all ISPs will allow 284 all members of a session to send to that session. If multicast feedback 285 is not allowed, then unicast feedback can almost always be substituted, 286 although often at the expense of additional messages and mechanisms. 288 Some networks may not allow any form of feedback however. The primary 289 example of this occurs with satellite broadcasts where the back channel 290 may be very narrow or even non-existent. For such networks the solution 291 space is very constrained - only FEC-based encodings have any real 292 chance of working. If the receivers are direct satellite receivers, 293 then no congestion control is needed, but it is dangerous to make such 294 assumptions because it is possible for a satellite hop to feed down- 295 stream networks. Thus, congestion control still needs to be considered 296 with solutions that do not have a return path. 298 3.3. Network Assistance 300 A reliable multicast protocol must involve mechanisms running in end 301 hosts, and must involve routers forwarding multicast packets. However 302 under some circumstances, it is possible to rely on some additional 303 degree of assistance from network elements. Broadly speaking we can 304 cluster RM protocols into four classes depending on the degree of sup- 305 port received from other network elements. 307 No Additional Support 308 The routers merely forward packets, and only the sender and 309 receivers have any reliable multicast protocol state. 311 Layered Approaches 312 Data is split across multiple multicast groups. Receivers join 313 appropriate groups to receive only the traffic they require. This 314 may in some cases require fast join or leave functionality from the 315 routers, and may require more forwarding state in the routers. 317 Server-based Approaches 318 Additional nodes are used to assist with data delivery or feedback 319 aggregation. These additional nodes might not be normal senders or 320 receivers, and may be present on the distribution or feedback tree 321 only to provide assistance to the reliable multicast protocol. They 322 would not otherwise receive the multicast traffic. 324 Router-based Approaches 325 With router-based approaches, routers on the normal data distribu- 326 tion tree from the sender to the receivers assist in the delivery of 327 data or feedback aggregation or suppression. As routers can 328 directly influence multicast routing, they have more control over 329 which traffic goes to which group members than server-based 330 approaches. However routers do not normally have a large amount of 331 spare memory or processing power, which restricts how much function- 332 ality can be placed in the routers. In addition, router code is 333 normally more difficult to upgrade than application code, so router- 334 based approaches need to be very general as they are more difficult 335 to deploy and to change. 337 4. Good Throughput Mechanisms 339 Two main concerns that a RM protocol must address are congestion control 340 and good throughput. Packet loss plays a major role with respect to 341 both concerns. The primary symptom of congestion in many networks is 342 packet loss. The primary obstacle that must be overcome to achieve good 343 throughput is packet loss. Thus, measuring and reacting to packet loss 344 is crucial to address both concerns. RM solutions that address these 345 concerns can be roughly categorized as using one or more of the follow- 346 ing techniques: 348 o Data packet acknowledgment. 350 o Negative acknowledgment of missing data packets. 352 o Redundancy allowing not all packets to be received. 354 These techniques themselves can be usefully subdivided, so that we can 355 examine the parts of the requirement space in which each mechanism can 356 be deployed. In this section, we focus on using these mechanisms for 357 achieving good throughput, and in the next section we focus on using 358 these mechanisms for congestion control. 360 4.1. ACK-based Mechanisms 362 The simplest ACK-based mechanism involves every receiver sending an ACK 363 packet for every data packet it receives and resending packets that are 364 lost by any receiver. Such mechanisms are limited to very small 365 receiver groups by the implosion of ACKs received at the sender, and for 366 this reason they are impractical for most applications. 368 Putting multiple ACKs into a single data packet [RMWT98] reduces the 369 implosion problem by a constant amount, allowing slightly larger 370 receiver groups. However a limit is soon reached whereby feedback to 371 the sender is too infrequent for sender-based congestion control mecha- 372 nisms to work reliably. 374 Arranging the receivers into a ring [WKM94] whereby an ``ACK-token'' is 375 passed around the ring prevents the implosion problem for data. However 376 ring creation and maintenance may itself be problematic. Also if ring 377 creation does not take into account network topology (something which is 378 difficult to achieve in practice), then the number of ACK packets cross- 379 ing the network backbone for each data packet sent may increase O(n) 380 with the number of receivers. 382 4.1.1. Tree-based ACK Mechanisms 384 Arranging the receivers into a tree [MWB+98, KCW98] whereby receivers 385 generate ACKs to a parent node, which aggregates those ACKs to its par- 386 ent in turn, is both more robust and more easily configured than a ring. 387 The ACK-tree is typically only used for ACK-aggregation - data packets 388 are multicast from the sender to the receivers as normal. Trees are 389 easier to construct than rings because more local information can be 390 used in their construction. Also they can be more fault tolerant than 391 rings because node failures only affect a subset of receivers, each of 392 which can easily and locally decide to by-pass its parent and report 393 directly to the node one level higher in the tree. With good ACK-tree 394 formation, tree-based ACK mechanisms have the potential to be one of the 395 most scalable RM solutions. 397 To be simple to deploy, tree-based protocols must be self-organizing - 398 the receivers must form the tree themselves using local information in a 399 scalable manner. Such mechanisms are possible, but are not trivial. 400 The main scaling limitations of tree-based protocols therefore come from 401 the tree formation and maintenance mechanisms rather than from the use 402 of ACKs. Without such a scalable and automatic tree-formation mecha- 403 nism, tree-based protocols must rely on manual configuration, which sig- 404 nificantly limits their applicability (often to intranets) and (due to 405 the complexity of configuration) their scalability. 407 Orthogonal to the issue of tree formation is the issue of subtree 408 retransmission. With appropriate router mechanisms, or the use of mul- 409 tiple multicast groups, it is possible to allow the intermediate tree 410 nodes to retransmit missing data to the nodes below them in the tree 411 rather than relying on the original sender to retransmit the data. This 412 relies on there being a good correlation at the point of the intermedi- 413 ate node between the ACK tree and the actual data tree, as well as there 414 being a mechanism to constrain the retransmission to the subtree. A 415 good automatic tree formation mechanism combined with the use of admin- 416 istrative scoped multicast groups might provide such a solution. Without 417 such tree formation mechanisms, subtree retransmission is difficult to 418 deploy in large groups in the public internet. This could also be 419 solved by the use of transport-level router mechanisms to assist or per- 420 form retransmission, although existing router mechanisms [FLST98] sup- 421 port NACK-based rather than ACK-based protocols. 423 Another important issue is the nature of the aggregation performed at 424 interior nodes on the ACK-tree. Such nodes could: 426 1. aggregate ACKs by sending a single ACK when all their children have 427 ACKed, 429 2. aggregate ACKs by listing all the children that have ACKed, 431 3. send an aggregated ACK with a NACK-like exception list. 433 For data packets, 1. is clearly more scalable, and should be preferred. 434 However if the sender needs to know exactly which receivers received the 435 data, 2. and 3. provide this information. Fortunately, there is usually 436 no need to do this on a per-packet basis, but rather on a per-ADU basis. 437 Doing 1. on a per packet basis, and 3. on a per ADU basis is the most 438 scalable solution for applications that need this information, and suf- 439 fers virtually no disadvantage compared to the other solutions used on a 440 per-packet basis. 442 4.2. NACK-based mechanisms 444 Instead of sending an ACK for every data packet received, receivers can 445 send a negative acknowledgment (NACK) for every data packet they dis- 446 cover they did not receive. This has a number of advantages over ACK- 447 based mechanisms: 449 o The sender no longer needs to know exactly how many receivers there 450 are. This removes the topology-building phase needed for ring- or 451 tree-style ACK-based algorithms. 453 o Fault-tolerance is made somewhat simpler by making receivers respon- 454 sible for reliability. 456 o Sender state can be significantly reduced because the sender does 457 not need to keep track of the receivers state. 459 o Only a single NACK is needed from any receiver to indicate a packet 460 that is missing by any number of receivers. Thus NACK suppression 461 is possible. 463 The disadvantages are that it is more difficult for the sender to know 464 that it can free transmission buffers, and that additional session level 465 mechanisms are needed if the sender really needs to know if a particular 466 receiver actually received all the data. However for many applications, 467 neither of these is an issue. 469 4.2.1. NACK Suppression 471 The key differences between NACK-based protocols is in how NACK-suppres- 472 sion is performed. The goal is for only one NACK to reach the sender 473 (or a node that can resend the missing data) as soon as possible after 474 the loss is first noticed, and for only one copy of the missing data to 475 be received by those nodes needing retransmission. 477 Different mechanisms come close to satisfying these goals in different 478 ways. 480 o SRM [FJM95] uses random timers weighted by the round trip time 481 between the sender and each node missing the data. This is effec- 482 tive, but requires computing the RTT to each receiver before sup- 483 pression works properly. 485 o NTE [HC97] uses a sender-triggered mechanism based on random keys 486 and sliding masks. This does not require random timers, and works 487 for very large sessions, but makes it difficult to provide the con- 488 stant low-level stream of feedback needed to perform congestion con- 489 trol. 491 o AAP [Ha99] uses exponentially distributed random timers and is 492 effective for large sessions without needing to compute the RTT to 493 each receiver. 495 o PGM [FLST98] and LMS [PPV98] use additional mechanisms in routers to 496 suppress duplicate NACKs. In the case of PGM, router assistance 497 suppliments SRM-stype random timers and localizes the suppression so 498 that the whole group does not need suppressing. 500 The most general of these mechanisms is probably exponentially weighted 501 random timers. Although SRM style timers can reduce feedback delay, 502 they are harder to use correctly in situations where all the RTTs are 503 not known, or where the number of respondees is unknown. In contrast, 504 exponentially weighted random timers work well across a large range of 505 session sizes with good worst case delay characteristics. 507 Either form of random timer based mechanism can be supplemented by 508 router-support where it is available. Sender triggered NACK mechanisms 509 (e.g. [HC97]) are more difficult to integrate with router-based support 510 mechanisms. 512 4.3. Replication 514 Some RM protocols can be designed so as to not need explicit reliability 515 mechanisms except in comparatively rare cases. An example is in a mul- 516 ticast game, where the position of a moving object is continuously mul- 517 ticast. This positional stream does not require additional reliability 518 because a new position superseding the old one will be sent before any 519 retransmission could take place. However, when the moving object inter- 520 acts with other objects or stops moving, then an explicit reliability 521 mechanism is required to reliably send the interaction information or 522 last position. 524 It is not just games that can be built in this manner - the NTE shared 525 text editor[HC97] uses just such a mechanism with changes to a line of 526 text. For every change the whole line is sent, and so long as the user 527 keeps typing no explicit reliability mechanism is needed. The major 528 advantage of replication is that it is not susceptible to spatially 529 uncorrelated packet loss. With a traditional ACK or NACK based proto- 530 col, the probability of any particular packet being received by all the 531 receivers in a large group can be very low. This leads to high retrans- 532 mission rates. In contrast, replicated streams do not suffer as the 533 size of the receiver group increases - different receivers lose differ- 534 ent packets, but this does not increase network traffic. 536 4.4. Packet-level Forward Error Correction 538 Forward Error Correction (FEC) is a well known technique for protecting 539 data against corruption. For reliable multicast it is most useful in 540 the form of erasure codes. 542 The simplest form of packet-level FEC is to take a group of packets that 543 is to be sent, and to XOR the packets together to form a newpacket which 544 is also sent. If there were three original packets plus the XOR packet 545 sent, then if a receiver is missing any one of the original data pack- 546 ets, but receives the XOR packet, then it can reproduce the missing 547 original packet. 549 More general erasure codes exist [BKKKLZ95], [Ri97], [LMSSS97] that 550 allow the generation of n encoding packets from k original data packets. 551 In such cases, so long as at least k of the n encoding packets are 552 received, then the k original data packets can be reproduced. 554 To apply FEC the sender groups data packets into rounds, and encoding 555 packets are produced based on all the data packets in a round. A round 556 may consist of all data packets in an entire application data unit in 557 some cases, whereas in other cases it may consist of a group of data 558 packets that make up only a small portion of an application data unit. 560 Using erasure codes to repair packet loss is a significant improvement 561 over simple retransmission because the dependency on which packets have 562 been lost is removed. Thus, the amount of repair traffic required to 563 repair spatially uncorrelated packet loss is considerably lessened. 565 We can divide packet-level FEC schemes into two categories: pro-active 566 FEC and reactive FEC. The difference between the two is that for pro- 567 active FEC the sender decides a priori how many encoding packets to send 568 for each round of data packets, whereas for reactive FEC the sender ini- 569 tially transmits only the original data packets for each round. Then, 570 the sender uses feedback from the receivers to compute how many packets 571 were lost by the receiver that experienced the most loss in each round, 572 and then only that number of additional encoding packets are sent for 573 that round. These encoding packets will then also serve to repair loss 574 at the other receivers that are missing fewer packets.The receivers 575 report via ACKs or NACKs how many packets are missing from each round. 576 With NACKs, only the receiver missing the most packets need send a NACK 577 for this round, so this is used to weight the random timers in the NACK 578 calculation. 580 Proactive and reactive FEC can be combined, e.g., a certain amount of 581 proactive FEC can be sent for each round and if there are receivers that 582 experience more loss than can be overcome by this for some rounds then 583 they can request and receive additional encoding packets for these 584 rounds. 586 FEC is very effective at reducing the repair traffic for packet loss. 587 However, it requires that the data to be sent to be grouped into rounds, 588 which can add to end-to-end latency. For bulk-data applications this is 589 typically not a problem, but this may be an issue for interactive appli- 590 cations where replication may be a better solution. 592 4.5. Layered FEC 594 An alternative use of packet level FEC is possible when data is spread 595 across several multicast groups [RVC98], [BLMR98]. In such cases, the 596 original k data packets are used to generate n encoding packets, where n 597 is much larger than k. The n encoded packets are then striped across 598 multiple multicast groups. When a receiver wishes to receive the origi- 599 nal data it joins one or more of the multicast groups, and receives the 600 encoding packets. Once it has received k different encoding packets, 601 the receiver can then leave all the multicast groups and reconstruct the 602 original data. 604 The primary importance of such a layering is that it allows different 605 receivers to be able to receive the traffic at different rates according 606 to the available capacity. Such schemes do not require any form of 607 feedback from the receivers to the sender to ensure good throughput, and 608 therefore the need for good throughput does not constrain the size of 609 the receiver set. However, to perform adequate network congestion con- 610 trol using receiver joins and leaves in this manner may require coordi- 611 nation between members that are behind the same congested link from the 612 sender. As described in the next section, [RVC98] suggests such a lay- 613 ered congestion control scheme. 615 5. Congestion Control Mechanisms 617 The basic delivery model of the Internet is best-effort service. No 618 guarantees are given as to throughput, delay or packet loss. End-sys- 619 tems are expected to be adaptive, and to reduce their transmission rate 620 to a level appropriate for the congestion state of the network. 621 Although increasingly the Internet will start to support reserved band- 622 width and differentiated service classes for specialist applications, 623 unless an end-system knows explicitly that it has reserved bandwidth, it 624 must still perform congestion control. 626 Broadly speaking, there are five classes of single-sender multicast con- 627 gestion control solution: 629 o Sender-controlled, one group. 631 A single multicast group is used for data distribution. Feedback 632 from the group members is used to control the rate of this group. 633 The goal is to transmit at a rate dictated by the slowest receiver. 635 o Sender-controlled, multiple groups. 637 One initial multicast group is adaptively subdivided into multiple 638 subgroups with subdivisions centered on congestion points in the 639 network. Application-level relays buffer data from a group nearer 640 the original sender, and retransmit it at a slower rate into a group 641 further from the original sender. In this way, different receivers 642 can receiver the data at different rates. Sender-based congestion 643 control takes place between the members of a subgroup and their 644 relay. 646 o Receiver-controlled, one group. 648 A single multicast group is used for data distribution. The 649 receivers determine if the sender is transmitting too rapidly for 650 the current congestion state of the network, and they leave the 651 group if this is the case. 653 o Receiver-controlled, layered organization. 655 A layered approach for how to combine this scheme with a congestion 656 control protocol that requires no receiver feedback is described in 657 [RVC98]. The sender stripes data across multiple multicast groups 658 simultaneously. Receivers join and leave these layered groups 659 depending on their measurements of the congestion state of the net- 660 work, so that the amount of data being received is always appropri- 661 ate. However, this scheme relies on receivers to join and leave the 662 different multicast groups in a coordinated fashion behind a bottle- 663 neck link, and it has not yet been completely confirmed that this 664 approach will scale in practice to the Internet. As a result, more 665 work on this congestion control mechanism would be beneficial. 667 o Router-based congestion control. 669 It is possible to add additional mechanisms to multicast routers to 670 assist in multicast congestion control. Such mechanisms could 671 include: 673 o Conditional joins (a multicast join that specifies a loss rate 674 above which it is acceptable for the router to reject the join). 676 o Router filtering of traffic that exceeds a reasonable rate. 677 This may include mechanisms for filtering traffic at different 678 points in the network at different rates depending on local con- 679 gestion conditions [LVS99]. 681 o Fair queuing schemes combined with end-to-end adaptation. 683 Router-based schemes generally require more state in network routers 684 than has traditionally been acceptable for backbone routers. Thus, 685 in the near-term, such schemes are only likely to be applicable for 686 intranet solutions. 688 For reliable multicast protocols, it is important to consider congestion 689 control at the same time as reliability is being considered. The same 690 mechanisms that are used to provide reliability will sometimes be used 691 to provide congestion control. 693 In the case of receiver-based congestion control, open-loop delivery 694 using FEC is the likely choice for achieving good throughput for bulk- 695 data transfer. This is because open-loop delivery requires no feedback 696 from receivers, and thus it is a perfect match with a receiver-based 697 congestion-control mechanism that operates without feedback from 698 receivers. 700 6. Security Considerations 702 Generally speaking, security considerations have relatively little 703 effect on constraining the design space for reliable multicast proto- 704 cols. The primary issues constraining the design space are all related 705 to receiver-set scaling. For authentication of the source and of data 706 integrity, receiver-set scaling is not a significant issue. However, 707 for data encryption, key distribution and particularly re-keying may be 708 significantly affected by receiver-set scaling. Tree and graph based 709 re-keying solutions[WHA98,WGL97] would appear to be appropriate solu- 710 tions to these problems. It is not clear however that such re-keying 711 solutions need to directly affect the design of the data distribution 712 part of a reliable multicast protocol. 714 The primary question to consider for the security of reliable multicast 715 protocols is the role of third-parties. If nodes other than the origi- 716 nal source of the data are allowed to send or resend data packets, then 717 the security model for the protocol must take this into account. In 718 particular, it must be clear whether such third parties are trusted or 719 untrusted. A requirement for trusted third parties can make protocols 720 difficult to deploy on the Internet. 722 Untrusted third parties (such as receivers that retransmit the data) may 723 be used so long as the data authentication mechanisms take this into 724 account. Typically this means that the original sender digitally signs 725 and timestamps the data, and that the third parties resend this signed 726 timestamped payload unmodified. 728 Unlike unicast protocols, denial-of-service attacks on multicast trans- 729 port state are easy if the protocol design does not take such attacks 730 into account. This is because any receiver can join the session, and 731 can then produce feedback that influences the progress of a session 732 involving many other receivers. Hence protection against denial-of-ser- 733 vice attacks on reliable multicast protocols must be carefully consid- 734 ered. A receiver that requests retransmission of every packet, or that 735 refuses to acknowledge packets in an ACK-based protocol can potentially 736 bring a reliable multicast session to a standstill. Senders must have 737 appropriate policy to deal with such conditions, and if necessary, evict 738 the receiver from the group. A single receiver masquerading as a large 739 number of receivers may still be an issue under such circumstances with 740 protocols that support NACK-like functionality. Providing unique 741 ``keys'' to each NACKer when they first NACK using a unicast response 742 might potentially prevent such attacks. 744 Denial-of-service attacks caused by traffic flooding are however some- 745 what easier to protect against than with unicast. Unwanted senders can 746 simply be pruned from the distribution tree using the mechanisms imple- 747 mented in IGMP v3[CDT99]. 749 7. Conclusions 751 In this document we present an overview of the design space for reliable 752 multicast within the context of one-to-many bulk-data transfer. Other 753 flavors of multicast application are not considered in this document, 754 and hence the overview given should not be considered inclusive of the 755 design space for protocols that fall outside the context of one-to-many 756 bulk-data transfer. During the course of this overview, we have reaf- 757 firmed the notion that the process of reliable multicast protocol design 758 is affected by a number of factors that render the generation of a "one 759 size fits all solution" moot. These factors are then described to show 760 how an application's needs serve to constrain the set of available tech- 761 niques that may be used to create a reliable multicast protocol. We 762 examined a number of basic techniques and to show how well they can meet 763 the needs of certain types of applications. 765 This document is intended to provide guidance to the IETF community 766 regarding the standardization of reliable multicast protocols for bulk- 767 data transfer. Given the degree to which application requirements con- 768 strain reliable multicast solutions, and the diverse set of applications 769 that need to be supported, it should be clear that any standardization 770 work should take great pains to be future-proof. This would seem to 771 imply not standardizing complete reliable multicast transport protocols 772 in one pass, but rather examining the degree to which such protocols are 773 separable into functional building blocks, and standardizing these 774 blocks separately to the maxmimum degree that makes sense. Such an 775 approach allows for protocol evolution, and allows applications with new 776 constraints to be supported with maximal reuse of existing and tested 777 mechanisms. 779 8. Acknowledgments 781 This document represents an overview of the reliable multicast design 782 space. The ideas presented are not those of the authors, but are col- 783 lected from the varied presentations and discussions in the IRTF Reli- 784 able Multicast Research Group. Although they are too numerous to list 785 here, we thank everyone who has participated in these discussions for 786 their contributions. 788 9. Author's Addresses 790 Mark Handley, Sally Floyd 791 ATT Center for Internet Research at ICSI, 792 International Computer Science Institute, 793 1947 Center Street, Suite 600, 794 Berkeley, CA 94704, USA 795 mjh@aciri.org, floyd@aciri.org 797 Brian Whetten 798 Talarian Corporation, 799 333 Distel Circle, 800 Los Altos, CA 94022, USA 801 whetten@talarian.com 803 Roger Kermode 804 Motorola Australian Research Centre 805 Level 3, 12 Lord St, 806 Botany NSW 2019, 807 Australia. 808 ark008@email.mot.com 810 Lorenzo Vicisano 811 Cisco Systems, 812 170 West Tasman Dr. 813 San Jose, CA 95134, USA 814 lorenzo@cisco.com 816 Michael Luby 817 Digital Fountain, Inc. and ICSI 818 luby@dfountain.com, luby@icsi.berkeley.edu 820 10. References 822 [BC94] K. Birman, T. Clark. ``Performance of the Isis Distributed Com- 823 puting Toolkit.'' Technical Report TR-94-1432, Dept. of Computer 824 Science, Cornell University. 826 [BKKKLZ95] J. Bloemer, M. Kalfane, M. Karpinski, R. Karp, M. Luby, D. 827 Zuckerman, ``An XOR-based Erasure Resilient Coding Scheme'', ICSI 828 Technical Report No. TR-95-048, August 1995. 830 [BLMR98] J. Byers, M. Luby, M. Mitzenmacher, A. Rege, ``A Digital Foun- 831 tain Approach to Reliable Distribution of Bulk Data'', Proc ACM 832 SIGCOMM 98. 834 [CDT99] B. Cain, S. Deering, A. Thyagarajan, ``Internet Group Management 835 Protocol, Version 3'', Internet Draft, Work-in-progress, Feb 1999. 837 [FLST98] D. Farinacci, S. Lin, T. Speakman, and A. Tweedly, ``PGM reli- 838 able transport protocol specification,'' Internet Draft, Internet 839 Engineering Task Force, Aug. 1998. Work in progress. 841 [FJM95] S. Floyd, V. Jacobson, S. McCanne, ``A Reliable Multicast Frame- 842 work for Light-weight Sessions and Application Level Framing'', 843 Proc ACM SIGCOMM 95, Aug 1995 pp. 342-356. 845 [Ha99] M. Handley, ``Multicast address allocation protocol (AAP),'' 846 Internet Draft, Internet Engineering Task Force, Jun 1999. Work in 847 progress. 849 [HC97] M. Handley and J. Crowcroft, ``Network text editor (NTE) a scal- 850 able shared text editor for MBone,'' ACM Computer Communication 851 Review, vol. 27, pp. 197-208, Oct. 1997. ACM SIGCOMM'97, Sept. 852 1997. 854 [KCW98] M. Kadansky, D. Chiu, and J. Wesley, ``Tree-based reliable mul- 855 ticast (TRAM),'' Internet Draft, Internet Engineering Task Force, 856 Nov. 1998. Work in progress. 858 [LMSSS97] M. Luby, M. Mitzenmacher, A. Shokrollahi, D. Spielman, V. Ste- 859 mann, ``Practical Loss-Resilient Codes'', Proc ACM Symposium on 860 Theory of Computing, 1997. 862 [MWB+98] T. Montgomery, B. Whetten, M. Basavaiah, S. Paul, N. Rastogi, 863 J. Conlan, and T. Yeh, ``THE RMTP-II PROTOCOL,'' Internet Draft, 864 Internet Engineering Task Force, Apr. 1998. Work in progress 866 [PPV98] C. Papadopoulos, G. Parulkar, and G. Varghese, ``An error con- 867 trol scheme for large-scale multicast applications,'' in Proceed- 868 ings of the Conference on Computer Communications (IEEE Infocom), 869 (San Francisco, California), p. 1188, March/April 1998. 871 [Ri97] L. Rizzo, ``Effective erasure codes for reliable computer commu- 872 nication protocols,'' ACM Computer Communication Review, vol. 27, 873 pp. 24-36, Apr. 1997. 875 [RV97] L. Rizzo, L. Vicisano, ``A Reliable Multicast data Distribution 876 Protocol based on software FEC techniques'', Proc. of The Fourth 877 IEEE Workshop on the Architecture and Implementation of High Per- 878 formance Communication Systems (HPCS'97), Sani Beach, Chalkidiki, 879 Greece June 23-25, 1997. 881 [RVC98] L. Rizzo, L. Vicisano, J. Crowcroft, ``The RLC multicast conges- 882 tion control algorithm'', submitted to IEEE Network - special issue 883 multicast. 885 [RMWT98] K. Robertson, K. Miller, M. White, and A. Tweedly, ``StarBurst 886 multicast file transfer protocol (MFTP) specification,'' Internet 887 Draft, Internet Engineering Task Force, Apr. 1998. Work in 888 progress. 890 [WHA98] D. Wallner, E. Hardler, R. Agee, ``Key Management for Multi- 891 cast: Issues and Architectures'', Internet Draft, Work- 892 in-progress, Sept 1998. 894 [WKM94] Brian Whetten, Simon Kaplan, and Todd Montgomery, ``A high per- 895 formance totally ordered multicast protocol,'' research memorandum, 896 Aug. 1994. 898 [WGL97] C.K. Wong, M. Gouda, S. Lam, ``Secure Group Communications Using 899 Key Graphs,'' Technical Report TR 97-23, Department of Computer 900 Sciences, The University of Texas at Austin, July 1997. 902 11. Full Copyright Notice 904 Copyright (C) The Internet Society (2000). All Rights Reserved. 906 This document and translations of it may be copied and furnished to oth- 907 ers, and derivative works that comment on or otherwise explain it or 908 assist in its implementation may be prepared, copied, published and dis- 909 tributed, in whole or in part, without restriction of any kind, provided 910 that the above copyright notice and this paragraph are included on all 911 such copies and derivative works. However, this document itself may not 912 be modified in any way, such as by removing the copyright notice or ref- 913 erences to the Internet Society or other Internet organizations, except 914 as needed for the purpose of developing Internet standards in which case 915 the procedures for copyrights defined in the Internet languages other 916 than English. 918 The limited permissions granted above are perpetual and will not be 919 revoked by the Internet Society or its successors or assigns. 921 This document and the information contained herein is provided on an "AS 922 IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK 923 FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT 924 LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT 925 INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FIT- 926 NESS FOR A PARTICULAR PURPOSE."