idnits 2.17.1 

draft-ietf-rmt-design-space-01.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** Looks like you're using RFC 2026 boilerplate.  This must be updated to
     follow RFC 3978/3979, as updated by RFC 4748.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  ** The document seems to lack a 1id_guidelines paragraph about 6 months
     document validity -- however, there's a paragraph with a matching
     beginning. Boilerplate error?

  ** The document seems to lack a 1id_guidelines paragraph about the list of
     current Internet-Drafts. 

  ** The document seems to lack a 1id_guidelines paragraph about the list of
     Shadow Directories. 

  ** The document is more than 15 pages and seems to lack a Table of Contents.

  == No 'Intended status' indicated for this document; assuming Proposed
     Standard

  == The page length should not exceed 58 lines per page, but there was 20
     longer pages, the longest (page 2) being 60 lines

  == It seems as if not all pages are separated by form feeds - found 0 form
     feeds but 21 pages


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack an IANA Considerations section.  (See Section
     2.2 of https://www.ietf.org/id-info/checklist for how to handle the case
     when there are no actions for IANA.)

  ** The document seems to lack separate sections for Informative/Normative
     References.  All references will be assumed normative when checking for
     downward references.

  ** There are 37 instances of lines with control characters in the document.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not
     match the current year

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (Mar 2000) is 8805 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  == Missing Reference: 'LVS99' is mentioned on line 679, but not defined

  == Unused Reference: 'KCW98' is defined on line 854, but no explicit
     reference was found in the text

  == Unused Reference: 'RV97' is defined on line 875, but no explicit
     reference was found in the text

  -- Possible downref: Non-RFC (?) normative reference: ref. 'BC94'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'BKKKLZ95'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'BLMR98'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'CDT99'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'FLST98'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'FJM95'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'Ha99'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'HC97'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'KCW98'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'LMSSS97'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'PPV98'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'Ri97'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'RV97'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'RVC98'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'RMWT98'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'WHA98'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'WKM94'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'WGL97'


     Summary: 8 errors (**), 0 flaws (~~), 7 warnings (==), 20 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Internet Engineering Task Force 				     RMT WG
3	INTERNET-DRAFT						  M. Handley, ACIRI
4	draft-ietf-rmt-design-space-01			       B. Whetten, Talarian
5							       R. Kermode, Motorola
6								    S. Floyd, ACIRI
7								 L. Vicisano, Cisco
8							  M. Luby, Digital Fountain
9								      10th Mar 2000
10								  Expires: Sep 2000

12	       The Reliable Multicast Design Space for Bulk Data Transfer

14	Status of this Memo

16	This document is an Internet-Draft and is in full conformance with all
17	provisions of Section 10 of RFC2026.

19	Internet-Drafts are working documents of the Internet Engineering Task
20	Force (IETF), its areas, and its working groups.  Note that other groups
21	may also distribute working documents as Internet-Drafts.

23	Internet-Drafts are draft documents valid for a maximum of six months
24	and may be updated, replaced, or obsoleted by other documents at any
25	time.  It is inappropriate to use Internet- Drafts as reference material
26	or to cite them other than as "work in progress."

28	To view the list Internet-Draft Shadow Directories, see
29	http://www.ietf.org/shadow.html.

31	Copyright Notice

33	Copyright (C) The Internet Society (2000).  All Rights Reserved.

35	Abstract

37	The design space for reliable multicast is rich, with many possible
38	solutions having been devised.	However, application requirements serve
39	to constrain this design space to a relatively small solution space.
40	This document provides an overview of the design space and the ways in
41	which application constraints affect possible solutions.

43	1.  Introduction

45	The term ``general purpose reliable multicast protocol'' is something of
46	an oxymoron.  Different applications have different requirements of a
47	reliable multicast protocol, and these requirements constrain the design
48	space in ways that two applications with differing requirements often
49	cannot share a single solution.  There are however many successful reli-
50	able multicast protocol designs that serve more special purpose require-
51	ments well.

53	In this document we attempt to review the design space for reliable mul-
54	ticast protocols intended for bulk data transfer.  The term bulk data
55	transfer should be taken as having broad meaning - the main limitations
56	are that the data stream is continuous and long lived - constraints nec-
57	essary for the forms of congestion control we currently understand.  The
58	purpose of this review is to gather together an overview of the field
59	and to make explicit the constraints imposed by particular mechanisms.
60	The aim is to provide guidance to the standardization process for proto-
61	cols and protocol building blocks.  In doing this, we cluster potential
62	solutions into a number of loose categories - real protocols may be com-
63	posed of mechanisms from more than one of these clusters.

65	The main constraint on solutions is imposed by the need to scale to
66	large receiver sets.  For small receiver sets the design space is much
67	less restricted.

69	2.  Application Constraints

71	Application requirements for reliable multicast (RM) are as broad and
72	varied as the applications themselves.	However, there are a set of
73	requirements that significantly affect the design of an RM protocol.  A
74	brief list includes:

76	o    Does the application need to know that everyone received the data?

78	o    Does the application need to constrain differences between
79	     receivers?

81	o    Does the application need to scale to large numbers of receivers?

83	o    Does the application need to be totally reliable?

85	o    Does the application need ordered data?

87	o    Does the application need to provide low-delay delivery?
88	o    Does the application need to provide time-bounded delivery?

90	o    Does the application need many interacting senders?

92	o    Is the application data flow intermittent?

94	o    Does the application need to work in the public Internet?

96	o    Does the application need to work without a return path (e.g.
97	     satellite)?

99	o    Does the application need to provide secure delivery?

101	In the context of standardizing bulk data transfer protocols, we can
102	rule out applications with multiple interacting senders and intermittent
103	data flows.  It is not that these applications are unimportant, but that
104	we do not yet have effective congestion control for such applications.

106	2.1.  Did everyone receive the data?

108	In many applications a logically defined unit or units of data is to be
109	delivered to multiple clients, e.g., a file or a set of files, a soft-
110	ware package, a stock quote or package of stock quotes, an event notifi-
111	cation, a set of slides, a frame or block from a video.  An application
112	data unit (ADU) is defined to be a logically separable unit of data that
113	is useful to the application. In some cases, an application data unit
114	may be short enough to fit into a single packet (e.g., an event notifi-
115	cation or a stock quote), whereas in other cases an application data
116	unit may be much longer than a packet (e.g., a software package).

118	A protocol may optionally provide delivery confirmation to ensure reli-
119	able delivery, i.e., a mechanism for receivers to inform the sender when
120	data has been delivered.  There are two types of confirmation, at the
121	application data unit level and at the packet level. Application data
122	unit confirmation is useful at the application level, e.g., to inform
123	the application about receiver progress and to decide when to stop send-
124	ing packets about a particular application data unit.  Packet confirma-
125	tion is useful at the transport level, e.g., to inform the transport
126	level when it can release buffer space being used for storing packets
127	for which delivery has been confirmed.

129	Some applications have a strong requirement for confirmation that all
130	the receivers got an ADU, or if not, to be informed of which specific
131	receivers failed to receive the entire ADU. Examples include applica-
132	tions where receivers pay for data, and reliable file-system replica-
133	tion.  Other applications do not have such a requirement.  An example is
134	the distribution of free software.

136	If the application does need to know that every receiver got the ADU,
137	then a positive acknowledgment must be received from every receiver,
138	although it may be possible to aggregate these acknowledgments.  If the
139	application needs to know precisely which receivers failed to get the
140	ADU, additional constraints are placed on acknowledgment aggregation.

142	It should be noted that different mechanisms can be used for ADU-level
143	confirmation and packet-level confirmation in the same application.  For
144	example, an ADU-level confirmation mechanism using positive acknowledg-
145	ments may sit on top of a packet-level NACK or FEC-based transport.
146	Typically this only makes sense when ADUs are significantly larger than
147	a single packet.

149	2.2.  Constraining differences

151	Some applications need to constrain differences between receivers so
152	that the data reception characteristics for all receivers falls within
153	some range.  An example is a stock price feed, where it is unacceptable
154	for a receiver to suffer delivery that is delayed significantly more
155	than any other receiver.

157	This requirement is difficult to satisfy without harming performance.
158	Typically solutions involve not sending more than a limited amount of
159	new data until positive acknowledgments have been received from all the
160	receivers.  Such a solution does not cope with network and end-system
161	failures well.

163	2.3.  Receiver Set Scaling

165	There are many applications for RM that do not need to scale to large
166	numbers of receivers.  For such applications, a range of solutions may
167	be available that are not available for applications where scaling to
168	large receiver sets is a requirement.

170	A protocol must achieve good throughput of application data units to
171	receivers.  This means that most data that is delivered to receivers is
172	useful in recovering the application data unit that they are trying to
173	receive. A protocol must also provide good congestion control to fairly
174	share the available network resources between all applications.
175	Receiver set scaling is one of the most important constraints in meeting
176	these requirements, because it strictly limits the mechanisms that can
177	be used to achieve these requirements to those that will efficiently
178	scale to a large receiver population.  Acknowledgement packets have been
179	employed by many systems to achieve these goals, but it is important to
180	understand the strength and limitations of different ways of using such
181	packets.

183	In a very small system, it may be acceptable to have the receivers
184	acknowledge every packet.  This approach provides the sender with the
185	maximum amount of information about reception conditions at all the
186	receivers, information that can be used both to achieve good throughput
187	and to achieve congestion control.

189	For larger systems, such ``flat ACK'' schemes cause acknowledge implo-
190	sions at the sender.  Attempts have been made to reduce this problem by
191	sending aggregate ACKs infrequently [RMWT98, BC94], but it is very dif-
192	ficult to incorporate effective congestion control into such protocols
193	because of the spareceness of feedback.

195	Using negative acknowledgments (NACKs) instead of ACKs reduces this
196	problem to one of NACK implosion (only from the receivers missing the
197	packets), and because the sender really only needs to know that at least
198	one receiver is missing data in order to achieve good throughput, vari-
199	ous NACK suppression mechanisms can be applied.

201	An alternative to NACKs is ACK aggregation, which can be done by arrang-
202	ing the receivers into a logical tree, so that each leaf sends ACKs to
203	its parent which aggregates them, and passes them on up the tree.  Tree-
204	based protocols scale well, but tree formation can be problematic.

206	Other ACK topologies such as rings are also possible, but are often more
207	difficult to form and maintain than trees are.	An alternative strategy
208	is to add mechanisms to routers so that they can help out in achieving
209	good throughput or in reducing the cost of achieving good throughput.

211	All these solutions improve receiver set scaling, but they all have lim-
212	its of one form or another.  One class of solutions scales to an infi-
213	nite number of receivers by having no feedback channel whatsoever in
214	order to achieve good throughput.  These open-loop solutions take the
215	initial data and encode it using an FEC-style mechanism.  This encoded
216	data is transmitted in a continuous stream.  Receivers then join the
217	session and receive packets until they have sufficient packets to decode
218	the original data, at which point they leave the session.

220	Thus, it is clear that the intended scale of the session constrains the
221	possible solutions.  All solutions will work for very small sessions,
222	but as the intended receive set increases, the range of possible solu-
223	tions that can be deployed safely decreases.

225	It should also be noted that hybrids of these mechanisms are possible,
226	and that using one mechanism at the packet-level and a different (typi-
227	cally higher overhead) solution at the ADU level may also scale reason-
228	ably if the ADUs are large compared to packets.

230	2.4.  Total vs Semi-reliable

232	Many applications require delivery of application data units to be
233	totally reliable; if any of the application data unit is missing, none
234	of the received portion of the application data unit is useful.  File
235	transfer applications are a good example of applications requiring total
236	reliability.

238	However, some applications do not need total reliability.  An example is
239	audio broadcasting, where missing packets reduce the quality of the
240	received audio but do not render it unusable.  Such applications can
241	sometimes get by without any additional reliability over native IP reli-
242	ability, but often having a semi-reliable multicast protocol is desir-
243	able.

245	2.5.  Time-bounded Delivery

247	Many applications just require data to be delivered to the receivers as
248	fast as possible.  They have no absolute deadline for delivery.

250	However, some applications have hard delivery constraints - if the data
251	does not arrive at the receiver by a certain time, there is no point in
252	delivering it at all.  Such time-boundedness may be as a result of real-
253	time constraints such as with audio or video streaming, or as the result
254	of new data superseding old data.  In both cases, the requirement is for
255	the application to have a greater degree of control over precisely what
256	the application sends at which time than might be required with applica-
257	tions such as file transfer.

259	Time-bounded delivery usually also implies a semi-reliable protocol, but
260	the converse does not necessarily hold.

262	3.  Network Constraints

264	The properties of the network in which the application is being deployed
265	may themselves constrain the reliable multicast design space.

267	3.1.  Internet vs Intranet

269	In principle the Internet and intranets are the same.  In practice how-
270	ever, the fact that an intranet is under one administration might allow
271	for solutions to be configured that can not easily be done in the public
272	Internet.  Thus, if the data is of very high value, it might be appro-
273	priate to enhance the routers to provide assistance to a reliable multi-
274	cast transport protocol.  In the public Internet, it is less likely that
275	the additional expense required to support this state in the routers
276	would be acceptable.

278	3.2.  Return Path

280	In principle, when feedback is required from receivers, this feedback
281	can be multicast or unicast.  Multicast feedback has advantages, espe-
282	cially in NACK-based protocols where it is valuable for NACK suppres-
283	sion.  However, it is not clear at this time whether all ISPs will allow
284	all members of a session to send to that session.  If multicast feedback
285	is not allowed, then unicast feedback can almost always be substituted,
286	although often at the expense of additional messages and mechanisms.

288	Some networks may not allow any form of feedback however.  The primary
289	example of this occurs with satellite broadcasts where the back channel
290	may be very narrow or even non-existent.  For such networks the solution
291	space is very constrained - only FEC-based encodings have any real
292	chance of working.  If the receivers are direct satellite receivers,
293	then no congestion control is needed, but it is dangerous to make such
294	assumptions because it is possible for a satellite hop to feed down-
295	stream networks.  Thus, congestion control still needs to be considered
296	with solutions that do not have a return path.

298	3.3.  Network Assistance

300	A reliable multicast protocol must involve mechanisms running in end
301	hosts, and must involve routers forwarding multicast packets.  However
302	under some circumstances, it is possible to rely on some additional
303	degree of assistance from network elements.  Broadly speaking we can
304	cluster RM protocols into four classes depending on the degree of sup-
305	port received from other network elements.

307	No Additional Support
308	    The routers merely forward packets, and only the sender and
309	    receivers have any reliable multicast protocol state.

311	Layered Approaches
312	    Data is split across multiple multicast groups.  Receivers join
313	    appropriate groups to receive only the traffic they require.  This
314	    may in some cases require fast join or leave functionality from the
315	    routers, and may require more forwarding state in the routers.

317	Server-based Approaches
318	    Additional nodes are used to assist with data delivery or feedback
319	    aggregation.  These additional nodes might not be normal senders or
320	    receivers, and may be present on the distribution or feedback tree
321	    only to provide assistance to the reliable multicast protocol.  They
322	    would not otherwise receive the multicast traffic.

324	Router-based Approaches
325	    With router-based approaches, routers on the normal data distribu-
326	    tion tree from the sender to the receivers assist in the delivery of
327	    data or feedback aggregation or suppression.  As routers can
328	    directly influence multicast routing, they have more control over
329	    which traffic goes to which group members than server-based
330	    approaches.  However routers do not normally have a large amount of
331	    spare memory or processing power, which restricts how much function-
332	    ality can be placed in the routers.  In addition, router code is
333	    normally more difficult to upgrade than application code, so router-
334	    based approaches need to be very general as they are more difficult
335	    to deploy and to change.

337	4.  Good Throughput Mechanisms

339	Two main concerns that a RM protocol must address are congestion control
340	and good throughput.  Packet loss plays a major role with respect to
341	both concerns.	The primary symptom of congestion in many networks is
342	packet loss. The primary obstacle that must be overcome to achieve good
343	throughput is packet loss.  Thus, measuring and reacting to packet loss
344	is crucial to address both concerns. RM solutions that address these
345	concerns can be roughly categorized as using one or more of the follow-
346	ing techniques:

348	o   Data packet acknowledgment.

350	o   Negative acknowledgment of missing data packets.

352	o   Redundancy allowing not all packets to be received.

354	These techniques themselves can be usefully subdivided, so that we can
355	examine the parts of the requirement space in which each mechanism can
356	be deployed.  In this section, we focus on using these mechanisms for
357	achieving good throughput, and in the next section we focus on using
358	these mechanisms for congestion control.

360	4.1.  ACK-based Mechanisms

362	The simplest ACK-based mechanism involves every receiver sending an ACK
363	packet for every data packet it receives and resending packets that are
364	lost by any receiver.  Such mechanisms are limited to very small
365	receiver groups by the implosion of ACKs received at the sender, and for
366	this reason they are impractical for most applications.

368	Putting multiple ACKs into a single data packet [RMWT98] reduces the
369	implosion problem by a constant amount, allowing slightly larger
370	receiver groups.  However a limit is soon reached whereby feedback to
371	the sender is too infrequent for sender-based congestion control mecha-
372	nisms to work reliably.

374	Arranging the receivers into a ring [WKM94] whereby an ``ACK-token'' is
375	passed around the ring prevents the implosion problem for data.  However
376	ring creation and maintenance may itself be problematic.  Also if ring
377	creation does not take into account network topology (something which is
378	difficult to achieve in practice), then the number of ACK packets cross-
379	ing the network backbone for each data packet sent may increase O(n)
380	with the number of receivers.

382	4.1.1.	Tree-based ACK Mechanisms

384	Arranging the receivers into a tree [MWB+98, KCW98] whereby receivers
385	generate ACKs to a parent node, which aggregates those ACKs to its par-
386	ent in turn, is both more robust and more easily configured than a ring.
387	The ACK-tree is typically only used for ACK-aggregation - data packets
388	are multicast from the sender to the receivers as normal.  Trees are
389	easier to construct than rings because more local information can be
390	used in their construction.  Also they can be more fault tolerant than
391	rings because node failures only affect a subset of receivers, each of
392	which can easily and locally decide to by-pass its parent and report
393	directly to the node one level higher in the tree.  With good ACK-tree
394	formation, tree-based ACK mechanisms have the potential to be one of the
395	most scalable RM solutions.

397	To be simple to deploy, tree-based protocols must be self-organizing -
398	the receivers must form the tree themselves using local information in a
399	scalable manner.  Such mechanisms are possible, but are not trivial.
400	The main scaling limitations of tree-based protocols therefore come from
401	the tree formation and maintenance mechanisms rather than from the use
402	of ACKs.  Without such a scalable and automatic tree-formation mecha-
403	nism, tree-based protocols must rely on manual configuration, which sig-
404	nificantly limits their applicability (often to intranets) and (due to
405	the complexity of configuration) their scalability.

407	Orthogonal to the issue of tree formation is the issue of subtree
408	retransmission.  With appropriate router mechanisms, or the use of mul-
409	tiple multicast groups, it is possible to allow the intermediate tree
410	nodes to retransmit missing data to the nodes below them in the tree
411	rather than relying on the original sender to retransmit the data.  This
412	relies on there being a good correlation at the point of the intermedi-
413	ate node between the ACK tree and the actual data tree, as well as there
414	being a mechanism to constrain the retransmission to the subtree.  A
415	good automatic tree formation mechanism combined with the use of admin-
416	istrative scoped multicast groups might provide such a solution. Without
417	such tree formation mechanisms, subtree retransmission is difficult to
418	deploy in large groups in the public internet.	This could also be
419	solved by the use of transport-level router mechanisms to assist or per-
420	form retransmission, although existing router mechanisms [FLST98] sup-
421	port NACK-based rather than ACK-based protocols.

423	Another important issue is the nature of the aggregation performed at
424	interior nodes on the ACK-tree.  Such nodes could:

426	1.  aggregate ACKs by sending a single ACK when all their children have
427	    ACKed,

429	2.  aggregate ACKs by listing all the children that have ACKed,

431	3.  send an aggregated ACK with a NACK-like exception list.

433	For data packets, 1. is clearly more scalable, and should be preferred.
434	However if the sender needs to know exactly which receivers received the
435	data, 2. and 3. provide this information.  Fortunately, there is usually
436	no need to do this on a per-packet basis, but rather on a per-ADU basis.
437	Doing 1. on a per packet basis, and 3. on a per ADU basis is the most
438	scalable solution for applications that need this information, and suf-
439	fers virtually no disadvantage compared to the other solutions used on a
440	per-packet basis.

442	4.2.  NACK-based mechanisms

444	Instead of sending an ACK for every data packet received, receivers can
445	send a negative acknowledgment (NACK) for every data packet they dis-
446	cover they did not receive.  This has a number of advantages over ACK-
447	based mechanisms:

449	o   The sender no longer needs to know exactly how many receivers there
450	    are.  This removes the topology-building phase needed for ring- or
451	    tree-style ACK-based algorithms.

453	o   Fault-tolerance is made somewhat simpler by making receivers respon-
454	    sible for reliability.

456	o   Sender state can be significantly reduced because the sender does
457	    not need to keep track of the receivers state.

459	o   Only a single NACK is needed from any receiver to indicate a packet
460	    that is missing by any number of receivers.  Thus NACK suppression
461	    is possible.

463	The disadvantages are that it is more difficult for the sender to know
464	that it can free transmission buffers, and that additional session level
465	mechanisms are needed if the sender really needs to know if a particular
466	receiver actually received all the data.  However for many applications,
467	neither of these is an issue.

469	4.2.1.	NACK Suppression

471	The key differences between NACK-based protocols is in how NACK-suppres-
472	sion is performed.  The goal is for only one NACK to reach the sender
473	(or a node that can resend the missing data) as soon as possible after
474	the loss is first noticed, and for only one copy of the missing data to
475	be received by those nodes needing retransmission.

477	Different mechanisms come close to satisfying these goals in different
478	ways.

480	o   SRM [FJM95] uses random timers weighted by the round trip time
481	    between the sender and each node missing the data.	This is effec-
482	    tive, but requires computing the RTT to each receiver before sup-
483	    pression works properly.

485	o   NTE [HC97] uses a sender-triggered mechanism based on random keys
486	    and sliding masks.	This does not require random timers, and works
487	    for very large sessions, but makes it difficult to provide the con-
488	    stant low-level stream of feedback needed to perform congestion con-
489	    trol.

491	o   AAP [Ha99] uses exponentially distributed random timers and is
492	    effective for large sessions without needing to compute the RTT to
493	    each receiver.

495	o   PGM [FLST98] and LMS [PPV98] use additional mechanisms in routers to
496	    suppress duplicate NACKs.  In the case of PGM, router assistance
497	    suppliments SRM-stype random timers and localizes the suppression so
498	    that the whole group does not need suppressing.

500	The most general of these mechanisms is probably exponentially weighted
501	random timers.	Although SRM style timers can reduce feedback delay,
502	they are harder to use correctly in situations where all the RTTs are
503	not known, or where the number of respondees is unknown.  In contrast,
504	exponentially weighted random timers work well across a large range of
505	session sizes with good worst case delay characteristics.

507	Either form of random timer based mechanism can be supplemented by
508	router-support where it is available.  Sender triggered NACK mechanisms
509	(e.g. [HC97]) are more difficult to integrate with router-based support
510	mechanisms.

512	4.3.  Replication

514	Some RM protocols can be designed so as to not need explicit reliability
515	mechanisms except in comparatively rare cases.	An example is in a mul-
516	ticast game, where the position of a moving object is continuously mul-
517	ticast.  This positional stream does not require additional reliability
518	because a new position superseding the old one will be sent before any
519	retransmission could take place.  However, when the moving object inter-
520	acts with other objects or stops moving, then an explicit reliability
521	mechanism is required to reliably send the interaction information or
522	last position.

524	It is not just games that can be built in this manner - the NTE shared
525	text editor[HC97] uses just such a mechanism with changes to a line of
526	text.  For every change the whole line is sent, and so long as the user
527	keeps typing no explicit reliability mechanism is needed.  The major
528	advantage of replication is that it is not susceptible to spatially
529	uncorrelated packet loss.  With a traditional ACK or NACK based proto-
530	col, the probability of any particular packet being received by all the
531	receivers in a large group can be very low.  This leads to high retrans-
532	mission rates.	In contrast, replicated streams do not suffer as the
533	size of the receiver group increases - different receivers lose differ-
534	ent packets, but this does not increase network traffic.

536	4.4.  Packet-level Forward Error Correction

538	Forward Error Correction (FEC) is a well known technique for protecting
539	data against corruption.  For reliable multicast it is most useful in
540	the form of erasure codes.

542	The simplest form of packet-level FEC is to take a group of packets that
543	is to be sent, and to XOR the packets together to form a newpacket which
544	is also sent.  If there were three original packets plus the XOR packet
545	sent, then if a receiver is missing any one of the original data pack-
546	ets, but receives the XOR packet, then it can reproduce the missing
547	original packet.

549	More general erasure codes exist [BKKKLZ95], [Ri97], [LMSSS97] that
550	allow the generation of n encoding packets from k original data packets.
551	In such cases, so long as at least k of the n encoding packets are
552	received, then the k original data packets can be reproduced.

554	To apply FEC the sender groups data packets into rounds, and encoding
555	packets are produced based on all the data packets in a round. A round
556	may consist of all data packets in an entire application data unit in
557	some cases, whereas in other cases it may consist of a group of data
558	packets that make up only a small portion of an application data unit.

560	Using erasure codes to repair packet loss is a significant improvement
561	over simple retransmission because the dependency on which packets have
562	been lost is removed.  Thus, the amount of repair traffic required to
563	repair spatially uncorrelated packet loss is considerably lessened.

565	We can divide packet-level FEC schemes into two categories: pro-active
566	FEC and reactive FEC.  The difference between the two is that for pro-
567	active FEC the sender decides a priori how many encoding packets to send
568	for each round of data packets, whereas for reactive FEC the sender ini-
569	tially transmits only the original data packets for each round.  Then,
570	the sender uses feedback from the receivers to compute how many packets
571	were lost by the receiver that experienced the most loss in each round,
572	and then only that number of additional encoding packets are sent for
573	that round.  These encoding packets will then also serve to repair loss
574	at the other receivers that are missing fewer packets.The receivers
575	report via ACKs or NACKs how many packets are missing from each round.
576	With NACKs, only the receiver missing the most packets need send a NACK
577	for this round, so this is used to weight the random timers in the NACK
578	calculation.

580	Proactive and reactive FEC can be combined, e.g., a certain amount of
581	proactive FEC can be sent for each round and if there are receivers that
582	experience more loss than can be overcome by this for some rounds then
583	they can request and receive additional encoding packets for these
584	rounds.

586	FEC is very effective at reducing the repair traffic for packet loss.
587	However, it requires that the data to be sent to be grouped into rounds,
588	which can add to end-to-end latency.  For bulk-data applications this is
589	typically not a problem, but this may be an issue for interactive appli-
590	cations where replication may be a better solution.

592	4.5.  Layered FEC

594	An alternative use of packet level FEC is possible when data is spread
595	across several multicast groups [RVC98], [BLMR98].  In such cases, the
596	original k data packets are used to generate n encoding packets, where n
597	is much larger than k.	The n encoded packets are then striped across
598	multiple multicast groups.  When a receiver wishes to receive the origi-
599	nal data it joins one or more of the multicast groups, and receives the
600	encoding packets.  Once it has received k different encoding packets,
601	the receiver can then leave all the multicast groups and reconstruct the
602	original data.

604	The primary importance of such a layering is that it allows different
605	receivers to be able to receive the traffic at different rates according
606	to the available capacity.  Such schemes do not require any form of
607	feedback from the receivers to the sender to ensure good throughput, and
608	therefore the need for good throughput does not constrain the size of
609	the receiver set.  However, to perform adequate network congestion con-
610	trol using receiver joins and leaves in this manner may require coordi-
611	nation between members that are behind the same congested link from the
612	sender.  As described in the next section, [RVC98] suggests such a lay-
613	ered congestion control scheme.

615	5.  Congestion Control Mechanisms

617	The basic delivery model of the Internet is best-effort service.  No
618	guarantees are given as to throughput, delay or packet loss.  End-sys-
619	tems are expected to be adaptive, and to reduce their transmission rate
620	to a level appropriate for the congestion state of the network.
621	Although increasingly the Internet will start to support reserved band-
622	width and differentiated service classes for specialist applications,
623	unless an end-system knows explicitly that it has reserved bandwidth, it
624	must still perform congestion control.

626	Broadly speaking, there are five classes of single-sender multicast con-
627	gestion control solution:

629	o   Sender-controlled, one group.

631	    A single multicast group is used for data distribution.  Feedback
632	    from the group members is used to control the rate of this group.
633	    The goal is to transmit at a rate dictated by the slowest receiver.

635	o   Sender-controlled, multiple groups.

637	    One initial multicast group is adaptively subdivided into multiple
638	    subgroups with subdivisions centered on congestion points in the
639	    network.  Application-level relays buffer data from a group nearer
640	    the original sender, and retransmit it at a slower rate into a group
641	    further from the original sender.  In this way, different receivers
642	    can receiver the data at different rates.  Sender-based congestion
643	    control takes place between the members of a subgroup and their
644	    relay.

646	o   Receiver-controlled, one group.

648	    A single multicast group is used for data distribution.  The
649	    receivers determine if the sender is transmitting too rapidly for
650	    the current congestion state of the network, and they leave the
651	    group if this is the case.

653	o   Receiver-controlled, layered organization.

655	    A layered approach for how to combine this scheme with a congestion
656	    control protocol that requires no receiver feedback is described in
657	    [RVC98].  The sender stripes data across multiple multicast groups
658	    simultaneously.  Receivers join and leave these layered groups
659	    depending on their measurements of the congestion state of the net-
660	    work, so that the amount of data being received is always appropri-
661	    ate. However, this scheme relies on receivers to join and leave the
662	    different multicast groups in a coordinated fashion behind a bottle-
663	    neck link, and it has not yet been completely confirmed that this
664	    approach will scale in practice to the Internet.  As a result, more
665	    work on this congestion control mechanism would be beneficial.

667	o   Router-based congestion control.

669	    It is possible to add additional mechanisms to multicast routers to
670	    assist in multicast congestion control.  Such mechanisms could
671	    include:

673	    o	Conditional joins (a multicast join that specifies a loss rate
674		above which it is acceptable for the router to reject the join).

676	    o	Router filtering of traffic that exceeds a reasonable rate.
677		This may include mechanisms for filtering traffic at different
678		points in the network at different rates depending on local con-
679		gestion conditions [LVS99].

681	    o	Fair queuing schemes combined with end-to-end adaptation.

683	    Router-based schemes generally require more state in network routers
684	    than has traditionally been acceptable for backbone routers.  Thus,
685	    in the near-term, such schemes are only likely to be applicable for
686	    intranet solutions.

688	For reliable multicast protocols, it is important to consider congestion
689	control at the same time as reliability is being considered.  The same
690	mechanisms that are used to provide reliability will sometimes be used
691	to provide congestion control.

693	In the case of receiver-based congestion control, open-loop delivery
694	using FEC is the likely choice for achieving good throughput for bulk-
695	data transfer.	This is because open-loop delivery requires no feedback
696	from receivers, and thus it is a perfect match with a receiver-based
697	congestion-control mechanism that operates without feedback from
698	receivers.

700	6.  Security Considerations

702	Generally speaking, security considerations have relatively little
703	effect on constraining the design space for reliable multicast proto-
704	cols.  The primary issues constraining the design space are all related
705	to receiver-set scaling.  For authentication of the source and of data
706	integrity, receiver-set scaling is not a significant issue.  However,
707	for data encryption, key distribution and particularly re-keying may be
708	significantly affected by receiver-set scaling.  Tree and graph based
709	re-keying solutions[WHA98,WGL97] would appear to be appropriate solu-
710	tions to these problems.  It is not clear however that such re-keying
711	solutions need to directly affect the design of the data distribution
712	part of a reliable multicast protocol.

714	The primary question to consider for the security of reliable multicast
715	protocols is the role of third-parties.  If nodes other than the origi-
716	nal source of the data are allowed to send or resend data packets, then
717	the security model for the protocol must take this into account.  In
718	particular, it must be clear whether such third parties are trusted or
719	untrusted.  A requirement for trusted third parties can make protocols
720	difficult to deploy on the Internet.

722	Untrusted third parties (such as receivers that retransmit the data) may
723	be used so long as the data authentication mechanisms take this into
724	account.  Typically this means that the original sender digitally signs
725	and timestamps the data, and that the third parties resend this signed
726	timestamped payload unmodified.

728	Unlike unicast protocols, denial-of-service attacks on multicast trans-
729	port state are easy if the protocol design does not take such attacks
730	into account.  This is because any receiver can join the session, and
731	can then produce feedback that influences the progress of a session
732	involving many other receivers.  Hence protection against denial-of-ser-
733	vice attacks on reliable multicast protocols must be carefully consid-
734	ered.  A receiver that requests retransmission of every packet, or that
735	refuses to acknowledge packets in an ACK-based protocol can potentially
736	bring a reliable multicast session to a standstill.  Senders must have
737	appropriate policy to deal with such conditions, and if necessary, evict
738	the receiver from the group.  A single receiver masquerading as a large
739	number of receivers may still be an issue under such circumstances with
740	protocols that support NACK-like functionality.  Providing unique
741	``keys'' to each NACKer when they first NACK using a unicast response
742	might potentially prevent such attacks.

744	Denial-of-service attacks caused by traffic flooding are however some-
745	what easier to protect against than with unicast.  Unwanted senders can
746	simply be pruned from the distribution tree using the mechanisms imple-
747	mented in IGMP v3[CDT99].

749	7.  Conclusions

751	In this document we present an overview of the design space for reliable
752	multicast within the context of one-to-many bulk-data transfer. Other
753	flavors of multicast application are not considered in this document,
754	and hence the overview given should not be considered inclusive of the
755	design space for protocols that fall outside the context of one-to-many
756	bulk-data transfer. During the course of this overview, we have reaf-
757	firmed the notion that the process of reliable multicast protocol design
758	is affected by a number of factors that render the generation of a "one
759	size fits all solution" moot. These factors are then described to show
760	how an application's needs serve to constrain the set of available tech-
761	niques that may be used to create a reliable multicast protocol. We
762	examined a number of basic techniques and to show how well they can meet
763	the needs of certain types of applications.

765	This document is intended to provide guidance to the IETF community
766	regarding the standardization of reliable multicast protocols for bulk-
767	data transfer.	Given the degree to which application requirements con-
768	strain reliable multicast solutions, and the diverse set of applications
769	that need to be supported, it should be clear that any standardization
770	work should take great pains to be future-proof.  This would seem to
771	imply not standardizing complete reliable multicast transport protocols
772	in one pass, but rather examining the degree to which such protocols are
773	separable into functional building blocks, and standardizing these
774	blocks separately to the maxmimum degree that makes sense.  Such an
775	approach allows for protocol evolution, and allows applications with new
776	constraints to be supported with maximal reuse of existing and tested
777	mechanisms.

779	8.  Acknowledgments

781	This document represents an overview of the reliable multicast design
782	space.	The ideas presented are not those of the authors, but are col-
783	lected from the varied presentations and discussions in the IRTF Reli-
784	able Multicast Research Group.	Although they are too numerous to list
785	here, we thank everyone who has participated in these discussions for
786	their contributions.

788	9.  Author's Addresses

790	Mark Handley, Sally Floyd
791	ATT Center for Internet Research at ICSI,
792	International Computer Science Institute,
793	1947 Center Street, Suite 600,
794	Berkeley, CA 94704, USA
795	mjh@aciri.org, floyd@aciri.org

797	Brian Whetten
798	Talarian Corporation,
799	333 Distel Circle,
800	Los Altos, CA 94022, USA
801	whetten@talarian.com

803	Roger Kermode
804	Motorola Australian Research Centre
805	Level 3, 12 Lord St,
806	Botany	NSW  2019,
807	Australia.
808	ark008@email.mot.com

810	Lorenzo Vicisano
811	Cisco Systems,
812	170 West Tasman Dr.
813	San Jose, CA 95134, USA
814	lorenzo@cisco.com

816	Michael Luby
817	Digital Fountain, Inc. and ICSI
818	luby@dfountain.com, luby@icsi.berkeley.edu

820	10.  References

822	[BC94] K. Birman, T. Clark.  ``Performance of the Isis Distributed Com-
823	     puting Toolkit.''	Technical Report TR-94-1432, Dept. of Computer
824	     Science, Cornell University.

826	[BKKKLZ95] J. Bloemer, M. Kalfane, M. Karpinski, R. Karp, M. Luby, D.
827	     Zuckerman, ``An XOR-based Erasure Resilient Coding Scheme'', ICSI
828	     Technical Report No. TR-95-048, August 1995.

830	[BLMR98] J. Byers, M. Luby, M. Mitzenmacher, A. Rege, ``A Digital Foun-
831	     tain Approach to Reliable Distribution of Bulk Data'', Proc ACM
832	     SIGCOMM 98.

834	[CDT99] B. Cain, S. Deering, A. Thyagarajan, ``Internet Group Management
835	     Protocol, Version 3'', Internet Draft, Work-in-progress, Feb 1999.

837	[FLST98] D. Farinacci, S. Lin, T. Speakman, and A. Tweedly, ``PGM reli-
838	     able transport protocol specification,'' Internet Draft, Internet
839	     Engineering Task Force, Aug. 1998. Work in progress.

841	[FJM95] S. Floyd, V. Jacobson, S. McCanne, ``A Reliable Multicast Frame-
842	     work for Light-weight Sessions and Application Level Framing'',
843	     Proc ACM SIGCOMM 95, Aug 1995 pp. 342-356.

845	[Ha99] M. Handley, ``Multicast address allocation protocol (AAP),''
846	     Internet Draft, Internet Engineering Task Force, Jun 1999. Work in
847	     progress.

849	[HC97] M. Handley and J. Crowcroft, ``Network text editor (NTE) a scal-
850	     able shared text editor for MBone,'' ACM Computer Communication
851	     Review, vol. 27, pp. 197-208, Oct. 1997. ACM SIGCOMM'97, Sept.
852	     1997.

854	[KCW98] M. Kadansky, D. Chiu, and J. Wesley, ``Tree-based reliable mul-
855	     ticast (TRAM),'' Internet Draft, Internet Engineering Task Force,
856	     Nov. 1998. Work in progress.

858	[LMSSS97] M. Luby, M. Mitzenmacher, A. Shokrollahi, D. Spielman, V. Ste-
859	     mann, ``Practical Loss-Resilient Codes'', Proc ACM Symposium on
860	     Theory of Computing, 1997.

862	[MWB+98] T. Montgomery, B. Whetten, M. Basavaiah, S. Paul, N. Rastogi,
863	     J. Conlan, and T. Yeh, ``THE RMTP-II PROTOCOL,'' Internet Draft,
864	     Internet Engineering Task Force, Apr. 1998. Work in progress

866	[PPV98] C. Papadopoulos, G. Parulkar, and G. Varghese, ``An error con-
867	     trol scheme for large-scale multicast applications,'' in Proceed-
868	     ings of the Conference on Computer Communications (IEEE Infocom),
869	     (San Francisco, California), p. 1188, March/April 1998.

871	[Ri97] L. Rizzo, ``Effective erasure codes for reliable computer commu-
872	     nication protocols,'' ACM Computer Communication Review, vol.  27,
873	     pp. 24-36, Apr. 1997.

875	[RV97] L. Rizzo, L. Vicisano, ``A Reliable Multicast data Distribution
876	     Protocol based on software FEC techniques'', Proc. of The Fourth
877	     IEEE Workshop on the Architecture and Implementation of High Per-
878	     formance Communication Systems (HPCS'97), Sani Beach, Chalkidiki,
879	     Greece June 23-25, 1997.

881	[RVC98] L. Rizzo, L. Vicisano, J. Crowcroft, ``The RLC multicast conges-
882	     tion control algorithm'', submitted to IEEE Network - special issue
883	     multicast.

885	[RMWT98] K. Robertson, K. Miller, M. White, and A. Tweedly, ``StarBurst
886	     multicast file transfer protocol (MFTP) specification,'' Internet
887	     Draft, Internet Engineering Task Force, Apr. 1998. Work in
888	     progress.

890	[WHA98] D. Wallner, E. Hardler, R. Agee, ``Key Management for	Multi-
891	     cast:  Issues     and  Architectures'',  Internet	  Draft, Work-
892	     in-progress, Sept 1998.

894	[WKM94] Brian Whetten, Simon Kaplan, and Todd Montgomery, ``A high per-
895	     formance totally ordered multicast protocol,'' research memorandum,
896	     Aug. 1994.

898	[WGL97] C.K. Wong, M. Gouda, S. Lam, ``Secure Group Communications Using
899	     Key Graphs,'' Technical Report TR 97-23, Department of Computer
900	     Sciences, The University of Texas at Austin, July 1997.

902	11.  Full Copyright Notice

904	Copyright (C) The Internet Society (2000).  All Rights Reserved.

906	This document and translations of it may be copied and furnished to oth-
907	ers, and derivative works that comment on or otherwise explain it or
908	assist in its implementation may be prepared, copied, published and dis-
909	tributed, in whole or in part, without restriction of any kind, provided
910	that the above copyright notice and this paragraph are included on all
911	such copies and derivative works. However, this document itself may not
912	be modified in any way, such as by removing the copyright notice or ref-
913	erences to the Internet Society or other Internet organizations, except
914	as needed for the purpose of developing Internet standards in which case
915	the procedures for copyrights defined in the Internet languages other
916	than English.

918	The limited permissions granted above are perpetual and will not be
919	revoked by the Internet Society or its successors or assigns.

921	This document and the information contained herein is provided on an "AS
922	IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK
923	FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT
924	LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT
925	INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FIT-
926	NESS FOR A PARTICULAR PURPOSE."