idnits 2.17.1 

draft-lei-samrg-dmmp-03.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** It looks like you're using RFC 3978 boilerplate.  You should update this
     to the boilerplate described in the IETF Trust License Policy document
     (see https://trustee.ietf.org/license-info), which is required now.

  -- Found old boilerplate from RFC 3978, Section 5.1 on line 16.

  -- Found old boilerplate from RFC 3978, Section 5.5, updated by RFC 4748 on
     line 1288.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 1299.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 1306.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 1312.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  == No 'Intended status' indicated for this document; assuming Proposed
     Standard


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack an IANA Considerations section.  (See Section
     2.2 of https://www.ietf.org/id-info/checklist for how to handle the case
     when there are no actions for IANA.)

  == There are 6 instances of lines with non-RFC6890-compliant IPv4 addresses
     in the document.  If these are example addresses, they should be changed.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust Copyright Line does not match the
     current year

  == Line 1212 has weird spacing: '...cs, and  relat...'

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (February 22, 2008) is 5901 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  == Unused Reference: '21' is defined on line 1229, but no explicit
     reference was found in the text

  == Unused Reference: '22' is defined on line 1233, but no explicit
     reference was found in the text

  == Unused Reference: '23' is defined on line 1236, but no explicit
     reference was found in the text

  == Unused Reference: '24' is defined on line 1239, but no explicit
     reference was found in the text

  == Unused Reference: '25' is defined on line 1242, but no explicit
     reference was found in the text

  ** Downref: Normative reference to an Informational RFC: RFC 3569 (ref. '1')


     Summary: 3 errors (**), 0 flaws (~~), 9 warnings (==), 7 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Network Working Group                                             J. Lei
3	Internet-Draft                                                     X. Fu
4	Expires: August 25, 2008                                      D. Hogrefe
5	                                                        Univ. Goettingen
6	                                                       February 22, 2008

8	          DMMP: Dynamic Mesh-based Overlay Multicast Protocol
9	                      draft-lei-samrg-dmmp-03.txt

11	Status of this Memo

13	   By submitting this Internet-Draft, each author represents that any
14	   applicable patent or other IPR claims of which he or she is aware
15	   have been or will be disclosed, and any of which he or she becomes
16	   aware will be disclosed, in accordance with Section 6 of BCP 79.

18	   Internet-Drafts are working documents of the Internet Engineering
19	   Task Force (IETF), its areas, and its working groups.  Note that
20	   other groups may also distribute working documents as Internet-
21	   Drafts.

23	   Internet-Drafts are draft documents valid for a maximum of six months
24	   and may be updated, replaced, or obsoleted by other documents at any
25	   time.  It is inappropriate to use Internet-Drafts as reference
26	   material or to cite them other than as "work in progress."

28	   The list of current Internet-Drafts can be accessed at
29	   http://www.ietf.org/ietf/1id-abstracts.txt.

31	   The list of Internet-Draft Shadow Directories can be accessed at
32	   http://www.ietf.org/shadow.html.

34	   This Internet-Draft will expire on August 25, 2008.

36	Copyright Notice

38	   Copyright (C) The IETF Trust (2008).

40	Abstract

42	   This document describes a Dynamic Mesh-based overlay Multicast
43	   Protocol (DMMP) framework to support multicast data delivery
44	   applications without relying on classic IP multicast, including
45	   multicast group management, overlay hierarchy establishment,
46	   multicast tree construction and data forwarding scheme from a source
47	   to a number of receivers.  The DMMP framework builds on control plane
48	   functions which dynamically manage an overlay core and a multicast
49	   tree.  The key idea is a number of end hosts self-organize into an
50	   overlay mesh, and dynamically maintain such a mesh.  Based on the
51	   constructed mesh, some core-based clusters are formed with capacity-
52	   aware trees inside.  Then, a multicast tree consisting of DMMP-aware
53	   end hosts (and/or specific routers) is built on the top of the
54	   overlay core for the efficient delivery of the multicast data.

56	Table of Contents

58	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
59	   2.  Features of DMMP . . . . . . . . . . . . . . . . . . . . . . .  4
60	   3.  Terminology and Abbreviations  . . . . . . . . . . . . . . . .  6
61	   4.  DMMP Overview  . . . . . . . . . . . . . . . . . . . . . . . .  8
62	     4.1.  Control plane in DMMP  . . . . . . . . . . . . . . . . . .  8
63	     4.2.  Data plane in DMMP . . . . . . . . . . . . . . . . . . . . 11
64	   5.  DMMP Messages  . . . . . . . . . . . . . . . . . . . . . . . . 13
65	   6.  DMMP: Protocol Details . . . . . . . . . . . . . . . . . . . . 15
66	     6.1.  Initialization . . . . . . . . . . . . . . . . . . . . . . 16
67	     6.2.  Super Node Selection . . . . . . . . . . . . . . . . . . . 17
68	     6.3.  Member Join  . . . . . . . . . . . . . . . . . . . . . . . 18
69	     6.4.  Refresh Information  . . . . . . . . . . . . . . . . . . . 19
70	     6.5.  Member Leave . . . . . . . . . . . . . . . . . . . . . . . 19
71	     6.6.  Data Delivery Control  . . . . . . . . . . . . . . . . . . 20
72	     6.7.  Failure Recovery . . . . . . . . . . . . . . . . . . . . . 20
73	     6.8.  Self-improvement . . . . . . . . . . . . . . . . . . . . . 22
74	   7.  Metrics specification  . . . . . . . . . . . . . . . . . . . . 24
75	   8.  Security Considerations  . . . . . . . . . . . . . . . . . . . 26
76	   9.  Open Issues  . . . . . . . . . . . . . . . . . . . . . . . . . 26
77	   10. Contributors . . . . . . . . . . . . . . . . . . . . . . . . . 26
78	   11. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 26
79	   12. References . . . . . . . . . . . . . . . . . . . . . . . . . . 26
80	     12.1. Normative References . . . . . . . . . . . . . . . . . . . 26
81	     12.2. Informative References . . . . . . . . . . . . . . . . . . 27
82	   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 29
83	   Intellectual Property and Copyright Statements . . . . . . . . . . 30

85	1.  Introduction

87	   Over the recent years, a lot of research efforts have been focusing
88	   on moving multicast support out of the network core, since the
89	   deployment of network layer multicast has been obtructed by both
90	   technical and opertional issues [3], [4].  To solve these issues of
91	   IP multicast, various application level multicast approaches have
92	   been proposed, which can be largely summarized into two categories,
93	   namely, application layer multicast (ALM) and overlay multicast (OM).
94	   As a matter of fact, network layer multicast requires changes in IP
95	   routers, while ALM and OM approaches rely on network unicast and does
96	   not need network layer infrastructure support from intermediate nodes
97	   (e.g. router).

99	   In ALM approach, end hosts form a virtual network, and multicast
100	   delivery structures are constructed on top of such a virtual overlay.
101	   A basic ALM approach is to form and maintain an overlay for data
102	   transmission, where all end hosts in a multicast session are involved
103	   without considering the heterogeneities of them, e.g. computation
104	   power, bandwidth and access possibilities.  For instance, all end
105	   hosts join the full mesh construction of ESM (Narada) [5] and
106	   multiple connections exist between any two nodes.  The main advantage
107	   of constructing such a mesh is easy implementation and being
108	   relatively stable.  However, ESM's sole dependence on the mesh
109	   structure results in that it could only be applied well into a small
110	   or medium-sized group [6].  NICE [7], in contrast, introduces a
111	   hierarchical management scheme to create a scalable ALM overlay.
112	   This hierarchical design simplifies the membership management of the
113	   application layer multicast and makes it scale better than the full
114	   mesh-based structure.  Nevertheless, the joining procedure in NICE
115	   causes a very high control overhead, which not only limits the
116	   scalability of deployment, but also is likely vulnerable to single
117	   node failures (e.g. possible failures caused by the node at the
118	   highest layer of hierarchy).  As described above, ALM approaches
119	   address some practical/-deployment issues in network layer multicast
120	   but there is a general concern about its efficiency and scalability.

122	   Observing the weaknesses from ALM approaches, an alternative
123	   approach, i.e., overlay multicast or OM, by using a kind of
124	   "infrastructure-based" solution, has been proposed to improve
125	   multicast efficiency and maximize the resource usage (e.g,
126	   bandwidth).  Proposals of such an approach include OMNI [8] and TOMA
127	   [9].  The design issues of OM can be summarized in the following two
128	   aspects: On one hand, OM approaches employ some fixed or long-term
129	   infrastructure-based nodes to simplify membership management and
130	   multicast tree construction.  This advantage can become a weakness
131	   since the assumption of these fixed nodes in the infrastructure
132	   limits the extensibility and flexibility of deployment.  For example,
133	   the infrastructure must be re-established based on other long-term
134	   measurements before constructing a new multicast trees to adapt to
135	   the requirements imposed by a different metric.  On the other hand,
136	   TOMA and OMNI need dedicated infrastructure deployment and costly
137	   servers, which could not be adaptive to dynamic network changes and
138	   group member changes such as new members join.  Therefore, it is
139	   relatively difficult to implement them into the current Internet
140	   environment although they are proposed to provide multicast support
141	   for group communication applications.  Obviously, to develop a
142	   practical, efficient and resilient multicast framework is the
143	   essential way towards wide deployment of multicasting services.

145	   Additionally, the explosive growth of multimedia services and
146	   applications over Internet necessitates streaming media to a large
147	   population of users.  However, with current media streaming
148	   technology, it's hard to develop a comprehensive on-demand media
149	   streaming system due to the following two challenges [10].  First,
150	   the total number of concurrent clients the system can support is
151	   limited by the resources of the streaming supplier.  Second, current
152	   media streaming proposals usually have limitations in reliability and
153	   scalability.  The reliability concern arises from the fact that only
154	   one entity is responsible for all clients.  The scalability issue
155	   arises from the fact that adding internet-scale potential users
156	   requires the commensurate amount of resources to the supplying
157	   server.  Meanwhile, aforementioned proposals could not explicitly
158	   support real-time media streaming applications in a large scale.

160	   Motivated by above studies, in this draft we present a new overlay
161	   multicast framework which manages a dynamic mesh-based overlay core
162	   and only involves participating end hosts without relying on the
163	   availability of the OM-aware infrastructure nodes, while providing
164	   certain degree of efficiency, reliability and resilience.

166	2.  Features of DMMP

168	   DMMP is organized into a two-level hierarchy and the mechanisms of
169	   DMMP are introduced to dynamically manage and maintain the hierarchy.
170	   The key idea behind DMMP is to let a few end hosts selected and self-
171	   organize into an overlay mesh during the multicast initialization
172	   phrase and also when group member changes, and dynamically maintain
173	   such a mesh.  Although routers may also be manually designated (e.g.
174	   by ISPs) to construct the mesh, this document initially discusses the
175	   approach via end hosts.  Specifically, there are three design issues
176	   to be addressed in DMMP:

178	   o  Host heterogeneity - Previous research has shown that a large
179	      proportion of free-riders (i.e. group member who can only receive
180	      data from incoming sessions) may exist in the network [11], [12].
181	      DMMP considers the heterogeneous capacities of group members by
182	      evaluating their available bandwidth during the runtime.  In the
183	      DMMP framework,it is possible that only a small number of high-
184	      capacity end hosts are selected to construct the overlay mesh when
185	      there are a large proportion of free-riders [13].  This operation
186	      may help maximizing the usage of available bandwidth for the
187	      overlay tree.
188	   o  Scalability - Scalability is one of the main problems to be solved
189	      in the multicast applications.  In DMMP, each end-host may act as
190	      a potential server for other clients and the number of possible
191	      servers increases at the same rate as the end host clients.  As
192	      peer to peer (p2p) technologies have been deployed to support
193	      various services over the Internet, it is possible that more end
194	      hosts resources are available in the network.  Once a node joins
195	      the DMMP multicast session, additional resources are available to
196	      the whole system.  Therefore, the DMMP system is scalable as it
197	      can potentially support a number of clients.
198	   o  Delay optimization - DMMP considers end-to-end delay for end
199	      hosts.  When forming the local cluster, non-super nodes select the
200	      nearest super node in term of e2e delay measurement.  This aspect
201	      is essential while supporting delay sensitive applications (e.g.
202	      media streaming).  Furthermore, high-capacity nodes are given high
203	      priority to stay at the higher level of the tree when constructing
204	      the overlay multicast tree.  In return, it allows to produce the
205	      tree as short as possible and hence the overall delay could be
206	      reduced.
207	   o  Resilience to dynamic changes - DMMP considers the transient
208	      nature of end hosts and tries to prevent incapable or short-lived
209	      nodes from staying close to the center of the multicast tree.
210	      Consequently, the DMMP overlay structure is relatively stable and
211	      resilient to dynamic network changes.  Moreover, network situation
212	      changes (such as multicast members joining/leaving) within a local
213	      cluster will not have any impact on other clusters.  The failure
214	      of a single node may result in a transient instability in a small
215	      subset of participants, but it will not cause a catastrophe in the
216	      whole overlay.

218	   In order to overcome the two challenges in Section 1, media streaming
219	   task in DMMP is accomplished through the following two phases: (1) An
220	   on-demand overlay core(or mesh) is established to achieve the
221	   optimized performance; (2) Based on the structured mesh, several
222	   clusters are formed to connect with selected mesh members, namely,
223	   super nodes.  Here, DMMP applies the concept of locality (e.g.,
224	   clusters) into the group management so that it can dramatically
225	   reduce the control overhead and the complexity of the overlay
226	   maintenance.  Basically, a source-based DMMP architecture consists of
227	   a sender, several receivers, one or many Rendezvous Points (RPs) and
228	   Domain Name Systems (DNSs).  Compared with existing application level
229	   multicast approaches, DMMP is designed to be more stable, efficient
230	   and applicable to support large-scale groups without relying on
231	   predetermined intermediate nodes in the network and potentially get
232	   better performance.  Note that DMMP currently considers source-
233	   specific multicast [1], any-source multicast is left for future
234	   studies.

236	3.  Terminology and Abbreviations

238	   The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
239	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
240	   "OPTIONAL", in this document are to be interpreted as described in
241	   BCP 14, RFC 2119 [2].  Other terminologies and abbreviations used in
242	   this document are used as follows:

244	   o  Overlay Multicast - A multicast data delivery scheme depending on
245	      end hosts to form an overlay core for message control and a
246	      multicast tree for data delivery.
247	   o  Rendezvous Point (RP) - A designated node for a multicast group,
248	      which assists managing group members and stores some required
249	      information (e.g. performance required).
250	   o  Source - The multicast service sender.  It could be a video stored
251	      server or some video distributed servers in one service domain,
252	      which delivers the data traffic to the source-based multicast
253	      group members; DMMP in this document only provides the source-
254	      specific mechanism to realize the single source-based overlay
255	      multicast.
256	   o  Super nodes - Some end hosts are chosen to manage the multicast
257	      group and relay data from the mesh to receivers within clusters.
258	      Currently, only end hosts can serve as super nodes; future version
259	      of this document may specify the case when some routers (e.g.,
260	      first-hop routers) are used as super nodes.
261	   o  Receivers - Multicast group members who want to receive the data
262	      from the source.
263	   o  Mesh - An overlay core, which is responsible for group member
264	      management and multicast tree configuration.
265	   o  Clusters - Based on each super node, end hosts organize themselves
266	      into a core-based multicast tree. [14].
267	   o  Stress - A tree metric that counts the number of identical packets
268	      sent by the protocol over a single link or a single node.
269	   o  Out-degree - Available connections, namely, the available number
270	      of connections that a node can establish.

272	   o  Uptime - The time duration from a node joining in a multicast
273	      session to its leaving the multicast session.

275	   Within each cluster, there are some terminologies and abbreviations
276	   which are used as follows:

278	   o  Parent - the direct upstream node of a node is called the parent
279	      of that node, e.g. end host 1.2 is the parent of end host 1.2.1 in
280	      Figure 1.
281	   o  Parent level nodes (PLN)- nodes (exclusive parent) at the same
282	      level as the parent of some node, e.g. 1.1 and 1.3 are parent
283	      level nodes of 1.2.1 in Figure 1.
284	   o  Child - the direct downstream node of some node, e.g. in Figure 1,
285	      1.2.1 is the child of 1.2.
286	   o  Children level nodes (CLN)- nodes (exclusive children) at the same
287	      level as children of some node, e.g. 1.1.1, 1.1.2, 1.3.1 are
288	      children level nodes of 1.2 in Figure 1.
289	   o  Siblings - nodes at the same level of a node are called siblings,
290	      e.g. in Figure 1, 1.1.1, 1.1.2, 1.2.2, 1.2.3 and 1.3.1 are
291	      siblings of 1.2.1.

293	                 /-------------------------------\
294	                /              1.1.1    1.1.1.1  |
295	               /     1.1    / End host- End host |
296	              /     End host   1.1.2             |
297	             /     /        \ End host           |
298	            /     /            1.2.1             |
299	           /     /         / End host    1.2.2.1 |
300	          /     /         /             End host |
301	         /     /   1.2   /    1.2.2   /          |
302	   Super node--- End host -- End host            |
303	         \     \         \            \ End host |
304	          \     \         \   1.2.3      1.2.2.2 |
305	           \     \         \ End host            |
306	            \     \                              |
307	             \     \ 1.3        1.3.1            |
308	              \     End host - End host          |
309	               \                                 |
310	                \--------------------------------/
311	                            Cluster

313	               Figure 1: An example of local Cluster

315	4.  DMMP Overview

317	   The DMMP framework consists of two types of functionalities: control
318	   plane and data plane.  The control plane composes one overlay mesh
319	   and some core-based clusters.  Data plane is, then, built on the top
320	   of the structured control plane.  Meanwhile, the source entity and a
321	   set of super nodes form the overlay mesh, in which each super node
322	   supervises one cluster.  Although the Tree-based overlays are
323	   regarded as the most efficient approach for data distribution in a
324	   stable network, they are not effective for dynamic scenarios since
325	   pure tree structure has difficulties to meet both high bandwidth and
326	   high reliability requirements.  The first difficulty is the delivered
327	   service quality to downstream end hosts is limited by the minimum
328	   bandwidth among the upstream connections from the source; for
329	   instance, a member in the OMNI tree receives data only from its
330	   upstream node and the data reception rate of this member can not be
331	   greater than its upstream node.  It is even more difficult in the
332	   core network where each MSN should be responsible for data delivery
333	   to a whole cluster.  For the second requirement, a tree is less
334	   robust than a mesh because a single node failure or a loop can
335	   partition the tree and disable communications among the members.  To
336	   increase the throughput of the whole overlay network, an DMMP-aware
337	   mesh as one of multiple forwarding solutions is introduced in this
338	   draft.  Such a mesh will allow the reception of packets from multiple
339	   nodes other than from only the upstream node.

341	4.1.  Control plane in DMMP

343	   In this source-specific overlay multicast protocol, the combination
344	   of available bandwidth [15] and uptime represents the capacity of
345	   each node [13].  For media streaming system, available bandwidth
346	   resources possessed by a multicast group may be insufficient during
347	   runtime [16].  It is reason why DMMP needs to consider the degree
348	   bound in streaming applications, which can be easily observed from
349	   the available bandwidth.  For example, on the assumption that the bit
350	   rate of media is B and the outbound bandwidth of an end host i is
351	   b(i), the total number of connections it can establish is b(i)/B
352	   which is also the maximum degree of the end host.  Moreover, the
353	   usage of available bandwidth in overlay routing has become possible,
354	   based on recent advances in available bandwidth measurement
355	   techniques and tools [17], [18].  Obviously, if an application has
356	   additional requirements on end-to-end delay or loss rate, these
357	   metrics can be jointly considered during the overlay hierarchy
358	   construction.

360	   o  Step 1: After initialization, RP will calculate the out-degree of
361	      each host and distribute them into two categories: leaf nodes
362	      (whose out-degree is less than 2) and non-leaf nodes.  If one's
363	      out-degree is less than 2, the end host can act as leaf node
364	      because it can only receive data from the incoming connection.
365	   o  Step 2: The information of two-category nodes are respectively
366	      stored at the RP.  Meanwhile, all non-leaf nodes are placed in the
367	      order of out-degree and reported to the source.  On receiving the
368	      list of ordered non-leaf nodes, the source selects an application-
369	      specific number of them as super nodes.  Those nodes have higher
370	      capacities as defined in Section 6.2, and are used to manage the
371	      multicast group.  In the initialization stage, nodes with higher
372	      bandwidth support will be selected as super nodes since the
373	      current uptime is zero.  The capacities of each super node are
374	      also stored at the source and the RP.
375	   o  Step 3: After being selected, the super nodes organize themselves
376	      into a mesh rooted at the source.  The issue of overlay mesh
377	      contruction is mostly motivated from [5].

379	   DMMP considers the heterogeneous capacities of group members by
380	   evaluating their available bandwidth during runtime so that high-
381	   capacity nodes (i.e., super nodes) which are able to and willing to
382	   make more contributions to the network are expected to get better
383	   performance.  This will maximize the usage of available bandwidth for
384	   the overlay tree.  To further improve the overlay mesh performance,
385	   DMMP allows dynamically adding and deleting links within the mesh,
386	   which will be clarified in a future version of this draft.  Based on
387	   selected super nodes, some core-based clusters will be formed to
388	   connect with the mesh.

390	   o  Step 1: After constructing the overlay mesh, the next step is to
391	      form core-based clusters.  Each non-super node will firstly
392	      consult its local cache for super node candidates.  If there is no
393	      suitable candidate, it queries the RP immediately.  Then, the
394	      requestor caches these newly received candidates, from which it
395	      selects the best one based on e2e latency measurements.  If there
396	      are multiple super nodes which can provide similar e2e latency for
397	      the node, one of them with higher out-degree will be chosen.
398	   o  Step 2: Those non-super nodes sharing the same super node will
399	      form a local cluster.  The cluster formation is initiated by the
400	      super node which answers for informing the RP and contacting the
401	      source.  Generally, certain number (due to the super node's
402	      available bandwidth) of end hosts with higher capacity will be
403	      selected as its immedidate children.  This operation guarantees
404	      that the multicast tree within each cluster meets the bandwidth
405	      need of media streaming applications.
406	   o  Step 3: Afterwards, direct children of super nodes choose some
407	      nodes with higher capacities (i.e. out-degree, e2e latency) as
408	      their children.  This selection method will expedite the
409	      convergence of the tree and alleviate the average latency in some
410	      senses.

412	   The iteration will continue until all cluster members join the tree,
413	   and accordingly the control hierarchy is constructed for the
414	   multicast group .  For the sake of resilience, each node in the local
415	   cluster should keep some information of its relatives in the local
416	   cache.  In this draft, these entities (parent, PLN, child, CLN and
417	   siblings, see in Section 2) are denoted as relatives.

419	   Figure 2 illustrates a basic example for the overlay construction.
420	   Assuming that it is the first time to construct the overlay
421	   hierarchy, then:

423	   o  Step 1: When obtaining the list of non-leaf members from the RP,
424	      the source selects six of them as super nodes.  After selection,
425	      super nodes self-organize into a mesh rooted at the source.
426	   o  Step 2: Each non-super node chooses the best super node based on
427	      the e2e latency measurements.  Those non-super nodes sharing the
428	      same super node will then form a local cluster.
429	   o  Step 3: According to the super node's capacity, no more than K end
430	      hosts with larger capacities are selected as its immediate
431	      children.  Here, three end hosts, namely, 1.1, 1.2 and 1.3 are
432	      chosen as the immediate children of the super node.
433	   o  Step 4: Once the capacity of the super node is exhausted, it
434	      responds to new requestors with its immediate children and an
435	      indication of rejection.  For example, upon the receipt of
436	      rejections and a list of candidate parents (i.e. 1.1, 1.2 and 1.3)
437	      from the super node, end hosts 1.1.1, 1.1.2 and 1.3.1 send Join
438	      request to them.  In this case, requestors with higher out-degree
439	      are likely to be selected as the children.  If there are multiple
440	      acceptances, the end host will select the one which is "near" in
441	      terms of the e2e latency.  For example, 1.1.1 and 1.1.2 accept the
442	      request and join the cluster as children of 1.1.  Then 1.1 will
443	      update the associated information to 1.2, 1.3, and as well as to
444	      the super node after its children selection.
445	   o  Step 5: If there are still some nodes left, they will receive a
446	      rejection and need to re-send Join request to the children at the
447	      lower level, i.e. 1.1.1, 1.1.2 and 1.3.1.  When all receivers
448	      confirm their positions in the cluster, the control plane is
449	      initially constructed.

451	                                Communication
452	                          Source <========> Rendezvous Point
453	                            |      Channel
454	                            |
455	                            |                      ---------------\
456	                            v                     /         1.1.1 |
457	              /---------Super Node---------\     /  1.1   End host|
458	              |         /      \   \       |    / End host/       |
459	              |        /        \   \      |   / /        \ 1.1.2 |
460	              v       /          \   \     v  / /   1.2   End host|
461	   Cluster -> Super Node          \  Super Node -- End host       |
462	              |       \            \    /  |  \ \   1.3    1.3.1  |
463	              |        \            \  /   |   \ End host-End host|
464	              |         \            \/    |    \                 |
465	              v          \           /\    v     \----------------/
466	   Cluster -> Super Node  \         /Super Node <- Cluster
467	              |       \    \       /   /   |
468	              |        \    \     /   /    |
469	              |         \    \   /   /     |
470	              \-----------Super Node ------/
471	                              |
472	                              |
473	                           Cluster

475	                    Figure 2: Control Hierarchy in DMMP

477	   After constructing the control hierarchy, only super nodes need to
478	   keep the full knowledge among themselves, while non-super nodes only
479	   need to keep the knowledge of a small part of the group within each
480	   cluster.  This will help dramatically reducing the control overhead
481	   of the whole multicast tree compared with that each member keeps the
482	   full knowledge of the entire group whilst providing certain
483	   resilience.  Therefore, the overlay hierarchy of DMMP incurs a low
484	   delay penalty, makes effective use of network resources, and
485	   considers the heterogeneities of end hosts during multicast sessions.

487	4.2.  Data plane in DMMP

489	   Corresponding to the constructed control plane, the data plane can be
490	   divided into two parts.  The first part, which is within the overlay
491	   mesh, is built through the reverse shortest path between each super
492	   node and the source, identical to DVMRP [19].  For example, a super
493	   node A receives the packet from the source through its neighbor B
494	   only if B is the next hop on the shortest path from A to the source.
495	   In this case, super node A will replicate and forward the packets
496	   directly to super node B. Likewise, B will replicate and forward the
497	   packet to the next super node as well.  When receiving the data, the
498	   super node will replicate and forward the data to its direct children
499	   in the cluster.  Therefore, data can be delivered throughout the
500	   whole cluster.

502	   The second part is data delivery within core-based clusters.  In each
503	   cluster, data are firstly forwarded from the super node to its direct
504	   children.  Then, the receivers will replicate the data and forward
505	   them to its children at the lower level.  The whole iteration
506	   terminates with success if all receivers within each cluster receive
507	   the data.

509	   Figure 3 depicts an example for data delivery within a DMMP-aware
510	   cluster.  Here, a respective data channel between two entities is
511	   established by exploiting the existing protocol stacks such as UDP/IP
512	   or TCP/IP.  The data channels utilize IP unicast according to the
513	   underlying IP transport scheme.  As shown in the figure, data are
514	   firstly replicated into three copies, respectively delivered from the
515	   super node to its direct children 1.1, 1.2 and 1.3 using unicast.
516	   Similarly, 1.1 replicates copies of the data according to the number
517	   of their children (e.g. two copies), sending to 1.1.1 and 1.1.2.  In
518	   the next iteration, the receiver will similarly make copies and
519	   deliver to its children.

521	                 /-----------------------------\
522	                /              1.1.1           |
523	               /     1.1    / End host         |
524	              /     End host   1.1.2           |
525	             /     /        \ End host         |
526	            /     /            1.2.1           |
527	           /     /         / End host          |
528	          /     /         /                    |
529	        -----> /   1.2   /    1.2.2    1.2.2.1 |
530	   Data ----->  End host -- End host - End host|
531	        -----> \         \                     |
532	          \     \         \   1.2.3            |
533	           \     \         \ End host          |
534	            \     \                            |
535	             \     \ 1.3        1.3.1          |
536	              \     End host - End host        |
537	               \                               |
538	                \                              |
539	                 \-----------------------------/
540	                         Cluster

542	                    Figure 3: Data delivery in DMMP

544	5.  DMMP Messages

546	   DMMP handles tasks related to overlay hierarchy management, multicast
547	   tree configuration and maintenance.  It uses a common format to carry
548	   both data and control packets as shown in Figure 4.

550	   0               7                       15                    31
551	   +---------------+-----------------------+----------------------+
552	   |    version    |     Tree version      |       Option         |
553	   +---------------+-----------------------+----------------------+
554	   |                        DMMP session ID                       |
555	   +--------------------------------------------------------------+
556	   |                             Source ID                        |
557	   +--------------------------------------------------------------+
558	   |                          Sequence Number                     |
559	   +--------------------------------------------------------------+
560	   |                               Reserved                       |
561	   +--------------------------------------------------------------+
562	   |                           Payload Data length                |
563	   +--------------------------------------------------------------+

565	                    Figure 4: DMMP Packet Header Format

567	   Session ID and Source ID are generated by the RP and guaranteed to be
568	   collision free.  The tree version field is used to prevent loops and
569	   partitions from the multicast tree.  Since the multicast session tree
570	   is initialized and controlled by the RP, a loop free topology may be
571	   generated.  Moreover, since tree update messages are independently
572	   disseminated to all group members, there is possibility that some
573	   messages might be lost and received out-of-order by different group
574	   members.  These members may act on Refresh message with updating
575	   capacities.  All these events could cause loops and tree partition.
576	   In order to avoid these failures, the RP will assign a monotonically
577	   increasing version number to each newly generated multicast tree.

579	   The Option fields in the header defines various types of operation
580	   messages.  Currently, there are seven pairs of control messages as
581	   shown in Table 1.  Each pair of control messages will be exchanged
582	   between the DMMP-aware entities in a request-and-response way.

584	   o  Subscription Request and Response - Group members get the address
585	      of RP from Domain Name Systems (DNSs).
586	   o  Ping_RP Request and Response - During bootstrapping, each member
587	      of the group gets a list of available super nodes from the RP,
588	      containing at least one active node.
589	   o  Join Request and Response - A newly joining member sends request
590	      in order to join the multicast session and gets corresponding
591	      information from active group members.

593	   o  Status Request and Report - To request the status reports from
594	      neighbors or relatives, and accordingly to send reports to them.
595	   o  Probe Request and Response - To probe whether the target node is
596	      still active or not.
597	   o  Inactive Report and Response - To inform the other group members
598	      that the target node is inactive.
599	   o  Refresh Request and Response - To maintain the overlay hierarchy,
600	      they are used to periodically update the capacities (such as
601	      uptime, out-degree) of group members.

603	   To adapt to dynamic network changes, each end host maintains the
604	   overlay core by periodically updating its capacities.  For example,
605	   it periodically exchanges Refresh message with its neighbors.  If
606	   node A cannot receive this message from node B within the Refresh
607	   timer, node A will send a Probe request to node B. If there is still
608	   no Response returned, node B will be confirmed to be inactive by a
609	   certain time.  Then the Status report, indicating node B being
610	   inactive, will be used to inform the rest of group members.

612	   Table 1 lists the DMMP messages according to the associated DMMP
613	   operational phases.

615	                             Table 1: DMMP Messages
616	            +-----------------+------------+------------+------------+
617	            |    Messages     |  Operation |      From  |    To      |
618	            +-----------------+------------+------------+------------+
619	            | Subscription Rq | Initializ- |Group Member| DNS server |
620	            +-----------------+  ation     +------------+------------+
621	            | Subscription Res|            | DNS server |Group Member|
622	            +-----------------+------------+------------+------------+
623	            | Ping_RP Request | Bootstrap  |Group Member|    RP      |
624	            +-----------------+            +------------+------------+
625	            | Ping_RP Response|            |    RP      |Group Member|
626	            +-----------------+------------+------------+------------+
627	            | Join Request    |   Member   |  New Host  | Group Mem. |
628	            +-----------------+            +------------+------------+
629	            | Join Response   |   Join     | Group Mem. |  New Host  |
630	            +-----------------+------------+------------+------------+
631	            | Status Request  |   Cluster  |Group Member|Group Member|
632	            +-----------------+   Member   +------------+------------+
633	            | Status Report   | Monitoring |Group Member|Group Member|
634	            +-----------------+------------+------------+------------+
635	            | Probe Request   |   Probe    |Group Member|Group Member|
636	            +-----------------+            +------------+------------+
637	            | Probe Response  |   Members  |Group Member|Group Member|
638	            +-----------------+------------+------------+------------+
639	            | Inactive Rep.   |   Member   |Leaving Node|Group Member|
640	            +-----------------+            +------------+------------+
641	            | Inactive Report |    Leave   |Group Member|Leaving Node|
642	            +-----------------+------------+------------+------------+
643	            |Refresh Request  |   Update   |Group Member|Group Member|
644	            +-----------------+            +------------+------------+
645	            |Refresh Response |Information |Group Member|Group Member|
646	            +-----------------+------------+------------+------------+
647	                  Legend:    SN : Super Node
648	                         Rq/Req.: Request

650	   Although TCP provides a generic protocol for a guaranteed, in-order
651	   delivery of stream-based messages, this reliability comes at a price
652	   in performance.  Besides, the communication pattern in DMMP is
653	   strictly in a request-response mode, and most messages have a small
654	   fixed maximum size.  Thus, it is preferred to encapsulate all DMMP
655	   messages over UDP to provide the required delivery guarantees without
656	   extra network burdens.

658	6.  DMMP: Protocol Details

660	   All DMMP-aware nodes and the source are assumed to be able to know
661	   the address of the RP.  Also, once a source node starts, a direct
662	   communication channel will be established between the source node and
663	   the RP, so that the necessary information like the capacities of
664	   group members and active super nodes could be exchanged between them
665	   during the lifetime of an overlay multicast session.  Furthermore, a
666	   DNS namespace is required to maintain the RP information for a
667	   specified multicast group.

669	   Before initialization, each group member and the source send out
670	   Subscription request containing the specified group name and domain
671	   name, to the DNS for the address of associated Rendezvous Point (RP).
672	   Since RP does not participate in the data forwarding, the location of
673	   RP has no significant impact on the performance of data distribution.
674	   If there is no existing RP which is serving for this multicast group,
675	   DNS will allocate a new one for this multicast group based on
676	   application requirements.  Otherwise, the RP's address will be sent
677	   back.  It is possible that multiple RPs serve for the same multicast
678	   group, e.g. for the purpose of load balancing and fault tolerance.
679	   For simplicity, this document initially considers the case where
680	   there is only one RP involved.

682	6.1.  Initialization

684	   During the initialization phase, certain application related software
685	   has been distributed to the prospective DMMP-aware entities within
686	   the DMMP control and data transport models.

688	   Before the multicast session starts, the source and RP must be ready
689	   to give response to requests from DMMP-aware end hosts.  The source
690	   and RP will take no further reactions to any DMMP requests once the
691	   session stops.  The active session time should be the period from the
692	   service starts until it stops.  Then the out-of-band channel between
693	   the RP and source should be active during this active session so that
694	   the source can monitor the current status of memberships.  However,
695	   the detailed mechanism for implementing this out-of-band
696	   bootstrapping is out of the scope of this document.

698	   Moreover, session-related information should be obtained before the
699	   session starts and all prospective group members use out-of-band
700	   bootstrapping mechanism to get necessary information, for instance,
701	   Group ID and location of RP including the port number serving for
702	   certain sessions before the application begins.  Then DMMP-aware
703	   entities can start receiving data after they join the overlay
704	   hierarchy.

706	6.2.  Super Node Selection

708	   During the mesh construction, the selection of the super nodes is to
709	   ensure that a newly joining member is able to quickly find its
710	   appropriate position in the multicast tree using a very small number
711	   of queries such as Join request.  As the super node selection is the
712	   first step towards mesh establishment, this section gives more
713	   details.

715	   To select super nodes for better performance while maintaining
716	   scalability, the following distribution requirements need to be taken
717	   into account [20].
718	   o  Connections: Super nodes have relatively higher capacties and are
719	      expected to be strong to perform additional tasks such as
720	      resources control, load balance and fault tolerance.
721	   o  Number: To be more efficient, the number of active super nodes is
722	      no more than one hundred, otherwise it may cause high control
723	      overhead and high stress [5].  Assuming that each super node can
724	      manage, in average, hundreds of cluster members, it is sufficient
725	      to support totally more than thousands of end hosts.  Otherwise,
726	      multiple sources will be deployed into media streaming by multiple
727	      overlay multicast sessions.  To balance the tradeoff between the
728	      efficiency and the reliability, it is reasonable to select an
729	      application-specific number of super nodes to construct the
730	      overlay mesh.  For example, in the case of 110 end hosts, it may
731	      need 10 super nodes if the required ratio is 10%.  In this case,
732	      10 end hosts with higher capacity will be chosen as super nodes.
733	   o  Downstream: To adapt to bandwidth requirements, super nodes should
734	      not serve more than K non-super nodes as its immediate children,
735	      where K is respectively determined by the available out-degree of
736	      each super node and service specifications.

738	   In order to deal with factors from a large-scale and dynamic network
739	   environment, the following three conditions are outlined in addition
740	   to above requirements.

742	   o  Stable: One cause for current multimedia streaming serivces which
743	      cannot guarantee required QoS mainly from unstable network status.
744	      Thus, super nodes should be relatively stable because, otherwise,
745	      its cluster members are easily partitioned from the tree.
746	   o  Resilience: Super nodes are responsible for detecting dynamic
747	      changes and for handling them quickly, e.g. one super node leaves
748	      the group ungracefully, which should be detected by at least one
749	      of other active nodes.  Then, a new super node should be quickly
750	      selected to replace the leaving super node in the position.  The
751	      time for detection and recovery process is also constrained by
752	      certain service requirements.

754	   o  Security: Super nodes should be fundamentally invulnerable to
755	      attacks; otherwise, they will easily disrupt the multicast service
756	      by forwarding wrong messages or failing to accept information.

758	6.3.  Member Join

760	   DMMP is resilient to dynamic network changes, for examples, events
761	   like a member joining/leaving.

763	   The newcomer first checks its local cache for super node candidates.
764	   If there are no suitable candidates in the cache, it requests the RP
765	   for the addresses of the source and super node candidates.  After
766	   receiving the source address and a list of active super nodes, it
767	   caches their capacities, i.e. uptime, out-degree.  Then, it will
768	   measure the e2e latency between them and itself, and sends the Join
769	   Request message to some super node which can provide smaller e2e
770	   latency.

772	   Upon receiving Join Request from a newcomer, the super node will
773	   check its current out-degree.  If it is possible to accept the
774	   newcomer to as its immediate child, the super node will respond with
775	   an indication of acceptance.  In this case, the information about
776	   newly joining nodes will only be propagated to the existing children
777	   of this super node since no child of the new member exists.  When the
778	   super node cannot accept it as its immediate child, it will redirect
779	   this Join Request message to its active children with the largest
780	   available out-degree.  If one of them responses to the super node,
781	   this response will be relayed to the new member.  If there are more
782	   potential parents, the new member selects the one with smallest tree
783	   depth as its parent.  If there are multiple potential parents at the
784	   same depth, it chooses the best one in terms of their uptime.  Once
785	   finding the appropriate parent, the new member starts data delivery.
786	   At this time, the information about the new member will be propagated
787	   from the parent to its PLN, siblings and the super node.  The process
788	   will be terminated until the new member finds its position and
789	   accordingly updates its related information at corresponding nodes.

791	   After joining the tree, the newcomers who have higher capacities
792	   could "climb" from the bottom to a higher level after some switch
793	   stages.  For example, a newcomer at the lower level could switch with
794	   its parent if its capacity exceeds (over a predefined threshold) the
795	   current parent.  The appropriate threshold will be defined to avoid
796	   unnessary switching since if the child has a smaller bandwidth
797	   support, it will be ultimately placed below the high capacity parent.

799	   In case the newcomer fails to find an appropriate position in any
800	   cluster to meet application requirements, it can sell itself as a
801	   potential super node and report its own capacities to the RP.

803	   Regarding its capacity and the current number of super nodes, it
804	   could be entitled as a super node.  In this way, end hosts have more
805	   flexibility to get optimal services, which will be specified in a
806	   future version of this document.

808	6.4.  Refresh Information

810	   In DMMP, each member is responsible for maintaining the overlay
811	   hierarchy, by periodically sending Refresh message.  The Refresh
812	   mechanism in the overlay mesh has a little difference from that in
813	   the local clusters.  To efficiently manage the overlay hierarchy,
814	   both active and passive models are utilized in DMMP.  Within each
815	   cluster, end host starts to exchange Refresh message with its PLNs,
816	   siblings and CLNs once it joins the cluster.  In addition that each
817	   member has to periodically update its information, members in the
818	   local cluster are able to request refresh message from their
819	   relatives, e.g., PNLs or CNLs.  This operation guarantees the
820	   reliability and the stability of the overlay hierarchy.

822	   For the Refresh message in the mesh, each super node sends its
823	   current information to all mesh members including the source.  Once
824	   receiving updated information, the source will correspondingly update
825	   the information at the RP.  If one mesh member stops receiving
826	   Refresh message from another beyond the Mesh_Refresh_Timer, it
827	   assumes this neighbor to be either inactive or leaving.  In order to
828	   confirm the status, it may initiate a Probe message as stated in
829	   Section 5.

831	6.5.  Member Leave

833	   In most cases, two situations for a member leaving the group, either
834	   gracefully or ungracefully, are distinguished from each other.  In
835	   each cluster, the leaving member should at least send an Inactive
836	   Request to its parent or one of its children.  After receiving the
837	   confirmation, it can leave the group gracefully.  Then the notified
838	   node will propagate this Inactive message to its relatives so that
839	   they can update their service membership tables.  In the second case,
840	   the Inactive status will be detected by periodically exchanging
841	   Refresh messages.  If any member within the cluster, say p, fails to
842	   receive a Refresh Report message from one of its required relatives,
843	   say q, within the refresh timeout Refresh_Timer, then p sends a
844	   redundant Probe Request message to q.  If there is still no Probe
845	   Response message returned, p assumes q to be inactive and propagates
846	   this Status_Inactive message throughout the whole cluster.
847	   Afterwards, one of it children with relatively high capacity will
848	   replace its place, and other children will accordingly change their
849	   positions.  Nevertheless, ungraceful leaving may cause the crash of
850	   whole multicast tree.  DMMP is able to handle different situations by
851	   detecting the failures and recovering quickly from them as shown in
852	   Section 6.7.

854	   Compared with the handling in the local cluster, the operation is
855	   even tougher in the core mesh since all its cluster members are
856	   partitioned from the tree.  When there is no end hosts connecting to
857	   the leaving super node, no further changes to the overlay are
858	   required.  Before a super node gracefully leaves the group, it must
859	   recommend a replacement leader for the cluster it owns and inform
860	   other super nodes in the overlay mesh before leaving.  To detect
861	   unannounced leaves, DMMP relies on the periodic Refresh message
862	   exchanges.  If the failed peer happens to be a super node, the
863	   overlay hierarchy has to be repaired, which will be depicted in
864	   Section 6.7.

866	6.6.  Data Delivery Control

868	   After the multicast tree configuration, the new member will ask its
869	   immediate parent to send the data.  Generally, parent nodes will
870	   delete data in their local cache after they have forwarded them to
871	   their children.  If the parent still holds the data, the new member
872	   can quickly get the data from it.  If the parent has not received
873	   data yet, either it waits until the parent forwards the data after
874	   receiving, or it directly inquires the super node to deliver the
875	   data.  The former option is preferred in DMMP as its overhead is
876	   likely lower than the latter one.  Upon receiving the data, the new
877	   member will firstly forward them to its parent if its parent hasn't
878	   received the data yet.  If the parent has deleted the data, it will
879	   then ask its siblings to forward data directly.  In order to
880	   alleviate the redundant data transmission, the new member needs to
881	   wait for a certain while before it asks for the data from its
882	   siblings.

884	   Different from joining as a cluster member, the new member may join
885	   the multicast session as a super node.  In this case, it will firstly
886	   ask its neighbors in the overlay mesh to send the data.  In addition,
887	   it may query the data from its children when they are already in its
888	   local cluster.  If any of them receives the data, this new super node
889	   will also get the data.  A third possibility is to let the new member
890	   directly ask the source to send the data.  To alleviate the control
891	   overhead, it is recommended that this new node waits for a certain
892	   while until one of its mesh neighbors receives the data.

894	6.7.  Failure Recovery

896	   Maintenance of such a multicast tree in DMMP faces a key problem that
897	   non-leaf nodes in the tree are end hosts who are more likely to fail
898	   than routers and may join/leave the tree at will.  It does not happen
899	   in IP multicast since non-leaf nodes in the delivery tree are routers
900	   which do not leave the multicast tree without notification.  Thus,
901	   one challenge in DMMP is to reconstruct the overlay multicast tree
902	   after a node's departure.  If one non-leaf node leaves the group
903	   ungracefully, its downstream nodes will be inevitably affected.  Two
904	   possible means can alleviate such impacts: one is to reduce the
905	   possibility of failures; the other is to reduce the number of
906	   possible affected nodes.  In practice, however, the first way might
907	   be very difficult since end hosts may leave the group at will.  For
908	   the second option, DMMP proposes a proactive mechanism by
909	   periodically pushing high-capacity nodes to higher levels of the tree
910	   by capacity comparison mechanism.  Detailed mechanisms are described
911	   in Section 6.8.  Thereby, it is very likely that long-lived super
912	   nodes and their immediate children form a stable and efficient
913	   cluster core after a certain time.  The longer a node remains in the
914	   multicast session, the more it becomes attaching to other long-lived
915	   nodes with similar uptime.  In other words, higher capacity nodes
916	   form a well-connected core with relatively more bandwidth support and
917	   being more stable, whereas peers with less available bandwidth and
918	   shorter uptime will be placed out of the core as possible.

920	   In order to improve the performance of DMMP, especially when there
921	   are high packet losses or host failures, a reactive recovery
922	   technique is also implemented after failure detection.  Recovery from
923	   failures regarding a member crash is similar to handling a member
924	   leaves.  The difference is that surviving members usually do not
925	   receive prior notification of a crash.  Thus, Refresh message is
926	   periodically exchanged between each member and its neighbors.  It is
927	   even more difficult for super nodes to maintain the cluster since all
928	   of its local members are partitioned from the multicast tree and can
929	   not receive the multicast data until it is repaired.

931	   In the local cluster, each immediate child of the super node must
932	   find a backup parent in advance, either the source or a group member.
933	   Once the super node leaves the group, its children try to contact
934	   with their alternative parents to re-join the multicast tree.  This
935	   approach can facilitate the recovery process and strengthen the
936	   reliability of the overlay hierarchy.  In addition, cluster members
937	   periodically estimate their relatives (i.e.  PNLs, CNLs) within the
938	   cluster and evaluates the number of losses that it shares with these
939	   nodes.  While in the core mesh, each super node maintains state
940	   information about all other mesh members, no additional discovery of
941	   nodes is necessary.  Using this mechanism, packet delivery ratios can
942	   be increased with a high probability.  To handle different scenarios
943	   of failures, more mechanisms need to be defined in the near future.

945	6.8.  Self-improvement

947	   We suggest the self-improvement mechanisms and hence a cluster member
948	   having a higher capacity than its parent could be promoted.
949	   Basically, the children and their parents can swap their positions.
950	   After promotion, the former child becomes the new parent and its
951	   former parent becomes the current child.  However, there are some
952	   aspects impact the mechanism:
953	   o  the number of nodes involved in the promotion: the more nodes
954	      change their positions, the less stable the overlay tree becomes.
955	   o  the reliability of participated nodes: the promoted node may leave
956	      ungracefully, and its children will be partitioned from the
957	      existing tree.
958	   o  complixity of re-construction: after promotion, the re-
959	      construction of the existing tree may be complicated since the
960	      promoted child may have not enough bandwidth to accept all
961	      exisiting end hosts as its children.

963	   For simplicity, we describe the idea by taking the following example.
964	   In Figure 5, suppose that node 1.1.2 has much higher capacity than
965	   its parent 1.1 based on the capacity comparison.  Then, node 1.1.2
966	   sends a promotion request to node 1.1.  After a certain proof, node
967	   1.1 acknowledges the request and sends back a status report which
968	   contains the address of node s.  Here, it is necessary that node 1.1
969	   waits till node s has received the breakup request.  Otherwise, the
970	   join request from 1.1.2 may arrive earlier, which will cause a loop
971	   in the overlay tree.

973	   Then, node 1.1 breaks the connections with node s and 1.1.2.
974	   However, node 1.1 keeps node s as its backup parent in case node
975	   1.1.2 is leaving or unreachable.  Moreover, node s considers node 1.1
976	   as it temporary child.  At the same time, node 1.1.2 contacts node s
977	   and notifies node 1.1 to be its child.  Once node 1.1 receives the
978	   notification and rejoins the tree as the child of node 1.1.2, it may
979	   break the connection with node 1.1.1 if node 1.1.2 still has
980	   available capacity.  In the following example, node 1.1.2 can support
981	   at least three children.  Therefore, after the first swap, the node
982	   1.1.1 requests to join as one child of node 1.1.2.  The message flow
983	   of the promotion example is shown in Figure 6.

985	   Above mechanism can be tolerant to dynamic changes.  Considering the
986	   scenario that node 1.1 and 1.1.1 can still quickly rejoin the tree
987	   even if node 1.1.1 leaves ungracefully.  The optimization and
988	   extension of the mechanism will be studied in the near future.

990	                        End host
991	   Former:            /   1.1.1
992	                     /
993	                    /
994	                   /
995	   End host--End host--End host--End host
996	      s      1.1       1.1.2      1.1.2.1

998	                        ||
999	                        ||
1000	                        ||
1001	                        \/

1003	                        End host
1004	   Later:              /  1.1
1005	                      /
1006	                     /
1007	   End host--End host--End host
1008	       s      1.1.2  \ 1.1.2.1
1009	                      \
1010	                       \
1011	                        End host
1012	                         1.1.1

1014	           Figure 5: An example of self-improvement

1016	   End host        End host        End host        End host
1017	   1.1.2             1.1              s              1.1.1
1018	     |promotion req. |                |               |
1019	     |=============> |  break req.    |               |
1020	     |               |==============> |               |
1021	     |               |  break resp.   |               |
1022	     |promotion ACK  |<============== |               |
1023	     |<============= |                |               |
1024	     |         Join Request           |               |
1025	     |==============================> |               |
1026	     |         Join Response          |               |
1027	     |<============================== |               |
1028	     |  Join Notif.  |                |               |
1029	     |=============> | Break Confirm  |               |
1030	     |               |<==============>|               |
1031	     |               |            Join Notif.         |
1032	     |               | ==============================>|
1033	     |               | Join Request   |               |
1034	     |<===============================================|
1035	     |               | Join Response  |               |
1036	     |==============================================> |
1037	     |               |      Promotion Confirm         |
1038	     |               |<============================== |

1040	           Figure 6: Message flow of self-improvement example

1042	   Moreover, nodes at the bottom level are either transient nodes, leaf
1043	   nodes or new comers.  The newcomers who have higher capacities could
1044	   "climb" from the bottom to a higher level after some switching
1045	   stages.  For example, a newcomer at the lower level could switch with
1046	   its parent if its capacity exceeds (over a predefined threshold) the
1047	   current parent.  Nevertheless, an appropriate threshold will be
1048	   defined to avoid unnecessary switching since if the child has a
1049	   smaller bandwidth support, it will be ultimately placed below the
1050	   parent.  The main goal of doing this is to reducing the impacts of
1051	   frequent changes in the overlay so that only a small part of the
1052	   overlay multicast tree will be affected and needs to be re-
1053	   constructed after dynamic changes.

1055	7.  Metrics specification

1057	   End host based overlay multicast is more susceptive to dynamic
1058	   network changes since end hosts may join or leave the group at will.
1059	   It would be even harder for DMMP to manage and maintain such an
1060	   overlay mesh because super nodes may leave the group ungracefully as
1061	   well.  To address the instability of mesh, uptime is chosen as an
1062	   assisted criterion to strengthen its maintenance.  Once an end host
1063	   joins in the overlay multicast tree, its uptime starts to calculate
1064	   from 0 until its leaving.  Besides, there are several metrics used in
1065	   DMMP as the criteria of the capacity, which will be specified in this
1066	   section.

1068	                      Table 2: Capacity specification
1069	   +-------------+-------------------------------------+
1070	   |    Metric   |             Operation               |
1071	   +-------------+-------------------------------------+
1072	   |             | Differentiation non-leaf nodes from |
1073	   |             |           leaf nodes                |
1074	   | Out-degree  +-------------------------------------+
1075	   |             |        Super nodes selection        |
1076	   |             +-------------------------------------+
1077	   |             |   Tree construction within clusters |
1078	   |             +-------------------------------------+
1079	   |             |     New member joins the group      |
1080	   |             +-------------------------------------+
1081	   |             |     Failure recovery mechanism      |
1082	   |             +-------------------------------------+
1083	   |             |     Self-improving mechanism        |
1084	   +-------------+-------------------------------------+
1085	   |             |    Non-super nodes attach to super  |
1086	   | E2E latency |       nodes to form clusters        |
1087	   |             +-------------------------------------+
1088	   |             |     New member joins the group      |
1089	   +-------------+-------------------------------------+
1090	   |             |     New member joins the group      |
1091	   |    Uptime   +-------------------------------------+
1092	   |             |     Self-improving mechanism        |
1093	   |             +-------------------------------------+
1094	   |             |     Failure recovery mechanism      |
1095	   +-------------+-------------------------------------+

1097	   As shown in Table 2, out-degree is the main criterion to select the
1098	   super nodes from end hosts.  Then, non-super nodes select one super
1099	   node which locates "near" to it based on the estimated e2e latency.
1100	   During the tree construction within clusters, nodes with higher out-
1101	   degree are likely to join in the tree at the high level.  Regarding
1102	   new members joining procedure, out-degree, e2e latency and uptime are
1103	   all taken into considerations.  To keep the stability of the overlay
1104	   hierarchy, out-degree and uptime are chosen as comparison metric to
1105	   self-improve the overlay multicast tree.  Furthermore, out-degree and
1106	   uptime are regarded as the main selection criterion for alternative
1107	   node during the failure recovery.

1109	8.  Security Considerations

1111	   Security should be considered during the super nodes selection.  In
1112	   DMMP, the current preference is to use an authority center that
1113	   qualifies the trust level of end hosts.  Only when the end host
1114	   obtains a security certificate from the authority center, it can be
1115	   selected as a super node.

1117	   Within each cluster, Cluster key, Group key and Private key are
1118	   proposed as the security scheme to manage the cluster members.
1119	   Details and discussions related to other security issues are to be
1120	   explored in a future version of this document.

1122	9.  Open Issues

1124	   DMMP framework will study necessary extensions or open issues, where
1125	   security, large-scale efficiency, end-to-end quality-of-service (QoS)
1126	   provisioning will be likely related.  Moreover, initial DMMP does not
1127	   include support for NATs and firewalls since they impose fundemental
1128	   restrictions on pair-wise connectivity of hosts on the overlay.

1130	10.  Contributors

1132	   Ruediger Geib, Xiaodong Yang and David Weiss contributed great
1133	   efforts to this document.

1135	11.  Acknowledgements

1137	   We thank Nicolai Leymann, John Buford, Yuji-UG-Imai and Yangwoo Ko
1138	   for their helpful suggestions.

1140	12.  References

1142	12.1.  Normative References

1144	   [1]   Bhattacharyya, S., "An Overview of Source-Specific Multicast
1145	         (SSM)", RFC 3569, July 2003.

1147	   [2]   Bradner, S., "Key words for use in RFCs to Indicate Requirement
1148	         Levels", BCP 14, RFC 2119, March 1997.

1150	12.2.  Informative References

1152	   [3]   Almeroth, K., "The evolution of multicast: From the MBone to
1153	         inter-domain multicast to  Internet2 deployment", IEEE
1154	         Network 2000, Jan./Feb 2000.

1156	   [4]   Diot, C., Levine, B., Lyles, J., and D. Balensiefen,
1157	         "Deployment issues for IP multicast service and architecture",
1158	         IEEE Network 2000, Jan. 2000.

1160	   [5]   Chu, Y., Rao, S., and et. al, "A Case for End System
1161	         Multicast", IEEE JSAC Vol. 20, No. 8, October 2002.

1163	   [6]   Banerjee, S. and B. Bhattacharjee, "Analysis of the NICE
1164	         Application Layer Multicast Protocol", UMIACS Technical
1165	         report TR 2002-60 and CS-TR 4380, June 2002.

1167	   [7]   Banerjee, S., Bhattacharjee, B., and et. al, "Scalable
1168	         Application Layer Multicast", SIGCOMM 2002, August 2002.

1170	   [8]   Banerjee, S., Kommareddy, C., and et. al, "OMNI: An Efficient
1171	         Infrastructure for Real-time Applications", Computer
1172	         Networks, Special Issue on Overlay  Distribution Structures and
1173	         their Applications, Vol. 50, No. 6, April 2006.

1175	   [9]   Lao, L., Cui, J., and et. al, "TOMA: A Viable Solution for
1176	         Large-scale Multicast Service Support", IFIP Networking 2005,
1177	         May 2005.

1179	   [10]  Hefeeda, M. and B. Bhargava, "On-Demand Media Streaming Over
1180	         the Internet", FTDCS'03, The Ninth IEEE Workshop on Future
1181	         Trends of Distributed Computing Systems, ftdcs, p. 279,
1182	         April 2003.

1184	   [11]  Saroiu, S., Gummadi, P., and S. Gribble, "A Measurement Study
1185	         of Peer-to-Peer File Sharing Systems", MMCN'02 Proceedings of
1186	         Multimedia Computing and  Networking, July 2002.

1188	   [12]  Sripanidkulchai, K., Maggs, B., and H. Zhang, "An analysis of
1189	         live streaming workloads on the Internet", SIGCOMM
1190	         IMC Proceedings of the 4th ACM SIGCOMM IMC, Oct. 2004.

1192	   [13]  Lei, J., Fu, X., and D. Hogrefe, "DMMP: A New Dynamic Mesh-
1193	         based Overlay Multicast Protocol  Framework", accepted in the
1194	         2007 IEEE Consumer Communications and  Networking
1195	         Conference, Workshop on Peer-to-Peer Multicasting (P2PM 2007),
1196	         Jan. 2007.

1198	   [14]  Ballardie, A. and J. Crowcroft, "Core based trees (CBT)", ACM
1199	         SIGCOMM Computer Communication Review, October 1993.

1201	   [15]  Hu, N. and P. Steenkiste, "Evaluation and Characterization of
1202	         Available Bandwidth Probing  Techniques", IEEE JSAC Special
1203	         Issue in Internet and WWW  Measurement, Vol. 21, No. 6,
1204	         August 2003.

1206	   [16]  Tan, G., Jarvis, S., and D. Spooner, "Improving Fault
1207	         Resilience of Overlay Multicast for Media Streaming",
1208	         DSN'06 IEEE International Conference on Dependable Systems and
1209	         Networks, June 2006.

1211	   [17]  Jain, M. and C. Dovrolis, "End-to-end available bandwidth:
1212	         measurement methodology, dynamics, and  relation with tcp
1213	         throughput", Transaction'03 IEEE/ACM Transaction on Networking
1214	         11  (4), August 2003.

1216	   [18]  Strauss, J., Katabi, D., and F. Kaashoek, "A measurement of
1217	         study of available bandwidth estimation tools",
1218	         IMC'03 Proceedings of ACM SIGCOMM Conference on  Internet
1219	         Measurement, Oct. 2003.

1221	   [19]  Deering, S., "Multicasting routing in internetworks and
1222	         extended LANs", ACM SIGCOMM 1988, August 1988.

1224	   [20]  Lo, V., Zhou, D., and et. al, "Scalable Supernode Selection in
1225	         Peer-to-Peer Overlay Networks", HOT-P2P'05 Proceedings of
1226	         International Workshop on  Hot Topics in Peer-to-Peer Systems
1227	         2005, July 2005.

1229	   [21]  Handler, G. and P. Mirchandani, "Location on Networks: Theory
1230	         and Algorithms", The MIT Press Cambridge Massachusetts,
1231	         Mar. 1979.

1233	   [22]  Lynch, A., "Distributed Algorithms", Morgan Kaufmann San
1234	         Francisco, April 1997.

1236	   [23]  Zhi, L. and P. Mohapatra, "HostCast: A New Overlay Multicasting
1237	         Protocol", IEEE ICC'03 Proceedings of IEEE ICC 2003, June 2003.

1239	   [24]  Banerjee, S., Lee, S., and et. al, "Resilient Multicast using
1240	         Overlays", ACM SIGMETRICS 2003, June 2003.

1242	   [25]  Paul, S. and K. Sabnani, "Reliable multicast transport protocol
1243	         (RTMP)", IEEE JSAC,  Vol. 15, No. 3, April 1997.

1245	Authors' Addresses

1247	   Jun Lei
1248	   University of Goettingen
1249	   Institute for Informatics
1250	   Lotzestr. 16-18
1251	   Goettingen  37083
1252	   Germany

1254	   Email: lei@cs.uni-goettingen.de

1256	   Xiaoming Fu
1257	   University of Goettingen
1258	   Institute for Informatics
1259	   Lotzestr. 16-18
1260	   Goettingen  37083
1261	   Germany

1263	   Email: fu@cs.uni-goettingen.de

1265	   Dieter Hogrefe
1266	   University of Goettingen
1267	   Institute for Informatics
1268	   Lotzestr. 16-18
1269	   Goettingen  37083
1270	   Germany

1272	   Email: hogrefe@cs.uni-goettingen.de

1274	Full Copyright Statement

1276	   Copyright (C) The IETF Trust (2008).

1278	   This document is subject to the rights, licenses and restrictions
1279	   contained in BCP 78, and except as set forth therein, the authors
1280	   retain all their rights.

1282	   This document and the information contained herein are provided on an
1283	   "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
1284	   OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND
1285	   THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS
1286	   OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF
1287	   THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
1288	   WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

1290	Intellectual Property

1292	   The IETF takes no position regarding the validity or scope of any
1293	   Intellectual Property Rights or other rights that might be claimed to
1294	   pertain to the implementation or use of the technology described in
1295	   this document or the extent to which any license under such rights
1296	   might or might not be available; nor does it represent that it has
1297	   made any independent effort to identify any such rights.  Information
1298	   on the procedures with respect to rights in RFC documents can be
1299	   found in BCP 78 and BCP 79.

1301	   Copies of IPR disclosures made to the IETF Secretariat and any
1302	   assurances of licenses to be made available, or the result of an
1303	   attempt made to obtain a general license or permission for the use of
1304	   such proprietary rights by implementers or users of this
1305	   specification can be obtained from the IETF on-line IPR repository at
1306	   http://www.ietf.org/ipr.

1308	   The IETF invites any interested party to bring to its attention any
1309	   copyrights, patents or patent applications, or other proprietary
1310	   rights that may cover technology that may be required to implement
1311	   this standard.  Please address the information to the IETF at
1312	   ietf-ipr@ietf.org.

1314	Acknowledgment

1316	   Funding for the RFC Editor function is provided by the IETF
1317	   Administrative Support Activity (IASA).