idnits 2.17.1 draft-lei-samrg-dmmp-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3978, Section 5.1 on line 16. -- Found old boilerplate from RFC 3978, Section 5.5, updated by RFC 4748 on line 1288. -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 1299. -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 1306. -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 1312. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) == There are 6 instances of lines with non-RFC6890-compliant IPv4 addresses in the document. If these are example addresses, they should be changed. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust Copyright Line does not match the current year == Line 1212 has weird spacing: '...cs, and relat...' -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (February 22, 2008) is 5901 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Unused Reference: '21' is defined on line 1229, but no explicit reference was found in the text == Unused Reference: '22' is defined on line 1233, but no explicit reference was found in the text == Unused Reference: '23' is defined on line 1236, but no explicit reference was found in the text == Unused Reference: '24' is defined on line 1239, but no explicit reference was found in the text == Unused Reference: '25' is defined on line 1242, but no explicit reference was found in the text ** Downref: Normative reference to an Informational RFC: RFC 3569 (ref. '1') Summary: 3 errors (**), 0 flaws (~~), 9 warnings (==), 7 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group J. Lei 3 Internet-Draft X. Fu 4 Expires: August 25, 2008 D. Hogrefe 5 Univ. Goettingen 6 February 22, 2008 8 DMMP: Dynamic Mesh-based Overlay Multicast Protocol 9 draft-lei-samrg-dmmp-03.txt 11 Status of this Memo 13 By submitting this Internet-Draft, each author represents that any 14 applicable patent or other IPR claims of which he or she is aware 15 have been or will be disclosed, and any of which he or she becomes 16 aware will be disclosed, in accordance with Section 6 of BCP 79. 18 Internet-Drafts are working documents of the Internet Engineering 19 Task Force (IETF), its areas, and its working groups. Note that 20 other groups may also distribute working documents as Internet- 21 Drafts. 23 Internet-Drafts are draft documents valid for a maximum of six months 24 and may be updated, replaced, or obsoleted by other documents at any 25 time. It is inappropriate to use Internet-Drafts as reference 26 material or to cite them other than as "work in progress." 28 The list of current Internet-Drafts can be accessed at 29 http://www.ietf.org/ietf/1id-abstracts.txt. 31 The list of Internet-Draft Shadow Directories can be accessed at 32 http://www.ietf.org/shadow.html. 34 This Internet-Draft will expire on August 25, 2008. 36 Copyright Notice 38 Copyright (C) The IETF Trust (2008). 40 Abstract 42 This document describes a Dynamic Mesh-based overlay Multicast 43 Protocol (DMMP) framework to support multicast data delivery 44 applications without relying on classic IP multicast, including 45 multicast group management, overlay hierarchy establishment, 46 multicast tree construction and data forwarding scheme from a source 47 to a number of receivers. The DMMP framework builds on control plane 48 functions which dynamically manage an overlay core and a multicast 49 tree. The key idea is a number of end hosts self-organize into an 50 overlay mesh, and dynamically maintain such a mesh. Based on the 51 constructed mesh, some core-based clusters are formed with capacity- 52 aware trees inside. Then, a multicast tree consisting of DMMP-aware 53 end hosts (and/or specific routers) is built on the top of the 54 overlay core for the efficient delivery of the multicast data. 56 Table of Contents 58 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 59 2. Features of DMMP . . . . . . . . . . . . . . . . . . . . . . . 4 60 3. Terminology and Abbreviations . . . . . . . . . . . . . . . . 6 61 4. DMMP Overview . . . . . . . . . . . . . . . . . . . . . . . . 8 62 4.1. Control plane in DMMP . . . . . . . . . . . . . . . . . . 8 63 4.2. Data plane in DMMP . . . . . . . . . . . . . . . . . . . . 11 64 5. DMMP Messages . . . . . . . . . . . . . . . . . . . . . . . . 13 65 6. DMMP: Protocol Details . . . . . . . . . . . . . . . . . . . . 15 66 6.1. Initialization . . . . . . . . . . . . . . . . . . . . . . 16 67 6.2. Super Node Selection . . . . . . . . . . . . . . . . . . . 17 68 6.3. Member Join . . . . . . . . . . . . . . . . . . . . . . . 18 69 6.4. Refresh Information . . . . . . . . . . . . . . . . . . . 19 70 6.5. Member Leave . . . . . . . . . . . . . . . . . . . . . . . 19 71 6.6. Data Delivery Control . . . . . . . . . . . . . . . . . . 20 72 6.7. Failure Recovery . . . . . . . . . . . . . . . . . . . . . 20 73 6.8. Self-improvement . . . . . . . . . . . . . . . . . . . . . 22 74 7. Metrics specification . . . . . . . . . . . . . . . . . . . . 24 75 8. Security Considerations . . . . . . . . . . . . . . . . . . . 26 76 9. Open Issues . . . . . . . . . . . . . . . . . . . . . . . . . 26 77 10. Contributors . . . . . . . . . . . . . . . . . . . . . . . . . 26 78 11. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 26 79 12. References . . . . . . . . . . . . . . . . . . . . . . . . . . 26 80 12.1. Normative References . . . . . . . . . . . . . . . . . . . 26 81 12.2. Informative References . . . . . . . . . . . . . . . . . . 27 82 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 29 83 Intellectual Property and Copyright Statements . . . . . . . . . . 30 85 1. Introduction 87 Over the recent years, a lot of research efforts have been focusing 88 on moving multicast support out of the network core, since the 89 deployment of network layer multicast has been obtructed by both 90 technical and opertional issues [3], [4]. To solve these issues of 91 IP multicast, various application level multicast approaches have 92 been proposed, which can be largely summarized into two categories, 93 namely, application layer multicast (ALM) and overlay multicast (OM). 94 As a matter of fact, network layer multicast requires changes in IP 95 routers, while ALM and OM approaches rely on network unicast and does 96 not need network layer infrastructure support from intermediate nodes 97 (e.g. router). 99 In ALM approach, end hosts form a virtual network, and multicast 100 delivery structures are constructed on top of such a virtual overlay. 101 A basic ALM approach is to form and maintain an overlay for data 102 transmission, where all end hosts in a multicast session are involved 103 without considering the heterogeneities of them, e.g. computation 104 power, bandwidth and access possibilities. For instance, all end 105 hosts join the full mesh construction of ESM (Narada) [5] and 106 multiple connections exist between any two nodes. The main advantage 107 of constructing such a mesh is easy implementation and being 108 relatively stable. However, ESM's sole dependence on the mesh 109 structure results in that it could only be applied well into a small 110 or medium-sized group [6]. NICE [7], in contrast, introduces a 111 hierarchical management scheme to create a scalable ALM overlay. 112 This hierarchical design simplifies the membership management of the 113 application layer multicast and makes it scale better than the full 114 mesh-based structure. Nevertheless, the joining procedure in NICE 115 causes a very high control overhead, which not only limits the 116 scalability of deployment, but also is likely vulnerable to single 117 node failures (e.g. possible failures caused by the node at the 118 highest layer of hierarchy). As described above, ALM approaches 119 address some practical/-deployment issues in network layer multicast 120 but there is a general concern about its efficiency and scalability. 122 Observing the weaknesses from ALM approaches, an alternative 123 approach, i.e., overlay multicast or OM, by using a kind of 124 "infrastructure-based" solution, has been proposed to improve 125 multicast efficiency and maximize the resource usage (e.g, 126 bandwidth). Proposals of such an approach include OMNI [8] and TOMA 127 [9]. The design issues of OM can be summarized in the following two 128 aspects: On one hand, OM approaches employ some fixed or long-term 129 infrastructure-based nodes to simplify membership management and 130 multicast tree construction. This advantage can become a weakness 131 since the assumption of these fixed nodes in the infrastructure 132 limits the extensibility and flexibility of deployment. For example, 133 the infrastructure must be re-established based on other long-term 134 measurements before constructing a new multicast trees to adapt to 135 the requirements imposed by a different metric. On the other hand, 136 TOMA and OMNI need dedicated infrastructure deployment and costly 137 servers, which could not be adaptive to dynamic network changes and 138 group member changes such as new members join. Therefore, it is 139 relatively difficult to implement them into the current Internet 140 environment although they are proposed to provide multicast support 141 for group communication applications. Obviously, to develop a 142 practical, efficient and resilient multicast framework is the 143 essential way towards wide deployment of multicasting services. 145 Additionally, the explosive growth of multimedia services and 146 applications over Internet necessitates streaming media to a large 147 population of users. However, with current media streaming 148 technology, it's hard to develop a comprehensive on-demand media 149 streaming system due to the following two challenges [10]. First, 150 the total number of concurrent clients the system can support is 151 limited by the resources of the streaming supplier. Second, current 152 media streaming proposals usually have limitations in reliability and 153 scalability. The reliability concern arises from the fact that only 154 one entity is responsible for all clients. The scalability issue 155 arises from the fact that adding internet-scale potential users 156 requires the commensurate amount of resources to the supplying 157 server. Meanwhile, aforementioned proposals could not explicitly 158 support real-time media streaming applications in a large scale. 160 Motivated by above studies, in this draft we present a new overlay 161 multicast framework which manages a dynamic mesh-based overlay core 162 and only involves participating end hosts without relying on the 163 availability of the OM-aware infrastructure nodes, while providing 164 certain degree of efficiency, reliability and resilience. 166 2. Features of DMMP 168 DMMP is organized into a two-level hierarchy and the mechanisms of 169 DMMP are introduced to dynamically manage and maintain the hierarchy. 170 The key idea behind DMMP is to let a few end hosts selected and self- 171 organize into an overlay mesh during the multicast initialization 172 phrase and also when group member changes, and dynamically maintain 173 such a mesh. Although routers may also be manually designated (e.g. 174 by ISPs) to construct the mesh, this document initially discusses the 175 approach via end hosts. Specifically, there are three design issues 176 to be addressed in DMMP: 178 o Host heterogeneity - Previous research has shown that a large 179 proportion of free-riders (i.e. group member who can only receive 180 data from incoming sessions) may exist in the network [11], [12]. 181 DMMP considers the heterogeneous capacities of group members by 182 evaluating their available bandwidth during the runtime. In the 183 DMMP framework,it is possible that only a small number of high- 184 capacity end hosts are selected to construct the overlay mesh when 185 there are a large proportion of free-riders [13]. This operation 186 may help maximizing the usage of available bandwidth for the 187 overlay tree. 188 o Scalability - Scalability is one of the main problems to be solved 189 in the multicast applications. In DMMP, each end-host may act as 190 a potential server for other clients and the number of possible 191 servers increases at the same rate as the end host clients. As 192 peer to peer (p2p) technologies have been deployed to support 193 various services over the Internet, it is possible that more end 194 hosts resources are available in the network. Once a node joins 195 the DMMP multicast session, additional resources are available to 196 the whole system. Therefore, the DMMP system is scalable as it 197 can potentially support a number of clients. 198 o Delay optimization - DMMP considers end-to-end delay for end 199 hosts. When forming the local cluster, non-super nodes select the 200 nearest super node in term of e2e delay measurement. This aspect 201 is essential while supporting delay sensitive applications (e.g. 202 media streaming). Furthermore, high-capacity nodes are given high 203 priority to stay at the higher level of the tree when constructing 204 the overlay multicast tree. In return, it allows to produce the 205 tree as short as possible and hence the overall delay could be 206 reduced. 207 o Resilience to dynamic changes - DMMP considers the transient 208 nature of end hosts and tries to prevent incapable or short-lived 209 nodes from staying close to the center of the multicast tree. 210 Consequently, the DMMP overlay structure is relatively stable and 211 resilient to dynamic network changes. Moreover, network situation 212 changes (such as multicast members joining/leaving) within a local 213 cluster will not have any impact on other clusters. The failure 214 of a single node may result in a transient instability in a small 215 subset of participants, but it will not cause a catastrophe in the 216 whole overlay. 218 In order to overcome the two challenges in Section 1, media streaming 219 task in DMMP is accomplished through the following two phases: (1) An 220 on-demand overlay core(or mesh) is established to achieve the 221 optimized performance; (2) Based on the structured mesh, several 222 clusters are formed to connect with selected mesh members, namely, 223 super nodes. Here, DMMP applies the concept of locality (e.g., 224 clusters) into the group management so that it can dramatically 225 reduce the control overhead and the complexity of the overlay 226 maintenance. Basically, a source-based DMMP architecture consists of 227 a sender, several receivers, one or many Rendezvous Points (RPs) and 228 Domain Name Systems (DNSs). Compared with existing application level 229 multicast approaches, DMMP is designed to be more stable, efficient 230 and applicable to support large-scale groups without relying on 231 predetermined intermediate nodes in the network and potentially get 232 better performance. Note that DMMP currently considers source- 233 specific multicast [1], any-source multicast is left for future 234 studies. 236 3. Terminology and Abbreviations 238 The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 239 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 240 "OPTIONAL", in this document are to be interpreted as described in 241 BCP 14, RFC 2119 [2]. Other terminologies and abbreviations used in 242 this document are used as follows: 244 o Overlay Multicast - A multicast data delivery scheme depending on 245 end hosts to form an overlay core for message control and a 246 multicast tree for data delivery. 247 o Rendezvous Point (RP) - A designated node for a multicast group, 248 which assists managing group members and stores some required 249 information (e.g. performance required). 250 o Source - The multicast service sender. It could be a video stored 251 server or some video distributed servers in one service domain, 252 which delivers the data traffic to the source-based multicast 253 group members; DMMP in this document only provides the source- 254 specific mechanism to realize the single source-based overlay 255 multicast. 256 o Super nodes - Some end hosts are chosen to manage the multicast 257 group and relay data from the mesh to receivers within clusters. 258 Currently, only end hosts can serve as super nodes; future version 259 of this document may specify the case when some routers (e.g., 260 first-hop routers) are used as super nodes. 261 o Receivers - Multicast group members who want to receive the data 262 from the source. 263 o Mesh - An overlay core, which is responsible for group member 264 management and multicast tree configuration. 265 o Clusters - Based on each super node, end hosts organize themselves 266 into a core-based multicast tree. [14]. 267 o Stress - A tree metric that counts the number of identical packets 268 sent by the protocol over a single link or a single node. 269 o Out-degree - Available connections, namely, the available number 270 of connections that a node can establish. 272 o Uptime - The time duration from a node joining in a multicast 273 session to its leaving the multicast session. 275 Within each cluster, there are some terminologies and abbreviations 276 which are used as follows: 278 o Parent - the direct upstream node of a node is called the parent 279 of that node, e.g. end host 1.2 is the parent of end host 1.2.1 in 280 Figure 1. 281 o Parent level nodes (PLN)- nodes (exclusive parent) at the same 282 level as the parent of some node, e.g. 1.1 and 1.3 are parent 283 level nodes of 1.2.1 in Figure 1. 284 o Child - the direct downstream node of some node, e.g. in Figure 1, 285 1.2.1 is the child of 1.2. 286 o Children level nodes (CLN)- nodes (exclusive children) at the same 287 level as children of some node, e.g. 1.1.1, 1.1.2, 1.3.1 are 288 children level nodes of 1.2 in Figure 1. 289 o Siblings - nodes at the same level of a node are called siblings, 290 e.g. in Figure 1, 1.1.1, 1.1.2, 1.2.2, 1.2.3 and 1.3.1 are 291 siblings of 1.2.1. 293 /-------------------------------\ 294 / 1.1.1 1.1.1.1 | 295 / 1.1 / End host- End host | 296 / End host 1.1.2 | 297 / / \ End host | 298 / / 1.2.1 | 299 / / / End host 1.2.2.1 | 300 / / / End host | 301 / / 1.2 / 1.2.2 / | 302 Super node--- End host -- End host | 303 \ \ \ \ End host | 304 \ \ \ 1.2.3 1.2.2.2 | 305 \ \ \ End host | 306 \ \ | 307 \ \ 1.3 1.3.1 | 308 \ End host - End host | 309 \ | 310 \--------------------------------/ 311 Cluster 313 Figure 1: An example of local Cluster 315 4. DMMP Overview 317 The DMMP framework consists of two types of functionalities: control 318 plane and data plane. The control plane composes one overlay mesh 319 and some core-based clusters. Data plane is, then, built on the top 320 of the structured control plane. Meanwhile, the source entity and a 321 set of super nodes form the overlay mesh, in which each super node 322 supervises one cluster. Although the Tree-based overlays are 323 regarded as the most efficient approach for data distribution in a 324 stable network, they are not effective for dynamic scenarios since 325 pure tree structure has difficulties to meet both high bandwidth and 326 high reliability requirements. The first difficulty is the delivered 327 service quality to downstream end hosts is limited by the minimum 328 bandwidth among the upstream connections from the source; for 329 instance, a member in the OMNI tree receives data only from its 330 upstream node and the data reception rate of this member can not be 331 greater than its upstream node. It is even more difficult in the 332 core network where each MSN should be responsible for data delivery 333 to a whole cluster. For the second requirement, a tree is less 334 robust than a mesh because a single node failure or a loop can 335 partition the tree and disable communications among the members. To 336 increase the throughput of the whole overlay network, an DMMP-aware 337 mesh as one of multiple forwarding solutions is introduced in this 338 draft. Such a mesh will allow the reception of packets from multiple 339 nodes other than from only the upstream node. 341 4.1. Control plane in DMMP 343 In this source-specific overlay multicast protocol, the combination 344 of available bandwidth [15] and uptime represents the capacity of 345 each node [13]. For media streaming system, available bandwidth 346 resources possessed by a multicast group may be insufficient during 347 runtime [16]. It is reason why DMMP needs to consider the degree 348 bound in streaming applications, which can be easily observed from 349 the available bandwidth. For example, on the assumption that the bit 350 rate of media is B and the outbound bandwidth of an end host i is 351 b(i), the total number of connections it can establish is b(i)/B 352 which is also the maximum degree of the end host. Moreover, the 353 usage of available bandwidth in overlay routing has become possible, 354 based on recent advances in available bandwidth measurement 355 techniques and tools [17], [18]. Obviously, if an application has 356 additional requirements on end-to-end delay or loss rate, these 357 metrics can be jointly considered during the overlay hierarchy 358 construction. 360 o Step 1: After initialization, RP will calculate the out-degree of 361 each host and distribute them into two categories: leaf nodes 362 (whose out-degree is less than 2) and non-leaf nodes. If one's 363 out-degree is less than 2, the end host can act as leaf node 364 because it can only receive data from the incoming connection. 365 o Step 2: The information of two-category nodes are respectively 366 stored at the RP. Meanwhile, all non-leaf nodes are placed in the 367 order of out-degree and reported to the source. On receiving the 368 list of ordered non-leaf nodes, the source selects an application- 369 specific number of them as super nodes. Those nodes have higher 370 capacities as defined in Section 6.2, and are used to manage the 371 multicast group. In the initialization stage, nodes with higher 372 bandwidth support will be selected as super nodes since the 373 current uptime is zero. The capacities of each super node are 374 also stored at the source and the RP. 375 o Step 3: After being selected, the super nodes organize themselves 376 into a mesh rooted at the source. The issue of overlay mesh 377 contruction is mostly motivated from [5]. 379 DMMP considers the heterogeneous capacities of group members by 380 evaluating their available bandwidth during runtime so that high- 381 capacity nodes (i.e., super nodes) which are able to and willing to 382 make more contributions to the network are expected to get better 383 performance. This will maximize the usage of available bandwidth for 384 the overlay tree. To further improve the overlay mesh performance, 385 DMMP allows dynamically adding and deleting links within the mesh, 386 which will be clarified in a future version of this draft. Based on 387 selected super nodes, some core-based clusters will be formed to 388 connect with the mesh. 390 o Step 1: After constructing the overlay mesh, the next step is to 391 form core-based clusters. Each non-super node will firstly 392 consult its local cache for super node candidates. If there is no 393 suitable candidate, it queries the RP immediately. Then, the 394 requestor caches these newly received candidates, from which it 395 selects the best one based on e2e latency measurements. If there 396 are multiple super nodes which can provide similar e2e latency for 397 the node, one of them with higher out-degree will be chosen. 398 o Step 2: Those non-super nodes sharing the same super node will 399 form a local cluster. The cluster formation is initiated by the 400 super node which answers for informing the RP and contacting the 401 source. Generally, certain number (due to the super node's 402 available bandwidth) of end hosts with higher capacity will be 403 selected as its immedidate children. This operation guarantees 404 that the multicast tree within each cluster meets the bandwidth 405 need of media streaming applications. 406 o Step 3: Afterwards, direct children of super nodes choose some 407 nodes with higher capacities (i.e. out-degree, e2e latency) as 408 their children. This selection method will expedite the 409 convergence of the tree and alleviate the average latency in some 410 senses. 412 The iteration will continue until all cluster members join the tree, 413 and accordingly the control hierarchy is constructed for the 414 multicast group . For the sake of resilience, each node in the local 415 cluster should keep some information of its relatives in the local 416 cache. In this draft, these entities (parent, PLN, child, CLN and 417 siblings, see in Section 2) are denoted as relatives. 419 Figure 2 illustrates a basic example for the overlay construction. 420 Assuming that it is the first time to construct the overlay 421 hierarchy, then: 423 o Step 1: When obtaining the list of non-leaf members from the RP, 424 the source selects six of them as super nodes. After selection, 425 super nodes self-organize into a mesh rooted at the source. 426 o Step 2: Each non-super node chooses the best super node based on 427 the e2e latency measurements. Those non-super nodes sharing the 428 same super node will then form a local cluster. 429 o Step 3: According to the super node's capacity, no more than K end 430 hosts with larger capacities are selected as its immediate 431 children. Here, three end hosts, namely, 1.1, 1.2 and 1.3 are 432 chosen as the immediate children of the super node. 433 o Step 4: Once the capacity of the super node is exhausted, it 434 responds to new requestors with its immediate children and an 435 indication of rejection. For example, upon the receipt of 436 rejections and a list of candidate parents (i.e. 1.1, 1.2 and 1.3) 437 from the super node, end hosts 1.1.1, 1.1.2 and 1.3.1 send Join 438 request to them. In this case, requestors with higher out-degree 439 are likely to be selected as the children. If there are multiple 440 acceptances, the end host will select the one which is "near" in 441 terms of the e2e latency. For example, 1.1.1 and 1.1.2 accept the 442 request and join the cluster as children of 1.1. Then 1.1 will 443 update the associated information to 1.2, 1.3, and as well as to 444 the super node after its children selection. 445 o Step 5: If there are still some nodes left, they will receive a 446 rejection and need to re-send Join request to the children at the 447 lower level, i.e. 1.1.1, 1.1.2 and 1.3.1. When all receivers 448 confirm their positions in the cluster, the control plane is 449 initially constructed. 451 Communication 452 Source <========> Rendezvous Point 453 | Channel 454 | 455 | ---------------\ 456 v / 1.1.1 | 457 /---------Super Node---------\ / 1.1 End host| 458 | / \ \ | / End host/ | 459 | / \ \ | / / \ 1.1.2 | 460 v / \ \ v / / 1.2 End host| 461 Cluster -> Super Node \ Super Node -- End host | 462 | \ \ / | \ \ 1.3 1.3.1 | 463 | \ \ / | \ End host-End host| 464 | \ \/ | \ | 465 v \ /\ v \----------------/ 466 Cluster -> Super Node \ /Super Node <- Cluster 467 | \ \ / / | 468 | \ \ / / | 469 | \ \ / / | 470 \-----------Super Node ------/ 471 | 472 | 473 Cluster 475 Figure 2: Control Hierarchy in DMMP 477 After constructing the control hierarchy, only super nodes need to 478 keep the full knowledge among themselves, while non-super nodes only 479 need to keep the knowledge of a small part of the group within each 480 cluster. This will help dramatically reducing the control overhead 481 of the whole multicast tree compared with that each member keeps the 482 full knowledge of the entire group whilst providing certain 483 resilience. Therefore, the overlay hierarchy of DMMP incurs a low 484 delay penalty, makes effective use of network resources, and 485 considers the heterogeneities of end hosts during multicast sessions. 487 4.2. Data plane in DMMP 489 Corresponding to the constructed control plane, the data plane can be 490 divided into two parts. The first part, which is within the overlay 491 mesh, is built through the reverse shortest path between each super 492 node and the source, identical to DVMRP [19]. For example, a super 493 node A receives the packet from the source through its neighbor B 494 only if B is the next hop on the shortest path from A to the source. 495 In this case, super node A will replicate and forward the packets 496 directly to super node B. Likewise, B will replicate and forward the 497 packet to the next super node as well. When receiving the data, the 498 super node will replicate and forward the data to its direct children 499 in the cluster. Therefore, data can be delivered throughout the 500 whole cluster. 502 The second part is data delivery within core-based clusters. In each 503 cluster, data are firstly forwarded from the super node to its direct 504 children. Then, the receivers will replicate the data and forward 505 them to its children at the lower level. The whole iteration 506 terminates with success if all receivers within each cluster receive 507 the data. 509 Figure 3 depicts an example for data delivery within a DMMP-aware 510 cluster. Here, a respective data channel between two entities is 511 established by exploiting the existing protocol stacks such as UDP/IP 512 or TCP/IP. The data channels utilize IP unicast according to the 513 underlying IP transport scheme. As shown in the figure, data are 514 firstly replicated into three copies, respectively delivered from the 515 super node to its direct children 1.1, 1.2 and 1.3 using unicast. 516 Similarly, 1.1 replicates copies of the data according to the number 517 of their children (e.g. two copies), sending to 1.1.1 and 1.1.2. In 518 the next iteration, the receiver will similarly make copies and 519 deliver to its children. 521 /-----------------------------\ 522 / 1.1.1 | 523 / 1.1 / End host | 524 / End host 1.1.2 | 525 / / \ End host | 526 / / 1.2.1 | 527 / / / End host | 528 / / / | 529 -----> / 1.2 / 1.2.2 1.2.2.1 | 530 Data -----> End host -- End host - End host| 531 -----> \ \ | 532 \ \ \ 1.2.3 | 533 \ \ \ End host | 534 \ \ | 535 \ \ 1.3 1.3.1 | 536 \ End host - End host | 537 \ | 538 \ | 539 \-----------------------------/ 540 Cluster 542 Figure 3: Data delivery in DMMP 544 5. DMMP Messages 546 DMMP handles tasks related to overlay hierarchy management, multicast 547 tree configuration and maintenance. It uses a common format to carry 548 both data and control packets as shown in Figure 4. 550 0 7 15 31 551 +---------------+-----------------------+----------------------+ 552 | version | Tree version | Option | 553 +---------------+-----------------------+----------------------+ 554 | DMMP session ID | 555 +--------------------------------------------------------------+ 556 | Source ID | 557 +--------------------------------------------------------------+ 558 | Sequence Number | 559 +--------------------------------------------------------------+ 560 | Reserved | 561 +--------------------------------------------------------------+ 562 | Payload Data length | 563 +--------------------------------------------------------------+ 565 Figure 4: DMMP Packet Header Format 567 Session ID and Source ID are generated by the RP and guaranteed to be 568 collision free. The tree version field is used to prevent loops and 569 partitions from the multicast tree. Since the multicast session tree 570 is initialized and controlled by the RP, a loop free topology may be 571 generated. Moreover, since tree update messages are independently 572 disseminated to all group members, there is possibility that some 573 messages might be lost and received out-of-order by different group 574 members. These members may act on Refresh message with updating 575 capacities. All these events could cause loops and tree partition. 576 In order to avoid these failures, the RP will assign a monotonically 577 increasing version number to each newly generated multicast tree. 579 The Option fields in the header defines various types of operation 580 messages. Currently, there are seven pairs of control messages as 581 shown in Table 1. Each pair of control messages will be exchanged 582 between the DMMP-aware entities in a request-and-response way. 584 o Subscription Request and Response - Group members get the address 585 of RP from Domain Name Systems (DNSs). 586 o Ping_RP Request and Response - During bootstrapping, each member 587 of the group gets a list of available super nodes from the RP, 588 containing at least one active node. 589 o Join Request and Response - A newly joining member sends request 590 in order to join the multicast session and gets corresponding 591 information from active group members. 593 o Status Request and Report - To request the status reports from 594 neighbors or relatives, and accordingly to send reports to them. 595 o Probe Request and Response - To probe whether the target node is 596 still active or not. 597 o Inactive Report and Response - To inform the other group members 598 that the target node is inactive. 599 o Refresh Request and Response - To maintain the overlay hierarchy, 600 they are used to periodically update the capacities (such as 601 uptime, out-degree) of group members. 603 To adapt to dynamic network changes, each end host maintains the 604 overlay core by periodically updating its capacities. For example, 605 it periodically exchanges Refresh message with its neighbors. If 606 node A cannot receive this message from node B within the Refresh 607 timer, node A will send a Probe request to node B. If there is still 608 no Response returned, node B will be confirmed to be inactive by a 609 certain time. Then the Status report, indicating node B being 610 inactive, will be used to inform the rest of group members. 612 Table 1 lists the DMMP messages according to the associated DMMP 613 operational phases. 615 Table 1: DMMP Messages 616 +-----------------+------------+------------+------------+ 617 | Messages | Operation | From | To | 618 +-----------------+------------+------------+------------+ 619 | Subscription Rq | Initializ- |Group Member| DNS server | 620 +-----------------+ ation +------------+------------+ 621 | Subscription Res| | DNS server |Group Member| 622 +-----------------+------------+------------+------------+ 623 | Ping_RP Request | Bootstrap |Group Member| RP | 624 +-----------------+ +------------+------------+ 625 | Ping_RP Response| | RP |Group Member| 626 +-----------------+------------+------------+------------+ 627 | Join Request | Member | New Host | Group Mem. | 628 +-----------------+ +------------+------------+ 629 | Join Response | Join | Group Mem. | New Host | 630 +-----------------+------------+------------+------------+ 631 | Status Request | Cluster |Group Member|Group Member| 632 +-----------------+ Member +------------+------------+ 633 | Status Report | Monitoring |Group Member|Group Member| 634 +-----------------+------------+------------+------------+ 635 | Probe Request | Probe |Group Member|Group Member| 636 +-----------------+ +------------+------------+ 637 | Probe Response | Members |Group Member|Group Member| 638 +-----------------+------------+------------+------------+ 639 | Inactive Rep. | Member |Leaving Node|Group Member| 640 +-----------------+ +------------+------------+ 641 | Inactive Report | Leave |Group Member|Leaving Node| 642 +-----------------+------------+------------+------------+ 643 |Refresh Request | Update |Group Member|Group Member| 644 +-----------------+ +------------+------------+ 645 |Refresh Response |Information |Group Member|Group Member| 646 +-----------------+------------+------------+------------+ 647 Legend: SN : Super Node 648 Rq/Req.: Request 650 Although TCP provides a generic protocol for a guaranteed, in-order 651 delivery of stream-based messages, this reliability comes at a price 652 in performance. Besides, the communication pattern in DMMP is 653 strictly in a request-response mode, and most messages have a small 654 fixed maximum size. Thus, it is preferred to encapsulate all DMMP 655 messages over UDP to provide the required delivery guarantees without 656 extra network burdens. 658 6. DMMP: Protocol Details 660 All DMMP-aware nodes and the source are assumed to be able to know 661 the address of the RP. Also, once a source node starts, a direct 662 communication channel will be established between the source node and 663 the RP, so that the necessary information like the capacities of 664 group members and active super nodes could be exchanged between them 665 during the lifetime of an overlay multicast session. Furthermore, a 666 DNS namespace is required to maintain the RP information for a 667 specified multicast group. 669 Before initialization, each group member and the source send out 670 Subscription request containing the specified group name and domain 671 name, to the DNS for the address of associated Rendezvous Point (RP). 672 Since RP does not participate in the data forwarding, the location of 673 RP has no significant impact on the performance of data distribution. 674 If there is no existing RP which is serving for this multicast group, 675 DNS will allocate a new one for this multicast group based on 676 application requirements. Otherwise, the RP's address will be sent 677 back. It is possible that multiple RPs serve for the same multicast 678 group, e.g. for the purpose of load balancing and fault tolerance. 679 For simplicity, this document initially considers the case where 680 there is only one RP involved. 682 6.1. Initialization 684 During the initialization phase, certain application related software 685 has been distributed to the prospective DMMP-aware entities within 686 the DMMP control and data transport models. 688 Before the multicast session starts, the source and RP must be ready 689 to give response to requests from DMMP-aware end hosts. The source 690 and RP will take no further reactions to any DMMP requests once the 691 session stops. The active session time should be the period from the 692 service starts until it stops. Then the out-of-band channel between 693 the RP and source should be active during this active session so that 694 the source can monitor the current status of memberships. However, 695 the detailed mechanism for implementing this out-of-band 696 bootstrapping is out of the scope of this document. 698 Moreover, session-related information should be obtained before the 699 session starts and all prospective group members use out-of-band 700 bootstrapping mechanism to get necessary information, for instance, 701 Group ID and location of RP including the port number serving for 702 certain sessions before the application begins. Then DMMP-aware 703 entities can start receiving data after they join the overlay 704 hierarchy. 706 6.2. Super Node Selection 708 During the mesh construction, the selection of the super nodes is to 709 ensure that a newly joining member is able to quickly find its 710 appropriate position in the multicast tree using a very small number 711 of queries such as Join request. As the super node selection is the 712 first step towards mesh establishment, this section gives more 713 details. 715 To select super nodes for better performance while maintaining 716 scalability, the following distribution requirements need to be taken 717 into account [20]. 718 o Connections: Super nodes have relatively higher capacties and are 719 expected to be strong to perform additional tasks such as 720 resources control, load balance and fault tolerance. 721 o Number: To be more efficient, the number of active super nodes is 722 no more than one hundred, otherwise it may cause high control 723 overhead and high stress [5]. Assuming that each super node can 724 manage, in average, hundreds of cluster members, it is sufficient 725 to support totally more than thousands of end hosts. Otherwise, 726 multiple sources will be deployed into media streaming by multiple 727 overlay multicast sessions. To balance the tradeoff between the 728 efficiency and the reliability, it is reasonable to select an 729 application-specific number of super nodes to construct the 730 overlay mesh. For example, in the case of 110 end hosts, it may 731 need 10 super nodes if the required ratio is 10%. In this case, 732 10 end hosts with higher capacity will be chosen as super nodes. 733 o Downstream: To adapt to bandwidth requirements, super nodes should 734 not serve more than K non-super nodes as its immediate children, 735 where K is respectively determined by the available out-degree of 736 each super node and service specifications. 738 In order to deal with factors from a large-scale and dynamic network 739 environment, the following three conditions are outlined in addition 740 to above requirements. 742 o Stable: One cause for current multimedia streaming serivces which 743 cannot guarantee required QoS mainly from unstable network status. 744 Thus, super nodes should be relatively stable because, otherwise, 745 its cluster members are easily partitioned from the tree. 746 o Resilience: Super nodes are responsible for detecting dynamic 747 changes and for handling them quickly, e.g. one super node leaves 748 the group ungracefully, which should be detected by at least one 749 of other active nodes. Then, a new super node should be quickly 750 selected to replace the leaving super node in the position. The 751 time for detection and recovery process is also constrained by 752 certain service requirements. 754 o Security: Super nodes should be fundamentally invulnerable to 755 attacks; otherwise, they will easily disrupt the multicast service 756 by forwarding wrong messages or failing to accept information. 758 6.3. Member Join 760 DMMP is resilient to dynamic network changes, for examples, events 761 like a member joining/leaving. 763 The newcomer first checks its local cache for super node candidates. 764 If there are no suitable candidates in the cache, it requests the RP 765 for the addresses of the source and super node candidates. After 766 receiving the source address and a list of active super nodes, it 767 caches their capacities, i.e. uptime, out-degree. Then, it will 768 measure the e2e latency between them and itself, and sends the Join 769 Request message to some super node which can provide smaller e2e 770 latency. 772 Upon receiving Join Request from a newcomer, the super node will 773 check its current out-degree. If it is possible to accept the 774 newcomer to as its immediate child, the super node will respond with 775 an indication of acceptance. In this case, the information about 776 newly joining nodes will only be propagated to the existing children 777 of this super node since no child of the new member exists. When the 778 super node cannot accept it as its immediate child, it will redirect 779 this Join Request message to its active children with the largest 780 available out-degree. If one of them responses to the super node, 781 this response will be relayed to the new member. If there are more 782 potential parents, the new member selects the one with smallest tree 783 depth as its parent. If there are multiple potential parents at the 784 same depth, it chooses the best one in terms of their uptime. Once 785 finding the appropriate parent, the new member starts data delivery. 786 At this time, the information about the new member will be propagated 787 from the parent to its PLN, siblings and the super node. The process 788 will be terminated until the new member finds its position and 789 accordingly updates its related information at corresponding nodes. 791 After joining the tree, the newcomers who have higher capacities 792 could "climb" from the bottom to a higher level after some switch 793 stages. For example, a newcomer at the lower level could switch with 794 its parent if its capacity exceeds (over a predefined threshold) the 795 current parent. The appropriate threshold will be defined to avoid 796 unnessary switching since if the child has a smaller bandwidth 797 support, it will be ultimately placed below the high capacity parent. 799 In case the newcomer fails to find an appropriate position in any 800 cluster to meet application requirements, it can sell itself as a 801 potential super node and report its own capacities to the RP. 803 Regarding its capacity and the current number of super nodes, it 804 could be entitled as a super node. In this way, end hosts have more 805 flexibility to get optimal services, which will be specified in a 806 future version of this document. 808 6.4. Refresh Information 810 In DMMP, each member is responsible for maintaining the overlay 811 hierarchy, by periodically sending Refresh message. The Refresh 812 mechanism in the overlay mesh has a little difference from that in 813 the local clusters. To efficiently manage the overlay hierarchy, 814 both active and passive models are utilized in DMMP. Within each 815 cluster, end host starts to exchange Refresh message with its PLNs, 816 siblings and CLNs once it joins the cluster. In addition that each 817 member has to periodically update its information, members in the 818 local cluster are able to request refresh message from their 819 relatives, e.g., PNLs or CNLs. This operation guarantees the 820 reliability and the stability of the overlay hierarchy. 822 For the Refresh message in the mesh, each super node sends its 823 current information to all mesh members including the source. Once 824 receiving updated information, the source will correspondingly update 825 the information at the RP. If one mesh member stops receiving 826 Refresh message from another beyond the Mesh_Refresh_Timer, it 827 assumes this neighbor to be either inactive or leaving. In order to 828 confirm the status, it may initiate a Probe message as stated in 829 Section 5. 831 6.5. Member Leave 833 In most cases, two situations for a member leaving the group, either 834 gracefully or ungracefully, are distinguished from each other. In 835 each cluster, the leaving member should at least send an Inactive 836 Request to its parent or one of its children. After receiving the 837 confirmation, it can leave the group gracefully. Then the notified 838 node will propagate this Inactive message to its relatives so that 839 they can update their service membership tables. In the second case, 840 the Inactive status will be detected by periodically exchanging 841 Refresh messages. If any member within the cluster, say p, fails to 842 receive a Refresh Report message from one of its required relatives, 843 say q, within the refresh timeout Refresh_Timer, then p sends a 844 redundant Probe Request message to q. If there is still no Probe 845 Response message returned, p assumes q to be inactive and propagates 846 this Status_Inactive message throughout the whole cluster. 847 Afterwards, one of it children with relatively high capacity will 848 replace its place, and other children will accordingly change their 849 positions. Nevertheless, ungraceful leaving may cause the crash of 850 whole multicast tree. DMMP is able to handle different situations by 851 detecting the failures and recovering quickly from them as shown in 852 Section 6.7. 854 Compared with the handling in the local cluster, the operation is 855 even tougher in the core mesh since all its cluster members are 856 partitioned from the tree. When there is no end hosts connecting to 857 the leaving super node, no further changes to the overlay are 858 required. Before a super node gracefully leaves the group, it must 859 recommend a replacement leader for the cluster it owns and inform 860 other super nodes in the overlay mesh before leaving. To detect 861 unannounced leaves, DMMP relies on the periodic Refresh message 862 exchanges. If the failed peer happens to be a super node, the 863 overlay hierarchy has to be repaired, which will be depicted in 864 Section 6.7. 866 6.6. Data Delivery Control 868 After the multicast tree configuration, the new member will ask its 869 immediate parent to send the data. Generally, parent nodes will 870 delete data in their local cache after they have forwarded them to 871 their children. If the parent still holds the data, the new member 872 can quickly get the data from it. If the parent has not received 873 data yet, either it waits until the parent forwards the data after 874 receiving, or it directly inquires the super node to deliver the 875 data. The former option is preferred in DMMP as its overhead is 876 likely lower than the latter one. Upon receiving the data, the new 877 member will firstly forward them to its parent if its parent hasn't 878 received the data yet. If the parent has deleted the data, it will 879 then ask its siblings to forward data directly. In order to 880 alleviate the redundant data transmission, the new member needs to 881 wait for a certain while before it asks for the data from its 882 siblings. 884 Different from joining as a cluster member, the new member may join 885 the multicast session as a super node. In this case, it will firstly 886 ask its neighbors in the overlay mesh to send the data. In addition, 887 it may query the data from its children when they are already in its 888 local cluster. If any of them receives the data, this new super node 889 will also get the data. A third possibility is to let the new member 890 directly ask the source to send the data. To alleviate the control 891 overhead, it is recommended that this new node waits for a certain 892 while until one of its mesh neighbors receives the data. 894 6.7. Failure Recovery 896 Maintenance of such a multicast tree in DMMP faces a key problem that 897 non-leaf nodes in the tree are end hosts who are more likely to fail 898 than routers and may join/leave the tree at will. It does not happen 899 in IP multicast since non-leaf nodes in the delivery tree are routers 900 which do not leave the multicast tree without notification. Thus, 901 one challenge in DMMP is to reconstruct the overlay multicast tree 902 after a node's departure. If one non-leaf node leaves the group 903 ungracefully, its downstream nodes will be inevitably affected. Two 904 possible means can alleviate such impacts: one is to reduce the 905 possibility of failures; the other is to reduce the number of 906 possible affected nodes. In practice, however, the first way might 907 be very difficult since end hosts may leave the group at will. For 908 the second option, DMMP proposes a proactive mechanism by 909 periodically pushing high-capacity nodes to higher levels of the tree 910 by capacity comparison mechanism. Detailed mechanisms are described 911 in Section 6.8. Thereby, it is very likely that long-lived super 912 nodes and their immediate children form a stable and efficient 913 cluster core after a certain time. The longer a node remains in the 914 multicast session, the more it becomes attaching to other long-lived 915 nodes with similar uptime. In other words, higher capacity nodes 916 form a well-connected core with relatively more bandwidth support and 917 being more stable, whereas peers with less available bandwidth and 918 shorter uptime will be placed out of the core as possible. 920 In order to improve the performance of DMMP, especially when there 921 are high packet losses or host failures, a reactive recovery 922 technique is also implemented after failure detection. Recovery from 923 failures regarding a member crash is similar to handling a member 924 leaves. The difference is that surviving members usually do not 925 receive prior notification of a crash. Thus, Refresh message is 926 periodically exchanged between each member and its neighbors. It is 927 even more difficult for super nodes to maintain the cluster since all 928 of its local members are partitioned from the multicast tree and can 929 not receive the multicast data until it is repaired. 931 In the local cluster, each immediate child of the super node must 932 find a backup parent in advance, either the source or a group member. 933 Once the super node leaves the group, its children try to contact 934 with their alternative parents to re-join the multicast tree. This 935 approach can facilitate the recovery process and strengthen the 936 reliability of the overlay hierarchy. In addition, cluster members 937 periodically estimate their relatives (i.e. PNLs, CNLs) within the 938 cluster and evaluates the number of losses that it shares with these 939 nodes. While in the core mesh, each super node maintains state 940 information about all other mesh members, no additional discovery of 941 nodes is necessary. Using this mechanism, packet delivery ratios can 942 be increased with a high probability. To handle different scenarios 943 of failures, more mechanisms need to be defined in the near future. 945 6.8. Self-improvement 947 We suggest the self-improvement mechanisms and hence a cluster member 948 having a higher capacity than its parent could be promoted. 949 Basically, the children and their parents can swap their positions. 950 After promotion, the former child becomes the new parent and its 951 former parent becomes the current child. However, there are some 952 aspects impact the mechanism: 953 o the number of nodes involved in the promotion: the more nodes 954 change their positions, the less stable the overlay tree becomes. 955 o the reliability of participated nodes: the promoted node may leave 956 ungracefully, and its children will be partitioned from the 957 existing tree. 958 o complixity of re-construction: after promotion, the re- 959 construction of the existing tree may be complicated since the 960 promoted child may have not enough bandwidth to accept all 961 exisiting end hosts as its children. 963 For simplicity, we describe the idea by taking the following example. 964 In Figure 5, suppose that node 1.1.2 has much higher capacity than 965 its parent 1.1 based on the capacity comparison. Then, node 1.1.2 966 sends a promotion request to node 1.1. After a certain proof, node 967 1.1 acknowledges the request and sends back a status report which 968 contains the address of node s. Here, it is necessary that node 1.1 969 waits till node s has received the breakup request. Otherwise, the 970 join request from 1.1.2 may arrive earlier, which will cause a loop 971 in the overlay tree. 973 Then, node 1.1 breaks the connections with node s and 1.1.2. 974 However, node 1.1 keeps node s as its backup parent in case node 975 1.1.2 is leaving or unreachable. Moreover, node s considers node 1.1 976 as it temporary child. At the same time, node 1.1.2 contacts node s 977 and notifies node 1.1 to be its child. Once node 1.1 receives the 978 notification and rejoins the tree as the child of node 1.1.2, it may 979 break the connection with node 1.1.1 if node 1.1.2 still has 980 available capacity. In the following example, node 1.1.2 can support 981 at least three children. Therefore, after the first swap, the node 982 1.1.1 requests to join as one child of node 1.1.2. The message flow 983 of the promotion example is shown in Figure 6. 985 Above mechanism can be tolerant to dynamic changes. Considering the 986 scenario that node 1.1 and 1.1.1 can still quickly rejoin the tree 987 even if node 1.1.1 leaves ungracefully. The optimization and 988 extension of the mechanism will be studied in the near future. 990 End host 991 Former: / 1.1.1 992 / 993 / 994 / 995 End host--End host--End host--End host 996 s 1.1 1.1.2 1.1.2.1 998 || 999 || 1000 || 1001 \/ 1003 End host 1004 Later: / 1.1 1005 / 1006 / 1007 End host--End host--End host 1008 s 1.1.2 \ 1.1.2.1 1009 \ 1010 \ 1011 End host 1012 1.1.1 1014 Figure 5: An example of self-improvement 1016 End host End host End host End host 1017 1.1.2 1.1 s 1.1.1 1018 |promotion req. | | | 1019 |=============> | break req. | | 1020 | |==============> | | 1021 | | break resp. | | 1022 |promotion ACK |<============== | | 1023 |<============= | | | 1024 | Join Request | | 1025 |==============================> | | 1026 | Join Response | | 1027 |<============================== | | 1028 | Join Notif. | | | 1029 |=============> | Break Confirm | | 1030 | |<==============>| | 1031 | | Join Notif. | 1032 | | ==============================>| 1033 | | Join Request | | 1034 |<===============================================| 1035 | | Join Response | | 1036 |==============================================> | 1037 | | Promotion Confirm | 1038 | |<============================== | 1040 Figure 6: Message flow of self-improvement example 1042 Moreover, nodes at the bottom level are either transient nodes, leaf 1043 nodes or new comers. The newcomers who have higher capacities could 1044 "climb" from the bottom to a higher level after some switching 1045 stages. For example, a newcomer at the lower level could switch with 1046 its parent if its capacity exceeds (over a predefined threshold) the 1047 current parent. Nevertheless, an appropriate threshold will be 1048 defined to avoid unnecessary switching since if the child has a 1049 smaller bandwidth support, it will be ultimately placed below the 1050 parent. The main goal of doing this is to reducing the impacts of 1051 frequent changes in the overlay so that only a small part of the 1052 overlay multicast tree will be affected and needs to be re- 1053 constructed after dynamic changes. 1055 7. Metrics specification 1057 End host based overlay multicast is more susceptive to dynamic 1058 network changes since end hosts may join or leave the group at will. 1059 It would be even harder for DMMP to manage and maintain such an 1060 overlay mesh because super nodes may leave the group ungracefully as 1061 well. To address the instability of mesh, uptime is chosen as an 1062 assisted criterion to strengthen its maintenance. Once an end host 1063 joins in the overlay multicast tree, its uptime starts to calculate 1064 from 0 until its leaving. Besides, there are several metrics used in 1065 DMMP as the criteria of the capacity, which will be specified in this 1066 section. 1068 Table 2: Capacity specification 1069 +-------------+-------------------------------------+ 1070 | Metric | Operation | 1071 +-------------+-------------------------------------+ 1072 | | Differentiation non-leaf nodes from | 1073 | | leaf nodes | 1074 | Out-degree +-------------------------------------+ 1075 | | Super nodes selection | 1076 | +-------------------------------------+ 1077 | | Tree construction within clusters | 1078 | +-------------------------------------+ 1079 | | New member joins the group | 1080 | +-------------------------------------+ 1081 | | Failure recovery mechanism | 1082 | +-------------------------------------+ 1083 | | Self-improving mechanism | 1084 +-------------+-------------------------------------+ 1085 | | Non-super nodes attach to super | 1086 | E2E latency | nodes to form clusters | 1087 | +-------------------------------------+ 1088 | | New member joins the group | 1089 +-------------+-------------------------------------+ 1090 | | New member joins the group | 1091 | Uptime +-------------------------------------+ 1092 | | Self-improving mechanism | 1093 | +-------------------------------------+ 1094 | | Failure recovery mechanism | 1095 +-------------+-------------------------------------+ 1097 As shown in Table 2, out-degree is the main criterion to select the 1098 super nodes from end hosts. Then, non-super nodes select one super 1099 node which locates "near" to it based on the estimated e2e latency. 1100 During the tree construction within clusters, nodes with higher out- 1101 degree are likely to join in the tree at the high level. Regarding 1102 new members joining procedure, out-degree, e2e latency and uptime are 1103 all taken into considerations. To keep the stability of the overlay 1104 hierarchy, out-degree and uptime are chosen as comparison metric to 1105 self-improve the overlay multicast tree. Furthermore, out-degree and 1106 uptime are regarded as the main selection criterion for alternative 1107 node during the failure recovery. 1109 8. Security Considerations 1111 Security should be considered during the super nodes selection. In 1112 DMMP, the current preference is to use an authority center that 1113 qualifies the trust level of end hosts. Only when the end host 1114 obtains a security certificate from the authority center, it can be 1115 selected as a super node. 1117 Within each cluster, Cluster key, Group key and Private key are 1118 proposed as the security scheme to manage the cluster members. 1119 Details and discussions related to other security issues are to be 1120 explored in a future version of this document. 1122 9. Open Issues 1124 DMMP framework will study necessary extensions or open issues, where 1125 security, large-scale efficiency, end-to-end quality-of-service (QoS) 1126 provisioning will be likely related. Moreover, initial DMMP does not 1127 include support for NATs and firewalls since they impose fundemental 1128 restrictions on pair-wise connectivity of hosts on the overlay. 1130 10. Contributors 1132 Ruediger Geib, Xiaodong Yang and David Weiss contributed great 1133 efforts to this document. 1135 11. Acknowledgements 1137 We thank Nicolai Leymann, John Buford, Yuji-UG-Imai and Yangwoo Ko 1138 for their helpful suggestions. 1140 12. References 1142 12.1. Normative References 1144 [1] Bhattacharyya, S., "An Overview of Source-Specific Multicast 1145 (SSM)", RFC 3569, July 2003. 1147 [2] Bradner, S., "Key words for use in RFCs to Indicate Requirement 1148 Levels", BCP 14, RFC 2119, March 1997. 1150 12.2. Informative References 1152 [3] Almeroth, K., "The evolution of multicast: From the MBone to 1153 inter-domain multicast to Internet2 deployment", IEEE 1154 Network 2000, Jan./Feb 2000. 1156 [4] Diot, C., Levine, B., Lyles, J., and D. Balensiefen, 1157 "Deployment issues for IP multicast service and architecture", 1158 IEEE Network 2000, Jan. 2000. 1160 [5] Chu, Y., Rao, S., and et. al, "A Case for End System 1161 Multicast", IEEE JSAC Vol. 20, No. 8, October 2002. 1163 [6] Banerjee, S. and B. Bhattacharjee, "Analysis of the NICE 1164 Application Layer Multicast Protocol", UMIACS Technical 1165 report TR 2002-60 and CS-TR 4380, June 2002. 1167 [7] Banerjee, S., Bhattacharjee, B., and et. al, "Scalable 1168 Application Layer Multicast", SIGCOMM 2002, August 2002. 1170 [8] Banerjee, S., Kommareddy, C., and et. al, "OMNI: An Efficient 1171 Infrastructure for Real-time Applications", Computer 1172 Networks, Special Issue on Overlay Distribution Structures and 1173 their Applications, Vol. 50, No. 6, April 2006. 1175 [9] Lao, L., Cui, J., and et. al, "TOMA: A Viable Solution for 1176 Large-scale Multicast Service Support", IFIP Networking 2005, 1177 May 2005. 1179 [10] Hefeeda, M. and B. Bhargava, "On-Demand Media Streaming Over 1180 the Internet", FTDCS'03, The Ninth IEEE Workshop on Future 1181 Trends of Distributed Computing Systems, ftdcs, p. 279, 1182 April 2003. 1184 [11] Saroiu, S., Gummadi, P., and S. Gribble, "A Measurement Study 1185 of Peer-to-Peer File Sharing Systems", MMCN'02 Proceedings of 1186 Multimedia Computing and Networking, July 2002. 1188 [12] Sripanidkulchai, K., Maggs, B., and H. Zhang, "An analysis of 1189 live streaming workloads on the Internet", SIGCOMM 1190 IMC Proceedings of the 4th ACM SIGCOMM IMC, Oct. 2004. 1192 [13] Lei, J., Fu, X., and D. Hogrefe, "DMMP: A New Dynamic Mesh- 1193 based Overlay Multicast Protocol Framework", accepted in the 1194 2007 IEEE Consumer Communications and Networking 1195 Conference, Workshop on Peer-to-Peer Multicasting (P2PM 2007), 1196 Jan. 2007. 1198 [14] Ballardie, A. and J. Crowcroft, "Core based trees (CBT)", ACM 1199 SIGCOMM Computer Communication Review, October 1993. 1201 [15] Hu, N. and P. Steenkiste, "Evaluation and Characterization of 1202 Available Bandwidth Probing Techniques", IEEE JSAC Special 1203 Issue in Internet and WWW Measurement, Vol. 21, No. 6, 1204 August 2003. 1206 [16] Tan, G., Jarvis, S., and D. Spooner, "Improving Fault 1207 Resilience of Overlay Multicast for Media Streaming", 1208 DSN'06 IEEE International Conference on Dependable Systems and 1209 Networks, June 2006. 1211 [17] Jain, M. and C. Dovrolis, "End-to-end available bandwidth: 1212 measurement methodology, dynamics, and relation with tcp 1213 throughput", Transaction'03 IEEE/ACM Transaction on Networking 1214 11 (4), August 2003. 1216 [18] Strauss, J., Katabi, D., and F. Kaashoek, "A measurement of 1217 study of available bandwidth estimation tools", 1218 IMC'03 Proceedings of ACM SIGCOMM Conference on Internet 1219 Measurement, Oct. 2003. 1221 [19] Deering, S., "Multicasting routing in internetworks and 1222 extended LANs", ACM SIGCOMM 1988, August 1988. 1224 [20] Lo, V., Zhou, D., and et. al, "Scalable Supernode Selection in 1225 Peer-to-Peer Overlay Networks", HOT-P2P'05 Proceedings of 1226 International Workshop on Hot Topics in Peer-to-Peer Systems 1227 2005, July 2005. 1229 [21] Handler, G. and P. Mirchandani, "Location on Networks: Theory 1230 and Algorithms", The MIT Press Cambridge Massachusetts, 1231 Mar. 1979. 1233 [22] Lynch, A., "Distributed Algorithms", Morgan Kaufmann San 1234 Francisco, April 1997. 1236 [23] Zhi, L. and P. Mohapatra, "HostCast: A New Overlay Multicasting 1237 Protocol", IEEE ICC'03 Proceedings of IEEE ICC 2003, June 2003. 1239 [24] Banerjee, S., Lee, S., and et. al, "Resilient Multicast using 1240 Overlays", ACM SIGMETRICS 2003, June 2003. 1242 [25] Paul, S. and K. Sabnani, "Reliable multicast transport protocol 1243 (RTMP)", IEEE JSAC, Vol. 15, No. 3, April 1997. 1245 Authors' Addresses 1247 Jun Lei 1248 University of Goettingen 1249 Institute for Informatics 1250 Lotzestr. 16-18 1251 Goettingen 37083 1252 Germany 1254 Email: lei@cs.uni-goettingen.de 1256 Xiaoming Fu 1257 University of Goettingen 1258 Institute for Informatics 1259 Lotzestr. 16-18 1260 Goettingen 37083 1261 Germany 1263 Email: fu@cs.uni-goettingen.de 1265 Dieter Hogrefe 1266 University of Goettingen 1267 Institute for Informatics 1268 Lotzestr. 16-18 1269 Goettingen 37083 1270 Germany 1272 Email: hogrefe@cs.uni-goettingen.de 1274 Full Copyright Statement 1276 Copyright (C) The IETF Trust (2008). 1278 This document is subject to the rights, licenses and restrictions 1279 contained in BCP 78, and except as set forth therein, the authors 1280 retain all their rights. 1282 This document and the information contained herein are provided on an 1283 "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS 1284 OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND 1285 THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS 1286 OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF 1287 THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED 1288 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 1290 Intellectual Property 1292 The IETF takes no position regarding the validity or scope of any 1293 Intellectual Property Rights or other rights that might be claimed to 1294 pertain to the implementation or use of the technology described in 1295 this document or the extent to which any license under such rights 1296 might or might not be available; nor does it represent that it has 1297 made any independent effort to identify any such rights. Information 1298 on the procedures with respect to rights in RFC documents can be 1299 found in BCP 78 and BCP 79. 1301 Copies of IPR disclosures made to the IETF Secretariat and any 1302 assurances of licenses to be made available, or the result of an 1303 attempt made to obtain a general license or permission for the use of 1304 such proprietary rights by implementers or users of this 1305 specification can be obtained from the IETF on-line IPR repository at 1306 http://www.ietf.org/ipr. 1308 The IETF invites any interested party to bring to its attention any 1309 copyrights, patents or patent applications, or other proprietary 1310 rights that may cover technology that may be required to implement 1311 this standard. Please address the information to the IETF at 1312 ietf-ipr@ietf.org. 1314 Acknowledgment 1316 Funding for the RFC Editor function is provided by the IETF 1317 Administrative Support Activity (IASA).