idnits 2.17.1 draft-perlman-simple-multicast-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** Missing expiration date. The document expiration date should appear on the first and last page. ** The document seems to lack a 1id_guidelines paragraph about the list of current Internet-Drafts. ** The document seems to lack a 1id_guidelines paragraph about the list of Shadow Directories. ** The document is more than 15 pages and seems to lack a Table of Contents. == No 'Intended status' indicated for this document; assuming Proposed Standard == The page length should not exceed 58 lines per page, but there was 33 longer pages, the longest (page 2) being 60 lines == It seems as if not all pages are separated by form feeds - found 0 form feeds but 34 pages Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack a Security Considerations section. ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. ** There are 7 instances of too long lines in the document, the longest one being 5 characters in excess of 72. Miscellaneous warnings: ---------------------------------------------------------------------------- == Line 333 has weird spacing: '... random gener...' == Line 353 has weird spacing: '...N. This messa...' == Line 1182 has weird spacing: '... times as th...' -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- Couldn't find a document date in the document -- date freshness check skipped. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Missing reference section? 'MBGP' on line 439 looks like a reference -- Missing reference section? 'MZAP' on line 694 looks like a reference -- Missing reference section? 'RFC2365' on line 701 looks like a reference Summary: 9 errors (**), 0 flaws (~~), 6 warnings (==), 5 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internet Engineering Task Force R. Perlman 3 INTERNET DRAFT Sun Microsystems 4 February 1999 C-Y Lee 5 Nortel Networks 6 A. Ballardie 7 Research Consultant 8 J. Crowcroft 9 UCL 10 Z. Wang 11 Lucent Technologies 12 T. Maufer 13 3Com Corporation 14 C. Diot 15 Sprint 16 J. Thoo 17 Nortel Networks 18 M. Green 19 @Home Networks 21 Simple Multicast: A Design for Simple, Low-Overhead Multicast^M 23 ^M 25 Status of this memo 27 This document is an Internet-Draft and is in full conformance 28 with all provisions of Section 10 of RFC2026. 30 Internet-Drafts are working documents of the Internet Engineering 31 Task Force (IETF), its areas, and its working groups. Note that 32 other groups may also distribute working documents as 33 Internet-Drafts. 35 Internet-Drafts are draft documents valid for a maximum of six 36 months and may be updated, replaced, or obsoleted by other 37 documents at any time. It is inappropriate to use Internet- 38 Drafts as reference material or to cite them other than as 39 "work in progress." 41 To view the list Internet-Draft Shadow Directories, see 42 http://www.ietf.org/shadow.html. 44 Abstract 46 This paper describes a design for multicast that is simple to 47 understand and low enough overhead for routers that a single scheme 48 can work both within and between domains. It also eliminates the need 49 for coordinated multicast address allocation across the Internet. It 50 is not very different from the tree-based schemes CBT, PIM-SM, and 51 BGMP. Essentially all of the mechanisms to support this have already 52 been implemented in the other designs. The contribution of this 53 protocol is in what is NOT required to be implemented. 55 The main idea for simplifying multicast is to consider the identity 56 of a group to be the 8-byte combination of a "core node" C, and the 57 multicast address M. The identity of the group is carried in join 58 messages and data messages. M no longer has to be unique across the 59 Internet. It only has to be unique per C. The other idea, which is 60 independent of the first, it to build a bi-directional tree (as is 61 done in CBT and BGMP) instead of building per-source trees from each 62 sender. This reduces the state necessary in routers to support 63 multicast. 65 Changes from revision 1 66 - use a Simple Multicast (SM) header instead of a new IP option 68 - modified branch creation and deletion to avoid loops 70 - added tree splicing mechanism 72 - added multicast scoping 74 - allow both IGMP and host SM Join 76 - added sender only joins 78 - third party independence 80 - layer 2 filtering 82 - host API and kernel changes 84 1.0 Introduction 86 IP Multicast has been around for over a decade, and several multicast 87 protocols have been developed over the years. However, the solutions 88 are either difficult to understand or expensive to deploy or both. In 89 particular, we believe that multicast address allocation protocols 90 are too complex and BGMP in combination with MASC will not scale 91 easily. 93 In this paper, we present a design we call Simple Multicast that 94 reduces the complexity and overhead of multicast. It is not really 95 "yet another multicast protocol". Instead, it is more like a subset 96 of other protocols, with one variation; to have the identifier of a 97 group consist of both C (the core) and M (the multicast address). 98 This eliminates the need to have unique multicast addresses and 99 coordinate multicast addresses across the Internet. 101 1.1 Previous Work 103 DVMRP is the first multicast routing protocol proposed. It uses a 104 simple mechanism of flooding and pruning. 106 The scalability issues with DVMRP led to the development of CBT. In 107 CBT, a multicast group is formed by choosing a distinguished node, 108 the "core", and having all members join by sending special join 109 messages towards the core. The routers along the path keep state 110 about which ports are in the group. If a router along the path of the 111 join already has state about that group the join does not proceed 112 further. Instead the router just "grafts" the new limb onto the tree. 113 The result is a tree of shortest paths from the core, with only the 114 routers along the path knowing anything about that group. 116 In PIM-SM, each node could independently decide whether the volume of 117 traffic from a particular source is worth switching from a shared 118 tree to a per-source tree. Thus, there are two possible trees for 119 traffic from a particular source for group M; the shared tree and the 120 source tree. To prevent loops, the shared tree had to be 121 unidirectional, i.e., to send to the shared tree, the data has to be 122 encapsulated and unicast to the core. 124 The other issue that makes current protocols complex is the necessity 125 for routers to be able to figure out the location of the core based 126 solely on the multicast address M. In PIM-SM, this resulted in a 127 protocol whereby "core-capable" routers are being continuously 128 advertised. All routers keep track of the current set of live core- 129 capable routers, and there is a hashing function to map a multicast 130 address to one of the set of core-capable routers. This advertisement 131 protocol is confined to within a domain because it was recognized 132 that this mechanism would not scale to the entire Internet. 134 For inter-domain multicast, a set of new protocols has been proposed. 135 The MASC protocol deals with hierarchical block allocation of Class D 136 address space. Essentially, it creates a prefix structure in 137 multicast address space in a way similar to unicast address space. 138 Because of the limited multicast address space, the allocation has to 139 be dynamic. MASC contains mechanisms for collision detection and 140 de-allocation. Once a block of multicast addresses is allocated, and 141 no collision is detected for a period of time, the address block is 142 then given to MAAS servers for actual assignment to multicast groups. 143 The address block has to be propagated through BGP+ so that routers 144 throughout the Internet can know the mapping of multicast addresses 145 to cores, even in other domains. BGMP then uses this information to 146 know the direction in which a join to multicast address M should be 147 sent. 149 1.2 Overview of Simple Multicast 151 The Simple Multicast proposal tries to reduce or eliminate some of 152 the complexity and overhead of multicast by taking a slightly 153 different approach. The basic idea in Simple Multicast is that a 154 multicast group is created by generating: 156 - a distinguished node C known as the "core" 158 - a multicast address M 160 The multicast group is then identified by the pair (C,M) rather than 161 just M as in conventional IP multicast. Note that the address M does 162 not have to be unique across the Internet now. Instead, only the pair 163 (C,M) has to be unique. That means that every node C in the Internet 164 can assign the full 28 bits worth of multicast addresses. 166 In Simple Multicast, multicast address allocation and core placement 167 (i.e., choosing a multicast address M and a core C for a multicast 168 group) are taken out of the basic multicast protocol. End systems may 169 find out about the multicast address M and the core C for a group 170 through one of several possible mechanisms including email 171 announcement, web advertising, SDR, DNS lookup etc. Both SM-aware 172 endnodes and SM-aware routers must recognize the combination of (C,M) 173 as the identity of the group. 175 Once the end systems have M and C, they then join the group by 176 sending a special join message towards the core C, creating state in 177 the routers along the path until the join packet hits the core or a 178 router that is already on the tree for this multicast group. This 179 creates a branch in the bi-directional distribution tree for the 180 group. The current IGMP mechanism for joining groups is fine, 181 provided that both C and M appear in the IGMP reply. Until IGMP is 182 modified to support this, the join message itself can be sent from 183 the end system. If both C and M appear in the join message, then the 184 first hop router can initiate the join. 186 To enable incremental deployment of Simple Multicast, we provide a 187 mechanism for the join message traverses non-SM aware routers. (See 188 Joining a Group). 190 The multicast tree formed is bi-directional, meaning that traffic can 191 be injected from any point. The core is just another node in the 192 tree. The data packet contains both C and M, and routers look up the 193 group based on the combination (C,M). 195 Data packets would need to carry both C and M. There has been a few 196 suggestions on how this may be done: 1) Define a new IP option and 197 specify both C and M in it. 2) Define a new protocol and specify the 198 new protocol in the 'protocol' field of the IPv4 header. Encapsulate 199 the payload inside this new protocol. This new protocol header will 200 contain both C and M. 3) Map (C,M) to a unique class-D address on 201 the data-link. The destination address of the data packet would be 202 re-written to a unique class-D address before being forwarded on that 203 data-link. 205 Although option processing in general is more expensive, in this case 206 the option processing is merely, forwarding packets by looking at an 207 extra IP address in the option field. In contrast, other IP options 208 such as LSR, SSR and Router Alert are more involved. Hence, from a 209 purely technical point of view, the first and second approach can be 210 implemented in hardware and there is no significant difference 211 between these two approaches. However, due to current hardware 212 implementation convention, option processing is more likely done in 213 software. As a result, we have opted to use the SM header instead. 215 The third approach does not require data packets or join messages to 216 carry the core address. SM nodes obtain the unique class-D address 217 which maps to a group (C,M) from a special node(s) on the data-link. 218 This approach is appealing because it allows SM applications to join 219 a group by joining a class-D address just like conventional IP 220 multicast. On the other hand, it also introduces concerns not unlike 221 label switching, e.g. vulnerability to loops, ensuring the uniqueness 222 of addresses at all times, ensuring all nodes on the LAN use the same 223 address for a group at all times and address recycling, among others. 224 In this approach, if a unique address on the data-link is not 225 available for use, data cannot be forwarded. In contrast, if a packet 226 cannot be label switched, it can be routed. We are investigating the 227 feasibility of this approach. 229 The SM header will carry both C and M. The reason for carrying both C 230 and M in the option instead of carrying at least one of them in the 231 destination address is to allow SM aware routers to co-exist with 232 non-SM aware routers. The destination address in the IP packet is set 233 to a reserved multicast address, the ALL-SM-NODES, when sending to 234 networks with SM aware routers. This ensures that non-SM routers 235 will not forward SM multicast data packets. When the packet must hop 236 over non-SM routers, the IP destination address is set to the next 237 SM-aware router in the path. 239 A nice feature of Simple Multicast is that, since both C and M are in 240 the SM header, the destination address in the IP packet can be 241 replaced with the tunnel endpoint address, and packets can be 242 'tunneled' with very little work. Instead of having to add and delete 243 IP headers (if the packet is encapsulated IPIP), the only work is to 244 write the tunnel endpoint address into the destination address of the 245 IP header.. 247 1.3 Why Simple Multicast 249 We now discuss some of the advantages of Simple Multicast. 251 - One protocol is all that is needed. Currently, we need to deal 252 with two sets of multicast protocols in order to support multicast in 253 the Internet: DVMRP, PIM-DM, PIM-SM and CBT etc for intra-domain 254 multicast and MASC, MAAS and BGMP for inter-domain support. The 255 beauty of the Simple Multicast proposal is only one multicast 256 protocol is needed for both intra-domain and inter-domain. This is 257 possible because Simple Multicast is designed to be scalable. 259 - Scalability. Simple Multicast is scalable to the global Internet. 260 This scalability is achieved by using a trivial multicast address 261 allocation scheme, decoupling core selection and discovery from the 262 multicast protocol and using bi-directional trees. If core discovery 263 is decoupled from multicast routing protocols such as PIM-SM or CBT, 264 these protocols would not have to use the bootstrap mechanism to 265 discover and select cores, a mechanism generally considered to be not 266 scalable. 268 - Trivial multicast address allocation. IP Multicast address 269 allocation is still an unresolved problem. Dynamically allocating 270 addresses such that addresses are allocated in aggregatable blocks, 271 while ensuring low probability of address collision (non-uniqueness) 272 is non-trivial. In Simple Multicast, since (C,M) is the identifier 273 for a multicast group, address assignment becomes totally trivial, 274 since addresses only have to be unique per core. Each core can have 275 the full 28 bit space (over 200 million address) so we have virtually 276 unlimited multicast addresses. Each core can allocate these addresses 277 independently without Internet-wide coordination. 279 - Cost effective and efficient delivery trees. It takes less state 280 in routers to support a group with n senders with a single shared 281 tree than with n per-sender trees. A bi-directional shared tree is as 282 cost effective for delivery of traffic from source S,even if S is not 283 the core, as a per-source tree rooted at S. The bi-directional shared 284 tree is much more efficient for delivery of traffic from non-core 285 source S than a unidirectional tree where the data from S must be 286 tunneled to the core before being multicast. 288 Bi-directional trees are more robust. In a unidirectional tree, the 289 core is needed for relaying packets from all senders. If the core is 290 down, the tree is gone. For a bi-directional tree, the core does not 291 hold any particular significance. The core is just another node in 292 the tree. If the core is down, the tree is merely partitioned and may 293 still be used for traffic delivery if the application chooses to do 294 so. 296 - Incremental deployment. Simple Multicast routers may be deployed 297 along side unicast routers and other multicast routers. Traffic is 298 effectively tunneled (although the actual mechanism used is more 299 efficient than tunnels) through routers which do not support Simple 300 Multicast. Therefore a network manager may incrementally add Simple 301 Multicast routers as multicast users spread in the network. 303 2.0 The Design 305 In this section, we describe the design of Simple Multicast and its 306 basic operations in detail. 308 2.1 Creating a Multicast Group 310 To create a group, one needs to select a core address and a multicast 311 address. 313 Typically most applications consist of a single high-volume source. 314 For those applications, the core should be the source. For others, 315 any node close to any member of the group would be a logical choice 316 for core. Because the tree-building strategy (like BGMP) uses a 317 single exit point from a domain or any region separated from the rest 318 of the Internet through expensive links, the traffic pattern 319 resembles individual trees within domains hooked together with 320 inter-domain paths. In other words, if S is in your domain, then you 321 will receive traffic from S through a path internal to your domain 322 even if the core of the group is outside the domain. Therefore, even 323 if most of the members of the group are in Europe, and one member of 324 the group is in Australia, and the Australian is chosen as the core, 325 the tree will still be a very good tree. Traffic between the 326 Europeans would be multicast through the tree confined within Europe, 327 even though the core was in Australia. 329 As the multicast addresses only need to be unique per core, each core 330 has over 200 million multicast addresses for allocation. Once the 331 core is chosen, some very simple mechanisms can be used to generate 332 the multicast address for the chosen core, for example, querying the 333 core for an address or random generation as it is done in SDR (the 334 collision rate will be significantly lower). Some permanent mapping 335 of "well-known" addresses for popular groups is also feasible. 337 2.2 Joining a Group 339 To join a group, one first has to find the core address C and 340 multicast address M. It is appropriate to have a variety of 341 mechanisms. A web page advertising a "singles chat group" might 342 advertise its (C,M) on its web page. Or a provider of some other sort 343 of service, like stock quotes, might advertise on a web page. 344 Ideally, clicking on the web page would cause M and C to be 345 downloaded to the client machine, which would then join the group. 346 Another mechanism, for instance when arranging a private conference, 347 might be to be told about M and C via the telephone, or via email. 348 Yet another mechanism is to have the group (together with a name or a 349 description) advertised in a directory such as SDR. 351 If IGMP is extended to support SM, the host sends a membership report 352 for group (C,M). The SM DR is responsible for forwarding the join off 353 the LAN. This message is sent towards the core, creating state in 354 the routers along the path, so that each router knows which ports are 355 in the group (C,M). 357 If there are no SM routers on the LAN, a host may send an SM Join 358 itself. The destination IP address of the join message is set to the 359 core IP address. If a non-SM router on the LAN receives the join 360 message, it will forward it to the core. Data will be tunneled to 361 this endnode by an upstream SM router. As there could be potentially 362 multiple tunnels to the LAN, host SM Join should only be used when 363 there is no local SM support as may be the case during initial 364 deployment or when there are very few local members to justify a 365 network upgrade. If the next hop towards the core on the LAN is an 366 SM router, and if it is not an SM DR itself, it will redirect the 367 join to the SM DR. In this case, if data is tunneled from upstream, 368 it will be tunneled to the SM router that forwards the join off the 369 LAN, instead of the endnode. [Note: This approach provides a 370 migration path whereby as more SM routers are deployed on the LAN, 371 less tunnels are used. It also allows the co-existence of IGMP (with 372 or without SM support) and host SM Join during the migration 373 process.] 375 If a router receives a join formulticast address (C,M), and it 376 already has state for (C,M), then it merely adds that port to its set 377 of ports for (C,M) and does not forward the join further. The result 378 is a tree of shortest paths from the core to each member. Each 379 router on the tree has a database of (C,M, {ports}) that tells it, 380 for group (C,M), the ports that data should be forwarded to. 382 The join message is sent with the Router Alert option. Since the join 383 message has C as the destination address, if an intermediate router 384 is not SM aware, it will just forward the join towards the core. When 385 the join message reaches an SM-aware router R2, it looks at the IP 386 source address of the join message, say R1. If R1 is a neighbor, R2 387 adds the port from which the join was received to its list of ports 388 for (C,M). If R1 is not a neighbor, R2 will add a join-ack to R1. If 389 R2 is not a neighbor, R1 adds the 'tunnel port' to R2 as its 'parent 390 port' for (C,M). If R2 is a neighbor, R1 just adds the port as its 391 parent port for (C,M), since the packet will not need to be tunneled 392 to get to R2. 394 A non-member sender may join the group as a sender-only (cf uni- 395 directional join in CBT). The sender will be on-tree and thus will be 396 sending keep-alives and receiving heartbeat messages, and hence will 397 be aware about core liveliness. Data will not be forwarded to a 398 sender-only branch. 400 2.3 Transmitting to multicast group (C,M) 402 A sender who is a member of the group, sends an IP packet with C and 403 M in the SM header. The destination IP address is set to ALL-SM- 404 NODES. This ensures non-SM aware nodes will ignore the packet. Only 405 SM aware routers will forward the packet. 407 A router that receives an SM packet looks up (C,M) in its forwarding 408 table. If it knows about (C,M), it checks if the port it received the 409 packet on is in its database. If not, it drops the packet. If so, it 410 forwards the packet onto all the other ports listed in its database 411 for (C,M). If the outgoing port is a tunnel port, the destination 412 address of the IP header is replaced by the tunnel endpoint, and will 413 therefore travel across routers that are not SM-aware. At the other 414 end of the tunnel, the SM-aware router will replace the destination 415 address with ALL-SM-NODES, or with another tunnel endpoint's address, 416 depending on whether the 418 packet is being forwarded on a "real port" or a "tunnel port. 420 If you are not a member of the group but want to transmit to the 421 group, you place C into the IP destination address, and put C and M 422 in the SM header. The packet might travel all the way to the core, 423 but if it instead hits an SM-aware router R with state about (C,M) 424 before it gets to the core, R will inject the packet into the tree. 425 A sender-only member may transmit like a member, but will not be 426 receiving any packets for this group. 428 2.4 Inter-domain Multicast 430 Simple Multicast works both for intra-domain and inter-domain 431 multicast. Because the join message of Simple Multicast carries the 432 core IP address, and unicast routing already knows how to reach any 433 IP address, the join message will be delivered based on the unicast 434 forwarding table. 436 2.4.1 Incongruent unicast and multicast topologies 438 Where the unicast and multicast topologies are incongruent, BGP-4+ 439 [MBGP] allows a network provider to specify the path it would accept 440 multicast traffic independent of the path unicast traffic would 441 traverse. In the figure below, AS1 may have a peering agreement with 442 AS2 to forward its unicast traffic, but a peering agreement with AS3 443 to forward multicast traffic. A join from AS1 towards any cores in 444 AS4 would be sent via AS3. A finer granularity of policy may specify 445 certain network or core ranges that AS3 would carry traffic for. 447 AS2 448 * * 449 * * 450 AS1 AS4 451 * * 452 * * 453 AS3 455 The join message to C should be routed towards the exit router 456 specified by BGP4+, for delivery of multicast traffic outside of the 457 domain. 459 2.4.2 "3rd Party" Independence 461 For the case in which SM is used both within and between domains, 462 joins from different parts of the domain might only converge (merge) 463 outside the domain. It is not desirable for a domain to depend on 464 another, "3rd party", domain for the distribution of internally 465 sourced traffic to other internal receivers. It is therefore 466 necessary to ensure that joins from different internal receivers 467 merge at a common point inside the domain. 469 BGP-4 operates on border routers (BRs) of transit domains, and 470 ensures that all BRs know which of them acts as egress for a 471 particular unicast prefix. Some transit domains (the elected egress 472 router) inject external route information internally, and therefore, 473 internal routers know in which direction to forward packets destined 474 to a particular unicast prefix. In other cases, and in stub domains, 475 external route information is not injected inside the domain. 476 Nevertheless, the BRs of these domains know for which unicast 477 prefix(es) each of them is acting as egress. Thus, domain BR routing 478 knowledge ensures that joins originated inside a domain converge at a 479 common point inside the domain. 481 This principle can be applied recursively across a multiple levels of 482 routing hierarchy. 484 2.5 Failure Recovery 486 The situations to detect are: 488 - branch unused 490 - loop 492 - path to core broken or changed 494 - core dead or unreachable 496 Any of the tree building schemes (CBT, PIM-SM, BGMP) need to solve 497 these problems, and there is no need to do anything radically new. 498 The only extra mechanism we've introduced is for loop detection. 499 Since packets can quickly proliferate in a multicast loop, it is 500 desirable to detect a loop as soon as it is formed forms. Since SM 501 uses an SM header, we can make use of a flag that will enable us to 502 detect a loop on a data packet. 504 The other mechanisms we specify are similar to those already in place 505 for PIM, CBT, and BGMP. 507 2.5.1 Unused Branch 509 A branch must be kept alive with a "keep-alive" message. If R 510 receives at least one keep-alive message from a child in tree (C,M), 511 R sends a keep-alive to its parent port for (C,M). If no keep-alive 512 is received for some amount of time (at least a few keep-alive 513 intervals) from some child port for (C,M), that port is removed from 514 the list of ports. If there are no more child ports, then R stops 515 sending keep-alives, or as an optimization "unjoins" from its parent. 517 2.5.2 Loop 519 It would be easy to detect a loop if we could assume that any data 520 packet for which TTL became zero implied there was a loop. 521 Unfortunately, some applications do an "expanding ring search" or a 522 traceroute in which packets are launched with very small TTLs. It 523 would be wrong to conclude there was a loop when the TTL on those 524 packets expired. 526 We use a flag in the SM header to indicate a packet that would 527 indicate a loop if its TTL reached 0. An application launching a 528 packet with a low TTL would not set that flag. SM routers do not need 529 to look at the flag except on packets for which TTL expires. 531 Loops can also be detected on keep-alive and heartbeat messages 532 (which are sent outwards from the core...see next section). The 533 keep-alive message indicates "hops from furthest leaf". A router 534 collects keep-alives from its child ports and transmits a keep-alive 535 that is one hop more than the maximum "hops" it receives in any keep 536 alive from a child. 538 The heartbeat is like a keep-alive, but from the parent. Likewise it 539 carries a "distance from the core". In either case (heartbeat or 540 keep-alive) if the distance gets too great a loop is suspected and 541 the port is removed from the tree and the child rejoins to the core. 543 2.5.3 Path to core broken or changed 545 A parent transmits a "heartbeat" message to its children at regular 546 intervals. The heartbeat indicates whether the core is known to be 547 alive. A parent continues sending heartbeat messages even if it stops 548 receiving "core-alive" heartbeats from its parent. In this way a 549 subtree will continue functioning even if the core is dead. And if 550 the core is not dead, the parent can simply rejoin without causing 551 disruption to the nodes below it in the tree, where feasible. 553 If unicast routing indicates the path to the core has changed, R 554 rejoins to the core, again, without disrupting the subtree below it, 555 where feasible. 557 To avoid loops from forming, the parent would rejoin the core using a 558 special join to splice the sub-trees. This splice message must be 559 forwarded all the way to the core, creating state where there is no 560 existing state. The core will acknowledge the splice message. 562 If the splice message hits a downstream router, it will be forwarded 563 until it reaches the router that originated this splice message. At 564 this point, the router would realize that it cannot splice the sub- 565 trees without causing loops. Depending on application requirement 566 which is conveyed to routers from core via heartbeat messages, the 567 router could either flush the sub-tree and let leaf routers or hosts 568 rejoin, or if the application desire, allow the sub-trees to continue 569 functioning separately, but attempts to splice the sub-trees again 570 when unicast route to the core changes. The latter makes more sense 571 when there is a network partition, and the core is not reachable. 573 Since the heartbeat message is generated at regular intervals even if 574 a heartbeat is not received from the parent, a very long tree does 575 not suffer from delay variance that might cause nodes very far from 576 the core to incorrectly assume the tree was broken. 578 2.5.4 Core dead or unreachable 580 When the core transmits a heartbeat message it sets the "core alive" 581 flag. If a router has received a heartbeat message from its parent 582 with the "core alive" flag set recently enough (3 heartbeat 583 intervals), then it sets the "core alive" flag in its heartbeat 584 messages to its children. 586 If it stops receiving heartbeats with "core alive", it prunes itself 587 from the old parent and rejoin (by sending a splice message) the 588 core. 590 The only purpose of knowing whether the core is alive or not is for 591 applications to decide, if there are multiple trees for a group, 592 which tree they should transmit on. (see next section) 594 2.5.5 Multiple Trees for Reliability 596 The core should be selected to be a node that is reliable. However, 597 if a group will be long-lived and there is the worry that the core 598 might die, a simple mechanism is to create multiple trees (C1, M1) 599 and (C2, M2) for this group. All members join both groups. They can 600 transmit on either group. If "core alive" heartbeat is only received 601 on group (C1, M1) that is the group that should be transmitted to. 603 For applications for which instantaneous switchover is more important 604 than overhead, senders should transmit on both trees. 606 2.6 Access Control 608 We accomplish access control by allowing the core for the group to be 609 configured with the set of allowed senders. The core can put the 610 access rules into the heartbeat message. The heartbeat message 611 contains a list of address prefixes of authorized senders and 612 unauthorized senders. If the rules do not fit into the heartbeat, or 613 the core for privacy reasons does not want to advertise in advance 614 all the allowed senders, it can specify that no senders other than It 615 is allowed. In that case, all senders must tunnel packets to the core 616 and the core will forward them. Once a sender gets permission to 617 send, and is known to have data to send, the core can add that 618 sender's address to the heartbeat message. 620 For example, if there is some sort of authentication that must be 621 done in order to get permission, the core initially disallows all 622 senders, but then when S1 gets permission, it gets added to the list 623 in the heartbeat message. 625 Since the heartbeat message gives the access rules, all SM routers 626 will refuse to forward a packet from a sender disallowed by the 627 access rules. 629 Border/Access routers may also have an additional Access Control List 630 locally. For instance, it may have a list of sender 631 prefixes/addresses allowed to transmit multicast data. All multicast 632 traffic with source address matching these prefixes/ addresses will 633 not be filtered. The Include/Exclude Senders List from the core will 634 prevent these senders from sending to a group that they are not 635 permitted to. 637 2.7 Dynamically forming more trees 639 In some cases dynamically formed auxiliary trees make sense, 640 especially in the inter-domain, where policy might prohibit packets 641 from A to D to transit domain B. With a core in domain B, or just due 642 to the shared tree that happened to get formed, packets from senders 643 in A to receivers in D might traverse domain B. One simple method of 644 solving the problem is to have A unicast to the core, and have the 645 core send the multicast. B is still acting as a transit domain 646 between A and D, but it doesn't know it. 648 Another solution takes inspiration from the PIM-SM concept of using 649 the shared tree to find out about per-source trees. The way it works 650 is that the sender in domain A, say X, sends a message to the core C 651 telling it that it would like to create a "spin-off" group, (X,M'). 652 Then the core C, in the heartbeat messages for group (C,M) advertises 653 the spin-off trees that members of (C,M) should also join. The spin- 654 off tree would, like the original tree, be kept robust through keep- 655 alives. 657 Although this does allow creation of multiple trees to support a 658 single group, this is less expensive than the PIM-SM scheme because 659 it does not always create a tree for every sender. It only does it 660 when necessary, and does not need a totally separate tree for each 661 sender. It only needs one per domain in which there are sources (and 662 only when the shared tree doesn't work because of transit policy 663 problems). 665 2.8 Multicast Scoping 667 A multicast group address can be scoped such that packets matching 668 the group address are not forwarded outside the defined region. Two 669 commonly used scopes are the link-local scope and the global scope 670 and they do not require configuration. Routers merely do not forward 671 the statically assigned link-local scope address (224.0.0.0/24). 673 The third type of scoping requires network administrators to 674 configure the perimeter (boundary routers) of the scoped region. This 675 is called administratively scoped or local scope. At present, this is 676 achieved by configuring multicast border routers (M-BRs) on a scope 677 boundary with a boundary scope address range - so-called 678 Administratively Scoped address range. Multicast traffic flows which 679 are to be confined within a range must use a class-D address which is 680 within the range. M-BRs are an impermeable boundary to any multicast 681 packet with a class-D destination address that falls within any of 682 its configured Administratively Scoped address ranges. 684 It is perfectly feasible for SM to use exactly the same mechanism for 685 achieving multicast scoping. However, multicast scoping as it is 686 currently defined requires a significant amount of configuration, as 687 well as co-ordination of the address space for defining scope 688 boundary ranges. Any mis-configurations can lead to multicast 689 packets "leaking" across boundaries they should not. 691 Multicast scope boundary configurations must conform to certain 692 rules, such as the rule that boundaries must be completely contained 693 within one another (the term "nesting", or "convex", are often used). 694 The MZAP protocol [MZAP] is implemented on M-BRs to detect 695 inconsistent administratively scoped boundary configurations. As such 696 it is essentially a network management tool, it does not correct 697 mis-configurations. 699 In SM, the group address (C,M) is scoped according to the unicast 700 core address C. The advantage of this compared to Administratively 701 Scoped IP Multicast [RFC2365] is there is no requirement for these 702 scoped addresses to be dynamically assigned (via AAP or MAAS) or 703 announced in the scoped regions (MZAP). 705 2.8.1 Multicast Scoping using unicast boundaries and scope mask 706 SM has the unique ability to take advantage of the unicast routing 707 system boundaries (e.g. subnet, area, AS, AS-Confederation etc.) and 708 use these as "natural" boundaries for multicast traffic, obviating 709 the need for the configuration of explicit multicast boundaries. 710 Furthermore, one group identifier (C, M) can be used with multiple 711 scopes. It works as follows: assume a (C, M) group identifier is to 712 be used for scopes A and B, with A nested inside B. A and B are 713 natural unicast routing boundaries, e.g. area, and AS. A unicast 714 routing system boundary is implicitly identified by a router 715 aggregating routing information before propagating it over outgoing 716 interfaces; this is achieved by shortening a prefix mask. For 717 example, routing information inside boundary A has an associated mask 718 of 24 bits. The boundary router between A and B reduces this is to 16 719 bits before propagating inside B. 721 Now, if a SM data packet carried a "scope mask(len)" in the SM 722 header, the data packet would not pass beyond any unicast routing 723 system boundary that itself propagates a shorter mask in unicast 724 route updates it sends. The general rule is: a SM data packet 725 carrying a "scope mask(len)" is only forwarded over those interfaces 726 that aggregate unicast routing information using a mask which is 727 equal length or longer than that specified in the SM data packet 728 header. 730 | 731 (c) /16 | (d) /12 732 | 733 --------+------- 734 (a) /8 | (b) /20 735 | 736 | 738 The figure above illustrates a router with 4 interfaces, a, b, c, d, 739 each which is aggregating routes with the respective prefix. If a SM 740 data packet arrives on interface (b) carrying a "scope mask(len)" of 741 12, it is forwarded only over interface (c) and (d). 743 2.8.2 Multicast Scoping using private network boundaries 745 A multicast session can be scoped within a private network if the 746 core address belongs to the private address space and is not 747 translated to any global address. In this case the boundary routers 748 can be the filtering or NAT devices at the edge of the network. Since 749 NAT devices can scope the addresses, the SM data packet itself does 750 not have to carry the scope mask in the SM header. 752 Note that for administrative scoping purposes, the function in the 753 NAT device which is of interest here is the filtering and address 754 space separation function, not the address translation function. An 755 public node will not be able to join n private core if the private 756 core address is not mapped to any global address. As a result, no 757 data packets for this scoped core will be forwarded out of the NAT 758 device. 760 If the boundary routers are NAT devices, there is no requirement for 761 the NAT devices to be SM-enabled (i.e. it knows how to translate SM 762 specific packets) for the purpose of scoping SM groups. If the NAT is 763 not SM-enabled, the join message will be filtered according to the 764 core (IP destination) address and hence forwarding states for (C,G) 765 will only be created in the defined scope. If the NAT device is SM- 766 enabled, data packets can be filtered based on the core address C or 767 the source address. In the case of SM dense mode, C=255.255.255.255. 768 If the NAT device is not SM-enabled, since the IP destination 769 address=255.255.255.255, the packets will be filtered. Hence SM 770 dense-mode traffic is scoped by default, i.e. no dense-mode data 771 packets will be forwarded across any boundary. If the NAT device is 772 SM-enabled, a dense-mode data packet is scoped according to its IP 773 source address. Source address is scoped in the same manner as core 774 address. 776 If two scoped regions intersect topologically, then the address space 777 in the overlapped region cannot be used by the outer scope, as stated 778 in RFC2365. This applies here as well, i.e. a scoped group address 779 cannot have its core address in the address space of the overlapped 780 region, to avoid the problem of the same (C,M) belonging to different 781 scopes at the intersecting boundary. This implies a core address C, 782 scoped within scope X, where scope X is inside scope Y, should be 783 unique within scopes X and Y; and no core within scope Y should have 784 that same address C. Further, any other addresses scoped within X 785 should not be visible to scope Y; all addresses scoped within Y is 786 visible to scope X. This address separation is already maintained by 787 NAT devices. 789 2.8.3 Multicast Scoping in IPv6 791 In IPv6, if a core address is a site-local scope address, then the 792 corresponding (C,*) will be site-local scope as well, 794 2.9 Additional Features 796 We are investigating the following additional features, which are not 797 available in other multicast protocols: 799 - the ability to select dense-mode. Currently there are routers that 800 implement dense mode and routers that implement sparse mode, and 801 typically a domain will implement either sparse or dense mode. There 802 is no way to choose, per application, which type of tree is more 803 appropriate. 805 There are cases in which dense mode makes more sense for an 806 application. For example, dense mode is more appropriate if the 807 number of receivers is so dense that there is very little 808 optimization gained by creating a tree. Dense mode is also 809 appropriate when the volume of data is sufficiently low that 810 optimizing its delivery is not worth the overhead of creating and 811 maintaining a tree. 813 With SM we use the convention of core=FF:FF:FF:FF to indicate the 814 packet should be sent via dense-mode. For such packets no tree is 815 formed and routers merely forward the packet using reverse path 816 forwarding. As in DVMRP, states (S,M), where S is the source IP 817 address, are created for dense mode groups. 819 Routers find out whether their neighbors support SM, and other 820 characteristics of their neighbors, through Hello messages. A dense 821 mode SM-packet should only be sent to SM-aware neighbors. As with 822 DVMRP, tunnels can be configured between SM-aware nodes to enable a 823 wider range for delivery of dense-mode SM packets. 825 - the ability to join a set of groups. The join message contains (C, 826 M, mask). That facilitates having content parameterized by M. For 827 instance, if the set of groups (C,*) is for stock information, 828 certain bits in M can encode industry, country, etc. To receive 829 information about all stocks, join (C,*). To receive some subset, 830 join a more specific (M, mask) for core C. 832 2.10 SM Issues 834 2.10.1 Host API and Kernel Changes 836 The SM architecture require changes to the host Application 837 Programming Interface (API) and kernel. Host may join a group using 838 either SM Join - where hosts send joins similarly to an SM router or 839 IGMP extended to carry the core address as well as a class-D address. 840 As noted before, host SM Join should only be used where appropriate 841 e.g. when there is no local SM support. 843 Taking the BSD Sockets API as an example, joining a group is achieved 844 using a system call; the data structure passed with the system call 845 as an argument only supports the specification of a class-D address 846 and interface (IP) address. For SM this data structure needs 847 modifying to include a core address element, which can be 848 concatenated with the class-D address to form SM's 8 byte group 849 identifier. The kernel SM software, or IGMP software, can then make 850 use of this information to generate a SM join message, or IGMP 851 Report, respectively. 853 Similarly, when data is sent to a group, the data structure passed to 854 the send system call must include a core address. The kernel SM 855 software can then place this core address in the SM header. When an 856 SM packet (identified by the IP protocol field) is received, the 857 kernel SM software is invoked and the SM header is decapsulated 858 before being send to the upper layer. 860 2.10.1.1 Extending IGMP 862 While not necessary, we propose using TLV in IGMP Membership Report 863 messages. It is anticipated that IGMP will be extended for various 864 purposes in future. The use of TLV will facilitate that. 866 In addition to the class-D address, a field called the extended 867 address field, for lack of a better term, is defined to carry the 868 additional address require in IGMPv3, Express, SM and Distributed 869 Core Multicast (DCM). The IGMP Membership Report message is encoded 870 as follow: 871 Type Value 872 Classic: S,G (if IGMPv3 with source specific joins) 873 Express: S,E 874 Simple: C,M 875 DCM: (S),G where S is a list of channels Hence the extended 876 address field carries: i) the source address for classical IP 877 multicast (IGMPv3 with source specific joins) ii) the source address 878 for Express iii) the core address for SM iv) the pointer to a list of 879 channels for DCM. 881 Extending IGMP is perfectly feasible - it has been done before in 882 upgrading from IGMPv1 to IGMPv2, and changes will be required for 883 IGMPv3 if it gains wider acceptance. The kernel modifications 884 required to support SM are mainly to handle the additional address 885 field. The host API change itself require only the addition of two 886 parameters. We do not, therefore, consider host changes as barriers 887 to SM deployment. 889 2.10.2 Layer 2 Filtering 891 In conventional IP multicast, each class D could be mapped to a 892 distinct MAC address if 28 bits were available at the MAC layer for 893 mapping. However, since only 23 bits of the MAC address is used for 894 mapping, 32 IP multicast address could potentially be mapped to one 895 MAC layer address. Hence higher layer filtering of multicast packets 896 is required. 898 If the low-order 4 bytes of the SM group identifier - the class-D 899 address, is similarly mapped, there is the potential for each of a 900 subnet's hosts to join different SM groups, with their group-ids 901 differing only in the core address portion of the group-id. In this 902 worst-case scenario the transmission of packets to one group will be 903 received by hosts belonging to all other SM groups on the subnet; a 904 group's packets only become distinguishable at the hosts' network 905 layers. In a more realistic case we might reasonably expect only a 906 small percentage of a subnet's hosts to receive packets 907 unnecessarily. 909 One possible way to reduce the amount of filtering at the network 910 layer, would be to statically map the core address to a multicast 911 layer 2 address if we assume groups associated with a core are likely 912 to be related. This would still potentially incur higher layer 913 filtering of undesired groups, but only those hosts subscribed to 914 group(s) associated with a particular core would be affected. 916 The problem of mapping a larger-than-usual network identifier to a 917 layer 2 address is not unique to SM - the problem manifests itself in 918 IPv6 and EXPRESS. 920 One possible way of guaranteeing layer-2 multicast destination 921 address uniqueness would have special node(s) map unique layer 2 922 address to the group-id. Before a node could send, receive or forward 923 data, it has to obtain the layer 2 address. IGMP can be extended for 924 this purpose. 926 Another possible solution is to have hardware filter based on a group 927 address at a specific offset and of a specific length. The NIC would 928 be snooping the IP header, but software should be able to program it 929 to filter addresses at the desired offset. 931 3.0 Packet formats 933 This section describes all the packet formats. Simple Multicast could 934 be implemented as very small modifications to PIM, CBT, or BGMP. 936 The packet types are: 938 - data packet 940 - join-request 942 - join-ack 944 - keep-alive (sent by child to parent) 946 - heartbeat (sent by parent to child) 948 - flush-tree (sent by parent to child after a loop is detected, to 949 clear out state from looped tree as quickly as possible and cause 950 subtree to be reformed) 952 For all control packets (JOIN-REQUEST, JOIN-ACK, KEEP-ALIVE, 953 HEARTBEAT, FLUSH- TREE), the "Protocol" field in the IPv4 header is 954 set to SM (a new protocol field). 956 3.1 SM-'tunnels' 958 Upstream (towards the core) or downstream SM routers may not be 959 immediate neighbors, if there are non-SM routers on the path between 960 them. In a traditional tunnel between R1 and R2, R1 must add an 961 extra IP header, and R2 must delete the header. SM gets the same 962 functionality without adding and deleting headers. Instead all that 963 is needed is to overwrite the destination address in the IP header to 964 the address of the "tunnel" endpoint. The reason this can be done is 965 that the information necessary for SM-routers to route the packet 966 (namely C and M) are contained in the SM header. 968 JOIN-REQUESTs and JOIN-ACKs allow tunnel-endpoints to learn of each 969 other. The state for a "tunnel" consists of the IP address of the 970 endpoint, and the number of actual IP hops in the tunnel. The purpose 971 of keeping the count of the tunnel's hops is because SM counts the 972 length of the tree, so that senders can know what to set as the TTL 973 in data packets. 975 3.2 Data Packet Header 977 IP Header 979 0 1 2 3^M 980 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1^M 981 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+^M 982 |Version| IHL |Type of Service| Total Length |^M 983 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+^M 984 | Identification |Flags| Fragment Offset |^M 985 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+^M 986 | Time to Live | Protocol = | Header Checksum |^M 987 | | PROTO_SM | | 988 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+^M 989 | Source Address |^M 990 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+^M 991 | Destination Address |^M 992 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+^M 994 SM Header 996 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+^M 997 | Core Address |^M 998 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+^M 999 | Multicast Address |^M 1000 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+^M 1001 |L| Reserved Flag bits |^M 1002 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+^M 1003 ^M 1004 This SM header includes C, M, loop detect flag, where C=FF:FF:FF:FF ^M 1005 indicates packet should be delivered dense-mode.^M 1007 The 'L' bit in Flag, if set, indicates the TTL for this packet should 1008 never reach 0 (See Loops).^M 1009 ^M 1010 The IP Destination address is ALL-SM-NODES except in the following 1011 cases:^M 1012 ^M 1013 - when a non-member sender transmits the packet, the destination is set 1014 to the core address. The purpose of this is to enable the packet^M 1015 to be unicasted until it hits a node that is SM-aware, at which point 1016 the packet is multicast along the tree from the point at which it 1017 entered 1018 the tree. 1019 Note that if the non-member sender has joined the group as a 'sender-only' 1020 (c.f. uni-directional join in CBT), then the destination address in 1021 the data packet is either ALL-SM-NODES or the tunnel endpoint 1022 (as described below). 1024 - when the packet is transmitted on a tunnel port, in which case the^M 1025 destination address is set to the IP address of the tunnel endpoint.^M 1027 Note that at Layer 2, the MAC address is mapped to the Multicast Address 1028 M of the group (C,M), not to ALL-SM-NODES.^M 1030 3.2 JOIN-REQUEST 1032 The following control packet header fields are as defined in CBT: 1033 addr_len, checksum, Payload Length and # of options. 1035 0 1 2 3 1036 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1037 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1038 | vers |type=1 | addr len | checksum | 1039 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1040 |Payload Length | # of options | reserved | 1041 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1042 | Join Originating Router | 1043 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1044 | core address C | 1045 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1046 | Multicast address M | 1047 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1048 | Multicast address mask m | 1049 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1050 | option type | option len | option value... | 1051 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1053 The destination IP address in the IP header is the Core Address. The 1054 JOIN-REQUEST is sent with the Router Alert Option. 1056 The Multicast address and corresponding mask (M,m) may appear 1057 multiple times. The total length of these fields is specified in the 1058 "addr_len" field of the common control header. 1060 The JOIN-REQUEST may contain the following option: 1062 - Originating TTL. This field is set to the TTL in the IP header of 1063 this JOIN- REQUEST packet. The receiving SM router ignores this 1064 option unless the control packet is from a SM router who is not an 1065 immediate neighbor. The value in this field is used to calculate the 1066 number of hops in a 'tunnel' = Originating TTL - TTL in the IP 1067 header for this packet. The value derived is placed in "# of hops in 1068 tunnel from you to me" in the JOIN-ACK message. 1070 0 1 2 3 1071 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1072 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1073 | 1 | 2 | Originating TTL | 1074 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1076 - Sender-Only 1077 The join would only be successful if the sender is on the Include 1078 Senders List or NOT in the Exclude Senders List. 1079 The sender is attached to the tree as per uni-directional Join in CBT. 1081 0 1 2 3 1082 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1083 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1084 | 2 | 2 | Reserved | 1085 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1087 3.3 JOIN-ACK 1089 0 1 2 3 1090 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1091 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1092 | vers |type=2 | addr len | checksum | 1093 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1094 |Payload Length | # of options | # of hops in 'tunnel' | 1095 | | | from you to me | 1096 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1097 | Join Originating Router | 1098 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1099 | core address C | 1100 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1101 | Multicast address M | 1102 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1103 | Multicast address mask m | 1104 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1105 | option type | option len | option value... | 1106 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1108 The destination IP address in the IP header is the downstream IP 1109 source address of the JOIN-REQUEST. The JOIN_ACK is sent with the 1110 Router Alert Option. 1112 The Multicast address and corresponding mask (M,m) may appear 1113 multiple times. The total length of these fields is specified in the 1114 "addr_len" field. 1116 The field "# of hops in tunnel from you to me" is ignored unless the 1117 control packet is from a SM router who is not an immediate neighbor. 1118 The value in this field is saved as state for this tunnel port. 1120 The options from the JOIN-REQUEST are copied into the JOIN-ACK, with 1121 the exception of the "Originating TTL" option. The Originating TTL is 1122 set to the TTL in the IP header of this JOIN-ACK packet. 1124 3.4 KEEP-ALIVE 1126 0 1 2 3 1127 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1128 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1129 | vers | type=3| addr len | checksum | 1130 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1131 |Payload Length | # of options | reserved | 1132 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1133 | KEEP-ALIVE Originating Router | 1134 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1135 | core address C | 1136 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1137 | Multicast address M | 1138 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1139 | Multicast address mask m | 1140 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1141 | option type | option len | option value... | 1142 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1144 The keep-alive message is sent from a child to a parent (towards 1145 core), and is sent only if a keep-alive has been received recently 1146 from a child. The destination IP address in the IP header is ALL-SM- 1147 NODES or the tunnel endpoint address. 1149 A single keep-alive can serve as many groups as fit into the list in 1150 the packet. 1152 (M,m) may appear multiple times. The total length of these fields is 1153 specified in the "addr_len" field. 1155 The KEEP-ALIVE may contain the following options: 1157 0 1 2 3 1158 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1159 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1160 | 1 | 10 |I| reserved flag bits | 1161 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1162 | Include/Exclude Sender Prefix | 1163 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1164 | Include/Exclude Sender Mask | 1165 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1167 - Include/Exclude Senders List that upstream routers should filter. 1168 This option may appear multiple times. The 'I' bit is set if this is 1169 an include sender list, and is zero if this is an exclude sender 1170 list. 1172 0 1 2 3 1173 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1174 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1175 | 2 | 10 | hop count | 1176 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1177 | Prune Time | # of hops in 'tunnel' | 1178 | | from you to me | 1179 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1181 - KEEP-ALIVE Option. This option should appear the same number of 1182 times as the address set (C,M,mask). It corresponds and is 1183 applicable to the address set (C,M,mask). 1185 The fields in this option are: - Number of hops to furthest leaf for 1186 (C,M,mask), hop count. The hop count is incremented at every SM hop. 1187 In addition, when the KEEP-ALIVE is received from a tunnel port, hop 1188 count = hop count + number of hops in 'tunnel'. 1190 - Prune Time for (C,M,mask), time after which, if no KEEP-ALIVE is 1191 received for group (C1, M, mask), the parent should prune off this 1192 branch. 1194 - 'Originating TTL'. This is as described in JOIN-REQUEST. 1196 3.5 HEARTBEAT 1198 0 1 2 3 1199 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1200 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1201 | vers | type=4| addr len | checksum | 1202 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1203 |Payload Length | # of options | reserved | 1204 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1205 | HEARTBEAT Originating Router | 1206 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1207 | core address C | 1208 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1209 | Multicast address M | 1210 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1211 | Multicast address mask m | 1212 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1213 | option type | option len | option value... | 1214 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1216 The heartbeat is sent by a parent to a child. It is sent periodically 1217 regardless of whether heartbeat is received from its parent. The 1218 destination IP address is set to ALL-SM-NODES or the tunnel endpoint 1219 address. 1221 The HEARTBEAT may contain the following additional options: - 1222 Include/Exclude Senders List. This is the list of allowed/prohibited 1223 senders to the group. The format of this option is the same the 1224 KEEP-ALIVE Include/Exclude Senders List, although it serves as a 1225 different purpose here. 1227 - spin-off groups (Ci,Mi). One or more spin-off groups (Ci,Mi) may be 1228 specified. 1230 0 1 2 3 1231 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1232 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1233 | 1 | #Groupsx8 | reserved flag bits | 1234 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1235 | Core Address Ci | 1236 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1237 | Multicast Address Mi | 1238 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1240 - HEARTBEAT Option. This option should appear the same number of 1241 times as the address set (C,M,mask). It corresponds and is applicable 1242 to the address set (C,M,mask). 1244 The fields in this option are: 1245 0 1 2 3 1246 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1247 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1248 | 2 | 6 | core distance | 1249 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1250 | Time To Shutdown | # of hops in 'tunnel' | 1251 | | from you to me | 1252 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1253 |A| reserved | 1254 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1256 - distance from core. Number of hops to core (C,M,mask), core 1257 distance. The core distance is incremented at every SM hop. In 1258 addition, when the KEEP-ALIVE is received from a tunnel port, core 1259 distance = core distance + number of hops in 'tunnel' - Time left 1260 before group should be closed down. (all 'ones' indicates group 1261 should not be torn down) - The 'A' bit if set indicates the core is 1262 alive or reachable 1264 - 'Originating TTL'. This is as described in JOIN-ACK. 1266 3.6 FLUSH-TREE 1268 0 1 2 3 1269 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1270 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1271 | vers | type=5| addr len | checksum | 1272 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1273 |Payload Length | # of options | reserved | 1274 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1275 | HEARTBEAT Originating Router | 1276 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1277 | core address C | 1278 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1279 | Multicast address M | 1280 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1281 | Multicast address mask m | 1282 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1283 | option type | option len | option value... | 1284 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1286 The destination IP address is set to ALL-SM-NODES or the tunnel 1287 endpoint address. 1289 The Multicast address and corresponding mask (M,m) may appear 1290 multiple times. The total length of these fields is specified in the 1291 "addr_len" field of the common control header. 1293 No options are currently defined. 1295 4 Acknowledgments 1297 Many people have contributed ideas to this proposal, including Harald 1298 Alvastrand, Joel Halpern and Fred Baker. The fact that SM is based on 1299 previous work in IP Multicast implies that the authors are grateful 1300 to everyone who has contributed to the development of IP Multicast. 1301 We would like to thank all members of IDMR, in particular Dino 1302 Farinacci, Mark Handley, Brad Cain, Dave Thaler Russ White and Ken 1303 Carlberg whose helpful comments have improved this proposal. Others 1304 that have provided helpful technical information include Matthew 1305 Yuen, Patrick Lee. 1307 References 1309 DNS Based RP Placement scheme 1310 Dino Farinacci's presentation in the MBONED WG, 40th IETF Meeting 1312 Static Multicast, Internet-Draft, March 1998 1313 M. Ohta, J. Crowcroft 1315 Express 1316 IDMR Mailing List discussion 1318 CBT, Core Based Tree Multicast Routing, 1319 Ballardie, Cain, Zhang 1321 PIM-SM, Protocol independent multicast-sparse mode Specification, 1322 RFC-2117, June 1997 1323 Estrin, Farinacci, Helmy, Thaler, Deering, Handley, 1324 Jacobson, Liu, Sharma, and Wei. 1326 BGMP, Border Gateway Multicast Protocol Specification, 1327 Thaler, Estrin, Meyers 1329 MASC, Multicast Address Set Claim Protocol, 1330 Estrin, Handley, Kumar, Thaler 1332 IGMP, Internet Group Management Protocol, Version 3, 1333 Cain, Deering, Thyagarajan 1335 "A Border Gateway Protocol 4 (BGP-4)", Y. Rekhter & T. Li, 1336 RFC1771, March 1995 1338 "Multiprotocol Extensions for BGP-4", RFC 2283, February 1998. 1339 Bates, T., Chandra, R., Katz, D., and Y. Rekhter, 1341 "The IP Network Address Translator (NAT)" RFC 1631, May 1994. 1342 RFC1631 Egevang, K., Francis, P., 1344 "Administratively Scoped IP Multicast", 1345 RFC 2365, July 1998. Meyer, D., 1347 Distributed Core Multicast, L. Blazevic, J-Y. Boudec 1349 OGMP ftp://cs.ucl.ac.uk/darpa/ogmp.ps.gz 1351 Authors' Addresses 1353 Radia Perlman 1354 Sun Microsystems Laboratories 1355 2 Elizabeth Drive 1356 Chelmsford, MA 01824 1357 Radia.Perlman@sun.com 1359 Cheng-Yin Lee 1360 Nortel Networks 1361 PO Box 3511, Station C 1362 Ottawa, ON K1Y 4H7, Canada 1363 leecy@nortel.com 1365 Tony Ballardie 1366 Research Consultant 1367 aballardie@acm.org 1369 Jon Crowcroft 1370 Department of Computer Science 1371 University College London 1372 Gower Street 1373 London, WC1E 6BT, UK 1374 J.Crowcroft@cs.ucl.ac.uk 1376 Zheng Wang 1377 Bell Labs Lucent Technologies 1378 101 Crawfords Corner Road 1379 Holmdel NJ 07733 1380 zhwang@bell-labs.com 1382 Thomas Maufer 1383 3Com Corporation 1384 5400 Bayfront Plaza 1385 Santa Clara, CA 95052 1386 maufer@3com.com 1388 Christophe Diot 1389 Sprint ATL 1390 1 Adrian Court 1391 Burlingame CA 94010 1392 USA 1393 cdiot@sprintlabs.com 1395 Joseph Thoo 1396 Nortel Networks 1397 PO Box 3511, Station C 1398 Ottawa, ON K1Y 4H7, Canada 1399 jthoo@nortel.com 1401 Mark Green 1402 @Home Networks 1403 markg@corp.home.net