idnits 2.17.1 draft-perlman-simple-multicast-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** The document seems to lack a 1id_guidelines paragraph about the list of current Internet-Drafts. ** The document seems to lack a 1id_guidelines paragraph about the list of Shadow Directories. ** The document is more than 15 pages and seems to lack a Table of Contents. == No 'Intended status' indicated for this document; assuming Proposed Standard == The page length should not exceed 58 lines per page, but there was 33 longer pages, the longest (page 2) being 60 lines == It seems as if not all pages are separated by form feeds - found 0 form feeds but 34 pages Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack a Security Considerations section. ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. ** There are 7 instances of too long lines in the document, the longest one being 5 characters in excess of 72. Miscellaneous warnings: ---------------------------------------------------------------------------- == Line 333 has weird spacing: '... random gener...' == Line 353 has weird spacing: '...N. This messa...' == Line 1187 has weird spacing: '... times as th...' -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- Couldn't find a document date in the document -- date freshness check skipped. -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Missing reference section? 'MBGP' on line 439 looks like a reference -- Missing reference section? 'MZAP' on line 698 looks like a reference -- Missing reference section? 'RFC2365' on line 705 looks like a reference Summary: 8 errors (**), 0 flaws (~~), 6 warnings (==), 6 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internet Engineering Task Force R. Perlman 3 INTERNET DRAFT Sun Microsystems 4 October 1999 C-Y Lee 5 Nortel Networks 6 A. Ballardie 7 Research Consultant 8 J. Crowcroft 9 UCL 10 Z. Wang 11 Lucent Technologies 12 T. Maufer 13 3Com Corporation 14 C. Diot 15 Sprint 16 J. Thoo 17 Nortel Networks 18 M. Green 19 @Home Networks 21 Simple Multicast: A Design for Simple, Low-Overhead Multicast 23 25 Status of this memo 27 This document is an Internet-Draft and is in full conformance 28 with all provisions of Section 10 of RFC2026. 30 Internet-Drafts are working documents of the Internet Engineering 31 Task Force (IETF), its areas, and its working groups. Note that 32 other groups may also distribute working documents as 33 Internet-Drafts. 35 Internet-Drafts are draft documents valid for a maximum of six 36 months and may be updated, replaced, or obsoleted by other 37 documents at any time. It is inappropriate to use Internet- 38 Drafts as reference material or to cite them other than as 39 "work in progress." 41 To view the list Internet-Draft Shadow Directories, see 42 http://www.ietf.org/shadow.html. 44 Abstract 46 This paper describes a design for multicast that is simple to 47 understand and low enough overhead for routers that a single scheme 48 can work both within and between domains. It also eliminates the need 49 for coordinated multicast address allocation across the Internet. It 50 is not very different from the tree-based schemes CBT, PIM-SM, and 51 BGMP. Essentially all of the mechanisms to support this have already 52 been implemented in the other designs. The contribution of this 53 protocol is in what is NOT required to be implemented. 55 The main idea for simplifying multicast is to consider the identity 56 of a group to be the 8-byte combination of a "core node" C, and the 57 multicast address M. The identity of the group is carried in join 58 messages and data messages. M no longer has to be unique across the 59 Internet. It only has to be unique per C. The other idea, which is 60 independent of the first, it to build a bi-directional tree (as is 61 done in CBT and BGMP) instead of building per-source trees from each 62 sender. This reduces the state necessary in routers to support 63 multicast. 65 Changes from revision 1 66 - use a Simple Multicast (SM) header instead of a new IP option 68 - modified branch creation and deletion to avoid loops 70 - added tree splicing mechanism 72 - added multicast scoping 74 - allow both IGMP and host SM Join 76 - added sender only joins 78 - third party independence 80 - layer 2 filtering 82 - host API and kernel changes 84 1.0 Introduction 86 IP Multicast has been around for over a decade, and several multicast 87 protocols have been developed over the years. However, the solutions 88 are either difficult to understand or expensive to deploy or both. In 89 particular, we believe that multicast address allocation protocols 90 are too complex and BGMP in combination with MASC will not scale 91 easily. 93 In this paper, we present a design we call Simple Multicast that 94 reduces the complexity and overhead of multicast. It is not really 95 "yet another multicast protocol". Instead, it is more like a subset 96 of other protocols, with one variation; to have the identifier of a 97 group consist of both C (the core) and M (the multicast address). 98 This eliminates the need to have unique multicast addresses and 99 coordinate multicast addresses across the Internet. 101 1.1 Previous Work 103 DVMRP is the first multicast routing protocol proposed. It uses a 104 simple mechanism of flooding and pruning. 106 The scalability issues with DVMRP led to the development of CBT. In 107 CBT, a multicast group is formed by choosing a distinguished node, 108 the "core", and having all members join by sending special join 109 messages towards the core. The routers along the path keep state 110 about which ports are in the group. If a router along the path of the 111 join already has state about that group the join does not proceed 112 further. Instead the router just "grafts" the new limb onto the tree. 113 The result is a tree of shortest paths from the core, with only the 114 routers along the path knowing anything about that group. 116 In PIM-SM, each node could independently decide whether the volume of 117 traffic from a particular source is worth switching from a shared 118 tree to a per-source tree. Thus, there are two possible trees for 119 traffic from a particular source for group M; the shared tree and the 120 source tree. To prevent loops, the shared tree had to be 121 unidirectional, i.e., to send to the shared tree, the data has to be 122 encapsulated and unicast to the core. 124 The other issue that makes current protocols complex is the necessity 125 for routers to be able to figure out the location of the core based 126 solely on the multicast address M. In PIM-SM, this resulted in a 127 protocol whereby "core-capable" routers are being continuously 128 advertised. All routers keep track of the current set of live core- 129 capable routers, and there is a hashing function to map a multicast 130 address to one of the set of core-capable routers. This advertisement 131 protocol is confined to within a domain because it was recognized 132 that this mechanism would not scale to the entire Internet. 134 For inter-domain multicast, a set of new protocols has been proposed. 135 The MASC protocol deals with hierarchical block allocation of Class D 136 address space. Essentially, it creates a prefix structure in 137 multicast address space in a way similar to unicast address space. 138 Because of the limited multicast address space, the allocation has to 139 be dynamic. MASC contains mechanisms for collision detection and 140 de-allocation. Once a block of multicast addresses is allocated, and 141 no collision is detected for a period of time, the address block is 142 then given to MAAS servers for actual assignment to multicast groups. 143 The address block has to be propagated through BGP+ so that routers 144 throughout the Internet can know the mapping of multicast addresses 145 to cores, even in other domains. BGMP then uses this information to 146 know the direction in which a join to multicast address M should be 147 sent. 149 1.2 Overview of Simple Multicast 151 The Simple Multicast proposal tries to reduce or eliminate some of 152 the complexity and overhead of multicast by taking a slightly 153 different approach. The basic idea in Simple Multicast is that a 154 multicast group is created by generating: 156 - a distinguished node C known as the "core" 158 - a multicast address M 160 The multicast group is then identified by the pair (C,M) rather than 161 just M as in conventional IP multicast. Note that the address M does 162 not have to be unique across the Internet now. Instead, only the pair 163 (C,M) has to be unique. That means that every node C in the Internet 164 can assign the full 28 bits worth of multicast addresses. 166 In Simple Multicast, multicast address allocation and core placement 167 (i.e., choosing a multicast address M and a core C for a multicast 168 group) are taken out of the basic multicast protocol. End systems may 169 find out about the multicast address M and the core C for a group 170 through one of several possible mechanisms including email 171 announcement, web advertising, SDR, DNS lookup etc. Both SM-aware 172 endnodes and SM-aware routers must recognize the combination of (C,M) 173 as the identity of the group. 175 Once the end systems have M and C, they then join the group by 176 sending a special join message towards the core C, creating state in 177 the routers along the path until the join packet hits the core or a 178 router that is already on the tree for this multicast group. This 179 creates a branch in the bi-directional distribution tree for the 180 group. The current IGMP mechanism for joining groups is fine, 181 provided that both C and M appear in the IGMP reply. Until IGMP is 182 modified to support this, the join message itself can be sent from 183 the end system. If both C and M appear in the join message, then the 184 first hop router can initiate the join. 186 To enable incremental deployment of Simple Multicast, we provide a 187 mechanism for the join message traverses non-SM aware routers. (See 188 Joining a Group). 190 The multicast tree formed is bi-directional, meaning that traffic can 191 be injected from any point. The core is just another node in the 192 tree. The data packet contains both C and M, and routers look up the 193 group based on the combination (C,M). 195 Data packets would need to carry both C and M. There has been a few 196 suggestions on how this may be done: 1) Define a new IP option and 197 specify both C and M in it. 2) Define a new protocol and specify the 198 new protocol in the 'protocol' field of the IPv4 header. Encapsulate 199 the payload inside this new protocol. This new protocol header will 200 contain both C and M. 3) Map (C,M) to a unique class-D address on 201 the data-link. The destination address of the data packet would be 202 re-written to a unique class-D address before being forwarded on that 203 data-link. 205 Although option processing in general is more expensive, in this case 206 the option processing is merely, forwarding packets by looking at an 207 extra IP address in the option field. In contrast, other IP options 208 such as LSR, SSR and Router Alert are more involved. Hence, from a 209 purely technical point of view, the first and second approach can be 210 implemented in hardware and there is no significant difference 211 between these two approaches. However, due to current hardware 212 implementation convention, option processing is more likely done in 213 software. As a result, we have opted to use the SM header instead. 215 The third approach does not require data packets or join messages to 216 carry the core address. SM nodes obtain the unique class-D address 217 which maps to a group (C,M) from a special node(s) on the data-link. 218 This approach is appealing because it allows SM applications to join 219 a group by joining a class-D address just like conventional IP 220 multicast. On the other hand, it also introduces concerns not unlike 221 label switching, e.g. vulnerability to loops, ensuring the uniqueness 222 of addresses at all times, ensuring all nodes on the LAN use the same 223 address for a group at all times and address recycling, among others. 224 In this approach, if a unique address on the data-link is not 225 available for use, data cannot be forwarded. In contrast, if a packet 226 cannot be label switched, it can be routed. We are investigating the 227 feasibility of this approach. 229 The SM header will carry both C and M. The reason for carrying both C 230 and M in the option instead of carrying at least one of them in the 231 destination address is to allow SM aware routers to co-exist with 232 non-SM aware routers. The destination address in the IP packet is set 233 to a reserved multicast address, the ALL-SM-NODES, when sending to 234 networks with SM aware routers. This ensures that non-SM routers 235 will not forward SM multicast data packets. When the packet must hop 236 over non-SM routers, the IP destination address is set to the next 237 SM-aware router in the path. 239 A nice feature of Simple Multicast is that, since both C and M are in 240 the SM header, the destination address in the IP packet can be 241 replaced with the tunnel endpoint address, and packets can be 242 'tunneled' with very little work. Instead of having to add and delete 243 IP headers (if the packet is encapsulated IPIP), the only work is to 244 write the tunnel endpoint address into the destination address of the 245 IP header.. 247 1.3 Why Simple Multicast 249 We now discuss some of the advantages of Simple Multicast. 251 - One protocol is all that is needed. Currently, we need to deal 252 with two sets of multicast protocols in order to support multicast in 253 the Internet: DVMRP, PIM-DM, PIM-SM and CBT etc for intra-domain 254 multicast and MASC, MAAS and BGMP for inter-domain support. The 255 beauty of the Simple Multicast proposal is only one multicast 256 protocol is needed for both intra-domain and inter-domain. This is 257 possible because Simple Multicast is designed to be scalable. 259 - Scalability. Simple Multicast is scalable to the global Internet. 260 This scalability is achieved by using a trivial multicast address 261 allocation scheme, decoupling core selection and discovery from the 262 multicast protocol and using bi-directional trees. If core discovery 263 is decoupled from multicast routing protocols such as PIM-SM or CBT, 264 these protocols would not have to use the bootstrap mechanism to 265 discover and select cores, a mechanism generally considered to be not 266 scalable. 268 - Trivial multicast address allocation. IP Multicast address 269 allocation is still an unresolved problem. Dynamically allocating 270 addresses such that addresses are allocated in aggregatable blocks, 271 while ensuring low probability of address collision (non-uniqueness) 272 is non-trivial. In Simple Multicast, since (C,M) is the identifier 273 for a multicast group, address assignment becomes totally trivial, 274 since addresses only have to be unique per core. Each core can have 275 the full 28 bit space (over 200 million address) so we have virtually 276 unlimited multicast addresses. Each core can allocate these addresses 277 independently without Internet-wide coordination. 279 - Cost effective and efficient delivery trees. It takes less state 280 in routers to support a group with n senders with a single shared 281 tree than with n per-sender trees. A bi-directional shared tree is as 282 cost effective for delivery of traffic from source S,even if S is not 283 the core, as a per-source tree rooted at S. The bi-directional shared 284 tree is much more efficient for delivery of traffic from non-core 285 source S than a unidirectional tree where the data from S must be 286 tunneled to the core before being multicast. 288 Bi-directional trees are more robust. In a unidirectional tree, the 289 core is needed for relaying packets from all senders. If the core is 290 down, the tree is gone. For a bi-directional tree, the core does not 291 hold any particular significance. The core is just another node in 292 the tree. If the core is down, the tree is merely partitioned and may 293 still be used for traffic delivery if the application chooses to do 294 so. 296 - Incremental deployment. Simple Multicast routers may be deployed 297 along side unicast routers and other multicast routers. Traffic is 298 effectively tunneled (although the actual mechanism used is more 299 efficient than tunnels) through routers which do not support Simple 300 Multicast. Therefore a network manager may incrementally add Simple 301 Multicast routers as multicast users spread in the network. 303 2.0 The Design 305 In this section, we describe the design of Simple Multicast and its 306 basic operations in detail. 308 2.1 Creating a Multicast Group 310 To create a group, one needs to select a core address and a multicast 311 address. 313 Typically most applications consist of a single high-volume source. 314 For those applications, the core should be the source. For others, 315 any node close to any member of the group would be a logical choice 316 for core. Because the tree-building strategy (like BGMP) uses a 317 single exit point from a domain or any region separated from the rest 318 of the Internet through expensive links, the traffic pattern 319 resembles individual trees within domains hooked together with 320 inter-domain paths. In other words, if S is in your domain, then you 321 will receive traffic from S through a path internal to your domain 322 even if the core of the group is outside the domain. Therefore, even 323 if most of the members of the group are in Europe, and one member of 324 the group is in Australia, and the Australian is chosen as the core, 325 the tree will still be a very good tree. Traffic between the 326 Europeans would be multicast through the tree confined within Europe, 327 even though the core was in Australia. 329 As the multicast addresses only need to be unique per core, each core 330 has over 200 million multicast addresses for allocation. Once the 331 core is chosen, some very simple mechanisms can be used to generate 332 the multicast address for the chosen core, for example, querying the 333 core for an address or random generation as it is done in SDR (the 334 collision rate will be significantly lower). Some permanent mapping 335 of "well-known" addresses for popular groups is also feasible. 337 2.2 Joining a Group 339 To join a group, one first has to find the core address C and 340 multicast address M. It is appropriate to have a variety of 341 mechanisms. A web page advertising a "singles chat group" might 342 advertise its (C,M) on its web page. Or a provider of some other sort 343 of service, like stock quotes, might advertise on a web page. 344 Ideally, clicking on the web page would cause M and C to be 345 downloaded to the client machine, which would then join the group. 346 Another mechanism, for instance when arranging a private conference, 347 might be to be told about M and C via the telephone, or via email. 348 Yet another mechanism is to have the group (together with a name or a 349 description) advertised in a directory such as SDR. 351 If IGMP is extended to support SM, the host sends a membership report 352 for group (C,M). The SM DR is responsible for forwarding the join off 353 the LAN. This message is sent towards the core, creating state in 354 the routers along the path, so that each router knows which ports are 355 in the group (C,M). 357 If there are no SM routers on the LAN, a host may send an SM Join 358 itself. The destination IP address of the join message is set to the 359 core IP address. If a non-SM router on the LAN receives the join 360 message, it will forward it to the core. Data will be tunneled to 361 this endnode by an upstream SM router. As there could be potentially 362 multiple tunnels to the LAN, host SM Join should only be used when 363 there is no local SM support as may be the case during initial 364 deployment or when there are very few local members to justify a 365 network upgrade. If the next hop towards the core on the LAN is an 366 SM router, and if it is not an SM DR itself, it will redirect the 367 join to the SM DR. In this case, if data is tunneled from upstream, 368 it will be tunneled to the SM router that forwards the join off the 369 LAN, instead of the endnode. [Note: This approach provides a 370 migration path whereby as more SM routers are deployed on the LAN, 371 less tunnels are used. It also allows the co-existence of IGMP (with 372 or without SM support) and host SM Join during the migration 373 process.] 375 If a router receives a join formulticast address (C,M), and it 376 already has state for (C,M), then it merely adds that port to its set 377 of ports for (C,M) and does not forward the join further. The result 378 is a tree of shortest paths from the core to each member. Each 379 router on the tree has a database of (C,M, {ports}) that tells it, 380 for group (C,M), the ports that data should be forwarded to. 382 The join message is sent with the Router Alert option. Since the join 383 message has C as the destination address, if an intermediate router 384 is not SM aware, it will just forward the join towards the core. When 385 the join message reaches an SM-aware router R2, it looks at the IP 386 source address of the join message, say R1. If R1 is a neighbor, R2 387 adds the port from which the join was received to its list of ports 388 for (C,M). If R1 is not a neighbor, R2 will add a join-ack to R1. If 389 R2 is not a neighbor, R1 adds the 'tunnel port' to R2 as its 'parent 390 port' for (C,M). If R2 is a neighbor, R1 just adds the port as its 391 parent port for (C,M), since the packet will not need to be tunneled 392 to get to R2. 394 A non-member sender may join the group as a sender-only (cf uni- 395 directional join in CBT). The sender will be on-tree and thus will be 396 sending keep-alives and receiving heartbeat messages, and hence will 397 be aware about core liveliness. Data will not be forwarded to a 398 sender-only branch. 400 2.3 Transmitting to multicast group (C,M) 402 A sender who is a member of the group, sends an IP packet with C and 403 M in the SM header. The destination IP address is set to ALL-SM- 404 NODES. This ensures non-SM aware nodes will ignore the packet. Only 405 SM aware routers will forward the packet. 407 A router that receives an SM packet looks up (C,M) in its forwarding 408 table. If it knows about (C,M), it checks if the port it received the 409 packet on is in its database. If not, it drops the packet. If so, it 410 forwards the packet onto all the other ports listed in its database 411 for (C,M). If the outgoing port is a tunnel port, the destination 412 address of the IP header is replaced by the tunnel endpoint, and will 413 therefore travel across routers that are not SM-aware. At the other 414 end of the tunnel, the SM-aware router will replace the destination 415 address with ALL-SM-NODES, or with another tunnel endpoint's address, 416 depending on whether the 418 packet is being forwarded on a "real port" or a "tunnel port. 420 If you are not a member of the group but want to transmit to the 421 group, you place C into the IP destination address, and put C and M 422 in the SM header. The packet might travel all the way to the core, 423 but if it instead hits an SM-aware router R with state about (C,M) 424 before it gets to the core, R will inject the packet into the tree. 425 A sender-only member may transmit like a member, but will not be 426 receiving any packets for this group. 428 2.4 Inter-domain Multicast 430 Simple Multicast works both for intra-domain and inter-domain 431 multicast. Because the join message of Simple Multicast carries the 432 core IP address, and unicast routing already knows how to reach any 433 IP address, the join message will be delivered based on the unicast 434 forwarding table. 436 2.4.1 Incongruent unicast and multicast topologies 438 Where the unicast and multicast topologies are incongruent, BGP-4+ 439 [MBGP] allows a network provider to specify the path it would accept 440 multicast traffic independent of the path unicast traffic would 441 traverse. In the figure below, AS1 may have a peering agreement with 442 AS2 to forward its unicast traffic, but a peering agreement with AS3 443 to forward multicast traffic. A join from AS1 towards any cores in 444 AS4 would be sent via AS3. A finer granularity of policy may specify 445 certain network or core ranges that AS3 would carry traffic for. 447 AS2 448 * * 449 * * 450 AS1 AS4 451 * * 452 * * 453 AS3 455 The join message to C should be routed towards the exit router 456 specified by BGP4+, for delivery of multicast traffic outside of the 457 domain. 459 2.4.2 "3rd Party" Independence 461 For the case in which SM is used both within and between domains, 462 joins from different parts of the domain might only converge (merge) 463 outside the domain. It is not desirable for a domain to depend on 464 another, "3rd party", domain for the distribution of internally 465 sourced traffic to other internal receivers. It is therefore 466 necessary to ensure that joins from different internal receivers 467 merge at a common point inside the domain. 469 BGP-4 operates on border routers (BRs) of transit domains, and 470 ensures that all BRs know which of them acts as egress for a 471 particular unicast prefix. Some transit domains (the elected egress 472 router) inject external route information internally, and therefore, 473 internal routers know in which direction to forward packets destined 474 to a particular unicast prefix. In other cases, and in stub domains, 475 external route information is not injected inside the domain. 476 Nevertheless, the BRs of these domains know for which unicast 477 prefix(es) each of them is acting as egress. Thus, domain BR routing 478 knowledge ensures that joins originated inside a domain converge at a 479 common point inside the domain. 481 This principle can be applied recursively across a multiple levels of 482 routing hierarchy. 484 2.5 Failure Recovery 486 The situations to detect are: 488 - branch unused 490 - loop 492 - path to core broken or changed 494 - core dead or unreachable 496 Any of the tree building schemes (CBT, PIM-SM, BGMP) need to solve 497 these problems, and there is no need to do anything radically new. 498 The only extra mechanism we've introduced is for loop detection. 499 Since packets can quickly proliferate in a multicast loop, it is 500 desirable to detect a loop as soon as it is formed forms. Since SM 501 uses an SM header, we can make use of a flag that will enable us to 502 detect a loop on a data packet. 504 The other mechanisms we specify are similar to those already in place 505 for PIM, CBT, and BGMP. 507 2.5.1 Unused Branch 509 A branch must be kept alive with a "keep-alive" message. If R 510 receives at least one keep-alive message from a child in tree (C,M), 511 R sends a keep-alive to its parent port for (C,M). If no keep-alive 512 is received for some amount of time (at least a few keep-alive 513 intervals) from some child port for (C,M), that port is removed from 514 the list of ports. If there are no more child ports, then R stops 515 sending keep-alives, or as an optimization "unjoins" from its parent. 517 2.5.2 Loop 519 It would be easy to detect a loop if we could assume that any data 520 packet for which TTL became zero implied there was a loop. 521 Unfortunately, some applications do an "expanding ring search" or a 522 traceroute in which packets are launched with very small TTLs. It 523 would be wrong to conclude there was a loop when the TTL on those 524 packets expired. 526 We use a flag in the SM header to indicate a packet that would 527 indicate a loop if its TTL reached 0. An application launching a 528 packet with a low TTL would not set that flag. SM routers do not need 529 to look at the flag except on packets for which TTL expires. 531 Loops can also be detected on keep-alive and heartbeat messages 532 (which are sent outwards from the core...see next section). The 533 keep-alive message indicates "hops from furthest leaf". A router 534 collects keep-alives from its child ports and transmits a keep-alive 535 that is one hop more than the maximum "hops" it receives in any keep 536 alive from a child. 538 The heartbeat is like a keep-alive, but from the parent. Likewise it 539 carries a "distance from the core". In either case (heartbeat or 540 keep-alive) if the distance gets too great a loop is suspected and 541 the port is removed from the tree and the child rejoins to the core. 543 2.5.3 Path to core broken or changed 545 A parent transmits a "heartbeat" message to its children at regular 546 intervals. The heartbeat indicates whether the core is known to be 547 alive. A parent continues sending heartbeat messages even if it stops 548 receiving "core-alive" heartbeats from its parent. In this way a 549 subtree will continue functioning even if the core is dead. And if 550 the core is not dead, the parent can simply rejoin without causing 551 disruption to the nodes below it in the tree, where feasible. 553 If unicast routing indicates the path to the core has changed, R 554 rejoins to the core, again, without disrupting the subtree below it, 555 where feasible. 557 To avoid loops from forming, the parent would rejoin the core using a 558 special join to splice the sub-trees. This splice message must be 559 forwarded all the way to the core, creating state where there is no 560 existing state. The core will acknowledge the splice message. 562 If the splice message hits a downstream router, it will be forwarded 563 until it reaches the router that originated this splice message. At 564 this point, the router would realize that it cannot splice the sub- 565 trees without causing loops. Depending on application requirement 566 which is conveyed to routers from core via heartbeat messages, the 567 router could either flush the sub-tree and let leaf routers or hosts 568 rejoin, or if the application desire, allow the sub-trees to continue 569 functioning separately, but attempts to splice the sub-trees again 570 when unicast route to the core changes. The latter makes more sense 571 when there is a network partition, and the core is not reachable. /* 572 MODIFIED */ The decision to flush the sub-tree or rejoin the core can 573 be based on information such as the depth of the sub-tree and 574 distance to core. This information may be obtain from the keep-alive 575 and heartbeat messages. 577 Since the heartbeat message is generated at regular intervals even if 578 a heartbeat is not received from the parent, a very long tree does 579 not suffer from delay variance that might cause nodes very far from 580 the core to incorrectly assume the tree was broken. 582 2.5.4 Core dead or unreachable 584 When the core transmits a heartbeat message it sets the "core alive" 585 flag. If a router has received a heartbeat message from its parent 586 with the "core alive" flag set recently enough (3 heartbeat 587 intervals), then it sets the "core alive" flag in its heartbeat 588 messages to its children. 590 If it stops receiving heartbeats with "core alive", it prunes itself 591 from the old parent and rejoin (by sending a splice message) the 592 core. 594 The only purpose of knowing whether the core is alive or not is for 595 applications to decide, if there are multiple trees for a group, 596 which tree they should transmit on. (see next section) 598 2.5.5 Multiple Trees for Reliability 600 The core should be selected to be a node that is reliable. However, 601 if a group will be long-lived and there is the worry that the core 602 might die, a simple mechanism is to create multiple trees (C1, M1) 603 and (C2, M2) for this group. All members join both groups. They can 604 transmit on either group. If "core alive" heartbeat is only received 605 on group (C1, M1) that is the group that should be transmitted to. 607 For applications for which instantaneous switchover is more important 608 than overhead, senders should transmit on both trees. 610 2.6 Access Control 612 We accomplish access control by allowing the core for the group to be 613 configured with the set of allowed senders. The core can put the 614 access rules into the heartbeat message. The heartbeat message 615 contains a list of address prefixes of authorized senders and 616 unauthorized senders. If the rules do not fit into the heartbeat, or 617 the core for privacy reasons does not want to advertise in advance 618 all the allowed senders, it can specify that no senders other than It 619 is allowed. In that case, all senders must tunnel packets to the core 620 and the core will forward them. Once a sender gets permission to 621 send, and is known to have data to send, the core can add that 622 sender's address to the heartbeat message. 624 For example, if there is some sort of authentication that must be 625 done in order to get permission, the core initially disallows all 626 senders, but then when S1 gets permission, it gets added to the list 627 in the heartbeat message. 629 Since the heartbeat message gives the access rules, all SM routers 630 will refuse to forward a packet from a sender disallowed by the 631 access rules. 633 Border/Access routers may also have an additional Access Control List 634 locally. For instance, it may have a list of sender 635 prefixes/addresses allowed to transmit multicast data. All multicast 636 traffic with source address matching these prefixes/ addresses will 637 not be filtered. The Include/Exclude Senders List from the core will 638 prevent these senders from sending to a group that they are not 639 permitted to. 641 2.7 Dynamically forming more trees 643 In some cases dynamically formed auxiliary trees make sense, 644 especially in the inter-domain, where policy might prohibit packets 645 from A to D to transit domain B. With a core in domain B, or just due 646 to the shared tree that happened to get formed, packets from senders 647 in A to receivers in D might traverse domain B. One simple method of 648 solving the problem is to have A unicast to the core, and have the 649 core send the multicast. B is still acting as a transit domain 650 between A and D, but it doesn't know it. 652 Another solution takes inspiration from the PIM-SM concept of using 653 the shared tree to find out about per-source trees. The way it works 654 is that the sender in domain A, say X, sends a message to the core C 655 telling it that it would like to create a "spin-off" group, (X,M'). 656 Then the core C, in the heartbeat messages for group (C,M) advertises 657 the spin-off trees that members of (C,M) should also join. The spin- 658 off tree would, like the original tree, be kept robust through keep- 659 alives. 661 Although this does allow creation of multiple trees to support a 662 single group, this is less expensive than the PIM-SM scheme because 663 it does not always create a tree for every sender. It only does it 664 when necessary, and does not need a totally separate tree for each 665 sender. It only needs one per domain in which there are sources (and 666 only when the shared tree doesn't work because of transit policy 667 problems). 669 2.8 Multicast Scoping 671 A multicast group address can be scoped such that packets matching 672 the group address are not forwarded outside the defined region. Two 673 commonly used scopes are the link-local scope and the global scope 674 and they do not require configuration. Routers merely do not forward 675 the statically assigned link-local scope address (224.0.0.0/24). 677 The third type of scoping requires network administrators to 678 configure the perimeter (boundary routers) of the scoped region. This 679 is called administratively scoped or local scope. At present, this is 680 achieved by configuring multicast border routers (M-BRs) on a scope 681 boundary with a boundary scope address range - so-called 682 Administratively Scoped address range. Multicast traffic flows which 683 are to be confined within a range must use a class-D address which is 684 within the range. M-BRs are an impermeable boundary to any multicast 685 packet with a class-D destination address that falls within any of 686 its configured Administratively Scoped address ranges. 688 It is perfectly feasible for SM to use exactly the same mechanism for 689 achieving multicast scoping. However, multicast scoping as it is 690 currently defined requires a significant amount of configuration, as 691 well as co-ordination of the address space for defining scope 692 boundary ranges. Any mis-configurations can lead to multicast 693 packets "leaking" across boundaries they should not. 695 Multicast scope boundary configurations must conform to certain 696 rules, such as the rule that boundaries must be completely contained 697 within one another (the term "nesting", or "convex", are often used). 698 The MZAP protocol [MZAP] is implemented on M-BRs to detect 699 inconsistent administratively scoped boundary configurations. As such 700 it is essentially a network management tool, it does not correct 701 mis-configurations. 703 In SM, the group address (C,M) is scoped according to the unicast 704 core address C. The advantage of this compared to Administratively 705 Scoped IP Multicast [RFC2365] is there is no requirement for these 706 scoped addresses to be dynamically assigned (via AAP or MAAS) or 707 announced in the scoped regions (MZAP). 709 2.8.1 Multicast Scoping using unicast boundaries and scope mask 711 SM has the unique ability to take advantage of the unicast routing 712 system boundaries (e.g. subnet, area, AS, AS-Confederation etc.) and 713 use these as "natural" boundaries for multicast traffic, obviating 714 the need for the configuration of explicit multicast boundaries. 715 Furthermore, one group identifier (C, M) can be used with multiple 716 scopes. It works as follows: assume a (C, M) group identifier is to 717 be used for scopes A and B, with A nested inside B. A and B are 718 natural unicast routing boundaries, e.g. area, and AS. A unicast 719 routing system boundary is implicitly identified by a router 720 aggregating routing information before propagating it over outgoing 721 interfaces; this is achieved by shortening a prefix mask. For 722 example, routing information inside boundary A has an associated mask 723 of 24 bits. The boundary router between A and B reduces this is to 16 724 bits before propagating inside B. 726 Now, if a SM data packet carried a "scope mask(len)" in the SM 727 header, the data packet would not pass beyond any unicast routing 728 system boundary that itself propagates a shorter mask in unicast 729 route updates it sends. The general rule is: a SM data packet 730 carrying a "scope mask(len)" is only forwarded over those interfaces 731 that aggregate unicast routing information using a mask which is 732 equal length or longer than that specified in the SM data packet 733 header. 735 | 736 (c) /16 | (d) /12 737 | 738 --------+------- 739 (a) /8 | (b) /20 740 | 741 | 743 The figure above illustrates a router with 4 interfaces, a, b, c, d, 744 each which is aggregating routes with the respective prefix. If a SM 745 data packet arrives on interface (b) carrying a "scope mask(len)" of 746 12, it is forwarded only over interface (c) and (d). 748 2.8.2 Multicast Scoping using private network boundaries 750 A multicast session can be scoped within a private network if the 751 core address belongs to the private address space and is not 752 translated to any global address. In this case the boundary routers 753 can be the filtering or NAT devices at the edge of the network. Since 754 NAT devices can scope the addresses, the SM data packet itself does 755 not have to carry the scope mask in the SM header. 757 Note that for administrative scoping purposes, the function in the 758 NAT device which is of interest here is the filtering and address 759 space separation function, not the address translation function. An 760 public node will not be able to join n private core if the private 761 core address is not mapped to any global address. As a result, no 762 data packets for this scoped core will be forwarded out of the NAT 763 device. 765 If the boundary routers are NAT devices, there is no requirement for 766 the NAT devices to be SM-enabled (i.e. it knows how to translate SM 767 specific packets) for the purpose of scoping SM groups. If the NAT is 768 not SM-enabled, the join message will be filtered according to the 769 core (IP destination) address and hence forwarding states for (C,G) 770 will only be created in the defined scope. If the NAT device is SM- 771 enabled, data packets can be filtered based on the core address C or 772 the source address. In the case of SM dense mode, C=255.255.255.255. 773 If the NAT device is not SM-enabled, since the IP destination 774 address=255.255.255.255, the packets will be filtered. Hence SM 775 dense-mode traffic is scoped by default, i.e. no dense-mode data 776 packets will be forwarded across any boundary. If the NAT device is 777 SM-enabled, a dense-mode data packet is scoped according to its IP 778 source address. Source address is scoped in the same manner as core 779 address. 781 If two scoped regions intersect topologically, then the address space 782 in the overlapped region cannot be used by the outer scope, as stated 783 in RFC2365. This applies here as well, i.e. a scoped group address 784 cannot have its core address in the address space of the overlapped 785 region, to avoid the problem of the same (C,M) belonging to different 786 scopes at the intersecting boundary. This implies a core address C, 787 scoped within scope X, where scope X is inside scope Y, should be 788 unique within scopes X and Y; and no core within scope Y should have 789 that same address C. Further, any other addresses scoped within X 790 should not be visible to scope Y; all addresses scoped within Y is 791 visible to scope X. This address separation is already maintained by 792 NAT devices. 794 2.8.3 Multicast Scoping in IPv6 796 In IPv6, if a core address is a site-local scope address, then the 797 corresponding (C,*) will be site-local scope as well, 799 2.9 Additional Features 801 We are investigating the following additional features, which are not 802 available in other multicast protocols: 804 - the ability to select dense-mode. Currently there are routers that 805 implement dense mode and routers that implement sparse mode, and 806 typically a domain will implement either sparse or dense mode. There 807 is no way to choose, per application, which type of tree is more 808 appropriate. 810 There are cases in which dense mode makes more sense for an 811 application. For example, dense mode is more appropriate if the 812 number of receivers is so dense that there is very little 813 optimization gained by creating a tree. Dense mode is also 814 appropriate when the volume of data is sufficiently low that 815 optimizing its delivery is not worth the overhead of creating and 816 maintaining a tree. 818 With SM we use the convention of core=FF:FF:FF:FF to indicate the 819 packet should be sent via dense-mode. For such packets no tree is 820 formed and routers merely forward the packet using reverse path 821 forwarding. As in DVMRP, states (S,M), where S is the source IP 822 address, are created for dense mode groups. 824 Routers find out whether their neighbors support SM, and other 825 characteristics of their neighbors, through Hello messages. A dense 826 mode SM-packet should only be sent to SM-aware neighbors. As with 827 DVMRP, tunnels can be configured between SM-aware nodes to enable a 828 wider range for delivery of dense-mode SM packets. 830 - the ability to join a set of groups. The join message contains (C, 831 M, mask). That facilitates having content parameterized by M. For 832 instance, if the set of groups (C,*) is for stock information, 833 certain bits in M can encode industry, country, etc. To receive 834 information about all stocks, join (C,*). To receive some subset, 835 join a more specific (M, mask) for core C. 837 2.10 SM Issues 839 2.10.1 Host API and Kernel Changes 841 The SM architecture require changes to the host Application 842 Programming Interface (API) and kernel. Host may join a group using 843 either SM Join - where hosts send joins similarly to an SM router or 844 IGMP extended to carry the core address as well as a class-D address. 845 As noted before, host SM Join should only be used where appropriate 846 e.g. when there is no local SM support. 848 Taking the BSD Sockets API as an example, joining a group is achieved 849 using a system call; the data structure passed with the system call 850 as an argument only supports the specification of a class-D address 851 and interface (IP) address. For SM this data structure needs 852 modifying to include a core address element, which can be 853 concatenated with the class-D address to form SM's 8 byte group 854 identifier. The kernel SM software, or IGMP software, can then make 855 use of this information to generate a SM join message, or IGMP 856 Report, respectively. 858 Similarly, when data is sent to a group, the data structure passed to 859 the send system call must include a core address. The kernel SM 860 software can then place this core address in the SM header. When an 861 SM packet (identified by the IP protocol field) is received, the 862 kernel SM software is invoked and the SM header is decapsulated 863 before being send to the upper layer. 865 2.10.1.1 Extending IGMP 867 While not necessary, we propose using TLV in IGMP Membership Report 868 messages. It is anticipated that IGMP will be extended for various 869 purposes in future. The use of TLV will facilitate that. 871 In addition to the class-D address, a field called the extended 872 address field, for lack of a better term, is defined to carry the 873 additional address require in IGMPv3, Express, SM and Distributed 874 Core Multicast (DCM). The IGMP Membership Report message is encoded 875 as follow: 876 Type Value 877 Classic: S,G (if IGMPv3 with source specific joins) 878 Express: S,E 879 Simple: C,M 880 DCM: (S),G where S is a list of channels Hence the extended 881 address field carries: i) the source address for classical IP 882 multicast (IGMPv3 with source specific joins) ii) the source address 883 for Express iii) the core address for SM iv) the pointer to a list of 884 channels for DCM. 886 Extending IGMP is perfectly feasible - it has been done before in 887 upgrading from IGMPv1 to IGMPv2, and changes will be required for 888 IGMPv3 if it gains wider acceptance. The kernel modifications 889 required to support SM are mainly to handle the additional address 890 field. The host API change itself require only the addition of two 891 parameters. We do not, therefore, consider host changes as barriers 892 to SM deployment. 894 2.10.2 Layer 2 Filtering 896 In conventional IP multicast, each class D could be mapped to a 897 distinct MAC address if 28 bits were available at the MAC layer for 898 mapping. However, since only 23 bits of the MAC address is used for 899 mapping, 32 IP multicast address could potentially be mapped to one 900 MAC layer address. Hence higher layer filtering of multicast packets 901 is required. 903 If the low-order 4 bytes of the SM group identifier - the class-D 904 address, is similarly mapped, there is the potential for each of a 905 subnet's hosts to join different SM groups, with their group-ids 906 differing only in the core address portion of the group-id. In this 907 worst-case scenario the transmission of packets to one group will be 908 received by hosts belonging to all other SM groups on the subnet; a 909 group's packets only become distinguishable at the hosts' network 910 layers. In a more realistic case we might reasonably expect only a 911 small percentage of a subnet's hosts to receive packets 912 unnecessarily. 914 One possible way to reduce the amount of filtering at the network 915 layer, would be to statically map the core address to a multicast 916 layer 2 address if we assume groups associated with a core are likely 917 to be related. This would still potentially incur higher layer 918 filtering of undesired groups, but only those hosts subscribed to 919 group(s) associated with a particular core would be affected. 921 The problem of mapping a larger-than-usual network identifier to a 922 layer 2 address is not unique to SM - the problem manifests itself in 923 IPv6 and EXPRESS. 925 One possible way of guaranteeing layer-2 multicast destination 926 address uniqueness would have special node(s) map unique layer 2 927 address to the group-id. Before a node could send, receive or forward 928 data, it has to obtain the layer 2 address. IGMP can be extended for 929 this purpose. 931 Another possible solution is to have hardware filter based on a group 932 address at a specific offset and of a specific length. The NIC would 933 be snooping the IP header, but software should be able to program it 934 to filter addresses at the desired offset. 936 3.0 Packet formats 938 This section describes all the packet formats. Simple Multicast could 939 be implemented as very small modifications to PIM, CBT, or BGMP. 941 The packet types are: 943 - data packet 945 - join-request 947 - join-ack 949 - keep-alive (sent by child to parent) 951 - heartbeat (sent by parent to child) 953 - flush-tree (sent by parent to child after a loop is detected, to 954 clear out state from looped tree as quickly as possible and cause 955 subtree to be reformed) 957 For all control packets (JOIN-REQUEST, JOIN-ACK, KEEP-ALIVE, 958 HEARTBEAT, FLUSH- TREE), the "Protocol" field in the IPv4 header is 959 set to SM (a new protocol field). 961 3.1 SM-'tunnels' 963 Upstream (towards the core) or downstream SM routers may not be 964 immediate neighbors, if there are non-SM routers on the path between 965 them. In a traditional tunnel between R1 and R2, R1 must add an 966 extra IP header, and R2 must delete the header. SM gets the same 967 functionality without adding and deleting headers. Instead all that 968 is needed is to overwrite the destination address in the IP header to 969 the address of the "tunnel" endpoint. The reason this can be done is 970 that the information necessary for SM-routers to route the packet 971 (namely C and M) are contained in the SM header. 973 JOIN-REQUESTs and JOIN-ACKs allow tunnel-endpoints to learn of each 974 other. The state for a "tunnel" consists of the IP address of the 975 endpoint, and the number of actual IP hops in the tunnel. The purpose 976 of keeping the count of the tunnel's hops is because SM counts the 977 length of the tree, so that senders can know what to set as the TTL 978 in data packets. 980 3.2 Data Packet Header 982 IP Header 984 0 1 2 3^M 985 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1^M 986 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+^M 987 |Version| IHL |Type of Service| Total Length |^M 988 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+^M 989 | Identification |Flags| Fragment Offset |^M 990 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+^M 991 | Time to Live | Protocol = | Header Checksum |^M 992 | | IPPROTO_SM | | 993 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+^M 994 | Source Address |^M 995 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+^M 996 | Destination Address |^M 997 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+^M 999 SM Header 1001 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+^M 1002 | Core Address |^M 1003 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+^M 1004 | Multicast Address |^M 1005 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+^M 1006 |Protocol=egUDP| Core Mask | |L| 1007 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+^M 1009 This SM header includes C, M, loop detect flag, where C=FF:FF:FF:FF ^M 1010 indicates packet should be delivered dense-mode.^M 1012 The 'L' bit in Flag, if set, indicates the TTL for this packet should 1013 never reach 0 (See Loops).^M 1014 ^M 1015 The IP Destination address is ALL-SM-NODES except in the following 1016 cases:^M 1017 ^M 1018 - when a non-member sender transmits the packet, the destination is set 1019 to the core address. The purpose of this is to enable the packet^M 1020 to be unicasted until it hits a node that is SM-aware, at which point 1021 the packet is multicast along the tree from the point at which it 1022 entered 1023 the tree. 1024 Note that if the non-member sender has joined the group as a 'sender-only' 1025 (c.f. uni-directional join in CBT), then the destination address in 1026 the data packet is either ALL-SM-NODES or the tunnel endpoint 1027 (as described below). 1029 - when the packet is transmitted on a tunnel port, in which case the^M 1030 destination address is set to the IP address of the tunnel endpoint.^M 1032 Note that at Layer 2, the MAC address is mapped to the Multicast Address 1033 M of the group (C,M), not to ALL-SM-NODES.^M 1035 3.2 JOIN-REQUEST 1037 The following control packet header fields are as defined in CBT: 1038 addr_len, checksum, Payload Length and # of options. 1040 0 1 2 3 1041 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1042 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1043 | vers |type=1 | addr len | checksum | 1044 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1045 |Payload Length | # of options | reserved | 1046 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1047 | Join Originating Router | 1048 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1049 | core address C | 1050 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1051 | Multicast address M | 1052 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1053 | Multicast address mask m | 1054 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1055 | option type | option len | option value... | 1056 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1058 The destination IP address in the IP header is the Core Address. The 1059 JOIN-REQUEST is sent with the Router Alert Option. 1061 The Multicast address and corresponding mask (M,m) may appear 1062 multiple times. The total length of these fields is specified in the 1063 "addr_len" field of the common control header. 1065 The JOIN-REQUEST may contain the following option: 1067 - Originating TTL. This field is set to the TTL in the IP header of 1068 this JOIN- REQUEST packet. The receiving SM router ignores this 1069 option unless the control packet is from a SM router who is not an 1070 immediate neighbor. The value in this field is used to calculate the 1071 number of hops in a 'tunnel' = Originating TTL - TTL in the IP 1072 header for this packet. The value derived is placed in "# of hops in 1073 tunnel from you to me" in the JOIN-ACK message. 1075 0 1 2 3 1076 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1077 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1078 | 1 | 2 | Originating TTL | 1079 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1081 - Sender-Only 1082 The join would only be successful if the sender is on the Include 1083 Senders List or NOT in the Exclude Senders List. 1084 The sender is attached to the tree as per uni-directional Join in CBT. 1086 0 1 2 3 1087 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1088 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1089 | 2 | 2 | Reserved | 1090 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1092 3.3 JOIN-ACK 1094 0 1 2 3 1095 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1096 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1097 | vers |type=2 | addr len | checksum | 1098 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1099 |Payload Length | # of options | # of hops in 'tunnel' | 1100 | | | from you to me | 1101 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1102 | Join Originating Router | 1103 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1104 | core address C | 1105 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1106 | Multicast address M | 1107 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1108 | Multicast address mask m | 1109 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1110 | option type | option len | option value... | 1111 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1113 The destination IP address in the IP header is the downstream IP 1114 source address of the JOIN-REQUEST. The JOIN_ACK is sent with the 1115 Router Alert Option. 1117 The Multicast address and corresponding mask (M,m) may appear 1118 multiple times. The total length of these fields is specified in the 1119 "addr_len" field. 1121 The field "# of hops in tunnel from you to me" is ignored unless the 1122 control packet is from a SM router who is not an immediate neighbor. 1123 The value in this field is saved as state for this tunnel port. 1125 The options from the JOIN-REQUEST are copied into the JOIN-ACK, with 1126 the exception of the "Originating TTL" option. The Originating TTL is 1127 set to the TTL in the IP header of this JOIN-ACK packet. 1129 3.4 KEEP-ALIVE 1131 0 1 2 3 1132 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1133 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1134 | vers | type=3| addr len | checksum | 1135 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1136 |Payload Length | # of options | reserved | 1137 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1138 | KEEP-ALIVE Originating Router | 1139 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1140 | core address C | 1141 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1142 | Multicast address M | 1143 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1144 | Multicast address mask m | 1145 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1146 | option type | option len | option value... | 1147 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1149 The keep-alive message is sent from a child to a parent (towards 1150 core), and is sent only if a keep-alive has been received recently 1151 from a child. The destination IP address in the IP header is ALL-SM- 1152 NODES or the tunnel endpoint address. 1154 A single keep-alive can serve as many groups as fit into the list in 1155 the packet. 1157 (M,m) may appear multiple times. The total length of these fields is 1158 specified in the "addr_len" field. 1160 The KEEP-ALIVE may contain the following options: 1162 0 1 2 3 1163 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1164 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1165 | 1 | 10 |I| reserved flag bits | 1166 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1167 | Include/Exclude Sender Prefix | 1168 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1169 | Include/Exclude Sender Mask | 1170 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1172 - Include/Exclude Senders List that upstream routers should filter. 1173 This option may appear multiple times. The 'I' bit is set if this is 1174 an include sender list, and is zero if this is an exclude sender 1175 list. 1177 0 1 2 3 1178 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1179 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1180 | 2 | 10 | hop count | 1181 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1182 | Prune Time | # of hops in 'tunnel' | 1183 | | from you to me | 1184 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1186 - KEEP-ALIVE Option. This option should appear the same number of 1187 times as the address set (C,M,mask). It corresponds and is 1188 applicable to the address set (C,M,mask). 1190 The fields in this option are: - Number of hops to furthest leaf for 1191 (C,M,mask), hop count. The hop count is incremented at every SM hop. 1192 In addition, when the KEEP-ALIVE is received from a tunnel port, hop 1193 count = hop count + number of hops in 'tunnel'. 1195 - Prune Time for (C,M,mask), time after which, if no KEEP-ALIVE is 1196 received for group (C1, M, mask), the parent should prune off this 1197 branch. 1199 - 'Originating TTL'. This is as described in JOIN-REQUEST. 1201 3.5 HEARTBEAT 1203 0 1 2 3 1204 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1205 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1206 | vers | type=4| addr len | checksum | 1207 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1208 |Payload Length | # of options | reserved | 1209 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1210 | HEARTBEAT Originating Router | 1211 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1212 | core address C | 1213 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1214 | Multicast address M | 1215 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1216 | Multicast address mask m | 1217 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1218 | option type | option len | option value... | 1219 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1221 The heartbeat is sent by a parent to a child. It is sent periodically 1222 regardless of whether heartbeat is received from its parent. The 1223 destination IP address is set to ALL-SM-NODES or the tunnel endpoint 1224 address. 1226 The HEARTBEAT may contain the following additional options: - 1227 Include/Exclude Senders List. This is the list of allowed/prohibited 1228 senders to the group. The format of this option is the same the 1229 KEEP-ALIVE Include/Exclude Senders List, although it serves as a 1230 different purpose here. 1232 - spin-off groups (Ci,Mi). One or more spin-off groups (Ci,Mi) may be 1233 specified. 1235 0 1 2 3 1236 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1237 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1238 | 1 | #Groupsx8 | reserved flag bits | 1239 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1240 | Core Address Ci | 1241 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1242 | Multicast Address Mi | 1243 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1245 - HEARTBEAT Option. This option should appear the same number of 1246 times as the address set (C,M,mask). It corresponds and is applicable 1247 to the address set (C,M,mask). 1249 The fields in this option are: 1250 0 1 2 3 1251 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1252 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1253 | 2 | 6 | core distance | 1254 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1255 | Time To Shutdown | # of hops in 'tunnel' | 1256 | | from you to me | 1257 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1258 |A| reserved | 1259 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1261 - distance from core. Number of hops to core (C,M,mask), core 1262 distance. The core distance is incremented at every SM hop. In 1263 addition, when the KEEP-ALIVE is received from a tunnel port, core 1264 distance = core distance + number of hops in 'tunnel' - Time left 1265 before group should be closed down. (all 'ones' indicates group 1266 should not be torn down) - The 'A' bit if set indicates the core is 1267 alive or reachable 1269 - 'Originating TTL'. This is as described in JOIN-ACK. 1271 3.6 FLUSH-TREE 1273 0 1 2 3 1274 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1275 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1276 | vers | type=5| addr len | checksum | 1277 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1278 |Payload Length | # of options | reserved | 1279 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1280 | HEARTBEAT Originating Router | 1281 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1282 | core address C | 1283 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1284 | Multicast address M | 1285 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1286 | Multicast address mask m | 1287 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1288 | option type | option len | option value... | 1289 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1291 The destination IP address is set to ALL-SM-NODES or the tunnel 1292 endpoint address. 1294 The Multicast address and corresponding mask (M,m) may appear 1295 multiple times. The total length of these fields is specified in the 1296 "addr_len" field of the common control header. 1298 No options are currently defined. 1300 4 Acknowledgments 1302 Many people have contributed ideas to this proposal, including Harald 1303 Alvastrand, Joel Halpern and Fred Baker. The fact that SM is based on 1304 previous work in IP Multicast implies that the authors are grateful 1305 to everyone who has contributed to the development of IP Multicast. 1306 We would like to thank all members of IDMR, in particular Dino 1307 Farinacci, Mark Handley, Brad Cain, Dave Thaler Russ White and Ken 1308 Carlberg whose helpful comments have improved this proposal. Others 1309 that have provided helpful technical information include Matthew 1310 Yuen, Patrick Lee. 1312 References 1314 DNS Based RP Placement scheme 1315 Dino Farinacci's presentation in the MBONED WG, 40th IETF Meeting 1317 Static Multicast, Internet-Draft, March 1998 1318 M. Ohta, J. Crowcroft 1320 Express 1321 IDMR Mailing List discussion 1323 CBT, Core Based Tree Multicast Routing, 1324 Ballardie, Cain, Zhang 1326 PIM-SM, Protocol independent multicast-sparse mode Specification, 1327 RFC-2117, June 1997 1328 Estrin, Farinacci, Helmy, Thaler, Deering, Handley, 1329 Jacobson, Liu, Sharma, and Wei. 1331 BGMP, Border Gateway Multicast Protocol Specification, 1332 Thaler, Estrin, Meyers 1334 MASC, Multicast Address Set Claim Protocol, 1335 Estrin, Handley, Kumar, Thaler 1337 IGMP, Internet Group Management Protocol, Version 3, 1338 Cain, Deering, Thyagarajan 1340 "A Border Gateway Protocol 4 (BGP-4)", Y. Rekhter & T. Li, 1341 RFC1771, March 1995 1343 "Multiprotocol Extensions for BGP-4", RFC 2283, February 1998. 1344 Bates, T., Chandra, R., Katz, D., and Y. Rekhter, 1346 "The IP Network Address Translator (NAT)" RFC 1631, May 1994. 1347 RFC1631 Egevang, K., Francis, P., 1349 "Administratively Scoped IP Multicast", 1350 RFC 2365, July 1998. Meyer, D., 1352 Distributed Core Multicast, L. Blazevic, J-Y. Boudec 1354 OGMP ftp://cs.ucl.ac.uk/darpa/ogmp.ps.gz 1356 Authors' Addresses 1358 Radia Perlman 1359 Sun Microsystems Laboratories 1360 2 Elizabeth Drive 1361 Chelmsford, MA 01824 1362 Radia.Perlman@sun.com 1364 Cheng-Yin Lee 1365 Nortel Networks 1366 PO Box 3511, Station C 1367 Ottawa, ON K1Y 4H7, Canada 1368 leecy@nortel.com 1370 Tony Ballardie 1371 Research Consultant 1372 aballardie@acm.org 1374 Jon Crowcroft 1375 Department of Computer Science 1376 University College London 1377 Gower Street 1378 London, WC1E 6BT, UK 1379 J.Crowcroft@cs.ucl.ac.uk 1381 Zheng Wang 1382 Bell Labs Lucent Technologies 1383 101 Crawfords Corner Road 1384 Holmdel NJ 07733 1385 zhwang@bell-labs.com 1387 Thomas Maufer 1388 3Com Corporation 1389 5400 Bayfront Plaza 1390 Santa Clara, CA 95052 1391 maufer@3com.com 1393 Christophe Diot 1394 Sprint ATL 1395 1 Adrian Court 1396 Burlingame CA 94010 1397 USA 1398 cdiot@sprintlabs.com 1400 Joseph Thoo 1401 Nortel Networks 1402 PO Box 3511, Station C 1403 Ottawa, ON K1Y 4H7, Canada 1404 jthoo@nortel.com 1406 Mark Green 1407 @Home Networks 1408 markg@corp.home.net