idnits 2.17.1 

draft-perlman-simple-multicast-02.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** Looks like you're using RFC 2026 boilerplate.  This must be updated to
     follow RFC 3978/3979, as updated by RFC 4748.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  ** Missing expiration date.  The document expiration date should appear on
     the first and last page.

  ** The document seems to lack a 1id_guidelines paragraph about the list of
     current Internet-Drafts. 

  ** The document seems to lack a 1id_guidelines paragraph about the list of
     Shadow Directories. 

  ** The document is more than 15 pages and seems to lack a Table of Contents.

  == No 'Intended status' indicated for this document; assuming Proposed
     Standard

  == The page length should not exceed 58 lines per page, but there was 33
     longer pages, the longest (page 2) being 60 lines

  == It seems as if not all pages are separated by form feeds - found 0 form
     feeds but 34 pages


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack a Security Considerations section.

  ** The document seems to lack an IANA Considerations section.  (See Section
     2.2 of https://www.ietf.org/id-info/checklist for how to handle the case
     when there are no actions for IANA.)

  ** The document seems to lack separate sections for Informative/Normative
     References.  All references will be assumed normative when checking for
     downward references.

  ** There are 7 instances of too long lines in the document, the longest one
     being 5 characters in excess of 72.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == Line 333 has weird spacing: '... random  gener...'

  == Line 353 has weird spacing: '...N. This  messa...'

  == Line 1182 has weird spacing: '...  times  as th...'

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- Couldn't find a document date in the document -- date freshness check
     skipped.


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  -- Missing reference section? 'MBGP' on line 439 looks like a reference

  -- Missing reference section? 'MZAP' on line 694 looks like a reference

  -- Missing reference section? 'RFC2365' on line 701 looks like a reference


     Summary: 9 errors (**), 0 flaws (~~), 6 warnings (==), 5 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Internet Engineering Task Force                     R. Perlman
3	INTERNET DRAFT                                      Sun Microsystems
4	February 1999                                       C-Y Lee
5	                                                    Nortel Networks
6	                                                    A. Ballardie
7	                                                    Research Consultant
8	                                                    J. Crowcroft
9	                                                    UCL
10	                                                    Z. Wang
11	                                                    Lucent Technologies
12	                                                    T. Maufer
13	                                                    3Com Corporation
14	                                                    C. Diot
15	                                                    Sprint
16	                                                    J. Thoo
17	                                                    Nortel Networks
18	                                                    M. Green
19	                                                    @Home Networks

21	    Simple Multicast: A Design for Simple, Low-Overhead Multicast^M

23	            <draft-perlman-simple-multicast-02.txt>^M

25	Status of this memo

27	     This document is an Internet-Draft and is in full conformance
28	     with all provisions of Section 10 of RFC2026.

30	     Internet-Drafts are working documents of the Internet Engineering
31	     Task Force (IETF), its areas, and its working groups.  Note that
32	     other groups may also distribute working documents as
33	     Internet-Drafts.

35	     Internet-Drafts are draft documents valid for a maximum of six
36	     months and may be updated, replaced, or obsoleted by other
37	     documents at any time.  It is inappropriate to use Internet-
38	     Drafts as reference material or to cite them other than as
39	     "work in progress."

41	     To view the list Internet-Draft Shadow Directories, see
42	     http://www.ietf.org/shadow.html.

44	Abstract

46	   This paper describes a design for multicast that is simple to
47	   understand and low enough overhead for routers that a single scheme
48	   can work both within and between domains. It also eliminates the need
49	   for coordinated multicast address allocation across the Internet. It
50	   is not very different from the tree-based schemes CBT, PIM-SM, and
51	   BGMP. Essentially all of the mechanisms to support this have already
52	   been implemented in the other designs. The contribution of this
53	   protocol is in what is NOT required to be implemented.

55	   The main idea for simplifying multicast is to consider the identity
56	   of a group to be the 8-byte combination of a "core node" C, and the
57	   multicast address M. The identity of the group is carried in join
58	   messages and data messages. M no longer has to be unique across the
59	   Internet. It only has to be unique per C. The other idea, which is
60	   independent of the first, it to build a bi-directional tree (as is
61	   done in CBT and BGMP) instead of building per-source trees from each
62	   sender.  This reduces the state necessary in routers to support
63	   multicast.

65	Changes from revision 1
66	   - use a Simple Multicast (SM) header instead of a new IP option

68	   - modified branch creation and deletion to avoid loops

70	   - added tree splicing mechanism

72	   - added multicast scoping

74	   - allow both IGMP and host SM Join

76	   - added sender only joins

78	   - third party independence

80	   - layer 2 filtering

82	   - host API and kernel changes

84	1.0 Introduction

86	   IP Multicast has been around for over a decade, and several multicast
87	   protocols have been developed over the years. However, the solutions
88	   are either difficult to understand or expensive to deploy or both. In
89	   particular, we believe that multicast address allocation protocols
90	   are too complex and BGMP in combination with MASC will not scale
91	   easily.

93	   In this paper, we present a design we call Simple Multicast that
94	   reduces the complexity and overhead of multicast. It is not really
95	   "yet another multicast protocol". Instead, it is more like a subset
96	   of other protocols, with one variation; to have the identifier of a
97	   group consist of both C (the core) and M (the multicast address).
98	   This eliminates the need to have unique multicast addresses and
99	   coordinate multicast addresses across the Internet.

101	1.1 Previous Work

103	   DVMRP is the first multicast routing protocol proposed. It uses a
104	   simple mechanism of flooding and pruning.

106	   The scalability issues with DVMRP led to the development of CBT. In
107	   CBT, a multicast group is formed by choosing a distinguished node,
108	   the "core", and having all members join by sending special join
109	   messages towards the core. The routers along the path keep state
110	   about which ports are in the group. If a router along the path of the
111	   join already has state about that group the join does not proceed
112	   further. Instead the router just "grafts" the new limb onto the tree.
113	   The result is a tree of shortest paths from the core, with only the
114	   routers along the path knowing anything about that group.

116	   In PIM-SM, each node could independently decide whether the volume of
117	   traffic from a particular source is worth switching from a shared
118	   tree to a per-source tree.  Thus, there are two possible trees for
119	   traffic from a particular source for group M; the shared tree and the
120	   source tree. To prevent loops, the shared tree had to be
121	   unidirectional, i.e., to send to the shared tree, the data has to be
122	   encapsulated and unicast to the core.

124	   The other issue that makes current protocols complex is the necessity
125	   for routers to be able to figure out the location of the core based
126	   solely on the multicast address M.  In PIM-SM, this resulted in a
127	   protocol whereby "core-capable" routers are being continuously
128	   advertised. All routers keep track of the current set of live core-
129	   capable routers, and there is a hashing function to map a multicast
130	   address to one of the set of core-capable routers. This advertisement
131	   protocol is confined to within a domain because it was recognized
132	   that this mechanism would not scale to the entire Internet.

134	   For inter-domain multicast, a set of new protocols has been proposed.
135	   The MASC protocol deals with hierarchical block allocation of Class D
136	   address space.  Essentially, it creates a prefix structure in
137	   multicast address space in a way similar to unicast address space.
138	   Because of the limited multicast address space, the allocation has to
139	   be dynamic.  MASC contains mechanisms for collision detection and
140	   de-allocation. Once a block of multicast addresses is allocated, and
141	   no collision is detected for a period of time, the address block is
142	   then given to MAAS servers for actual assignment to multicast groups.
143	   The address block has to be propagated through BGP+ so that routers
144	   throughout the Internet can know the mapping of multicast addresses
145	   to cores, even in other domains. BGMP then uses this information to
146	   know the direction in which a join to multicast address M should be
147	   sent.

149	1.2 Overview of Simple Multicast

151	   The Simple Multicast proposal tries to reduce or eliminate some of
152	   the complexity and overhead of multicast by taking a slightly
153	   different approach.  The basic idea in Simple Multicast is that a
154	   multicast group is created by generating:

156	   - a distinguished node C known as the "core"

158	   - a multicast address M

160	   The multicast group is then identified by the pair (C,M) rather than
161	   just M as in conventional IP multicast. Note that the address M does
162	   not have to be unique across the Internet now. Instead, only the pair
163	   (C,M) has to be unique. That means that every node C in the Internet
164	   can assign the full 28 bits worth of multicast addresses.

166	   In Simple Multicast, multicast address allocation and core placement
167	   (i.e., choosing a multicast address M and a core C for a multicast
168	   group) are taken out of the basic multicast protocol. End systems may
169	   find out about the multicast address M and the core C for a group
170	   through one of several possible mechanisms including email
171	   announcement, web advertising, SDR, DNS lookup etc.  Both SM-aware
172	   endnodes and SM-aware routers must recognize the combination of (C,M)
173	   as the identity of the group.

175	   Once the end systems have M and C, they then join the group by
176	   sending a special join message towards the core C, creating state in
177	   the routers along the path until the join packet hits the core or a
178	   router that is already on the tree for this multicast group. This
179	   creates a branch in the bi-directional distribution tree for the
180	   group. The current IGMP mechanism for joining groups is fine,
181	   provided that both C and M appear in the IGMP reply. Until IGMP is
182	   modified to support this, the join message itself can be sent from
183	   the end system. If both C and M appear in the join message, then the
184	   first hop router can initiate the join.

186	   To enable incremental deployment of Simple Multicast, we provide a
187	   mechanism for the join message traverses non-SM aware routers. (See
188	   Joining a Group).

190	   The multicast tree formed is bi-directional, meaning that traffic can
191	   be injected from any point. The core is just another node in the
192	   tree.  The data packet contains both C and M, and routers look up the
193	   group based on the combination (C,M).

195	   Data packets would need to carry both C and M. There has been a few
196	   suggestions on how this may be done:  1) Define a new IP option and
197	   specify both C and M in it.  2) Define a new protocol and specify the
198	   new protocol in the 'protocol' field of the IPv4 header. Encapsulate
199	   the payload inside this new protocol.  This new protocol header will
200	   contain both C and M.  3) Map (C,M) to a unique class-D address on
201	   the data-link. The destination address of the data packet would be
202	   re-written to a unique class-D address before being forwarded on that
203	   data-link.

205	   Although option processing in general is more expensive, in this case
206	   the option processing is merely, forwarding packets by looking at an
207	   extra IP address in the option field. In contrast, other IP options
208	   such as LSR, SSR and Router Alert are more involved.  Hence, from a
209	   purely technical point of view, the first and second approach can be
210	   implemented in hardware and there is no significant difference
211	   between these two approaches. However, due to current hardware
212	   implementation convention, option processing is more likely done in
213	   software. As a result, we have opted to use the SM header instead.

215	   The third approach does not require data packets or join messages to
216	   carry the core address. SM nodes obtain the unique class-D address
217	   which maps to a group (C,M) from a special node(s) on the data-link.
218	   This approach is appealing because it allows SM applications to join
219	   a group by joining a class-D address just like conventional IP
220	   multicast. On the other hand, it also introduces concerns not unlike
221	   label switching, e.g. vulnerability to loops, ensuring the uniqueness
222	   of addresses at all times, ensuring all nodes on the LAN use the same
223	   address for a group at all times and address recycling, among others.
224	   In this approach, if a unique address on the data-link is not
225	   available for use, data cannot be forwarded. In contrast, if a packet
226	   cannot be label switched, it can be routed.  We are investigating the
227	   feasibility of this approach.

229	   The SM header will carry both C and M. The reason for carrying both C
230	   and M in the option instead of carrying at least one of them in the
231	   destination address is to allow SM aware routers to co-exist with
232	   non-SM aware routers. The destination address in the IP packet is set
233	   to a reserved multicast address, the ALL-SM-NODES, when sending to
234	   networks with SM aware routers.  This ensures that non-SM routers
235	   will not forward SM multicast data packets. When the packet must hop
236	   over non-SM routers, the IP destination address is set to the next
237	   SM-aware router in the path.

239	   A nice feature of Simple Multicast is that, since both C and M are in
240	   the SM header, the destination address in the IP packet can be
241	   replaced with the tunnel endpoint address, and packets can be
242	   'tunneled' with very little work. Instead of having to add and delete
243	   IP headers (if the packet is encapsulated IPIP), the only work is to
244	   write the tunnel endpoint address into the destination address of the
245	   IP header..

247	1.3 Why Simple Multicast

249	   We now discuss some of the advantages of Simple Multicast.

251	   - One protocol is all that is needed.  Currently, we need to deal
252	   with two sets of multicast protocols in order to support multicast in
253	   the Internet: DVMRP, PIM-DM, PIM-SM and CBT etc for intra-domain
254	   multicast and MASC, MAAS and BGMP for inter-domain support. The
255	   beauty of the Simple Multicast proposal is only one multicast
256	   protocol is needed for both intra-domain and inter-domain.  This is
257	   possible because Simple Multicast is designed to be scalable.

259	   - Scalability.  Simple Multicast is scalable to the global Internet.
260	   This scalability is achieved by using a trivial multicast address
261	   allocation scheme, decoupling core selection and discovery from the
262	   multicast protocol and using bi-directional trees.  If core discovery
263	   is decoupled from multicast routing protocols such as PIM-SM or CBT,
264	   these protocols would not have to use the bootstrap mechanism to
265	   discover and select cores, a mechanism generally considered to be not
266	   scalable.

268	   - Trivial multicast address allocation. IP Multicast address
269	   allocation is still an unresolved problem. Dynamically allocating
270	   addresses such that addresses are allocated in aggregatable blocks,
271	   while ensuring low probability of address collision (non-uniqueness)
272	   is non-trivial. In Simple Multicast, since (C,M) is the identifier
273	   for a multicast group, address assignment becomes totally trivial,
274	   since addresses only have to be unique per core. Each core can have
275	   the full 28 bit space (over 200 million address) so we have virtually
276	   unlimited multicast addresses. Each core can allocate these addresses
277	   independently without Internet-wide coordination.

279	   - Cost effective and efficient delivery trees.  It takes less state
280	   in routers to support a group with n senders with a single shared
281	   tree than with n per-sender trees. A bi-directional shared tree is as
282	   cost effective for delivery of traffic from source S,even if S is not
283	   the core, as a per-source tree rooted at S. The bi-directional shared
284	   tree is much more efficient for delivery of traffic from non-core
285	   source S than a unidirectional tree where the data from S must be
286	   tunneled to the core before being multicast.

288	   Bi-directional trees are more robust. In a unidirectional tree, the
289	   core is needed for relaying packets from all senders. If the core is
290	   down, the tree is gone. For a bi-directional tree, the core does not
291	   hold any particular significance. The core is just another node in
292	   the tree. If the core is down, the tree is merely partitioned and may
293	   still be used for traffic delivery if the application chooses to do
294	   so.

296	   - Incremental deployment.  Simple Multicast routers may be deployed
297	   along side unicast routers and other multicast routers. Traffic is
298	   effectively tunneled (although the actual mechanism used is more
299	   efficient than tunnels) through routers which do not support Simple
300	   Multicast. Therefore a network manager may incrementally add Simple
301	   Multicast routers as multicast users spread in the network.

303	2.0 The Design

305	   In this section, we describe the design of Simple Multicast and its
306	   basic operations in detail.

308	2.1 Creating a Multicast Group

310	   To create a group, one needs to select a core address and a multicast
311	   address.

313	   Typically most applications consist of a single high-volume source.
314	   For those applications, the core should be the source. For others,
315	   any node close to any member of the group would be a logical choice
316	   for core. Because the tree-building strategy (like BGMP) uses a
317	   single exit point from a domain or any region separated from the rest
318	   of the Internet through expensive links, the traffic pattern
319	   resembles individual trees within domains hooked together with
320	   inter-domain paths. In other words, if S is in your domain, then you
321	   will receive traffic from S through a path internal to your domain
322	   even if the core of the group is outside the domain. Therefore, even
323	   if most of the members of the group are in Europe, and one member of
324	   the group is in Australia, and the Australian is chosen as the core,
325	   the tree will still be a very good tree. Traffic between the
326	   Europeans would be multicast through the tree confined within Europe,
327	   even though the core was in Australia.

329	   As the multicast addresses only need to be unique per core, each core
330	   has over 200 million multicast addresses for allocation. Once the
331	   core is chosen, some very simple mechanisms can be used to generate
332	   the multicast address for the chosen core, for example, querying the
333	   core for an address or random  generation as it is done in SDR (the
334	   collision rate will be significantly lower). Some permanent mapping
335	   of "well-known" addresses for popular groups is also feasible.

337	2.2 Joining a Group

339	   To join a group, one first has to find the core address C and
340	   multicast address M. It is appropriate to have a variety of
341	   mechanisms. A web page advertising a "singles chat group" might
342	   advertise its (C,M) on its web page. Or a provider of some other sort
343	   of service, like stock quotes, might advertise on a web page.
344	   Ideally, clicking on the web page would cause M and C to be
345	   downloaded to the client machine, which would then join the group.
346	   Another mechanism, for instance when arranging a private conference,
347	   might be to be told about M and C via the telephone, or via email.
348	   Yet another mechanism is to have the group (together with a name or a
349	   description) advertised in a directory such as SDR.

351	   If IGMP is extended to support SM, the host sends a membership report
352	   for group (C,M). The SM DR is responsible for forwarding the join off
353	   the LAN. This  message is sent towards the core, creating state in
354	   the routers along the path, so that each router knows which ports are
355	   in the group (C,M).

357	   If there are no SM routers on the LAN, a host may send an SM Join
358	   itself. The destination IP address of the join message is set to the
359	   core IP address. If a non-SM router on the LAN receives the join
360	   message, it will forward it to the core. Data will be tunneled to
361	   this endnode by an upstream SM router.  As there could be potentially
362	   multiple tunnels to the LAN, host SM Join should only be used when
363	   there is no local SM support as may be the case during initial
364	   deployment or when there are very few local members to justify a
365	   network upgrade.  If the next hop towards the core on the LAN is an
366	   SM router, and if it is not an SM DR itself, it will redirect the
367	   join to the SM DR. In this case, if data is tunneled from upstream,
368	   it will be tunneled to the SM router that forwards the join off the
369	   LAN, instead of the endnode. [Note: This approach provides a
370	   migration path whereby as more SM routers are deployed on the LAN,
371	   less tunnels are used. It also allows the co-existence of IGMP (with
372	   or without SM support) and host SM Join during the migration
373	   process.]

375	   If a router receives a join formulticast address (C,M), and it
376	   already has state for (C,M), then it merely adds that port to its set
377	   of ports for (C,M) and does not forward the join further.  The result
378	   is a tree of shortest paths from the core to each member.  Each
379	   router on the tree has a database of (C,M, {ports}) that tells it,
380	   for group (C,M), the ports that data should be forwarded to.

382	   The join message is sent with the Router Alert option. Since the join
383	   message has C as the destination address, if an intermediate router
384	   is not SM aware, it will just forward the join towards the core. When
385	   the join message reaches an SM-aware router R2, it looks at the IP
386	   source address of the join message, say R1. If R1 is a neighbor, R2
387	   adds the port from which the join was received to its list of ports
388	   for (C,M). If R1 is not a neighbor, R2 will add a join-ack to R1. If
389	   R2 is not a neighbor, R1 adds the 'tunnel port' to R2 as its 'parent
390	   port' for (C,M). If R2 is a neighbor, R1 just adds the port as its
391	   parent port for (C,M), since the packet will not need to be tunneled
392	   to get to R2.

394	   A non-member sender may join the group as a sender-only (cf uni-
395	   directional join in CBT). The sender will be on-tree and thus will be
396	   sending keep-alives and receiving heartbeat messages, and hence will
397	   be aware about core liveliness. Data will not be forwarded to a
398	   sender-only branch.

400	2.3 Transmitting to multicast group (C,M)

402	   A sender who is a member of the group, sends an IP packet with C and
403	   M in the SM header. The destination IP address is set to ALL-SM-
404	   NODES. This ensures non-SM aware nodes will ignore the packet. Only
405	   SM aware routers will forward the packet.

407	   A router that receives an SM packet looks up (C,M) in its forwarding
408	   table. If it knows about (C,M), it checks if the port it received the
409	   packet on is in its database. If not, it drops the packet. If so, it
410	   forwards the packet onto all the other ports listed in its database
411	   for (C,M). If the outgoing port is a tunnel port, the destination
412	   address of the IP header is replaced by the tunnel endpoint, and will
413	   therefore travel across routers that are not SM-aware. At the other
414	   end of the tunnel, the SM-aware router will replace the destination
415	   address with ALL-SM-NODES, or with another tunnel endpoint's address,
416	   depending on whether the

418	   packet is being forwarded on a "real port" or a "tunnel port.

420	   If you are not a member of the group but want to transmit to the
421	   group, you place C into the IP destination address, and put C and M
422	   in the SM header. The packet might travel all the way to the core,
423	   but if it instead hits an SM-aware router R with state about (C,M)
424	   before it gets to the core, R will inject the packet into the tree.
425	   A sender-only member may transmit like a member, but will not be
426	   receiving any packets for this group.

428	2.4 Inter-domain Multicast

430	   Simple Multicast works both for intra-domain and inter-domain
431	   multicast. Because the join message of Simple Multicast carries the
432	   core IP address, and unicast routing already knows how to reach any
433	   IP address, the join message will be delivered based on the unicast
434	   forwarding table.

436	   2.4.1 Incongruent unicast and multicast topologies

438	   Where the unicast and multicast topologies are incongruent, BGP-4+
439	   [MBGP] allows a network provider to specify the path it would accept
440	   multicast traffic independent of the path unicast traffic would
441	   traverse. In the figure below, AS1 may have a peering agreement with
442	   AS2 to forward its unicast traffic, but a peering agreement with AS3
443	   to forward multicast traffic. A join from AS1 towards any cores in
444	   AS4 would be sent via AS3. A finer granularity of policy may specify
445	   certain network or core ranges that AS3 would carry traffic for.

447	           AS2
448	         *     *
449	        *       *
450	      AS1       AS4
451	        *       *
452	         *     *
453	           AS3

455	   The join message to C should be routed towards the exit router
456	   specified by BGP4+, for delivery of multicast traffic outside of the
457	   domain.

459	   2.4.2 "3rd Party" Independence

461	   For the case in which SM is used both within and between domains,
462	   joins from different parts of the domain might only converge (merge)
463	   outside the domain. It is not desirable for a domain to depend on
464	   another, "3rd party", domain for the distribution of internally
465	   sourced traffic to other internal receivers. It is therefore
466	   necessary to ensure that joins from different internal receivers
467	   merge at a common point inside the domain.

469	   BGP-4 operates on border routers (BRs) of transit domains, and
470	   ensures that all BRs know which of them acts as egress for a
471	   particular unicast prefix. Some transit domains (the elected egress
472	   router) inject external route information internally, and therefore,
473	   internal routers know in which direction to forward packets destined
474	   to a particular unicast prefix. In other cases, and in stub domains,
475	   external route information is not injected inside the domain.
476	   Nevertheless, the BRs of these domains know for which unicast
477	   prefix(es) each of them is acting as egress. Thus, domain BR routing
478	   knowledge ensures that joins originated inside a domain converge at a
479	   common point inside the domain.

481	   This principle can be applied recursively across a multiple levels of
482	   routing hierarchy.

484	2.5 Failure Recovery

486	   The situations to detect are:

488	   - branch unused

490	   - loop

492	   - path to core broken or changed

494	   - core dead or unreachable

496	   Any of the tree building schemes (CBT, PIM-SM, BGMP) need to solve
497	   these problems, and there is no need to do anything radically new.
498	   The only extra mechanism we've introduced is for loop detection.
499	   Since packets can quickly proliferate in a multicast loop, it is
500	   desirable to detect a loop as soon as it is formed forms.  Since SM
501	   uses an SM header, we can make use of a flag that will enable us to
502	   detect a loop on a data packet.

504	   The other mechanisms we specify are similar to those already in place
505	   for PIM, CBT, and BGMP.

507	2.5.1 Unused Branch

509	   A branch must be kept alive with a "keep-alive" message. If R
510	   receives at least one keep-alive message from a child in tree (C,M),
511	   R sends a keep-alive to its parent port for (C,M). If no keep-alive
512	   is received for some amount of time (at least a few keep-alive
513	   intervals) from some child port for (C,M), that port is removed from
514	   the list of ports. If there are no more child ports, then R stops
515	   sending keep-alives, or as an optimization "unjoins" from its parent.

517	2.5.2 Loop

519	   It would be easy to detect a loop if we could assume that any data
520	   packet for which TTL became zero implied there was a loop.
521	   Unfortunately, some applications do an "expanding ring search" or a
522	   traceroute in which packets are launched with very small TTLs. It
523	   would be wrong to conclude there was a loop when the TTL on those
524	   packets expired.

526	   We use a flag in the SM header to indicate a packet that would
527	   indicate a loop if its TTL reached 0. An application launching a
528	   packet with a low TTL would not set that flag. SM routers do not need
529	   to look at the flag except on packets for which TTL expires.

531	   Loops can also be detected on keep-alive and heartbeat messages
532	   (which are sent outwards from the core...see next section). The
533	   keep-alive message indicates "hops from furthest leaf". A router
534	   collects keep-alives from its child ports and transmits a keep-alive
535	   that is one hop more than the maximum "hops" it receives in any keep
536	   alive from a child.

538	   The heartbeat is like a keep-alive, but from the parent. Likewise it
539	   carries a "distance from the core". In either case (heartbeat or
540	   keep-alive) if the distance gets too great a loop is suspected and
541	   the port is removed from the tree and the child rejoins to the core.

543	2.5.3 Path to core broken or changed

545	   A parent transmits a "heartbeat" message to its children at regular
546	   intervals. The heartbeat indicates whether the core is known to be
547	   alive. A parent continues sending heartbeat messages even if it stops
548	   receiving "core-alive" heartbeats from its parent. In this way a
549	   subtree will continue functioning even if the core is dead.  And if
550	   the core is not dead, the parent can simply rejoin without causing
551	   disruption to the nodes below it in the tree, where feasible.

553	   If unicast routing indicates the path to the core has changed, R
554	   rejoins to the core, again, without disrupting the subtree below it,
555	   where feasible.

557	   To avoid loops from forming, the parent would rejoin the core using a
558	   special join to splice the sub-trees. This splice message must be
559	   forwarded all the way to the core, creating state where there is no
560	   existing state. The core will acknowledge the splice message.

562	   If the splice message hits a downstream router, it will be forwarded
563	   until it reaches the router that originated this splice message. At
564	   this point, the router would realize that it cannot splice the sub-
565	   trees without causing loops. Depending on application requirement
566	   which is conveyed to routers from core via heartbeat messages, the
567	   router could either flush the sub-tree and let leaf routers or hosts
568	   rejoin, or if the application desire, allow the sub-trees to continue
569	   functioning separately, but attempts to splice the sub-trees again
570	   when unicast route to the core changes. The latter makes more sense
571	   when there is a network partition, and the core is not reachable.

573	   Since the heartbeat message is generated at regular intervals even if
574	   a heartbeat is not received from the parent, a very long tree does
575	   not suffer from delay variance that might cause nodes very far from
576	   the core to incorrectly assume the tree was broken.

578	2.5.4 Core dead or unreachable

580	   When the core transmits a heartbeat message it sets the "core alive"
581	   flag. If a router has received a heartbeat message from its parent
582	   with the "core alive" flag set recently enough (3 heartbeat
583	   intervals), then it sets the "core alive" flag in its heartbeat
584	   messages to its children.

586	   If it stops receiving heartbeats with "core alive", it prunes itself
587	   from the old parent and rejoin (by sending a splice message) the
588	   core.

590	   The only purpose of knowing whether the core is alive or not is for
591	   applications to decide, if there are multiple trees for a group,
592	   which tree they should transmit on. (see next section)

594	2.5.5 Multiple Trees for Reliability

596	   The core should be selected to be a node that is reliable. However,
597	   if a group will be long-lived and there is the worry that the core
598	   might die, a simple mechanism is to create multiple trees (C1, M1)
599	   and (C2, M2) for this group. All members join both groups. They can
600	   transmit on either group. If "core alive" heartbeat is only received
601	   on group (C1, M1) that is the group that should be transmitted to.

603	   For applications for which instantaneous switchover is more important
604	   than overhead, senders should transmit on both trees.

606	2.6 Access Control

608	   We accomplish access control by allowing the core for the group to be
609	   configured with the set of allowed senders. The core can put the
610	   access rules into the heartbeat message. The heartbeat message
611	   contains a list of address prefixes of authorized senders and
612	   unauthorized senders. If the rules do not fit into the heartbeat, or
613	   the core for privacy reasons does not want to advertise in advance
614	   all the allowed senders, it can specify that no senders other than It
615	   is allowed. In that case, all senders must tunnel packets to the core
616	   and the core will forward them. Once a sender gets permission to
617	   send, and is known to have data to send, the core can add that
618	   sender's address to the heartbeat message.

620	   For example, if there is some sort of authentication that must be
621	   done in order to get permission, the core initially disallows all
622	   senders, but then when S1 gets permission, it gets added to the list
623	   in the heartbeat message.

625	   Since the heartbeat message gives the access rules, all SM routers
626	   will refuse to forward a packet from a sender disallowed by the
627	   access rules.

629	   Border/Access routers may also have an additional Access Control List
630	   locally.  For instance, it may have a list of sender
631	   prefixes/addresses allowed to transmit multicast data.  All multicast
632	   traffic with source address matching these prefixes/ addresses will
633	   not be filtered. The Include/Exclude Senders List from the core will
634	   prevent these senders from sending to a group that they are not
635	   permitted to.

637	2.7 Dynamically forming more trees

639	   In some cases dynamically formed auxiliary trees make sense,
640	   especially in the inter-domain, where policy might prohibit packets
641	   from A to D to transit domain B. With a core in domain B, or just due
642	   to the shared tree that happened to get formed, packets from senders
643	   in A to receivers in D might traverse domain B. One simple method of
644	   solving the problem is to have A unicast to the core, and have the
645	   core send the multicast. B is still acting as a transit domain
646	   between A and D, but it doesn't know it.

648	   Another solution takes inspiration from the PIM-SM concept of using
649	   the shared tree to find out about per-source trees. The way it works
650	   is that the sender in domain A, say X, sends a message to the core C
651	   telling it that it would like to create a "spin-off" group, (X,M').
652	   Then the core C, in the heartbeat messages for group (C,M) advertises
653	   the spin-off trees that members of (C,M) should also join. The spin-
654	   off tree would, like the original tree, be kept robust through keep-
655	   alives.

657	   Although this does allow creation of multiple trees to support a
658	   single group, this is less expensive than the PIM-SM scheme because
659	   it does not always create a tree for every sender. It only does it
660	   when necessary, and does not need a totally separate tree for each
661	   sender. It only needs one per domain in which there are sources (and
662	   only when the shared tree doesn't work because of transit policy
663	   problems).

665	2.8 Multicast Scoping

667	   A multicast group address can be scoped such that packets matching
668	   the group address are not forwarded outside the defined region.  Two
669	   commonly used scopes are the link-local scope and the global scope
670	   and they do not require configuration.  Routers merely do not forward
671	   the statically assigned link-local scope address (224.0.0.0/24).

673	   The third type of scoping requires network administrators to
674	   configure the perimeter (boundary routers) of the scoped region. This
675	   is called administratively scoped or local scope. At present, this is
676	   achieved by configuring multicast border routers (M-BRs) on a scope
677	   boundary with a boundary scope address range - so-called
678	   Administratively Scoped address range. Multicast traffic flows which
679	   are to be confined within a range must use a class-D address which is
680	   within the range. M-BRs are an impermeable boundary to any multicast
681	   packet with a class-D destination address that falls within any of
682	   its configured Administratively Scoped address ranges.

684	   It is perfectly feasible for SM to use exactly the same mechanism for
685	   achieving multicast scoping. However, multicast scoping as it is
686	   currently defined requires a significant amount of configuration, as
687	   well as co-ordination of the address space for defining scope
688	   boundary ranges.  Any mis-configurations can lead to multicast
689	   packets "leaking" across boundaries they should not.

691	   Multicast scope boundary configurations must conform to certain
692	   rules, such as the rule that boundaries must be completely contained
693	   within one another (the term "nesting", or "convex", are often used).
694	   The MZAP protocol [MZAP] is implemented on M-BRs to detect
695	   inconsistent administratively scoped boundary configurations. As such
696	   it is essentially a network management tool, it does not correct
697	   mis-configurations.

699	   In SM, the group address (C,M) is scoped according to the unicast
700	   core address C. The advantage of this compared to Administratively
701	   Scoped IP  Multicast [RFC2365] is there is no requirement for these
702	   scoped addresses to be dynamically assigned (via AAP or MAAS) or
703	   announced in the scoped regions (MZAP).

705	2.8.1  Multicast Scoping using unicast boundaries and scope mask
706	   SM has the unique ability to take advantage of the unicast routing
707	   system boundaries (e.g.  subnet, area, AS, AS-Confederation etc.) and
708	   use these as "natural" boundaries for multicast traffic, obviating
709	   the need for the configuration of explicit multicast boundaries.
710	   Furthermore, one group identifier (C, M) can be used with multiple
711	   scopes. It works as follows: assume a (C, M) group identifier is to
712	   be used for scopes A and B, with A nested inside B. A and B are
713	   natural unicast routing boundaries, e.g. area, and AS. A unicast
714	   routing system boundary is implicitly identified by a router
715	   aggregating routing information before propagating it over outgoing
716	   interfaces; this is achieved by shortening a prefix mask. For
717	   example, routing information inside boundary A has an associated mask
718	   of 24 bits. The boundary router between A and B reduces this is to 16
719	   bits before propagating inside B.

721	   Now, if a SM data packet carried a "scope mask(len)" in the SM
722	   header, the data packet would not pass beyond any unicast routing
723	   system boundary that itself propagates a shorter mask in unicast
724	   route updates it sends. The general rule is: a SM data packet
725	   carrying a "scope mask(len)" is only forwarded over those interfaces
726	   that aggregate unicast routing information using a mask which is
727	   equal length or longer than that specified in the SM data packet
728	   header.

730	                                   |
731	                           (c) /16 | (d) /12
732	                                   |
733	                           --------+-------
734	                           (a) /8  | (b) /20
735	                                   |
736	                                   |

738	   The figure above illustrates a router with 4 interfaces, a, b, c, d,
739	   each which is aggregating routes with the respective prefix. If a SM
740	   data packet arrives on interface (b) carrying a "scope mask(len)" of
741	   12, it is forwarded only over interface (c) and (d).

743	2.8.2  Multicast Scoping using private network boundaries

745	   A multicast session can be scoped within a private network if the
746	   core address belongs to the private address space and is not
747	   translated to any global address. In this case the boundary routers
748	   can be the filtering or NAT devices at the edge of the network. Since
749	   NAT devices can scope the addresses, the SM data packet itself does
750	   not have to carry the scope mask in the SM header.

752	   Note that for administrative scoping purposes, the function in the
753	   NAT device which is of interest here is the filtering and address
754	   space separation function, not the address translation function.  An
755	   public node will not be able to join n private core if the private
756	   core address is not mapped to any global address. As a result, no
757	   data packets for this scoped core will be forwarded out of the NAT
758	   device.

760	   If the boundary routers are NAT devices, there is no requirement for
761	   the NAT devices to be SM-enabled (i.e. it knows how to translate SM
762	   specific packets) for the purpose of scoping SM groups. If the NAT is
763	   not SM-enabled, the join message will be filtered according to the
764	   core (IP destination) address and hence forwarding states for (C,G)
765	   will only be created in the defined scope. If the NAT device is SM-
766	   enabled, data packets can be filtered based on the core address C or
767	   the source address. In the case of SM dense mode, C=255.255.255.255.
768	   If the NAT device is not SM-enabled, since the IP destination
769	   address=255.255.255.255, the packets will be filtered. Hence SM
770	   dense-mode traffic is scoped by default, i.e. no dense-mode data
771	   packets will be forwarded across any boundary. If the NAT device is
772	   SM-enabled, a dense-mode data packet is scoped according to its IP
773	   source address.  Source address is scoped in the same manner as core
774	   address.

776	   If two scoped regions intersect topologically, then the address space
777	   in the overlapped region cannot be used by the outer scope, as stated
778	   in RFC2365. This applies here as well, i.e. a scoped group address
779	   cannot have its core address in the address space of the overlapped
780	   region, to avoid the problem of the same (C,M) belonging to different
781	   scopes at the intersecting boundary. This implies a core address C,
782	   scoped within scope X, where scope X is inside scope Y, should be
783	   unique within scopes X and Y; and no core within scope Y should have
784	   that same address C. Further, any other addresses scoped within X
785	   should not be visible to scope Y; all addresses scoped within Y is
786	   visible to scope X. This address separation is already maintained by
787	   NAT devices.

789	2.8.3 Multicast Scoping in IPv6

791	   In IPv6, if a core address is a site-local scope address, then the
792	   corresponding (C,*) will be site-local scope as well,

794	2.9 Additional Features

796	   We are investigating the following additional features, which are not
797	   available in other multicast protocols:

799	   - the ability to select dense-mode. Currently there are routers that
800	   implement dense mode and routers that implement sparse mode, and
801	   typically a domain will implement either sparse or dense mode. There
802	   is no way to choose, per application, which type of tree is more
803	   appropriate.

805	   There are cases in which dense mode makes more sense for an
806	   application.  For example, dense mode is more appropriate if the
807	   number of receivers is so dense that there is very little
808	   optimization gained by creating a tree. Dense mode is also
809	   appropriate when the volume of data is sufficiently low that
810	   optimizing its delivery is not worth the overhead of creating and
811	   maintaining a tree.

813	   With SM we use the convention of core=FF:FF:FF:FF to indicate the
814	   packet should be sent via dense-mode. For such packets no tree is
815	   formed and routers merely forward the packet using reverse path
816	   forwarding.  As in DVMRP, states (S,M), where S is the source IP
817	   address, are created for dense mode groups.

819	   Routers find out whether their neighbors support SM, and other
820	   characteristics of their neighbors, through Hello messages. A dense
821	   mode SM-packet should only be sent to SM-aware neighbors. As with
822	   DVMRP, tunnels can be configured between SM-aware nodes to enable a
823	   wider range for delivery of dense-mode SM packets.

825	   - the ability to join a set of groups. The join message contains (C,
826	   M, mask). That facilitates having content parameterized by M. For
827	   instance, if the set of groups (C,*) is for stock information,
828	   certain bits in M can encode industry, country, etc. To receive
829	   information about all stocks, join (C,*). To receive some subset,
830	   join a more specific (M, mask) for core C.

832	2.10 SM Issues

834	2.10.1 Host API and Kernel Changes

836	   The SM architecture require changes to the host Application
837	   Programming Interface (API) and kernel. Host may join a group using
838	   either SM Join - where hosts send joins similarly to an SM router or
839	   IGMP extended to carry the core address as well as a class-D address.
840	   As noted before, host SM Join should only be used where appropriate
841	   e.g. when there is no local SM support.

843	   Taking the BSD Sockets API as an example, joining a group is achieved
844	   using a system call; the data structure passed with the system call
845	   as an argument only supports the specification of a class-D address
846	   and interface (IP) address. For SM this data structure needs
847	   modifying to include a core address element, which can be
848	   concatenated with the class-D address to form SM's 8 byte group
849	   identifier. The kernel SM software, or IGMP software, can then make
850	   use of this information to generate a SM join message, or IGMP
851	   Report, respectively.

853	   Similarly, when data is sent to a group, the data structure passed to
854	   the send system call must include a core address. The kernel SM
855	   software can then place this core address in the SM header.  When an
856	   SM packet (identified by the IP protocol field) is received, the
857	   kernel SM software is invoked and the SM header is decapsulated
858	   before being send to the upper layer.

860	2.10.1.1 Extending IGMP

862	   While not necessary, we propose using TLV in IGMP Membership Report
863	   messages. It is anticipated that IGMP will be extended for various
864	   purposes in future. The use of TLV will facilitate that.

866	   In addition to the class-D address, a field called the extended
867	   address field, for lack of a better term, is defined to carry the
868	   additional address require in IGMPv3, Express, SM and Distributed
869	   Core Multicast (DCM). The IGMP Membership Report message is encoded
870	   as follow:
871	    Type     Value
872	    Classic: S,G (if IGMPv3 with source specific joins)
873	    Express: S,E
874	    Simple:  C,M
875	    DCM:     (S),G where S is a list of channels Hence the extended
876	   address field carries:  i) the source address for classical IP
877	   multicast (IGMPv3 with source specific joins) ii) the source address
878	   for Express iii) the core address for SM iv) the pointer to a list of
879	   channels for DCM.

881	   Extending IGMP is perfectly feasible - it has been done before in
882	   upgrading from IGMPv1 to IGMPv2, and changes will be required for
883	   IGMPv3 if it gains wider acceptance. The kernel modifications
884	   required to support SM are mainly to handle the additional address
885	   field.  The host API change itself require only the addition of two
886	   parameters.  We do not, therefore, consider host changes as barriers
887	   to SM deployment.

889	2.10.2 Layer 2 Filtering

891	   In conventional IP multicast, each class D could be mapped to a
892	   distinct MAC address if 28 bits were available at the MAC layer for
893	   mapping. However, since only 23 bits of the MAC address is used for
894	   mapping, 32 IP multicast address could potentially be mapped to one
895	   MAC layer address.  Hence higher layer filtering of multicast packets
896	   is required.

898	   If the low-order 4 bytes of the SM group identifier - the class-D
899	   address, is similarly mapped, there is the potential for each of a
900	   subnet's hosts to join different SM groups, with their group-ids
901	   differing only in the core address portion of the group-id. In this
902	   worst-case scenario the transmission of packets to one group will be
903	   received by hosts belonging to all other SM groups on the subnet; a
904	   group's packets only become distinguishable at the hosts' network
905	   layers. In a more realistic case we might reasonably expect only a
906	   small percentage of a subnet's hosts to receive packets
907	   unnecessarily.

909	   One possible way to reduce the amount of filtering at the network
910	   layer, would be to statically map the core address to a multicast
911	   layer 2 address if we assume groups associated with a core are likely
912	   to be related. This would still potentially incur higher layer
913	   filtering of undesired groups, but only those hosts subscribed to
914	   group(s) associated with a particular core would be affected.

916	   The problem of mapping a larger-than-usual network identifier to a
917	   layer 2 address is not unique to SM - the problem manifests itself in
918	   IPv6 and EXPRESS.

920	   One possible way of guaranteeing layer-2 multicast destination
921	   address uniqueness would have special node(s) map unique layer 2
922	   address to the group-id. Before a node could send, receive or forward
923	   data, it has to obtain the layer 2 address. IGMP can be extended for
924	   this purpose.

926	   Another possible solution is to have hardware filter based on a group
927	   address at a specific offset and of a specific length. The NIC would
928	   be snooping the IP header, but software should be able to program it
929	   to filter addresses at the desired offset.

931	3.0 Packet formats

933	   This section describes all the packet formats. Simple Multicast could
934	   be implemented as very small modifications to PIM, CBT, or BGMP.

936	   The packet types are:

938	   - data packet

940	   - join-request

942	   - join-ack

944	   - keep-alive (sent by child to parent)

946	   - heartbeat (sent by parent to child)

948	   - flush-tree (sent by parent to child after a loop is detected, to
949	   clear out state from looped tree as quickly as possible and cause
950	   subtree to be reformed)

952	   For all control packets (JOIN-REQUEST, JOIN-ACK, KEEP-ALIVE,
953	   HEARTBEAT, FLUSH- TREE), the "Protocol" field in the IPv4 header is
954	   set to SM (a new protocol field).

956	3.1 SM-'tunnels'

958	   Upstream (towards the core) or downstream SM routers may not be
959	   immediate neighbors, if there are non-SM routers on the path between
960	   them.  In a traditional tunnel between R1 and R2, R1 must add an
961	   extra IP header, and R2 must delete the header. SM gets the same
962	   functionality without adding and deleting headers. Instead all that
963	   is needed is to overwrite the destination address in the IP header to
964	   the address of the "tunnel" endpoint. The reason this can be done is
965	   that the information necessary for SM-routers to route the packet
966	   (namely C and M) are contained in the SM header.

968	   JOIN-REQUESTs and JOIN-ACKs allow tunnel-endpoints to learn of each
969	   other.  The state for a "tunnel" consists of the IP address of the
970	   endpoint, and the number of actual IP hops in the tunnel. The purpose
971	   of keeping the count of the tunnel's hops is because SM counts the
972	   length of the tree, so that senders can know what to set as the TTL
973	   in data packets.

975	3.2 Data Packet Header

977	   IP Header

979	   0               1               2               3^M
980	   0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1^M
981	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+^M
982	   |Version|  IHL  |Type of Service|          Total Length         |^M
983	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+^M
984	   |         Identification        |Flags|      Fragment Offset    |^M
985	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+^M
986	   |  Time to Live |   Protocol =  |         Header Checksum       |^M
987	   |               |   PROTO_SM    |                               |
988	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+^M
989	   |                       Source Address                          |^M
990	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+^M
991	   |                    Destination Address                        |^M
992	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+^M

994	   SM Header

996	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+^M
997	   |                         Core Address                          |^M
998	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+^M
999	   |                     Multicast Address                         |^M
1000	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+^M
1001	   |L|                     Reserved Flag bits                      |^M
1002	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+^M
1003	   ^M
1004	   This SM header includes C, M, loop detect flag, where C=FF:FF:FF:FF ^M
1005	   indicates packet should be delivered dense-mode.^M

1007	   The 'L' bit in Flag, if set, indicates the TTL for this packet should
1008	   never reach 0 (See Loops).^M
1009	   ^M
1010	   The IP Destination address is ALL-SM-NODES except in the following
1011	   cases:^M
1012	   ^M
1013	   - when a non-member sender transmits the packet, the destination is set
1014	   to the core address. The purpose of this is to enable the packet^M
1015	   to be unicasted until it hits a node that is SM-aware, at which point
1016	   the packet is multicast along the tree from the point at which it
1017	   entered
1018	   the tree.
1019	   Note that if the non-member sender has joined the group as a 'sender-only'
1020	   (c.f. uni-directional join in CBT), then the destination address in
1021	   the data packet is either ALL-SM-NODES or the tunnel endpoint
1022	   (as described below).

1024	   - when the packet is transmitted on a tunnel port, in which case the^M
1025	   destination address is set to the IP address of the tunnel endpoint.^M

1027	   Note that at Layer 2, the MAC address is mapped to the Multicast Address
1028	   M of the group (C,M), not to ALL-SM-NODES.^M

1030	3.2 JOIN-REQUEST

1032	   The following control packet header fields are as defined in CBT:
1033	   addr_len, checksum, Payload Length and # of options.

1035	   0               1               2               3
1036	   0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
1037	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1038	   |  vers |type=1 |  addr len     |         checksum              |
1039	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1040	   |Payload Length |  # of options |           reserved            |
1041	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1042	   |                    Join Originating Router                    |
1043	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1044	   |                       core address C                          |
1045	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1046	   |                       Multicast address M                     |
1047	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1048	   |                       Multicast address mask m                |
1049	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1050	   |  option type  |  option len   |        option value...        |
1051	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

1053	   The destination IP address in the IP header is the Core Address.  The
1054	   JOIN-REQUEST is sent with the Router Alert Option.

1056	   The Multicast address and corresponding mask (M,m) may appear
1057	   multiple times. The total length of these fields is specified in the
1058	   "addr_len" field of the common control header.

1060	   The JOIN-REQUEST may contain the following option:

1062	   - Originating TTL. This field is set to the TTL in the IP header of
1063	   this JOIN- REQUEST packet. The receiving SM router ignores this
1064	   option unless the control packet is from a SM router who is not an
1065	   immediate neighbor. The value in this field is used to calculate the
1066	      number of hops in a 'tunnel' = Originating TTL - TTL in the IP
1067	   header for this packet. The value derived is placed in "# of hops in
1068	   tunnel from you to me" in the JOIN-ACK message.

1070	   0               1               2               3
1071	   0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
1072	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1073	   |     1         |       2       |        Originating TTL        |
1074	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

1076	   - Sender-Only
1077	   The join would only be successful if the sender is on the Include
1078	   Senders List or NOT in the Exclude Senders List.
1079	   The sender is attached to the tree as per uni-directional Join in CBT.

1081	   0               1               2               3
1082	   0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
1083	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1084	   |     2         |       2       |       Reserved                |
1085	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

1087	3.3 JOIN-ACK

1089	   0               1               2               3
1090	   0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
1091	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1092	   |  vers |type=2 |  addr len     |         checksum              |
1093	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1094	   |Payload Length |  # of options |    # of hops in 'tunnel'      |
1095	   |               |               |       from you to me          |
1096	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1097	   |                    Join Originating Router                    |
1098	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1099	   |                       core address C                          |
1100	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1101	   |                       Multicast address M                     |
1102	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1103	   |                       Multicast address mask m                |
1104	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1105	   |  option type  |  option len   |        option value...        |
1106	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

1108	   The destination IP address in the IP header is the downstream IP
1109	   source address of the JOIN-REQUEST. The JOIN_ACK is sent with the
1110	   Router Alert Option.

1112	   The Multicast address and corresponding mask (M,m) may appear
1113	   multiple times. The total length of these fields is specified in the
1114	   "addr_len" field.

1116	   The field "# of hops in tunnel from you to me" is ignored unless the
1117	   control packet is from a SM router who is not an immediate neighbor.
1118	   The value in this field is saved as state for this tunnel port.

1120	   The options from the JOIN-REQUEST are copied into the JOIN-ACK, with
1121	   the exception of the "Originating TTL" option. The Originating TTL is
1122	   set to the TTL in the IP header of this JOIN-ACK packet.

1124	3.4 KEEP-ALIVE

1126	   0               1               2               3
1127	   0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
1128	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1129	   |  vers | type=3|  addr len     |         checksum              |
1130	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1131	   |Payload Length |  # of options |        reserved               |
1132	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1133	   |                KEEP-ALIVE Originating Router                  |
1134	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1135	   |                       core address C                          |
1136	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1137	   |                       Multicast address M                     |
1138	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1139	   |                       Multicast address mask m                |
1140	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1141	   |  option type  |  option len   |        option value...        |
1142	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

1144	   The keep-alive message is sent from a child to a parent (towards
1145	   core), and is sent only if a keep-alive has been received recently
1146	   from a child. The destination IP address in the IP header is ALL-SM-
1147	   NODES or the tunnel endpoint address.

1149	   A single keep-alive can serve as many groups as fit into the list in
1150	   the packet.

1152	   (M,m) may appear multiple times. The total length of these fields is
1153	   specified in the "addr_len" field.

1155	   The KEEP-ALIVE may contain the following options:

1157	   0               1               2               3
1158	   0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
1159	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1160	   |     1         |       10      |I|     reserved flag bits      |
1161	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1162	   |                Include/Exclude Sender Prefix                  |
1163	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1164	   |                Include/Exclude Sender Mask                    |
1165	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

1167	   - Include/Exclude Senders List that upstream routers should filter.
1168	   This option may appear multiple times. The 'I' bit is set if this is
1169	   an include sender list, and is zero if this is an exclude sender
1170	   list.

1172	   0               1               2               3
1173	   0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
1174	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1175	   |     2         |       10      |        hop count              |
1176	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1177	   |     Prune Time                |   # of hops in 'tunnel'       |
1178	   |                               |       from you to me          |
1179	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

1181	   - KEEP-ALIVE Option. This option should appear the same number of
1182	   times  as the address set (C,M,mask). It corresponds and is
1183	   applicable to the address set (C,M,mask).

1185	   The fields in this option are:  - Number of hops to furthest leaf for
1186	   (C,M,mask), hop count. The hop count is incremented at every SM hop.
1187	   In addition, when the KEEP-ALIVE is received from a tunnel port, hop
1188	   count = hop count + number of hops in 'tunnel'.

1190	   - Prune Time for (C,M,mask), time after which, if no KEEP-ALIVE is
1191	   received for group (C1, M, mask), the parent should prune off this
1192	   branch.

1194	   - 'Originating TTL'. This is as described in JOIN-REQUEST.

1196	3.5 HEARTBEAT

1198	   0               1               2               3
1199	   0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
1200	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1201	   |  vers | type=4|  addr len     |         checksum              |
1202	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1203	   |Payload Length |  # of options |      reserved                 |
1204	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1205	   |                 HEARTBEAT Originating Router                  |
1206	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1207	   |                       core address C                          |
1208	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1209	   |                       Multicast address M                     |
1210	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1211	   |                       Multicast address mask m                |
1212	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1213	   |  option type  |  option len   |        option value...        |
1214	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

1216	   The heartbeat is sent by a parent to a child. It is sent periodically
1217	   regardless of whether heartbeat is received from its parent.  The
1218	   destination IP address is set to ALL-SM-NODES or the tunnel endpoint
1219	   address.

1221	   The HEARTBEAT may contain the following additional options:  -
1222	   Include/Exclude Senders List. This is the list of allowed/prohibited
1223	   senders to the group. The format of this option is the same the
1224	   KEEP-ALIVE Include/Exclude Senders List, although it serves as a
1225	   different purpose here.

1227	   - spin-off groups (Ci,Mi). One or more spin-off groups (Ci,Mi) may be
1228	   specified.

1230	   0               1               2               3
1231	   0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
1232	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1233	   |     1         |  #Groupsx8    |       reserved flag bits      |
1234	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1235	   |                       Core Address  Ci                        |
1236	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1237	   |                    Multicast Address Mi                       |
1238	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

1240	   - HEARTBEAT Option. This option should appear the same number of
1241	   times as the address set (C,M,mask). It corresponds and is applicable
1242	   to the address set (C,M,mask).

1244	   The fields in this option are:
1245	   0               1               2               3
1246	   0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
1247	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1248	   |     2         |       6       |        core distance          |
1249	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1250	   |     Time To Shutdown          |   # of hops in 'tunnel'       |
1251	   |                               |      from you to me           |
1252	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1253	   |A|                    reserved                                 |
1254	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

1256	   - distance from core. Number of hops to core (C,M,mask), core
1257	   distance. The core distance is incremented at every SM hop. In
1258	   addition, when the KEEP-ALIVE is received from a tunnel port, core
1259	   distance = core distance + number of hops in 'tunnel' - Time left
1260	   before group should be closed down. (all 'ones' indicates group
1261	   should not be torn down) - The 'A' bit if set indicates the core is
1262	   alive or reachable

1264	   - 'Originating TTL'. This is as described in JOIN-ACK.

1266	3.6 FLUSH-TREE

1268	   0               1               2               3
1269	   0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
1270	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1271	   |  vers | type=5|  addr len     |         checksum              |
1272	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1273	   |Payload Length |  # of options |           reserved            |
1274	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1275	   |                 HEARTBEAT Originating Router                  |
1276	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1277	   |                       core address C                          |
1278	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1279	   |                       Multicast address M                     |
1280	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1281	   |                       Multicast address mask m                |
1282	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1283	   |  option type  |  option len   |        option value...        |
1284	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

1286	   The destination IP address is set to ALL-SM-NODES or the tunnel
1287	   endpoint address.

1289	   The Multicast address and corresponding mask (M,m) may appear
1290	   multiple times. The total length of these fields is specified in the
1291	   "addr_len" field of the common control header.

1293	   No options are currently defined.

1295	4 Acknowledgments

1297	   Many people have contributed ideas to this proposal, including Harald
1298	   Alvastrand, Joel Halpern and Fred Baker. The fact that SM is based on
1299	   previous work in IP Multicast implies that the authors are grateful
1300	   to everyone who has contributed to the development of IP Multicast.
1301	   We would like to thank all members of IDMR, in particular Dino
1302	   Farinacci, Mark Handley, Brad Cain, Dave Thaler Russ White and Ken
1303	   Carlberg whose helpful comments have improved this proposal. Others
1304	   that have provided helpful technical information include Matthew
1305	   Yuen, Patrick Lee.

1307	References

1309	      DNS Based RP Placement scheme
1310	      Dino Farinacci's presentation in the MBONED WG, 40th IETF Meeting

1312	      Static Multicast, Internet-Draft, March 1998
1313	      M. Ohta, J. Crowcroft

1315	      Express
1316	      IDMR Mailing List discussion

1318	      CBT, Core Based Tree Multicast Routing,
1319	      Ballardie, Cain, Zhang

1321	      PIM-SM, Protocol independent multicast-sparse mode Specification,
1322	      RFC-2117, June 1997
1323	      Estrin, Farinacci, Helmy, Thaler, Deering, Handley,
1324	      Jacobson, Liu, Sharma, and Wei.

1326	      BGMP, Border Gateway Multicast Protocol Specification,
1327	      Thaler, Estrin, Meyers

1329	      MASC, Multicast Address Set Claim Protocol,
1330	      Estrin, Handley, Kumar, Thaler

1332	      IGMP, Internet Group Management Protocol, Version 3,
1333	      Cain, Deering, Thyagarajan

1335	      "A Border Gateway Protocol 4 (BGP-4)", Y. Rekhter & T. Li,
1336	      RFC1771, March 1995

1338	      "Multiprotocol Extensions for BGP-4", RFC 2283, February 1998.
1339	      Bates, T., Chandra, R., Katz, D., and Y. Rekhter,

1341	      "The IP Network Address Translator (NAT)" RFC 1631, May 1994.
1342	      RFC1631 Egevang, K., Francis, P.,

1344	      "Administratively Scoped IP Multicast",
1345	      RFC 2365, July 1998.  Meyer, D.,

1347	      Distributed Core Multicast, L. Blazevic, J-Y. Boudec

1349	      OGMP ftp://cs.ucl.ac.uk/darpa/ogmp.ps.gz

1351	Authors' Addresses

1353	Radia Perlman
1354	Sun Microsystems Laboratories
1355	2 Elizabeth Drive
1356	Chelmsford, MA 01824
1357	Radia.Perlman@sun.com

1359	Cheng-Yin Lee
1360	Nortel Networks
1361	PO Box 3511, Station C
1362	Ottawa, ON K1Y 4H7, Canada
1363	leecy@nortel.com

1365	Tony Ballardie
1366	Research Consultant
1367	aballardie@acm.org

1369	Jon Crowcroft
1370	Department of Computer Science
1371	University College London
1372	Gower Street
1373	London, WC1E 6BT, UK
1374	J.Crowcroft@cs.ucl.ac.uk

1376	Zheng Wang
1377	Bell Labs Lucent Technologies
1378	101 Crawfords Corner Road
1379	Holmdel NJ 07733
1380	zhwang@bell-labs.com

1382	Thomas Maufer
1383	3Com Corporation
1384	5400 Bayfront Plaza
1385	Santa Clara, CA  95052
1386	maufer@3com.com

1388	Christophe Diot
1389	Sprint ATL
1390	1 Adrian Court
1391	Burlingame CA 94010
1392	USA
1393	cdiot@sprintlabs.com

1395	Joseph Thoo
1396	Nortel Networks
1397	PO Box 3511, Station C
1398	Ottawa, ON K1Y 4H7, Canada
1399	jthoo@nortel.com

1401	Mark Green
1402	@Home Networks
1403	markg@corp.home.net