idnits 2.17.1 

draft-perlman-simple-multicast-03.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** Looks like you're using RFC 2026 boilerplate.  This must be updated to
     follow RFC 3978/3979, as updated by RFC 4748.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  ** The document seems to lack a 1id_guidelines paragraph about the list of
     current Internet-Drafts. 

  ** The document seems to lack a 1id_guidelines paragraph about the list of
     Shadow Directories. 

  ** The document is more than 15 pages and seems to lack a Table of Contents.

  == No 'Intended status' indicated for this document; assuming Proposed
     Standard

  == The page length should not exceed 58 lines per page, but there was 33
     longer pages, the longest (page 2) being 60 lines

  == It seems as if not all pages are separated by form feeds - found 0 form
     feeds but 34 pages


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack a Security Considerations section.

  ** The document seems to lack an IANA Considerations section.  (See Section
     2.2 of https://www.ietf.org/id-info/checklist for how to handle the case
     when there are no actions for IANA.)

  ** The document seems to lack separate sections for Informative/Normative
     References.  All references will be assumed normative when checking for
     downward references.

  ** There are 7 instances of too long lines in the document, the longest one
     being 5 characters in excess of 72.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == Line 333 has weird spacing: '... random  gener...'

  == Line 353 has weird spacing: '...N. This  messa...'

  == Line 1187 has weird spacing: '...  times  as th...'

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- Couldn't find a document date in the document -- date freshness check
     skipped.

  -- Found something which looks like a code comment -- if you have code
     sections in the document, please surround them with '<CODE BEGINS>' and
     '<CODE ENDS>' lines.


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  -- Missing reference section? 'MBGP' on line 439 looks like a reference

  -- Missing reference section? 'MZAP' on line 698 looks like a reference

  -- Missing reference section? 'RFC2365' on line 705 looks like a reference


     Summary: 8 errors (**), 0 flaws (~~), 6 warnings (==), 6 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Internet Engineering Task Force                     R. Perlman
3	INTERNET DRAFT                                      Sun Microsystems
4	October 1999                                       C-Y Lee
5	                                                    Nortel Networks
6	                                                    A. Ballardie
7	                                                    Research Consultant
8	                                                    J. Crowcroft
9	                                                    UCL
10	                                                    Z. Wang
11	                                                    Lucent Technologies
12	                                                    T. Maufer
13	                                                    3Com Corporation
14	                                                    C. Diot
15	                                                    Sprint
16	                                                    J. Thoo
17	                                                    Nortel Networks
18	                                                    M. Green
19	                                                    @Home Networks

21	    Simple Multicast: A Design for Simple, Low-Overhead Multicast

23	            <draft-perlman-simple-multicast-03.txt>

25	Status of this memo

27	     This document is an Internet-Draft and is in full conformance
28	     with all provisions of Section 10 of RFC2026.

30	     Internet-Drafts are working documents of the Internet Engineering
31	     Task Force (IETF), its areas, and its working groups.  Note that
32	     other groups may also distribute working documents as
33	     Internet-Drafts.

35	     Internet-Drafts are draft documents valid for a maximum of six
36	     months and may be updated, replaced, or obsoleted by other
37	     documents at any time.  It is inappropriate to use Internet-
38	     Drafts as reference material or to cite them other than as
39	     "work in progress."

41	     To view the list Internet-Draft Shadow Directories, see
42	     http://www.ietf.org/shadow.html.

44	Abstract

46	   This paper describes a design for multicast that is simple to
47	   understand and low enough overhead for routers that a single scheme
48	   can work both within and between domains. It also eliminates the need
49	   for coordinated multicast address allocation across the Internet. It
50	   is not very different from the tree-based schemes CBT, PIM-SM, and
51	   BGMP. Essentially all of the mechanisms to support this have already
52	   been implemented in the other designs. The contribution of this
53	   protocol is in what is NOT required to be implemented.

55	   The main idea for simplifying multicast is to consider the identity
56	   of a group to be the 8-byte combination of a "core node" C, and the
57	   multicast address M. The identity of the group is carried in join
58	   messages and data messages. M no longer has to be unique across the
59	   Internet. It only has to be unique per C. The other idea, which is
60	   independent of the first, it to build a bi-directional tree (as is
61	   done in CBT and BGMP) instead of building per-source trees from each
62	   sender.  This reduces the state necessary in routers to support
63	   multicast.

65	Changes from revision 1
66	   - use a Simple Multicast (SM) header instead of a new IP option

68	   - modified branch creation and deletion to avoid loops

70	   - added tree splicing mechanism

72	   - added multicast scoping

74	   - allow both IGMP and host SM Join

76	   - added sender only joins

78	   - third party independence

80	   - layer 2 filtering

82	   - host API and kernel changes

84	1.0 Introduction

86	   IP Multicast has been around for over a decade, and several multicast
87	   protocols have been developed over the years. However, the solutions
88	   are either difficult to understand or expensive to deploy or both. In
89	   particular, we believe that multicast address allocation protocols
90	   are too complex and BGMP in combination with MASC will not scale
91	   easily.

93	   In this paper, we present a design we call Simple Multicast that
94	   reduces the complexity and overhead of multicast. It is not really
95	   "yet another multicast protocol". Instead, it is more like a subset
96	   of other protocols, with one variation; to have the identifier of a
97	   group consist of both C (the core) and M (the multicast address).
98	   This eliminates the need to have unique multicast addresses and
99	   coordinate multicast addresses across the Internet.

101	1.1 Previous Work

103	   DVMRP is the first multicast routing protocol proposed. It uses a
104	   simple mechanism of flooding and pruning.

106	   The scalability issues with DVMRP led to the development of CBT. In
107	   CBT, a multicast group is formed by choosing a distinguished node,
108	   the "core", and having all members join by sending special join
109	   messages towards the core. The routers along the path keep state
110	   about which ports are in the group. If a router along the path of the
111	   join already has state about that group the join does not proceed
112	   further. Instead the router just "grafts" the new limb onto the tree.
113	   The result is a tree of shortest paths from the core, with only the
114	   routers along the path knowing anything about that group.

116	   In PIM-SM, each node could independently decide whether the volume of
117	   traffic from a particular source is worth switching from a shared
118	   tree to a per-source tree.  Thus, there are two possible trees for
119	   traffic from a particular source for group M; the shared tree and the
120	   source tree. To prevent loops, the shared tree had to be
121	   unidirectional, i.e., to send to the shared tree, the data has to be
122	   encapsulated and unicast to the core.

124	   The other issue that makes current protocols complex is the necessity
125	   for routers to be able to figure out the location of the core based
126	   solely on the multicast address M.  In PIM-SM, this resulted in a
127	   protocol whereby "core-capable" routers are being continuously
128	   advertised. All routers keep track of the current set of live core-
129	   capable routers, and there is a hashing function to map a multicast
130	   address to one of the set of core-capable routers. This advertisement
131	   protocol is confined to within a domain because it was recognized
132	   that this mechanism would not scale to the entire Internet.

134	   For inter-domain multicast, a set of new protocols has been proposed.
135	   The MASC protocol deals with hierarchical block allocation of Class D
136	   address space.  Essentially, it creates a prefix structure in
137	   multicast address space in a way similar to unicast address space.
138	   Because of the limited multicast address space, the allocation has to
139	   be dynamic.  MASC contains mechanisms for collision detection and
140	   de-allocation. Once a block of multicast addresses is allocated, and
141	   no collision is detected for a period of time, the address block is
142	   then given to MAAS servers for actual assignment to multicast groups.
143	   The address block has to be propagated through BGP+ so that routers
144	   throughout the Internet can know the mapping of multicast addresses
145	   to cores, even in other domains. BGMP then uses this information to
146	   know the direction in which a join to multicast address M should be
147	   sent.

149	1.2 Overview of Simple Multicast

151	   The Simple Multicast proposal tries to reduce or eliminate some of
152	   the complexity and overhead of multicast by taking a slightly
153	   different approach.  The basic idea in Simple Multicast is that a
154	   multicast group is created by generating:

156	   - a distinguished node C known as the "core"

158	   - a multicast address M

160	   The multicast group is then identified by the pair (C,M) rather than
161	   just M as in conventional IP multicast. Note that the address M does
162	   not have to be unique across the Internet now. Instead, only the pair
163	   (C,M) has to be unique. That means that every node C in the Internet
164	   can assign the full 28 bits worth of multicast addresses.

166	   In Simple Multicast, multicast address allocation and core placement
167	   (i.e., choosing a multicast address M and a core C for a multicast
168	   group) are taken out of the basic multicast protocol. End systems may
169	   find out about the multicast address M and the core C for a group
170	   through one of several possible mechanisms including email
171	   announcement, web advertising, SDR, DNS lookup etc.  Both SM-aware
172	   endnodes and SM-aware routers must recognize the combination of (C,M)
173	   as the identity of the group.

175	   Once the end systems have M and C, they then join the group by
176	   sending a special join message towards the core C, creating state in
177	   the routers along the path until the join packet hits the core or a
178	   router that is already on the tree for this multicast group. This
179	   creates a branch in the bi-directional distribution tree for the
180	   group. The current IGMP mechanism for joining groups is fine,
181	   provided that both C and M appear in the IGMP reply. Until IGMP is
182	   modified to support this, the join message itself can be sent from
183	   the end system. If both C and M appear in the join message, then the
184	   first hop router can initiate the join.

186	   To enable incremental deployment of Simple Multicast, we provide a
187	   mechanism for the join message traverses non-SM aware routers. (See
188	   Joining a Group).

190	   The multicast tree formed is bi-directional, meaning that traffic can
191	   be injected from any point. The core is just another node in the
192	   tree.  The data packet contains both C and M, and routers look up the
193	   group based on the combination (C,M).

195	   Data packets would need to carry both C and M. There has been a few
196	   suggestions on how this may be done:  1) Define a new IP option and
197	   specify both C and M in it.  2) Define a new protocol and specify the
198	   new protocol in the 'protocol' field of the IPv4 header. Encapsulate
199	   the payload inside this new protocol.  This new protocol header will
200	   contain both C and M.  3) Map (C,M) to a unique class-D address on
201	   the data-link. The destination address of the data packet would be
202	   re-written to a unique class-D address before being forwarded on that
203	   data-link.

205	   Although option processing in general is more expensive, in this case
206	   the option processing is merely, forwarding packets by looking at an
207	   extra IP address in the option field. In contrast, other IP options
208	   such as LSR, SSR and Router Alert are more involved.  Hence, from a
209	   purely technical point of view, the first and second approach can be
210	   implemented in hardware and there is no significant difference
211	   between these two approaches. However, due to current hardware
212	   implementation convention, option processing is more likely done in
213	   software. As a result, we have opted to use the SM header instead.

215	   The third approach does not require data packets or join messages to
216	   carry the core address. SM nodes obtain the unique class-D address
217	   which maps to a group (C,M) from a special node(s) on the data-link.
218	   This approach is appealing because it allows SM applications to join
219	   a group by joining a class-D address just like conventional IP
220	   multicast. On the other hand, it also introduces concerns not unlike
221	   label switching, e.g. vulnerability to loops, ensuring the uniqueness
222	   of addresses at all times, ensuring all nodes on the LAN use the same
223	   address for a group at all times and address recycling, among others.
224	   In this approach, if a unique address on the data-link is not
225	   available for use, data cannot be forwarded. In contrast, if a packet
226	   cannot be label switched, it can be routed.  We are investigating the
227	   feasibility of this approach.

229	   The SM header will carry both C and M. The reason for carrying both C
230	   and M in the option instead of carrying at least one of them in the
231	   destination address is to allow SM aware routers to co-exist with
232	   non-SM aware routers. The destination address in the IP packet is set
233	   to a reserved multicast address, the ALL-SM-NODES, when sending to
234	   networks with SM aware routers.  This ensures that non-SM routers
235	   will not forward SM multicast data packets. When the packet must hop
236	   over non-SM routers, the IP destination address is set to the next
237	   SM-aware router in the path.

239	   A nice feature of Simple Multicast is that, since both C and M are in
240	   the SM header, the destination address in the IP packet can be
241	   replaced with the tunnel endpoint address, and packets can be
242	   'tunneled' with very little work. Instead of having to add and delete
243	   IP headers (if the packet is encapsulated IPIP), the only work is to
244	   write the tunnel endpoint address into the destination address of the
245	   IP header..

247	1.3 Why Simple Multicast

249	   We now discuss some of the advantages of Simple Multicast.

251	   - One protocol is all that is needed.  Currently, we need to deal
252	   with two sets of multicast protocols in order to support multicast in
253	   the Internet: DVMRP, PIM-DM, PIM-SM and CBT etc for intra-domain
254	   multicast and MASC, MAAS and BGMP for inter-domain support. The
255	   beauty of the Simple Multicast proposal is only one multicast
256	   protocol is needed for both intra-domain and inter-domain.  This is
257	   possible because Simple Multicast is designed to be scalable.

259	   - Scalability.  Simple Multicast is scalable to the global Internet.
260	   This scalability is achieved by using a trivial multicast address
261	   allocation scheme, decoupling core selection and discovery from the
262	   multicast protocol and using bi-directional trees.  If core discovery
263	   is decoupled from multicast routing protocols such as PIM-SM or CBT,
264	   these protocols would not have to use the bootstrap mechanism to
265	   discover and select cores, a mechanism generally considered to be not
266	   scalable.

268	   - Trivial multicast address allocation. IP Multicast address
269	   allocation is still an unresolved problem. Dynamically allocating
270	   addresses such that addresses are allocated in aggregatable blocks,
271	   while ensuring low probability of address collision (non-uniqueness)
272	   is non-trivial. In Simple Multicast, since (C,M) is the identifier
273	   for a multicast group, address assignment becomes totally trivial,
274	   since addresses only have to be unique per core. Each core can have
275	   the full 28 bit space (over 200 million address) so we have virtually
276	   unlimited multicast addresses. Each core can allocate these addresses
277	   independently without Internet-wide coordination.

279	   - Cost effective and efficient delivery trees.  It takes less state
280	   in routers to support a group with n senders with a single shared
281	   tree than with n per-sender trees. A bi-directional shared tree is as
282	   cost effective for delivery of traffic from source S,even if S is not
283	   the core, as a per-source tree rooted at S. The bi-directional shared
284	   tree is much more efficient for delivery of traffic from non-core
285	   source S than a unidirectional tree where the data from S must be
286	   tunneled to the core before being multicast.

288	   Bi-directional trees are more robust. In a unidirectional tree, the
289	   core is needed for relaying packets from all senders. If the core is
290	   down, the tree is gone. For a bi-directional tree, the core does not
291	   hold any particular significance. The core is just another node in
292	   the tree. If the core is down, the tree is merely partitioned and may
293	   still be used for traffic delivery if the application chooses to do
294	   so.

296	   - Incremental deployment.  Simple Multicast routers may be deployed
297	   along side unicast routers and other multicast routers. Traffic is
298	   effectively tunneled (although the actual mechanism used is more
299	   efficient than tunnels) through routers which do not support Simple
300	   Multicast. Therefore a network manager may incrementally add Simple
301	   Multicast routers as multicast users spread in the network.

303	2.0 The Design

305	   In this section, we describe the design of Simple Multicast and its
306	   basic operations in detail.

308	2.1 Creating a Multicast Group

310	   To create a group, one needs to select a core address and a multicast
311	   address.

313	   Typically most applications consist of a single high-volume source.
314	   For those applications, the core should be the source. For others,
315	   any node close to any member of the group would be a logical choice
316	   for core. Because the tree-building strategy (like BGMP) uses a
317	   single exit point from a domain or any region separated from the rest
318	   of the Internet through expensive links, the traffic pattern
319	   resembles individual trees within domains hooked together with
320	   inter-domain paths. In other words, if S is in your domain, then you
321	   will receive traffic from S through a path internal to your domain
322	   even if the core of the group is outside the domain. Therefore, even
323	   if most of the members of the group are in Europe, and one member of
324	   the group is in Australia, and the Australian is chosen as the core,
325	   the tree will still be a very good tree. Traffic between the
326	   Europeans would be multicast through the tree confined within Europe,
327	   even though the core was in Australia.

329	   As the multicast addresses only need to be unique per core, each core
330	   has over 200 million multicast addresses for allocation. Once the
331	   core is chosen, some very simple mechanisms can be used to generate
332	   the multicast address for the chosen core, for example, querying the
333	   core for an address or random  generation as it is done in SDR (the
334	   collision rate will be significantly lower). Some permanent mapping
335	   of "well-known" addresses for popular groups is also feasible.

337	2.2 Joining a Group

339	   To join a group, one first has to find the core address C and
340	   multicast address M. It is appropriate to have a variety of
341	   mechanisms. A web page advertising a "singles chat group" might
342	   advertise its (C,M) on its web page. Or a provider of some other sort
343	   of service, like stock quotes, might advertise on a web page.
344	   Ideally, clicking on the web page would cause M and C to be
345	   downloaded to the client machine, which would then join the group.
346	   Another mechanism, for instance when arranging a private conference,
347	   might be to be told about M and C via the telephone, or via email.
348	   Yet another mechanism is to have the group (together with a name or a
349	   description) advertised in a directory such as SDR.

351	   If IGMP is extended to support SM, the host sends a membership report
352	   for group (C,M). The SM DR is responsible for forwarding the join off
353	   the LAN. This  message is sent towards the core, creating state in
354	   the routers along the path, so that each router knows which ports are
355	   in the group (C,M).

357	   If there are no SM routers on the LAN, a host may send an SM Join
358	   itself. The destination IP address of the join message is set to the
359	   core IP address. If a non-SM router on the LAN receives the join
360	   message, it will forward it to the core. Data will be tunneled to
361	   this endnode by an upstream SM router.  As there could be potentially
362	   multiple tunnels to the LAN, host SM Join should only be used when
363	   there is no local SM support as may be the case during initial
364	   deployment or when there are very few local members to justify a
365	   network upgrade.  If the next hop towards the core on the LAN is an
366	   SM router, and if it is not an SM DR itself, it will redirect the
367	   join to the SM DR. In this case, if data is tunneled from upstream,
368	   it will be tunneled to the SM router that forwards the join off the
369	   LAN, instead of the endnode. [Note: This approach provides a
370	   migration path whereby as more SM routers are deployed on the LAN,
371	   less tunnels are used. It also allows the co-existence of IGMP (with
372	   or without SM support) and host SM Join during the migration
373	   process.]

375	   If a router receives a join formulticast address (C,M), and it
376	   already has state for (C,M), then it merely adds that port to its set
377	   of ports for (C,M) and does not forward the join further.  The result
378	   is a tree of shortest paths from the core to each member.  Each
379	   router on the tree has a database of (C,M, {ports}) that tells it,
380	   for group (C,M), the ports that data should be forwarded to.

382	   The join message is sent with the Router Alert option. Since the join
383	   message has C as the destination address, if an intermediate router
384	   is not SM aware, it will just forward the join towards the core. When
385	   the join message reaches an SM-aware router R2, it looks at the IP
386	   source address of the join message, say R1. If R1 is a neighbor, R2
387	   adds the port from which the join was received to its list of ports
388	   for (C,M). If R1 is not a neighbor, R2 will add a join-ack to R1. If
389	   R2 is not a neighbor, R1 adds the 'tunnel port' to R2 as its 'parent
390	   port' for (C,M). If R2 is a neighbor, R1 just adds the port as its
391	   parent port for (C,M), since the packet will not need to be tunneled
392	   to get to R2.

394	   A non-member sender may join the group as a sender-only (cf uni-
395	   directional join in CBT). The sender will be on-tree and thus will be
396	   sending keep-alives and receiving heartbeat messages, and hence will
397	   be aware about core liveliness. Data will not be forwarded to a
398	   sender-only branch.

400	2.3 Transmitting to multicast group (C,M)

402	   A sender who is a member of the group, sends an IP packet with C and
403	   M in the SM header. The destination IP address is set to ALL-SM-
404	   NODES. This ensures non-SM aware nodes will ignore the packet. Only
405	   SM aware routers will forward the packet.

407	   A router that receives an SM packet looks up (C,M) in its forwarding
408	   table. If it knows about (C,M), it checks if the port it received the
409	   packet on is in its database. If not, it drops the packet. If so, it
410	   forwards the packet onto all the other ports listed in its database
411	   for (C,M). If the outgoing port is a tunnel port, the destination
412	   address of the IP header is replaced by the tunnel endpoint, and will
413	   therefore travel across routers that are not SM-aware. At the other
414	   end of the tunnel, the SM-aware router will replace the destination
415	   address with ALL-SM-NODES, or with another tunnel endpoint's address,
416	   depending on whether the

418	   packet is being forwarded on a "real port" or a "tunnel port.

420	   If you are not a member of the group but want to transmit to the
421	   group, you place C into the IP destination address, and put C and M
422	   in the SM header. The packet might travel all the way to the core,
423	   but if it instead hits an SM-aware router R with state about (C,M)
424	   before it gets to the core, R will inject the packet into the tree.
425	   A sender-only member may transmit like a member, but will not be
426	   receiving any packets for this group.

428	2.4 Inter-domain Multicast

430	   Simple Multicast works both for intra-domain and inter-domain
431	   multicast. Because the join message of Simple Multicast carries the
432	   core IP address, and unicast routing already knows how to reach any
433	   IP address, the join message will be delivered based on the unicast
434	   forwarding table.

436	   2.4.1 Incongruent unicast and multicast topologies

438	   Where the unicast and multicast topologies are incongruent, BGP-4+
439	   [MBGP] allows a network provider to specify the path it would accept
440	   multicast traffic independent of the path unicast traffic would
441	   traverse. In the figure below, AS1 may have a peering agreement with
442	   AS2 to forward its unicast traffic, but a peering agreement with AS3
443	   to forward multicast traffic. A join from AS1 towards any cores in
444	   AS4 would be sent via AS3. A finer granularity of policy may specify
445	   certain network or core ranges that AS3 would carry traffic for.

447	           AS2
448	         *     *
449	        *       *
450	      AS1       AS4
451	        *       *
452	         *     *
453	           AS3

455	   The join message to C should be routed towards the exit router
456	   specified by BGP4+, for delivery of multicast traffic outside of the
457	   domain.

459	   2.4.2 "3rd Party" Independence

461	   For the case in which SM is used both within and between domains,
462	   joins from different parts of the domain might only converge (merge)
463	   outside the domain. It is not desirable for a domain to depend on
464	   another, "3rd party", domain for the distribution of internally
465	   sourced traffic to other internal receivers. It is therefore
466	   necessary to ensure that joins from different internal receivers
467	   merge at a common point inside the domain.

469	   BGP-4 operates on border routers (BRs) of transit domains, and
470	   ensures that all BRs know which of them acts as egress for a
471	   particular unicast prefix. Some transit domains (the elected egress
472	   router) inject external route information internally, and therefore,
473	   internal routers know in which direction to forward packets destined
474	   to a particular unicast prefix. In other cases, and in stub domains,
475	   external route information is not injected inside the domain.
476	   Nevertheless, the BRs of these domains know for which unicast
477	   prefix(es) each of them is acting as egress. Thus, domain BR routing
478	   knowledge ensures that joins originated inside a domain converge at a
479	   common point inside the domain.

481	   This principle can be applied recursively across a multiple levels of
482	   routing hierarchy.

484	2.5 Failure Recovery

486	   The situations to detect are:

488	   - branch unused

490	   - loop

492	   - path to core broken or changed

494	   - core dead or unreachable

496	   Any of the tree building schemes (CBT, PIM-SM, BGMP) need to solve
497	   these problems, and there is no need to do anything radically new.
498	   The only extra mechanism we've introduced is for loop detection.
499	   Since packets can quickly proliferate in a multicast loop, it is
500	   desirable to detect a loop as soon as it is formed forms.  Since SM
501	   uses an SM header, we can make use of a flag that will enable us to
502	   detect a loop on a data packet.

504	   The other mechanisms we specify are similar to those already in place
505	   for PIM, CBT, and BGMP.

507	2.5.1 Unused Branch

509	   A branch must be kept alive with a "keep-alive" message. If R
510	   receives at least one keep-alive message from a child in tree (C,M),
511	   R sends a keep-alive to its parent port for (C,M). If no keep-alive
512	   is received for some amount of time (at least a few keep-alive
513	   intervals) from some child port for (C,M), that port is removed from
514	   the list of ports. If there are no more child ports, then R stops
515	   sending keep-alives, or as an optimization "unjoins" from its parent.

517	2.5.2 Loop

519	   It would be easy to detect a loop if we could assume that any data
520	   packet for which TTL became zero implied there was a loop.
521	   Unfortunately, some applications do an "expanding ring search" or a
522	   traceroute in which packets are launched with very small TTLs. It
523	   would be wrong to conclude there was a loop when the TTL on those
524	   packets expired.

526	   We use a flag in the SM header to indicate a packet that would
527	   indicate a loop if its TTL reached 0. An application launching a
528	   packet with a low TTL would not set that flag. SM routers do not need
529	   to look at the flag except on packets for which TTL expires.

531	   Loops can also be detected on keep-alive and heartbeat messages
532	   (which are sent outwards from the core...see next section). The
533	   keep-alive message indicates "hops from furthest leaf". A router
534	   collects keep-alives from its child ports and transmits a keep-alive
535	   that is one hop more than the maximum "hops" it receives in any keep
536	   alive from a child.

538	   The heartbeat is like a keep-alive, but from the parent. Likewise it
539	   carries a "distance from the core". In either case (heartbeat or
540	   keep-alive) if the distance gets too great a loop is suspected and
541	   the port is removed from the tree and the child rejoins to the core.

543	2.5.3 Path to core broken or changed

545	   A parent transmits a "heartbeat" message to its children at regular
546	   intervals. The heartbeat indicates whether the core is known to be
547	   alive. A parent continues sending heartbeat messages even if it stops
548	   receiving "core-alive" heartbeats from its parent. In this way a
549	   subtree will continue functioning even if the core is dead.  And if
550	   the core is not dead, the parent can simply rejoin without causing
551	   disruption to the nodes below it in the tree, where feasible.

553	   If unicast routing indicates the path to the core has changed, R
554	   rejoins to the core, again, without disrupting the subtree below it,
555	   where feasible.

557	   To avoid loops from forming, the parent would rejoin the core using a
558	   special join to splice the sub-trees. This splice message must be
559	   forwarded all the way to the core, creating state where there is no
560	   existing state. The core will acknowledge the splice message.

562	   If the splice message hits a downstream router, it will be forwarded
563	   until it reaches the router that originated this splice message. At
564	   this point, the router would realize that it cannot splice the sub-
565	   trees without causing loops. Depending on application requirement
566	   which is conveyed to routers from core via heartbeat messages, the
567	   router could either flush the sub-tree and let leaf routers or hosts
568	   rejoin, or if the application desire, allow the sub-trees to continue
569	   functioning separately, but attempts to splice the sub-trees again
570	   when unicast route to the core changes. The latter makes more sense
571	   when there is a network partition, and the core is not reachable. /*
572	   MODIFIED */ The decision to flush the sub-tree or rejoin the core can
573	   be based on information such as the depth of the sub-tree and
574	   distance to core.  This information may be obtain from the keep-alive
575	   and heartbeat messages.

577	   Since the heartbeat message is generated at regular intervals even if
578	   a heartbeat is not received from the parent, a very long tree does
579	   not suffer from delay variance that might cause nodes very far from
580	   the core to incorrectly assume the tree was broken.

582	2.5.4 Core dead or unreachable

584	   When the core transmits a heartbeat message it sets the "core alive"
585	   flag. If a router has received a heartbeat message from its parent
586	   with the "core alive" flag set recently enough (3 heartbeat
587	   intervals), then it sets the "core alive" flag in its heartbeat
588	   messages to its children.

590	   If it stops receiving heartbeats with "core alive", it prunes itself
591	   from the old parent and rejoin (by sending a splice message) the
592	   core.

594	   The only purpose of knowing whether the core is alive or not is for
595	   applications to decide, if there are multiple trees for a group,
596	   which tree they should transmit on. (see next section)

598	2.5.5 Multiple Trees for Reliability

600	   The core should be selected to be a node that is reliable. However,
601	   if a group will be long-lived and there is the worry that the core
602	   might die, a simple mechanism is to create multiple trees (C1, M1)
603	   and (C2, M2) for this group. All members join both groups. They can
604	   transmit on either group. If "core alive" heartbeat is only received
605	   on group (C1, M1) that is the group that should be transmitted to.

607	   For applications for which instantaneous switchover is more important
608	   than overhead, senders should transmit on both trees.

610	2.6 Access Control

612	   We accomplish access control by allowing the core for the group to be
613	   configured with the set of allowed senders. The core can put the
614	   access rules into the heartbeat message. The heartbeat message
615	   contains a list of address prefixes of authorized senders and
616	   unauthorized senders. If the rules do not fit into the heartbeat, or
617	   the core for privacy reasons does not want to advertise in advance
618	   all the allowed senders, it can specify that no senders other than It
619	   is allowed. In that case, all senders must tunnel packets to the core
620	   and the core will forward them. Once a sender gets permission to
621	   send, and is known to have data to send, the core can add that
622	   sender's address to the heartbeat message.

624	   For example, if there is some sort of authentication that must be
625	   done in order to get permission, the core initially disallows all
626	   senders, but then when S1 gets permission, it gets added to the list
627	   in the heartbeat message.

629	   Since the heartbeat message gives the access rules, all SM routers
630	   will refuse to forward a packet from a sender disallowed by the
631	   access rules.

633	   Border/Access routers may also have an additional Access Control List
634	   locally.  For instance, it may have a list of sender
635	   prefixes/addresses allowed to transmit multicast data.  All multicast
636	   traffic with source address matching these prefixes/ addresses will
637	   not be filtered. The Include/Exclude Senders List from the core will
638	   prevent these senders from sending to a group that they are not
639	   permitted to.

641	2.7 Dynamically forming more trees

643	   In some cases dynamically formed auxiliary trees make sense,
644	   especially in the inter-domain, where policy might prohibit packets
645	   from A to D to transit domain B. With a core in domain B, or just due
646	   to the shared tree that happened to get formed, packets from senders
647	   in A to receivers in D might traverse domain B. One simple method of
648	   solving the problem is to have A unicast to the core, and have the
649	   core send the multicast. B is still acting as a transit domain
650	   between A and D, but it doesn't know it.

652	   Another solution takes inspiration from the PIM-SM concept of using
653	   the shared tree to find out about per-source trees. The way it works
654	   is that the sender in domain A, say X, sends a message to the core C
655	   telling it that it would like to create a "spin-off" group, (X,M').
656	   Then the core C, in the heartbeat messages for group (C,M) advertises
657	   the spin-off trees that members of (C,M) should also join. The spin-
658	   off tree would, like the original tree, be kept robust through keep-
659	   alives.

661	   Although this does allow creation of multiple trees to support a
662	   single group, this is less expensive than the PIM-SM scheme because
663	   it does not always create a tree for every sender. It only does it
664	   when necessary, and does not need a totally separate tree for each
665	   sender. It only needs one per domain in which there are sources (and
666	   only when the shared tree doesn't work because of transit policy
667	   problems).

669	2.8 Multicast Scoping

671	   A multicast group address can be scoped such that packets matching
672	   the group address are not forwarded outside the defined region.  Two
673	   commonly used scopes are the link-local scope and the global scope
674	   and they do not require configuration.  Routers merely do not forward
675	   the statically assigned link-local scope address (224.0.0.0/24).

677	   The third type of scoping requires network administrators to
678	   configure the perimeter (boundary routers) of the scoped region. This
679	   is called administratively scoped or local scope. At present, this is
680	   achieved by configuring multicast border routers (M-BRs) on a scope
681	   boundary with a boundary scope address range - so-called
682	   Administratively Scoped address range. Multicast traffic flows which
683	   are to be confined within a range must use a class-D address which is
684	   within the range. M-BRs are an impermeable boundary to any multicast
685	   packet with a class-D destination address that falls within any of
686	   its configured Administratively Scoped address ranges.

688	   It is perfectly feasible for SM to use exactly the same mechanism for
689	   achieving multicast scoping. However, multicast scoping as it is
690	   currently defined requires a significant amount of configuration, as
691	   well as co-ordination of the address space for defining scope
692	   boundary ranges.  Any mis-configurations can lead to multicast
693	   packets "leaking" across boundaries they should not.

695	   Multicast scope boundary configurations must conform to certain
696	   rules, such as the rule that boundaries must be completely contained
697	   within one another (the term "nesting", or "convex", are often used).
698	   The MZAP protocol [MZAP] is implemented on M-BRs to detect
699	   inconsistent administratively scoped boundary configurations. As such
700	   it is essentially a network management tool, it does not correct
701	   mis-configurations.

703	   In SM, the group address (C,M) is scoped according to the unicast
704	   core address C. The advantage of this compared to Administratively
705	   Scoped IP  Multicast [RFC2365] is there is no requirement for these
706	   scoped addresses to be dynamically assigned (via AAP or MAAS) or
707	   announced in the scoped regions (MZAP).

709	2.8.1  Multicast Scoping using unicast boundaries and scope mask

711	   SM has the unique ability to take advantage of the unicast routing
712	   system boundaries (e.g.  subnet, area, AS, AS-Confederation etc.) and
713	   use these as "natural" boundaries for multicast traffic, obviating
714	   the need for the configuration of explicit multicast boundaries.
715	   Furthermore, one group identifier (C, M) can be used with multiple
716	   scopes. It works as follows: assume a (C, M) group identifier is to
717	   be used for scopes A and B, with A nested inside B. A and B are
718	   natural unicast routing boundaries, e.g. area, and AS. A unicast
719	   routing system boundary is implicitly identified by a router
720	   aggregating routing information before propagating it over outgoing
721	   interfaces; this is achieved by shortening a prefix mask. For
722	   example, routing information inside boundary A has an associated mask
723	   of 24 bits. The boundary router between A and B reduces this is to 16
724	   bits before propagating inside B.

726	   Now, if a SM data packet carried a "scope mask(len)" in the SM
727	   header, the data packet would not pass beyond any unicast routing
728	   system boundary that itself propagates a shorter mask in unicast
729	   route updates it sends. The general rule is: a SM data packet
730	   carrying a "scope mask(len)" is only forwarded over those interfaces
731	   that aggregate unicast routing information using a mask which is
732	   equal length or longer than that specified in the SM data packet
733	   header.

735	                                   |
736	                           (c) /16 | (d) /12
737	                                   |
738	                           --------+-------
739	                           (a) /8  | (b) /20
740	                                   |
741	                                   |

743	   The figure above illustrates a router with 4 interfaces, a, b, c, d,
744	   each which is aggregating routes with the respective prefix. If a SM
745	   data packet arrives on interface (b) carrying a "scope mask(len)" of
746	   12, it is forwarded only over interface (c) and (d).

748	2.8.2  Multicast Scoping using private network boundaries

750	   A multicast session can be scoped within a private network if the
751	   core address belongs to the private address space and is not
752	   translated to any global address. In this case the boundary routers
753	   can be the filtering or NAT devices at the edge of the network. Since
754	   NAT devices can scope the addresses, the SM data packet itself does
755	   not have to carry the scope mask in the SM header.

757	   Note that for administrative scoping purposes, the function in the
758	   NAT device which is of interest here is the filtering and address
759	   space separation function, not the address translation function.  An
760	   public node will not be able to join n private core if the private
761	   core address is not mapped to any global address. As a result, no
762	   data packets for this scoped core will be forwarded out of the NAT
763	   device.

765	   If the boundary routers are NAT devices, there is no requirement for
766	   the NAT devices to be SM-enabled (i.e. it knows how to translate SM
767	   specific packets) for the purpose of scoping SM groups. If the NAT is
768	   not SM-enabled, the join message will be filtered according to the
769	   core (IP destination) address and hence forwarding states for (C,G)
770	   will only be created in the defined scope. If the NAT device is SM-
771	   enabled, data packets can be filtered based on the core address C or
772	   the source address. In the case of SM dense mode, C=255.255.255.255.
773	   If the NAT device is not SM-enabled, since the IP destination
774	   address=255.255.255.255, the packets will be filtered. Hence SM
775	   dense-mode traffic is scoped by default, i.e. no dense-mode data
776	   packets will be forwarded across any boundary. If the NAT device is
777	   SM-enabled, a dense-mode data packet is scoped according to its IP
778	   source address.  Source address is scoped in the same manner as core
779	   address.

781	   If two scoped regions intersect topologically, then the address space
782	   in the overlapped region cannot be used by the outer scope, as stated
783	   in RFC2365. This applies here as well, i.e. a scoped group address
784	   cannot have its core address in the address space of the overlapped
785	   region, to avoid the problem of the same (C,M) belonging to different
786	   scopes at the intersecting boundary. This implies a core address C,
787	   scoped within scope X, where scope X is inside scope Y, should be
788	   unique within scopes X and Y; and no core within scope Y should have
789	   that same address C. Further, any other addresses scoped within X
790	   should not be visible to scope Y; all addresses scoped within Y is
791	   visible to scope X. This address separation is already maintained by
792	   NAT devices.

794	2.8.3 Multicast Scoping in IPv6

796	   In IPv6, if a core address is a site-local scope address, then the
797	   corresponding (C,*) will be site-local scope as well,

799	2.9 Additional Features

801	   We are investigating the following additional features, which are not
802	   available in other multicast protocols:

804	   - the ability to select dense-mode. Currently there are routers that
805	   implement dense mode and routers that implement sparse mode, and
806	   typically a domain will implement either sparse or dense mode. There
807	   is no way to choose, per application, which type of tree is more
808	   appropriate.

810	   There are cases in which dense mode makes more sense for an
811	   application.  For example, dense mode is more appropriate if the
812	   number of receivers is so dense that there is very little
813	   optimization gained by creating a tree. Dense mode is also
814	   appropriate when the volume of data is sufficiently low that
815	   optimizing its delivery is not worth the overhead of creating and
816	   maintaining a tree.

818	   With SM we use the convention of core=FF:FF:FF:FF to indicate the
819	   packet should be sent via dense-mode. For such packets no tree is
820	   formed and routers merely forward the packet using reverse path
821	   forwarding.  As in DVMRP, states (S,M), where S is the source IP
822	   address, are created for dense mode groups.

824	   Routers find out whether their neighbors support SM, and other
825	   characteristics of their neighbors, through Hello messages. A dense
826	   mode SM-packet should only be sent to SM-aware neighbors. As with
827	   DVMRP, tunnels can be configured between SM-aware nodes to enable a
828	   wider range for delivery of dense-mode SM packets.

830	   - the ability to join a set of groups. The join message contains (C,
831	   M, mask). That facilitates having content parameterized by M. For
832	   instance, if the set of groups (C,*) is for stock information,
833	   certain bits in M can encode industry, country, etc. To receive
834	   information about all stocks, join (C,*). To receive some subset,
835	   join a more specific (M, mask) for core C.

837	2.10 SM Issues

839	2.10.1 Host API and Kernel Changes

841	   The SM architecture require changes to the host Application
842	   Programming Interface (API) and kernel. Host may join a group using
843	   either SM Join - where hosts send joins similarly to an SM router or
844	   IGMP extended to carry the core address as well as a class-D address.
845	   As noted before, host SM Join should only be used where appropriate
846	   e.g. when there is no local SM support.

848	   Taking the BSD Sockets API as an example, joining a group is achieved
849	   using a system call; the data structure passed with the system call
850	   as an argument only supports the specification of a class-D address
851	   and interface (IP) address. For SM this data structure needs
852	   modifying to include a core address element, which can be
853	   concatenated with the class-D address to form SM's 8 byte group
854	   identifier. The kernel SM software, or IGMP software, can then make
855	   use of this information to generate a SM join message, or IGMP
856	   Report, respectively.

858	   Similarly, when data is sent to a group, the data structure passed to
859	   the send system call must include a core address. The kernel SM
860	   software can then place this core address in the SM header.  When an
861	   SM packet (identified by the IP protocol field) is received, the
862	   kernel SM software is invoked and the SM header is decapsulated
863	   before being send to the upper layer.

865	2.10.1.1 Extending IGMP

867	   While not necessary, we propose using TLV in IGMP Membership Report
868	   messages. It is anticipated that IGMP will be extended for various
869	   purposes in future. The use of TLV will facilitate that.

871	   In addition to the class-D address, a field called the extended
872	   address field, for lack of a better term, is defined to carry the
873	   additional address require in IGMPv3, Express, SM and Distributed
874	   Core Multicast (DCM). The IGMP Membership Report message is encoded
875	   as follow:
876	    Type     Value
877	    Classic: S,G (if IGMPv3 with source specific joins)
878	    Express: S,E
879	    Simple:  C,M
880	    DCM:     (S),G where S is a list of channels Hence the extended
881	   address field carries:  i) the source address for classical IP
882	   multicast (IGMPv3 with source specific joins) ii) the source address
883	   for Express iii) the core address for SM iv) the pointer to a list of
884	   channels for DCM.

886	   Extending IGMP is perfectly feasible - it has been done before in
887	   upgrading from IGMPv1 to IGMPv2, and changes will be required for
888	   IGMPv3 if it gains wider acceptance. The kernel modifications
889	   required to support SM are mainly to handle the additional address
890	   field.  The host API change itself require only the addition of two
891	   parameters.  We do not, therefore, consider host changes as barriers
892	   to SM deployment.

894	2.10.2 Layer 2 Filtering

896	   In conventional IP multicast, each class D could be mapped to a
897	   distinct MAC address if 28 bits were available at the MAC layer for
898	   mapping. However, since only 23 bits of the MAC address is used for
899	   mapping, 32 IP multicast address could potentially be mapped to one
900	   MAC layer address.  Hence higher layer filtering of multicast packets
901	   is required.

903	   If the low-order 4 bytes of the SM group identifier - the class-D
904	   address, is similarly mapped, there is the potential for each of a
905	   subnet's hosts to join different SM groups, with their group-ids
906	   differing only in the core address portion of the group-id. In this
907	   worst-case scenario the transmission of packets to one group will be
908	   received by hosts belonging to all other SM groups on the subnet; a
909	   group's packets only become distinguishable at the hosts' network
910	   layers. In a more realistic case we might reasonably expect only a
911	   small percentage of a subnet's hosts to receive packets
912	   unnecessarily.

914	   One possible way to reduce the amount of filtering at the network
915	   layer, would be to statically map the core address to a multicast
916	   layer 2 address if we assume groups associated with a core are likely
917	   to be related. This would still potentially incur higher layer
918	   filtering of undesired groups, but only those hosts subscribed to
919	   group(s) associated with a particular core would be affected.

921	   The problem of mapping a larger-than-usual network identifier to a
922	   layer 2 address is not unique to SM - the problem manifests itself in
923	   IPv6 and EXPRESS.

925	   One possible way of guaranteeing layer-2 multicast destination
926	   address uniqueness would have special node(s) map unique layer 2
927	   address to the group-id. Before a node could send, receive or forward
928	   data, it has to obtain the layer 2 address. IGMP can be extended for
929	   this purpose.

931	   Another possible solution is to have hardware filter based on a group
932	   address at a specific offset and of a specific length. The NIC would
933	   be snooping the IP header, but software should be able to program it
934	   to filter addresses at the desired offset.

936	3.0 Packet formats

938	   This section describes all the packet formats. Simple Multicast could
939	   be implemented as very small modifications to PIM, CBT, or BGMP.

941	   The packet types are:

943	   - data packet

945	   - join-request

947	   - join-ack

949	   - keep-alive (sent by child to parent)

951	   - heartbeat (sent by parent to child)

953	   - flush-tree (sent by parent to child after a loop is detected, to
954	   clear out state from looped tree as quickly as possible and cause
955	   subtree to be reformed)

957	   For all control packets (JOIN-REQUEST, JOIN-ACK, KEEP-ALIVE,
958	   HEARTBEAT, FLUSH- TREE), the "Protocol" field in the IPv4 header is
959	   set to SM (a new protocol field).

961	3.1 SM-'tunnels'

963	   Upstream (towards the core) or downstream SM routers may not be
964	   immediate neighbors, if there are non-SM routers on the path between
965	   them.  In a traditional tunnel between R1 and R2, R1 must add an
966	   extra IP header, and R2 must delete the header. SM gets the same
967	   functionality without adding and deleting headers. Instead all that
968	   is needed is to overwrite the destination address in the IP header to
969	   the address of the "tunnel" endpoint. The reason this can be done is
970	   that the information necessary for SM-routers to route the packet
971	   (namely C and M) are contained in the SM header.

973	   JOIN-REQUESTs and JOIN-ACKs allow tunnel-endpoints to learn of each
974	   other.  The state for a "tunnel" consists of the IP address of the
975	   endpoint, and the number of actual IP hops in the tunnel. The purpose
976	   of keeping the count of the tunnel's hops is because SM counts the
977	   length of the tree, so that senders can know what to set as the TTL
978	   in data packets.

980	3.2 Data Packet Header

982	   IP Header

984	   0               1               2               3^M
985	   0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1^M
986	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+^M
987	   |Version|  IHL  |Type of Service|          Total Length         |^M
988	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+^M
989	   |         Identification        |Flags|      Fragment Offset    |^M
990	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+^M
991	   |  Time to Live |   Protocol =  |         Header Checksum       |^M
992	   |               | IPPROTO_SM    |                               |
993	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+^M
994	   |                       Source Address                          |^M
995	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+^M
996	   |                    Destination Address                        |^M
997	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+^M

999	   SM Header

1001	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+^M
1002	   |                         Core Address                          |^M
1003	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+^M
1004	   |                     Multicast Address                         |^M
1005	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+^M
1006	   |Protocol=egUDP| Core Mask      |                             |L|
1007	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+^M

1009	   This SM header includes C, M, loop detect flag, where C=FF:FF:FF:FF ^M
1010	   indicates packet should be delivered dense-mode.^M

1012	   The 'L' bit in Flag, if set, indicates the TTL for this packet should
1013	   never reach 0 (See Loops).^M
1014	   ^M
1015	   The IP Destination address is ALL-SM-NODES except in the following
1016	   cases:^M
1017	   ^M
1018	   - when a non-member sender transmits the packet, the destination is set
1019	   to the core address. The purpose of this is to enable the packet^M
1020	   to be unicasted until it hits a node that is SM-aware, at which point
1021	   the packet is multicast along the tree from the point at which it
1022	   entered
1023	   the tree.
1024	   Note that if the non-member sender has joined the group as a 'sender-only'
1025	   (c.f. uni-directional join in CBT), then the destination address in
1026	   the data packet is either ALL-SM-NODES or the tunnel endpoint
1027	   (as described below).

1029	   - when the packet is transmitted on a tunnel port, in which case the^M
1030	   destination address is set to the IP address of the tunnel endpoint.^M

1032	   Note that at Layer 2, the MAC address is mapped to the Multicast Address
1033	   M of the group (C,M), not to ALL-SM-NODES.^M

1035	3.2 JOIN-REQUEST

1037	   The following control packet header fields are as defined in CBT:
1038	   addr_len, checksum, Payload Length and # of options.

1040	   0               1               2               3
1041	   0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
1042	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1043	   |  vers |type=1 |  addr len     |         checksum              |
1044	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1045	   |Payload Length |  # of options |           reserved            |
1046	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1047	   |                    Join Originating Router                    |
1048	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1049	   |                       core address C                          |
1050	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1051	   |                       Multicast address M                     |
1052	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1053	   |                       Multicast address mask m                |
1054	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1055	   |  option type  |  option len   |        option value...        |
1056	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

1058	   The destination IP address in the IP header is the Core Address.  The
1059	   JOIN-REQUEST is sent with the Router Alert Option.

1061	   The Multicast address and corresponding mask (M,m) may appear
1062	   multiple times. The total length of these fields is specified in the
1063	   "addr_len" field of the common control header.

1065	   The JOIN-REQUEST may contain the following option:

1067	   - Originating TTL. This field is set to the TTL in the IP header of
1068	   this JOIN- REQUEST packet. The receiving SM router ignores this
1069	   option unless the control packet is from a SM router who is not an
1070	   immediate neighbor. The value in this field is used to calculate the
1071	      number of hops in a 'tunnel' = Originating TTL - TTL in the IP
1072	   header for this packet. The value derived is placed in "# of hops in
1073	   tunnel from you to me" in the JOIN-ACK message.

1075	   0               1               2               3
1076	   0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
1077	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1078	   |     1         |       2       |        Originating TTL        |
1079	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

1081	   - Sender-Only
1082	   The join would only be successful if the sender is on the Include
1083	   Senders List or NOT in the Exclude Senders List.
1084	   The sender is attached to the tree as per uni-directional Join in CBT.

1086	   0               1               2               3
1087	   0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
1088	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1089	   |     2         |       2       |       Reserved                |
1090	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

1092	3.3 JOIN-ACK

1094	   0               1               2               3
1095	   0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
1096	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1097	   |  vers |type=2 |  addr len     |         checksum              |
1098	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1099	   |Payload Length |  # of options |    # of hops in 'tunnel'      |
1100	   |               |               |       from you to me          |
1101	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1102	   |                    Join Originating Router                    |
1103	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1104	   |                       core address C                          |
1105	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1106	   |                       Multicast address M                     |
1107	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1108	   |                       Multicast address mask m                |
1109	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1110	   |  option type  |  option len   |        option value...        |
1111	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

1113	   The destination IP address in the IP header is the downstream IP
1114	   source address of the JOIN-REQUEST. The JOIN_ACK is sent with the
1115	   Router Alert Option.

1117	   The Multicast address and corresponding mask (M,m) may appear
1118	   multiple times. The total length of these fields is specified in the
1119	   "addr_len" field.

1121	   The field "# of hops in tunnel from you to me" is ignored unless the
1122	   control packet is from a SM router who is not an immediate neighbor.
1123	   The value in this field is saved as state for this tunnel port.

1125	   The options from the JOIN-REQUEST are copied into the JOIN-ACK, with
1126	   the exception of the "Originating TTL" option. The Originating TTL is
1127	   set to the TTL in the IP header of this JOIN-ACK packet.

1129	3.4 KEEP-ALIVE

1131	   0               1               2               3
1132	   0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
1133	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1134	   |  vers | type=3|  addr len     |         checksum              |
1135	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1136	   |Payload Length |  # of options |        reserved               |
1137	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1138	   |                KEEP-ALIVE Originating Router                  |
1139	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1140	   |                       core address C                          |
1141	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1142	   |                       Multicast address M                     |
1143	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1144	   |                       Multicast address mask m                |
1145	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1146	   |  option type  |  option len   |        option value...        |
1147	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

1149	   The keep-alive message is sent from a child to a parent (towards
1150	   core), and is sent only if a keep-alive has been received recently
1151	   from a child. The destination IP address in the IP header is ALL-SM-
1152	   NODES or the tunnel endpoint address.

1154	   A single keep-alive can serve as many groups as fit into the list in
1155	   the packet.

1157	   (M,m) may appear multiple times. The total length of these fields is
1158	   specified in the "addr_len" field.

1160	   The KEEP-ALIVE may contain the following options:

1162	   0               1               2               3
1163	   0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
1164	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1165	   |     1         |       10      |I|     reserved flag bits      |
1166	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1167	   |                Include/Exclude Sender Prefix                  |
1168	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1169	   |                Include/Exclude Sender Mask                    |
1170	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

1172	   - Include/Exclude Senders List that upstream routers should filter.
1173	   This option may appear multiple times. The 'I' bit is set if this is
1174	   an include sender list, and is zero if this is an exclude sender
1175	   list.

1177	   0               1               2               3
1178	   0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
1179	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1180	   |     2         |       10      |        hop count              |
1181	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1182	   |     Prune Time                |   # of hops in 'tunnel'       |
1183	   |                               |       from you to me          |
1184	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

1186	   - KEEP-ALIVE Option. This option should appear the same number of
1187	   times  as the address set (C,M,mask). It corresponds and is
1188	   applicable to the address set (C,M,mask).

1190	   The fields in this option are:  - Number of hops to furthest leaf for
1191	   (C,M,mask), hop count. The hop count is incremented at every SM hop.
1192	   In addition, when the KEEP-ALIVE is received from a tunnel port, hop
1193	   count = hop count + number of hops in 'tunnel'.

1195	   - Prune Time for (C,M,mask), time after which, if no KEEP-ALIVE is
1196	   received for group (C1, M, mask), the parent should prune off this
1197	   branch.

1199	   - 'Originating TTL'. This is as described in JOIN-REQUEST.

1201	3.5 HEARTBEAT

1203	   0               1               2               3
1204	   0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
1205	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1206	   |  vers | type=4|  addr len     |         checksum              |
1207	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1208	   |Payload Length |  # of options |      reserved                 |
1209	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1210	   |                 HEARTBEAT Originating Router                  |
1211	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1212	   |                       core address C                          |
1213	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1214	   |                       Multicast address M                     |
1215	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1216	   |                       Multicast address mask m                |
1217	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1218	   |  option type  |  option len   |        option value...        |
1219	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

1221	   The heartbeat is sent by a parent to a child. It is sent periodically
1222	   regardless of whether heartbeat is received from its parent.  The
1223	   destination IP address is set to ALL-SM-NODES or the tunnel endpoint
1224	   address.

1226	   The HEARTBEAT may contain the following additional options:  -
1227	   Include/Exclude Senders List. This is the list of allowed/prohibited
1228	   senders to the group. The format of this option is the same the
1229	   KEEP-ALIVE Include/Exclude Senders List, although it serves as a
1230	   different purpose here.

1232	   - spin-off groups (Ci,Mi). One or more spin-off groups (Ci,Mi) may be
1233	   specified.

1235	   0               1               2               3
1236	   0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
1237	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1238	   |     1         |  #Groupsx8    |       reserved flag bits      |
1239	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1240	   |                       Core Address  Ci                        |
1241	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1242	   |                    Multicast Address Mi                       |
1243	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

1245	   - HEARTBEAT Option. This option should appear the same number of
1246	   times as the address set (C,M,mask). It corresponds and is applicable
1247	   to the address set (C,M,mask).

1249	   The fields in this option are:
1250	   0               1               2               3
1251	   0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
1252	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1253	   |     2         |       6       |        core distance          |
1254	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1255	   |     Time To Shutdown          |   # of hops in 'tunnel'       |
1256	   |                               |      from you to me           |
1257	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1258	   |A|                    reserved                                 |
1259	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

1261	   - distance from core. Number of hops to core (C,M,mask), core
1262	   distance. The core distance is incremented at every SM hop. In
1263	   addition, when the KEEP-ALIVE is received from a tunnel port, core
1264	   distance = core distance + number of hops in 'tunnel' - Time left
1265	   before group should be closed down. (all 'ones' indicates group
1266	   should not be torn down) - The 'A' bit if set indicates the core is
1267	   alive or reachable

1269	   - 'Originating TTL'. This is as described in JOIN-ACK.

1271	3.6 FLUSH-TREE

1273	   0               1               2               3
1274	   0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
1275	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1276	   |  vers | type=5|  addr len     |         checksum              |
1277	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1278	   |Payload Length |  # of options |           reserved            |
1279	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1280	   |                 HEARTBEAT Originating Router                  |
1281	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1282	   |                       core address C                          |
1283	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1284	   |                       Multicast address M                     |
1285	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1286	   |                       Multicast address mask m                |
1287	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1288	   |  option type  |  option len   |        option value...        |
1289	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

1291	   The destination IP address is set to ALL-SM-NODES or the tunnel
1292	   endpoint address.

1294	   The Multicast address and corresponding mask (M,m) may appear
1295	   multiple times. The total length of these fields is specified in the
1296	   "addr_len" field of the common control header.

1298	   No options are currently defined.

1300	4 Acknowledgments

1302	   Many people have contributed ideas to this proposal, including Harald
1303	   Alvastrand, Joel Halpern and Fred Baker. The fact that SM is based on
1304	   previous work in IP Multicast implies that the authors are grateful
1305	   to everyone who has contributed to the development of IP Multicast.
1306	   We would like to thank all members of IDMR, in particular Dino
1307	   Farinacci, Mark Handley, Brad Cain, Dave Thaler Russ White and Ken
1308	   Carlberg whose helpful comments have improved this proposal. Others
1309	   that have provided helpful technical information include Matthew
1310	   Yuen, Patrick Lee.

1312	References

1314	      DNS Based RP Placement scheme
1315	      Dino Farinacci's presentation in the MBONED WG, 40th IETF Meeting

1317	      Static Multicast, Internet-Draft, March 1998
1318	      M. Ohta, J. Crowcroft

1320	      Express
1321	      IDMR Mailing List discussion

1323	      CBT, Core Based Tree Multicast Routing,
1324	      Ballardie, Cain, Zhang

1326	      PIM-SM, Protocol independent multicast-sparse mode Specification,
1327	      RFC-2117, June 1997
1328	      Estrin, Farinacci, Helmy, Thaler, Deering, Handley,
1329	      Jacobson, Liu, Sharma, and Wei.

1331	      BGMP, Border Gateway Multicast Protocol Specification,
1332	      Thaler, Estrin, Meyers

1334	      MASC, Multicast Address Set Claim Protocol,
1335	      Estrin, Handley, Kumar, Thaler

1337	      IGMP, Internet Group Management Protocol, Version 3,
1338	      Cain, Deering, Thyagarajan

1340	      "A Border Gateway Protocol 4 (BGP-4)", Y. Rekhter & T. Li,
1341	      RFC1771, March 1995

1343	      "Multiprotocol Extensions for BGP-4", RFC 2283, February 1998.
1344	      Bates, T., Chandra, R., Katz, D., and Y. Rekhter,

1346	      "The IP Network Address Translator (NAT)" RFC 1631, May 1994.
1347	      RFC1631 Egevang, K., Francis, P.,

1349	      "Administratively Scoped IP Multicast",
1350	      RFC 2365, July 1998.  Meyer, D.,

1352	      Distributed Core Multicast, L. Blazevic, J-Y. Boudec

1354	      OGMP ftp://cs.ucl.ac.uk/darpa/ogmp.ps.gz

1356	Authors' Addresses

1358	Radia Perlman
1359	Sun Microsystems Laboratories
1360	2 Elizabeth Drive
1361	Chelmsford, MA 01824
1362	Radia.Perlman@sun.com

1364	Cheng-Yin Lee
1365	Nortel Networks
1366	PO Box 3511, Station C
1367	Ottawa, ON K1Y 4H7, Canada
1368	leecy@nortel.com

1370	Tony Ballardie
1371	Research Consultant
1372	aballardie@acm.org

1374	Jon Crowcroft
1375	Department of Computer Science
1376	University College London
1377	Gower Street
1378	London, WC1E 6BT, UK
1379	J.Crowcroft@cs.ucl.ac.uk

1381	Zheng Wang
1382	Bell Labs Lucent Technologies
1383	101 Crawfords Corner Road
1384	Holmdel NJ 07733
1385	zhwang@bell-labs.com

1387	Thomas Maufer
1388	3Com Corporation
1389	5400 Bayfront Plaza
1390	Santa Clara, CA  95052
1391	maufer@3com.com

1393	Christophe Diot
1394	Sprint ATL
1395	1 Adrian Court
1396	Burlingame CA 94010
1397	USA
1398	cdiot@sprintlabs.com

1400	Joseph Thoo
1401	Nortel Networks
1402	PO Box 3511, Station C
1403	Ottawa, ON K1Y 4H7, Canada
1404	jthoo@nortel.com

1406	Mark Green
1407	@Home Networks
1408	markg@corp.home.net