idnits 2.17.1 

draft-ietf-ipoib-link-multicast-04.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** Looks like you're using RFC 2026 boilerplate.  This must be updated to
     follow RFC 3978/3979, as updated by RFC 4748.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  == No 'Intended status' indicated for this document; assuming Proposed
     Standard


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack an IANA Considerations section.  (See Section
     2.2 of https://www.ietf.org/id-info/checklist for how to handle the case
     when there are no actions for IANA.)

  ** The document seems to lack separate sections for Informative/Normative
     References.  All references will be assumed normative when checking for
     downward references.

  == There are 4 instances of lines with non-RFC3849-compliant IPv6 addresses
     in the document.  If these are example addresses, they should be changed.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not
     match the current year

  == Using lowercase 'not' together with uppercase 'MUST', 'SHALL', 'SHOULD',
     or 'RECOMMENDED' is not an accepted usage according to RFC 2119.  Please
     use uppercase 'NOT' together with RFC 2119 keywords (if that is what you
     mean).
     
     Found 'SHALL not' in this paragraph:
     
     It is up to the network administrator to select a link MTU to use
     when configuring an IPoIB link. The link MTU SHALL not be greater than
     the MTU of any IB devices on the IPoIB link. Here the IB devices include
     IB switches, CAs, or routers.

  == Using lowercase 'not' together with uppercase 'MUST', 'SHALL', 'SHOULD',
     or 'RECOMMENDED' is not an accepted usage according to RFC 2119.  Please
     use uppercase 'NOT' together with RFC 2119 keywords (if that is what you
     mean).
     
     Found 'MUST not' in this paragraph:
     
     In case an IPoIB link spans more than one IB subnet, the IPoIB link
     MTU MUST not exceed the path MTU of any path connecting two nodes in the
     same IB partition. It is up to the network administrator to determine the
     appropriate path MTU value that will work for any node in the same IPoIB
     link.

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- Couldn't find a document date in the document -- date freshness check
     skipped.


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  == Unused Reference: 'IP6MLD' is defined on line 602, but no explicit
     reference was found in the text

  ** Obsolete normative reference: RFC 2373 (ref. 'AARCH') (Obsoleted by RFC
     3513)

  ** Obsolete normative reference: RFC 2461 (ref. 'DISC') (Obsoleted by RFC
     4861)

  -- Possible downref: Non-RFC (?) normative reference: ref. 'IBTA'

  ** Obsolete normative reference: RFC 2460 (ref. 'IPV6') (Obsoleted by RFC
     8200)


     Summary: 6 errors (**), 0 flaws (~~), 6 warnings (==), 3 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	INTERNET-DRAFT                                            H.K. Jerry Chu
3	<draft-ietf-ipoib-link-multicast-04.txt>                Sun Microsystems
4	                                                           Vivek Kashyap
5	                                                                     IBM
6	Expires: December, 2003                                       June, 2003

8	             IP link and multicast over InfiniBand networks

10	Status of this Memo

12	   This document is an Internet-Draft and is in full conformance with
13	   all provisions of Section 10 of RFC2026.

15	   Internet-Drafts are working documents of the Internet Engineering
16	   Task Force (IETF), its areas, and its working groups. Note that other
17	   groups may also distribute working documents as Internet-Drafts.

19	   Internet-Drafts are draft documents valid for a maximum of six months
20	   and may be updated, replaced, or obsoleted by other documents at any
21	   time. It is inappropriate to use Internet-Drafts as reference
22	   material or to cite them other than as "work in progress."

24	   The list of current Internet-Drafts can be accessed at
25	   http://www.ietf.org/ietf/1id-abstracts.txt

27	   The list of Internet-Draft Shadow Directories can be accessed at
28	   http://www.ietf.org/shadow.html.

30	   Copyright (C) The Internet Society (2003).  All Rights Reserved.

32	Abstract

34	   This document specifies a method for setting up IP subnets and
35	   multicast services over InfiniBand(TM) networks. Discussions in this
36	   document are applicable to both IPv4 and IPv6, unless explicitly
37	   specified. A separate document will cover unicast and encapsulation
38	   of IP datagrams over InfiniBand networks.

40	Table of Contents
41	   1.0     Introduction
42	   2.0     Terminology
43	   3.0     Basic IPoIB Transport - Unreliable Datagram
44	   4.0     IB Multicast Architecture
45	   5.0     IB Links vs. IPoIB Links
46	   6.0     Setting up an IPoIB Link
47	   6.1     Maximum Transmission Unit
48	   6.2     IPoIB Link Q_Key
49	   6.3     Other Link Attributes
50	   7.0     The IPoIB Broadcast Group
51	   8.0     Mapping for other Multicast Groups
52	   9.0     Sending and Receiving IP Multicast Packets
53	   10.0    IP Multicast Routing
54	   11.0    New Types of Vulnerability in IB Multicast
55	   12.0    Security Considerations
56	   13.0    Acknowledgments
57	   14.0    References
58	   15.0    Author's Address
59	   16.0    Full Copyright Statement

61	1.0 Introduction

63	   InfiniBand Architecture (IBA) defines four layers of network services
64	   corresponding to layer one through layer four of the OSI reference
65	   model.  For the purpose of running IP over an InfiniBand (IB)
66	   network, the IB link, network, and transport layers collectively
67	   constitute the data link layer to the IP stack. One can find a
68	   general overview of IB architecture related to IP networks in
69	   [IPoIB_ARCH].

71	   This document will focus on the necessary steps in order to lay out
72	   an IP network on top of an IB network. It will describe all the
73	   elements of an IP over InfiniBand (IPoIB) link, how to configure its
74	   associated attributes, and how to set up basic broadcast and
75	   multicast services for it. IPoIB links are the building blocks upon
76	   which an IP network consisting of many IP subnets connected by
77	   routers can be built.  Subnetting allows the containment of broadcast
78	   traffic within a single link. It also provides certain degree of
79	   isolation for the administration purpose between nodes on different
80	   subnets.

82	2.0 Terminology

84	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
85	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
86	   document are to be interpreted as described in [RFC2119].

88	3.0 Basic IPoIB Transport - Unreliable Datagram

90	   InfiniBand defines four types of transport services [IBTA]. They are
91	   reliable connection, unreliable connection, reliable datagram,
92	   unreliable datagram. IBA also defines a special raw datagram service
93	   for encapsulation purpose. Both unreliable datagram and raw datagram
94	   define support for multicast. They provide the basic transport
95	   mechanism that best matches the IP datagram paradigm.

97	   IB unreliable datagram provides many additional features such as the
98	   partition key (P_Key) protection, multiple queue pairs (QPs), and
99	   Q_Key protection. Moreover, it defines a 32-bit invariant CRC
100	   checksum, which provides a much stronger protection against data
101	   corruption, compared with the 16-bit CRC that a raw datagram carries.

103	   For these reasons, IB unreliable datagram is considered to be a much
104	   better choice as the basic IPoIB transport than the raw datagram, and
105	   is chosen as the default IPoIB transport mechanism ([IPoIB_ARCH],
106	   [IPoIB_ENCAP]).

108	4.0 IB Multicast Architecture

110	   The following discussion gives a short overview of the multicast
111	   architecture in InfiniBand. For a complete specification, the reader
112	   is referred to [IBTA].

114	   IBA defines two layers of multicast services. Its link layer uses
115	   multicast LIDs (MLIDs) in the Local Route Header (LRH). LIDs are
116	   allocated by the Subnet Manager (SM) and fall in the range between
117	   0xC0000 to 0xFFFE (approximately 16k). MLIDs are used by IB switches
118	   to program their multicast forwarding tables. An IB switch
119	   implementation may support much fewer MLIDs in its forwarding table
120	   though.

122	   The IB network layer uses multicast GIDs (MGIDs) in the Global Route
123	   Header (GRH). MGIDs closely resemble IPv6 multicast addresses [AARCH]
124	   shown below.

126	   |   8    |  4 |  4 |                  112 bits                   |
127	   +------ -+----+----+---------------------------------------------+
128	   |11111111|flgs|scop|                  group ID                   |
129	   +--------+----+----+---------------------------------------------+

131	                                 Figure 1

133	   [IPoIB_ARCH] describes each field in more details.

135	   Since every IB multicast packet is required to carry a LRH and a GRH,
136	   both a valid MGID and a valid MLID are needed before an IB multicast
137	   packet can be constructed.

139	   An IB multicast group is uniquely identified by a valid MGID. Before
140	   a MGID can be used within an IB subnet, either as a destination
141	   address of a multicast packet, or to represent a multicast group that
142	   an IB node can join, an IB multicast group corresponding to the MGID
143	   must be created through the Subnet Administrator (SA). Besides the
144	   the MGID, the creator of an IB multicast group must supply values of
145	   path MTU, P_Key, Q_Key, Service Level (SL), FlowLabel, TClass that
146	   are appropriate for all the potential clients of the multicast group
147	   to use. In return, SA will allocate a MLID to be used by switches in
148	   the local IB subnet.

150	   Unreliable multicast is defined by IBA as an optional functionality
151	   for channel adaptors (CAs) and switches. In today's IP technology,
152	   link multicast has become an indispensable function for better
153	   supporting a modern IP network. For this reason, it is required that
154	   an IPoIB fabric supports multicast. This includes all the CAs and
155	   switches that are part of an IP network.

157	5.0 IB Links vs. IPoIB Links

159	   A link segment on top of which an IP subnet can be configured is
160	   defined in [IPV6] as a communication facility or medium over which
161	   nodes can communicate at the "link" layer.  For most types of
162	   communication media, the boundary between different data link
163	   segments closely follows the physical topology of the network. For
164	   instance, an Ethernet network connected by switches, hubs, or bridges
165	   usually forms a single link segment and broadcast/multicast domain.
166	   Different Ethernet segments can be connected by IP routers at the
167	   network layer to form an IP network.

169	   InfiniBand defines its own link-layer and subnets consisting of nodes
170	   connected by IB switches and routers. However, the IPoIB link
171	   boundary need not follow the IB link boundary. Nodes residing on
172	   different IB subnets can still communicate directly with one another
173	   through IB routers at the InfiniBand network layer. This
174	   communication at the network layer applies to unicast as well as
175	   multicast.

177	   The ultimate requirement for two nodes in the same IB fabric to
178	   communicate at the IB level, besides physical connectivity, is a
179	   common P_Key.

181	   Partitioning in IB provides an isolation mechanism among nodes in an
182	   IB fabric, much like VLANs in the Ethernet network.  Each port of an
183	   HCA (Host Channel Adaptor) contains a P_Key table holding all the
184	   valid P_Keys the port is allowed to use. The P_Key table is set up by
185	   the SM of the local IB subnet. Each QP is programmed with a P_Key
186	   from the local P_Key table. This P_Key is carried in all the outgoing
187	   packets from the QP, and is used to compare against the P_Key of all
188	   incoming packets to the QP. Any packet with an invalid P_Key will be
189	   discarded by the QP and a P_Key violation trap will be generated.  IB
190	   switches may optionally enforce partition checking too.

192	   Following the above, IB partitions are the natural choice for
193	   defining IPoIB link boundary. It also provides much needed
194	   flexibility for a network administrator to group nodes logically into
195	   different subnets in a large network.

197	6.0 Setting up an IPoIB Link

199	   A network administrator defines an IPoIB link by setting up an IB
200	   partition and assigning it a unique P_Key. Since a full-duplex
201	   communication is required among IP nodes, full-membership P_Keys,
202	   that is, those with the high-order bit set to 1 shall be used. An IB
203	   partition may or may not span multiple IB subnets; and whether it
204	   does or not is mostly transparent to IPoIB.

206	   Each node attached to an IB partition MUST have one of its HCAs
207	   assigned the P_Key to use. Note that the P_key table of an HCA port
208	   may contain many P_Keys. It is up to the implementation to define the
209	   method by which the P_Key relevant to a particular IPoIB subnet is
210	   determined and conveyed to the IPoIB stack. For instance,
211	   implementations may resort to a manual configuration when choosing
212	   the P_key or a set of P_Keys for IPoIB, and rely on DHCP [DHCP] to
213	   assign an IP subnet number to each IPoIB link.

215	   Once an IB partition is established for IPoIB use, the link MTU and
216	   Q_Key are two other attributes that must be chosen before an IPoIB
217	   link can be configured.

219	6.1 Maximum Transmission Unit

221	   IB defines five permissible maximum payload sizes (MTUs). They are
222	   256, 512, 1024, 2048 and 4096 bytes. [IPV6] requires a link MTU of
223	   1280 bytes or greater. To be better compatible with Ethernet, the
224	   dominant network media in both the LAN and WAN environment, the IPoIB
225	   link MTU should be 1500 bytes or greater. This leaves only 2048 and
226	   4096 bytes as the two acceptable MTUs for IPoIB. Channel adaptors
227	   supporting a MTU less than the minimal requirement can still expose
228	   an acceptable MTU to IP through an adaptation layer that fragments
229	   larger messages into smaller IB packets, and reassembles them on the
230	   receiving end. But this must be done in a way that is transparent to
231	   the IP stack.

233	   It is up to the network administrator to select a link MTU to use
234	   when configuring an IPoIB link. The link MTU SHALL not be greater
235	   than the MTU of any IB devices on the IPoIB link. Here the IB devices
236	   include IB switches, CAs, or routers.

238	   In general, a maximum link MTU should be employed whenever possible
239	   to attain a better throughput performance. One caveat is that once a
240	   link MTU is chosen for a given IPoIB link, nodes connected by CAs of
241	   a smaller MTU won't be able to join the link unless the whole link
242	   and all the devices attached to it are reconfigured to use the
243	   smaller MTU.

245	   It may be desirable in some case to use a smaller link MTU than the
246	   full size. For example, bridging an IPoIB link with an Ethernet link
247	   could be made much easier if the IPoIB link MTU is reduced to 1500
248	   bytes. For IPv4, this may require a manual configuration of a
249	   different link MTU than the maximum that all the nodes support.  For
250	   IPv6, one can use the MTU option of the router advertisement [DISC]
251	   to announce a smaller MTU to all the nodes.

253	   In case an IPoIB link spans more than one IB subnet, the IPoIB link
254	   MTU MUST not exceed the path MTU of any path connecting two nodes in
255	   the same IB partition. It is up to the network administrator to
256	   determine the appropriate path MTU value that will work for any node
257	   in the same IPoIB link.

259	6.2 IPoIB Link Q_Key

261	   A Q_Key is programmed by the source QP in every IB datagram, and is
262	   compared against the Q_Key of the destination QP.  A Q_Key violation
263	   will cause the offending datagram to be dropped, and a Q_Key
264	   violation counter to be incremented on the receiving port. A trap is
265	   also generated if the feature is supported on that port.

267	   A single Q_Key must be selected for all the QPs attached to an IPoIB
268	   link to use. It is recommended that a controlled Q_Key be used with
269	   the high order bit set. This is to prevent non-privileged software
270	   from fabricating and sending out bogus IP datagrams. All QPs
271	   configured for a given IPoIB link SHALL be assigned the same per-link
272	   Q_Key.

274	6.3 Other Link Attributes

276	   TClass, FlowLabel, HopLimit, and SL are four other attributes that
277	   are required if an IPoIB link covers more than a single IB subnet.
278	   The selection of these values are implementation dependent.
279	   Implementations must take into account the topology of IB subnets
280	   comprising the IPoIB link to ensure a successful communication
281	   between any two nodes in the same IPoIB link.

283	7.0 The IPoIB Broadcast Group

285	   Once an IB partition is created with link attributes identified for
286	   an IPoIB link, the network administrator must create a special IB
287	   all-node multicast group (henceforth referred to as the broadcast
288	   group) with these link attributes for every node on the IPoIB link to
289	   join.  The creation of an IB multicast group is through the use of
290	   the "MCMemberRecord" SA attribute as described in the IBA
291	   specification.

293	   The MGID of an IPoIB broadcast group will embed in it the P_Key of
294	   the IB partition that defines the IPoIB link. A special signature is
295	   also embedded to identify all the MGIDs for IPoIB use only. For IPv4
296	   over IB, the signature will be "0x401B". For IPv6 over IB, the
297	   signature will be "0x601B".

299	   For an IPv4 subnet, the MGID for this special IB multicast group
300	   SHALL have the following format:

302	   |   8    |  4 |  4 |     16 bits    | 16 bits | 48 bits  | 32 bits |
303	   +--------+----+----+----------------+---------+----------+---------+
304	   |11111111|0001|scop|0100000000011011|< P_Key >|00.......0|<all 1's>|
305	   +--------+----+----+----------------+---------+----------+---------+

307	                                 Figure 2

309	   For an IPv6 subnet, the format of the MGID SHALL look like this:

311	   |   8    |  4 |  4 |     16 bits    | 16 bits |       80 bits      |
312	   +--------+----+----+----------------+---------+--------------------+
313	   |11111111|0001|scop|0110000000011011|< P_Key >|000.............0001|
314	   +--------+----+----+----------------+---------+--------------------+

316	                                 Figure 3

318	   As for the scop bits, if the IPoIB link is fully contained within a
319	   single IB subnet, the scop bits SHALL be set to 2 (link-local).
320	   Otherwise the scope will be set higher.

322	   The broadcast group for IPv4 will serve to provide a broadcast
323	   service for protocols like ARP to use.

325	   When a node is first brought up on an IPoIB link identified by a
326	   P_Key, it must look for the right broadcast group to join. This is
327	   done by querying the SA MCMemberRecord database for a multicast group
328	   with a MGID matching the one constructed from the link P_Key and the
329	   IPoIB signature. The node SHOULD always look for a MGID of a link-
330	   local scope first before attempting one with a greater scope.

332	   Once the right MGID and broadcast group are identified, the local
333	   node SHOULD use the MTU associated with the broadcast group.  In case
334	   the MTU of the broadcast group is greater than what the local HCA can
335	   support, the node can not join the IPoIB link and operate as an IP
336	   node. Otherwise the local node must join the broadcast group as a
337	   "full member" and use the rest of link attributes associated with the
338	   group for all future communication to the link.

340	   In addition to the special all-node multicast group for broadcast
341	   purpose, an all-router multicast group may be created at link
342	   configuration time if an IP router will be attached to the link. This
343	   is to facilitate IP multicast operations described later. An IB
344	   multicast group for the all-router MGID must cover every IB subnet
345	   that the IPoIB link encompasses.  The format of the all-router MGID
346	   will be covered in the next section.

348	8.0 Mapping for other Multicast Groups

350	   The general IP multicast [IPMULT] support over IB is similar to the
351	   case of the special broadcast group discussed above. An algorithmic
352	   mapping is used so that given an IP multicast address, individual
353	   host can compute the corresponding IB multicast address (MGID) all by
354	   itself without having to consult an external entity. This also
355	   removes the need for an externally maintained IP to IB multicast
356	   mapping table.

358	   The IPoIB multicast mapping is depicted in Figure 4. The same mapping
359	   function is used for both IPv4 and IPv6 except the IPoIB signature
360	   field.

362	   |   8    |  4 |  4 |     16 bits     | 16 bits |      80 bits       |
363	   +------ -+----+----+-----------------+---------+--------------------+
364	   |11111111|0001|scop|<IPoIB signature>|< P_Key >|      group ID      |
365	   +--------+----+----+-----------------+---------+--------------------+

367	                                 Figure 4

369	   Since a MGID allocated for transporting IP multicast datagrams is
370	   considered only a transient link-layer multicast address, all IB
371	   MGIDs allocated for IPoIB purpose SHOULD have T = 1. The scope bits
372	   SHALL be the same as that of the all-node MGID for the same IPoIB
373	   link.

375	   An IP multicast address is used together with a given IPoIB link
376	   P_Key to form the MGID of the IB multicast group. For IPv6 the lower
377	   80-bit of the group ID is used directly in the lower 80-bit of the
378	   MGID. For IPv4, the group ID is only 28-bit long and the rest of the
379	   80 bits are filled with 0.

381	   The rest of the bits are the same as those of the broadcast MGID.
382	   For example, on an IPoIB link that is fully contained within a single
383	   IB subnet with a P_Key of 0x8006, the MGIDs for the all-router
384	   multicast group with group ID 2 [AARCH, IGMP2] are:

386	   FF12:401B:8006:0:0:0:0:2

388	   or

390	   FF12:401B:8006::2

392	   for IPv4 in a compressed format, and

394	   FF12:601B:8006:0:0:0:0:2

396	   or

398	   FF12:601B:8006::2

400	   for IPv6 in a compressed format.

402	   A special case exists for the IPv4 limited broadcast address
403	   "255.255.255.255" [HOSTS]. The address SHALL be mapped to the
404	   broadcast MGID for IPv4 networks as described in section 7 above.
405	   Also the IPv6 all-node multicast address "FF0X::1" [AARCH] maps
406	   naturally to the the special broadcast MGID for IPv6 networks.

408	9.0 Sending and Receiving IP Multicast Packets

410	   Multicast in InfiniBand differs in a number of ways from multicast in
411	   Ethernet. This adds some complexity to an IPoIB implementation when
412	   supporting IP multicast over IB.

414	   A) An IB multicast group must be explicitly created through the SA
415	   before it can be used.

417	   This implies that in order to send a packet destined for an IP
418	   multicast address, the IPoIB implementation must check with the SA on
419	   the outbound link first for a "MCMemberRecord" that matches the MGID.
420	   If one does exist, the MLID associated with the multicast group is
421	   used as the DLID for the packet. Otherwise, it implies no member
422	   exists on the local link.  If the scope of the IP multicast group is
423	   beyond link-local, the packet must be sent to the on-link routers
424	   through the use of the all-router multicast group or the broadcast
425	   group. This is to allow local routers to forward the packet to
426	   multicast listeners on remote networks.  The all-router multicast
427	   group is preferred over the broadcast group for better efficiency. If
428	   the all-router multicast group does not exist, the sender can assume
429	   that there are no routers on the local link; hence the packet can be
430	   safely dropped.

432	   B) A multicast sender must join the target multicast group as a
433	   "SendOnlyNonMember" before outgoing multicast messages from it can be
434	   successfully routed. The "SendOnlyNonMember" join is different from
435	   the regular "FullMember" join in two aspects. First, both types of
436	   joins enable multicast packets to be routed FROM the local port, but
437	   only the "FullMember" join causes multicast packets to be routed TO
438	   the port.  Second, the sender port of a "SendOnlyNonMember" join will
439	   not be counted as a member of the multicast group for purposes of
440	   group creation and deletion.

442	   The following code snippet demonstrates the steps in a typical
443	   implementation when processing an egress multicast packet.

445	   if the egress port is already a "SendOnlyNonMember", or a
446	   "FullMember"
447	           => send the packet

449	   else if the target multicast group exists
450	           => do "SendOnlyNonMember" join
451	           => send the packet

453	   else if scope > link-local AND the all-router multicast group exists
454	           => send the packet to all routers
455	   else
456	           => drop the packet

458	   Implementations should cache the information about the existence of
459	   an IB multicast group, its MLID and other attributes. This is to
460	   avoid expensive SA calls on every outgoing multicast packet. Senders
461	   MUST subscribe to the multicast group create and delete traps in
462	   order to monitor the status of specific IB multicast groups. E.g.,
463	   multicast packets directed to the all-router multicast group due to a
464	   lack of listener on the local subnet must be forwarded to the right
465	   multicast group if the group is created later.  This happens when a
466	   listener shows up on the local subnet.

468	   A node joining an IP multicast group must first construct a MGID
469	   according to the rule described in section 8 above. Once the correct
470	   MGID is calculated, the node must call the SA of the outbound link to
471	   attempt a "FullMember" join of the IB multicast group corresponding
472	   to the MGID.  If the IB multicast group doesn't already exist, one
473	   must be created first with the IPoIB link MTU. For the rest of
474	   attributes, the same values from the all-node multicast/broadcast
475	   group SHOULD be used.

477	   The join request will cause the local port to be added to the
478	   multicast group. It also enables the SM to program IB switches and
479	   routers with the new multicast information to ensure the correct
480	   forwarding of multicast packets for the group.

482	   When a node leaves an IP multicast group, it SHOULD make a
483	   "FullMember" leave request to the SA. This gives SM an opportunity to
484	   update relevant forwarding information, to delete an IB multicast
485	   group if the local port is the last FullMember to leave, and free up
486	   the MLID allocated for it. The specific algorithm is implementation-
487	   dependent, and is out of the scope of this document.

489	   Note that for an IPoIB link that spans more than one IB subnet
490	   connected by IB routers, an adequate multicast forwarding support at
491	   the IB level is required for multicast packets to reach listeners on
492	   a remote IB subnet. The specific mechanism for this will be covered
493	   in [IBTA], and is beyond the scope of IPoIB.

495	10.0 IP Multicast Routing

497	   IP multicast routing requires multicast routers to receive a copy of
498	   every link multicast packet on a locally connected link [IPMULT,
499	   IP6MLD].  For Ethernet this is usually achieved by turning on the
500	   promiscuous multicast mode on a locally connected Ethernet interface.

502	   IBA does not provide any hardware support for promiscuous multicast
503	   mode.  Fortunately a promiscuous multicast mode can be emulated in
504	   the software running on a router through the following steps.

506	   A) Obtain a list of all active IB multicast groups from the local SA.

508	   B) Make a "NonMember" join request to the SA for every group that has
509	   a signature in its MGID matching the one for either IPv4 or IPv6.

511	   C) Subscribe to the IB multicast group creation events using a
512	   wildcarded MGID so that the router can "NonMember" join all IB
513	   multicast groups created subsequently for IPv4 or IPv6.

515	   The "NonMember" join has the same effect as a "FullMember" join
516	   except that the former will not be counted as a member of the
517	   multicast group for purposes of group creation or deletion. That is,
518	   when the last "FullMember" leaves a multicast group, the group can be
519	   safely deleted by the SA without concerning any "NonMember" routers.

521	11.0 New Types of Vulnerability in IB Multicast

523	   Many IB multicast functions are subject to failures due to a number
524	   of possible resource constraints. These include the creation of IB
525	   multicast groups, the join calls ("SendOnlyNonMember", "FullMember",
526	   and "NonMember"), and the attaching of a QP to a multicast group.

528	   In general, the occurrence of these failure conditions is highly
529	   implementation dependent, and is believed to be rare. Usually a
530	   failed multicast operation at the IB level can be propagated back to
531	   the IP level, causing the original operation to fail, and the
532	   initiator of the operation to be notified. But some IB multicast
533	   functions are not tied to any foreground operation, making their
534	   failures hard to detect. E.g., if an IP multicast router attempts to
535	   "NonMember" join a newly created multicast group in the local subnet,
536	   but the join call fails, packet forwarding for that particular
537	   multicast group will likely to fail silently, that is, without the
538	   attention of local multicast senders. This type of problems can add
539	   more vulnerability to the already unreliable IP multicast operations.

541	   Implementations should log error messages upon any failure from an IB
542	   multicast operation. Network administrators should be aware of this
543	   vulnerability, and preserve enough multicast resources at the points
544	   where IP multicast will be used heavily. E.g., HCAs with ample
545	   multicast resources should be used at any IP multicast router.

547	12.0 Security Considerations

549	   All the operations for creating and configuring an IPoIB link
550	   described in this document, including assigning P_Keys to CAs,
551	   creating IB multicast groups in SA, creating and attaching QPs to IB
552	   multicast groups,... etc, are privileged operations, and MUST be
553	   protected by the underlying operating system. This is to prevent
554	   malicious, non-privileged software from hijacking important resources
555	   and configurations.  For example, A bogus IPoIB broadcast group may
556	   prevent a proper one from being created when the network
557	   administrator tries to set up a link.

559	   Controlled Q_Keys SHOULD be used in IPoIB links. This is to prevent
560	   non-privileged software from fabricating IP datagrams to send, as
561	   mentioned in section 6.2.

563	13.0 Acknowledgments

565	   The authors would like to thank Bruce Beukema, David Brean, Dan
566	   Cassiday, Aditya Dube, Yaron Haviv, Michael Krause, Thomas Narten,
567	   Erik Nordmark, Greg Pfister, Renato Recio, Kanoj Sarcar, Satya
568	   Sharma, and David L. Stevens for their suggestions and many
569	   clarifications on the IBA specification.

571	14.0 References

573	   [AARCH]   Hinden, R. and S. Deering "IP Version 6 Addressing
574	             Architecture", RFC 2373, July 1998.

576	   [DHCP]    R. Droms "Dynamic Host Configuration Protocol", RFC 2131,
577	             March 1997.

579	   [DISC]    Narten, T., Nordmark, E. and W. Simpson, "Neighbor
580	             Discovery for IP Version 6 (IPv6)", RFC 2461, December
581	             1998.

583	   [HOSTS]   Braden R., "Requirements for Internet Hosts --
584	             Communication Layers", RFC 1122, October 1989

586	   [IBTA]    InfiniBand Architecture Specification, Release 1.0.a by
587	             InfiniBand Trade Association at www.infinibandta.org

589	   [IGMP2]   Fenner W., "Internet Group Management Protocol, Version 2",
590	             RFC 2236, November 1997.

592	   [IPMULT]  Deering S., "Host Extensions for IP Multicasting", RFC
593	             1112, August 1989.

595	   [IPoIB_ARCH]  draft-ietf-ipoib-architecture-01.txt

597	   [IPoIB_ENCAP] draft-ietf-ipoib-ip-over-infiniband-01.txt

599	   [IPV6]    Deering, S. and R. Hinden, "Internet Protocol, Version 6
600	             (IPv6) Specification", RFC 2460, December 1998.

602	   [IP6MLD]  Deering S., Fenner W., Haberman B., "Multicast Listener
603	             Discovery (MLD) for IPv6", RFC 2710, October 1999.

605	   [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
606	             Requirement Levels", BCP 14, RFC 2119, March 1997.

608	15.0 Author's Address

610	   H.K. Jerry Chu
611	   17 Network Circle, UMPK17-201
612	   Menlo Park, CA 94025
613	   USA

615	   Phone: +1 650 786-5146
616	   EMail: jerry.chu@sun.com

618	   Vivek Kashyap
619	   IBM
620	   15450, SW Koll Parkway
621	   Beaverton, OR 97006

623	   Phone: 503 578 3422
624	   EMail: vivk@us.ibm.com

626	16.0 Full Copyright Statement

628	   Copyright (C) The Internet Society (2003>.  All Rights Reserved.

630	   This document and translations of it may be copied and furnished to
631	   others, and derivative works that comment on or otherwise explain it
632	   or assist in its implementation may be prepared, copied, published
633	   and distributed, in whole or in part, without restriction of any
634	   kind, provided that the above copyright notice and this paragraph are
635	   included on all such copies and derivative works.  However, this
636	   document itself may not be modified in any way, such as by removing
637	   the copyright notice or references to the Internet Society or other
638	   Internet organizations, except as needed for the purpose of
639	   developing Internet standards in which case the procedures for
640	   copyrights defined in the Internet Standards process must be
641	   followed, or as required to translate it into languages other than
642	   English.

644	   The limited permissions granted above are perpetual and will not be
645	   revoked by the Internet Society or its successors or assigns.

647	   This document and the information contained herein is provided on an
648	   "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING
649	   TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING
650	   BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION
651	   HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF
652	   MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.