INTERNET DRAFT Vivek Kashyap IBM Expiration Date: June 15, 2002 December 15, 2001 IP over InfiniBand(IPoIB) Architecture Status of this memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC 2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as Reference material or to cite them other than as ``work in progress''. The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html This memo provides information for the Internet community. This memo does not specify an Internet standard of any kind. Distribution of this memo is unlimited. Copyright Notice Copyright (C) The Internet Society (2001). All Rights Reserved. Abstract InfiniBand is a high speed, channel based interconnect between systems and devices. This document presents an overview of the InfiniBand architecture. It further describes the requirements and guidelines for the transmission of IP over InfiniBand. Discussions in this document are applicable to both IPv4 and IPv6 unless explicitly specified. The encapsulation of IP over Kashyap [Page 1] INTERNET-DRAFT IPoIB architecture December 15, 2001 InfiniBand and the mechanism for IP address resolution on IB fabrics will be described in separate documents. Table of Contents 1.0 Introduction to InfiniBand 1.1 InfiniBand Architecture Specification 1.2 Overview of InfiniBand Architecture 1.2.1 InfiniBand Addresses 1.2.1.1 Unicast GIDs 1.2.1.2 Multicast GIDs 1.2.2 InfiniBand Multicast Groups 2.0 Management of InfiniBand subnet 3.0 IP over IB requirements 3.1 InfiniBand as datalink 3.2 Multicast support 3.2.1 Mapping IP multicast to IB multicast 3.2.2 Transient bit in IB MGIDs 3.3 IP subnet across IB subnets ? 3.4 Multicast address to LID mapping 4.0 IP subnets in InfiniBand fabrics 4.1 IPoIB VLANs 4.2 Multicast in IPoIB subnets 4.2.1 Sending IP multicast datagrams 4.2.2 Receiving multicast packets 4.2.2.1 Impact of InfiniBand Architecture Limits 4.2.3 Leaving/Deleting a multicast group 5.0 QoS and related issues 6.0 Security Considerations 7.0 Acknowledgement 8.0 References 9.0 Author's address 1.0 Introduction to InfiniBand The InfiniBand Trade Association(IBTA) was formed to develop an I/O specification to deliver a channel based, switched fabric technology. The InfiniBand standard is aimed at meeting the requirements of scalability, reliability, availability and performance of servers in data centers. 1.1 InfiniBand Architecture Specification The InfiniBand Trade Association specification is available for download from http://www.infinibandta.org. Kashyap [Page 2] INTERNET-DRAFT IPoIB architecture December 15, 2001 1.2 Overview of InfiniBand Architecture For a more complete overview the reader is referred to chapter 3 of the InfiniBand specification. InfiniBand Architecture (IBA) defines a System Area Network (SAN) for connecting multiple independent processor platforms, I/O platforms and I/O devices. The IBA SAN is a communications and management infrastructure supporting both I/O and inter-processor communications for one or more computer systems. An IBA SAN consists of processor nodes and I/O units connected through an IBA fabric made up of cascaded switches and IB routers (connecting IB subnets). I/O units can range in complexity from single ASIC IBA attached devices such as a LAN adapter to a large memory rich RAID subsystem. An IBA network may be subdivided into subnets interconnected by routers. These are IB routers and IB subnets and not IP routers or IP subnets. This document will refer to InfiniBand routers and subnets as 'IB routers' and 'IB subnets' respectively. The IP routers and IP subnets will be referred to as 'routers' and 'subnets' respectively. Each IB node or switch may attach to a single or multiple switches or directly with each other. Each IB unit interfaces with the link by way of channel adapters (CAs). The architecture supports multiple CAs per unit with each CA providing one or more ports that connect to the fabric. Each CA appears as a node to the fabric. The ports are the endpoints to which the data is sent. However, each of the ports may include multiple QPs (queue pairs) that may be directly addressed from a remote peer. From the point of view of data transfer the QP number (QPN) is part of the address. IBA supports both connection oriented and datagram service between the ports. The peers are identified by QPN and the port identifier. There are a two exceptions. QPNs are not used when packets are multicast. QPNs are also not used in the raw datagram mode. A port, in a data packet, is identified by a local ID (LID) and optionally a Global ID (GID). The GID in the packet is needed only when communicating across an IB subnet though it Kashyap [Page 3] INTERNET-DRAFT IPoIB architecture December 15, 2001 may always be included. The GID is 128 bits long and is formed by the concatenation of a 64 bit IB subnet prefix and a 64 bit EUI-64 compliant portion (GUID). The LID is a 16 bit value that is assigned when the port becomes active. Note that the GUID is the only persistent identifier of a port. However, it cannot be used as an address in a packet. If the prefix is modified then the GID may change. The subnet manager may attempt to keep the LID values constant across reboots but that is not a requirement. The assignment of the GID and the LID is done by the subnet manager. Every IB subnet has at least one subnet manager component that controls the fabric. It assigns the LIDs and GIDs. The subnet manager also programs the switches so that they route packets between destinations. The subnet manager and a related component, the subnet administrator (SA) are the central repository of all information that is required to setup and bring up the fabric. IB routers are components that route packets between IB subnets based on the GIDs. Thus within an IB subnet a packet may or may not include a GID but when going across an IB subnet the GID must be included. A LID is always needed in a packet since the destination within a subnet is determined by it. A CA and a switch may have multiple ports. Each CA port is assigned its own LID or a range of LIDs. The ports of a switch are not addressable by LIDs/GIDs or in other words, are transparent to other end nodes. Each port has its own set of buffers. The buffering is channeled through virtual lanes(VL) where each VL has its own flow control. There may be up to 16 VLs. VLs provide a mechanism for creating multiple virtual links within a single physical link. All ports must support VL15 which is reserved exclusively for subnet management datagrams and hence doesn't concern the IPoIB discussions. The actual VL that a packet uses is configured by the SM in the switch/channel adapter tables and is determined based on the Service Level (SL) specified in every packet. There are 16 possible SLs. In addition to the features described above viz. Queue Kashyap [Page 4] INTERNET-DRAFT IPoIB architecture December 15, 2001 Pairs(QPs), Service Levels(SLs) and addressing(GID/LID), IBA also defines the following: Partitioning: Every packet, but for the raw datagrams, carries the partition key (P_key). These values are used for isolation in the fabric. A switch (this is an optional feature) may be programmed by the SM to drop packets not having a certain key. The CA ports always check for the P_Keys. A CA port may belong to multiple partitions. P_Key checking is optional at IB routers. Q_Keys: These are used to enforce access rights for reliable and unreliable IB datagram services. Raw datagram services don't use Q_Keys. At communication establishment the endpoints exchange the Q_Keys and must always use the relevant Q_Keys when communicating with one another. Multicast packets use the Q_Key associated with the multicast group. Multicast support: A switch may support multicasting i.e. replication of packets across multiple output ports. This is an optional feature. Similarly, support for sending/receiving multicast packets is optional in CAs. A multicast group is identified by a GID. The GID format is as defined in [RFC2373] on IPv6 addressing. Thus from an IPv6 over InfiniBand's point of view the data link multicast address looks like the network address. An IB node must explicitly join a multicast group by sending a request to the SM to receive multicast packets. A node may send packets to any multicast group. In both cases the multicast LID to be used in the packets is received from the SM. There are 6 methods for data transfer in IB architecture. These are : 1. Unreliable Datagram (unacknowledged - connectionless) The UD service is connectionless and unacknowledged. It allows the QP to communicate with any unreliable datagram QP on any node. Kashyap [Page 5] INTERNET-DRAFT IPoIB architecture December 15, 2001 The switches and hence each link can support only a certain MTU. The MTU ranges are 256 bytes, 512 bytes, 1024 bytes, 2048 bytes, 4096 bytes. A UD packet cannot be larger than the smallest link MTU between the two peers. 2. Reliable Datagram (acknowledged - multiplexed) The RD service is multiplexed over connections between nodes called End to end contexts (EEC) which allows each RD QP to communicate with any RD QP on any node with an established EEC. Multiple QPs can use the same EEC and a single QP can use multiple EECs (one for each remote node per reliable datagram domain). 3. Reliable Connected (acknowledged - connection oriented) The RC service associates a local QP with one and only one remote QP. The message sizes maybe as large as 2^31 bytes in length. The CA implementation takes care of segmentation and assembly. 4. Unreliable Connected (unacknowledged - connection oriented) The UC service associates one local QP with one and only one remote QP. There is no acknowledgment and hence no resend of lost or corrupted packets. Such packets are therefore simply dropped. It is similar to RC otherwise. 5. Raw Ethertype (unacknowledged - connectionless) The Ethertype raw datagram packet contains a generic transport header that is not interpreted by the CA but it specifies the protocol type. The values for ethertype are the same as defined in RFC1700 for ethertype. 6. Raw IPv6 ( unacknowledged - connectionless) Using IPv6 raw datagram service, the IBA CA can support standard protocol layers atop IPv6 (such as TCP/UDP). Thus native IPv6 packets can be bridged into the IBA SAN and delivered directly to a port and to its IPv6 raw datagram QP. The first 4 types are referred to as IB transports. The latter two are classified as Raw datagrams. There is no indication of Kashyap [Page 6] INTERNET-DRAFT IPoIB architecture December 15, 2001 the QP number in the raw datagram packets. The raw datagram packets are limited by the link MTU in size. The two connected modes and the reliable datagram mode may also support 'Automatic Path Migration(APM)'. This is an optional facility that provides for a hardware based path failover. An alternate path is associated with the QP when the connection/EE context is first created. If unrecoverable errors are encountered the connection switches to using the alternate path. 1.2.1 InfiniBand Addresses The InfiniBand architecture borrows heavily from the IPv6 architecture in terms of the InfiniBand subnet structure and global identifiers (GIDs). The InfiniBand architecture defines the global identifier associated with a port as follows: GID (Global Identifier): A 128-bit unicast or multicast identifier used to identify a port on a channel adapter, a port on a router, a switch, or a multicast group. A GID is a valid 128-bit IPv6 address(per RFC 2373) with additional properties/restrictions defined within IBA to facilitate efficient discovery, communication, and routing. Note: These rules apply only to IBA operation and do not apply to raw IPv6 operation unless specifically called out. The raw IPv6 operation referred to in the note in the definition above is the IPv6 mode of InfiniBand's raw datagram service. It does not mean IPv6 itself. The routers and switches referred to in the above definition are the InfiniBand routers and switches. The InfiniBand(IB) specification defines two types of GIDs: unicast and multicast. 1.2.1.1 Unicast GIDs The unicast GIDs are defined, as in IPv6, with three scopes. Kashyap [Page 7] INTERNET-DRAFT IPoIB architecture December 15, 2001 The IB specification states: a. link local: This is defined to be FE80/10. The IB routers will not forward packets with a link local address in source or destination beyond the IB subnet. b. site local: FEC0/10 A unicast GID used within a collection of subnets which is unique within that collection (e.g. a data center or campus) but is not necessarily globally unique. IB routers must not forward any packets with either a site-local Source GID or a site-local Destination GID outside of the site. c. global: A unicast GID with a global prefix, i.e. an IB router may use this GID to route packets throughout an enterprise or internet. 1.2.1.2 Multicast GIDs The multicast GIDs also parallel the IPv6 multicast addresses. The IB specification defines the multicast GIDs as follows: FFxy:<112 bits> Flag bits: The nibble, denoted by x above, are the 4 flag bits: 000T. The first three bits are reserved and are set to zero. The last bit is defined as follows: T=0: denotes a permanently assigned i.e. well known GID T=1: denotes a transient group Scope bits: The 4 bits, denoted by y in the GID above, are the scope bits. These scope values are described in Table 1. Kashyap [Page 8] INTERNET-DRAFT IPoIB architecture December 15, 2001 scope value Address value 0 Reserved 1 Unassigned 2 Link-local 3 Unassigned 4 Unassigned 5 Site-local 6 Unassigned 7 Unassigned 8 Organization-local 9 Unassigned 0xA Unassigned 0xB Unassigned 0xC Unassigned 0xD Unassigned 0xE Global 0xF Reserved Table 1 The IB specification further refers to [RFC_2373] and [RFC_2375] while defining the well known multicast addresses. However, it then states that the well known addresses apply to IB raw IPv6 datagrams only. It must be noted though that a multicast group can be associated with only a single MGID. Thus the same MGID cannot be associated with the UD mode and the raw datagram mode. 1.2.2 InfiniBand Multicast Groups IB multicast groups (multicast GIDs) are managed by the subnet manager(SM). The SM explicitly programs the IB switches in the fabric to ensure that the packets are received by all the members of the multicast group. A multicast group is created by sending a create request to the SM. The subnet manager records the group's multicast GID and the associated characteristics. The group characteristics are defined by the group path MTU, whether the group will be used for raw datagrams or unreliable datagrams, the service level, the partition key associated with the group, the LID(local identifier) associated with the group etc. These characteristics are defined at the time of the group creation. The interested reader may lookup the 'MCGroupRecord' attribute Kashyap [Page 9] INTERNET-DRAFT IPoIB architecture December 15, 2001 in the IB architecture specification[IB_ARCH]. The LID is associated with the multicast group by the subnet manager(SM) at the time of the multicast group creation. The SM determines the multicast tree based on all the group members and programs the relevant switches. The multicast LID is used by the switches to route the packets. Any member IB node wanting to participate in the multicast group must join the group. As part of the join operation the node is returned the group characteristics. At the same time the subnet manager ensures that the requester can indeed participate in the group by verifying that it can support the group MTU, and accessibility to the rest of the group members. Other group characteristics may need verification too. The SM, for groups that span IB subnet boundaries, must interact with IB routers to determine the presence of this group in other IB subnets. If present the MTU must match across the IB subnets. P_Key is another characteristic that must match across IB subnets since the P_Key inserted into a packet is not modified by the IB switches or IB routers. Thus if the P_Keys didn't match the IB router(s) itself might drop the packets or destinations on other subnets might drop the packets. These characteristics are returned to the IB endnode that joins the multicast group. A join operation may cause the SM to reprogram the fabric so that the new member can participate in the multicast group. 2.0 Management of InfiniBand subnet To aid in the monitoring and configuration of InfiniBand subnet components a set of MIBs need to be defined. MIBs are needed for the channel adapters, InfiniBand interfaces, InfiniBand subnet manager, InfiniBand subnet management agents and to allow the management of specific device properties. It must be noted that the management objects addressed in the IPoIB documents are for all of the IB subnet components and are not limited to IP(over IB). The relevant MIBs will be described in separate documents. 3.0 IP over IB requirements As described in section 1.0, the InfiniBand architecture provides a broad set of capabilities to choose from when Kashyap [Page 10] INTERNET-DRAFT IPoIB architecture December 15, 2001 implementing IP over InfiniBand networks. The IPoIB specification MUST NOT require changes in IP and higher layer protocols. Nor should it mandate requirements on IP stacks to implement special user level programs. It is an aim that the IPoIB changes be amenable to modularisation and incorporation into existing implementations at the same level as other media types. 3.1 InfiniBand as link layer InfiniBand architecture provides multiple methods of data exchange between two endpoints as was noted above. These are: Reliable Connected (RC) Reliable Datagram (RD) Unreliable Connected (UC) Unreliable Datagram (UD) Raw Datagram : Raw IPv6 (R6) : Raw Ethertype (RE) IPoIB can be implemented over any, multiple or all of these services. A case can be made for support on any of the transport methods depending on the desired features. The IB specification requires Unreliable Datagram mode to be supported by all the IB nodes. The host channel adapters(HCAs) are specifically required to support Reliable connected(RC) and Unreliable connected(UC) modes but the same is not the case with target channel adapters(TCAs). Support for the two Raw Datagram modes is entirely optional. The Raw Datagram mode supports a 16-bit CRC as against the better protection provided by the use of a 32-bit CRC in other modes. For the sake of simplicity, ease of implementation and integration with existing stacks, it is desirable that the fabric support multicasting. This is possible only in Unreliable datagram (UD) and IB's Raw datagram modes. Thus it only the UD mode that is universal, supports multicast, and a robust CRC. Given these conditions it is a MUST that an IP stack support IP over the UD transport mode of InfiniBand. But then Unreliable datagrams are limited by the link MTU. The connected modes, in contrast to this limitation, can offer significant benefit in terms of performance by utilising a larger MTU. Reliability is also enhanced if the underlying Kashyap [Page 11] INTERNET-DRAFT IPoIB architecture December 15, 2001 feature of automatic path migration of connected modes is utilised. An implementation MAY choose to provide IP over non-UD transport modes in addition to the mandatory IP over UD function. InfiniBand communication is addressed to a QP at a port. Therefore the IPoIB interface is identified by the port identifier as well as a QP that is associated with the interface. The address resolution process for IPoIB MUST also determine the associated QPN along with determining the port identifier. An interface MAY be associated with multiple QPNs. This provides a mode of implementation wherein a single IP address is associated with different QPNs. Such an association may be used to demultiplex the incoming packets based on the QPN avoiding or reducing the upper-layer port based lookup. An implementation may choose to support such a function. The methods of implementation of the above modes of IP over InfiniBand will be investigated and described in other documents. 3.2 Multicast support InfiniBand specification makes support of multicasting in the switches optional. It is RECOMMENDED that multicast switches be used in IPoIB subnets. Lack of multicast capable switches however doesn't mean that multicasting cannot be supported. In such a case the underlying IB layer MUST emulate multicast while ensuring that it is transparent to the IP stack. The translation from IP addresses to IB MGIDs must be independent of the IB fabric's multicast capability. 3.2.1 Mapping IP multicast to IB multicast Well known IP multicast groups are defined for both IPv4 and IPv6 (RFC_1700, RFC_2373). Multicast groups may also be dynamically created at any time. To avoid creating unnecessary duplicates of multicast packets in the fabric, and to avoid unnecessary handling of such packets at the hosts each of the IP multicast groups needs to be associated with a different IB multicast group. A process MUST be defined for mapping the IP multicast addresses to unique IB multicast addresses. Every IPoIB node Kashyap [Page 12] INTERNET-DRAFT IPoIB architecture December 15, 2001 MUST be capable of making this mapping decision independently. 3.2.2 Transient flag in IB MGIDs The IB specification describes the flag bits as discussed in section 1.3. The IB specification also defines some well known IB multicast GIDs(MGIDs). The MGIDs are reserved for the IB's Raw datagram mode which is incompatible with the other transports of IB. Any mapping that is defined from IP multicast addresses therefore MUST NOT fall into IB's definition of a well-known address. Therefore all IPoIB related multicast GIDs will always set the transient bit. 3.3 IP subnets across IB subnets ? Some implementations may desire to support multiple clusters of machines in their own IB subnets but otherwise part of a common IP subnet. For such a solution the IB specification needs multiple upgrades. Some of the required enhancements are: 1) A method for creating IB multicast GIDs that span multiple IB subnets. The partition keys and other parameters need to be consistent across IB subnets. 2) Develop IB routing protocol to determine the IB topology across IB subnets. 3) Define the process and protocols needed between IB nodes and IB routers Until the above conditions are met it is not possible to implement IPoIB subnets that span IB subnets. The IPoIB standards can however be defined with this possibility in mind. 3.4 Multicast address to LID mapping In a generic LAN setup the IP multicast addresses are directly mapped to a link layer multicast address. In the case of InfiniBand this is only partly true. A mapping of multicast IP to IB MGIDs can be standardised. But the IPoIB driver on the host must determine the LID that needs to be used when sending to the particular multicast group. Kashyap [Page 13] INTERNET-DRAFT IPoIB architecture December 15, 2001 A mapping from the IP multicast address or the corresponding IB multicast group to a LID is not required because of the following reasons: 1) Sending/receiving IP multicast An IB node cannot be assured of its packets reaching all the multicast members without itself joining the IB multicast group. This is because the relevant switches are programmed by the IB subnet manager only on receiving a join request. Thus the sender/receiver will always have to join the IB multicast groups and keep track of the groups it has already joined. Mapping directly to the LID doesn't help if the group has not been joined. Thus the implementation is required to keep track of the IB groups joined. It can therefore also record the corresponding LID removing the need to map the IP multicast address to the LID. 2) Reduction of LID conflicts The LIDs in the range 0xC000 to 0xFFFE are designated as the multicast LIDs by IBA. This limits the range to 2^14 -1 entries (16382 entries). This implies that 2^18 or 256K IPv4 multicast groups could map to a single LID. It is better to let the SM decide on a more efficient usage of the multicast LID space. 3) SM and IB architecture should stay unaffected. A mapping of the LIDs can conflict with the subnet manager(SM) implementations. The SM is under no restrictions to choose a particular LID for any multicast group. Thus it could end up utilising a LID that maps from an IP multicast address for some other multicast group since not everything on IB subnets is governed by the IPoIB rules. 4) No need to plan for LID conflicts Allowing the SM decide on the LIDs also avoids having to come up with a solution to handle LID conflicts with other multicast groups. Kashyap [Page 14] INTERNET-DRAFT IPoIB architecture December 15, 2001 Thus it is best to avoid such a mapping and leave it to the individual implementations to determine the LID from the SM. There is no extra work involved in this determination since the SM has to be contacted anyway for the IB multicast group join/create operations. IPoIB will not standardise IP multicast addresses to LID mapping. 4.0 IP subnets in InfiniBand fabrics The IPoIB subnet is overlaid over the IB subnet. The IPoIB subnet is brought up in the following steps: Note: the join/leave operation at the IP level will be referred to as IP_join/IP_leave and the join/leave operations at the IB level will be referred to as IB_join in this document. 1. The all-IP nodes group is be created The fabric administrator creates the IB multicast group corresponding to the all-IP nodes/IPv4 broadcast (henceforth called 'broadcast group') when the IPv6/IPv4 subnet is setup. The method by which the broadcast group is setup is not defined by IPoIB. 2. All IPoIB interfaces IB_join the broadcast group The administrator chooses the parameters that are valid for the multicast group: P_Key, Q_Key, Hop Limit, Flow ID, TClass and the MTU. All multicast packets in the IP subnet must use these values. Therefore any other multicast groups setup in the IPoIB subnet MUST be setup with these attributes. In the future as the IB specification associates more meaning with the various values and defines IB QoS different values for IP multicast traffic maybe possible. The IB_join of the broadcast group by the IPoIB nodes builds the IPoIB subnet. The broadcast group defines the span and the members of the IPoIB subnet. The IB_join to the broadcast group has the additional benefit of distributing these values to all the members of the subnet. The IP interface MTU for the IP over Unreliable Datagram interface is the path MTU value returned when the broadcast MGID is joined. This is the largest MTU that can be used across the IPoIB subnet without fragmenting. The IPoIB Kashyap [Page 15] INTERNET-DRAFT IPoIB architecture December 15, 2001 specification for IP over non-UD modes of transmission MUST also define the MTU that can be used with it. The IP over non-UD implementation may require other parameters to be determined and exchange in addition to the MTU. 4.1 IPoIB VLANs The endpoints in an IB subnet must have compatible P_Keys to communicate with one another. Thus the administrator when setting up an IP subnet over an IB subnet must ensure that all the members have compatible P_Keys. An IP subnet can have only one P_Key associated with it to ensure that all IP nodes in it can talk to one another. An endpoint may however have multiple P_Keys. The IB architecture specifies that there can be only one MGID associated with a multicast group in the IB subnet. The P_Key can be included in the MGID mappings from the IP multicast addresses. Since the P_Key is unique in the IB subnet the inclusion of the P_Key in the IB MGIDs ensures unique MGID mappings are created. Every unique broadcast group MGID so formed creates a separate abstract IPoIB link and hence an IPoIB VLAN. It is an implementation choice on how the P_Key related to the IPoIB subnet is determined by the IP stack. It could be a configuration parameter initialised by some means by the administrator. The method employed by an implementation to determine the P_Key is beyond the scope of IPoIB. 4.2 Multicast in IPoIB subnets IP multicast on InfiniBand subnets follows the same concepts and rules as on any other media. However, unlike most other media multicast over InfiniBand requires interaction with another entity, the IB subnet manager. This section describes the outline of the process and suggests some guidelines. IB architecture specifies the following format for IB Kashyap [Page 16] INTERNET-DRAFT IPoIB architecture December 15, 2001 multicast packets when used over unreliable datagram(UD) mode: +--------+-------+---------+---------+-------+---------+---------+ |Local |Global |Base |Datagram |Packet |Invariant| Variant | |Routing |Routing|Transport|Extended |Payload| CRC | CRC | |Header |Header |Header |Transport| (IP) | | | | | | |Header | | | | +--------+-------+---------+---------+-------+---------+---------+ For details about the various headers please refer to InfiniBand Architecture Specification[IB_ARCH]. The Global routing header (GRH) includes the IB multicast group GID. The Local routing header (LRH) includes the local identifier (LID). The IB switches in the fabric route the packet based on the LID. The GID is made available to the receiving IB user (the IPoIB interface driver for example). The driver can therefore determine the IB group the packet belongs to. IPv4 defines three levels of multicast compliance. These are: Level 0: No support for IP multicasting Level 1: Support for sending but not receiving multicasts Level 2: Full support for IP multicasting In IPv6 there is no such distinction. Full multicast support is mandatory. Additionally, all IPv4 subnets support broadcast(255.255.255.255). IPv4 broadcast can always be sent/received by all IPv4 interfaces. Every IPoIB subnet requires the broadcast GID to be defined. Thus a packet can always be broadcast. 4.2.1 Sending IP multicast datagrams An IP host may send a multicast packet at any time to any multicast address. The IP layer conveys the multicast packet to the IPoIB interface driver/module. This module attempts to IB_join the relevant IB multicast group. This is required since otherwise InfiniBand architecture does not guarantee that the packet Kashyap [Page 17] INTERNET-DRAFT IPoIB architecture December 15, 2001 will reach its destinations. The subnet manager builds a logical tree across the participating switches/IB routers to ensure that the multicast packet is received by all the members of the multicast group. The IB_join operation causes the SM to rebuild/modify this routing tree to include the new endnode. It may have to (re)program some of the switches and IB routers to reflect the new topology. Therefore if the IB_join is not done there is a possibility that the fabric will fail to deliver the packet to some or all the recipients. If the multicast group does not exist the IB_join will fail. This can imply that there are no listeners on the subnet and the router doesn't expect to forward packets received on this group. However, this may not be the case. The IB group may not exist because the SM ran out of resources or the SM policy allows only a limited set of multicast groups to be created. Additionally it is not reasonable to expect the router to create IB groups for all the IP multicast addresses that it may be called upon to forward. It must be noted that unlike many other media IBA does not have a promiscuous mode at which the router can accept all the packets. Therefore, the multicast module of IPoIB interface, when sending a multicast packet, needs to do one the following: 1) join the IB multicast group corresponding to the IP multicast address. This is the RECOMMENDED option for multicast if the sender is itself a member of the IP multicast group. As noted earlier, a particular IB multicast group may not exist for some reason. In such a case the implementation MUST fall back to one of the following methods. 2) Send the multicast packet out with the IB MGID/MLID associated with the all-systems IP multicast address (224.0.0.1/FF02::1). An IPv4 implementation failing 1) above must fall back to this condition or the condition given below on failure to join the IB group corresponding to the IPv4 multicast address being sent to. 3) In IPv4 subnets if both the above conditions fail then the packet MUST be sent with the IB MGID/MLID Kashyap [Page 18] INTERNET-DRAFT IPoIB architecture December 15, 2001 corresponding to the IPv4 limited broadcast address(255.255.255.255). 4.2.2 Receiving multicast packets The IP host must create the IB multicast group corresponding to the IP address and then join it. This follows from the IBA requirement that the receiver must join the relevant IB multicast group. A router could create the group on receiving the IGMP/MLD report but then the IP host would have to be informed of the creation. Therefore, it is simpler for the IB interface module on the IP host to first create the IB group and then send the IGMP/MLD message to the router. The router in turn needs to IB_join the specified IB group on receiving the IGMP/MLD report. This report must be sent out on the broadcast-MGID to ensure reception by the router(s). The router MAY choose to create IB groups corresponding to the IP groups it expects to forward. Thus the creation of IB groups is done by IP receivers or IP routers only and not by senders thereby keeping things simple. The host must first try to join the group and only on failure attempt to create it. 4.2.2.1 Impact of InfiniBand Architecture Limits It must be noted that if the group exists or the creation succeeds the group will be IB_joined. However, in case the join doesn't succeed due to some reason the node can still transmit to the multicast group using the broadcast/all-IP nodes MGID since that is mandatory. It may be that the IB MGID could not be created/joined because of a transient error or policy limit/resource constraint at the SM. It may also be created at a later point in time. The receiver therefore would not be in the IB MGID corresponding to the IP address. Unfortunately there is no IB level support to let the listener know of the new IB MGID being created. If the underlying IB level indicates a transient failure the listener could periodically retry to join the IB group. The exact parameters and timers for such retries or an alternate solution are beyond the scope of IPoIB. These parameters, if needed, should be derived from the IB specification. Kashyap [Page 19] INTERNET-DRAFT IPoIB architecture December 15, 2001 Note that multicasting can still continue since the packets can be sent out on the broadcast MGID (and MLID). The multicast listeners won't receive any packets on this multicast address if other nodes could join the group but it couldn't. It must be realised that such a situation is not very likely. An HCA or TCA may have a limit on the number of MGIDs it can support. Thus, even though the groups may not be limited at the subnet manager and in the subnet as such, they may be limited at a particular interface. It is advisable to choose an adequately provisioned xCA when setting up an IPoIB subnet. 4.2.3 Leaving/Deleting a multicast group An IPv4 sender (level 1 compliance) IB_joins the IB multicast group only because that is the only way to guarantee reception of the packets by all the group recipients. The sender must however IB_leave the group at some time. It is advisable that a sender, when not a receiver on the group, start a timer per multicast group sent to. The sender leaves the IB group when the timer goes off. It restarts the timer if another message is sent. This recommendation doesn't apply to the IB broadcast group. It also doesn't apply to the IB group corresponding to the all-hosts multicast group. An IPv4 host must always remain a member of the broadcast group. It MAY choose to remain a member of all-hosts group. Thus a sender that chooses to always send to the broadcast group and not to the specific multicast group does not need to implement a timer. An IP multicast receiver MUST IB_leave the corresponding IB multicast group when it IP_leaves the IP multicast group. In the case of IPv4 implementation the receiver may choose to continue to be a sender (level 1 compliance). It MAY choose to not IB_leave the IB group but start a timer as explained above. A router is RECOMMENDED to IB_leave the IB multicast group when there are no members of the IP multicast address in the subnet and it has no explicit knowledge of any need to forward such packets. The router and the IP hosts SHOULD NOT IB_delete the IB Kashyap [Page 20] INTERNET-DRAFT IPoIB architecture December 15, 2001 multicast group when they IB_leave the group. It is possible for the same IB multicast group be used by a non-IP protocol. The IB specification mentions an IB specific protocol that will delete the IB groups when it determines that there are no IB members of the group. 5.0 QoS and related issues The IB specification suggests the use of service levels for load balancing, QoS and deadlock avoidance within an IB subnet. But the IB specification leaves the usage and mode of determination of the SL for the application to decide. The SL and list of SLs are available in the SA but it is up to the endnode's application to choose the 'right' value. Every IPoIB implementation will determine the relevant SL value based on its own policy. No method or process for choosing the SL will be defined by the IPoIB standards. 6.0 Security Considerations Any multicast/broadcast communication is inherently insecure since anyone can receive the data. The applications must implement appropriate authentication/encryption methods for data security. The IP subnet communication can be disrupted by creating the IB broadcast/multicast groups with incompatible parameters. The implementations must leverage IB specific methods to protect against such situations. 7.0 Acknowledgement This document has benefited from the comments and suggestion of the members of the IPoIB working group and the members of the InfiniBand(SM) Trade Association. 8.0 References [IB_ARCH] InfiniBand Architecture Specification, Volume 1.0 [RFC_2373] IP Version 6 Addressing Architecture [RFC_2375] IPv6 Multicast Address Assignments [RFC_1700] Assigned Numbers [RFC_1112] Host extensions for IP multicasting [RFC_2236] Internet Group Management Protocol, Version 2 [RFC_2710] Multicast Listener Discovery Kashyap [Page 21] INTERNET-DRAFT IPoIB architecture December 15, 2001 9.0 Author's Address Vivek Kashyap IBM 15450, SW Koll Parkway Beaverton, OR 97006 Phone: +1 503 578 3422 Email: vivk@us.ibm.com Full Copyright Statement Copyright (C) The Internet Society (2001). All Rights Reserved. This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implementation may be prepared, copied, published and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this paragraph are included on all such copies and derivative works. However, this document itself may not be modified in any way, such as by removing the copyright notice or references to the Internet Society or other Internet organizations, except as needed for the purpose of developing Internet standards in which case the procedures for copyrights defined in the Internet Standards process must be followed, or as required to translate it into languages other than English. The limited permissions granted above are perpetual and will not be revoked by the Internet Society or its successors or assigns. This document and the information contained herein is provided on an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Kashyap [Page 22]