INTERNET DRAFT                                                   Vivek Kashyap
<draft-ietf-ipoib-architecture-01.txt>                                     IBM
Expiration Date: June 15, 2002                               December 15, 2001


                        IP over InfiniBand(IPoIB) Architecture

Status of this memo

        This document is an Internet-Draft and is in full conformance
        with all provisions of Section 10 of RFC 2026.

        Internet-Drafts are working documents of the Internet
        Engineering Task Force (IETF), its areas, and its working
        groups. Note that other groups may also distribute working
        documents as Internet- Drafts.

        Internet-Drafts are draft documents valid for a maximum of six
        months and may be updated, replaced, or obsoleted by other
        documents at any time. It is inappropriate to use
        Internet-Drafts as Reference material or to cite them other
        than as ``work in progress''.

        The list of current Internet-Drafts can be accessed at
        http://www.ietf.org/ietf/1id-abstracts.txt

        The list of Internet-Draft Shadow Directories can be accessed
        at http://www.ietf.org/shadow.html

        This memo provides information for the Internet community.
        This memo does not specify an Internet standard of any kind.
        Distribution of this memo is unlimited.

Copyright Notice

   Copyright (C) The Internet Society (2001).  All Rights Reserved.

Abstract

        InfiniBand is a high speed, channel based interconnect between
        systems and devices.

        This document presents an overview of the InfiniBand
        architecture. It further describes the requirements and
        guidelines for the transmission of IP over InfiniBand.
        Discussions in this document are applicable to both IPv4 and
        IPv6 unless explicitly specified. The encapsulation of IP over


Kashyap                                                         [Page 1]


INTERNET-DRAFT             IPoIB architecture          December 15, 2001


        InfiniBand and the mechanism for IP address resolution on IB
        fabrics will be described in separate documents.

Table of Contents

        1.0     Introduction to InfiniBand
        1.1     InfiniBand Architecture Specification
        1.2     Overview of InfiniBand Architecture
        1.2.1   InfiniBand Addresses
        1.2.1.1 Unicast GIDs
        1.2.1.2 Multicast GIDs
        1.2.2   InfiniBand Multicast Groups
        2.0     Management of InfiniBand subnet
        3.0     IP over IB requirements
        3.1     InfiniBand as datalink
        3.2     Multicast support
        3.2.1   Mapping IP multicast to IB multicast
        3.2.2   Transient bit in IB MGIDs
        3.3     IP subnet across IB subnets ?
        3.4     Multicast address to LID mapping
        4.0     IP subnets in InfiniBand fabrics
        4.1     IPoIB VLANs
        4.2     Multicast in IPoIB subnets
        4.2.1   Sending IP multicast datagrams
        4.2.2   Receiving multicast packets
        4.2.2.1 Impact of InfiniBand Architecture Limits
        4.2.3   Leaving/Deleting a multicast group
        5.0     QoS and related issues
        6.0     Security Considerations
        7.0     Acknowledgement
        8.0     References
        9.0     Author's address

1.0 Introduction to InfiniBand

        The InfiniBand Trade Association(IBTA) was formed to develop
        an I/O specification to deliver a channel based, switched
        fabric technology. The InfiniBand standard is aimed at meeting
        the requirements of scalability, reliability, availability and
        performance of servers in data centers.

1.1 InfiniBand Architecture Specification

        The InfiniBand Trade Association specification is available
        for download from http://www.infinibandta.org.


Kashyap                                                         [Page 2]


INTERNET-DRAFT             IPoIB architecture          December 15, 2001


1.2 Overview of InfiniBand Architecture

        For a more complete overview the reader is referred to
        chapter 3 of the InfiniBand specification.

        InfiniBand Architecture (IBA) defines a System Area Network
        (SAN) for connecting multiple independent processor platforms,
        I/O platforms and I/O devices. The IBA SAN is a communications
        and management infrastructure supporting both I/O and
        inter-processor communications for one or more computer
        systems.

        An IBA SAN consists of processor nodes and I/O units connected
        through an IBA fabric made up of cascaded switches and IB
        routers (connecting IB subnets). I/O units can range in
        complexity from single ASIC IBA attached devices such as a LAN
        adapter to a large memory rich RAID subsystem.

        An IBA network may be subdivided into subnets interconnected
        by routers. These are IB routers and IB subnets and not IP
        routers or IP subnets. This document will refer to InfiniBand
        routers and subnets as 'IB routers' and 'IB subnets'
        respectively. The IP routers and IP subnets will be referred
        to as 'routers' and 'subnets' respectively.

        Each IB node or switch may attach to a single or multiple
        switches or directly with each other. Each IB unit interfaces
        with the link by way of channel adapters (CAs). The
        architecture supports multiple CAs per unit with each CA
        providing one or more ports that connect to the fabric. Each
        CA appears as a node to the fabric.

        The ports are the endpoints to which the data is sent.
        However, each of the ports may include multiple QPs (queue
        pairs) that may be directly addressed from a remote peer. From
        the point of view of data transfer the QP number (QPN) is part
        of the address.

        IBA supports both connection oriented and datagram service
        between the ports. The peers are identified by QPN and the
        port identifier. There are a two exceptions. QPNs are not used
        when packets are multicast. QPNs are also not used in the raw
        datagram mode.

        A port, in a data packet, is identified by a local ID (LID)
        and optionally a Global ID (GID). The GID in the packet is
        needed only when communicating across an IB subnet though it


Kashyap                                                         [Page 3]


INTERNET-DRAFT             IPoIB architecture          December 15, 2001


        may always be included.

        The GID is 128 bits long and is formed by the concatenation of
        a 64 bit IB subnet prefix and a 64 bit EUI-64 compliant
        portion (GUID). The LID is a 16 bit value that is assigned
        when the port becomes active. Note that the GUID is the only
        persistent identifier of a port. However, it cannot be used as
        an address in a packet. If the prefix is modified then the GID
        may change. The subnet manager may attempt to keep the LID
        values constant across reboots but that is not a requirement.

        The assignment of the GID and the LID is done by the subnet
        manager. Every IB subnet has at least one subnet manager
        component that controls the fabric. It assigns the LIDs and
        GIDs. The subnet manager also programs the switches so that
        they route packets between destinations. The subnet manager
        and a related component, the subnet administrator (SA) are the
        central repository of all information that is required to
        setup and bring up the fabric.

        IB routers are components that route packets between IB
        subnets based on the GIDs. Thus within an IB subnet a packet
        may or may not include a GID but when going across an IB
        subnet the GID must be included. A LID is always needed in a
        packet since the destination within a subnet is determined by
        it.

        A CA and a switch may have multiple ports. Each CA port is
        assigned its own LID or a range of LIDs. The ports of a switch
        are not addressable by LIDs/GIDs or in other words, are
        transparent to other end nodes. Each port has its own set of
        buffers. The buffering is channeled through virtual lanes(VL)
        where each VL has its own flow control. There may be up to 16
        VLs.

        VLs provide a mechanism for creating multiple virtual links
        within a single physical link. All ports must support VL15
        which is reserved exclusively for subnet management datagrams
        and hence doesn't concern the IPoIB discussions. The actual VL
        that a packet uses is configured by the SM in the
        switch/channel adapter tables and is determined based on the
        Service Level (SL) specified in every packet. There are 16
        possible SLs.

        In addition to the features described above viz. Queue


Kashyap                                                         [Page 4]


INTERNET-DRAFT             IPoIB architecture          December 15, 2001


        Pairs(QPs), Service Levels(SLs) and addressing(GID/LID), IBA
        also defines the following:

        Partitioning:

                Every packet, but for the raw datagrams, carries the
                partition key (P_key). These values are used for
                isolation in the fabric. A switch (this is an optional
                feature) may be programmed by the SM to drop packets
                not having a certain key. The CA ports always check
                for the P_Keys. A CA port may belong to multiple
                partitions. P_Key checking is optional at IB routers.

        Q_Keys:

                These are used to enforce access rights for reliable
                and unreliable IB datagram services. Raw datagram
                services don't use Q_Keys. At communication
                establishment the endpoints exchange the Q_Keys and
                must always use the relevant Q_Keys when communicating
                with one another. Multicast packets use the Q_Key
                associated with the multicast group.

        Multicast support:

                A switch may support multicasting i.e. replication of
                packets across multiple output ports. This is an
                optional feature. Similarly, support for
                sending/receiving multicast packets is optional in
                CAs. A multicast group is identified by a GID. The GID
                format is as defined in [RFC2373] on IPv6 addressing.
                Thus from an IPv6 over InfiniBand's point of view the
                data link multicast address looks like the network
                address. An IB node must explicitly join a multicast
                group by sending a request to the SM to receive
                multicast packets. A node may send packets to any
                multicast group. In both cases the multicast LID to be
                used in the packets is received from the SM.

        There are 6 methods for data transfer in IB architecture.
        These are :

        1. Unreliable Datagram (unacknowledged - connectionless)

                The UD service is connectionless and unacknowledged.
                It allows the QP to communicate with any unreliable
                datagram QP on any node.


Kashyap                                                         [Page 5]


INTERNET-DRAFT             IPoIB architecture          December 15, 2001


                The switches and hence each link can support only a
                certain MTU. The MTU ranges are 256 bytes, 512 bytes,
                1024 bytes, 2048 bytes, 4096 bytes. A UD packet cannot
                be larger than the smallest link MTU between the two
                peers.

        2. Reliable Datagram    (acknowledged - multiplexed)

                The RD service is multiplexed over connections between
                nodes called End to end contexts (EEC) which allows
                each RD QP to communicate with any RD QP on any node
                with an established EEC. Multiple QPs can use the same
                EEC and a single QP can use multiple EECs (one for
                each remote node per reliable datagram domain).

        3. Reliable Connected (acknowledged - connection oriented)

                The RC service associates a local QP with one and only
                one remote QP. The message sizes maybe as large as
                2^31 bytes in length. The CA implementation takes care
                of segmentation and assembly.

        4. Unreliable Connected (unacknowledged - connection oriented)

                The UC service associates one local QP with one and
                only one remote QP. There is no acknowledgment and
                hence no resend of lost or corrupted packets. Such
                packets are therefore simply dropped. It is similar to
                RC otherwise.

        5. Raw Ethertype (unacknowledged - connectionless)

                The Ethertype raw datagram packet contains a generic
                transport header that is not interpreted by the CA but
                it specifies the protocol type. The values for
                ethertype are the same as defined in RFC1700 for
                ethertype.

        6. Raw IPv6 ( unacknowledged - connectionless)

                Using IPv6 raw datagram service, the IBA CA can
                support standard protocol layers atop IPv6 (such as
                TCP/UDP). Thus native IPv6 packets can be bridged into
                the IBA SAN and delivered directly to a port and to
                its IPv6 raw datagram QP.

        The first 4 types are referred to as IB transports. The latter
        two are classified as Raw datagrams. There is no indication of


Kashyap                                                         [Page 6]


INTERNET-DRAFT             IPoIB architecture          December 15, 2001


        the QP number in the raw datagram packets. The raw datagram
        packets are limited by the link MTU in size.

        The two connected modes and the reliable datagram mode may
        also support 'Automatic Path Migration(APM)'. This is an
        optional facility that provides for a hardware based path
        failover. An alternate path is associated with the QP when the
        connection/EE context is first created. If unrecoverable
        errors are encountered the connection switches to using the
        alternate path.

1.2.1 InfiniBand Addresses

        The InfiniBand architecture borrows heavily from the IPv6
        architecture in terms of the InfiniBand subnet structure and
        global identifiers (GIDs).

        The InfiniBand architecture defines the global identifier
        associated with a port as follows:

                GID (Global Identifier): A 128-bit unicast or
                multicast identifier used to identify a port on a
                channel adapter, a port on a router, a switch, or a
                multicast group. A GID is a valid 128-bit IPv6
                address(per RFC 2373) with additional
                properties/restrictions defined within IBA to
                facilitate efficient discovery, communication, and
                routing.

                Note: These rules apply only to IBA operation and do
                not apply to raw IPv6 operation unless specifically
                called out.

        The raw IPv6 operation referred to in the note in the
        definition above is the IPv6 mode of InfiniBand's raw datagram
        service. It does not mean IPv6 itself. The routers and
        switches referred to in the above definition are the
        InfiniBand routers and switches.

        The InfiniBand(IB) specification defines two types of GIDs:
        unicast and multicast.

1.2.1.1 Unicast GIDs

        The unicast GIDs are defined, as in IPv6, with three scopes.


Kashyap                                                         [Page 7]


INTERNET-DRAFT             IPoIB architecture          December 15, 2001


        The IB specification states:

                a. link local: This is defined to be FE80/10.

                               The IB routers will not forward packets
                               with a link local address in source or
                               destination beyond the IB subnet.

                b. site local: FEC0/10

                               A unicast GID used within a collection
                               of subnets which is unique within that
                               collection (e.g. a data center or
                               campus) but is not necessarily globally
                               unique. IB routers must not forward any
                               packets with either a site-local Source
                               GID or a site-local Destination GID
                               outside of the site.

                c. global:     A unicast GID with a global prefix,
                               i.e. an IB router may use this GID to
                               route packets throughout an enterprise
                               or internet.

1.2.1.2  Multicast GIDs

        The multicast GIDs also parallel the IPv6 multicast addresses.
        The IB specification defines the multicast GIDs as follows:

                FFxy:<112 bits>

        Flag bits:

        The nibble, denoted by x above, are the 4 flag bits: 000T. The
        first three bits are reserved and are set to zero. The last
        bit is defined as follows:

                T=0: denotes a permanently assigned i.e. well known GID
                T=1: denotes a transient group

        Scope bits:

        The 4 bits, denoted by y in the GID above, are the scope bits.
        These scope values are described in Table 1.


Kashyap                                                         [Page 8]


INTERNET-DRAFT             IPoIB architecture          December 15, 2001


                scope value             Address value

                    0                        Reserved
                    1                        Unassigned
                    2                        Link-local
                    3                        Unassigned
                    4                        Unassigned
                    5                        Site-local
                    6                        Unassigned
                    7                        Unassigned
                    8                        Organization-local
                    9                        Unassigned
                    0xA                      Unassigned
                    0xB                      Unassigned
                    0xC                      Unassigned
                    0xD                      Unassigned
                    0xE                      Global
                    0xF                      Reserved

                                Table 1


        The IB specification further refers to [RFC_2373] and
        [RFC_2375] while defining the well known multicast addresses.
        However, it then states that the well known addresses apply to
        IB raw IPv6 datagrams only. It must be noted though that a
        multicast group can be associated with only a single MGID.
        Thus the same MGID cannot be associated with the UD mode and
        the raw datagram mode.

1.2.2 InfiniBand Multicast Groups

        IB multicast groups (multicast GIDs) are managed by the subnet
        manager(SM). The SM explicitly programs the IB switches in the
        fabric to ensure that the packets are received by all the
        members of the multicast group.

        A multicast group is created by sending a create request to
        the SM. The subnet manager records the group's multicast GID
        and the associated characteristics. The group characteristics
        are defined by the group path MTU, whether the group will be
        used for raw datagrams or unreliable datagrams, the service
        level, the partition key associated with the group, the
        LID(local identifier) associated with the group etc. These
        characteristics are defined at the time of the group creation.
        The interested reader may lookup the 'MCGroupRecord' attribute


Kashyap                                                         [Page 9]


INTERNET-DRAFT             IPoIB architecture          December 15, 2001


        in the IB architecture specification[IB_ARCH].

        The LID is associated with the multicast group by the subnet
        manager(SM) at the time of the multicast group creation. The
        SM determines the multicast tree based on all the group
        members and programs the relevant switches. The multicast LID
        is used by the switches to route the packets.

        Any member IB node wanting to participate in the multicast
        group must join the group. As part of the join operation the
        node is returned the group characteristics. At the same time
        the subnet manager ensures that the requester can indeed
        participate in the group by verifying that it can support the
        group MTU, and accessibility to the rest of the group members.
        Other group characteristics may need verification too.

        The SM, for groups that span IB subnet boundaries, must
        interact with IB routers to determine the presence of this
        group in other IB subnets. If present the MTU must match
        across the IB subnets.

        P_Key is another characteristic that must match across IB subnets
        since the P_Key inserted into a packet is not modified by the
        IB switches or IB routers. Thus if the P_Keys didn't match the
        IB router(s) itself might drop the packets or destinations on
        other subnets might drop the packets.

        These characteristics are returned to the IB endnode that
        joins the multicast group. A join operation may cause the SM
        to reprogram the fabric so that the new member can participate
        in the multicast group.

2.0 Management of InfiniBand subnet

        To aid in the monitoring and configuration of InfiniBand
        subnet components a set of MIBs need to be defined. MIBs are
        needed for the channel adapters, InfiniBand interfaces,
        InfiniBand subnet manager, InfiniBand subnet management agents
        and to allow the management of specific device properties. It
        must be noted that the management objects addressed in the
        IPoIB documents are for all of the IB subnet components and
        are not limited to IP(over IB). The relevant MIBs will be
        described in separate documents.

3.0 IP over IB requirements

        As described in section 1.0, the InfiniBand architecture
        provides a broad set of capabilities to choose from when


Kashyap                                                        [Page 10]


INTERNET-DRAFT             IPoIB architecture          December 15, 2001


        implementing IP over InfiniBand networks.

        The IPoIB specification MUST NOT require changes in IP and
        higher layer protocols. Nor should it mandate requirements on
        IP stacks to implement special user level programs. It is an
        aim that the IPoIB changes be amenable to modularisation and
        incorporation into existing implementations at the same level
        as other media types.

3.1 InfiniBand as link layer

        InfiniBand architecture provides multiple methods of data
        exchange between two endpoints as was noted above. These are:

                Reliable Connected (RC)
                Reliable Datagram  (RD)
                Unreliable Connected (UC)
                Unreliable Datagram (UD)
                Raw Datagram : Raw IPv6 (R6)
                             : Raw Ethertype (RE)

        IPoIB can be implemented over any, multiple or all of these
        services. A case can be made for support on any of the
        transport methods depending on the desired features.

        The IB specification requires Unreliable Datagram mode to be
        supported by all the IB nodes. The host channel adapters(HCAs)
        are specifically required to support Reliable connected(RC) and
        Unreliable connected(UC) modes but the same is not the case
        with target channel adapters(TCAs). Support for the two Raw
        Datagram modes is entirely optional. The Raw Datagram mode
        supports a 16-bit CRC as against the better protection
        provided by the use of a 32-bit CRC in other modes.

        For the sake of simplicity, ease of implementation and
        integration with existing stacks, it is desirable that the
        fabric support multicasting. This is possible only in
        Unreliable datagram (UD) and IB's Raw datagram modes.

        Thus it only the UD mode that is universal, supports
        multicast, and a robust CRC. Given these conditions it is a
        MUST that an IP stack support IP over the UD transport mode of
        InfiniBand.

        But then Unreliable datagrams are limited by the link MTU. The
        connected modes, in contrast to this limitation, can offer
        significant benefit in terms of performance by utilising a
        larger MTU. Reliability is also enhanced if the underlying


Kashyap                                                        [Page 11]


INTERNET-DRAFT             IPoIB architecture          December 15, 2001


        feature of automatic path migration of connected modes is
        utilised. An implementation MAY choose to provide IP over
        non-UD transport modes in addition to the mandatory IP over UD
        function.

        InfiniBand communication is addressed to a QP at a port.
        Therefore the IPoIB interface is identified by the port
        identifier as well as a QP that is associated with the
        interface. The address resolution process for IPoIB MUST also
        determine the associated QPN along with determining the port
        identifier.

        An interface MAY be associated with multiple QPNs. This
        provides a mode of implementation wherein a single IP address
        is associated with different QPNs. Such an association may be
        used to demultiplex the incoming packets based on the QPN
        avoiding or reducing the upper-layer port based lookup. An
        implementation may choose to support such a function.

        The methods of implementation of the above modes of IP over
        InfiniBand will be investigated and described in other
        documents.

3.2 Multicast support

        InfiniBand specification makes support of multicasting in the
        switches optional. It is RECOMMENDED that multicast switches
        be used in IPoIB subnets. Lack of multicast capable switches
        however doesn't mean that multicasting cannot be supported. In
        such a case the underlying IB layer MUST emulate multicast
        while ensuring that it is transparent to the IP stack.

        The translation from IP addresses to IB MGIDs must be
        independent of the IB fabric's multicast capability.

3.2.1 Mapping IP multicast to IB multicast

        Well known IP multicast groups are defined for both IPv4 and
        IPv6 (RFC_1700, RFC_2373). Multicast groups may also be
        dynamically created at any time. To avoid creating unnecessary
        duplicates of multicast packets in the fabric, and to avoid
        unnecessary handling of such packets at the hosts each of
        the IP multicast groups needs to be associated with a
        different IB multicast group.

        A process MUST be defined for mapping the IP multicast
        addresses to unique IB multicast addresses. Every IPoIB node


Kashyap                                                        [Page 12]


INTERNET-DRAFT             IPoIB architecture          December 15, 2001


        MUST be capable of making this mapping decision
        independently.

3.2.2 Transient flag in IB MGIDs

        The IB specification describes the flag bits as discussed in
        section 1.3. The IB specification also defines some well known
        IB multicast GIDs(MGIDs). The MGIDs are reserved for the IB's
        Raw datagram mode which is incompatible with the other
        transports of IB. Any mapping that is defined from IP
        multicast addresses therefore MUST NOT fall into IB's
        definition of a well-known address.

        Therefore all IPoIB related multicast GIDs will always set the
        transient bit.

3.3 IP subnets across IB subnets ?

        Some implementations may desire to support multiple clusters
        of machines in their own IB subnets but otherwise part of a
        common IP subnet. For such a solution the IB specification
        needs multiple upgrades. Some of the required enhancements
        are:

        1) A method for creating IB multicast GIDs that span multiple
           IB subnets. The partition keys and other parameters need to
           be consistent across IB subnets.

        2) Develop IB routing protocol to determine the IB topology
           across IB subnets.

        3) Define the process and protocols needed between IB nodes
           and IB routers

        Until the above conditions are met it is not possible to
        implement IPoIB subnets that span IB subnets. The IPoIB
        standards can however be defined with this possibility in
        mind.

3.4 Multicast address to LID mapping

        In a generic LAN setup the IP multicast addresses are directly
        mapped to a link layer multicast address. In the case of
        InfiniBand this is only partly true. A mapping of multicast IP
        to IB MGIDs can be standardised. But the IPoIB driver on the
        host must determine the LID that needs to be used when sending
        to the particular multicast group.


Kashyap                                                        [Page 13]


INTERNET-DRAFT             IPoIB architecture          December 15, 2001


        A mapping from the IP multicast address or the corresponding
        IB multicast group to a LID is not required because of the
        following reasons:

                1) Sending/receiving IP multicast

                   An IB node cannot be assured of its packets
                   reaching all the multicast members without itself
                   joining the IB multicast group. This is because the
                   relevant switches are programmed by the IB subnet
                   manager only on receiving a join request.

                   Thus the sender/receiver will always have to join
                   the IB multicast groups and keep track of the
                   groups it has already joined. Mapping directly to
                   the LID doesn't help if the group has not been
                   joined.

                   Thus the implementation is required to keep track
                   of the IB groups joined. It can therefore also
                   record the corresponding LID removing the need to
                   map the IP multicast address to the LID.

                2) Reduction of LID conflicts

                   The LIDs in the range 0xC000 to 0xFFFE are
                   designated as the multicast LIDs by IBA. This
                   limits the range to 2^14 -1 entries (16382
                   entries). This implies that 2^18 or 256K IPv4
                   multicast groups could map to a single LID. It is
                   better to let the SM decide on a more efficient
                   usage of the multicast LID space.

                3) SM and IB architecture should stay unaffected.

                   A mapping of the LIDs can conflict with the subnet
                   manager(SM) implementations. The SM is under no
                   restrictions to choose a particular LID for any
                   multicast group. Thus it could end up utilising a
                   LID that maps from an IP multicast address for some
                   other multicast group since not everything on IB
                   subnets is governed by the IPoIB rules.

                4) No need to plan for LID conflicts

                   Allowing the SM decide on the LIDs also avoids
                   having to come up with a solution to handle LID
                   conflicts with other multicast groups.


Kashyap                                                        [Page 14]


INTERNET-DRAFT             IPoIB architecture          December 15, 2001


        Thus it is best to avoid such a mapping and leave it to the
        individual implementations to determine the LID from the SM.
        There is no extra work involved in this determination since
        the SM has to be contacted anyway for the IB multicast group
        join/create operations.

        IPoIB will not standardise IP multicast addresses to LID
        mapping.

4.0 IP subnets in InfiniBand fabrics

        The IPoIB subnet is overlaid over the IB subnet. The IPoIB
        subnet is brought up in the following steps:

        Note: the join/leave operation at the IP level will be
              referred to as IP_join/IP_leave and the join/leave
              operations at the IB level will be referred to as
              IB_join in this document.

        1. The all-IP nodes group is be created

        The fabric administrator creates the IB multicast group
        corresponding to the all-IP nodes/IPv4 broadcast (henceforth
        called 'broadcast group') when the IPv6/IPv4 subnet is setup.
        The method by which the broadcast group is setup is not
        defined by IPoIB.

        2. All IPoIB interfaces IB_join the broadcast group

        The administrator chooses the parameters that are valid for
        the multicast group: P_Key, Q_Key, Hop Limit, Flow ID, TClass
        and the MTU. All multicast packets in the IP subnet must use
        these values. Therefore any other multicast groups setup in
        the IPoIB subnet MUST be setup with these attributes. In the
        future as the IB specification associates more meaning with
        the various values and defines IB QoS different values for IP
        multicast traffic maybe possible.

        The IB_join of the broadcast group by the IPoIB nodes builds
        the IPoIB subnet. The broadcast group defines the span and the
        members of the IPoIB subnet. The IB_join to the broadcast
        group has the additional benefit of distributing these values
        to all the members of the subnet.

        The IP interface MTU for the IP over Unreliable Datagram
        interface is the path MTU value returned when the broadcast
        MGID is joined. This is the largest MTU that can be used
        across the IPoIB subnet without fragmenting. The IPoIB


Kashyap                                                        [Page 15]


INTERNET-DRAFT             IPoIB architecture          December 15, 2001


        specification for IP over non-UD modes of transmission MUST
        also define the MTU that can be used with it. The IP over
        non-UD implementation may require other parameters to be
        determined and exchange in addition to the MTU.

4.1 IPoIB VLANs

        The endpoints in an IB subnet must have compatible P_Keys to
        communicate with one another. Thus the administrator when
        setting up an IP subnet over an IB subnet must ensure that all
        the members have compatible P_Keys. An IP subnet can have only
        one P_Key associated with it to ensure that all IP nodes in it
        can talk to one another. An endpoint may however have multiple
        P_Keys.

        The IB architecture specifies that there can be only one MGID
        associated with a multicast group in the IB subnet. The P_Key
        can be included in the MGID mappings from the IP multicast
        addresses. Since the P_Key is unique in the IB subnet the
        inclusion of the P_Key in the IB MGIDs ensures unique MGID
        mappings are created. Every unique broadcast group MGID so
        formed creates a separate abstract IPoIB link and hence an
        IPoIB VLAN.

        It is an implementation choice on how the P_Key related to the
        IPoIB subnet is determined by the IP stack. It could be a
        configuration parameter initialised by some means by the
        administrator. The method employed by an implementation to
        determine the P_Key is beyond the scope of IPoIB.

4.2 Multicast in IPoIB subnets

        IP multicast on InfiniBand subnets follows the same concepts
        and rules as on any other media. However, unlike most other
        media multicast over InfiniBand requires interaction with
        another entity, the IB subnet manager. This section describes
        the outline of the process and suggests some guidelines.

        IB architecture specifies the following format for IB


Kashyap                                                        [Page 16]


INTERNET-DRAFT             IPoIB architecture          December 15, 2001


        multicast packets when used over unreliable datagram(UD)
        mode:

       +--------+-------+---------+---------+-------+---------+---------+
       |Local   |Global |Base     |Datagram |Packet |Invariant| Variant |
       |Routing |Routing|Transport|Extended |Payload| CRC     |  CRC    |
       |Header  |Header |Header   |Transport| (IP)  |         |         |
       |        |       |         |Header   |       |         |         |
       +--------+-------+---------+---------+-------+---------+---------+

       For details about the various headers please refer to
       InfiniBand Architecture Specification[IB_ARCH].

       The Global routing header (GRH) includes the IB multicast group
       GID. The Local routing header (LRH) includes the local
       identifier (LID). The IB switches in the fabric route the
       packet based on the LID.

       The GID is made available to the receiving IB user (the IPoIB
       interface driver for example). The driver can therefore
       determine the IB group the packet belongs to.

       IPv4 defines three levels of multicast compliance. These are:

                Level 0: No support for IP multicasting

                Level 1: Support for sending but not receiving multicasts

                Level 2: Full support for IP multicasting

        In IPv6 there is no such distinction. Full multicast support
        is mandatory. Additionally, all IPv4 subnets support
        broadcast(255.255.255.255). IPv4 broadcast can always be
        sent/received by all IPv4 interfaces.

        Every IPoIB subnet requires the broadcast GID to be defined.
        Thus a packet can always be broadcast.

4.2.1 Sending IP multicast datagrams

        An IP host may send a multicast packet at any time to any
        multicast address.

        The IP layer conveys the multicast packet to the IPoIB
        interface driver/module. This module attempts to IB_join the
        relevant IB multicast group. This is required since otherwise
        InfiniBand architecture does not guarantee that the packet


Kashyap                                                        [Page 17]


INTERNET-DRAFT             IPoIB architecture          December 15, 2001


        will reach its destinations.

        The subnet manager builds a logical tree across the
        participating switches/IB routers to ensure that the multicast
        packet is received by all the members of the multicast group.
        The IB_join operation causes the SM to rebuild/modify this
        routing tree to include the new endnode. It may have to
        (re)program some of the switches and IB routers to reflect the
        new topology. Therefore if the IB_join is not done there is a
        possibility that the fabric will fail to deliver the packet to
        some or all the recipients.

        If the multicast group does not exist the IB_join will fail.
        This can imply that there are no listeners on the subnet and
        the router doesn't expect to forward packets received on this
        group. However, this may not be the case. The IB group may not
        exist because the SM ran out of resources or the SM policy
        allows only a limited set of multicast groups to be created.
        Additionally it is not reasonable to expect the router to
        create IB groups for all the IP multicast addresses that it
        may be called upon to forward. It must be noted that unlike
        many other media IBA does not have a promiscuous mode at which
        the router can accept all the packets.

        Therefore, the multicast module of IPoIB interface, when
        sending a multicast packet, needs to do one the following:

                1) join the IB multicast group corresponding to the IP
                   multicast address. This is the RECOMMENDED option
                   for multicast if the sender is itself a member of
                   the IP multicast group.

                   As noted earlier, a particular IB multicast group
                   may not exist for some reason. In such a case the
                   implementation MUST fall back to one of the
                   following methods.

                2) Send the multicast packet out with the
                   IB MGID/MLID associated with the all-systems IP
                   multicast address (224.0.0.1/FF02::1).

                   An IPv4 implementation failing 1) above must fall
                   back to this condition or the condition given below
                   on failure to join the IB group corresponding to
                   the IPv4 multicast address being sent to.

                3) In IPv4 subnets if both the above conditions fail
                   then the packet MUST be sent with the IB MGID/MLID


Kashyap                                                        [Page 18]


INTERNET-DRAFT             IPoIB architecture          December 15, 2001


                   corresponding to the IPv4 limited broadcast
                   address(255.255.255.255).

4.2.2 Receiving multicast packets

        The IP host must create the IB multicast group corresponding
        to the IP address and then join it. This follows from the IBA
        requirement that the receiver must join the relevant IB
        multicast group.

        A router could create the group on receiving the IGMP/MLD
        report but then the IP host would have to be informed of the
        creation. Therefore, it is simpler for the IB interface module
        on the IP host to first create the IB group and then send the
        IGMP/MLD message to the router. The router in turn needs to
        IB_join the specified IB group on receiving the IGMP/MLD
        report. This report must be sent out on the broadcast-MGID to
        ensure reception by the router(s).

        The router MAY choose to create IB groups corresponding to the
        IP groups it expects to forward.

        Thus the creation of IB groups is done by IP receivers or IP
        routers only and not by senders thereby keeping things simple.
        The host must first try to join the group and only on failure
        attempt to create it.

4.2.2.1 Impact of InfiniBand Architecture Limits

        It must be noted that if the group exists or the creation
        succeeds the group will be IB_joined. However, in case the
        join doesn't succeed due to some reason the node can still
        transmit to the multicast group using the broadcast/all-IP
        nodes MGID since that is mandatory.

        It may be that the IB MGID could not be created/joined because
        of a transient error or policy limit/resource constraint at
        the SM. It may also be created at a later point in time. The
        receiver therefore would not be in the IB MGID corresponding
        to the IP address. Unfortunately there is no IB level support
        to let the listener know of the new IB MGID being created.

        If the underlying IB level indicates a transient failure the
        listener could periodically retry to join the IB group. The
        exact parameters and timers for such retries or an alternate
        solution are beyond the scope of IPoIB. These parameters, if
        needed, should be derived from the IB specification.


Kashyap                                                        [Page 19]


INTERNET-DRAFT             IPoIB architecture          December 15, 2001


        Note that multicasting can still continue since the packets
        can be sent out on the broadcast MGID (and MLID). The
        multicast listeners won't receive any packets on this
        multicast address if other nodes could join the group but it
        couldn't. It must be realised that such a situation is not
        very likely.

        An HCA or TCA may have a limit on the number of MGIDs it can
        support. Thus, even though the groups may not be limited at
        the subnet manager and in the subnet as such, they may be
        limited at a particular interface. It is advisable to choose
        an adequately provisioned xCA when setting up an IPoIB
        subnet.

4.2.3 Leaving/Deleting a multicast group

        An IPv4 sender (level 1 compliance) IB_joins the IB multicast
        group only because that is the only way to guarantee reception
        of the packets by all the group recipients. The sender must
        however IB_leave the group at some time. It is advisable that
        a sender, when not a receiver on the group, start a timer per
        multicast group sent to. The sender leaves the IB group when
        the timer goes off. It restarts the timer if another message
        is sent.

        This recommendation doesn't apply to the IB broadcast group.
        It also doesn't apply to the IB group corresponding to the
        all-hosts multicast group. An IPv4 host must always remain a
        member of the broadcast group. It MAY choose to remain a
        member of all-hosts group.

        Thus a sender that chooses to always send to the broadcast
        group and not to the specific multicast group does not need to
        implement a timer.

        An IP multicast receiver MUST IB_leave the corresponding IB
        multicast group when it IP_leaves the IP multicast group. In
        the case of IPv4 implementation the receiver may choose to
        continue to be a sender (level 1 compliance). It MAY choose to
        not IB_leave the IB group but start a timer as explained
        above.

        A router is RECOMMENDED to IB_leave the IB multicast group
        when there are no members of the IP multicast address in the
        subnet and it has no explicit knowledge of any need to forward
        such packets.

        The router and the IP hosts SHOULD NOT IB_delete the IB


Kashyap                                                        [Page 20]


INTERNET-DRAFT             IPoIB architecture          December 15, 2001


        multicast group when they IB_leave the group. It is possible
        for the same IB multicast group be used by a non-IP protocol.
        The IB specification mentions an IB specific protocol that
        will delete the IB groups when it determines that there are no
        IB members of the group.

5.0 QoS and related issues

        The IB specification suggests the use of service levels for
        load balancing, QoS and deadlock avoidance within an IB
        subnet. But the IB specification leaves the usage and mode of
        determination of the SL for the application to decide. The SL
        and list of SLs are available in the SA but it is up to the
        endnode's application to choose the 'right' value.

        Every IPoIB implementation will determine the relevant SL
        value based on its own policy. No method or process for
        choosing the SL will be defined by the IPoIB standards.

6.0 Security Considerations

        Any multicast/broadcast communication is inherently insecure
        since anyone can receive the data. The applications must
        implement appropriate authentication/encryption methods for
        data security.

        The IP subnet communication can be disrupted by creating the
        IB broadcast/multicast groups with incompatible parameters.
        The implementations must leverage IB specific methods to
        protect against such situations.

7.0 Acknowledgement

        This document has benefited from the comments and suggestion
        of the members of the IPoIB working group and the members of
        the InfiniBand(SM) Trade Association.

8.0 References

[IB_ARCH]       InfiniBand Architecture Specification, Volume 1.0
[RFC_2373]      IP Version 6 Addressing Architecture
[RFC_2375]      IPv6 Multicast Address Assignments
[RFC_1700]      Assigned Numbers
[RFC_1112]      Host extensions for IP multicasting
[RFC_2236]      Internet Group Management Protocol, Version 2
[RFC_2710]      Multicast Listener Discovery


Kashyap                                                        [Page 21]


INTERNET-DRAFT             IPoIB architecture          December 15, 2001


9.0 Author's Address

Vivek Kashyap

IBM
15450, SW Koll Parkway
Beaverton, OR 97006

Phone: +1 503 578 3422
Email: vivk@us.ibm.com

Full Copyright Statement

        Copyright (C) The Internet Society (2001). All Rights Reserved.

        This document and translations of it may be copied and
        furnished to others, and derivative works that comment on or
        otherwise explain it or assist in its implementation may be
        prepared, copied, published and distributed, in whole or in
        part, without restriction of any kind, provided that the above
        copyright notice and this paragraph are included on all such
        copies and derivative works. However, this document itself may
        not be modified in any way, such as by removing the copyright
        notice or references to the Internet Society or other Internet
        organizations, except as needed for the purpose of developing
        Internet standards in which case the procedures for copyrights
        defined in the Internet Standards process must be followed, or
        as required to translate it into languages other than
        English.

        The limited permissions granted above are perpetual and will
        not be revoked by the Internet Society or its successors or
        assigns.

        This document and the information contained herein is provided
        on an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET
        ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE
        USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR
        ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A
        PARTICULAR PURPOSE.


Kashyap                                                        [Page 22]