INTERNET DRAFT V.Kashyap IBM Expiration Date: October 27, 2001 April 27, 2001 IPv4 and ARP over InfiniBand networks Status of this memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC 2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as Reference material or to cite them other than as ``work in progress''. The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html This memo provides information for the Internet community. This memo does not specify an Internet standard of any kind. Distribution of this memo is unlimited. Copyright Notice Copyright (C) The Internet Society (2001). All Rights Reserved. Abstract This document presents a way of encapsulating IPv4 and Address Resolution Protocol(ARP) packets over InfiniBand and also describes a mechanism for IPv4 address resolution on InfiniBand fabrics. Table of Contents 1.0 Introduction 2.0 InfiniBand data link Kashyap [Page 1] INTERNET-DRAFT IPv4 and ARP over InfiniBand April 26, 2001 2.1 UD packet format 2.1.1 Local Routing header 2.1.2 Global Routing header 2.1.3 Base Transport Header 2.1.4 Datagram Extended Transport Header 2.1.5 IPv4 over UD requirements 3.0 IPv4 Address resolution 3.1 Path MTU 3.2 Service Level 3.3 InfiniBand ARP 3.3.1 InfiniBand ARP header 3.3.2 Hardware address format 3.3.2.1 LID 3.3.2.2 Capability flag 3.3.2.3 QPN and Q_Key 3.3.2.4 GID 3.4 InfiniBand ARP process 3.5 ARP packet encapsulation 3.6 IPv4 across IB subnet implementation 4.0 IPv4 encapsulation in UD packets 5.0 Additional Features 6.0 IANA Considerations 7.0 Security Considerations 8.0 References 9.0 Author's Address 10.0 APPENDIX A Full Copyright Statement 1.0 Introduction The reader is referred to APPENDIX A at the end of this document for a brief description of InfiniBand(TM) architecture. The InfiniBand specification[1] can be found at www.infinibandta.org. The document 'IP over InfiniBand: Overview, issues and requirements' [2] provides a short overview of InfiniBand architecture and issues with respect to specifying IP over InfiniBand. This document restricts itself with IPv4 and ARP over InfiniBand. A subsequent document will define IPv6 encapsulation and address resoluton over InfiniBand. The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119. Kashyap [Page 2] INTERNET-DRAFT IPv4 and ARP over InfiniBand April 26, 2001 2.0 InfiniBand data link InfiniBand(IB) provides multiple methods of packet exchange beween two endpoints. These are : Reliable Connected (RC) Reliable Datagram (RD) Unreliable Connected (UC) Unreliable Datagram (UD) Raw Datagram - Raw IPv6 (R6) - Raw Ethertype (RE) IPv4 and ARP can be specified over any, multiple or all of these methods. A case can be made for support on any of the methods depending on the desired parameters. However, only Unreliable Datagram is required to be supported by all the IB nodes. The host channel adapters (HCAs) are additionally required to support Reliable connected and Unreliable connected modes. Additionally, for the sake of simplicity and ease of implementation and integration with existing stacks, it is desirable that the fabric support multicast (and broadcast). This is possible only in Unreliable datagram (UD) and IB's Raw datagram modes. Given the above conditions this document specifies a method to encapsulate IPv4 and ARP over UD mode of InfiniBand. It is a MUST for an IPv4 over InfiniBand implementation to support IPv4 and ARP over Unreliable Datagram mode of InfiniBand. The Address Resolution Protocol (ARP) MUST NOT be supported over any mode other than Unreliabe Datagram. 2.1 UD packet format The UD packet may be transmitted in two ways: 1) Local (within an IB subnet) packets +--------+---------+---------+-------+---------+---------+ |Local |Base |Datagram |Packet |Invariant| Variant | |Routing |Transport|Extended |Payload| CRC | CRC | |Header |Header |Transport| | | | | | |Header | | | | +--------+---------+---------+-------+---------+---------+ Kashyap [Page 3] INTERNET-DRAFT IPv4 and ARP over InfiniBand April 26, 2001 2) Global (between IB subnets) packets +--------+-------+---------+---------+-------+---------+---------+ |Local |Global |Base |Datagram |Packet |Invariant| Variant | |Routing |Routing|Transport|Extended |Payload| CRC | CRC | |Header |Header |Header |Transport| | | | | | | |Header | | | | +--------+-------+---------+---------+-------+---------+---------+ 2.1.1 Local Routing header 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |Virtual|Link |Service|Rsr|LNH| Destination Local ID | | Lane |Version| Level |vd | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |Reserved | Packet Length | Source Local ID | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Of the header elements the sending node's IPv4 stack must know the Service Level, Destination LID and the source LID. In addition packet length cannot specify a payload of more than the path MTU between the source and the destination ports. The other values are either well known standard values or are determined from other known values. For example, the VL is determined from the SL. Kashyap [Page 4] INTERNET-DRAFT IPv4 and ARP over InfiniBand April 26, 2001 2.1.2 Global Routing header This header is used when the packet must traverse IB subnet boundaries. The GRH looks like the IPv6 header. The GID looks like an IPv6 address. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |Version| Traffic Class | Flow Label | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Payload Length | Next Header | Hop Limit | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Source GID | | | | | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Destination GID | | | | | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ This header is needed only if the packet is sent across the IB subnet. Note that from the point of view of the IPv4 layer the GID is another form of MAC address albeit incomplete since the LID is always needed for any communication. The version is always set to 6, the Traffic Class, Flow label etc. are likely to be determined in response to a policy or default values may be used. The next header field is always the BTH (Base transport header). The hop limit is a function of the configuration. Only the destination GID needs to be determined from the resolution of target IPv4 address to the link layer address. Kashyap [Page 5] INTERNET-DRAFT IPv4 and ARP over InfiniBand April 26, 2001 2.1.3 Base Transport Header 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | OpCode |S|M|PC | Tver | Partition Key (P_Key) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Reserved | Destination Queue Pair(QP) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |A| Reserved | Packet Sequence Number | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Of these the P_Key and the destination QP must be determined as part of the IPv4 address resolution process. The rest of the fields are either not used by UD mode or are filled in the the channel adapter based on local conditions/values. The P_Key index in the P_Key table is attached to the QP used for transmission of packets. In case the P_Key table on the port is more than one entry deep the software needs to decide the P_Key to use. Note: The P_Key table can be written to only by the SM only [1]. When multicasting the destination QP is always set to 0xFFFFFF. 2.1.4 Datagram Extended Transport Header 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Queue Key (Q_Key) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Reserved | Source Queue Pair | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ This header includes the sender's queue pair number and the Q_Key used in the communication. 2.1.5 IPv4 over UD requirements Based on the above headers it is clear that the IPv4 implementation must know the following information before it can send a packet to a peer: 1. LID 2. GID A GID is required if IPv4 subnets span multiple IB Kashyap [Page 6] INTERNET-DRAFT IPv4 and ARP over InfiniBand April 26, 2001 subnets. A GID is also required, albeit the multicast group's GID, when sending multicast packets. 3. Service Level 4. Path MTU between the communicating port 5. Partition Key 6. Queue Pair Number 7. Q_Key 3.0 IPv4 Address resolution Address resolution in its most basic form requires a mapping from the IPv4 address to the link layer address. This is generally the port identifier. However, a packet in IB requires additional auxiliary information as noted above. All this information must be determined. Of the information noted in the previous section the peer knows its own LID, GID, Partition Key, Queue Pair and the Q_Key. It can therfore return these values in the ARP reply. The service level and path MTU can only be determined after knowing both the endpoint's port identifiers. The identifiers can be used to determine the service level (SL) and the Path MTU between them. Thus to get all the information that can comprise a link address in InfiniBand UD fabrics the subnet manager/subnet administrator needs to be consulted. Such a setup however introduces unwanted complexity and possibly delay. Additionally it may impact the scalability of the IPv4 subnets in IB subnets. The solution proposed in this draft does away with consulting the SM/SA for the missing information. This is achieved by utilising the subnet wide parameters that are configured for the IB multicast GID corressponding to IPv4 broadcast[5]. 3.1 Path MTU Instead of determining the path MTU from the SM/SA a subnet wide path MTU is satisfactory. This not only removes the requirement of querying the SM/SA but also is likely to ease integration with existing stacks since it is not common to utilise destination based MTUs at the link level in a given subnet. Path based MTUs may also cause fragmentation of multicast packets over some links and are thus not preferred. Kashyap [Page 7] INTERNET-DRAFT IPv4 and ARP over InfiniBand April 26, 2001 Every IPv4 subnet over InfiniBand will use only one IPv4 subnet wide MTU. This value will be derived This MTU cannot be greater than 4096 bytes. This values apply to the payload in InfiniBand specification and don't include the headers [1]. The link MTUs defined by InfiniBand are of sizes 256, 512, 1024, 2048 and 4096 bytes. 3.2 Service Level The service level(s) to be used between two endpoints are recorded in the SA. The SA is queried with the identifiers of the two endpoints to receive the SLs. However, the choice of a particular SL is dependent on the policy implemented at the endpoints as long as the chosen SL is from the valid SLs listed in the SA for the two endpoints. It is desirable to avoid going to the SM/SA to determine the SL in the interest of simplicity of implementation. 3.3 InfiniBand ARP This document proposes to utilise the address resolution protocol as defined in RFC826 [6]. The ARP request packet is broadcast to the IPv4 subnet. This packet includes the target IPv4 address and the sender's link layer address. The response is unicast to the sender with the target's link layer address. The incorporation of the above protocol allows easy integration with the existing stacks. 3.3.1 InfiniBand ARP header The standard arp packet header is of the form (as per RFC 826) 16 bits: hardware protocol 16 bits: protocol 8 bits: length of hardware address 8 bits: length of protocol address 16 bits: ARP operation The hardware protocol will take the value corresponding to InfiniBand. A request to IANA will be made for this allocation. The rest of the fields will be used consistent with RFC 826. Kashyap [Page 8] INTERNET-DRAFT IPv4 and ARP over InfiniBand April 26, 2001 The remaining fields in the packets hold the sender/target hardware and protocol addresses. [ sender hardware address ] [ sender protocol address ] [ target hardware address ] [ target protocol address ] 3.3.2 Hardware address format 16 bits : LID 8 bits : Capability flag (UC|RC|RE|R6|QPN) 24 bits : QPN 32 bits : Q_Key 128 bits : GID Note that this is the packet on the wire. It does not imply the data structure used by the end hosts in its ARP cache. 3.3.2.1 LID This is the LID associated with the port to which the IPv4 address is attached by way of the logical interface. 3.3.2.2 Capability flag Only the first 5 bits are defined. The rest are for future use. The first 4 bits denote the InfiniBand modes over which IPv4 is supported. UC - unreliable connected RC - reliable connected RE - raw ethertype R6 - raw IPv6 The support of IPv4 over UD is mandatory and therefore it need not be indicated in these bits. The rest are all optional. The implementation details of the other formats are beyond the scope of this document. The flags provide a way for the IPv4 over IB implementations to indicate the possibilities among themselves. The use of these capabilites is then a choice between the communicating endpoints. QPN flag: Kashyap [Page 9] INTERNET-DRAFT IPv4 and ARP over InfiniBand April 26, 2001 QPN flag indicates that the endpoint supports applications that are tied to specific QPs. Since there may be a large number of QPs available at the endpoints (QP number is 24 bits) an endpoint can choose to map various services (protocol and port pairs) to specific QPNs. This flag indicates the use of such demultiplexing. The flag will be set by hosts that want to advertise such a use. The endpoints that don't support QPN demultiplexing don't use this flag. The presence of multiple QPNs for the same IPv4 address introduces multiple link addresses (differing in QPNs). Most implementations are unable to handle such a case. To provide for interoperability the receiver is free to ignore this flag and continue to use the default QPN (described below) and not determine the service related QPN. By the same token, a host that implements QPN based demultiplexing MUST accept packets that are received on the default QPN even if it is demultiplexing the corresponding service by use of QPNs. The method of service resolution to the corresponding QPN is not defined in this document. 3.3.2.3 QPN and Q_Key This is the default QPN to be used to communicate to the endnode. The sender lists the QPN it expects the packets to be sent to and the target replies with its QPN. The Q_Key is the corresponding Q_Key the endpoints intend to use. 3.3.2.4 GID The GID is needed only if the IB subnet is traversed. Some implementations may prefer this mode though it is not recommended to implement IPv4 subnets spanning IB subnets[5]. The use of GID also fulfils the need of implementations that might prefer to use a well defined, largely invariant link address to identify endpoints. Note that to actually send a packet the LID of the next hop (IB router or the peer) is always needed. Kashyap [Page 10] INTERNET-DRAFT IPv4 and ARP over InfiniBand April 26, 2001 3.4 InfiniBand ARP process The source broadcasts the ARP_REQUEST packet to the IPv4 subnet. The IPv4 broadcast to IB multicast group mapping is defined in kashyap-draft-ipoib-ipv4-multicast-00.txt [5]. The broadcast packet itself is a UD packet and hence requires the paramters listed in section 2.1.5. As defined in [5], the IPv4 subnet is setup with the IPv4 broadcast address mapped to an IB multicast group. This address is registered with the IB subnets SM/SA. Along with this are registered the characteristics such as the: LID P_Key Q_Key Service Level MTU Traffic Class Hop Limit Flow ID Note that since these are applicable to the IPv4 broadcast address, the fabric administrator must ensure that these parameters are honoured across the IPv4 subnet. If this were not done the broadcast cannot be sent to all the IPv4 hosts. The ones that cannot honour these parameters will not be able to join the IPv4 broadcast address. Thus the service level must be supported across the multicast group. The MTU must be common across the subnet etc. All these values are returned to the node joining the group. Therefore the act of joining the IPv4 broadcast address resolves many of the parameters needed to send/receive packets in the IPv4 subnet. Every IPv4 interface MUST join the IB multicast group corressponding to IPv4 subnet broadcast address. This first step is a necessary step towards address resolution and general IPv4 and ARP support on InfiniBand subnets. In the interest of simplicity it is RECOMMENDED that the implementations use the parameters returned by the joining of the IPv4 broadcast group in all communication. Implementations are free to utilise IB specific messages and methods to Kashyap [Page 11] INTERNET-DRAFT IPv4 and ARP over InfiniBand April 26, 2001 determine alternate values if they so desire. Thus the InfiniBand ARP broadcast packet utilises the information received as above and the rest of the process is therefore identical to standard implementation of ARP over ethernet as described in RFC826. All multicast packets in IB use the QP number 0xFFFFFF thus it doesn't have to be determined for sending the ARP broadcast. During the ARP response/request the QPN and Q_Key beind used by the two endpoints are exchanged along with the LID and the GID. The rest of the values, as stated above, are determined at the time the broadcast group was joined. 3.5 ARP packet encapsulation The ARP packets takes the format: +-------+------+---------+---------+---------+---------+---------+---------+ |Local | |Base |Datagram | | ARP |Invariant| Variant | |Routing| GRH |Transport|Extended |Ethertype| Request| CRC | CRC | |Header |Header|Header |Transport| | | | | | | | |Header | | | | | +-------+------+---------+---------+---------+---------+---------+---------+ The Ethertype is the value 0x806 as defined in RFC1700[7]. The GRH is always included to allow for the following two cases: 1. The InfiniBand specification requires the use of GRH when multicasting 2. GRH is needed if the packet is being transmitted across IB subnets. 3.6 IPv4 across IB subnet implementation Such an implementation is not recommended. However, if implemented the LID and GID corresponding to a particular IPv4 will not belong to the same port. This makes no difference from the point of view of the IPv4 and the ARP caches. The IB ARP implementation MUST however ensure that the LID of the IB router is used when the packet is to be sent across an IB subnet. Kashyap [Page 12] INTERNET-DRAFT IPv4 and ARP over InfiniBand April 26, 2001 4.0 IPv4 encapsulation in UD packets +-------+------+---------+---------+---------+---------+---------+---------+ |Local | |Base |Datagram | | IPv4 |Invariant| Variant | |Routing| GRH |Transport|Extended |Ethertype| Payload| CRC | CRC | |Header |Header|Header |Transport| | | | | | | | |Header | | | | | +-------+------+---------+---------+---------+---------+---------+---------+ The Ethertype is the value 0x800 as defined in RFC1700[7] The GRH is always included to allow for the following two cases: 1. The InfiniBand specification requires the use of GRH when multicasting 2. GRH is needed if the packet is being transmitted across IB subnets. Note that the GRH may not be included in the packets if they are destined for the same IB subnet. The determination of this requirement rests with the IB driver. It doesn't matter to the upper layers whether a GRH was included in the packet headers or not. The InfiniBand implementations will interoperate in all cases[1]. 5.0 Additional Features This document has presented a simple, efficient, interoperable method of address resolution and ARP and IPv4 encapsulation in InfiniBand packets. The basic desire of the author is to present a method that easily enmeshes with existing implementations. Another strong desire is to ensure interoperability between implementations by requiring easily setup default values without inhibiting those implementations that need some additional features. There may be situations where implementations may desire more 'optimal' performance or features. Such implementations are dependent on the fabric administrator ensuring that the SM/SA and the fabric components are correctly setup for the desired features. Kashyap [Page 13] INTERNET-DRAFT IPv4 and ARP over InfiniBand April 26, 2001 These cases could be: 1. use of other methods than UD for IPv4 packets Example: One could utilise Reliable connected mode to get a higher MTU and utilise UDP for some communications since the lower layer is reliable. The ARP proposal in this document allows for an indication of such a capability. It is upto the implementations to then utilise InfiniBand specific ways (use of Connection manager etc.) to setup the necessary communication. The IPv4 encapsulation can stay the same except that the relevant IB headers will be used. The only requirement is that the Address Resolution protocol MUST be implemented over UD mode of InfiniBand. This allows for a common mode of determining the link address and link capabilities across the IPv4 subnet. 2. Use of alternate SL Some implementations might want to determine alternate SL values from the SM/SA. This is a valid option but is unrelated to the IPv4 implementation. This document recommends that the SL utilised in the IPv4 subnet wide multicast address i.e. the corresponding IB multicast group, which by definition is valid for the whole of IPv4 subnet, will be used by default. Alternate choices depend on the implementation consulting the SM/SA and the fabric administrator ensuring that such choices are valid and available. Such a choice could also depend on the quality of service mappings from IPv4 to InfiniBand implemented on the host. 3. Use of alternate TClass, FlowLabel The logic governing these parameters is the same as in the previous case for S2. Additionally these may be determined by a policy that is intertwined with IPv4 routing or IB routing or both. The discussion of such issues is not relevant to this document. Kashyap [Page 14] INTERNET-DRAFT IPv4 and ARP over InfiniBand April 26, 2001 4. Use of QPN demultiplexing The QPN demultiplexing has been described in some detail in section 3.3.2.2. Thus the implementations that desire the more 'optimal' behaviour can do so in an interoperable way. The method of determining the service bindings to QPs is beyond the scope of this document. 6.0 IANA Considerations To support ARP over InfiniBand the Address Resolution Parameter 'Number Hardware Type (hrd)' is required. This number may be assigned as per the first-come-first-served policy defined in RFC2434[8]. 7.0 Security Considerations This document specifies IPv4 packet transmission over a broadcast network. Any network of this kind is vulnerable to a sender claiming another's identity and forge traffic or eavesdrop. It is the responsibility of the higher layers or applications to implement suitable counter-measures if this is a problem. 8.0 References: [1] InfiniBand Architecture Specification, Volume 1, Release 1.0 [2] draft-kashyap-ipoib_requirements-00.txt. V. Kashyap [3] RFC2373: IPv4 Version 6 Addressing Architecture. R. Hinden,S. Deering. [4] RFC2375: IPv6 Multicast Address Assignments. R. Hinden, S. Deering. [5] draft-kashyap-ipoib-ipv4-multicast-00.txt V. Kashyap [6] RFC826:An Ethernet Address Resolution Protocol. David C. Plummer [7] RFC1700: Assigned Numbers. J. Reynolds, J. Postel [8] RFC2434: Guidelines for Writing an IANA Considerations Section in RFCs T. Narten, H. Alvestrand 9.0 Author's Address Vivek Kashyap IBM 15450, SW Koll Parkway Beaverton, OR 97006 Work: 503 578 3422 Email: vivk@us.ibm.com Kashyap [Page 15] INTERNET-DRAFT IPv4 and ARP over InfiniBand April 26, 2001 10.0 APPENDIX A: Introduction to InfiniBand For a more complete overview the reader is referred to chapter 3 of the InfiniBand specification. InfiniBand Architecture (IBA) defines a System Area Network (SAN) for connecting multiple independent processor platforms, I/O platforms and I/O devices. The IBA SAN is a communications and management infrastructure supporting both I/O and inter-processor communications for one or more computer systems. An IBA SAN consists of processor nodes and I/O units connected through an IBA fabric made up of cascaded switches and IB routers (connecting IB subnets). I/O units can range in complexity from single ASIC IBA attached devices such as a LAN adapter to a large memory rich RAID subsystem. IBA network is subdivided into subnets interconnected by IB routers. These are IB routers and IB subnets and not IP routers or IP subnets. Each IB node or switch may attach to a single or multiple switches or directly with each other. Each node interfaces with the link by way of channel adapters (CAs). The architecture supports multiple CAs per unit with each CA providing one or mode ports that connect to the fabric. Each CA appears as a node to the fabric. The ports are the endpoints to which the data is sent. However, each of the ports may include multiple QPs (queue pairs) that may be directly addressed from a remote peer. From the point of view of data transfer the QP number (QPN) is part of the address. IBA supports both connection oriented and datagram service between the ports. The peers are identified by QPN and the port identifier. In raw datagram mode the QPN is not used. A port may be identified by a local ID (LID) and optionally a Global ID (GID). The GID is 128 bits long and is formed by the concatenation of a 64 bit subnet prefix and a 64 bit EUI-64 compliant portion (GUID). The LID is a 16 bit value that is assigned when the port becomes active. Note that the GUID is the only persistent identifier of a port. However, it cannot be used as an address in a packet. If the prefix is modified then the GID may Kashyap [Page 16] INTERNET-DRAFT IPv4 and ARP over InfiniBand April 26, 2001 change. The subnet manager may attempt to keep the LID values constant across shutdowns but that is not a requirement. The assignment of the GID and the LID is done by the subnet manager. Every IB subnet has at least one subnet manager component that controls the fabric. It assigns the LIDs and GIDs, it programs the switches so that they route packets between destinations. The subnet manager and a related component, the subnet administrator (SA) are the central repository of all information that is required to setup and bring up the fabric. IB routers are components that route packets between IB subnets based on the GIDs. Thus within and IB subnet a packet may or may not include a GID but when going across an IB subnet the GID must be included. A LID is always needed in a packet since the destination within a subnet is determined by it. A CA and a switch may have multiple ports. Each CA port is assigned its own LID or a range of LIDs. The ports of a switch are not addressable by LIDs/GIDs or in other words, are transparent to other end nodes. Each port has its own set of buffers. The buffering is channeled through virtual lanes (VL) where each VL has its own flow control. There may be upto 16 VLs. VLs provide a mechanism for creating multiple virtual links within a single physical link. All ports however must support VL15 which is reserved exclusively for subnet management datagrams and hence doesn't concern the IPoIB discussions. The actual VL that a port uses is configured by the SM and is based on the Service Level (SL) specified in every packet. There are 16 possible SLs. In addition to the features described above viz. Queue Pairs (QPs), Service Levels (SLs) and addressing (GID/LID), IBA also defines the following: P_Keys or partition keys: Every packet, but for the raw datagrams, carries the partition key (P_key). These values are used for isolation in the fabric. A switch (this is an optional feature) may be programmed by the SM to drop packets not having a certain key. The same is the case with the receiving CA. Kashyap [Page 17] INTERNET-DRAFT IPv4 and ARP over InfiniBand April 26, 2001 Q_Keys: These are used to enforce access rights for reliable and unreliable IB datagram services. Raw datagram services don't require this value. At communication establishment the endpoints exchange the Q_Keys and must always use the relevant Q_Keys when communicating with one another. Mutlicast support: A switch may support multicasting ie. replication of packets across multiple output ports. This is an optional feature at the switches. A multicast group is identified by a GID. The GID format is as defined in RFC 2373 on IPv6 addressing. Thus from an IPv6 over IB's point of view the data link multicast address looks like the network address. An IB node must explicitly join a multicast group by a request to the SM to receive packets. A node may send packets to any multicast group. In both cases the multicast LID to be used in the packets is received from the SM. There are 6 transport types specified by the IB architecture. These are : 1. Unreliable Datagram (unacknowledged - connectionless) The UD service is connectionless and unacknowledged. It allows the QP to communicate with any unreliable datagram QP on any node. The switches and hence each link can support only a certain MTU. The MTU ranges are 256 bytes, 512 bytes, 1024 bytes, 2048 bytes, 4096 bytes. A UD packet cannot be larger than the smallest link MTU between the two peers. 2. Reliable Datagram (acknowledged - multiplexed) The RD service is multiplexed over connections between nodes called End to end contexts (EEC) which allows each RD QP to communicate with any RD QP on any node with an established EEC. Multiple QPs can use the same EEC and a single QP can use multiple EECs (one for each remote node per reliable datagram domain). 3. Reliable Connected (acknowledged - connection oriented) The RC service associates a local QP with one and only one remote QP. The message sizes maybe as large as 2^31 bytes in length. The CA implementation takes care of segmentation and assembly. 4. Unreliable Connected (unacknowledged - connection oriented) The UC service associates one local QP with one and only one remote QP. There is no acknowledgment and Kashyap [Page 18] INTERNET-DRAFT IPv4 and ARP over InfiniBand April 26, 2001 hence no resend of lost or corrupted packets. Such packets are therefore simply dropped. It is similar to RC otherwise. 5. Raw Ethertype (unacknowledged - connectionless) The Ethertype raw datagram packet contains a generic transport header that is not interpreted by the CA but it specifies the protocol type. The values for ethertype are the same as defined in RFC1700 for ethertype. 6. Raw IPv6 ( unacknowledged - connectionless) Using IPv6 raw datagram service, the IBA CA can support standard prtocol layers atop IPv6 (such as TCP/UDP). Thus native IPv6 packets can be bridged into the IBA SAN and delivered directly to a port and to its IPv6 raw datagram QP. The first 4 are referred to as IB transports. The latter two are classified as Raw datagrams. There is no indication of the QP number in the raw datagram packets. The raw datagram packets are limited by the link MTU in size. Full Copyright Statement Copyright (C) The Internet Society (2001). All Rights Reserved. This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implementation may be prepared, copied, published and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this paragraph are included on all such copies and derivative works. However, this document itself may not be modified in any way, such as by removing the copyright notice or references to the Internet Society or other Internet organizations, except as needed for the purpose of developing Internet standards in which case the procedures for copyrights defined in the Internet Standards process must be followed, or as required to translate it into languages other than English. The limited permissions granted above are perpetual and will not be revoked by the Internet Society or its successors or assigns. This document and the information contained herein is provided on an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET Kashyap [Page 19] INTERNET-DRAFT IPv4 and ARP over InfiniBand April 26, 2001 ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Kashyap [Page 20] -- Vivek Kashyap IBM viv@sequent.com vivk@us.ibm.com 503 578 3422 (o)