Current Meeting Report

2.2.7 IP over InfiniBand (ipoib)

NOTE: This charter is a snapshot of the 53rd IETF Meeting in Minneapolis, MN USA. It may now be out-of-date. Last Modified: 08-Mar-02
H.K. Jerry Chu <>
Bill Strahm <>
Internet Area Director(s):
Thomas Narten <>
Erik Nordmark <>
Internet Area Advisor:
Thomas Narten <>
Mailing Lists:
To Subscribe:
In Body: subscribe ipoverib
Description of Working Group:
E-mail archive: t.html

InfiniBand is an emerging standard intended as an interconnect for processor and I/O systems and devices (see the Infiniband Trade Association web site at for details). IP is one type of traffic (and a very important one) that could use this interconnect. InfiniBand would benefit greatly from a standardized method of handling IP traffic on IB fabrics. It is also important to be able to manage InfiniBand devices in a common way.

The work group will specify the procedures and protocols to support IPv4/v6 over an InfiniBand fabric. Further, they will specify the set of MIB objects to allow management of the InfiniBand protocol.

The scope of this WG is limited to the definition of an encapsulation format for carrying IPv4 and IPv6 over IB networks and for performing address resolution between IP address and IB link-layer addresses. At the present time, more advanced functionalities such as mapping IP QOS into IB-specific capabilities is out of scope. Such work items may be considered in the future, but will require a recharter.

Work items

1. Specify a standards track procedure for supporting ARP/ND packets, and resolving IP addresses to IB link addresses.

2. Specify a standards track encapsulation for carrying IPv4 and IPv6 packets over IB.

3. Determine how to and specify a standard for transfering IP multicast over IB. IB has an optional receiver join multicast capability. Current working group plans are to use IB multicast as part of ARP, so using it for IP multicast as well may be a reasonable approach.

4. Specify a standards track channel adapter MIB that will allow management of an InfiniBand channel adapter. There will also need to be InfiniBand types approved and added to the ifType defined by IANA

5. Specify a standards track baseboard management MIB that will allow management of specified device properties

6. Specify sample counter MIBs to allow InfiniBand sample counters to be exposed to external SNMP management applications

Goals and Milestones:
Done   Submit initial Internet-Draft of ARP encapsulation
Done   Submit initial Internet-Draft of Requirements/Overview
Done   Submit initial Internet-Draft of IP V4/V6 Encapsulation
Done   Submit initial Internet-Draft of Infiniband-Like MIB
Jul 01   Submit initial Internet-Draft of Channel Adapter MIB
Done   Submit initial Internet-Draft of Multicast
Nov 01   Submit initial Internet-Draft of Baseboard MIB
Nov 01   Submit initial Internet-Draft of Sample Counter MIB
Feb 02   Submit initial Internet-Draft of Subnet Mangement MIB
Mar 02   Submit ARP/IP/Multicast encapsulation drafts for IESG Last Call
Mar 02   Submit Infiniband-Like MIB for IESG Last Call
Mar 02   Submit Channel Adapter MIB for IESG Last Call
No Request For Comments

Current Meeting Report

Minutes of IPoIB meeting 3/18/2002

(about 42 people in the room)

Administrative items were reviewed:
- blue sheets
- minute taker (jim pinkerton)
- agenda

Proposed agenda:
- link and multicast draft (Jerry) 30 min
- architecture/encap (Vivek) 30 min
- Advanced capabilities (Vivek) 30 min
- Next steps (Jerry) 15 minutes

Jerry reviewed the changes from the multicast draft -00 to the -01. He stated that the changes were minor, to address feedback he had received.

IPoIB Link Boundary:
- Treat four IB layers as L2 to IP
- IB partitions <-> IP links
- IB partition may span multiple IB subnets
o Must use right scope bits
- Leave IB cross-subnet unicast/multicast details to IBTA

First point above allows IPoIB spec to largely ignore the details of when/whether cross IB subnets (with caveat on scope bits above)

Mutlicast Address mapping
ServiceRecord vs. Algorithmic mapping ? we chose the later because the Advantage of ServiceRecord wasn?t clear. Avoids additional lookup.
How many address bits to use? Decided to use the whole multicast group ID (see spec). Some concerns that some implementations only enable the lower 23 bits on Ethernet. Some feeling that this was ?implementation bugs? and that there was no reason we shouldn?t use the full address space.
Embed an IPoIB signature and P_Key in MGIDs. IBTA limits each MGID unique
To an IB subnet, not within a partition. Thus P_Key must be part of Multicast address to ensure the multicast address is unique across the IB Subnet.
Default MTU? ? some folks wanted a default, others said why? Current design is
Let the admin decide and set it in the MCGroupRecord. Must work for both unicast and multicast..
Minimal MTU? ? Thomas stated he felt there needed to be a minimum MTU.
Jerry commented that IPv6 is 1280, IB physical minimum is 2048.
Jim Pinkerton mentioned that his concern is complexity. It seems easier to just require 2048 byte MTUs as a minimum. Quite a bit of discussion on the pluses of setting a clear direction on what IPoIB requires vs. what folks could implement. Thomas and JimP argued for a minimum 2 KB
MTU, Vivek and (???) argued for not specifying and allowing admin to control this.
Multicast Sender
Does sender need to join the target multicast group to send
- current draft is written with the assumption that they do not.
<skipped stuff here>

Multicast Forwarding (routing)
- Current RFC2236/2710 require interface on routers to be in promiscuous multicast mode.
- Soulution: IPoIB driver module always sens a copy to all-router multicast group
- Router and listener will receive duplicate copy
o Driver skips joins on routers
o Concern that it feels hacky.
o Vivek comments he has an alternative solution which doesn?t require two copies

What if multicast join fails?
- Join the all-router multicast group
- Will receive every multicast packet
o Need to filter packets

Vivek?s Presentation

Review architecture draft
Encapsaultion draft
Advanced capabilities

He has received quite a few comments on the -00, but -01 it has been primarily grammatical and typos.

He briefly reviewed some of the main architectural themes of the arch spec. What are link characteristics, definition of a multicast GID. The ?401B? and ?601B? is ?IPv4/6 over IB?. Pointed out that the default scope for multicast is ?local?, and enable it to be greater when the IBTA defines how this is done.

Vivek points out that the current draft requires two multicast packets to be sent.

Vivek reviewed how Ethernet handles this:
o Join the IP multicast group
o Ask interface to update reception filters
o Send IBMP/MLD report
- IPoIB interface filter
o Join IB Mulitcast IG
? Create if doesn?t exist
o Send IGMP/MLD report
- Router can not forward unless it joins every group.

Possible solution:
- Ensure IGMP/MLD report reches router
o Host sends report to all-routers MGID. One of:
? Send to all-routers MGID and IP group MGID
? Send report to broadcast MGID only (all receive it )
? Solution is within the IPoIB interface driver
o Router joins MGID for the group

Jerry pointed out that this possible solution requires a smarter router ? it has to parses IGMP (two versions) & MLD. Jerry is also concerned about the leave operation ? it falls back to a timer rather than a one-to-one mapping from IP layer to IB layer. Vivek agrees, but feels it is appropriate since IB is not Ethernet, and feels it can be localized to just the driver in an end-node. For a router he?s not as sure. He mentioned though that Voltaire (an IB router company) has reviewed the proposal and thought is was workable.

Leaving/Deleting MGIDs
- never deleted by IPoIB elements
o IB spec lists IB algorithms for pruning
o MGIDs may end up shared with other protocols
- IB_leave MGID when IP leaves IP mcast
- If pure sender IB_Leave after some idle time
- Router IB_Leaves if no listeners and no need to forward packets

Vivek pointed out that the issue is not whether the leave is optional ? that is required. It?s the deletion of the MGID.

<missed a slide>

Link layer address in current draft is 20 bytes: GID:QPN:reserved (16+3+1)

Vivek reviewed the positions on whether ethertype is required.

Discussion on advanced architecture draft.

Vivek reviewed the slides. Advantages of using RC mode (larger MTU, APM, multiple QPs).

Focus is to keep the onus of complexity should be on advanced modes, and keep UD simple. Suggested flags are RC| UC | RD | QPN, possibly SDP. UD is always implied, thus doesn?t need a flag.

Interoperability rules:
- default ? set to zero on transmit, ignore on receive
- Advanced interfaces
o Set supported capabilities on transmit
o Process on receive

Walked through some examples.

Question on how two nodes try to decide on advanced capability. Vivek?s preference is to keep this outside of the spec and allow implementation choice.

Use of multiple QPNs, use SIDR. Vivek walked through a possible packet format.

Vivek has gotten some comments from reflector (large MTU size, eliminate TCP checksum, add SDP flag)

JimP stated that he felt the main value in this approach is the larger MTU. Vivek agreed.
JimP voiced a concern that the reflector seemed to think that path MTU could be leveraged to support this. JimP?s opinion is this much more closely maps to the ATM model, where a circuit is setup to a particular destination. OS?s that support path MTU would not be able to enable this capability easily. Seems to be some consensus on this point.

DHCP Discussion

Vivek voiced some concern about re-opening the argument of LID/GID, and while he would be fine with just using the LID, he does not recommend re-opening the issue.

Thus he is recommending that we use the broadcast GID for the server reply. Proposal:
Htype = 32, hlen = 0
Always use client identifier.
- 4 bytes (default zero)
o distinguishes multiple IPoIB interfaces per port
o timestamp or QPN ? client needs to remember
- 16 bytes (GID)
Claims no change to the DHCP server.

Jerry comments on <???? Sorry ? missed it>. Jerry would like to see a solution that does not require either a server side or a client side change. JimP and several others comment that we can probably get away without a server change, but the client will have to be changed to some degree. Another concern was making sure it was clear that the client-id should be preserved across a reboot.

Jerry mentioned that he?s not that concerned about using the broadcast. He is concerned that there are other protocols out there that we haven?t looked at yet (similar to DHCP in that there were open issues), and strongly recommends folks start looking at other protocols on IPoIB.

MIB Status ? led by Sean Harnedy

Reviewed the MIBS
- textual convention MIB
- Interface MIB
- Subnet management agent MIB ? has a revised version going to the reflector

Some comments were to possibly make the counters 64 bits. Pointed out that we have drafts on 2 out of 5 chartered MIBs.

Jerry commented that we don?t need editors ? we have a lack of participation ? we need draft writers, not editors.

Next Steps:
Basic drafts ? resovle remaining issues (ethertype specifically).
Jerry asked the group whether people had felt that there had been enough discussion on the reflector. Quick summary on issues. Main technical issue is that if ethertype is not present, than random protocol can?t run on top. Several folks discussed their perspectives. Some stating it is cleaner to have it explicit. Some claiming it is cleaner to not.
Vivek pointed out that in the advanced case, ethertype is not needed.
Jerry pointed out that even in the advanced case you might want to share the QP. Jerry stated his personal opinion (not as co-chair) that he is not sure he buys the argument that it is okay to dedicate the QP.
He thinks it is valid from an IBTA perspective ? QP is a service point.
Jerry?s concern is that there may not be that many QP. JimP also stated he felt it was cleaner, and it doesn?t cost very much for a lot of flexibility in the future. Jerry asked for any more comments. None offered.

Sense of the room
People voted for ethertype ? 8.
People voting against ethertype ? 2.
People who don?t care one-way-or-other ? 5.
Those who need more time to have an opinion ? 2.
Folks that need more time stated that getting more opinion from the reflector would be a good thing.

Jerry is concerned that we haven?t gotten enough active review of the drafts.

He also asked for more authors.

MIBs ? need more participation
Advanced features: need more discussion
Issue last call before next IETF?


Agenda part 2
IP Over InfiniBand Working Group Management Information Bases