| < draft-ietf-ipoib-architecture-03.txt | draft-ietf-ipoib-architecture-04.txt > | |||
|---|---|---|---|---|
| INTERNET DRAFT | INTERNET DRAFT | |||
| <draft-ietf-ipoib-architecture-03.txt> Vivek Kashyap | <draft-ietf-ipoib-architecture-04.txt> Vivek Kashyap | |||
| Expiration Date: April, 2004 IBM | Expiration Date: October, 2004 IBM | |||
| October, 2003 | April, 2004 | |||
| IP over InfiniBand(IPoIB) Architecture | IP over InfiniBand(IPoIB) Architecture | |||
| Status of this memo | Status of this memo | |||
| This document is an Internet-Draft and is in full conformance | This document is an Internet-Draft and is in full conformance | |||
| with all provisions of Section 10 of RFC 2026. | with all provisions of Section 10 of RFC 2026. | |||
| Internet-Drafts are working documents of the Internet | Internet-Drafts are working documents of the Internet | |||
| Engineering Task Force (IETF), its areas, and its working | Engineering Task Force (IETF), its areas, and its working | |||
| skipping to change at page 2, line 5 ¶ | skipping to change at page 2, line 5 ¶ | |||
| InfiniBand is a high speed, channel based interconnect between | InfiniBand is a high speed, channel based interconnect between | |||
| systems and devices. | systems and devices. | |||
| This document presents an overview of the InfiniBand | This document presents an overview of the InfiniBand | |||
| architecture. It further describes the requirements and | architecture. It further describes the requirements and | |||
| guidelines for the transmission of IP over InfiniBand. | guidelines for the transmission of IP over InfiniBand. | |||
| Discussions in this document are applicable to both IPv4 and | Discussions in this document are applicable to both IPv4 and | |||
| IPv6 unless explicitly specified. The encapsulation of IP over | IPv6 unless explicitly specified. The encapsulation of IP over | |||
| InfiniBand and the mechanism for IP address resolution on IB | InfiniBand and the mechanism for IP address resolution on IB | |||
| fabrics are covered in [IPOIB_ENCAP] and [IPOIB_DHCP]. | fabrics are covered in other documents. | |||
| Table of Contents | Table of Contents | |||
| 1.0 Introduction to InfiniBand | 1.0 Introduction to InfiniBand | |||
| 1.1 InfiniBand Architecture Specification | 1.1 InfiniBand Architecture Specification | |||
| 1.2 Overview of InfiniBand Architecture | 1.2 Overview of InfiniBand Architecture | |||
| 1.2.1 InfiniBand Addresses | 1.2.1 InfiniBand Addresses | |||
| 1.2.1.1 Unicast GIDs | 1.2.1.1 Unicast GIDs | |||
| 1.2.1.2 Multicast GIDs | 1.2.1.2 Multicast GIDs | |||
| 1.3 InfiniBand Multicast Group Management | 1.3 InfiniBand Multicast Group Management | |||
| skipping to change at page 2, line 28 ¶ | skipping to change at page 2, line 28 ¶ | |||
| 1.3.2 Join and Leave operations | 1.3.2 Join and Leave operations | |||
| 1.3.2.1 Creating a Multicast Group | 1.3.2.1 Creating a Multicast Group | |||
| 1.3.2.3 Deleting a Multicast Group | 1.3.2.3 Deleting a Multicast Group | |||
| 1.3.2.4 Multicast Group Create/Delete Traps | 1.3.2.4 Multicast Group Create/Delete Traps | |||
| 2.0 Management of InfiniBand Subnet | 2.0 Management of InfiniBand Subnet | |||
| 3.0 IP over IB | 3.0 IP over IB | |||
| 3.1 InfiniBand as Datalink | 3.1 InfiniBand as Datalink | |||
| 3.2 Multicast Support | 3.2 Multicast Support | |||
| 3.2.1 Mapping IP Multicast to IB Multicast | 3.2.1 Mapping IP Multicast to IB Multicast | |||
| 3.2.2 Transient Flag in IB MGIDs | 3.2.2 Transient Flag in IB MGIDs | |||
| 3.3 IP Subnet Across IB Subnets ? | 3.3 IP Subnet Across IB Subnets | |||
| 4.0 IP Subnets in InfiniBand Fabrics | 4.0 IP Subnets in InfiniBand Fabrics | |||
| 4.1 IPoIB VLANs | 4.1 IPoIB VLANs | |||
| 4.2 Multicast in IPoIB Subnets | 4.2 Multicast in IPoIB Subnets | |||
| 4.2.1 Sending IP Multicast Datagrams | 4.2.1 Sending IP Multicast Datagrams | |||
| 4.2.2 Receiving Multicast Packets | 4.2.2 Receiving Multicast Packets | |||
| 4.2.3 Forwarding Multicast Packets | 4.2.3 Forwarding Multicast Packets | |||
| 4.2.4 Impact of InfiniBand Architecture Limits | 4.2.4 Impact of InfiniBand Architecture Limits | |||
| 4.2.5 Leaving/Deleting a Multicast Group | 4.2.5 Leaving/Deleting a Multicast Group | |||
| 5.0 QoS and Related Issues | 5.0 QoS and Related Issues | |||
| 6.0 Security Considerations | 6.0 Security Considerations | |||
| 7.0 Acknowledgements | 7.0 Acknowledgments | |||
| 8.0 References | 8.0 References | |||
| 9.0 Author's Address | 9.0 Author's Address | |||
| 1.0 Introduction to InfiniBand | 1.0 Introduction to InfiniBand | |||
| The InfiniBand Trade Association(IBTA) was formed to develop | The InfiniBand Trade Association(IBTA) was formed to develop | |||
| an I/O specification to deliver a channel based, switched | an I/O specification to deliver a channel based, switched | |||
| fabric technology. The InfiniBand standard is aimed at meeting | fabric technology. The InfiniBand standard is aimed at meeting | |||
| the requirements of scalability, reliability, availability and | the requirements of scalability, reliability, availability and | |||
| performance of servers in data centers. | performance of servers in data centers. | |||
| skipping to change at page 7, line 29 ¶ | skipping to change at page 7, line 26 ¶ | |||
| its IPv6 raw datagram QP. | its IPv6 raw datagram QP. | |||
| The first 4 types are referred to as IB transports. The latter | The first 4 types are referred to as IB transports. The latter | |||
| two are classified as Raw datagrams. There is no indication of | two are classified as Raw datagrams. There is no indication of | |||
| the QP number in the raw datagram packets. The raw datagram | the QP number in the raw datagram packets. The raw datagram | |||
| packets are limited by the link MTU in size. | packets are limited by the link MTU in size. | |||
| The two connected modes and the reliable datagram mode may | The two connected modes and the reliable datagram mode may | |||
| also support 'Automatic Path Migration(APM)'. This is an | also support 'Automatic Path Migration(APM)'. This is an | |||
| optional facility that provides for a hardware based path | optional facility that provides for a hardware based path | |||
| failover. An alternate path is associated with the QP when the | fail over. An alternate path is associated with the QP when | |||
| connection/EE context is first created. If unrecoverable | the connection/EE context is first created. If unrecoverable | |||
| errors are encountered the connection switches to using the | errors are encountered the connection switches to using the | |||
| alternate path. | alternate path. | |||
| 1.2.1 InfiniBand Addresses | 1.2.1 InfiniBand Addresses | |||
| The InfiniBand architecture borrows heavily from the IPv6 | The InfiniBand architecture borrows heavily from the IPv6 | |||
| architecture in terms of the InfiniBand subnet structure and | architecture in terms of the InfiniBand subnet structure and | |||
| global identifiers (GIDs). | global identifiers (GIDs). | |||
| The InfiniBand architecture defines the global identifier | The InfiniBand architecture defines the GID associated with a | |||
| associated with a port as follows: | port as a 128-bit unicast or multicast identifier. IBA derives | |||
| the GID address format from the IPv6 format[RFC_2373] with | ||||
| GID (Global Identifier): A 128-bit unicast or | some additional properties/restrictions defined to facilitate | |||
| multicast identifier used to identify a port on a | efficient discovery, communication and routing. | |||
| channel adapter, a port on a router, a switch, or a | ||||
| multicast group. A GID is a valid 128-bit IPv6 | ||||
| address(per RFC 2373) with additional | ||||
| properties/restrictions defined within IBA to | ||||
| facilitate efficient discovery, communication, and | ||||
| routing. | ||||
| Note: These rules apply only to IBA operation and do | ||||
| not apply to raw IPv6 operation unless specifically | ||||
| called out. | ||||
| The raw IPv6 operation referred to in the note | Note: | |||
| above is the IPv6 mode of InfiniBand's raw datagram | The IBA refers to [RFC_2373] explicitly. It must be noted | |||
| service. It does not mean IPv6 itself. The routers and | that IBA is therefore unaffected by any further changes | |||
| switches referred to in the above definition are the | that are introduced in IPv6 addressing architecture. | |||
| InfiniBand routers and switches. | ||||
| The InfiniBand(IB) specification defines two types of GIDs: | IBA defines two types of GIDs: | |||
| unicast and multicast. | unicast and, | |||
| multicast. | ||||
| 1.2.1.1 Unicast GIDs | 1.2.1.1 Unicast GIDs | |||
| The unicast GIDs are defined, as in IPv6, with three scopes. | The unicast GIDs are defined, as in IPv6, with three scopes. | |||
| The IB specification states: | The IB specification states: | |||
| a. link local: This is defined to be FE80/10. | a. link local: This is defined to be FE80/10. | |||
| The IB routers will not forward packets with a | The IB routers will not forward packets with a | |||
| link local address in source or destination | link local address in source or destination | |||
| skipping to change at page 8, line 47 ¶ | skipping to change at page 8, line 36 ¶ | |||
| c. global: | c. global: | |||
| A unicast GID with a global prefix, i.e. an IB | A unicast GID with a global prefix, i.e. an IB | |||
| router may use this GID to route packets | router may use this GID to route packets | |||
| throughout an enterprise or internet. | throughout an enterprise or internet. | |||
| 1.2.1.2 Multicast GIDs | 1.2.1.2 Multicast GIDs | |||
| The multicast GIDs also parallel the IPv6 multicast addresses. | The multicast GIDs also parallel the IPv6 multicast addresses. | |||
| The IB specification defines the multicast GIDs as follows: | The IB specification defines the multicast GIDs as follows: | |||
| FFxy:<112 bits> | FFxy:<112 bits> | |||
| Flag bits: | Flag bits: | |||
| The nibble, denoted by x above, are the 4 flag bits: 000T. | The nibble, denoted by x above, are the 4 flag bits: 000T. | |||
| The first three bits are reserved and are set to zero. The | The first three bits are reserved and are set to zero. The | |||
| last bit is defined as follows: | last bit is defined as follows: | |||
| T=0: denotes a permanently assigned i.e. well known GID | T=0: denotes a permanently assigned i.e. well known GID | |||
| T=1: denotes a transient group | T=1: denotes a transient group | |||
| Scope bits: | Scope bits: | |||
| The 4 bits, denoted by y in the GID above, are the scope | The 4 bits, denoted by y in the GID above, are the scope | |||
| bits. These scope values are described in Table 1. | bits. These scope values are described in Table 1. | |||
| scope value Address value | scope value Address value | |||
| 0 Reserved | 0 Reserved | |||
| 1 Unassigned | 1 Unassigned | |||
| 2 Link-local | 2 Link-local | |||
| 3 Unassigned | 3 Unassigned | |||
| 4 Unassigned | 4 Unassigned | |||
| 5 Site-local | 5 Site-local | |||
| 6 Unassigned | 6 Unassigned | |||
| 7 Unassigned | 7 Unassigned | |||
| 8 Organization-local | 8 Organization-local | |||
| 9 Unassigned | 9 Unassigned | |||
| 0xA Unassigned | 0xA Unassigned | |||
| 0xB Unassigned | 0xB Unassigned | |||
| 0xC Unassigned | 0xC Unassigned | |||
| 0xD Unassigned | 0xD Unassigned | |||
| 0xE Global | 0xE Global | |||
| 0xF Reserved | 0xF Reserved | |||
| Table 1 | Table 1 | |||
| The IB specification further refers to [RFC_2373] and | The IB specification further refers to [RFC_2373] and | |||
| [RFC_2375] while defining the well known multicast addresses. | [RFC_2375] while defining the well known multicast addresses. | |||
| However, it then states that the well known addresses apply to | However, it then states that the well known addresses apply to | |||
| IB raw IPv6 datagrams only. It must be noted though that a | IB raw IPv6 datagrams only. It must be noted though that a | |||
| multicast group can be associated with only a single MGID. | multicast group can be associated with only a single MGID. | |||
| Thus the same MGID cannot be associated with the UD mode and | Thus the same MGID cannot be associated with the UD mode and | |||
| the raw datagram mode. | the raw datagram mode. | |||
| 1.3 InfiniBand Multicast Group Management | 1.3 InfiniBand Multicast Group Management | |||
| skipping to change at page 10, line 36 ¶ | skipping to change at page 10, line 25 ¶ | |||
| characteristics that define a group. | characteristics that define a group. | |||
| A LID is associated with the multicast group by the subnet | A LID is associated with the multicast group by the subnet | |||
| manager(SM) at the time of the multicast group creation. The | manager(SM) at the time of the multicast group creation. The | |||
| SM determines the multicast tree based on all the group | SM determines the multicast tree based on all the group | |||
| members and programs the relevant switches. The Multicast | members and programs the relevant switches. The Multicast | |||
| LID(MLID) is used by the switches to route the packets. | LID(MLID) is used by the switches to route the packets. | |||
| Any member IB port wanting to participate in the multicast | Any member IB port wanting to participate in the multicast | |||
| group must join the group. As part of the join operation the | group must join the group. As part of the join operation the | |||
| port receives the group characteristics from the SM. At the | node receives the group characteristics from the SM. At the | |||
| same time the subnet manager ensures that the requester can | same time the subnet manager ensures that the requester can | |||
| indeed participate in the group by verifying that it can | indeed participate in the group by verifying that it can | |||
| support the group MTU, and accessibility to the rest of the | support the group MTU, and accessibility to the rest of the | |||
| group members. Other group characteristics may need | group members. Other group characteristics may need | |||
| verification too. | verification too. | |||
| The SM, for groups that span IB subnet boundaries, must | The SM, for groups that span IB subnet boundaries, must | |||
| interact with IB routers to determine the presence of this | interact with IB routers to determine the presence of this | |||
| group in other IB subnets. If present the MTU must match | group in other IB subnets. If present the MTU must match | |||
| across the IB subnets. | across the IB subnets. | |||
| skipping to change at page 11, line 26 ¶ | skipping to change at page 11, line 15 ¶ | |||
| MGID - Multicast GID for this multicast group | MGID - Multicast GID for this multicast group | |||
| PortGID - Valid GID of the port joining this multicast group | PortGID - Valid GID of the port joining this multicast group | |||
| Q_Key - Q_Key to be used by this multicast group | Q_Key - Q_Key to be used by this multicast group | |||
| MLID - Multicast LID for this multicast group | MLID - Multicast LID for this multicast group | |||
| MTU - MTU for this multicast group | MTU - MTU for this multicast group | |||
| P_Key - Partition key for this multicast group | P_Key - Partition key for this multicast group | |||
| SL - Service Level for this multicast group | SL - Service Level for this multicast group | |||
| Scope - Same as MGID address scope | Scope - Same as MGID address scope | |||
| JoinState - Join/Leave status requested by the port: | JoinState - Join/Leave status requested by the port: | |||
| bit 0: FullMemeber | bit 0: FullMember | |||
| bit 1: NonMember | bit 1: NonMember | |||
| bit 2: SendOnlyNonMember | bit 2: SendOnlyNonMember | |||
| 1.3.1.1 JoinState | 1.3.1.1 JoinState | |||
| The JoinState indicates the membership qualities a port wishes | The JoinState indicates the membership qualities a port wishes | |||
| to add while joining/creating a group or delete when leaving a | to add while joining/creating a group or delete when leaving a | |||
| group. The meaning of the JoinState bits are: | group. The meaning of the JoinState bits are: | |||
| FullMember: | FullMember: | |||
| skipping to change at page 12, line 51 ¶ | skipping to change at page 12, line 40 ¶ | |||
| the group. | the group. | |||
| Note that a special 'delete' message does not exist. It is a | Note that a special 'delete' message does not exist. It is a | |||
| side effect of the last FullMember 'leave' operation. | side effect of the last FullMember 'leave' operation. | |||
| 1.3.2.4 Multicast Group Create/Delete Traps | 1.3.2.4 Multicast Group Create/Delete Traps | |||
| The SA may be requested by the ports to generate a report | The SA may be requested by the ports to generate a report | |||
| whenever a multicast group is created or deleted. The port can | whenever a multicast group is created or deleted. The port can | |||
| specify the multicast group it is interested in i.e. use a | specify the multicast group it is interested in i.e. use a | |||
| specific MGID or use a wildcard request. The SA will report | specific MGID or use a wild card request. The SA will report | |||
| these events using traps 66 (for creates) and 67 (for | these events using traps 66 (for creates) and 67 (for | |||
| deletes)[IB_ARCH]. | deletes)[IB_ARCH]. | |||
| Therefore, a port wishing to join a group but not create it by | Therefore, a port wishing to join a group but not create it by | |||
| itself may request a create notification or a port might even | itself may request a create notification or a port might even | |||
| request a notification for all groups that are created(a | request a notification for all groups that are created(a | |||
| wildcarded request). The SA will diligently inform them of the | wild card request). The SA will diligently inform them of the | |||
| creation utilising the aforementioned traps. The requestor can | creation utilizing the aforementioned traps. The requester can | |||
| then join the multicast group indicated. Similarly, a | then join the multicast group indicated. Similarly, a | |||
| SendOnlyNonMember or a NonMember might request the SA to | SendOnlyNonMember or a NonMember might request the SA to | |||
| inform it of group deletions. The endnode, on receiving a | inform it of group deletions. The endnode, on receiving a | |||
| delete report, can safely release the resources associated | delete report, can safely release the resources associated | |||
| with the group. The associated MLID is no longer valid for the | with the group. The associated MLID is no longer valid for the | |||
| group and may be reassigned to a new multicast group by the | group and may be reassigned to a new multicast group by the | |||
| SM. | SM. | |||
| 2.0 Management of InfiniBand Subnet | 2.0 Management of InfiniBand Subnet | |||
| To aid in the monitoring and configuration of InfiniBand | To aid in the monitoring and configuration of InfiniBand | |||
| subnet components a set of MIBs need to be defined. MIBs are | subnet components a set of MIB modules need to be defined. | |||
| needed for the channel adapters, InfiniBand interfaces, | MIB modules are needed for the channel adapters, InfiniBand | |||
| InfiniBand subnet manager, InfiniBand subnet management agents | interfaces, InfiniBand subnet manager, InfiniBand subnet | |||
| and to allow the management of specific device properties. It | management agents and to allow the management of specific | |||
| must be noted that the management objects addressed in the | device properties. It must be noted that the management | |||
| IPoIB documents are for all of the IB subnet components and | objects addressed in the IPoIB documents are for all of the | |||
| are not limited to IP(over IB). The relevant MIBs are | IB subnet components and are not limited to IP(over IB). | |||
| described in separate documents and are not covered here. | The relevant MIB modules are described in separate | |||
| documents and are not covered here. | ||||
| 3.0 IP over IB | 3.0 IP over IB | |||
| As described in section 1.0, the InfiniBand architecture | As described in section 1.0, the InfiniBand architecture | |||
| provides a broad set of capabilities to choose from when | provides a broad set of capabilities to choose from when | |||
| implementing IP over InfiniBand networks. | implementing IP over InfiniBand networks. | |||
| The IPoIB specification must not, and does not, require | The IPoIB specification must not, and does not, require | |||
| changes in IP and higher layer protocols. Nor does it mandate | changes in IP and higher layer protocols. Nor does it mandate | |||
| requirements on IP stacks to implement special user level | requirements on IP stacks to implement special user level | |||
| programs. It is an aim of IPoIB specification that the IPoIB | programs. It is an aim of IPoIB specification that the IPoIB | |||
| changes be amenable to modularisation and incorporation into | changes be amenable to modularization and incorporation into | |||
| existing implementations at the same level as other media | existing implementations at the same level as other media | |||
| types. | types. | |||
| 3.1 InfiniBand as Datalink | 3.1 InfiniBand as Datalink | |||
| InfiniBand architecture provides multiple methods of data | InfiniBand architecture provides multiple methods of data | |||
| exchange between two endpoints as was noted above. These are: | exchange between two endpoints as was noted above. These are: | |||
| Reliable Connected (RC) | Reliable Connected (RC) | |||
| Reliable Datagram (RD) | Reliable Datagram (RD) | |||
| skipping to change at page 14, line 42 ¶ | skipping to change at page 14, line 26 ¶ | |||
| fabric support multicasting. This is possible only in | fabric support multicasting. This is possible only in | |||
| Unreliable datagram (UD) and IB's Raw datagram modes. | Unreliable datagram (UD) and IB's Raw datagram modes. | |||
| Thus it is only the UD mode that is universal, supports | Thus it is only the UD mode that is universal, supports | |||
| multicast, and a robust CRC. Given these conditions it is the | multicast, and a robust CRC. Given these conditions it is the | |||
| obvious choice for IP over InfiniBand [IPOIB_ENCAP]. | obvious choice for IP over InfiniBand [IPOIB_ENCAP]. | |||
| Future documents might consider the connected modes. In | Future documents might consider the connected modes. In | |||
| contrast to the limited link MTU offered by UD mode, the | contrast to the limited link MTU offered by UD mode, the | |||
| connected modes can offer significant benefit in terms of | connected modes can offer significant benefit in terms of | |||
| performance by utilising a larger MTU. Reliability is also | performance by utilizing a larger MTU. Reliability is also | |||
| enhanced if the underlying feature of automatic path migration | enhanced if the underlying feature of automatic path migration | |||
| of connected modes is utilised. | of connected modes is utilized. | |||
| 3.2 Multicast Support | 3.2 Multicast Support | |||
| InfiniBand specification makes support of multicasting in the | InfiniBand specification makes support of multicasting in the | |||
| switches optional. Multicast however, is a basic requirement | switches optional. Multicast however, is a basic requirement | |||
| in IP networks. Therefore, IPoIB requires that multicast | in IP networks. Therefore, IPoIB requires that multicast | |||
| capable InfiniBand fabrics be used to implement IPoIB | capable InfiniBand fabrics be used to implement IPoIB | |||
| subnets. | subnets. | |||
| 3.2.1 Mapping IP Multicast to IB Multicast | 3.2.1 Mapping IP Multicast to IB Multicast | |||
| skipping to change at page 15, line 38 ¶ | skipping to change at page 15, line 14 ¶ | |||
| section 1.3. The IB specification also defines some well known | section 1.3. The IB specification also defines some well known | |||
| IB multicast GIDs(MGIDs). The MGIDs are reserved for the IB's | IB multicast GIDs(MGIDs). The MGIDs are reserved for the IB's | |||
| Raw datagram mode which is incompatible with the other | Raw datagram mode which is incompatible with the other | |||
| transports of IB. Any mapping that is defined from IP | transports of IB. Any mapping that is defined from IP | |||
| multicast addresses therefore must not fall into IB's | multicast addresses therefore must not fall into IB's | |||
| definition of a well-known address. | definition of a well-known address. | |||
| Therefore all IPoIB related multicast GIDs always set the | Therefore all IPoIB related multicast GIDs always set the | |||
| transient bit. | transient bit. | |||
| 3.3 IP Subnets Across IB Subnets ? | 3.3 IP Subnets Across IB Subnets | |||
| Some implementations may wish to support multiple clusters of | Some implementations may wish to support multiple clusters of | |||
| machines in their own IB subnets but otherwise be part of a | machines in their own IB subnets but otherwise be part of a | |||
| common IP subnet. For such a solution the IB specification | common IP subnet. For such a solution the IB specification | |||
| needs multiple upgrades. Some of the required enhancements | needs multiple upgrades. Some of the required enhancements | |||
| are: | are: | |||
| 1) A method for creating IB multicast GIDs that span multiple | 1) A method for creating IB multicast GIDs that span multiple | |||
| IB subnets. The partition keys and other parameters need to | IB subnets. The partition keys and other parameters need to | |||
| be consistent across IB subnets. | be consistent across IB subnets. | |||
| skipping to change at page 16, line 25 ¶ | skipping to change at page 15, line 49 ¶ | |||
| The IPoIB subnet is overlaid over the IB subnet. The IPoIB | The IPoIB subnet is overlaid over the IB subnet. The IPoIB | |||
| subnet is brought up in the following steps: | subnet is brought up in the following steps: | |||
| Note: the join/leave operation at the IP level will be | Note: the join/leave operation at the IP level will be | |||
| referred to as IP_join/IP_leave and the join/leave | referred to as IP_join/IP_leave and the join/leave | |||
| operations at the IB level will be referred to as | operations at the IB level will be referred to as | |||
| IB_join in this document. | IB_join in this document. | |||
| 1. The all-IPoIB nodes IB multicast group is created | 1. The all-IPoIB nodes IB multicast group is created | |||
| The fabric administrator creates an IB multicast | The fabric administrator creates a IB multicast | |||
| group(henceforth called 'broadcast group') when the IP subnet | group(henceforth called 'broadcast group') when the IP subnet | |||
| is setup. The 'broadcast group' is defined in [IPOIB_ENCAP]. | is setup. The 'broadcast group' is defined in [IPOIB_ENCAP]. | |||
| The method by which the broadcast group is setup is not | The method by which the broadcast group is setup is not | |||
| defined by IPoIB. The group may be setup at the SM by the | defined by IPoIB. The group may be setup at the SM by the | |||
| administrator or by the first IB_join. | administrator or by the first IB_join. | |||
| As noted earlier, at the time of creating an IB multicast | As noted earlier, at the time of creating an IB multicast | |||
| group, multiple values such as the P_Key, Q_Key, Service | group, multiple values such as the P_Key, Q_Key, Service | |||
| Level, Hop Limit, Flow ID, TClass, MTU etc., have to be | Level, Hop Limit, Flow ID, TClass, MTU etc., have to be | |||
| specified. These values should be such that all potential | specified. These values should be such that all potential | |||
| members of the IB multicast group are be able to communicate | members of the IB multicast group are be able to communicate | |||
| with one another when using them. In the future, as the IB | with one another when using them. In the future, as the IB | |||
| skipping to change at page 17, line 31 ¶ | skipping to change at page 17, line 8 ¶ | |||
| However, the P_Key must still be known to the IPoIB endnode | However, the P_Key must still be known to the IPoIB endnode | |||
| before it can join the broadcast-group. The P_Key is included | before it can join the broadcast-group. The P_Key is included | |||
| in the mapping of the broadcast group[IPOIB_ENCAP]. Another | in the mapping of the broadcast group[IPOIB_ENCAP]. Another | |||
| parameter, the scope of the broadcast group, also needs to be | parameter, the scope of the broadcast group, also needs to be | |||
| known to the endnode before it can join the broadcast group. | known to the endnode before it can join the broadcast group. | |||
| It is an implementation choice on how the P_Key and the scope | It is an implementation choice on how the P_Key and the scope | |||
| bits related to the IPoIB subnet are determined by the | bits related to the IPoIB subnet are determined by the | |||
| implementation. These could be configuration parameters | implementation. These could be configuration parameters | |||
| initialised by some means by the administrator. | initialized by some means by the administrator. | |||
| The methods employed by an implementation to determine the | The methods employed by an implementation to determine the | |||
| P_Key and scope bits are not specified by IPoIB. | P_Key and scope bits are not specified by IPoIB. | |||
| 4.1 IPoIB VLANs | 4.1 IPoIB VLANs | |||
| The endpoints in an IB subnet must have compatible P_Keys to | The endpoints in an IB subnet must have compatible P_Keys to | |||
| communicate with one another. Thus the administrator when | communicate with one another. Thus the administrator when | |||
| setting up an IP subnet over an IB subnet must ensure that all | setting up an IP subnet over an IB subnet must ensure that all | |||
| the members have compatible P_Keys. An IP subnet can have only | the members have compatible P_Keys. An IP subnet can have only | |||
| skipping to change at page 18, line 19 ¶ | skipping to change at page 17, line 44 ¶ | |||
| IP multicast on InfiniBand subnets follows the same concepts | IP multicast on InfiniBand subnets follows the same concepts | |||
| and rules as on any other media. However, unlike most other | and rules as on any other media. However, unlike most other | |||
| media multicast over InfiniBand requires interaction with | media multicast over InfiniBand requires interaction with | |||
| another entity, the IB subnet manager. This section describes | another entity, the IB subnet manager. This section describes | |||
| the outline of the process and suggests some guidelines. | the outline of the process and suggests some guidelines. | |||
| IB architecture specifies the following format for IB | IB architecture specifies the following format for IB | |||
| multicast packets when used over unreliable datagram(UD) | multicast packets when used over unreliable datagram(UD) | |||
| mode: | mode: | |||
| +--------+-------+---------+---------+-------+---------+---------+ | +--------+-------+---------+---------+-------+---------+---------+ | |||
| |Local |Global |Base |Datagram |Packet |Invariant| Variant | | |Local |Global |Base |Datagram |Packet |Invariant| Variant | | |||
| |Routing |Routing|Transport|Extended |Payload| CRC | CRC | | |Routing |Routing|Transport|Extended |Payload| CRC | CRC | | |||
| |Header |Header |Header |Transport| (IP) | | | | |Header |Header |Header |Transport| (IP) | | | | |||
| | | | |Header | | | | | | | | |Header | | | | | |||
| +--------+-------+---------+---------+-------+---------+---------+ | +--------+-------+---------+---------+-------+---------+---------+ | |||
| For details about the various headers please refer to | For details about the various headers please refer to | |||
| InfiniBand Architecture Specification[IB_ARCH]. | InfiniBand Architecture Specification[IB_ARCH]. | |||
| The Global routing header (GRH) includes the IB multicast | The Global routing header (GRH) includes the IB multicast | |||
| group GID. The Local routing header (LRH) includes the local | group GID. The Local routing header (LRH) includes the local | |||
| identifier (LID). The IB switches in the fabric route the | identifier (LID). The IB switches in the fabric route the | |||
| packet based on the LID. | packet based on the LID. | |||
| The GID is made available to the receiving IB user (the IPoIB | The GID is made available to the receiving IB user (the IPoIB | |||
| interface driver for example). The driver can therefore | interface driver for example). The driver can therefore | |||
| determine the IB group the packet belongs to. | determine the IB group the packet belongs to. | |||
| IPv4 defines three levels of multicast compliance. These are: | IPv4 defines three levels of multicast compliance. These are: | |||
| Level 0: No support for IP multicasting | Level 0: No support for IP multicasting | |||
| Level 1: Support for sending but not receiving multicasts | Level 1: Support for sending but not receiving multicasts | |||
| Level 2: Full support for IP multicasting | Level 2: Full support for IP multicasting | |||
| In IPv6 there is no such distinction. Full multicast support | In IPv6 there is no such distinction. Full multicast support | |||
| is mandatory. Additionally, all IPv4 subnets support | is mandatory. Additionally, all IPv4 subnets support | |||
| broadcast(255.255.255.255). IPv4 broadcast can always be | broadcast(255.255.255.255). IPv4 broadcast can always be | |||
| sent/received by all IPv4 interfaces. | sent/received by all IPv4 interfaces. | |||
| Every IPoIB subnet requires the broadcast GID to be defined. | Every IPoIB subnet requires the broadcast GID to be defined. | |||
| Thus a packet can always be broadcast. | Thus a packet can always be broadcast. | |||
| 4.2.1 Sending IP Multicast Datagrams | 4.2.1 Sending IP Multicast Datagrams | |||
| An IP host may send a multicast packet at any time to any | An IP host may send a multicast packet at any time to any | |||
| multicast address. | multicast address. | |||
| The IP layer conveys the multicast packet to the IPoIB | The IP layer conveys the multicast packet to the IPoIB | |||
| interface driver/module. This module attempts to IB_join the | interface driver/module. This module attempts to IB_join the | |||
| relevant IB multicast group. This is required since otherwise | relevant IB multicast group. This is required since otherwise | |||
| skipping to change at page 21, line 47 ¶ | skipping to change at page 21, line 21 ¶ | |||
| The encapsulation of IP packets in InfiniBand is described | The encapsulation of IP packets in InfiniBand is described | |||
| in[IPOIB_ENCAP]. | in[IPOIB_ENCAP]. | |||
| It specifies the use of an 'Ethertype' value [IANA] in all | It specifies the use of an 'Ethertype' value [IANA] in all | |||
| IPoIB communication packets. The link-layer address is | IPoIB communication packets. The link-layer address is | |||
| comprised of the Global Identifier(GID) and the Queue Pair | comprised of the Global Identifier(GID) and the Queue Pair | |||
| Number(QPN) [IPOIB_ENCAP]. | Number(QPN) [IPOIB_ENCAP]. | |||
| To allow for multiple IB subnet based IPoIB subnets, the | To allow for multiple IB subnet based IPoIB subnets, the | |||
| specification utilises the Global Identifier(GID) as part of | specification utilizes the Global Identifier(GID) as part of | |||
| the link-layer address. Since all packets in IB have to use | the link-layer address. Since all packets in IB have to use | |||
| the Local Identifier(LID) the address resolution process has | the Local Identifier(LID) the address resolution process has | |||
| the additional step of resolving the destination GID, returned | the additional step of resolving the destination GID, returned | |||
| in response to ARP/ND request, to the LID[IPOIB_ENCAP]. This | in response to ARP/ND request, to the LID[IPOIB_ENCAP]. This | |||
| phase of address resolution might also be used to determine | phase of address resolution might also be used to determine | |||
| other essential parameters (e.g. the SL, path rate etc.)for | other essential parameters (e.g. the SL, path rate etc.)for | |||
| successful IB communication between two peers. | successful IB communication between two peers. | |||
| As noted earlier, all communication in the IPoIB subnet | As noted earlier, all communication in the IPoIB subnet | |||
| derives the Q_Key to use from the Q_Key specified in the | derives the Q_Key to use from the Q_Key specified in the | |||
| skipping to change at page 22, line 25 ¶ | skipping to change at page 21, line 47 ¶ | |||
| link-addresses. In the case of IPoIB, the link-address | link-addresses. In the case of IPoIB, the link-address | |||
| includes the QPN which might not be constant across reboots or | includes the QPN which might not be constant across reboots or | |||
| even across network interface resets. Therefore, static ARP | even across network interface resets. Therefore, static ARP | |||
| entries or RARP server entries will only work if the | entries or RARP server entries will only work if the | |||
| implementation(s) using these options can ensure that the QPN | implementation(s) using these options can ensure that the QPN | |||
| associated with an interface is invariant across | associated with an interface is invariant across | |||
| reboots/network resets[IPOIB_ENCAP]. | reboots/network resets[IPOIB_ENCAP]. | |||
| 4.5 DHCPv4 and IPoIB | 4.5 DHCPv4 and IPoIB | |||
| DHCPv4 [RFC_2131] utilises a 'client identifier' field | DHCPv4 [RFC_2131] utilizes a 'client identifier' field | |||
| (expected to hold the link-layer address) of 16 bytes. The | (expected to hold the link-layer address) of 16 bytes. The | |||
| address in the case of IPoIB is 20-bytes. To get around this | address in the case of IPoIB is 20-bytes. To get around this | |||
| problem IPoIB specifies [IPOIB_DHCP] that the 'broadcast flag' | problem IPoIB specifies [IPOIB_DHCP] that the 'broadcast flag' | |||
| be used by the client when requesting an IP address. | be used by the client when requesting an IP address. | |||
| 5.0 QoS and Related Issues | 5.0 QoS and Related Issues | |||
| The IB specification suggests the use of service levels for | The IB specification suggests the use of service levels for | |||
| load balancing, QoS and deadlock avoidance within an IB | load balancing, QoS and deadlock avoidance within an IB | |||
| subnet. But the IB specification leaves the usage and mode of | subnet. But the IB specification leaves the usage and mode of | |||
| skipping to change at page 22, line 49 ¶ | skipping to change at page 22, line 23 ¶ | |||
| Every IPoIB implementation will determine the relevant SL | Every IPoIB implementation will determine the relevant SL | |||
| value based on its own policy. No method or process for | value based on its own policy. No method or process for | |||
| choosing the SL has been defined by the IPoIB standards. | choosing the SL has been defined by the IPoIB standards. | |||
| 6.0 Security Considerations | 6.0 Security Considerations | |||
| This document describes the IB architecture as relevant to | This document describes the IB architecture as relevant to | |||
| IPoIB. It further restates issues specified in other | IPoIB. It further restates issues specified in other | |||
| documents. It does not itself specify any requirements. There | documents. It does not itself specify any requirements. There | |||
| are no security issues introduced by this document. IPoIB | are no security issues introduces by this document. IPoIB | |||
| related security issues are described in | related security issues are described in [IPOIB_ENCAP] and | |||
| [IPOIB_ENCAP] and [IPOIB_DHCP]. | [IPOIB_DHCP]. | |||
| 7.0 Acknowledgements | 7.0 Acknowledgments | |||
| This document has benefited from the comments and suggestion | This document has benefited from the comments and suggestions | |||
| of the members of the IPoIB working group and the members of | of the members of the IPoIB working group and the members of | |||
| the InfiniBand(SM) Trade Association. | the InfiniBand(SM) Trade Association. | |||
| 8.0 References | 8.0 References | |||
| 8.1 Normative References | ||||
| [IB_ARCH] InfiniBand Architecture Specification, Volume 1.1 | [IB_ARCH] InfiniBand Architecture Specification, Volume 1.1 | |||
| [IPOIB_ENCAP] draft-ietf-ipoib-ip-over-infiniband-06.txt | ||||
| [IPOIB_DHCP] draft-ietf-ipoib-dhcp-over-infiniband-05.txt | ||||
| 8.2 Informative References | ||||
| [RFC_2373] IP Version 6 Addressing Architecture | [RFC_2373] IP Version 6 Addressing Architecture | |||
| [RFC_2375] IPv6 Multicast Address Assignments | [RFC_2375] IPv6 Multicast Address Assignments | |||
| [RFC_1700] Assigned Numbers | [RFC_1700] Assigned Numbers | |||
| [RFC_1112] Host extensions for IP multicasting | [RFC_1112] Host extensions for IP multicasting | |||
| [RFC_2236] Internet Group Management Protocol, Version 2 | [RFC_2236] Internet Group Management Protocol, Version 2 | |||
| [RFC_2710] Multicast Listener Discovery | [RFC_2710] Multicast Listener Discovery | |||
| [IPOIB_ENCAP] draft-ietf-ipoib-ip-over-infiniband-05.txt | ||||
| [IPOIB_DHCP] draft-ietf-ipoib-dhcp-over-infiniband-05.txt | ||||
| 9.0 Author's Address | 9.0 Author's Address | |||
| Vivek Kashyap | Vivek Kashyap | |||
| IBM | IBM | |||
| 15450, SW Koll Parkway | 15450, SW Koll Parkway | |||
| Beaverton, OR 97006 | Beaverton, OR 97006 | |||
| Phone: +1 503 578 3422 | Phone: +1 503 578 3422 | |||
| Email: vivk@us.ibm.com | Email: vivk@us.ibm.com | |||
| End of changes. 44 change blocks. | ||||
| 94 lines changed or deleted | 88 lines changed or added | |||
This html diff was produced by rfcdiff 1.48. The latest version is available from http://tools.ietf.org/tools/rfcdiff/ | ||||