idnits 2.17.1 draft-ietf-ipoib-ip-over-infiniband-09.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3667, Section 5.1 on line 16. -- Found old boilerplate from RFC 3978, Section 5.5 on line 911. -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 922. -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 929. -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 935. ** Found boilerplate matching RFC 3978, Section 5.4, paragraph 1 (on line 903), which is fine, but *also* found old RFC 2026, Section 10.4C, paragraph 1 text on line 34. ** The document seems to lack an RFC 3978 Section 5.1 IPR Disclosure Acknowledgement -- however, there's a paragraph with a matching beginning. Boilerplate error? ** This document has an original RFC 3978 Section 5.4 Copyright Line, instead of the newer IETF Trust Copyright according to RFC 4748. ** This document has an original RFC 3978 Section 5.5 Disclaimer, instead of the newer disclaimer which includes the IETF Trust according to RFC 4748. ** The document uses RFC 3667 boilerplate or RFC 3978-like boilerplate instead of verbatim RFC 3978 boilerplate. After 6 May 2005, submission of drafts without verbatim RFC 3978 boilerplate is not accepted. The following non-3978 patterns matched text found in the document. That text should be removed or replaced: By submitting this Internet-Draft, I certify that any applicable patent or other IPR claims of which I am aware have been disclosed, or will be disclosed, and any of which I become aware will be disclosed, in accordance with RFC 3668. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard == The page length should not exceed 58 lines per page, but there was 19 longer pages, the longest (page 2) being 60 lines == It seems as if not all pages are separated by form feeds - found 0 form feeds but 20 pages Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** There are 11 instances of too long lines in the document, the longest one being 1 character in excess of 72. == There are 1 instance of lines with non-RFC2606-compliant FQDNs in the document. == There are 1 instance of lines with non-RFC3849-compliant IPv6 addresses in the document. If these are example addresses, they should be changed. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year == The "Author's Address" (or "Authors' Addresses") section title is misspelled. -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- Couldn't find a document date in the document -- date freshness check skipped. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Unused Reference: 'IANA' is defined on line 851, but no explicit reference was found in the text == Unused Reference: 'IP6MLD' is defined on line 872, but no explicit reference was found in the text == Unused Reference: 'IPMULT' is defined on line 875, but no explicit reference was found in the text ** Obsolete normative reference: RFC 3513 (ref. 'AARCH') (Obsoleted by RFC 4291) ** Obsolete normative reference: RFC 2461 (ref. 'DISC') (Obsoleted by RFC 4861) -- Possible downref: Non-RFC (?) normative reference: ref. 'IANA' -- Possible downref: Non-RFC (?) normative reference: ref. 'IBTA' -- Obsolete informational reference (is this intentional?): RFC 2460 (ref. 'IPV6') (Obsoleted by RFC 8200) Summary: 9 errors (**), 0 flaws (~~), 10 warnings (==), 10 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 INTERNET DRAFT 3 H.K. Jerry Chu 4 Expiration Date: July, 2005 Sun Microsystems 5 V. Kashyap 6 IBM 7 January, 2005 9 Transmission of IP over InfiniBand 11 Status of this memo 13 By submitting this Internet-Draft, I certify that any applicable 14 patent or other IPR claims of which I am aware have been disclosed, 15 or will be disclosed, and any of which I become aware will be 16 disclosed, in accordance with RFC 3668. 18 Internet-Drafts are working documents of the Internet Engineering 19 Task Force (IETF), its areas, and its working groups. Note that 20 other groups may also distribute working documents as Internet- 21 Drafts. 23 Internet-Drafts are draft documents valid for a maximum of six 24 months and may be updated, replaced, or obsoleted by other documents 25 at any time. It is inappropriate to use Internet-Drafts as 26 reference material or to cite them other than as "work in progress." 28 The list of current Internet-Drafts can be accessed at 29 http://www.ietf.org/ietf/1id-abstracts.txt. 31 The list of Internet-Draft Shadow Directories can be accessed at 32 http://www.ietf.org/shadow.html. 34 Copyright (C) The Internet Society (2001). All Rights Reserved. 36 Abstract 38 This document specifies a method for encapsulating and transmitting 39 IPv4/IPv6 and Address Resolution Protocol (ARP) packets over 40 InfiniBand (IB). It describes the link layer address to be used when 41 resolving the IP addresses in "IP over InfiniBand (IPoIB)" subnets. 42 The document also describes the mapping from IP multicast addresses 43 to InfiniBand multicast addresses. Additionally this document 44 defines the set up and configuration of IPoIB links. 46 Table of Contents 48 1.0 Introduction 49 2.0 IP over UD Mode 50 3.0 InfiniBand Datalink 51 4.0 Multicast Mapping 52 4.1 Broadcast-GID Parameters 53 5.0 Setting Up an IPoIB Link 54 6.0 Frame Format 55 7.0 Maximum Transmission Unit 56 8.0 IPv6 Stateless Autoconfiguration 57 8.1 IPv6 Link Local Address 58 9.0 Address Mapping - Unicast 59 9.1 Link Information 60 9.1.1 Link Layer Address/Hardware Address 61 9.1.2 Auxiliary Link Information 62 9.2 Address Resolution in IPv4 Subnets 63 9.3 Address Resolution in IPv6 Subnets 64 9.4 Cautionary Note on QPN Caching 65 10.0 Sending and Receiving IP Multicast Packets 66 11.0 IP Multicast Routing 67 12.0 New Types of Vulnerability in IB Multicast 68 13.0 Security Considerations 69 14.0 IANA Considerations 70 15.0 Acknowledgments 71 16.0 References 72 17.0 Author's Addresses 74 1.0 Introduction 76 The InfiniBand specification [IBTA] can be found at 77 www.infinibandta.org. The document [IPoIB_ARCH] provides a short 78 overview of InfiniBand architecture (IBA) along with considerations 79 for specifying IP over InfiniBand networks. 81 IBA defines multiple modes of transport over which IP may be 82 implemented. The unreliable datagram (UD) transport mode best 83 matches the needs of IP and the need for universality as described 84 in [IPoIB_ARCH]. 86 This document specifies IPoIB over IB's UD mode. The implementation 87 of IP subnets over IB's other transport mechanisms is out of scope 88 of this document. 90 This document describes the necessary steps required in order to lay 91 out an IP network on top of an IB network. It describes all the 92 elements of an IPoIB link, how to configure its associated 93 attributes, and how to set up basic broadcast and multicast services 94 for it. 96 It further describes IP address resolution and the encapsulation of 97 IP and ARP packets in InfiniBand frame. 99 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 100 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 101 document are to be interpreted as described in RFC 2119 [RFC2119]. 103 2.0 IP over UD Mode 105 The unreliable datagram mode of communication is supported by all IB 106 elements be they IB routers, Host Channel Adapters (HCAs) or Target 107 Channel Adapters (TCAs). In addition to being the only universal 108 transmission method it supports multicasting, partitioning and a 109 32-bit CRC [IBTA]. Though multicasting support is optional in IB 110 fabrics, IPoIB architecture requires the participating components to 111 support it. 113 All IPoIB implementations MUST support IP over the UD transport mode 114 of IBA. 116 3.0 InfiniBand Datalink 118 An IB subnet is formed by a network of IB nodes interconnected 119 either directly or via IB switches. IB subnets may be connected 120 using IB routers to form a fabric made of multiple IB subnets. Nodes 121 residing in different IB subnets can communicate directly with one 122 another through IB routers at the IB network layer. Multiple IP 123 subnets may be overlaid over this IB network. 125 An IP subnet is configured over a communication facility or medium 126 over which nodes can communicate at the "link" layer [IPV6]. E.g. an 127 ethernet segment is a link formed by interconnected 128 switches/hubs/bridges. The segment is therefore defined by the 129 physical topology of the network. This is not the case with IPoIB. 130 IPoIB subnets are built over an abstract "link". The link is defined 131 by its members and common characteristics such as the P_Key, link 132 MTU, and the Q_Key. 134 Any two ports using UD communication mode in an IB fabric can 135 communicate only if they are in the same partition i.e. have the 136 same P_Key and the same Q_Key [IPoIB_ARCH]. The link MTU provides a 137 limit to the size of the payload that may be used. The packet 138 transmission and routing within the IB fabric is also affected by 139 additional parameters such as the traffic class (TClass), hop limit 140 (HopLimit), service level (SL) and the flow label (FlowLabel) 141 [IPoIB_ARCH]. The determination and use of these values for IPoIB 142 communication is described in the following sections. 144 4.0 Multicast Mapping 146 IB identifies multicast groups by the multicast Global Identifiers 147 (MGIDs) which follow the same rules as IPv6 multicast addresses. 148 Hence the MGIDs follow the same rules regarding the transient 149 addresses and scope bits albeit in the context of the IB fabric. The 150 resultant address therefore resembles IPv6 multicast addresses. The 151 documents [IBTA, IPoIB_ARCH] give a detailed description of IB 152 multicast. 154 The IPoIB multicast mapping is depicted in figure 1. The same 155 mapping function is used for both IPv4 and IPv6 except for the IPoIB 156 signature field. 158 Unless explicitly stated, all addresses and fields in the protocol 159 headers in this document are stored in the network byte order. 161 | 8 | 4 | 4 | 16 bits | 16 bits | 80 bits | 162 +------ -+----+----+-----------------+---------+-------------------+ 163 |11111111|0001|scop||< P_Key >| group ID | 164 +--------+----+----+-----------------+---------+-------------------+ 165 Figure 1 167 Since an MGID allocated for transporting IP multicast datagrams is 168 considered only a transient link-layer multicast address 169 [IPoIB_ARCH], all IB MGIDs allocated for IPoIB purpose MUST set T- 170 flag to 1 [IBTA]. 172 A special signature is embedded to identify the MGID for IPoIB use 173 only. For IPv4 over IB, the signature MUST be "0x401B". For IPv6 174 over IB, the signature MUST be "0x601B". 176 The IP multicast address is used together with a given IPoIB link 177 P_Key to form the MGID of the IB multicast group. For IPv6 the lower 178 80-bit of the group ID is used directly in the lower 80-bit of the 179 MGID. For IPv4, the group ID is only 28-bit long, and is placed 180 directly in the lower 28 bits of the MGID. The rest of the group ID 181 bits in the MGID are filled with 0. 183 E.g. on an IPoIB link that is fully contained within a single IB 184 subnet with a P_Key of 0x8000, the MGIDs for the all-router 185 multicast group with group ID 2 [AARCH, IGMP2] are: 187 FF12:401B:8000::2, for IPv4 in compressed format, and 188 FF12:601B:8000::2, for IPv6 in compressed format. 190 A special case exists for the IPv4 limited broadcast address 191 "255.255.255.255" [HOSTS]. The address SHALL be mapped to the 192 "broadcast-GID", which is defined as follows: 194 | 8 | 4 | 4 | 16 bits | 16 bits | 48 bits | 32 bits | 195 +--------+----+----+----------------+---------+----------+---------+ 196 |11111111|0001|scop|0100000000011011|< P_Key >|00.......0|| 197 +--------+----+----+----------------+---------+----------+---------+ 198 Figure 2 200 All MGIDs used in the IPoIB subnet MUST use the scop bits used in 201 the broadcast GID. 203 4.1 Broadcast-GID Parameters 205 The broadcast-GID is set up with the following attributes: 207 1. P_Key 208 A "Full Membership" P_Key (high-order bit is set to 1) MUST 209 be used so that all members may communicate with one 210 another. 212 2. Q_Key 213 It is RECOMMENDED that a controlled Q_Key be used with the 214 high order bit set. This is to prevent non-privileged 215 software from fabricating and sending out bogus IP 216 datagrams. 218 3. IB MTU 219 The value assigned to the broadcast-GID must not be greater 220 than any physical link MTU spanned by the IPoIB subnet. 222 The following attributes are required in multicast transmissions and 223 also in unicast transmissions if an IPoIB link covers more than a 224 single subnet. 226 4. Other parameters 227 The selection of TClass, FlowLabel, and HopLimit values is 228 implementation dependent. But it must take into account the 229 topology of IB subnets comprising the IPoIB link in order to 230 allow successful communication between any two nodes in the 231 same IPoIB link. 233 An SL also needs to be assigned to the broadcast-GID. This 234 SL is used in all multicast communication in the subnet. 236 The broadcast-GID's scope bits need to be set based on 237 whether the IPoIB link is confined within an IB subnet or 238 the IPoIB link spans multiple IB subnets. A default of 239 local-subnet scope i.e. 0x2 is RECOMMENDED. A node might 240 determine the scope bits to use by interactively searching 241 for a broadcast-GID of ever greater scope by first starting 242 with the local-scope. Or, an implementation might include 243 the scope bits as a configuration parameter. 245 5.0 Setting Up an IPoIB Link 247 The broadcast-GID, as defined in the previous section MUST be set up 248 for an IPoIB subnet to be formed. Every IPoIB interface MUST 249 "FullMember" join the IB multicast group defined by the broadcast- 250 GID. This multicast group will henceforth be referred to as the 251 broadcast group. The join operation returns the MTU, the Q_Key and 252 other parameters associated with the broadcast group. The node then 253 associates the parameters received as a result of the join operation 254 with its IPoIB interface. The broadcast group also serves to provide 255 a link-layer broadcast service for protocols like ARP, net-directed, 256 subnet-directed and all-subnets-directed broadcasts in IPv4 over IB 257 networks. 259 The join operation is successful only if the Subnet Manager (SM) 260 determines that the joining node can support the MTU registered with 261 the broadcast group [IPoIB_ARCH] ensuring support for a common link 262 MTU. The SM also ensures that all the nodes joining the broadcast- 263 GID have paths to one another and can therefore send and receive 264 unicast packets. It further ensures that all the nodes do indeed 265 form a multicast tree that allows packets sent from any member to be 266 replicated to every other member. Thus the IPoIB link is formed by 267 the IPoIB nodes joining the broadcast group. There is no physical 268 demarcation of the IPoIB link other than that determined by the 269 broadcast group membership. 271 The P_Key is a configuration parameter that must be known before the 272 broadcast-GID can be formed. For a node to join a partition, one of 273 its ports must be assigned the relevant P_Key by the SM 274 [IPoIB_ARCH]. 276 The method creation of the broadcast group and the assignment/choice 277 of its parameters are up to the implementation and/or the 278 administrator of the IPoIB subnet. The broadcast group may be 279 created by the first IPoIB node to be initialized or it can be 280 created administratively before the IPoIB subnet is set up. It is 281 RECOMMENDED that the creation and deletion of the broadcast group is 282 under administrative control. 284 InfiniBand multicast management which includes the creation, joining 285 and leaving of IB multicast groups by IB nodes is described in 286 [IPoIB_ARCH]. 288 6.0 Frame Format 290 All IP and ARP datagrams transported over InfiniBand are prefixed by 291 a 4-octet encapsulation header as illustrated below. 293 0 1 2 3 294 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 295 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 296 | | | 297 | Type | Reserved | 298 | | | 299 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 300 Figure 3 302 The "Reserved" field MUST be set to zero on send and ignored on 303 receive unless specified differently in a future document. 305 The "Type" field SHALL indicate the encapsulated protocol as per the 306 following table. 308 +----------+-------------+ 309 | Type | Protocol | 310 |------------------------| 311 | 0x800 | IPv4 | 312 |------------------------| 313 | 0x806 | ARP | 314 |------------------------| 315 | 0x8035 | RARP | 316 |------------------------| 317 | 0x86DD | IPv6 | 318 +------------------------+ 319 Table 1 321 These values are taken from the "ETHER TYPE" numbers assigned by 322 Internet Assigned Numbers Authority (IANA). Other network protocols, 323 identified by different values of "ETHER TYPE", may use the 324 encapsulation format defined herein but such use is outside of the 325 scope of this document. 327 |<------ IB Frame headers -------->|<- Payload ->|<- IB trailers ->| 328 +-------+------+---------+---------+-------------+---------+-------+ 329 |Local | |Base |Datagram | 4-octet | | | 330 |Routing| GRH* |Transport|Extended | header |Invariant|Variant| 331 |Header |Header|Header |Transport| + | CRC | CRC | 332 | | | |Header | IP/ARP | | | 333 +-------+------+---------+---------+-------------+---------+-------+ 334 Figure 4 336 Figure 4 depicts the IB frame encapsulating an IP/ARP datagram. The 337 InfiniBand specification requires the use of Global Routing Header 338 (GRH) [IPoIB_ARCH] when multicasting or when an InfiniBand packet 339 traverses from one IB subnet to another through an IB router. Its 340 use is optional when used for unicast transmission between nodes 341 within an IB subnet. The IPoIB implementation MUST be able to handle 342 packets received with or without the use of GRH. 344 7.0 Maximum Transmission Unit 346 IB MTU: 347 The IB components i.e. IB links, switches, Channel Adapters 348 (CAs), and IB routers, may support maximum payloads of : 256, 349 512, 1024, 2048 or 4096 octets. The maximum IB payload supported 350 by the IB components in any IB path is the IB MTU for the path. 352 IPoIB-Link MTU: 353 The IPoIB-link MTU is the MTU value associated with the 354 broadcast group. The IPoIB-link MTU can be set to any value up 355 to the smallest IB MTU supported by the IB components comprising 356 the IPoIB link. 358 In order to reduce problems with fragmentation and path-MTU 359 discovery, this document requires that all IPoIB implementations 360 support an MTU of 2044 octets i.e. a 2048 octet IPoIB-link MTU minus 361 the 4 octet encapsulation overhead. Larger and smaller MTUs MAY be 362 supported subject to other existing MTU requirements [IPV6], but the 363 default configuration must support an MTU of 2044 octets. 365 8.0 IPv6 Stateless Autoconfiguration 367 IB architecture associates an EUI-64 identifier termed the GUID 368 (Globally Unique Identifier) [IPoIB_ARCH, IBTA] with each port. The 369 Local Identifier (LID) is unique within an IB subnet only. 371 The interface identifier may be chosen from: 373 1) The EUI-64 compliant GUID assigned by the manufacturer. 375 2) If the IPoIB subnet is fully contained within an IB subnet 376 any of the unique 16-bit LIDs of the port associated with the 377 IPoIB interface. 379 The LID values of a port may change after a reboot/power-cycle 380 of the IB node. Therefore, if a persistent value is desired, it 381 would be prudent to not use the LID to form the interface 382 identifier. 384 On the other hand, the LID provides an identifier that can be 385 used to create a more anonymous IPv6 address since the LID is 386 not globally unique and is subject to change over time. 388 It is RECOMMENDED that the link-local address be constructed from 389 the port's EUI-64 identifier as given below. 391 [AARCH] requires the interface identifier be created in the 392 "Modified EUI-64" format when derived from an EUI-64 identifier. 393 [IBTA] is unclear if the GUID should use IEEE EUI-64 format or the 394 "Modified EUI-64" format. Therefore, when creating an interface 395 identifier from the GUID an implementation MUST do the following: 397 => Determine if the GUID is a modified EUI-64 identifier ("u" 398 bit is toggled) as defined by [AARCH] 400 => If the GUID is a modified EUI-64 identifier then the "u" bit 401 MUST NOT be toggled when creating the interface identifier 403 => If the GUID is an unmodified EUI-64 identifier then the "u" 404 bit MUST be toggled in compliance with [AARCH] 406 8.1 IPv6 Link Local Address 408 The IPv6 link local address for an IPoIB interface are formed as 409 described in [AARCH] using the Interface Identifier as described in 410 the previous section. 412 9.0 Address Mapping - Unicast 414 Address resolution in IPv4 subnets is accomplished through Address 415 Resolution protocol (ARP) [ARP]. It is accomplished in IPv6 subnets 416 using the Neighbor Discovery protocol [DISC]. 418 9.1 Link Information 420 An InfiniBand packet over the UD mode includes multiple headers such 421 as the LRH (local route header), GRH (global route header), BTH 422 (base transport header), DETH (datagram extended header) as depicted 423 in figure 4 and specified in the InfiniBand architecture [IBTA]. All 424 these headers comprise the link-layer in an IPoIB link. 426 The parameters needed in these IBA headers constitute the link-layer 427 information that needs to be determined before an IP packet may be 428 transmitted across the IPoIB link. 430 The parameters that need to be determined are: 432 a) LID 434 The LID is always needed. A packet always includes the LRH 435 that is targeted at the remote node's LID, or an IB router's 436 LID to get to the remote node in another IB subnet. 438 b) Global Identifier (GID) 440 The GID is not needed when exchanging information within an 441 IB subnet though it may be included in any packet. It is an 442 absolute necessity when transmitting across the IB subnet 443 since the IB routers use the GID to correctly forward the 444 packets. The source and destination GIDs are fields included 445 in the GRH. 447 The GID, if formed using the GUID, can be used to 448 unambiguously identify an endpoint. 450 c) Queue Pair Number (QPN) 452 Every unicast UD communication is always directed to a 453 particular queue pair (QP) at the peer. 455 d) Q_Key 457 A Q_Key is associated with each unreliable datagram QPN. The 458 received packets must contain a Q_Key that matches the QP's 459 Q_Key to be accepted. 461 e) P_Key 463 A successful communication between two IB nodes using UD 464 mode can occur only if the two nodes have compatible P_Keys. 465 This is referred to as being in the same partition [IBTA]. 467 f) SL 469 Every IBA packet contains an SL value. A path in IBA is 470 defined by the three-tuple (source LID, destination LID, 471 SL). The SL in turns is mapped to a virtual lane (VL) at 472 every CA, switch that sends/forwards the packet 473 [IPoIB_ARCH]. Multiple SLs may be used between two endpoints 474 to provide for load-balancing, SLs may be used for providing 475 a QoS infrastructure, or may be used to avoid deadlocks in 476 the IBA fabric. 478 Another auxiliary piece of information, not included in the IBA 479 headers, is : 481 g) Path rate 483 IBA defines multiple link speeds. A higher speed transmitter 484 can swamp switches and the CAs. To avoid such congestion 485 every source transmitting at greater than 1x speeds is 486 required to determine the "path rate" before the data may be 487 transmitted [IBTA]. 489 9.1.1 Link Layer Address/Hardware Address 491 Though the list of information required for a successful transmittal 492 of an IPoIB packet is large, not all the information need be 493 determined during the IP address resolution process. 495 The 20-octet IPoIB link-layer address used in the source/target 496 link-layer address option in IPv6 and the "hardware address" in 497 IPv4/ARP has the same format. 499 The format is as described below: 501 0 1 2 3 502 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 503 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 504 | Reserved | Queue Pair Number | 505 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 506 | | 507 + + 508 | | 509 + GID + 510 | | 511 + + 512 | | 513 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 514 Figure 5 516 a) Reserved Flags 518 These 8 bits are reserved for future use. These bits MUST be 519 set to zero on send and ignored on receive unless specified 520 differently in a future document. 522 b) QPN 524 Every unicast communication in IB architecture is directed 525 to a specific QP [IBTA]. This QP number is included in the 526 link description. All IP communication to the relevant IPoIB 527 interface MUST be directed to this QPN. In the case of IPv4 528 subnets the address resolution protocol (ARP) reply packets 529 are also directed to the same QPN. 531 The choice of the QPN for IP/ARP communication is up to the 532 implementation. 534 c) GID 536 This is one of the GIDs of the port associated with the 537 IPoIB interface [IBTA]. IB associates multiple GIDs with a 538 port. It is RECOMMENDED that the GID formed by the 539 combination of the IB subnet prefix and the port's "Port 540 GUID" [IBTA] be included in the link-layer/hardware address. 542 9.1.2 Auxiliary Link Information 544 The rest of the parameters are determined as follows: 546 a) LID 548 The method of determining the peer's LID is not defined in 549 this document. It is up to the implementation to use any of 550 the IBA approved methods to determine the destination LID. 551 One such method is to use the GID determined during the 552 address resolution, to retrieve the associated LID from the 553 IB routing infrastructure or the Subnet Administrator (SA). 555 It is the responsibility of the administrator to ensure that 556 the IB subnet(s) have unicast connectivity between the IPoIB 557 nodes. The GID exchanged between two endpoints in a 558 multicast message (ARP/ND) does not guarantee the existence 559 of a unicast path between the two. 561 There may be multiple LIDs, and hence paths, between the 562 endpoints. The criteria for selection of the LIDs are beyond 563 the scope of this document. 565 b) Q_Key 567 The Q_Key received on joining the broadcast group MUST be 568 used for all IPoIB communication over the particular IPoIB 569 link. 571 c) P_Key 573 The P_Key to be used in the IP subnet is not discovered but 574 is a configuration parameter. 576 d) SL 578 The method of determining the SL is not defined in this 579 document. The SL is determined by any of the IBA approved 580 methods. 582 e) Path rate 584 The implementation must leverage IB methods to determine the 585 path rate as required. 587 9.2 Address Resolution in IPv4 Subnets 589 The ARP packet header is as defined in [ARP]. The hardware type is 590 set to 32 (decimal) as specified by IANA. The rest of the fields are 591 used as per [ARP]. 593 16 bits: hardware type 594 16 bits: protocol 595 8 bits: length of hardware address 596 8 bits: length of protocol address 597 16 bits: ARP operation 599 The remaining fields in the packet hold the sender/target hardware 600 and protocol addresses. 602 [ sender hardware address ] 603 [ sender protocol address ] 604 [ target hardware address ] 605 [ target protocol address ] 607 The hardware address included in the ARP packet will be as specified 608 in section 9.1.1 and depicted in figure 5. 610 The length of the hardware address used in ARP packet header 611 therefore is 20. 613 9.3 Address Resolution in IPv6 Subnets 615 The Source/Target Link-layer address option is used in Router 616 Solicit, Router advertisements, Redirect, Neighbor Solicitation and 617 Neighbor Advertisement messages when such messages are transmitted 618 on InfiniBand networks. 620 The source/target address option is specified as follows: 622 Type: 623 Source Link-layer address 1 624 Target Link-layer address 2 626 Length: 3 628 Link-layer address: 630 The link-layer address is as specified in section 9.1.1 and 631 depicted in figure 5. 633 [DISC] specifies the length of source/target option in 634 number of 8-octets as indicated by a length of '3' above. 635 Since the IPoIB link-layer address is only 20-octet long, 636 two octets of zero MUST be prepended to fill the total 637 option length of 24 octets. 639 9.4 Cautionary Note on QPN Caching 641 The link-layer address for IPoIB includes the QPN which might not be 642 constant across reboots or even across network interface resets. 643 Cached QPN entries, such as in static ARP entries or in RARP servers 644 will only work if the implementation(s) using these options ensure 645 that the QPN associated with an interface is invariant across 646 reboots/network resets. 648 It is RECOMMENDED that implementations revalidate ARP caches 649 periodically due to the aforementioned QPN induced volatility of 650 IPoIB link-layer addresses. 652 10.0 Sending and Receiving IP Multicast Packets 654 Multicast in InfiniBand differs in a number of ways from multicast 655 in Ethernet. This adds some complexity to an IPoIB implementation 656 when supporting IP multicast over IB. 658 A) An IB multicast group must be explicitly created through the SA 659 before it can be used. 661 This implies that in order to send a packet destined for an IP 662 multicast address, the IPoIB implementation must check with the SA 663 on the outbound link first for a "MCMemberRecord" that matches the 664 MGID. If one does exist, the MLID associated with the multicast 665 group is used as the DLID for the packet. Otherwise, it implies no 666 member exists on the local link. If the scope of the IP multicast 667 group is beyond link-local, the packet must be sent to the on-link 668 routers through the use of the all-router multicast group or the 669 broadcast group. This is to allow local routers to forward the 670 packet to multicast listeners on remote networks. The all-router 671 multicast group is preferred over the broadcast group for better 672 efficiency. If the all-router multicast group does not exist, the 673 sender can assume that there are no routers on the local link; hence 674 the packet can be safely dropped. 676 B) A multicast sender must join the target multicast group before 677 outgoing multicast messages from it can be successfully routed. The 678 "SendOnlyNonMember" join is different from the regular "FullMember" 679 join in two aspects. First, both types of joins enable multicast 680 packets to be routed FROM the local port, but only the "FullMember" 681 join causes multicast packets to be routed TO the port. Second, the 682 sender port of a "SendOnlyNonMember" join will not be counted as a 683 member of the multicast group for purposes of group creation and 684 deletion. 686 The following code snippet demonstrates the steps in a typical 687 implementation when processing an egress multicast packet. 689 if the egress port is already a "SendOnlyNonMember", or a 690 "FullMember" 691 => send the packet 693 else if the target multicast group exists 694 => do "SendOnlyNonMember" join 695 => send the packet 697 else if scope > link-local AND the all-router multicast group exists 698 => send the packet to all routers 700 else 701 => drop the packet 703 Implementations should cache the information about the existence of 704 an IB multicast group, its MLID and other attributes. This is to 705 avoid expensive SA calls on every outgoing multicast packet. 706 Senders MUST subscribe to the multicast group create and delete 707 traps in order to monitor the status of specific IB multicast 708 groups. E.g., multicast packets directed to the all-router multicast 709 group due to a lack of listener on the local subnet must be 710 forwarded to the right multicast group if the group is created 711 later. This happens when a listener shows up on the local subnet. 713 A node joining an IP multicast group must first construct a MGID 714 according to the rule described in section 4 above. Once the correct 715 MGID is calculated, the node must call the SA of the outbound link 716 to attempt a "FullMember" join of the IB multicast group 717 corresponding to the MGID. If the IB multicast group doesn't already 718 exist, one must be created first with the IPoIB link MTU. The MGID 719 MUST use the same P_Key, Q_Key, SL, MTU and HopLimit as those used 720 in the broadcast-GID. For the rest of attributes too, the values 721 used in the broadcast-GID SHOULD be used. 723 The join request will cause the local port to be added to the 724 multicast group. It also enables the SM to program IB switches and 725 routers with the new multicast information to ensure the correct 726 forwarding of multicast packets for the group. 728 When a node leaves an IP multicast group, it SHOULD make a 729 "FullMember" leave request to the SA. This gives SM an opportunity 730 to update relevant forwarding information, to delete an IB multicast 731 group if the local port is the last FullMember to leave, and free up 732 the MLID allocated for it. The specific algorithm is implementation- 733 dependent, and is out of the scope of this document. 735 Note that for an IPoIB link that spans more than one IB subnet 736 connected by IB routers, an adequate multicast forwarding support at 737 the IB level is required for multicast packets to reach listeners on 738 a remote IB subnet. The specific mechanism for this is beyond the 739 scope of IPoIB. 741 11.0 IP Multicast Routing 743 IP multicast routing requires each interface over which the router 744 is operating to be configured to listen to all link-layer multicast 745 addresses generated by IP. For an Ethernet interface this is often 746 achieved by turning on the promiscuous multicast mode on the 747 interface. 749 IBA does not provide any hardware support for promiscuous multicast 750 mode. Fortunately a promiscuous multicast mode can be emulated in 751 the software running on a router through the following steps. 753 A) Obtain a list of all active IB multicast groups from the local 754 SA. 756 B) Make a "NonMember" join request to the SA for every group that 757 has a signature in its MGID matching the one for either IPv4 or 758 IPv6. 760 C) Subscribe to the IB multicast group creation events using a 761 wildcarded MGID so that the router can "NonMember" join all IB 762 multicast groups created subsequently for IPv4 or IPv6. 764 The "NonMember" join has the same effect as a "FullMember" join 765 except that the former will not be counted as a member of the 766 multicast group for purposes of group creation or deletion. That is, 767 when the last "FullMember" leaves a multicast group, the group can 768 be safely deleted by the SA without concerning any "NonMember" 769 routers. 771 12.0 New Types of Vulnerability in IB Multicast 773 Many IB multicast functions are subject to failures due to a number 774 of possible resource constraints. These include the creation of IB 775 multicast groups, the join calls ("SendOnlyNonMember", "FullMember", 776 and "NonMember"), and the attaching of a QP to a multicast group. 778 In general, the occurrence of these failure conditions is highly 779 implementation dependent, and is believed to be rare. Usually a 780 failed multicast operation at the IB level can be propagated back to 781 the IP level, causing the original operation to fail, and the 782 initiator of the operation to be notified. But some IB multicast 783 functions are not tied to any foreground operation, making their 784 failures hard to detect. E.g., if an IP multicast router attempts to 785 "NonMember" join a newly created multicast group in the local 786 subnet, but the join call fails, packet forwarding for that 787 particular multicast group will likely to fail silently, that is, 788 without the attention of local multicast senders. This type of 789 problems can add more vulnerability to the already unreliable IP 790 multicast operations. 792 Implementations SHOULD log error messages upon any failure from an 793 IB multicast operation. Network administrators should be aware of 794 this vulnerability, and preserve enough multicast resources at the 795 points where IP multicast will be used heavily. E.g., HCAs with 796 ample multicast resources should be used at any IP multicast router. 798 13.0 Security Considerations 800 This document specifies IP transmission over a multicast network. 801 Any network of this kind is vulnerable to a sender claiming 802 another's identity and forging traffic or eavesdropping. It is the 803 responsibility of the higher layers or applications to implement 804 suitable counter-measures if this is a problem. 806 Successful transmission of IP packets depends on the correct set up 807 of the IPoIB link , creation of the broadcast GID, creation of the 808 QP and its attachment to the broadcast-GID, and the correct 809 determination of various link parameters such as the LID, service 810 level, path rate etc. These operations, many of which involve 811 interactions with the SM/SA, MUST be protected by the underlying 812 operating system. This is to prevent malicious, non- privileged 813 software from hijacking important resources and configurations. 815 Controlled Q_Keys SHOULD be used in all transmissions. This is to 816 prevent non-privileged software from fabricating IP datagrams. 818 14.0 IANA Considerations 820 To support ARP over InfiniBand, a value for the Address Resolution 821 Parameter "Number Hardware Type (hrd)" is required. IANA has 822 assigned the number "32" to indicate InfiniBand [IANA_ARP]. 824 Proposed uses of the reserved bits in the frame format(Figure 3) and 825 link layer address(Figure 5) MUST be published as RFCs. This 826 document requires that the reserved bits be set to zero on send and 827 ignored on receives. 829 15.0 Acknowledgments 831 The authors would like to thank Bruce Beukema, David Brean, Dan 832 Cassiday, Aditya Dube, Yaron Haviv, Michael Krause, Thomas Narten, 833 Erik Nordmark, Greg Pfister, Jim Pinkerton, Renato Recio, Kevin 834 Reilly, Kanoj Sarcar, Satya Sharma, Madhu Talluri, and David L. 835 Stevens for their suggestions and many clarifications on the IBA 836 specification. 838 16.0 References 840 16.1 Normative 842 [AARCH] Hinden, R. and S. Deering "IP Version 6 Addressing 843 Architecture", RFC 3513. 845 [ARP] Plummer D.C., "Ethernet Address Resolution Protocol", 846 RFC 826. 848 [DISC] Narten, T., Nordmark, E. and W. Simpson, "Neighbor 849 Discovery for IP Version 6 (IPv6)", RFC 2461. 851 [IANA] Internet Assigned Numbers Authority, www.iana.org 853 [IANA_ARP] www.iana.org/assignments/arp-parameters 855 [IBTA] InfiniBand Architecture Specification, 856 www.infinibandta.org/specs 858 [IPoIB_ARCH] Kashyap V., "IP over InfiniBand (IPoIB) Architecture", 859 draft-ietf-ipoib-architecture-04.txt. 861 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 862 Requirement Levels", RFC 2119. 864 16.2 Informative 866 [HOSTS] Braden R., "Requirements for Internet Hosts -- 867 Communication Layers", RFC 1122. 869 [IGMP2] Fenner W., "Internet Group Management Protocol, 870 Version 2", RFC 2236. 872 [IP6MLD] Deering S., Fenner W., Haberman B., "Multicast 873 Listener Discovery (MLD) for IPv6", RFC 2710. 875 [IPMULT] Deering S., "Host Extensions for IP Multicasting", 876 RFC 1112. 878 [IPV6] Deering, S. and R. Hinden, "Internet Protocol, 879 Version 6 (IPv6) Specification", RFC 2460. 881 17.0 Authors" Addresses 883 H.K. Jerry Chu 885 17 Network Circle, UMPK17-201 886 Menlo Park, CA 94025 887 USA 888 Phone: +1 650 786 5146 889 Email: jerry.chu@sun.com 891 Vivek Kashyap 893 15350, SW Koll Parkway 894 Beaverton, OR 97006 895 USA 896 Phone: +1 503 578 3422 897 Email: vivk@us.ibm.com 899 Full Copyright Statement 901 Copyright (C) The Internet Society (2004). This document is subject 902 to the rights, licenses and restrictions contained in BCP 78, and 903 except as set forth therein, the authors retain all their rights. 905 This document and the information contained herein are provided on 906 an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE 907 REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE 908 INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR 909 IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF 910 THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED 911 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 913 Intellectual Property Statement 915 The IETF takes no position regarding the validity or scope of any 916 Intellectual Property Rights or other rights that might be claimed 917 to pertain to the implementation or use of the technology described 918 in this document or the extent to which any license under such 919 rights might or might not be available; nor does it represent that 920 it has made any independent effort to identify any such rights. 921 Information on the procedures with respect to rights in RFC 922 documents can be found in BCP 78 and BCP 79. 924 Copies of IPR disclosures made to the IETF Secretariat and any 925 assurances of licenses to be made available, or the result of an 926 attempt made to obtain a general license or permission for the use 927 of such proprietary rights by implementers or users of this 928 specification can be obtained from the IETF on-line IPR repository 929 at http://www.ietf.org/ipr. 931 The IETF invites any interested party to bring to its attention any 932 copyrights, patents or patent applications, or other proprietary 933 rights that may cover technology that may be required to implement 934 this standard. Please address the information to the IETF at ietf- 935 ipr@ietf.org. 937 Acknowledgment 939 Funding for the RFC Editor function is currently provided by the 940 Internet Society.