idnits 2.17.1 draft-ietf-ipoib-ip-over-infiniband-07.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3978, Section 5.5 on line 905. -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 916. -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 923. -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 929. ** Found boilerplate matching RFC 3978, Section 5.4, paragraph 1 (on line 897), which is fine, but *also* found old RFC 2026, Section 10.4C, paragraph 1 text on line 37. ** The document claims conformance with section 10 of RFC 2026, but uses some RFC 3978/3979 boilerplate. As RFC 3978/3979 replaces section 10 of RFC 2026, you should not claim conformance with it if you have changed to using RFC 3978/3979 boilerplate. ** The document seems to lack an RFC 3978 Section 5.1 IPR Disclosure Acknowledgement. ** This document has an original RFC 3978 Section 5.4 Copyright Line, instead of the newer IETF Trust Copyright according to RFC 4748. ** This document has an original RFC 3978 Section 5.5 Disclaimer, instead of the newer disclaimer which includes the IETF Trust according to RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** The document seems to lack a 1id_guidelines paragraph about 6 months document validity -- however, there's a paragraph with a matching beginning. Boilerplate error? == No 'Intended status' indicated for this document; assuming Proposed Standard == The page length should not exceed 58 lines per page, but there was 19 longer pages, the longest (page 2) being 60 lines == It seems as if not all pages are separated by form feeds - found 0 form feeds but 20 pages Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. ** There are 11 instances of too long lines in the document, the longest one being 1 character in excess of 72. == There are 1 instance of lines with non-RFC2606-compliant FQDNs in the document. == There are 1 instance of lines with non-RFC3849-compliant IPv6 addresses in the document. If these are example addresses, they should be changed. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year == The "Author's Address" (or "Authors' Addresses") section title is misspelled. -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- Couldn't find a document date in the document -- date freshness check skipped. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Unused Reference: 'IP6MLD' is defined on line 853, but no explicit reference was found in the text == Unused Reference: 'IPMULT' is defined on line 856, but no explicit reference was found in the text == Unused Reference: 'RFC2119' is defined on line 865, but no explicit reference was found in the text == Unused Reference: 'IANA' is defined on line 871, but no explicit reference was found in the text ** Obsolete normative reference: RFC 3513 (ref. 'AARCH') (Obsoleted by RFC 4291) ** Obsolete normative reference: RFC 2461 (ref. 'DISC') (Obsoleted by RFC 4861) -- Possible downref: Non-RFC (?) normative reference: ref. 'IBTA' ** Obsolete normative reference: RFC 2460 (ref. 'IPV6') (Obsoleted by RFC 8200) -- Possible downref: Non-RFC (?) normative reference: ref. 'IANA' Summary: 12 errors (**), 0 flaws (~~), 11 warnings (==), 8 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 INTERNET DRAFT 2 H.K. Jerry Chu 3 Expiration Date: February, 2005 Sun Microsystems 4 V. Kashyap 5 IBM 6 August, 2004 8 Transmission of IP over InfiniBand 10 Status of this memo 12 This document is an Internet-Draft and is in full conformance with 13 all provisions of Section 10 of RFC 2026. 15 Internet-Drafts are working documents of the Internet Engineering 16 Task Force (IETF), its areas, and its working groups. Note that 17 other groups may also distribute working documents as Internet- 18 Drafts. 20 Internet-Drafts are draft documents valid for a maximum of six 21 months and may be updated, replaced, or obsoleted by other documents 22 at any time. It is inappropriate to use Internet-Drafts as Reference 23 material or to cite them other than as ``work in progress"". 25 The list of current Internet-Drafts can be accessed at 26 http://www.ietf.org/ietf/1id-abstracts.txt 28 The list of Internet-Draft Shadow Directories can be accessed at 29 http://www.ietf.org/shadow.html 31 This memo provides information for the Internet community. This memo 32 does not specify an Internet standard of any kind. Distribution of 33 this memo is unlimited. 35 Copyright Notice 37 Copyright (C) The Internet Society (2001). All Rights Reserved. 39 Abstract 41 This document specifies a method for encapsulating and transmitting 42 IPv4/IPv6 and Address Resolution Protocol (ARP) packets over 43 InfiniBand (IB). It describes the link layer address to be used when 44 resolving the IP addresses in "IP over InfiniBand (IPoIB)" subnets. 45 The document also describes the mapping from IP multicast addresses 46 to InfiniBand multicast addresses. Additionally this document 47 defines the setup and configuration of IPoIB links. 49 Table of Contents 51 1.0 Introduction 52 2.0 IP over UD Mode 53 3.0 InfiniBand Datalink 54 4.0 Multicast Mapping 55 4.1 Broadcast-GID Parameters 56 5.0 Setting Up an IPoIB Link 57 6.0 Frame Format 58 7.0 Maximum Transmission Unit 59 8.0 IPv6 Stateless Autoconfiguration 60 8.1 IPv6 Link Local Address 61 9.0 Address Mapping - Unicast 62 9.1 Link Information 63 9.1.1 Link Layer Address/Hardware Address 64 9.1.2 Auxiliary Link Information 65 9.2 Address Resolution in IPv4 Subnets 66 9.3 Address Resolution in IPv6 Subnets 67 9.4 Cautionary Note on QPN Caching 68 10.0 Sending and Receiving IP Multicast Packets 69 11.0 IP Multicast Routing 70 12.0 New Types of Vulnerability in IB Multicast 71 13.0 Security Considerations 72 14.0 IANA Considerations 73 15.0 Acknowledgments 74 16.0 References 75 17.0 Author's Addresses 77 1.0 Introduction 79 The InfiniBand specification [IBTA] can be found at 80 www.infinibandta.org. The document [IPoIB_ARCH] provides a short 81 overview of InfiniBand architecture (IBA) along with considerations 82 for specifying IP over InfiniBand networks. 84 IBA defines multiple modes of transport over which IP may be 85 implemented. The unreliable datagram (UD) transport mode best 86 matches the needs of IP and the need for universality as described 87 in [IPoIB_ARCH]. 89 This document specifies IPoIB over IB's UD mode. The implementation 90 of IP subnets over IB's other transport mechanisms is out of scope 91 of this document. 93 This document describes the necessary steps required in order to lay 94 out an IP network on top of an IB network. It describes all the 95 elements of an IPoIB link, how to configure its associated 96 attributes, and how to setup basic broadcast and multicast services 97 for it. 99 It further describes IP address resolution and the encapsulation of 100 IP and ARP packets in InfiniBand frame. 102 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 103 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 104 document are to be interpreted as described in RFC 2119. 106 2.0 IP over UD Mode 108 The unreliable datagram mode of communication is supported by all IB 109 elements be they IB routers, Host Channel Adapters (HCAs) or Target 110 Channel Adapters (TCAs). In addition to being the only universal 111 transmission method it supports multicasting, partitioning and a 112 32-bit CRC [IBTA]. Though multicasting support is optional in IB 113 fabrics, IPoIB architecture requires the participating components to 114 support it. 116 All IPoIB implementations MUST support IP over the UD transport mode 117 of IBA. 119 3.0 InfiniBand Datalink 121 An IB subnet is formed by a network of IB nodes interconnected 122 either directly or via IB switches. IB subnets may be connected 123 using IB routers to form a fabric made of multiple IB subnets. Nodes 124 residing in different IB subnets can communicate directly with one 125 another through IB routers at the IB network layer. Multiple IP 126 subnets may be overlaid over this IB network. 128 An IP subnet is configured over a communication facility or medium 129 over which nodes can communicate at the "link" layer [IPV6]. E.g. an 130 ethernet segment is a link formed by interconnected 131 switches/hubs/bridges. The segment is therefore defined by the 132 physical topology of the network. This is not the case with IPoIB. 133 IPoIB subnets are built over an abstract "link". The link is defined 134 by its members and common characteristics such as the P_Key, link 135 MTU, and the Q_Key. 137 Any two ports using UD communication mode in an IB fabric can 138 communicate only if they are in the same partition i.e. have the 139 same P_Key and the same Q_Key [IPoIB_ARCH]. The link MTU provides a 140 limit to the size of the payload that may be used. The packet 141 transmission and routing within the IB fabric is also affected by 142 additional parameters such as the traffic class (TClass), hop limit 143 (HopLimit), service level (SL) and the flow label (FlowLabel). The 144 determination and use of these values for IPoIB communication is 145 described in the following sections. 147 4.0 Multicast Mapping 149 IB identifies multicast groups by the multicast Global Identifiers 150 (MGIDs) which follow the same rules as IPv6 multicast addresses. 151 Hence the MGIDs follow the same rules regarding the transient 152 addresses and scope bits albeit in the context of the IB fabric. The 153 resultant address therefore resembles IPv6 multicast addresses. The 154 documents [IBTA, IPoIB_ARCH] give a detailed description of IB 155 multicast. 157 The IPoIB multicast mapping is depicted in figure 1. The same 158 mapping function is used for both IPv4 and IPv6 except for the IPoIB 159 signature field. 161 Unless explicitly stated, all addresses and fields in the protocol 162 headers in this document are stored in the network byte order. 164 | 8 | 4 | 4 | 16 bits | 16 bits | 80 bits | 165 +------ -+----+----+-----------------+---------+-------------------+ 166 |11111111|0001|scop||< P_Key >| group ID | 167 +--------+----+----+-----------------+---------+-------------------+ 168 Figure 1 170 Since an MGID allocated for transporting IP multicast datagrams is 171 considered only a transient link-layer multicast address 172 [IPoIB_ARCH], all IB MGIDs allocated for IPoIB purpose MUST set T- 173 flag to 1 [IBTA]. 175 A special signature is embedded to identify the MGID for IPoIB use 176 only. For IPv4 over IB, the signature MUST be "0x401B". For IPv6 177 over IB, the signature MUST be "0x601B". 179 The IP multicast address is used together with a given IPoIB link 180 P_Key to form the MGID of the IB multicast group. For IPv6 the lower 181 80-bit of the group ID is used directly in the lower 80-bit of the 182 MGID. For IPv4, the group ID is only 28-bit long and the rest of the 183 bits are filled with 0. 185 E.g. on an IPoIB link that is fully contained within a single IB 186 subnet with a P_Key of 0x8000, the MGIDs for the all-router 187 multicast group with group ID 2 [AARCH, IGMP2] are: 189 FF12:401B:8000::2, for IPv4 in compressed format, and 190 FF12:601B:8000::2, for IPv6 in compressed format. 192 A special case exists for the IPv4 limited broadcast address 193 "255.255.255.255" [HOSTS]. The address SHALL be mapped to the 194 "broadcast-GID", which is defined as follows: 196 | 8 | 4 | 4 | 16 bits | 16 bits | 48 bits | 32 bits | 197 +--------+----+----+----------------+---------+----------+---------+ 198 |11111111|0001|scop|0100000000011011|< P_Key >|00.......0|| 199 +--------+----+----+----------------+---------+----------+---------+ 200 Figure 2 202 All MGIDs used in the IPoIB subnet MUST use the scop bits used in 203 the broadcast GID. 205 4.1 Broadcast-GID Parameters 207 The broadcast-GID is setup with the following attributes: 209 1. P_Key 210 A "Full Membership" P_Key (high-order bit is set to 1) MUST 211 be used so that all members may communicate with one 212 another. 214 2. Q_Key 215 It is RECOMMENDED that a controlled Q_Key be used with the 216 high order bit set. This is to prevent non-privileged 217 software from fabricating and sending out bogus IP 218 datagrams. 220 3. IB MTU 221 The value assigned to the broadcast-GID must not be greater 222 than any physical link MTU spanned by the IPoIB subnet. 224 The following attributes are required in multicast transmissions and 225 also in unicast transmissions if an IPoIB link covers more than a 226 single subnet. 228 4. Other parameters 229 The selection of TClass, FlowLabel, and HopLimit values is 230 implementation dependent. But it must take into account the 231 topology of IB subnets comprising the IPoIB link in order to 232 allow successful communication between any two nodes in the 233 same IPoIB link. 235 An SL also needs to be assigned to the broadcast-GID. This 236 SL is used in all multicast communication in the subnet. 238 The broadcast-GID's scope bits need to be set based on 239 whether the IPoIB link is confined within an IB subnet or 240 the IPoIB link spans multiple IB subnets. A default of 241 local-subnet scope i.e. 0x2 is RECOMMENDED. A node might 242 determine the scope bits to use by interactively searching 243 for a broadcast-GID of ever greater scope by first starting 244 with the local-scope. Or, an implementation might include 245 the scope bits as a configuration parameter. 247 5.0 Setting Up an IPoIB Link 249 The broadcast-GID, as defined in the previous section MUST be setup 250 for an IPoIB subnet to be formed. Every IPoIB interface MUST 251 "FullMember" join the IB multicast group defined by the broadcast- 252 GID. This multicast group will henceforth be referred to as the 253 broadcast group. The join operation returns the MTU, the Q_Key and 254 other parameters associated with the broadcast group. The node then 255 associates the parameters received as a result of the join operation 256 with its IPoIB interface. The broadcast group also serves to provide 257 a link-layer broadcast service for protocols like ARP, net-directed, 258 subnet-directed and all-subnets-directed broadcasts in IPv4 over IB 259 networks. 261 The join operation is successful only if the Subnet Manager (SM) 262 determines that the joining node can support the MTU registered with 263 the broadcast group [IPoIB_ARCH] ensuring support for a common link 264 MTU. The SM also ensures that all the nodes joining the broadcast- 265 GID have paths to one another and can therefore send and receive 266 unicast packets. It further ensures that all the nodes do indeed 267 form a multicast tree that allows packets sent from any member to be 268 replicated to every other member. Thus the IPoIB link is formed by 269 the IPoIB nodes joining the broadcast group. There is no physical 270 demarcation of the IPoIB link other than that determined by the 271 broadcast group membership. 273 The P_Key is a configuration parameter that must be known before the 274 broadcast-GID can be formed. For a node to join a partition, one of 275 its ports must be assigned the relevant P_Key by the SM 276 [IPoIB_ARCH]. 278 The method creation of the broadcast group and the assignment/choice 279 of its parameters are upto the implementation and/or the 280 administrator of the IPoIB subnet. The broadcast group may be 281 created by the first IPoIB node to be initialized or it can be 282 created administratively before the IPoIB subnet is setup. It is 283 advisable to ensure that the creation and deletion of the broadcast 284 group are under administrative control. 286 InfiniBand multicast management which includes the creation, joining 287 and leaving of IB multicast groups by IB nodes is described in 288 [IPoIB_ARCH]. 290 6.0 Frame Format 292 All IP and ARP datagrams transported over InfiniBand are prefixed by 293 a 4-octet encapsulation header as illustrated below. 295 0 1 2 3 296 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 297 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 298 | | | 299 | Type | Reserved | 300 | | | 301 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 302 Figure 3 304 The type field SHALL indicate the encapsulated protocol as per the 305 following table. 307 +----------+-------------+ 308 | Type | Protocol | 309 |------------------------| 310 | 0x800 | IPv4 | 311 |------------------------| 312 | 0x806 | ARP | 313 |------------------------| 314 | 0x8035 | RARP | 315 |------------------------| 316 | 0x86DD | IPv6 | 317 +------------------------+ 318 Table 1 320 These values are taken from the "ETHER TYPE" numbers assigned by 321 Internet Assigned Numbers Authority (IANA). Other network protocols, 322 identified by different values of "ETHER TYPE", may use the 323 encapsulation format defined herein but such use is outside of the 324 scope of this document. 326 |<------ IB Frame headers -------->|<- Payload ->|<- IB trailers ->| 327 +-------+------+---------+---------+-------------+---------+-------+ 328 |Local | |Base |Datagram | 4-octet | | | 329 |Routing| GRH* |Transport|Extended | header |Invariant|Variant| 330 |Header |Header|Header |Transport| + | CRC | CRC | 331 | | | |Header | IP/ARP | | | 332 +-------+------+---------+---------+-------------+---------+-------+ 333 Figure 4 335 Figure 4 depicts the IB frame encapsulating an IP/ARP datagram. The 336 InfiniBand specification requires the use of Global Routing Header 337 (GRH) [IPoIB_ARCH] when multicasting or when an InfiniBand packet 338 traverses from one IB subnet to another through an IB router. Its 339 use is optional when used for unicast transmission between nodes 340 within an IB subnet. The IPoIB implementation MUST be able to handle 341 packets received with or without the use of GRH. 343 7.0 Maximum Transmission Unit 345 IB MTU: 346 The IB components i.e. IB links, switches, Channel Adapters 347 (CAs), and IB routers, may support maximum payloads of : 256, 348 512, 1024, 2048 or 4096 octets. The maximum IB payload supported 349 by the IB components in any IB path is the IB MTU for the path. 351 IPoIB-Link MTU: 352 The IPoIB-link MTU is the MTU value associated with the 353 broadcast group. The IPoIB-link MTU can be set to any value upto 354 the smallest IB MTU supported by the IB components comprising 355 the IPoIB link. 357 In order to reduce problems with fragmentation and path-MTU 358 discovery, this document requires that all IPoIB implementations 359 support an MTU of 2044 octets i.e. a 2048 octet IPoIB-link MTU minus 360 the 4 octet encapsulation overhead. Larger and smaller MTUs MAY be 361 supported, but the default configuration must support an MTU of 2044 362 octets. 364 In IPv6 subnets the MTU may be reduced by a Router Advertisement 365 [DISC] containing an MTU option which specifies a smaller MTU, or by 366 manual configuration of each node. If a Router Advertisement 367 received on an IPoIB interface has an MTU option specifying an MTU 368 larger than the link MTU or larger than a manually configured value, 369 that MTU option may be logged to system management but must be 370 otherwise ignored. 372 Similarly, the IPv4 MTU may also be reduced by manual configuration 373 of each node. 375 For purposes of this document, information received from DHCP is 376 considered manually configured. 378 8.0 IPv6 Stateless Autoconfiguration 380 IB architecture associates an EUI-64 identifier termed the GUID 381 (Globally Unique Identifier) [IPoIB_ARCH, IBTA] with each port. The 382 Local Identifier (LID) is unique within an IB subnet only. 384 The interface identifier may be chosen from: 386 1) The EUI-64 compliant GUID assigned by the manufacturer. 388 2) If the IPoIB subnet is fully contained within an IB subnet 389 any of the unique 16-bit LIDs of the port associated with the 390 IPoIB interface. 392 The LID values of a port may change after a reboot/power-cycle 393 of the IB node. Therefore, if a persistent value is desired, it 394 would be prudent to not use the LID to form the interface 395 identifier. 397 On the other hand, the LID provides an identifier that can be 398 used to create a more anonymous IPv6 address since the LID is 399 not globally unique and is subject to change over time. 401 It is RECOMMENDED that the link-local address be constructed from 402 the port's EUI-64 identifier as given below. 404 [AARCH] requires the interface identifier be created in the 405 "Modified EUI-64" format when derived from an EUI-64 identifier. 406 [IBTA] is unclear if the GUID should use IEEE EUI-64 format or the 407 "Modified EUI-64" format. Therefore, when creating an interface 408 identifier from the GUID an implementation MUST do the following: 410 => Determine if the GUID is a modified EUI-64 identifier ("u" 411 bit is toggled) as defined by [AARCH] 413 => If the GUID is a modified EUI-64 identifier then the "u" bit 414 MUST NOT be toggled when creating the interface identifier 416 => If the GUID is an umodified EUI-64 identifier then the "u" 417 bit MUST be toggled in compliance with [AARCH] 419 8.1 IPv6 Link Local Address 421 The IPv6 link local address for an IPoIB interface are formed as 422 described in [AARCH] using the Interface Identifier as described in 423 the previous section. 425 9.0 Address Mapping - Unicast 427 Address resolution in IPv4 subnets is accomplished through Address 428 Resolution protocol (ARP) [ARP]. It is accomplished in IPv6 subnets 429 using the Neighbor Discovery protocol [DISC]. 431 9.1 Link Information 433 An InfiniBand packet over the UD mode includes multiple headers such 434 as the LRH (local route header), GRH (global route header), BTH 435 (base transport header), DETH (datagram extended header) as depicted 436 in figure 4 and specified in the InfiniBand architecture [IBTA]. All 437 these headers comprise the link-layer in an IPoIB link. 439 The parameters needed in these IBA headers constitute the link-layer 440 information that needs to be determined before an IP packet may be 441 transmitted across the IPoIB link. 443 The parameters that need to be determined are: 445 a) LID 447 The LID is always needed. A packet always includes the LRH 448 that is targeted at the remote node's LID, or an IB router's 449 LID to get to the remote node in another IB subnet. 451 b) Global Identifier (GID) 453 The GID is not needed when exchanging information within an 454 IB subnet though it may be included in any packet. It is an 455 absolute necessity when transmitting across the IB subnet 456 since the IB routers use the GID to correctly forward the 457 packets. The source and destination GIDs are fields included 458 in the GRH. 460 The GID, if formed using the GUID, can be used to 461 unambiguously identify an endpoint. 463 c) Queue Pair Number (QPN) 465 Every unicast UD communication is always directed to a 466 particular queue pair (QP) at the peer. 468 d) Q_Key 470 A Q_Key is associated with each unreliable datagram QPN. The 471 received packets must contain a Q_Key that matches the QP's 472 Q_Key to be accepted. 474 e) P_Key 476 A successful communication between two IB nodes using UD 477 mode can occur only if the two nodes have compatible P_Keys. 478 This is referred to as being in the same partition [IBTA]. 480 f) SL 482 Every IBA packet contains an SL value. A path in IBA is 483 defined by the three-tuple (source LID, destination LID, 484 SL). The SL in turns is mapped to a virtual lane (VL) at 485 every CA, switch that sends/forwards the packet 486 [IPoIB_ARCH]. Multiple SLs may be used between two endpoints 487 to provide for load-balancing, SLs may be used for providing 488 a QoS infrastructure, or may be used to avoid deadlocks in 489 the IBA fabric. 491 Another auxiliary piece of information, not included in the IBA 492 headers, is : 494 g) Path rate 496 IBA defines multiple link speeds. A higher speed transmitter 497 can swamp switches and the CAs. To avoid such congestion 498 every source transmitting at greater than 1x speeds is 499 required to determine the "path rate" before the data may be 500 transmitted [IBTA]. 502 9.1.1 Link Layer Address/Hardware Address 504 Though the list of information required for a successful transmittal 505 of an IPoIB packet is large, not all the information need be 506 determined during the IP address resolution process. 508 The IPoIB link-layer address used in the source/target link-layer 509 address option in IPv6 and the "hardware address" in IPv4/ARP has 510 the same format. 512 The format is as described below: 514 0 1 2 3 515 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 516 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 517 | Reserved | Queue Pair Number | 518 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 519 | | 520 + + 521 | | 522 + GID + 523 | | 524 + + 525 | | 526 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 527 Figure 5 529 a) Reserved Flags 531 These 8 bits are reserved for future use. These bits MUST be 532 set to zero on send and ignored on receive unless specified 533 differently in a future document. 535 b) QPN 537 Every unicast communication in IB architecture is directed 538 to a specific QP [IBTA]. This QP number is included in the 539 link description. All IP communication to the relevant IPoIB 540 interface MUST be directed to this QPN. In the case of IPv4 541 subnets the address resolution protocol (ARP) reply packets 542 are also directed to the same QPN. 544 The choice of the QPN for IP/ARP communication is up to the 545 implementation. 547 c) GID 549 This is one of the GIDs of the port associated with the 550 IPoIB interface [IBTA]. IB associates multiple GIDs with a 551 port. It is RECOMMENDED that the GID formed by the 552 combination of the IB subnet prefix and the port's "Port 553 GUID" [IBTA] be included in the link-layer/hardware address. 555 9.1.2 Auxiliary Link Information 557 The rest of the parameters are determined as follows: 559 a) LID 561 The method of determining the peer's LID is not defined in 562 this document. It is up to the implementation to use any of 563 the IBA approved methods to determine the destination LID. 564 One such method is to use the GID determined during the 565 address resolution, to retrieve the associated LID from the 566 IB routing infrastructure or the Subnet Administrator (SA). 568 It is the responsibility of the administrator to ensure that 569 the IB subnet(s) have unicast connectivity between the IPoIB 570 nodes. The GID exchanged between two endpoints in a 571 multicast message (ARP/ND) does not guarantee the existence 572 of a unicast path between the two. 574 There may be multiple LIDs, and hence paths, between the 575 endpoints. The criteria for selection of the LIDs are beyond 576 the scope of this document. 578 b) Q_Key 580 The Q_Key received on joining the broadcast group MUST be 581 used for all IPoIB communication over the particular IPoIB 582 link. 584 c) P_Key 586 The P_Key to be used in the IP subnet is not discovered but 587 is a configuration parameter. 589 d) SL 591 The method of determining the SL is not defined in this 592 document. The SL is determined by any of the IBA approved 593 methods. 595 e) Path rate 597 The implementation must leverage IB methods to determine the 598 path rate as required. 600 9.2 Address Resolution in IPv4 Subnets 602 The ARP packet header is as defined in [ARP]. The hardware type is 603 set to 32 (decimal) as specified by IANA. The rest of the fields are 604 used as per [ARP]. 606 16 bits: hardware type 607 16 bits: protocol 608 8 bits: length of hardware address 609 8 bits: length of protocol address 610 16 bits: ARP operation 612 The remaining fields in the packet hold the sender/target hardware 613 and protocol addresses. 615 [ sender hardware address ] 616 [ sender protocol address ] 617 [ target hardware address ] 618 [ target protocol address ] 620 The hardware address included in the ARP packet will be as specified 621 in section 9.1.1 and depicted in figure 5. 623 The length of the hardware address used in ARP packet header 624 therefore is 20. 626 9.3 Address Resolution in IPv6 Subnets 628 The Source/Target Link-layer address option is used in Router 629 Solicit, Router advertisements, Redirect, Neighbor Solicitation and 630 Neighbor Advertisement messages when such messages are transmitted 631 on InfiniBand networks. 633 The source/target address option is specified as follows: 635 Type: 636 Source Link-layer address 1 637 Target Link-layer address 2 639 Length: 3 641 Link-layer address: 643 The link-layer address is as specified in section 9.1.1 and 644 depicted in figure 5. 646 9.4 Cautionary Note on QPN Caching 648 The link-layer address for IPoIB includes the QPN which might not be 649 constant across reboots or even across network interface resets. 650 Cached QPN entries, such as in static ARP entries or in RARP servers 651 will only work if the implementation(s) using these options ensure 652 that the QPN associated with an interface is invariant across 653 reboots/network resets. 655 10.0 Sending and Receiving IP Multicast Packets 657 Multicast in InfiniBand differs in a number of ways from multicast 658 in Ethernet. This adds some complexity to an IPoIB implementation 659 when supporting IP multicast over IB. 661 A) An IB multicast group must be explicitly created through the SA 662 before it can be used. 664 This implies that in order to send a packet destined for an IP 665 multicast address, the IPoIB implementation must check with the SA 666 on the outbound link first for a "MCMemberRecord" that matches the 667 MGID. If one does exist, the MLID associated with the multicast 668 group is used as the DLID for the packet. Otherwise, it implies no 669 member exists on the local link. If the scope of the IP multicast 670 group is beyond link-local, the packet must be sent to the on-link 671 routers through the use of the all-router multicast group or the 672 broadcast group. This is to allow local routers to forward the 673 packet to multicast listeners on remote networks. The all-router 674 multicast group is preferred over the broadcast group for better 675 efficiency. If the all-router multicast group does not exist, the 676 sender can assume that there are no routers on the local link; hence 677 the packet can be safely dropped. 679 B) A multicast sender must join the target multicast group before 680 outgoing multicast messages from it can be successfully routed. The 681 "SendOnlyNonMember" join is different from the regular "FullMember" 682 join in two aspects. First, both types of joins enable multicast 683 packets to be routed FROM the local port, but only the "FullMember" 684 join causes multicast packets to be routed TO the port. Second, the 685 sender port of a "SendOnlyNonMember" join will not be counted as a 686 member of the multicast group for purposes of group creation and 687 deletion. 689 The following code snippet demonstrates the steps in a typical 690 implementation when processing an egress multicast packet. 692 if the egress port is already a "SendOnlyNonMember", or a 693 "FullMember" 694 => send the packet 696 else if the target multicast group exists 697 => do "SendOnlyNonMember" join 698 => send the packet 700 else if scope > link-local AND the all-router multicast group exists 701 => send the packet to all routers 703 else 704 => drop the packet 706 Implementations should cache the information about the existence of 707 an IB multicast group, its MLID and other attributes. This is to 708 avoid expensive SA calls on every outgoing multicast packet. 709 Senders MUST subscribe to the multicast group create and delete 710 traps in order to monitor the status of specific IB multicast 711 groups. E.g., multicast packets directed to the all-router multicast 712 group due to a lack of listener on the local subnet must be 713 forwarded to the right multicast group if the group is created 714 later. This happens when a listener shows up on the local subnet. 716 A node joining an IP multicast group must first construct a MGID 717 according to the rule described in section 4 above. Once the correct 718 MGID is calculated, the node must call the SA of the outbound link 719 to attempt a "FullMember" join of the IB multicast group 720 corresponding to the MGID. If the IB multicast group doesn't already 721 exist, one must be created first with the IPoIB link MTU. The MGID 722 MUST use the same P_Key, Q_Key, SL, MTU and HopLimit as those used 723 in the broadcast-GID. For the rest of attributes too, the values 724 used in the broadcast-GID SHOULD be used. 726 The join request will cause the local port to be added to the 727 multicast group. It also enables the SM to program IB switches and 728 routers with the new multicast information to ensure the correct 729 forwarding of multicast packets for the group. 731 When a node leaves an IP multicast group, it SHOULD make a 732 "FullMember" leave request to the SA. This gives SM an opportunity 733 to update relevant forwarding information, to delete an IB multicast 734 group if the local port is the last FullMember to leave, and free up 735 the MLID allocated for it. The specific algorithm is implementation- 736 dependent, and is out of the scope of this document. 738 Note that for an IPoIB link that spans more than one IB subnet 739 connected by IB routers, an adequate multicast forwarding support at 740 the IB level is required for multicast packets to reach listeners on 741 a remote IB subnet. The specific mechanism for this is beyond the 742 scope of IPoIB. 744 11.0 IP Multicast Routing 746 IP multicast routing requires multicast routers to receive a copy of 747 every link multicast packet on a locally connected link [IPMULT, 748 IP6MLD]. For Ethernet this is usually achieved by turning on the 749 promiscuous multicast mode on a locally connected Ethernet 750 interface. 752 IBA does not provide any hardware support for promiscuous multicast 753 mode. Fortunately a promiscuous multicast mode can be emulated in 754 the software running on a router through the following steps. 756 A) Obtain a list of all active IB multicast groups from the local 757 SA. 759 B) Make a "NonMember" join request to the SA for every group that 760 has a signature in its MGID matching the one for either IPv4 or 761 IPv6. 763 C) Subscribe to the IB multicast group creation events using a 764 wildcarded MGID so that the router can "NonMember" join all IB 765 multicast groups created subsequently for IPv4 or IPv6. 767 The "NonMember" join has the same effect as a "FullMember" join 768 except that the former will not be counted as a member of the 769 multicast group for purposes of group creation or deletion. That is, 770 when the last "FullMember" leaves a multicast group, the group can 771 be safely deleted by the SA without concerning any "NonMember" 772 routers. 774 12.0 New Types of Vulnerability in IB Multicast 776 Many IB multicast functions are subject to failures due to a number 777 of possible resource constraints. These include the creation of IB 778 multicast groups, the join calls ("SendOnlyNonMember", "FullMember", 779 and "NonMember"), and the attaching of a QP to a multicast group. 781 In general, the occurrence of these failure conditions is highly 782 implementation dependent, and is believed to be rare. Usually a 783 failed multicast operation at the IB level can be propagated back to 784 the IP level, causing the original operation to fail, and the 785 initiator of the operation to be notified. But some IB multicast 786 functions are not tied to any foreground operation, making their 787 failures hard to detect. E.g., if an IP multicast router attempts to 788 "NonMember" join a newly created multicast group in the local 789 subnet, but the join call fails, packet forwarding for that 790 particular multicast group will likely to fail silently, that is, 791 without the attention of local multicast senders. This type of 792 problems can add more vulnerability to the already unreliable IP 793 multicast operations. 795 Implementations should log error messages upon any failure from an 796 IB multicast operation. Network administrators should be aware of 797 this vulnerability, and preserve enough multicast resources at the 798 points where IP multicast will be used heavily. E.g., HCAs with 799 ample multicast resources should be used at any IP multicast router. 801 13.0 Security Considerations 803 This document specifies IP transmission over a multicast network. 804 Any network of this kind is vulnerable to a sender claiming 805 another's identity and forging traffic or eavesdropping. It is the 806 responsibility of the higher layers or applications to implement 807 suitable counter-measures if this is a problem. 809 Successful transmission of IP packets depends on the correct setup 810 of the IPoIB link , creation of the broadcast GID, creation of the 811 QP and its attachment to the broadcast-GID, and the correct 812 determination of various link parameters such as the LID, service 813 level, path rate etc. These operations, many of which involve 814 interactions with the SM/SA, MUST be protected by the underlying 815 operating system. This is to prevent malicious, non- privileged 816 software from hijacking important resources and configurations. 818 Controlled Q_Keys SHOULD be used in all transmissions. This is to 819 prevent non-privileged software from fabricating IP datagrams. 821 14.0 IANA Considerations 823 To support ARP over InfiniBand, a value for the Address Resolution 824 Parameter "Number Hardware Type (hrd)" is required. IANA has 825 assigned the number "32" to indicate InfiniBand [IANA_ARP]. 827 15.0 Acknowledgments 829 The authors would like to thank Bruce Beukema, David Brean, Dan 830 Cassiday, Aditya Dube, Yaron Haviv, Michael Krause, Thomas Narten, 831 Erik Nordmark, Greg Pfister, Jim Pinkerton, Renato Recio, Kevin 832 Reilly, Kanoj Sarcar, Satya Sharma, Madhu Talluri, and David L. 833 Stevens for their suggestions and many clarifications on the IBA 834 specification. 836 16.0 References 838 [AARCH] Hinden, R. and S. Deering "IP Version 6 Addressing 839 Architecture", RFC 3513. 841 [DISC] Narten, T., Nordmark, E. and W. Simpson, "Neighbor 842 Discovery for IP Version 6 (IPv6)", RFC 2461. 844 [HOSTS] Braden R., "Requirements for Internet Hosts -- 845 Communication Layers", RFC 1122. 847 [IBTA] InfiniBand Architecture Specification, 848 www.infinibandta.org. 850 [IGMP2] Fenner W., "Internet Group Management Protocol, 851 Version 2", RFC 2236. 853 [IP6MLD] Deering S., Fenner W., Haberman B., "Multicast 854 Listener Discovery (MLD) for IPv6", RFC 2710. 856 [IPMULT] Deering S., "Host Extensions for IP Multicasting", 857 RFC 1112. 859 [IPoIB_ARCH] Kashyap V., "IP over InfiniBand (IPoIB) Architecture", 860 draft-ietf-ipoib-architecture-03.txt. 862 [IPV6] Deering, S. and R. Hinden, "Internet Protocol, 863 Version 6 (IPv6) Specification", RFC 2460. 865 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 866 Requirement Levels", RFC 2119. 868 [ARP] Plummer D.C., "Ethernet Address Resolution Protocol", 869 RFC 826. 871 [IANA] Internet Assigned Numbers Authority, www.iana.org 873 [IANA_ARP] www.iana.org/assignments/arp-parameters 875 17.0 Authors" Addresses 877 H.K. Jerry Chu 879 17 Network Circle, UMPK17-201 880 Menlo Park, CA 94025 881 USA 882 Phone: +1 650 786 5146 883 Email: jerry.chu@sun.com 885 Vivek Kashyap 887 15350, SW Koll Parkway 888 Beaverton, OR 97006 889 USA 890 Phone: +1 503 578 3422 891 Email: vivk@us.ibm.com 893 Full Copyright Statement 895 Copyright (C) The Internet Society (2004). This document is subject 896 to the rights, licenses and restrictions contained in BCP 78, and 897 except as set forth therein, the authors retain all their rights. 899 This document and the information contained herein are provided on 900 an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE 901 REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE 902 INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR 903 IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF 904 THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED 905 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 907 Intellectual Property Statement 909 The IETF takes no position regarding the validity or scope of any 910 Intellectual Property Rights or other rights that might be claimed 911 to pertain to the implementation or use of the technology described 912 in this document or the extent to which any license under such 913 rights might or might not be available; nor does it represent that 914 it has made any independent effort to identify any such rights. 915 Information on the procedures with respect to rights in RFC 916 documents can be found in BCP 78 and BCP 79. 918 Copies of IPR disclosures made to the IETF Secretariat and any 919 assurances of licenses to be made available, or the result of an 920 attempt made to obtain a general license or permission for the use 921 of such proprietary rights by implementers or users of this 922 specification can be obtained from the IETF on-line IPR repository 923 at http://www.ietf.org/ipr. 925 The IETF invites any interested party to bring to its attention any 926 copyrights, patents or patent applications, or other proprietary 927 rights that may cover technology that may be required to implement 928 this standard. Please address the information to the IETF at ietf- 929 ipr@ietf.org. 931 Acknowledgment 933 Funding for the RFC Editor function is currently provided by the 934 Internet Society.