idnits 2.17.1 draft-ietf-ipoib-ip-over-infiniband-05.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** The document seems to lack a 1id_guidelines paragraph about 6 months document validity -- however, there's a paragraph with a matching beginning. Boilerplate error? == No 'Intended status' indicated for this document; assuming Proposed Standard == The page length should not exceed 58 lines per page, but there was 18 longer pages, the longest (page 2) being 60 lines == It seems as if not all pages are separated by form feeds - found 0 form feeds but 19 pages Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. ** There are 11 instances of too long lines in the document, the longest one being 1 character in excess of 72. == There are 1 instance of lines with non-RFC2606-compliant FQDNs in the document. == There are 1 instance of lines with non-RFC3849-compliant IPv6 addresses in the document. If these are example addresses, they should be changed. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year == The "Author's Address" (or "Authors' Addresses") section title is misspelled. -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (November 2003) is 7462 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Unused Reference: 'IP6MLD' is defined on line 839, but no explicit reference was found in the text == Unused Reference: 'IPMULT' is defined on line 842, but no explicit reference was found in the text == Unused Reference: 'RFC2119' is defined on line 851, but no explicit reference was found in the text == Unused Reference: 'IANA' is defined on line 857, but no explicit reference was found in the text ** Obsolete normative reference: RFC 3513 (ref. 'AARCH') (Obsoleted by RFC 4291) ** Obsolete normative reference: RFC 2461 (ref. 'DISC') (Obsoleted by RFC 4861) -- Possible downref: Non-RFC (?) normative reference: ref. 'IBTA' ** Obsolete normative reference: RFC 2460 (ref. 'IPV6') (Obsoleted by RFC 8200) -- Possible downref: Non-RFC (?) normative reference: ref. 'IANA' Summary: 7 errors (**), 0 flaws (~~), 11 warnings (==), 4 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 INTERNET DRAFT 3 H.K. Jerry Chu 4 Expiration Date: May 2004 Sun Microsystems 5 V. Kashyap 6 IBM 7 November 2003 9 Transmission of IP over InfiniBand 11 Status of this memo 13 This document is an Internet-Draft and is in full conformance with 14 all provisions of Section 10 of RFC 2026. 16 Internet-Drafts are working documents of the Internet Engineering 17 Task Force (IETF), its areas, and its working groups. Note that 18 other groups may also distribute working documents as Internet- 19 Drafts. 21 Internet-Drafts are draft documents valid for a maximum of six 22 months and may be updated, replaced, or obsoleted by other documents 23 at any time. It is inappropriate to use Internet-Drafts as Reference 24 material or to cite them other than as ``work in progress"". 26 The list of current Internet-Drafts can be accessed at 27 http://www.ietf.org/ietf/1id-abstracts.txt 29 The list of Internet-Draft Shadow Directories can be accessed at 30 http://www.ietf.org/shadow.html 32 This memo provides information for the Internet community. This memo 33 does not specify an Internet standard of any kind. Distribution of 34 this memo is unlimited. 36 Copyright Notice 38 Copyright (C) The Internet Society (2001). All Rights Reserved. 40 Abstract 42 This document specifies a method for encapsulating and transmitting 43 IPv4/IPv6 and Address Resolution Protocol (ARP) packets over 44 InfiniBand (IB). It describes the link layer address to be used when 45 resolving the IP addresses in "IP over InfiniBand (IPoIB)" subnets. 46 The document also describes the mapping from IP multicast addresses 47 to InfiniBand multicast addresses. Additionally this document 48 defines the setup and configuration of IPoIB links. 50 Table of Contents 52 1.0 Introduction 53 2.0 IP over UD Mode 54 3.0 InfiniBand Datalink 55 4.0 Multicast Mapping 56 4.1 Broadcast-GID Parameters 57 5.0 Setting Up an IPoIB Link 58 6.0 Frame Format 59 7.0 Maximum Transmission Unit 60 8.0 IPv6 Stateless Autoconfiguration 61 8.1 IPv6 Link Local Address 62 9.0 Address Mapping - Unicast 63 9.1 Link Information 64 9.1.1 Link Layer Address/Hardware Address 65 9.1.2 Auxiliary Link Information 66 9.2 Address Resolution in IPv4 Subnets 67 9.3 Address Resolution in IPv6 Subnets 68 9.4 Cautionary Note on QPN Caching 69 10.0 IANA Considerations 70 11.0 Sending and Receiving IP Multicast Packets 71 12.0 IP Multicast Routing 72 13.0 New Types of Vulnerability in IB Multicast 73 14.0 Security Considerations 74 15.0 Acknowledgments 75 16.0 References 76 17.0 Author's Addresses 78 1.0 Introduction 80 The InfiniBand specification [IBTA] can be found at 81 www.infinibandta.org. The document [IPoIB_ARCH] provides a short 82 overview of InfiniBand architecture (IBA) along with considerations 83 for specifying IP over InfiniBand networks. 85 IBA defines multiple modes of transport over which IP may be 86 implemented. The unreliable datagram (UD) transport mode best 87 matches the needs of IP and the need for universality as described 88 in [IPoIB_ARCH]. 90 This document specifies IPoIB over IB's UD mode. The implementation 91 of IP subnets over IB's other transport mechanisms is out of scope 92 of this document. 94 This document describes the necessary steps required in order to lay 95 out an IP network on top of an IB network. It describes all the 96 elements of an IPoIB link, how to configure its associated 97 attributes, and how to setup basic broadcast and multicast services 98 for it. 100 It further describes IP address resolution and the encapsulation of 101 IP and ARP packets in InfiniBand frame. 103 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 104 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 105 document are to be interpreted as described in RFC 2119. 107 2.0 IP over UD Mode 109 The unreliable datagram mode of communication is supported by all IB 110 elements be they IB routers, Host Channel Adapters (HCAs) or Target 111 Channel Adapters (TCAs). In addition to being the only universal 112 transmission method it supports multicasting, partitioning and a 113 32-bit CRC [IBTA]. Though multicasting support is optional in IB 114 fabrics, IPoIB architecture requires the participating components to 115 support it. 117 All IPoIB implementations MUST support IP over the UD transport mode 118 of IBA. 120 3.0 InfiniBand Datalink 122 An IB subnet is formed by a network of IB nodes interconnected 123 either directly or via IB switches. IB subnets may be connected 124 using IB routers to form a fabric made of multiple IB subnets. Nodes 125 residing in different IB subnets can communicate directly with one 126 another through IB routers at the IB network layer. Multiple IP 127 subnets may be overlaid over this IB network. 129 An IP subnet is configured over a communication facility or medium 130 over which nodes can communicate at the "link" layer [IPV6]. E.g. an 131 ethernet segment is a link formed by interconnected 132 switches/hubs/bridges. The segment is therefore defined by the 133 physical topology of the network. This is not the case with IPoIB. 134 IPoIB subnets are built over an abstract "link". The link is defined 135 by its members and common characteristics such as the P_Key, link 136 MTU, and the Q_Key. 138 Any two ports using UD communication mode in an IB fabric can 139 communicate only if they are in the same partition i.e. have the 140 same P_Key and the same Q_Key [IPoIB_ARCH]. The link MTU provides a 141 limit to the size of the payload that may be used. The packet 142 transmission and routing within the IB fabric is also affected by 143 additional parameters such as the traffic class (TClass), hop limit 144 (HopLimit), service level (SL) and the flow label (FlowLabel). The 145 determination and use of these values for IPoIB communication is 146 described in the following sections. 148 4.0 Multicast Mapping 150 IB identifies multicast groups by the multicast Global Identifiers 151 (MGIDs) which follow the same rules as IPv6 multicast addresses. 152 Hence the MGIDs follow the same rules regarding the transient 153 addresses and scope bits albeit in the context of the IB fabric. The 154 resultant address therefore resembles IPv6 multicast addresses. The 155 documents [IBTA, IPoIB_ARCH] give a detailed description of IB 156 multicast. 158 The IPoIB multicast mapping is depicted in figure 1. The same 159 mapping function is used for both IPv4 and IPv6 except for the IPoIB 160 signature field. 162 Unless explicitly stated, all addresses and fields in the protocol 163 headers in this document are stored in the network byte order. 165 | 8 | 4 | 4 | 16 bits | 16 bits | 80 bits | 166 +------ -+----+----+-----------------+---------+-------------------+ 167 |11111111|0001|scop||< P_Key >| group ID | 168 +--------+----+----+-----------------+---------+-------------------+ 169 Figure 1 171 Since an MGID allocated for transporting IP multicast datagrams is 172 considered only a transient link-layer multicast address 173 [IPoIB_ARCH], all IB MGIDs allocated for IPoIB purpose MUST set T- 174 flag to 1 [IBTA]. 176 A special signature is embedded to identify the MGID for IPoIB use 177 only. For IPv4 over IB, the signature MUST be "0x401B". For IPv6 178 over IB, the signature MUST be "0x601B". 180 The IP multicast address is used together with a given IPoIB link 181 P_Key to form the MGID of the IB multicast group. For IPv6 the lower 182 80-bit of the group ID is used directly in the lower 80-bit of the 183 MGID. For IPv4, the group ID is only 28-bit long and the rest of the 184 bits are filled with 0. 186 E.g. on an IPoIB link that is fully contained within a single IB 187 subnet with a P_Key of 0x8000, the MGIDs for the all-router 188 multicast group with group ID 2 [AARCH, IGMP2] are: 190 FF12:401B:8000::2, for IPv4 in compressed format, and 191 FF12:601B:8000::2, for IPv6 in compressed format. 193 A special case exists for the IPv4 limited broadcast address 194 "255.255.255.255" [HOSTS]. The address SHALL be mapped to the 195 "broadcast-GID", which is defined as follows: 197 | 8 | 4 | 4 | 16 bits | 16 bits | 48 bits | 32 bits | 198 +--------+----+----+----------------+---------+----------+---------+ 199 |11111111|0001|scop|0100000000011011|< P_Key >|00.......0|| 200 +--------+----+----+----------------+---------+----------+---------+ 201 Figure 2 203 All MGIDs used in the IPoIB subnet MUST use the scop bits used in 204 the broadcast GID. 206 4.1 Broadcast-GID Parameters 208 The broadcast-GID is setup with the following attributes: 210 1. P_Key 211 A "Full Membership" P_Key (high-order bit is set to 1) MUST 212 be used so that all members may communicate with one 213 another. 215 2. Q_Key 216 It is RECOMMENDED that a controlled Q_Key be used with the 217 high order bit set. This is to prevent non-privileged 218 software from fabricating and sending out bogus IP 219 datagrams. 221 3. IB MTU 222 The value assigned to the broadcast-GID must not be greater 223 than any physical link MTU spanned by the IPoIB subnet. 225 The following attributes are required in multicast transmissions and 226 also in unicast transmissions if an IPoIB link covers more than a 227 single subnet. 229 4. Other parameters 230 The selection of TClass, FlowLabel, and HopLimit values is 231 implementation dependent. But it must take into account the 232 topology of IB subnets comprising the IPoIB link in order to 233 allow successful communication between any two nodes in the 234 same IPoIB link. 236 An SL also needs to be assigned to the broadcast-GID. This 237 SL is used in all multicast communication in the subnet. 239 The broadcast-GID's scope bits need to be set based on 240 whether the IPoIB link is confined within an IB subnet or 241 the IPoIB link spans multiple IB subnets. A default of 242 local-subnet scope i.e. 0x2 is RECOMMENDED. A node might 243 determine the scope bits to use by interactively searching 244 for a broadcast-GID of ever greater scope by first starting 245 with the local-scope. Or, an implementation might include 246 the scope bits as a configuration parameter. 248 5.0 Setting Up an IPoIB Link 250 The broadcast-GID, as defined in the previous section MUST be setup 251 for an IPoIB subnet to be formed. Every IPoIB interface MUST 252 "FullMember" join the IB multicast group defined by the broadcast- 253 GID. This multicast group will henceforth be referred to as the 254 broadcast group. The join operation returns the MTU, the Q_Key and 255 other parameters associated with the broadcast group. The node then 256 associates the parameters received as a result of the join operation 257 with its IPoIB interface. The broadcast group also serves to provide 258 a link-layer broadcast service for protocols like ARP, net-directed, 259 subnet-directed and all-subnets-directed broadcasts in IPv4 over IB 260 networks. 262 The join operation is successful only if the Subnet Manager (SM) 263 determines that the joining node can support the MTU registered with 264 the broadcast group [IPoIB_ARCH] ensuring support for a common link 265 MTU. The SM also ensures that all the nodes joining the broadcast- 266 GID have paths to one another and can therefore send and receive 267 unicast packets. It further ensures that all the nodes do indeed 268 form a multicast tree that allows packets sent from any member to be 269 replicated to every other member. Thus the IPoIB link is formed by 270 the IPoIB nodes joining the broadcast group. There is no physical 271 demarcation of the IPoIB link other than that determined by the 272 broadcast group membership. 274 The P_Key is a configuration parameter that must be known before the 275 broadcast-GID can be formed. For a node to join a partition, one of 276 its ports must be assigned the relevant P_Key by the SM 277 [IPoIB_ARCH]. 279 The method creation of the broadcast group and the assignment/choice 280 of its parameters are upto the implementation and/or the 281 administrator of the IPoIB subnet. The broadcast group may be 282 created by the first IPoIB node to be initialized or it can be 283 created administratively before the IPoIB subnet is setup. It is 284 advisable to ensure that the creation and deletion of the broadcast 285 group are under administrative control. 287 InfiniBand multicast management which includes the creation, joining 288 and leaving of IB multicast groups by IB nodes is described in 289 [IPoIB_ARCH]. 291 6.0 Frame Format 293 All IP and ARP datagrams transported over InfiniBand are prefixed by 294 a 4-octet encapsulation header as illustrated below. 296 0 1 2 3 297 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 298 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 299 | | | 300 | Type | Reserved | 301 | | | 302 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 303 Figure 3 305 The type field SHALL indicate the encapsulated protocol as per the 306 following table. 308 +----------+-------------+ 309 | Type | Protocol | 310 |------------------------| 311 | 0x800 | IPv4 | 312 |------------------------| 313 | 0x806 | ARP | 314 |------------------------| 315 | 0x8035 | RARP | 316 |------------------------| 317 | 0x86DD | IPv6 | 318 +------------------------+ 319 Table 1 321 These values are taken from the "ETHER TYPE" numbers assigned by 322 Internet Assigned Numbers Authority (IANA). Other network protocols, 323 identified by different values of "ETHER TYPE", may use the 324 encapsulation format defined herein but such use is outside of the 325 scope of this document. 327 |<------ IB Frame headers -------->|<- Payload ->|<- IB trailers ->| 328 +-------+------+---------+---------+-------------+---------+-------+ 329 |Local | |Base |Datagram | 4-octet | | | 330 |Routing| GRH* |Transport|Extended | header |Invariant|Variant| 331 |Header |Header|Header |Transport| + | CRC | CRC | 332 | | | |Header | IP/ARP | | | 333 +-------+------+---------+---------+-------------+---------+-------+ 334 Figure 4 336 Figure 4 depicts the IB frame encapsulating an IP/ARP datagram. The 337 InfiniBand specification requires the use of Global Routing Header 338 (GRH) [IPoIB_ARCH] when multicasting or when an InfiniBand packet 339 traverses from one IB subnet to another through an IB router. Its 340 use is optional when used for unicast transmission between nodes 341 within an IB subnet. The IPoIB implementation MUST be able to handle 342 packets received with or without the use of GRH. 344 7.0 Maximum Transmission Unit 346 IB MTU: 347 The IB components i.e. IB links, switches, Channel Adapters 348 (CAs), and IB routers, may support maximum payloads of : 256, 349 512, 1024, 2048 or 4096 octets. The maximum IB payload supported 350 by the IB components in any IB path is the IB MTU for the path. 352 IPoIB-Link MTU: 353 The IPoIB-link MTU is the MTU value associated with the 354 broadcast group. The IPoIB-link MTU can be set to any value upto 355 the smallest IB MTU supported by the IB components comprising 356 the IPoIB link. 358 In order to reduce problems with fragmentation and path-MTU 359 discovery, this document requires that all IPoIB implementations 360 support an MTU of 2044 octets i.e. a 2048 octet IPoIB-link MTU minus 361 the 4 octet encapsulation overhead. Larger and smaller MTUs MAY be 362 supported, but the default configuration must support an MTU of 2044 363 octets. 365 In IPv6 subnets the MTU may be reduced by a Router Advertisement 366 [DISC] containing an MTU option which specifies a smaller MTU, or by 367 manual configuration of each node. If a Router Advertisement 368 received on an IPoIB interface has an MTU option specifying an MTU 369 larger than the link MTU or larger than a manually configured value, 370 that MTU option may be logged to system management but must be 371 otherwise ignored. 373 Similarly, the IPv4 MTU may also be reduced by manual configuration 374 of each node. 376 For purposes of this document, information received from DHCP is 377 considered manually configured. 379 8.0 IPv6 Stateless Autoconfiguration 381 IB architecture associates an EUI-64 identifier termed the GUID 382 (Globally Unique Identifier) [IPoIB_ARCH, IBTA] with each port. The 383 Local Identifier (LID) is unique within an IB subnet only. 385 The interface identifier may be chosen from: 387 1) The EUI-64 compliant GUID assigned by the manufacturer. 389 2) If the IPoIB subnet is fully contained within an IB subnet 390 any of the unique 16-bit LIDs of the port associated with the 391 IPoIB interface. 393 The LID values of a port may change after a reboot/power-cycle 394 of the IB node. Therefore, if a persistent value is desired, it 395 would be prudent to not use the LID to form the interface 396 identifier. 398 On the other hand, the LID provides an identifier that can be 399 used to create a more anonymous IPv6 address since the LID is 400 not globally unique and is subject to change over time. 402 It is RECOMMENDED that the link-local address be constructed from 403 the port's EUI-64 identifier as per the rules specified in [AARCH]. 405 8.1 IPv6 Link Local Address 407 The IPv6 link local address for an IPoIB interface are formed as 408 described in [AARCH] using the Interface Identifier as described in 409 the previous section. 411 9.0 Address Mapping - Unicast 413 Address resolution in IPv4 subnets is accomplished through Address 414 Resolution protocol (ARP) [ARP]. It is accomplished in IPv6 subnets 415 using the Neighbor Discovery protocol [DISC]. 417 9.1 Link Information 419 An InfiniBand packet over the UD mode includes multiple headers such 420 as the LRH (local route header), GRH (global route header), BTH 421 (base transport header), DETH (datagram extended header) as depicted 422 in figure 4 and specified in the InfiniBand architecture [IBTA]. All 423 these headers comprise the link-layer in an IPoIB link. 425 The parameters needed in these IBA headers constitute the link-layer 426 information that needs to be determined before an IP packet may be 427 transmitted across the IPoIB link. 429 The parameters that need to be determined are: 431 a) LID 433 The LID is always needed. A packet always includes the LRH 434 that is targeted at the remote node's LID, or an IB router's 435 LID to get to the remote node in another IB subnet. 437 b) Global Identifier (GID) 439 The GID is not needed when exchanging information within an 440 IB subnet though it may be included in any packet. It is an 441 absolute necessity when transmitting across the IB subnet 442 since the IB routers use the GID to correctly forward the 443 packets. The source and destination GIDs are fields included 444 in the GRH. 446 The GID, if formed using the GUID, can be used to 447 unambiguously identify an endpoint. 449 c) Queue Pair Number (QPN) 451 Every unicast UD communication is always directed to a 452 particular queue pair (QP) at the peer. 454 d) Q_Key 456 A Q_Key is associated with each unreliable datagram QPN. The 457 received packets must contain a Q_Key that matches the QP's 458 Q_Key to be accepted. 460 e) P_Key 462 A successful communication between two IB nodes using UD 463 mode can occur only if the two nodes have compatible P_Keys. 464 This is referred to as being in the same partition [IBTA]. 466 f) SL 468 Every IBA packet contains an SL value. A path in IBA is 469 defined by the three-tuple (source LID, destination LID, 470 SL). The SL in turns is mapped to a virtual lane (VL) at 471 every CA, switch that sends/forwards the packet 472 [IPoIB_ARCH]. Multiple SLs may be used between two endpoints 473 to provide for load-balancing, SLs may be used for providing 474 a QoS infrastructure, or may be used to avoid deadlocks in 475 the IBA fabric. 477 Another auxiliary piece of information, not included in the IBA 478 headers, is : 480 g) Path rate 482 IBA defines multiple link speeds. A higher speed transmitter 483 can swamp switches and the CAs. To avoid such congestion 484 every source transmitting at greater than 1x speeds is 485 required to determine the "path rate" before the data may be 486 transmitted [IBTA]. 488 9.1.1 Link Layer Address/Hardware Address 490 Though the list of information required for a successful transmittal 491 of an IPoIB packet is large, not all the information need be 492 determined during the IP address resolution process. 494 The IPoIB link-layer address used in the source/target link-layer 495 address option in IPv6 and the "hardware address" in IPv4/ARP has 496 the same format. 498 The format is as described below: 500 0 1 2 3 501 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 502 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 503 | Reserved | Queue Pair Number | 504 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 505 | | 506 + + 507 | | 508 + GID + 509 | | 510 + + 511 | | 512 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 513 Figure 5 515 a) Reserved Flags 517 These 8 bits are reserved for future use. These bits MUST be 518 set to zero on send and ignored on receive unless specified 519 differently in a future document. 521 b) QPN 523 Every unicast communication in IB architecture is directed 524 to a specific QP [IBTA]. This QP number is included in the 525 link description. All IP communication to the relevant IPoIB 526 interface MUST be directed to this QPN. In the case of IPv4 527 subnets the address resolution protocol (ARP) reply packets 528 are also directed to the same QPN. 530 The choice of the QPN for IP/ARP communication is up to the 531 implementation. 533 c) GID 535 This is one of the GIDs of the port associated with the 536 IPoIB interface [IBTA]. IB associates multiple GIDs with a 537 port. It is RECOMMENDED that the GID formed by the 538 combination of the IB subnet prefix and the port's GUID be 539 included in the link-layer/hardware address. 541 9.1.2 Auxiliary Link Information 543 The rest of the parameters are determined as follows: 545 a) LID 547 The method of determining the peer's LID is not defined in 548 this document. It is up to the implementation to use any of 549 the IBA approved methods to determine the destination LID. 550 One such method is to use the GID determined during the 551 address resolution, to retrieve the associated LID from the 552 IB routing infrastructure or the Subnet Administrator (SA). 554 It is the responsibility of the administrator to ensure that 555 the IB subnet(s) have unicast connectivity between the IPoIB 556 nodes. The GID exchanged between two endpoints in a 557 multicast message (ARP/ND) does not guarantee the existence 558 of a unicast path between the two. 560 There may be multiple LIDs, and hence paths, between the 561 endpoints. The criteria for selection of the LIDs are beyond 562 the scope of this document. 564 b) Q_Key 566 The Q_Key received on joining the broadcast group MUST be 567 used for all IPoIB communication over the particular IPoIB 568 link. 570 c) P_Key 572 The P_Key to be used in the IP subnet is not discovered but 573 is a configuration parameter. 575 d) SL 577 The method of determining the SL is not defined in this 578 document. The SL is determined by any of the IBA approved 579 methods. 581 e) Path rate 583 The implementation must leverage IB methods to determine the 584 path rate as required. 586 9.2 Address Resolution in IPv4 Subnets 588 The ARP packet header is as defined in [ARP]. The hardware type is 589 set to 32 (decimal) as specified by IANA. The rest of the fields are 590 used as per [ARP]. 592 16 bits: hardware type 593 16 bits: protocol 594 8 bits: length of hardware address 595 8 bits: length of protocol address 596 16 bits: ARP operation 598 The remaining fields in the packet hold the sender/target hardware 599 and protocol addresses. 601 [ sender hardware address ] 602 [ sender protocol address ] 603 [ target hardware address ] 604 [ target protocol address ] 606 The hardware address included in the ARP packet will be as specified 607 in section 9.1.1 and depicted in figure 5. 609 The length of the hardware address used in ARP packet header 610 therefore is 20. 612 9.3 Address Resolution in IPv6 Subnets 614 The Source/Target Link-layer address option is used in Router 615 Solicit, Router advertisements, Redirect, Neighbor Solicitation and 616 Neighbor Advertisement messages when such messages are transmitted 617 on InfiniBand networks. 619 The source/target address option is specified as follows: 621 Type: 622 Source Link-layer address 1 623 Target Link-layer address 2 625 Length: 3 627 Link-layer address: 629 The link-layer address is as specified in section 9.1.1 and 630 depicted in figure 5. 632 9.4 Cautionary Note on QPN Caching 634 The link-layer address for IPoIB includes the QPN which might not be 635 constant across reboots or even across network interface resets. 636 Cached QPN entries, such as in static ARP entries or in RARP servers 637 will only work if the implementation(s) using these options ensure 638 that the QPN associated with an interface is invariant across 639 reboots/network resets. 641 10.0 IANA Considerations 643 To support ARP over InfiniBand, a value for the Address Resolution 644 Parameter "Number Hardware Type (hrd)" is required. IANA has 645 assigned the number "32" to indicate InfiniBand [IANA_ARP]. 647 11.0 Sending and Receiving IP Multicast Packets 649 Multicast in InfiniBand differs in a number of ways from multicast 650 in Ethernet. This adds some complexity to an IPoIB implementation 651 when supporting IP multicast over IB. 653 A) An IB multicast group must be explicitly created through the SA 654 before it can be used. 656 This implies that in order to send a packet destined for an IP 657 multicast address, the IPoIB implementation must check with the SA 658 on the outbound link first for a "MCMemberRecord" that matches the 659 MGID. If one does exist, the MLID associated with the multicast 660 group is used as the DLID for the packet. Otherwise, it implies no 661 member exists on the local link. If the scope of the IP multicast 662 group is beyond link-local, the packet must be sent to the on-link 663 routers through the use of the all-router multicast group or the 664 broadcast group. This is to allow local routers to forward the 665 packet to multicast listeners on remote networks. The all-router 666 multicast group is preferred over the broadcast group for better 667 efficiency. If the all-router multicast group does not exist, the 668 sender can assume that there are no routers on the local link; hence 669 the packet can be safely dropped. 671 B) A multicast sender must join the target multicast group before 672 outgoing multicast messages from it can be successfully routed. The 673 "SendOnlyNonMember" join is different from the regular "FullMember" 674 join in two aspects. First, both types of joins enable multicast 675 packets to be routed FROM the local port, but only the "FullMember" 676 join causes multicast packets to be routed TO the port. Second, the 677 sender port of a "SendOnlyNonMember" join will not be counted as a 678 member of the multicast group for purposes of group creation and 679 deletion. 681 The following code snippet demonstrates the steps in a typical 682 implementation when processing an egress multicast packet. 684 if the egress port is already a "SendOnlyNonMember", or a 685 "FullMember" 686 => send the packet 688 else if the target multicast group exists 689 => do "SendOnlyNonMember" join 690 => send the packet 692 else if scope > link-local AND the all-router multicast group exists 693 => send the packet to all routers 695 else 696 => drop the packet 698 Implementations should cache the information about the existence of 699 an IB multicast group, its MLID and other attributes. This is to 700 avoid expensive SA calls on every outgoing multicast packet. 701 Senders MUST subscribe to the multicast group create and delete 702 traps in order to monitor the status of specific IB multicast 703 groups. E.g., multicast packets directed to the all-router multicast 704 group due to a lack of listener on the local subnet must be 705 forwarded to the right multicast group if the group is created 706 later. This happens when a listener shows up on the local subnet. 708 A node joining an IP multicast group must first construct a MGID 709 according to the rule described in section 4 above. Once the correct 710 MGID is calculated, the node must call the SA of the outbound link 711 to attempt a "FullMember" join of the IB multicast group 712 corresponding to the MGID. If the IB multicast group doesn't already 713 exist, one must be created first with the IPoIB link MTU. The MGID 714 MUST use the same P_Key, Q_Key, SL, MTU and HopLimit as those used 715 in the broadcast-GID. For the rest of attributes too, the values 716 used in the broadcast-GID SHOULD be used. 718 The join request will cause the local port to be added to the 719 multicast group. It also enables the SM to program IB switches and 720 routers with the new multicast information to ensure the correct 721 forwarding of multicast packets for the group. 723 When a node leaves an IP multicast group, it SHOULD make a 724 "FullMember" leave request to the SA. This gives SM an opportunity 725 to update relevant forwarding information, to delete an IB multicast 726 group if the local port is the last FullMember to leave, and free up 727 the MLID allocated for it. The specific algorithm is implementation- 728 dependent, and is out of the scope of this document. 730 Note that for an IPoIB link that spans more than one IB subnet 731 connected by IB routers, an adequate multicast forwarding support at 732 the IB level is required for multicast packets to reach listeners on 733 a remote IB subnet. The specific mechanism for this is beyond the 734 scope of IPoIB. 736 12.0 IP Multicast Routing 738 IP multicast routing requires multicast routers to receive a copy of 739 every link multicast packet on a locally connected link [IPMULT, 740 IP6MLD]. For Ethernet this is usually achieved by turning on the 741 promiscuous multicast mode on a locally connected Ethernet 742 interface. 744 IBA does not provide any hardware support for promiscuous multicast 745 mode. Fortunately a promiscuous multicast mode can be emulated in 746 the software running on a router through the following steps. 748 A) Obtain a list of all active IB multicast groups from the local 749 SA. 751 B) Make a "NonMember" join request to the SA for every group that 752 has a signature in its MGID matching the one for either IPv4 or 753 IPv6. 755 C) Subscribe to the IB multicast group creation events using a 756 wildcarded MGID so that the router can "NonMember" join all IB 757 multicast groups created subsequently for IPv4 or IPv6. 759 The "NonMember" join has the same effect as a "FullMember" join 760 except that the former will not be counted as a member of the 761 multicast group for purposes of group creation or deletion. That is, 762 when the last "FullMember" leaves a multicast group, the group can 763 be safely deleted by the SA without concerning any "NonMember" 764 routers. 766 13.0 New Types of Vulnerability in IB Multicast 768 Many IB multicast functions are subject to failures due to a number 769 of possible resource constraints. These include the creation of IB 770 multicast groups, the join calls ("SendOnlyNonMember", "FullMember", 771 and "NonMember"), and the attaching of a QP to a multicast group. 773 In general, the occurrence of these failure conditions is highly 774 implementation dependent, and is believed to be rare. Usually a 775 failed multicast operation at the IB level can be propagated back to 776 the IP level, causing the original operation to fail, and the 777 initiator of the operation to be notified. But some IB multicast 778 functions are not tied to any foreground operation, making their 779 failures hard to detect. E.g., if an IP multicast router attempts to 780 "NonMember" join a newly created multicast group in the local 781 subnet, but the join call fails, packet forwarding for that 782 particular multicast group will likely to fail silently, that is, 783 without the attention of local multicast senders. This type of 784 problems can add more vulnerability to the already unreliable IP 785 multicast operations. 787 Implementations should log error messages upon any failure from an 788 IB multicast operation. Network administrators should be aware of 789 this vulnerability, and preserve enough multicast resources at the 790 points where IP multicast will be used heavily. E.g., HCAs with 791 ample multicast resources should be used at any IP multicast router. 793 14.0 Security Considerations 795 This document specifies IP transmission over a multicast network. 796 Any network of this kind is vulnerable to a sender claiming 797 another's identity and forging traffic or eavesdropping. It is the 798 responsibility of the higher layers or applications to implement 799 suitable counter-measures if this is a problem. 801 Successful transmission of IP packets depends on the correct setup 802 of the IPoIB link , creation of the broadcast GID, creation of the 803 QP and its attachment to the broadcast-GID, and the correct 804 determination of various link parameters such as the LID, service 805 level, path rate etc. These operations, many of which involve 806 interactions with the SM/SA, MUST be protected by the underlying 807 operating system. This is to prevent malicious, non- privileged 808 software from hijacking important resources and configurations. 810 Controlled Q_Keys SHOULD be used in all transmissions. This is to 811 prevent non-privileged software from fabricating IP datagrams. 813 15.0 Acknowledgments 815 The authors would like to thank Bruce Beukema, David Brean, Dan 816 Cassiday, Aditya Dube, Yaron Haviv, Michael Krause, Thomas Narten, 817 Erik Nordmark, Greg Pfister, Jim Pinkerton, Renato Recio, Kevin 818 Reilly, Kanoj Sarcar, Satya Sharma, Madhu Talluri, and David L. 819 Stevens for their suggestions and many clarifications on the IBA 820 specification. 822 16.0 References 824 [AARCH] Hinden, R. and S. Deering "IP Version 6 Addressing 825 Architecture", RFC 3513. 827 [DISC] Narten, T., Nordmark, E. and W. Simpson, "Neighbor 828 Discovery for IP Version 6 (IPv6)", RFC 2461. 830 [HOSTS] Braden R., "Requirements for Internet Hosts -- 831 Communication Layers", RFC 1122. 833 [IBTA] InfiniBand Architecture Specification, 834 www.infinibandta.org. 836 [IGMP2] Fenner W., "Internet Group Management Protocol, 837 Version 2", RFC 2236. 839 [IP6MLD] Deering S., Fenner W., Haberman B., "Multicast 840 Listener Discovery (MLD) for IPv6", RFC 2710. 842 [IPMULT] Deering S., "Host Extensions for IP Multicasting", 843 RFC 1112. 845 [IPoIB_ARCH] Kashyap V., "IP over InfiniBand (IPoIB) Architecture", 846 draft-ietf-ipoib-architecture-03.txt. 848 [IPV6] Deering, S. and R. Hinden, "Internet Protocol, 849 Version 6 (IPv6) Specification", RFC 2460. 851 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 852 Requirement Levels", RFC 2119. 854 [ARP] Plummer D.C., "Ethernet Address Resolution Protocol", 855 RFC 826. 857 [IANA] Internet Assigned Numbers Authority, www.iana.org 859 [IANA_ARP] www.iana.org/assignments/arp-parameters 861 17.0 Authors" Addresses 863 H.K. Jerry Chu 865 17 Network Circle, UMPK17-201 866 Menlo Park, CA 94025 867 USA 868 Phone: +1 650 786 5146 869 Email: jerry.chu@sun.com 871 Vivek Kashyap 873 15350, SW Koll Parkway 874 Beaverton, OR 97006 875 USA 876 Phone: +1 503 578 3422 877 Email: vivk@us.ibm.com 879 Full Copyright Statement 881 Copyright (C) The Internet Society (2001). All Rights Reserved. 883 This document and translations of it may be copied and furnished to 884 others, and derivative works that comment on or otherwise explain it 885 or assist in its implementation may be prepared, copied, published 886 and distributed, in whole or in part, without restriction of any 887 kind, provided that the above copyright notice and this paragraph 888 are included on all such copies and derivative works. However, this 889 document itself may not be modified in any way, such as by removing 890 the copyright notice or references to the Internet Society or other 891 Internet organizations, except as needed for the purpose of 892 developing Internet standards in which case the procedures for 893 copyrights defined in the Internet Standards process must be 894 followed, or as required to translate it into languages other than 895 English. 897 The limited permissions granted above are perpetual and will not be 898 revoked by the Internet Society or its successors or assigns. 900 This document and the information contained herein is provided on an 901 "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING 902 TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING 903 BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION 904 HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF 905 MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.