idnits 2.17.1 draft-ietf-ipoib-architecture-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** The document seems to lack a 1id_guidelines paragraph about Internet-Drafts being working documents -- however, there's a paragraph with a matching beginning. Boilerplate error? ** The document seems to lack a 1id_guidelines paragraph about 6 months document validity -- however, there's a paragraph with a matching beginning. Boilerplate error? == No 'Intended status' indicated for this document; assuming Proposed Standard == The page length should not exceed 58 lines per page, but there was 21 longer pages, the longest (page 2) being 60 lines == It seems as if not all pages are separated by form feeds - found 0 form feeds but 22 pages Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. ** There are 11 instances of too long lines in the document, the longest one being 6 characters in excess of 72. ** The document seems to lack a both a reference to RFC 2119 and the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. RFC 2119 keyword, line 468: '...IB specification MUST NOT require chan...' RFC 2119 keyword, line 507: '... MUST that an IP stack support ...' RFC 2119 keyword, line 515: '...n implementation MAY choose to provide...' RFC 2119 keyword, line 522: '...resolution process for IPoIB MUST also...' RFC 2119 keyword, line 526: '... An interface MAY be associated wit...' (16 more instances...) Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (December 15, 2001) is 8165 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Missing reference section? 'RFC2373' on line 216 looks like a reference Summary: 7 errors (**), 0 flaws (~~), 4 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 INTERNET DRAFT Vivek Kashyap 3 IBM 4 Expiration Date: June 15, 2002 December 15, 2001 6 IP over InfiniBand(IPoIB) Architecture 8 Status of this memo 10 This document is an Internet-Draft and is in full conformance 11 with all provisions of Section 10 of RFC 2026. 13 Internet-Drafts are working documents of the Internet 14 Engineering Task Force (IETF), its areas, and its working 15 groups. Note that other groups may also distribute working 16 documents as Internet- Drafts. 18 Internet-Drafts are draft documents valid for a maximum of six 19 months and may be updated, replaced, or obsoleted by other 20 documents at any time. It is inappropriate to use 21 Internet-Drafts as Reference material or to cite them other 22 than as ``work in progress''. 24 The list of current Internet-Drafts can be accessed at 25 http://www.ietf.org/ietf/1id-abstracts.txt 27 The list of Internet-Draft Shadow Directories can be accessed 28 at http://www.ietf.org/shadow.html 30 This memo provides information for the Internet community. 31 This memo does not specify an Internet standard of any kind. 32 Distribution of this memo is unlimited. 34 Copyright Notice 36 Copyright (C) The Internet Society (2001). All Rights Reserved. 38 Abstract 40 InfiniBand is a high speed, channel based interconnect between 41 systems and devices. 43 This document presents an overview of the InfiniBand 44 architecture. It further describes the requirements and 45 guidelines for the transmission of IP over InfiniBand. 46 Discussions in this document are applicable to both IPv4 and 47 IPv6 unless explicitly specified. The encapsulation of IP over 48 InfiniBand and the mechanism for IP address resolution on IB 49 fabrics will be described in separate documents. 51 Table of Contents 53 1.0 Introduction to InfiniBand 54 1.1 InfiniBand Architecture Specification 55 1.2 Overview of InfiniBand Architecture 56 1.2.1 InfiniBand Addresses 57 1.2.1.1 Unicast GIDs 58 1.2.1.2 Multicast GIDs 59 1.2.2 InfiniBand Multicast Groups 60 2.0 Management of InfiniBand subnet 61 3.0 IP over IB requirements 62 3.1 InfiniBand as datalink 63 3.2 Multicast support 64 3.2.1 Mapping IP multicast to IB multicast 65 3.2.2 Transient bit in IB MGIDs 66 3.3 IP subnet across IB subnets ? 67 3.4 Multicast address to LID mapping 68 4.0 IP subnets in InfiniBand fabrics 69 4.1 IPoIB VLANs 70 4.2 Multicast in IPoIB subnets 71 4.2.1 Sending IP multicast datagrams 72 4.2.2 Receiving multicast packets 73 4.2.2.1 Impact of InfiniBand Architecture Limits 74 4.2.3 Leaving/Deleting a multicast group 75 5.0 QoS and related issues 76 6.0 Security Considerations 77 7.0 Acknowledgement 78 8.0 References 79 9.0 Author's address 81 1.0 Introduction to InfiniBand 83 The InfiniBand Trade Association(IBTA) was formed to develop 84 an I/O specification to deliver a channel based, switched 85 fabric technology. The InfiniBand standard is aimed at meeting 86 the requirements of scalability, reliability, availability and 87 performance of servers in data centers. 89 1.1 InfiniBand Architecture Specification 91 The InfiniBand Trade Association specification is available 92 for download from http://www.infinibandta.org. 94 1.2 Overview of InfiniBand Architecture 96 For a more complete overview the reader is referred to 97 chapter 3 of the InfiniBand specification. 99 InfiniBand Architecture (IBA) defines a System Area Network 100 (SAN) for connecting multiple independent processor platforms, 101 I/O platforms and I/O devices. The IBA SAN is a communications 102 and management infrastructure supporting both I/O and 103 inter-processor communications for one or more computer 104 systems. 106 An IBA SAN consists of processor nodes and I/O units connected 107 through an IBA fabric made up of cascaded switches and IB 108 routers (connecting IB subnets). I/O units can range in 109 complexity from single ASIC IBA attached devices such as a LAN 110 adapter to a large memory rich RAID subsystem. 112 An IBA network may be subdivided into subnets interconnected 113 by routers. These are IB routers and IB subnets and not IP 114 routers or IP subnets. This document will refer to InfiniBand 115 routers and subnets as 'IB routers' and 'IB subnets' 116 respectively. The IP routers and IP subnets will be referred 117 to as 'routers' and 'subnets' respectively. 119 Each IB node or switch may attach to a single or multiple 120 switches or directly with each other. Each IB unit interfaces 121 with the link by way of channel adapters (CAs). The 122 architecture supports multiple CAs per unit with each CA 123 providing one or more ports that connect to the fabric. Each 124 CA appears as a node to the fabric. 126 The ports are the endpoints to which the data is sent. 127 However, each of the ports may include multiple QPs (queue 128 pairs) that may be directly addressed from a remote peer. From 129 the point of view of data transfer the QP number (QPN) is part 130 of the address. 132 IBA supports both connection oriented and datagram service 133 between the ports. The peers are identified by QPN and the 134 port identifier. There are a two exceptions. QPNs are not used 135 when packets are multicast. QPNs are also not used in the raw 136 datagram mode. 138 A port, in a data packet, is identified by a local ID (LID) 139 and optionally a Global ID (GID). The GID in the packet is 140 needed only when communicating across an IB subnet though it 141 may always be included. 143 The GID is 128 bits long and is formed by the concatenation of 144 a 64 bit IB subnet prefix and a 64 bit EUI-64 compliant 145 portion (GUID). The LID is a 16 bit value that is assigned 146 when the port becomes active. Note that the GUID is the only 147 persistent identifier of a port. However, it cannot be used as 148 an address in a packet. If the prefix is modified then the GID 149 may change. The subnet manager may attempt to keep the LID 150 values constant across reboots but that is not a requirement. 152 The assignment of the GID and the LID is done by the subnet 153 manager. Every IB subnet has at least one subnet manager 154 component that controls the fabric. It assigns the LIDs and 155 GIDs. The subnet manager also programs the switches so that 156 they route packets between destinations. The subnet manager 157 and a related component, the subnet administrator (SA) are the 158 central repository of all information that is required to 159 setup and bring up the fabric. 161 IB routers are components that route packets between IB 162 subnets based on the GIDs. Thus within an IB subnet a packet 163 may or may not include a GID but when going across an IB 164 subnet the GID must be included. A LID is always needed in a 165 packet since the destination within a subnet is determined by 166 it. 168 A CA and a switch may have multiple ports. Each CA port is 169 assigned its own LID or a range of LIDs. The ports of a switch 170 are not addressable by LIDs/GIDs or in other words, are 171 transparent to other end nodes. Each port has its own set of 172 buffers. The buffering is channeled through virtual lanes(VL) 173 where each VL has its own flow control. There may be up to 16 174 VLs. 176 VLs provide a mechanism for creating multiple virtual links 177 within a single physical link. All ports must support VL15 178 which is reserved exclusively for subnet management datagrams 179 and hence doesn't concern the IPoIB discussions. The actual VL 180 that a packet uses is configured by the SM in the 181 switch/channel adapter tables and is determined based on the 182 Service Level (SL) specified in every packet. There are 16 183 possible SLs. 185 In addition to the features described above viz. Queue 186 Pairs(QPs), Service Levels(SLs) and addressing(GID/LID), IBA 187 also defines the following: 189 Partitioning: 191 Every packet, but for the raw datagrams, carries the 192 partition key (P_key). These values are used for 193 isolation in the fabric. A switch (this is an optional 194 feature) may be programmed by the SM to drop packets 195 not having a certain key. The CA ports always check 196 for the P_Keys. A CA port may belong to multiple 197 partitions. P_Key checking is optional at IB routers. 199 Q_Keys: 201 These are used to enforce access rights for reliable 202 and unreliable IB datagram services. Raw datagram 203 services don't use Q_Keys. At communication 204 establishment the endpoints exchange the Q_Keys and 205 must always use the relevant Q_Keys when communicating 206 with one another. Multicast packets use the Q_Key 207 associated with the multicast group. 209 Multicast support: 211 A switch may support multicasting i.e. replication of 212 packets across multiple output ports. This is an 213 optional feature. Similarly, support for 214 sending/receiving multicast packets is optional in 215 CAs. A multicast group is identified by a GID. The GID 216 format is as defined in [RFC2373] on IPv6 addressing. 217 Thus from an IPv6 over InfiniBand's point of view the 218 data link multicast address looks like the network 219 address. An IB node must explicitly join a multicast 220 group by sending a request to the SM to receive 221 multicast packets. A node may send packets to any 222 multicast group. In both cases the multicast LID to be 223 used in the packets is received from the SM. 225 There are 6 methods for data transfer in IB architecture. 226 These are : 228 1. Unreliable Datagram (unacknowledged - connectionless) 230 The UD service is connectionless and unacknowledged. 231 It allows the QP to communicate with any unreliable 232 datagram QP on any node. 234 The switches and hence each link can support only a 235 certain MTU. The MTU ranges are 256 bytes, 512 bytes, 236 1024 bytes, 2048 bytes, 4096 bytes. A UD packet cannot 237 be larger than the smallest link MTU between the two 238 peers. 240 2. Reliable Datagram (acknowledged - multiplexed) 242 The RD service is multiplexed over connections between 243 nodes called End to end contexts (EEC) which allows 244 each RD QP to communicate with any RD QP on any node 245 with an established EEC. Multiple QPs can use the same 246 EEC and a single QP can use multiple EECs (one for 247 each remote node per reliable datagram domain). 249 3. Reliable Connected (acknowledged - connection oriented) 251 The RC service associates a local QP with one and only 252 one remote QP. The message sizes maybe as large as 253 2^31 bytes in length. The CA implementation takes care 254 of segmentation and assembly. 256 4. Unreliable Connected (unacknowledged - connection oriented) 258 The UC service associates one local QP with one and 259 only one remote QP. There is no acknowledgment and 260 hence no resend of lost or corrupted packets. Such 261 packets are therefore simply dropped. It is similar to 262 RC otherwise. 264 5. Raw Ethertype (unacknowledged - connectionless) 266 The Ethertype raw datagram packet contains a generic 267 transport header that is not interpreted by the CA but 268 it specifies the protocol type. The values for 269 ethertype are the same as defined in RFC1700 for 270 ethertype. 272 6. Raw IPv6 ( unacknowledged - connectionless) 274 Using IPv6 raw datagram service, the IBA CA can 275 support standard protocol layers atop IPv6 (such as 276 TCP/UDP). Thus native IPv6 packets can be bridged into 277 the IBA SAN and delivered directly to a port and to 278 its IPv6 raw datagram QP. 280 The first 4 types are referred to as IB transports. The latter 281 two are classified as Raw datagrams. There is no indication of 282 the QP number in the raw datagram packets. The raw datagram 283 packets are limited by the link MTU in size. 285 The two connected modes and the reliable datagram mode may 286 also support 'Automatic Path Migration(APM)'. This is an 287 optional facility that provides for a hardware based path 288 failover. An alternate path is associated with the QP when the 289 connection/EE context is first created. If unrecoverable 290 errors are encountered the connection switches to using the 291 alternate path. 293 1.2.1 InfiniBand Addresses 295 The InfiniBand architecture borrows heavily from the IPv6 296 architecture in terms of the InfiniBand subnet structure and 297 global identifiers (GIDs). 299 The InfiniBand architecture defines the global identifier 300 associated with a port as follows: 302 GID (Global Identifier): A 128-bit unicast or 303 multicast identifier used to identify a port on a 304 channel adapter, a port on a router, a switch, or a 305 multicast group. A GID is a valid 128-bit IPv6 306 address(per RFC 2373) with additional 307 properties/restrictions defined within IBA to 308 facilitate efficient discovery, communication, and 309 routing. 311 Note: These rules apply only to IBA operation and do 312 not apply to raw IPv6 operation unless specifically 313 called out. 315 The raw IPv6 operation referred to in the note in the 316 definition above is the IPv6 mode of InfiniBand's raw datagram 317 service. It does not mean IPv6 itself. The routers and 318 switches referred to in the above definition are the 319 InfiniBand routers and switches. 321 The InfiniBand(IB) specification defines two types of GIDs: 322 unicast and multicast. 324 1.2.1.1 Unicast GIDs 326 The unicast GIDs are defined, as in IPv6, with three scopes. 328 The IB specification states: 330 a. link local: This is defined to be FE80/10. 332 The IB routers will not forward packets 333 with a link local address in source or 334 destination beyond the IB subnet. 336 b. site local: FEC0/10 338 A unicast GID used within a collection 339 of subnets which is unique within that 340 collection (e.g. a data center or 341 campus) but is not necessarily globally 342 unique. IB routers must not forward any 343 packets with either a site-local Source 344 GID or a site-local Destination GID 345 outside of the site. 347 c. global: A unicast GID with a global prefix, 348 i.e. an IB router may use this GID to 349 route packets throughout an enterprise 350 or internet. 352 1.2.1.2 Multicast GIDs 354 The multicast GIDs also parallel the IPv6 multicast addresses. 355 The IB specification defines the multicast GIDs as follows: 357 FFxy:<112 bits> 359 Flag bits: 361 The nibble, denoted by x above, are the 4 flag bits: 000T. The 362 first three bits are reserved and are set to zero. The last 363 bit is defined as follows: 365 T=0: denotes a permanently assigned i.e. well known GID 366 T=1: denotes a transient group 368 Scope bits: 370 The 4 bits, denoted by y in the GID above, are the scope bits. 371 These scope values are described in Table 1. 373 scope value Address value 375 0 Reserved 376 1 Unassigned 377 2 Link-local 378 3 Unassigned 379 4 Unassigned 380 5 Site-local 381 6 Unassigned 382 7 Unassigned 383 8 Organization-local 384 9 Unassigned 385 0xA Unassigned 386 0xB Unassigned 387 0xC Unassigned 388 0xD Unassigned 389 0xE Global 390 0xF Reserved 392 Table 1 394 The IB specification further refers to [RFC_2373] and 395 [RFC_2375] while defining the well known multicast addresses. 396 However, it then states that the well known addresses apply to 397 IB raw IPv6 datagrams only. It must be noted though that a 398 multicast group can be associated with only a single MGID. 399 Thus the same MGID cannot be associated with the UD mode and 400 the raw datagram mode. 402 1.2.2 InfiniBand Multicast Groups 404 IB multicast groups (multicast GIDs) are managed by the subnet 405 manager(SM). The SM explicitly programs the IB switches in the 406 fabric to ensure that the packets are received by all the 407 members of the multicast group. 409 A multicast group is created by sending a create request to 410 the SM. The subnet manager records the group's multicast GID 411 and the associated characteristics. The group characteristics 412 are defined by the group path MTU, whether the group will be 413 used for raw datagrams or unreliable datagrams, the service 414 level, the partition key associated with the group, the 415 LID(local identifier) associated with the group etc. These 416 characteristics are defined at the time of the group creation. 417 The interested reader may lookup the 'MCGroupRecord' attribute 418 in the IB architecture specification[IB_ARCH]. 420 The LID is associated with the multicast group by the subnet 421 manager(SM) at the time of the multicast group creation. The 422 SM determines the multicast tree based on all the group 423 members and programs the relevant switches. The multicast LID 424 is used by the switches to route the packets. 426 Any member IB node wanting to participate in the multicast 427 group must join the group. As part of the join operation the 428 node is returned the group characteristics. At the same time 429 the subnet manager ensures that the requester can indeed 430 participate in the group by verifying that it can support the 431 group MTU, and accessibility to the rest of the group members. 432 Other group characteristics may need verification too. 434 The SM, for groups that span IB subnet boundaries, must 435 interact with IB routers to determine the presence of this 436 group in other IB subnets. If present the MTU must match 437 across the IB subnets. 439 P_Key is another characteristic that must match across IB subnets 440 since the P_Key inserted into a packet is not modified by the 441 IB switches or IB routers. Thus if the P_Keys didn't match the 442 IB router(s) itself might drop the packets or destinations on 443 other subnets might drop the packets. 445 These characteristics are returned to the IB endnode that 446 joins the multicast group. A join operation may cause the SM 447 to reprogram the fabric so that the new member can participate 448 in the multicast group. 450 2.0 Management of InfiniBand subnet 452 To aid in the monitoring and configuration of InfiniBand 453 subnet components a set of MIBs need to be defined. MIBs are 454 needed for the channel adapters, InfiniBand interfaces, 455 InfiniBand subnet manager, InfiniBand subnet management agents 456 and to allow the management of specific device properties. It 457 must be noted that the management objects addressed in the 458 IPoIB documents are for all of the IB subnet components and 459 are not limited to IP(over IB). The relevant MIBs will be 460 described in separate documents. 462 3.0 IP over IB requirements 464 As described in section 1.0, the InfiniBand architecture 465 provides a broad set of capabilities to choose from when 466 implementing IP over InfiniBand networks. 468 The IPoIB specification MUST NOT require changes in IP and 469 higher layer protocols. Nor should it mandate requirements on 470 IP stacks to implement special user level programs. It is an 471 aim that the IPoIB changes be amenable to modularisation and 472 incorporation into existing implementations at the same level 473 as other media types. 475 3.1 InfiniBand as link layer 477 InfiniBand architecture provides multiple methods of data 478 exchange between two endpoints as was noted above. These are: 480 Reliable Connected (RC) 481 Reliable Datagram (RD) 482 Unreliable Connected (UC) 483 Unreliable Datagram (UD) 484 Raw Datagram : Raw IPv6 (R6) 485 : Raw Ethertype (RE) 487 IPoIB can be implemented over any, multiple or all of these 488 services. A case can be made for support on any of the 489 transport methods depending on the desired features. 491 The IB specification requires Unreliable Datagram mode to be 492 supported by all the IB nodes. The host channel adapters(HCAs) 493 are specifically required to support Reliable connected(RC) and 494 Unreliable connected(UC) modes but the same is not the case 495 with target channel adapters(TCAs). Support for the two Raw 496 Datagram modes is entirely optional. The Raw Datagram mode 497 supports a 16-bit CRC as against the better protection 498 provided by the use of a 32-bit CRC in other modes. 500 For the sake of simplicity, ease of implementation and 501 integration with existing stacks, it is desirable that the 502 fabric support multicasting. This is possible only in 503 Unreliable datagram (UD) and IB's Raw datagram modes. 505 Thus it only the UD mode that is universal, supports 506 multicast, and a robust CRC. Given these conditions it is a 507 MUST that an IP stack support IP over the UD transport mode of 508 InfiniBand. 510 But then Unreliable datagrams are limited by the link MTU. The 511 connected modes, in contrast to this limitation, can offer 512 significant benefit in terms of performance by utilising a 513 larger MTU. Reliability is also enhanced if the underlying 514 feature of automatic path migration of connected modes is 515 utilised. An implementation MAY choose to provide IP over 516 non-UD transport modes in addition to the mandatory IP over UD 517 function. 519 InfiniBand communication is addressed to a QP at a port. 520 Therefore the IPoIB interface is identified by the port 521 identifier as well as a QP that is associated with the 522 interface. The address resolution process for IPoIB MUST also 523 determine the associated QPN along with determining the port 524 identifier. 526 An interface MAY be associated with multiple QPNs. This 527 provides a mode of implementation wherein a single IP address 528 is associated with different QPNs. Such an association may be 529 used to demultiplex the incoming packets based on the QPN 530 avoiding or reducing the upper-layer port based lookup. An 531 implementation may choose to support such a function. 533 The methods of implementation of the above modes of IP over 534 InfiniBand will be investigated and described in other 535 documents. 537 3.2 Multicast support 539 InfiniBand specification makes support of multicasting in the 540 switches optional. It is RECOMMENDED that multicast switches 541 be used in IPoIB subnets. Lack of multicast capable switches 542 however doesn't mean that multicasting cannot be supported. In 543 such a case the underlying IB layer MUST emulate multicast 544 while ensuring that it is transparent to the IP stack. 546 The translation from IP addresses to IB MGIDs must be 547 independent of the IB fabric's multicast capability. 549 3.2.1 Mapping IP multicast to IB multicast 551 Well known IP multicast groups are defined for both IPv4 and 552 IPv6 (RFC_1700, RFC_2373). Multicast groups may also be 553 dynamically created at any time. To avoid creating unnecessary 554 duplicates of multicast packets in the fabric, and to avoid 555 unnecessary handling of such packets at the hosts each of 556 the IP multicast groups needs to be associated with a 557 different IB multicast group. 559 A process MUST be defined for mapping the IP multicast 560 addresses to unique IB multicast addresses. Every IPoIB node 561 MUST be capable of making this mapping decision 562 independently. 564 3.2.2 Transient flag in IB MGIDs 566 The IB specification describes the flag bits as discussed in 567 section 1.3. The IB specification also defines some well known 568 IB multicast GIDs(MGIDs). The MGIDs are reserved for the IB's 569 Raw datagram mode which is incompatible with the other 570 transports of IB. Any mapping that is defined from IP 571 multicast addresses therefore MUST NOT fall into IB's 572 definition of a well-known address. 574 Therefore all IPoIB related multicast GIDs will always set the 575 transient bit. 577 3.3 IP subnets across IB subnets ? 579 Some implementations may desire to support multiple clusters 580 of machines in their own IB subnets but otherwise part of a 581 common IP subnet. For such a solution the IB specification 582 needs multiple upgrades. Some of the required enhancements 583 are: 585 1) A method for creating IB multicast GIDs that span multiple 586 IB subnets. The partition keys and other parameters need to 587 be consistent across IB subnets. 589 2) Develop IB routing protocol to determine the IB topology 590 across IB subnets. 592 3) Define the process and protocols needed between IB nodes 593 and IB routers 595 Until the above conditions are met it is not possible to 596 implement IPoIB subnets that span IB subnets. The IPoIB 597 standards can however be defined with this possibility in 598 mind. 600 3.4 Multicast address to LID mapping 602 In a generic LAN setup the IP multicast addresses are directly 603 mapped to a link layer multicast address. In the case of 604 InfiniBand this is only partly true. A mapping of multicast IP 605 to IB MGIDs can be standardised. But the IPoIB driver on the 606 host must determine the LID that needs to be used when sending 607 to the particular multicast group. 609 A mapping from the IP multicast address or the corresponding 610 IB multicast group to a LID is not required because of the 611 following reasons: 613 1) Sending/receiving IP multicast 615 An IB node cannot be assured of its packets 616 reaching all the multicast members without itself 617 joining the IB multicast group. This is because the 618 relevant switches are programmed by the IB subnet 619 manager only on receiving a join request. 621 Thus the sender/receiver will always have to join 622 the IB multicast groups and keep track of the 623 groups it has already joined. Mapping directly to 624 the LID doesn't help if the group has not been 625 joined. 627 Thus the implementation is required to keep track 628 of the IB groups joined. It can therefore also 629 record the corresponding LID removing the need to 630 map the IP multicast address to the LID. 632 2) Reduction of LID conflicts 634 The LIDs in the range 0xC000 to 0xFFFE are 635 designated as the multicast LIDs by IBA. This 636 limits the range to 2^14 -1 entries (16382 637 entries). This implies that 2^18 or 256K IPv4 638 multicast groups could map to a single LID. It is 639 better to let the SM decide on a more efficient 640 usage of the multicast LID space. 642 3) SM and IB architecture should stay unaffected. 644 A mapping of the LIDs can conflict with the subnet 645 manager(SM) implementations. The SM is under no 646 restrictions to choose a particular LID for any 647 multicast group. Thus it could end up utilising a 648 LID that maps from an IP multicast address for some 649 other multicast group since not everything on IB 650 subnets is governed by the IPoIB rules. 652 4) No need to plan for LID conflicts 654 Allowing the SM decide on the LIDs also avoids 655 having to come up with a solution to handle LID 656 conflicts with other multicast groups. 658 Thus it is best to avoid such a mapping and leave it to the 659 individual implementations to determine the LID from the SM. 660 There is no extra work involved in this determination since 661 the SM has to be contacted anyway for the IB multicast group 662 join/create operations. 664 IPoIB will not standardise IP multicast addresses to LID 665 mapping. 667 4.0 IP subnets in InfiniBand fabrics 669 The IPoIB subnet is overlaid over the IB subnet. The IPoIB 670 subnet is brought up in the following steps: 672 Note: the join/leave operation at the IP level will be 673 referred to as IP_join/IP_leave and the join/leave 674 operations at the IB level will be referred to as 675 IB_join in this document. 677 1. The all-IP nodes group is be created 679 The fabric administrator creates the IB multicast group 680 corresponding to the all-IP nodes/IPv4 broadcast (henceforth 681 called 'broadcast group') when the IPv6/IPv4 subnet is setup. 682 The method by which the broadcast group is setup is not 683 defined by IPoIB. 685 2. All IPoIB interfaces IB_join the broadcast group 687 The administrator chooses the parameters that are valid for 688 the multicast group: P_Key, Q_Key, Hop Limit, Flow ID, TClass 689 and the MTU. All multicast packets in the IP subnet must use 690 these values. Therefore any other multicast groups setup in 691 the IPoIB subnet MUST be setup with these attributes. In the 692 future as the IB specification associates more meaning with 693 the various values and defines IB QoS different values for IP 694 multicast traffic maybe possible. 696 The IB_join of the broadcast group by the IPoIB nodes builds 697 the IPoIB subnet. The broadcast group defines the span and the 698 members of the IPoIB subnet. The IB_join to the broadcast 699 group has the additional benefit of distributing these values 700 to all the members of the subnet. 702 The IP interface MTU for the IP over Unreliable Datagram 703 interface is the path MTU value returned when the broadcast 704 MGID is joined. This is the largest MTU that can be used 705 across the IPoIB subnet without fragmenting. The IPoIB 706 specification for IP over non-UD modes of transmission MUST 707 also define the MTU that can be used with it. The IP over 708 non-UD implementation may require other parameters to be 709 determined and exchange in addition to the MTU. 711 4.1 IPoIB VLANs 713 The endpoints in an IB subnet must have compatible P_Keys to 714 communicate with one another. Thus the administrator when 715 setting up an IP subnet over an IB subnet must ensure that all 716 the members have compatible P_Keys. An IP subnet can have only 717 one P_Key associated with it to ensure that all IP nodes in it 718 can talk to one another. An endpoint may however have multiple 719 P_Keys. 721 The IB architecture specifies that there can be only one MGID 722 associated with a multicast group in the IB subnet. The P_Key 723 can be included in the MGID mappings from the IP multicast 724 addresses. Since the P_Key is unique in the IB subnet the 725 inclusion of the P_Key in the IB MGIDs ensures unique MGID 726 mappings are created. Every unique broadcast group MGID so 727 formed creates a separate abstract IPoIB link and hence an 728 IPoIB VLAN. 730 It is an implementation choice on how the P_Key related to the 731 IPoIB subnet is determined by the IP stack. It could be a 732 configuration parameter initialised by some means by the 733 administrator. The method employed by an implementation to 734 determine the P_Key is beyond the scope of IPoIB. 736 4.2 Multicast in IPoIB subnets 738 IP multicast on InfiniBand subnets follows the same concepts 739 and rules as on any other media. However, unlike most other 740 media multicast over InfiniBand requires interaction with 741 another entity, the IB subnet manager. This section describes 742 the outline of the process and suggests some guidelines. 744 IB architecture specifies the following format for IB 745 multicast packets when used over unreliable datagram(UD) 746 mode: 748 +--------+-------+---------+---------+-------+---------+---------+ 749 |Local |Global |Base |Datagram |Packet |Invariant| Variant | 750 |Routing |Routing|Transport|Extended |Payload| CRC | CRC | 751 |Header |Header |Header |Transport| (IP) | | | 752 | | | |Header | | | | 753 +--------+-------+---------+---------+-------+---------+---------+ 755 For details about the various headers please refer to 756 InfiniBand Architecture Specification[IB_ARCH]. 758 The Global routing header (GRH) includes the IB multicast group 759 GID. The Local routing header (LRH) includes the local 760 identifier (LID). The IB switches in the fabric route the 761 packet based on the LID. 763 The GID is made available to the receiving IB user (the IPoIB 764 interface driver for example). The driver can therefore 765 determine the IB group the packet belongs to. 767 IPv4 defines three levels of multicast compliance. These are: 769 Level 0: No support for IP multicasting 771 Level 1: Support for sending but not receiving multicasts 773 Level 2: Full support for IP multicasting 775 In IPv6 there is no such distinction. Full multicast support 776 is mandatory. Additionally, all IPv4 subnets support 777 broadcast(255.255.255.255). IPv4 broadcast can always be 778 sent/received by all IPv4 interfaces. 780 Every IPoIB subnet requires the broadcast GID to be defined. 781 Thus a packet can always be broadcast. 783 4.2.1 Sending IP multicast datagrams 785 An IP host may send a multicast packet at any time to any 786 multicast address. 788 The IP layer conveys the multicast packet to the IPoIB 789 interface driver/module. This module attempts to IB_join the 790 relevant IB multicast group. This is required since otherwise 791 InfiniBand architecture does not guarantee that the packet 792 will reach its destinations. 794 The subnet manager builds a logical tree across the 795 participating switches/IB routers to ensure that the multicast 796 packet is received by all the members of the multicast group. 797 The IB_join operation causes the SM to rebuild/modify this 798 routing tree to include the new endnode. It may have to 799 (re)program some of the switches and IB routers to reflect the 800 new topology. Therefore if the IB_join is not done there is a 801 possibility that the fabric will fail to deliver the packet to 802 some or all the recipients. 804 If the multicast group does not exist the IB_join will fail. 805 This can imply that there are no listeners on the subnet and 806 the router doesn't expect to forward packets received on this 807 group. However, this may not be the case. The IB group may not 808 exist because the SM ran out of resources or the SM policy 809 allows only a limited set of multicast groups to be created. 810 Additionally it is not reasonable to expect the router to 811 create IB groups for all the IP multicast addresses that it 812 may be called upon to forward. It must be noted that unlike 813 many other media IBA does not have a promiscuous mode at which 814 the router can accept all the packets. 816 Therefore, the multicast module of IPoIB interface, when 817 sending a multicast packet, needs to do one the following: 819 1) join the IB multicast group corresponding to the IP 820 multicast address. This is the RECOMMENDED option 821 for multicast if the sender is itself a member of 822 the IP multicast group. 824 As noted earlier, a particular IB multicast group 825 may not exist for some reason. In such a case the 826 implementation MUST fall back to one of the 827 following methods. 829 2) Send the multicast packet out with the 830 IB MGID/MLID associated with the all-systems IP 831 multicast address (224.0.0.1/FF02::1). 833 An IPv4 implementation failing 1) above must fall 834 back to this condition or the condition given below 835 on failure to join the IB group corresponding to 836 the IPv4 multicast address being sent to. 838 3) In IPv4 subnets if both the above conditions fail 839 then the packet MUST be sent with the IB MGID/MLID 840 corresponding to the IPv4 limited broadcast 841 address(255.255.255.255). 843 4.2.2 Receiving multicast packets 845 The IP host must create the IB multicast group corresponding 846 to the IP address and then join it. This follows from the IBA 847 requirement that the receiver must join the relevant IB 848 multicast group. 850 A router could create the group on receiving the IGMP/MLD 851 report but then the IP host would have to be informed of the 852 creation. Therefore, it is simpler for the IB interface module 853 on the IP host to first create the IB group and then send the 854 IGMP/MLD message to the router. The router in turn needs to 855 IB_join the specified IB group on receiving the IGMP/MLD 856 report. This report must be sent out on the broadcast-MGID to 857 ensure reception by the router(s). 859 The router MAY choose to create IB groups corresponding to the 860 IP groups it expects to forward. 862 Thus the creation of IB groups is done by IP receivers or IP 863 routers only and not by senders thereby keeping things simple. 864 The host must first try to join the group and only on failure 865 attempt to create it. 867 4.2.2.1 Impact of InfiniBand Architecture Limits 869 It must be noted that if the group exists or the creation 870 succeeds the group will be IB_joined. However, in case the 871 join doesn't succeed due to some reason the node can still 872 transmit to the multicast group using the broadcast/all-IP 873 nodes MGID since that is mandatory. 875 It may be that the IB MGID could not be created/joined because 876 of a transient error or policy limit/resource constraint at 877 the SM. It may also be created at a later point in time. The 878 receiver therefore would not be in the IB MGID corresponding 879 to the IP address. Unfortunately there is no IB level support 880 to let the listener know of the new IB MGID being created. 882 If the underlying IB level indicates a transient failure the 883 listener could periodically retry to join the IB group. The 884 exact parameters and timers for such retries or an alternate 885 solution are beyond the scope of IPoIB. These parameters, if 886 needed, should be derived from the IB specification. 888 Note that multicasting can still continue since the packets 889 can be sent out on the broadcast MGID (and MLID). The 890 multicast listeners won't receive any packets on this 891 multicast address if other nodes could join the group but it 892 couldn't. It must be realised that such a situation is not 893 very likely. 895 An HCA or TCA may have a limit on the number of MGIDs it can 896 support. Thus, even though the groups may not be limited at 897 the subnet manager and in the subnet as such, they may be 898 limited at a particular interface. It is advisable to choose 899 an adequately provisioned xCA when setting up an IPoIB 900 subnet. 902 4.2.3 Leaving/Deleting a multicast group 904 An IPv4 sender (level 1 compliance) IB_joins the IB multicast 905 group only because that is the only way to guarantee reception 906 of the packets by all the group recipients. The sender must 907 however IB_leave the group at some time. It is advisable that 908 a sender, when not a receiver on the group, start a timer per 909 multicast group sent to. The sender leaves the IB group when 910 the timer goes off. It restarts the timer if another message 911 is sent. 913 This recommendation doesn't apply to the IB broadcast group. 914 It also doesn't apply to the IB group corresponding to the 915 all-hosts multicast group. An IPv4 host must always remain a 916 member of the broadcast group. It MAY choose to remain a 917 member of all-hosts group. 919 Thus a sender that chooses to always send to the broadcast 920 group and not to the specific multicast group does not need to 921 implement a timer. 923 An IP multicast receiver MUST IB_leave the corresponding IB 924 multicast group when it IP_leaves the IP multicast group. In 925 the case of IPv4 implementation the receiver may choose to 926 continue to be a sender (level 1 compliance). It MAY choose to 927 not IB_leave the IB group but start a timer as explained 928 above. 930 A router is RECOMMENDED to IB_leave the IB multicast group 931 when there are no members of the IP multicast address in the 932 subnet and it has no explicit knowledge of any need to forward 933 such packets. 935 The router and the IP hosts SHOULD NOT IB_delete the IB 936 multicast group when they IB_leave the group. It is possible 937 for the same IB multicast group be used by a non-IP protocol. 938 The IB specification mentions an IB specific protocol that 939 will delete the IB groups when it determines that there are no 940 IB members of the group. 942 5.0 QoS and related issues 944 The IB specification suggests the use of service levels for 945 load balancing, QoS and deadlock avoidance within an IB 946 subnet. But the IB specification leaves the usage and mode of 947 determination of the SL for the application to decide. The SL 948 and list of SLs are available in the SA but it is up to the 949 endnode's application to choose the 'right' value. 951 Every IPoIB implementation will determine the relevant SL 952 value based on its own policy. No method or process for 953 choosing the SL will be defined by the IPoIB standards. 955 6.0 Security Considerations 957 Any multicast/broadcast communication is inherently insecure 958 since anyone can receive the data. The applications must 959 implement appropriate authentication/encryption methods for 960 data security. 962 The IP subnet communication can be disrupted by creating the 963 IB broadcast/multicast groups with incompatible parameters. 964 The implementations must leverage IB specific methods to 965 protect against such situations. 967 7.0 Acknowledgement 969 This document has benefited from the comments and suggestion 970 of the members of the IPoIB working group and the members of 971 the InfiniBand(SM) Trade Association. 973 8.0 References 975 [IB_ARCH] InfiniBand Architecture Specification, Volume 1.0 976 [RFC_2373] IP Version 6 Addressing Architecture 977 [RFC_2375] IPv6 Multicast Address Assignments 978 [RFC_1700] Assigned Numbers 979 [RFC_1112] Host extensions for IP multicasting 980 [RFC_2236] Internet Group Management Protocol, Version 2 981 [RFC_2710] Multicast Listener Discovery 982 9.0 Author's Address 984 Vivek Kashyap 986 IBM 987 15450, SW Koll Parkway 988 Beaverton, OR 97006 990 Phone: +1 503 578 3422 991 Email: vivk@us.ibm.com 993 Full Copyright Statement 995 Copyright (C) The Internet Society (2001). All Rights Reserved. 997 This document and translations of it may be copied and 998 furnished to others, and derivative works that comment on or 999 otherwise explain it or assist in its implementation may be 1000 prepared, copied, published and distributed, in whole or in 1001 part, without restriction of any kind, provided that the above 1002 copyright notice and this paragraph are included on all such 1003 copies and derivative works. However, this document itself may 1004 not be modified in any way, such as by removing the copyright 1005 notice or references to the Internet Society or other Internet 1006 organizations, except as needed for the purpose of developing 1007 Internet standards in which case the procedures for copyrights 1008 defined in the Internet Standards process must be followed, or 1009 as required to translate it into languages other than 1010 English. 1012 The limited permissions granted above are perpetual and will 1013 not be revoked by the Internet Society or its successors or 1014 assigns. 1016 This document and the information contained herein is provided 1017 on an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET 1018 ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR 1019 IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE 1020 USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR 1021 ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A 1022 PARTICULAR PURPOSE.