idnits 2.17.1 draft-ietf-ipoib-architecture-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** The document seems to lack a 1id_guidelines paragraph about Internet-Drafts being working documents -- however, there's a paragraph with a matching beginning. Boilerplate error? ** The document seems to lack a 1id_guidelines paragraph about 6 months document validity -- however, there's a paragraph with a matching beginning. Boilerplate error? == No 'Intended status' indicated for this document; assuming Proposed Standard == The page length should not exceed 58 lines per page, but there was 20 longer pages, the longest (page 2) being 60 lines == It seems as if not all pages are separated by form feeds - found 0 form feeds but 21 pages Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. ** There are 12 instances of too long lines in the document, the longest one being 6 characters in excess of 72. ** The document seems to lack a both a reference to RFC 2119 and the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. RFC 2119 keyword, line 427: '...IB specification MUST NOT require chan...' RFC 2119 keyword, line 455: '...n implementation MAY choose to provide...' RFC 2119 keyword, line 471: '...nditions it is a MUST that an IP stack...' RFC 2119 keyword, line 478: '...resolution process for IPoIB MUST also...' RFC 2119 keyword, line 482: '... An interface MAY be associated wit...' (22 more instances...) Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (December 15, 2001) is 8165 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Missing reference section? 'RFC 2373' on line 195 looks like a reference Summary: 7 errors (**), 0 flaws (~~), 4 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 INTERNET DRAFT V.Kashyap 3 IBM 4 Expiration Date: June 15, 2002 December 15, 2001 6 IP over InfiniBand(IPoIB) Architecture 8 Status of this memo 10 This document is an Internet-Draft and is in full conformance 11 with all provisions of Section 10 of RFC 2026. 13 Internet-Drafts are working documents of the Internet 14 Engineering Task Force (IETF), its areas, and its working 15 groups. Note that other groups may also distribute working 16 documents as Internet- Drafts. 18 Internet-Drafts are draft documents valid for a maximum of six 19 months and may be updated, replaced, or obsoleted by other 20 documents at any time. It is inappropriate to use 21 Internet-Drafts as Reference material or to cite them other 22 than as ``work in progress''. 24 The list of current Internet-Drafts can be accessed at 25 http://www.ietf.org/ietf/1id-abstracts.txt 27 The list of Internet-Draft Shadow Directories can be accessed 28 at http://www.ietf.org/shadow.html 30 This memo provides information for the Internet community. 31 This memo does not specify an Internet standard of any kind. 32 Distribution of this memo is unlimited. 34 Copyright Notice 36 Copyright (C) The Internet Society (2001). All Rights Reserved. 38 Abstract 40 InfiniBand is a high speed, channel based interconnect between 41 systems and devices. 43 This memo presents an overview of the InfiniBand architecture. 44 It further describes the requirements and guidelines for the 45 transmission of IP over InfiniBand. The encapsulation of IP 46 over InfiniBand and the mechanism for IP address resolution on 47 IB fabrics will be described in separate documents. 49 Table of Contents 51 1.0 Introduction to InfiniBand 52 1.1 InfiniBand Architecture Specification 53 1.2 Overview of InfiniBand Architecture 54 1.2.1 InfiniBand Addresses 55 1.2.1.1 Unicast GIDs 56 1.2.1.2 Multicast GIDs 57 1.2.2 InfiniBand Multicast Groups 58 2.0 Management of InfiniBand subnet 59 3.0 IP over IB requirements 60 3.1 InfiniBand as datalink 61 3.2 Multicast support 62 3.2.1 Mapping IP multicast to IB multicast 63 3.2.2 Transient bit in IB MGIDs 64 3.3 IP subnet across IB subnets ? 65 3.4 Multicast address to LID mapping 66 3.5 IP encapsulation 67 4.0 IP subnets in InfiniBand fabrics 68 4.1 IPoIB VLANs 69 4.2 Multicast in IPoIB subnets 70 4.2.1 Sending IP multicast datagrams 71 4.2.2 Receiving multicast packets 72 4.2.2.1 IB_join of MGIDs by a listener 73 4.2.3 Leaving/Deleting a multicast group 74 5.0 QoS and related issues 75 6.0 Security Considerations 76 7.0 References 77 8.0 Author's address 79 1.0 Introduction to InfiniBand 81 The InfiniBand Trade Association(IBTA) was formed to develop 82 an I/O specification to deliver a channel based, switched 83 fabric technology. The InfiniBand standard is aimed at meeting 84 the requirements of scalability, reliability, availability and 85 performance of servers in data centers. 87 1.1 InfiniBand Architecture Specification 89 The InfiniBand Trade Association specification, version 1.0, 90 is available for download from http://www.infinibandta.org. 92 1.2 Overview of InfiniBand Architecture 94 For a more complete overview the reader is referred to 95 chapter 3 of the InfiniBand specification. 97 InfiniBand Architecture (IBA) defines a System Area Network 98 (SAN) for connecting multiple independent processor platforms, 99 I/O platforms and I/O devices. The IBA SAN is a communications 100 and management infrastructure supporting both I/O and 101 inter-processor communications for one or more computer 102 systems. 104 An IBA SAN consists of processor nodes and I/O units connected 105 through an IBA fabric made up of cascaded switches and IB 106 routers (connecting IB subnets). I/O units can range in 107 complexity from single ASIC IBA attached devices such as a LAN 108 adapter to a large memory rich RAID subsystem. 110 IBA network is subdivided into subnets interconnected by IB 111 routers. These are IB routers and IB subnets and not IP 112 routers or IP subnets. 114 Each IB node or switch may attach to a single or multiple 115 switches or directly with each other. Each node interfaces 116 with the link by way of channel adapters (CAs). The 117 architecture supports multiple CAs per unit with each CA 118 providing one or more ports that connect to the fabric. Each 119 CA appears as a node to the fabric. 121 The ports are the endpoints to which the data is sent. 122 However, each of the ports may include multiple QPs (queue 123 pairs) that may be directly addressed from a remote peer. From 124 the point of view of data transfer the QP number (QPN) is part 125 of the address. 127 IBA supports both connection oriented and datagram service 128 between the ports. The peers are identified by QPN and the 129 port identifier. In raw datagram mode the QPN is not used. 131 A port may be identified by a local ID (LID) and optionally a 132 Global ID (GID). 134 The GID is 128 bits long and is formed by the concatenation of 135 a 64 bit IB subnet prefix and a 64 bit EUI-64 compliant 136 portion (GUID). The LID is a 16 bit value that is assigned 137 when the port becomes active. Note that the GUID is the only 138 persistent identifier of a port. However, it cannot be used as 139 an address in a packet. If the prefix is modified then the GID 140 may change. The subnet manager may attempt to keep the LID 141 values constant across reboots but that is not a requirement. 143 The assignment of the GID and the LID is done by the subnet 144 manager. Every IB subnet has at least one subnet manager 145 component that controls the fabric. It assigns the LIDs and 146 GIDs. The subnet manager also programs the switches so that 147 they route packets between destinations. The subnet manager 148 and a related component, the subnet administrator (SA) are the 149 central repository of all information that is required to 150 setup and bring up the fabric. 152 IB routers are components that route packets between IB 153 subnets based on the GIDs. Thus within an IB subnet a packet 154 may or may not include a GID but when going across an IB 155 subnet the GID must be included. A LID is always needed in a 156 packet since the destination within a subnet is determined by 157 it. 159 A CA and a switch may have multiple ports. Each CA port is 160 assigned its own LID or a range of LIDs. The ports of a switch 161 are not addressable by LIDs/GIDs or in other words, are 162 transparent to other end nodes. Each port has its own set of 163 buffers. The buffering is channeled through virtual lanes (VL) 164 where each VL has its own flow control. There may be up to 16 165 VLs. 167 VLs provide a mechanism for creating multiple virtual links 168 within a single physical link. All ports however must support 169 VL15 which is reserved exclusively for subnet management 170 datagrams and hence doesn't concern the IPoIB discussions. The 171 actual VL that a port uses is configured by the SM and is 172 based on the Service Level (SL) specified in every packet. 173 There are 16 possible SLs. 175 In addition to the features described above viz. Queue Pairs 176 (QPs), Service Levels (SLs) and addressing (GID/LID), IBA also 177 defines the following: 179 P_Keys or partition keys: Every packet, but for the raw 180 datagrams, carries the partition key (P_key). These values are 181 used for isolation in the fabric. A switch (this is an 182 optional feature) may be programmed by the SM to drop packets 183 not having a certain key. The CA ports always check for the 184 P_Keys. A CA port may belong to multiple partitions. 186 Q_Keys: These are used to enforce access rights for reliable 187 and unreliable IB datagram services. Raw datagram services 188 don't require this value. At communication establishment the 189 endpoints exchange the Q_Keys and must always use the relevant 190 Q_Keys when communicating with one another. 192 Multicast support: A switch may support multicasting ie. 193 replication of packets across multiple output ports. This is 194 an optional feature. A multicast group is identified by a GID. 195 The GID format is as defined in [RFC 2373] on IPv6 addressing. 196 Thus from an IPv6 over IB's point of view the data link 197 multicast address looks like the network address. An IB node 198 must explicitly join a multicast group by a request to the SM 199 to receive packets. A node may send packets to any multicast 200 group. In both cases the multicast LID to be used in the 201 packets is received from the SM. 203 There are 6 transport types specified by the IB architecture. 204 These are : 206 1. Unreliable Datagram (unacknowledged - connectionless) 207 The UD service is connectionless and unacknowledged. 208 It allows the QP to communicate with any unreliable 209 datagram QP on any node. 211 The switches and hence each link can support only a 212 certain MTU. The MTU ranges are 256 bytes, 512 bytes, 213 1024 bytes, 2048 bytes, 4096 bytes. A UD packet cannot 214 be larger than the smallest link MTU between the two 215 peers. 217 2. Reliable Datagram (acknowledged - multiplexed) 218 The RD service is multiplexed over connections between 219 nodes called End to end contexts (EEC) which allows 220 each RD QP to communicate with any RD QP on any node 221 with an established EEC. Multiple QPs can use the same 222 EEC and a single QP can use multiple EECs (one for 223 each remote node per reliable datagram domain). 225 3. Reliable Connected (acknowledged - connection oriented) 226 The RC service associates a local QP with one and only 227 one remote QP. The message sizes maybe as large as 228 2^31 bytes in length. The CA implementation takes care 229 of segmentation and assembly. 231 4. Unreliable Connected (unacknowledged - connection oriented) 232 The UC service associates one local QP with one and 233 only one remote QP. There is no acknowledgment and 234 hence no resend of lost or corrupted packets. Such 235 packets are therefore simply dropped. It is similar to 236 RC otherwise. 238 5. Raw Ethertype (unacknowledged - connectionless) 239 The Ethertype raw datagram packet contains a generic 240 transport header that is not interpreted by the CA but 241 it specifies the protocol type. The values for 242 ethertype are the same as defined in RFC1700 for 243 ethertype. 245 6. Raw IPv6 ( unacknowledged - connectionless) 246 Using IPv6 raw datagram service, the IBA CA can 247 support standard protocol layers atop IPv6 (such as 248 TCP/UDP). Thus native IPv6 packets can be bridged into 249 the IBA SAN and delivered directly to a port and to 250 its IPv6 raw datagram QP. 252 The first 4 are referred to as IB transports. The latter two 253 are classified as Raw datagrams. There is no indication of the 254 QP number in the raw datagram packets. The raw datagram 255 packets are limited by the link MTU in size. 257 1.2.1 InfiniBand Addresses 259 The InfiniBand architecture borrows heavily from the IPv6 260 architecture in terms of the InfiniBand subnet structure and 261 global identifiers (GIDs). 263 The InfiniBand architecture defines the global identifier 264 associated with a port as follows: 266 GID (Global Identifier): A 128-bit unicast or 267 multicast identifier used to identify a port on a 268 channel adapter, a port on a router, a switch, or a 269 multicast group. A GID is a valid 128-bit IPv6 address 270 (per RFC 2373) with additional properties/restrictions 271 defined within IBA to facilitate efficient discovery, 272 communication, and routing. Note: These rules apply 273 only to IBA operation and do not apply to raw IPv6 274 operation unless specifically called out. 276 The raw IPv6 operation referred to in the note in the the 277 definition above is the IPv6 mode of InfiniBand's raw datagram 278 service. It does not mean IPv6 itself. The routers and 279 switches referred to in the above definition are the 280 InfiniBand routers and switches. 282 The InfiniBand(IB) specification defines two types of GIDs: 283 unicast and multicast. 285 1.2.1.1 Unicast GIDs 287 The unicast GIDs are defined, as in IPv6, with three scopes. 289 The IB specification states: 291 a. link local: This is defined to be FE80/10. 292 The IB routers will not forward packets 293 with a link local address in source or 294 destination beyond the IB subnet. 296 b. site local: FEC0/10 297 A unicast GID used within a collection 298 of subnets which is unique within that 299 collection (e.g. a data center or 300 campus) but is not necessarily globally 301 unique. IB routers must not forward any 302 packets with either a site-local Source 303 GID or a site-local Destination GID 304 outside of the site. 306 c. global: A unicast GID with a global prefix, i.e. an 307 IB router may use this GID to route packets 308 throughout an enterprise or internet. 310 1.2.1.2 Multicast GIDs 312 The multicast GIDs also parallel the IPv6 multicast addresses. 313 The IB specification defines the multicast GIDs as follows: 315 FFxy:<112 bits> 317 Flag bits: 319 The nibble, denoted by x above, are the 4 flag bits: 000T. The 320 first three bits are reserved and are set to zero. The last 321 bit is defined as follows: 323 T=0: denotes a permanently assigned i.e. well known GID 324 T=1: denotes a transient group 326 Scope bits: 328 The 4 bits, denoted by y in the GID above, are the scope bits. 329 These are defined as : 331 scope value Address value 333 0 Reserved 334 1 Unassigned 335 2 Link-local 336 3 Unassigned 337 4 Unassigned 338 5 Site-local 339 6 Unassigned 340 7 Unassigned 341 8 Organization-local 342 9 Unassigned 343 0xA Unassigned 344 0xB Unassigned 345 0xC Unassigned 346 0xD Unassigned 347 0xE Global 348 0xF Reserved 350 Table 1 352 The IB specification further refers to [RFC_2373] and 353 [RFC_2375] while defining the well known multicast addresses. 354 However, it then states that the well known addresses apply to 355 IB raw IPv6 datagrams only. The IB unreliable datagram (UD) 356 service recognises only one well known multicast address. This 357 is the ALL_CHANNEL_ADAPTERS multicast address defined to be 358 FF02::1. The scope of this address is limited to a single IB 359 subnet. It must be noted though that a multicast group can be 360 associated with only a single MGID. Thus the same MGID cannot 361 be associated with the UD mode and the raw datagram mode. 363 1.2.2 InfiniBand Multicast Groups 365 IB multicast groups (multicast GIDs) are managed by the subnet 366 manager(SM). The SM explicitly programs the IB switches in the 367 fabric to ensure that the packets are received by all the 368 members of the multicast group. 370 When the group is created a create request is sent to the SM. 371 The subnet manager records the group GIDs and the associated 372 characteristics. The group characteristics are defined by the 373 group path MTU, whether the group will be used for raw 374 datagrams or unreliable datagrams, the service level, the 375 partition key associated with the group, the LID (local 376 identifier) associated with the group etc. These 377 characteristics are defined at the time of the group 378 creation. 380 The LID is associated with the multicast group by the subnet 381 manager(SM) at the time of the multicast group creation. An IB 382 node may request a specific LID be associated with a group. 383 The SM determines the multicast tree based on all the group 384 members and programs the relevant switches. The LID is used by 385 the switches to route the packets. 387 Any member IB node wanting to participate in the group must 388 join the group. As part of the join operation the node is 389 returned the group characteristics. At the same time the 390 subnet manager ensures that the requester can indeed participate 391 in the group by verifying that it can support the group MTU, 392 and accessibility to the rest of the group members. Other 393 group characteristics may need verification too. 395 The SM, for groups that span IB subnet boundaries, must interact 396 with IB routers to determine the presence of this group in other 397 IB subnets. If present the MTU must match across the IB subnets. 399 P_Key is another characteristic that must match across IB subnets 400 since the P_Key inserted into a packet is not modified by the 401 IB switches or IB routers. Thus if the P_Keys didn't match the 402 IB router(s) itself might drop the packets or destinations on 403 other subnets might drop the packets. 405 These characteristics are returned to the IB endnode that 406 joins the multicast group. A join operation may cause the SM 407 to reprogram the fabric so that the new member can participate 408 in the multicast group. 410 2.0 Management of InfiniBand subnet 412 To aid in the monitoring and configuration of InfiniBand 413 subnet components a set of MIBs need to be defined. MIBs are 414 needed for the channel adapters, InfiniBand intefaces, 415 InfiniBand subnet manager and InfiniBand subnet management 416 agents and to allow management of specific device properties. 417 It must be noted that the management objects addressed in the 418 IPoIB documents are for all of the IB subnet components and 419 are not limited to IP(over IB). The relevant MIBs will be 420 described in separate documents. 422 3.0 IP over IB requirements 424 As described above, IB provides a broad set of capabilities to 425 choose from when implementing IP over IB. 427 The IPoIB specification MUST NOT require changes in IP and 428 higher layer protocols. Nor should it mandate requirements on 429 IP stacks to implement special user level programs. It is an 430 aim that the IPoIB changes be amenable to modularisation and 431 incorporation into existing implementations at the same level 432 as other media types. 434 3.1 InfiniBand as link layer 436 InfiniBand(IB) provides multiple methods of packet exchange 437 between two endpoints as was noted above. These are : 439 Reliable Connected (RC) 440 Reliable Datagram (RD) 441 Unreliable Connected (UC) 442 Unreliable Datagram (UD) 443 Raw Datagram : Raw IPv6 (R6) 444 : Raw Ethertype (RE) 446 IPoIB can be implemented over any, multiple or all of these 447 services. A case can be made for support on any of the 448 transport methods depending on the desired features. 450 Unreliable datagrams are limited by the link MTU. The 451 connected modes, in contrast to this limitation, can offer 452 significant benefit in terms of performance by utilising a 453 larger MTU. Reliability is also enhanced if the underlying 454 feature of automatic path migration of connected modes is 455 utilised. An implementation MAY choose to provide IP over 456 non-UD transport modes in addition to the mandatory IP over UD 457 function. 459 The IB specification requires Unreliable Datagram mode to be 460 supported by all the IB nodes. The host channel adapters 461 (HCAs) are additionally required to support Reliable connected 462 and Unreliable connected modes but not target channel adapters 463 (TCAs). Support for the two Raw Datagram modes is entirely 464 optional. 466 For the sake of simplicity and ease of implementation and 467 integration with existing stacks, it is desirable that the 468 fabric support multicasting. This is possible only in 469 Unreliable datagram (UD) and IB's Raw datagram modes. 471 Given these conditions it is a MUST that an IP stack support 472 IP over the UD transport mode of InfiniBand. The support of IP 473 over other modes of IB transport is optional. 475 InfiniBand communication is addressed to a QP at a port. 476 Therefore the IPoIB interface is identified by the port 477 identifier as well as a QP that is associated with the 478 interface. The address resolution process for IPoIB MUST also 479 determine the associated QPN along with determining the port 480 identifier. 482 An interface MAY be associated with multiple QPNs. This 483 provides a mode of implementation wherein a single IP address 484 is associated with different QPNs. Such an association may be 485 used to demultiplex the incoming packets based on the QPN 486 avoiding or reducing the upper-layer port based lookup. This 487 amounts to there being multiple MAC addresses associated with 488 an endpoint. Any process for providing resolution and support 489 of multiple QPNs per IP address MUST provide for 490 interoperability with the default version of a single QPN per 491 IPoIB interface. 493 3.2 Multicast support 495 InfiniBand specification makes support of multicasting in the 496 switches optional. It is RECOMMENDED that multicast switches 497 be used in IPoIB subnets. Lack of multicast capable switches 498 however doesn't mean that multicasting cannot be supported. In 499 such a case the underlying IB layer MUST emulate multicast 500 while ensuring that it is transparent to the IP 501 implementation. 503 The translation from IP addresses to IB MGIDs is independent 504 of the IB fabric's multicast capability. 506 3.2.1 Mapping IP multicast to IB multicast 508 Well known IP multicast groups are defined for both IPv4 and 509 IPv6 (RFC_1700, RFC_2373). Multicast groups may also be 510 dynamically created at any time. To avoid creating unnecessary 511 duplicates of multicast packets in the fabric, and to avoid 512 unnecessary handling of such packets at the hosts it is 513 desirable to associate each of the IP multicast groups with a 514 different IB multicast GID. 516 A process MUST be defined for mapping the IP multicast 517 addresses to unique IB multicast addresses. Every IPoIB node 518 MUST be capable of making this mapping decision 519 independently. 521 3.2.2 Transient flag in IB MGIDs 523 The IB specification describes the flag bits as discussed in 524 section 1.3. The IB specification also defines some well known 525 IB MGIDs. The MGIDs are reserved for the IB's Raw datagram 526 mode which is incompatible with the other transports of IB. 527 Any mapping that is defined from IP multicast addresses 528 therefore MUST NOT fall into IB's definition of a well-known 529 address. 531 Therefore all IPoIB related multicast GIDs will always set the 532 transient bit. 534 3.3 IP subnets across IB subnets ? 536 Some implementations may desire to support multiple clusters 537 of machines in their own IB subnets but otherwise part of a 538 common IP subnet. For such a solution the IB specification 539 needs multiple upgrades: 541 1) A method for creating IB multicast GIDs that span multiple 542 IB subnets. The partition keys and other parameters need to 543 be consistent across IB subnets. 545 2) Develop IB routing protocol to determine the IB topology 546 across IB subnets. 548 3) Define the process and protocols needed between IB nodes 549 and IB routers 551 Until the above conditions are met it is not possible to 552 define IPoIB subnets that span IB subnets. The IPoIB standards 553 can however be implemented with this possibility in mind. 555 3.4 Multicast address to LID mapping 557 In a generic LAN setup the IP multicast addresses are mapped 558 to the destination link layer address directly. In the case of 559 InfiniBand this is only partly true. A mapping of multicast IP 560 to IB GIDs can be standardised. But the IPoIB driver on the 561 host must determine the LID that needs to be used when sending 562 to the particular multicast group. 564 A mapping from the IP multicast address or the corresponding 565 IB multicast group to a LID is not required because of the 566 following reasons: 568 1) Sending/receiving IP multicast 570 An IB node cannot be assured of its packets 571 reaching all the multicast members without itself 572 joining the IB multicast group. This is because the 573 relevant switches are programmed by the IB subnet 574 manager only on receiving a join request. 576 Thus the sender/receiver will always have to join 577 the IB multicast groups and keep track of the 578 groups it has already joined. Mapping directly to 579 the LID doesn't help if the the group has not been 580 joined. 582 Thus the implementation is required to keep track 583 of the IB groups joined. It can therefore also 584 record the corresponding LID removing the need to 585 map the IP multicast address to the LID. 587 2) Reduction of LID conflicts 589 The LIDs in the range 0xC000 to 0xFFFE are 590 designated as the multicast LIDs by IBA. This 591 limits the range to 2^14 -1 entries (16382 592 entries). This implies that 2^18 or 256K IPv4 593 multicast groups could map to a single LID. It is 594 better to let the SM decide on a more efficient 595 usage of the multicast LID space. 597 3) SM and IB architecture should stay unaffected. 599 A mapping of the LIDs can conflict with the SM 600 implementations. The SM is under no restrictions to 601 choose a particular LID for any multicast group. 602 Thus it could end up utilising a LID that 603 maps from an IP multicast address for some other 604 multicast group since not everything on IB subnets 605 is governed by the IPoIB rules. 607 4) No need to plan for LID conflicts 609 Allowing the SM decide on the LIDs also avoids 610 having to come up with a solution to handle LID 611 conflicts with other multicast groups. 613 Thus it is best to avoid such a mapping and leave it to the 614 individual implementations to determine the LID from the SM. 615 There is no extra work involved in this determination since 616 the SM has to be contacted anyway for the IB multicast group 617 join/create operations. 619 IPoIB will not standardise IP multicast addresses to LID 620 mapping. 622 4.0 IP subnets in InfiniBand fabrics 624 The IPoIB subnet is overlaid over the IB subnet. The IPoIB 625 subnet is brought up in the following steps: 627 Note: the join/leave operation at the IP level will be 628 referred to as IP_join/IP_leave and the join/leave 629 operations at the IB level will be referred to as 630 IB_join in this document. 632 1. The all-IP nodes group MUST be created 634 It is a MUST that the administrator setup the IB multicast 635 group corresponding to all-IP nodes/IPv4 broadcast (henceforth 636 called 'broadcast group') when the IP(v4/v6) subnet is setup. 637 The method by which the broadcast group is setup is not 638 defined by IPoIB. 640 2. All IPoIB interfaces IB_join the broadcast group 642 The administrator chooses the parameters that are valid for 643 the multicast group: P_Key, Q_Key, Hop Limit, Flow ID, TClass 644 and the MTU. All multicast transmissions in the IP subnet must 645 use these values. Therefore any other multicast groups setup 646 in the IPoIB subnet MUST be setup with these attributes. In 647 the future as the IB specification associates more meaning 648 with the various values and defines IB QoS different values 649 for IP multicast traffic may be possible. 651 The IB_join of the broadcast group by the IPoIB nodes builds 652 the IPoIB subnet. The broadcast group defines the span and the 653 members of the IPoIB subnet. The IB_join to the broadcast 654 group has the additional benefit of distributing these values 655 to all the members of the subnet. 657 The IP interface MTU for the IP over Unreliable Datagram 658 interface is the path MTU value returned when the broadcast 659 MGID is joined. This is the largest MTU that can be used 660 across the IPoIB subnet without fragmenting. The IPoIB 661 specification for IP over non-UD modes of transmission MUST 662 also define the MTU that can be used with it. 664 4.1 IPoIB VLANs 666 In an IB subnet, to communicate with one another, the 667 endpoints must have compatible P_Keys. Thus the administrator 668 when setting up an IP subnet over an IB subnet must ensure 669 that all the members have compatible P_Keys. An endpoint may 670 however have multiple P_Keys. 672 The IB architecture specifies that there can be only one MGID 673 associated with a multicast group in the IB subnet. The P_Key 674 can be included in the MGID mappings from the IP multicast 675 addresses. Since the P_Key is unique in the IB subnet the 676 inclusion of the P_Key in the IB MGIDs ensures unique MGID 677 mappings are created. Every unique broadcast group MGID so 678 formed creates a separate abstract IPoIB link and hence an 679 IPoIB VLAN. 681 It is an implementation choice on how the P_Key related to the 682 IPoIB subnet is determined by the IP stack. It could be a 683 configuration parameter initialised by some means by the 684 administrator. The method employed by an implementation to 685 determine the P_Key is beyond the scope of IPoIB. 687 4.2 Multicast in IPoIB subnets 689 IP multicast on InfiniBand subnets follows the same concepts 690 and rules as on any other media. However, unlike most other 691 media multicast over InfiniBand requires interaction with 692 another entity, the IB subnet manager. This section describes 693 the outline of the process and also suggests some guidelines. 695 IB architecture specifies the following format for IB 696 multicast packets when used over unreliable datagram (UD) 697 mode: 699 +--------+-------+---------+---------+-------+---------+---------+ 700 |Local |Global |Base |Datagram |Packet |Invariant| Variant | 701 |Routing |Routing|Transport|Extended |Payload| CRC | CRC | 702 |Header |Header |Header |Transport| (IP) | | | 703 | | | |Header | | | | 704 +--------+-------+---------+---------+-------+---------+---------+ 706 For details about the various headers please refer to 707 InfiniBand Architecture Specification. 709 The Global routing header (GRH) includes the IB multicast group 710 GID. The Local routing header (LRH) includes the local 711 identifier (LID). The IB switches in the fabric route the 712 packet based on the LID. 714 The GID is made available to the receiving IB user (the IPoIB 715 interface driver for example). The driver can therefore 716 determine the IB group the packet belongs to. 718 IPv4 defines three levels of multicast compliance. These are: 720 Level 0: No support for IP multicasting 722 Level 1: Support for sending but not receiving multicasts 724 Level 2: Full support for IP multicasting 726 In IPv6 there is no such distinction. Full multicast support 727 is mandatory. Additionally, all IPv4 subnets support broadcast 728 (255.255.255.255). IPv4 broadcast can always be sent/received 729 by all IPv4 interfaces. 731 The standard case of broadcast is covered by the requirement 732 that the multicast MGID must exist for an IPoIB subnet to be 733 formed. Thus level 0 IPv4 multicast support is available by 734 default. 736 4.2.1 Sending IP multicast datagrams 738 An IP host may send a multicast packet at any time to any 739 multicast address. The join/leave of IB groups will be 740 referred to as IB_Join/IB_leave in this document. The 741 corresponding IP level join/leave will be referred to as 742 IP_join/IP_leave. 744 The IP layer conveys the multicast packet to the IPoIB 745 interface driver/module. This module attempts to IB_join the 746 relevant IB multicast group. This is required since otherwise 747 there is no guarantee that the packet will reach its 748 destinations. 750 The IB_join could fail if the IB group has not been created. 751 This could imply that there are no listeners on the subnet and 752 the router doesn't expect to forward packets received on this 753 group. However, this may not be the case. The IB group may not 754 exist because the SM ran out of resources or the SM policy 755 allows only a limited set of multicast groups to be created. 756 Additionally it is not reasonable to expect the router to 757 create IB groups forall the IP multicast addresses that it may 758 be called upon to forward. 760 Therefore, the multicast module of IPoIB interface, when 761 sending a multicast packet MUST do one the following: 763 1) join the IB multicast group corresponding to the IP 764 multicast address. This is the RECOMMENDED option for 765 multicast if the sender is itself a member of the 766 group. 768 As noted earlier, a particular IB multicast group 769 may not exist for some reason. In such a case the 770 implementation MUST fall back to one of the 771 following methods. 773 2) Send the multicast packet out with the 774 IB MGID/MLID associated with the all-systems IP 775 multicast address (224.0.0.1/FF02::1). 777 An implementation implementing 1) described above 778 must fall back to this condition or the condition 779 given below on failure to join the IB group 780 corresponding to the IPv4 multicast address being 781 sent to. 783 3) In IPv4 subnets if both the above conditions fail 784 then the packet MUST be sent with the IB MGID/MLID 785 corresponding to the IPv4 limited broadcast address 786 (255.255.255.255). 788 4.2.2 Receiving multicast packets 790 An IP host sends an IGMP/MLD report to the router(s) when it 791 wants to receive packets on a multicast group. The router 792 could then create the IB group. However to receive the packets 793 the IP host must join the corresponding IB multicast group. 794 Therefore, it is simpler for the IB interface module on the IP 795 host to first create the IB group and then send the IGMP 796 message to the router. The router will then IB_join the 797 specified IB group. It may also be that an IPoIB subnet 798 doesn't have any routers. In such a case the non-existent 799 router cannot be relied on to create the IB groups. 801 The router MAY choose to create IB groups corresponding to the 802 IP groups it expects to forward. 804 Thus the creation of IB groups is done by IP receivers or IP 805 routers only and not by senders thereby keeping things simple. 806 The host must first try to join the group and only on failure 807 attempt to create it. 809 4.2.2.1 IB join of MGIDs by a listener 811 A multicast listener follows the following steps when it 812 IP_joins the IP multicast group: 814 1) The IPoIB interface IB_Joins the corresponding IB MGID 816 2) If step 1) fails 818 The IPoIB interface creates the IB MGID group and IB_Joins it 820 3) If step 2) fails 822 The IPoIB interface records the IB MGID/MLID it will 823 be using for the IP multicast group. This decision is 824 based on the steps outlined in section 4.2.1. 826 4) It may be that the IB MGID could not be created/joined 827 because of a transient error or resource constraint at the 828 SM. It may also be created at a later point in time. The 829 listener therefore would not be in the IB MGID 830 corresponding to the IP address. Unfortunately there is no 831 IB level support to let the listener know of the new IB 832 MGID being created. 834 If the underlying IB level indicated a transient failure 835 the listener periodically retries to join the IB group. 836 Note that multicasting can still continue since the packets 837 can be sent out on the broadcast MGID. 839 The exact parameters and timers for such retries are 840 specified in another document. 842 4.2.3 Leaving/Deleting a multicast group 844 An IPv4 sender (level 1 compliance) IB_joins the IB multicast 845 group only because that is the only way to guarantee reception 846 of the packets by all the group recipients. The sender must 847 however IB_leave the group at some time. It is RECOMMENDED 848 that a sender, when not a receiver on the group, start a timer 849 per multicast group sent to. The sender leaves the IB group 850 when the timer goes off. It restarts the timer if another 851 message is sent. 853 This recommendation doesn't apply to the IB broadcast group. 854 It also doesn't apply to the IB group corresponding to the 855 all-hosts multicast group. An IPv4 host MUST always remain a 856 member of the broadcast group. It MAY choose to remain a 857 member of all-hosts group. 859 Thus a sender that chooses to always send to the broadcast 860 group and not to the specific multicast group does not need to 861 implement a timer. 863 An IP multicast receiver MUST IB_leave the corresponding IB 864 multicast group when it IP_leaves the IP multicast group. In 865 the case of IPv4 implementation the receiver may choose to 866 continue to be a sender (level 1 compliance). It MAY choose to 867 not IB_leave the IB group but start a timer as explained 868 above. 870 A router is RECOMMENDED to IB_leave the IB multicast group 871 when there are no members of the IP multicast address in the 872 subnet and it has no explicit knowledge of any need to forward 873 such packets. 875 The router and the IP hosts MUST NOT IB_delete the IB 876 multicast group when they IB_leave the group. It is possible 877 for the same IB multicast group be used by a non-IP protocol. 878 The IB specification mentions an IB specific protocol that 879 will delete the IB groups when it determines that there are no 880 IB members of the group. 882 5.0 QoS and related issues 884 The IB specification suggests the use of service levels for 885 load balancing, QoS and deadlock avoidance within an IB 886 subnet. But the IB specification leaves the usage and mode of 887 determination of the SL for the application to decide. The SL 888 and list of SLs are available in the SA but it is up to the 889 endnode's application to choose the 'right' value. 891 Every IPoIB implementation will determinethe relevant SL value 892 based on its own policy. No method or process for choosing the 893 SL will be defined by the IPoIB standards. 895 6.0 Security Considerations 897 Any multicast/broadcast communication is inherently insecure 898 since anyone can receive the data. The applications must 899 implement appropriate authentication/encryption methods for 900 data security. 902 The IP subnet communication can be disrupted by creating the 903 IB broadcast/multicast groups with incompatible parameters. 905 The implementations must leverage IB specific methods to 906 protect against such situations. 908 7.0 References 910 [IB_ARCH] InfiniBand Architecture Specification, Volume 1.0 911 [RFC_2373] IP Version 6 Addressing Architecture 912 [RFC_2375] IPv6 Multicast Address Assignments 913 [RFC_1700] Assigned Numbers 914 [RFC_1112] Host extensions for IP multicasting 915 [RFC_2236] Internet Group Management Protocol, Version 2 916 [RFC_2710] Multicast Listener Discovery 918 8.0 Author's Address 920 Vivek Kashyap 922 IBM 923 15450, SW Koll Parkway 924 Beaverton, OR 97006 926 Phone: 503 578 3422 927 Email: vivk@us.ibm.com 929 Full Copyright Statement 931 Copyright (C) The Internet Society (2001). All Rights Reserved. 933 This document and translations of it may be copied and 934 furnished to others, and derivative works that comment on or 935 otherwise explain it or assist in its implementation may be 936 prepared, copied, published and distributed, in whole or in 937 part, without restriction of any kind, provided that the above 938 copyright notice and this paragraph are included on all such 939 copies and derivative works. However, this document itself may 940 not be modified in any way, such as by removing the copyright 941 notice or references to the Internet Society or other Internet 942 organizations, except as needed for the purpose of developing 943 Internet standards in which case the procedures for copyrights 944 defined in the Internet Standards process must be followed, or 945 as required to translate it into languages other than 946 English. 948 The limited permissions granted above are perpetual and will 949 not be revoked by the Internet Society or its successors or 950 assigns. 952 This document and the information contained herein is provided 953 on an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET 954 ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR 955 IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE 956 USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR 957 ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A 958 PARTICULAR PURPOSE. 960 -- 961 Vivek Kashyap 962 Linux Technology Center, IBM 963 kashyapv@us.ibm.com 964 vivk@us.ibm.com 965 503 578 3422 (o)