idnits 2.17.1 draft-ietf-ipoib-architecture-04.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** The document seems to lack a 1id_guidelines paragraph about Internet-Drafts being working documents -- however, there's a paragraph with a matching beginning. Boilerplate error? ** The document seems to lack a 1id_guidelines paragraph about 6 months document validity -- however, there's a paragraph with a matching beginning. Boilerplate error? == No 'Intended status' indicated for this document; assuming Proposed Standard == The page length should not exceed 58 lines per page, but there was 23 longer pages, the longest (page 2) being 60 lines == It seems as if not all pages are separated by form feeds - found 0 form feeds but 24 pages Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** There are 9 instances of too long lines in the document, the longest one being 3 characters in excess of 72. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- Couldn't find a document date in the document -- date freshness check skipped. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Missing reference section? 'RFC2373' on line 239 looks like a reference -- Missing reference section? 'IANA' on line 976 looks like a reference Summary: 5 errors (**), 0 flaws (~~), 4 warnings (==), 4 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 INTERNET DRAFT 2 Vivek Kashyap 3 Expiration Date: October, 2004 IBM 4 April, 2004 6 IP over InfiniBand(IPoIB) Architecture 8 Status of this memo 10 This document is an Internet-Draft and is in full conformance 11 with all provisions of Section 10 of RFC 2026. 13 Internet-Drafts are working documents of the Internet 14 Engineering Task Force (IETF), its areas, and its working 15 groups. Note that other groups may also distribute working 16 documents as Internet- Drafts. 18 Internet-Drafts are draft documents valid for a maximum of six 19 months and may be updated, replaced, or obsoleted by other 20 documents at any time. It is inappropriate to use 21 Internet-Drafts as Reference material or to cite them other 22 than as ``work in progress''. 24 The list of current Internet-Drafts can be accessed at 25 http://www.ietf.org/ietf/1id-abstracts.txt 27 The list of Internet-Draft Shadow Directories can be accessed 28 at http://www.ietf.org/shadow.html 30 This memo provides information for the Internet community. 31 This memo does not specify an Internet standard of any kind. 32 Distribution of this memo is unlimited. 34 Copyright Notice 36 Copyright (C) The Internet Society (2001). All Rights Reserved. 38 Abstract 40 InfiniBand is a high speed, channel based interconnect between 41 systems and devices. 43 This document presents an overview of the InfiniBand 44 architecture. It further describes the requirements and 45 guidelines for the transmission of IP over InfiniBand. 46 Discussions in this document are applicable to both IPv4 and 47 IPv6 unless explicitly specified. The encapsulation of IP over 48 InfiniBand and the mechanism for IP address resolution on IB 49 fabrics are covered in other documents. 51 Table of Contents 53 1.0 Introduction to InfiniBand 54 1.1 InfiniBand Architecture Specification 55 1.2 Overview of InfiniBand Architecture 56 1.2.1 InfiniBand Addresses 57 1.2.1.1 Unicast GIDs 58 1.2.1.2 Multicast GIDs 59 1.3 InfiniBand Multicast Group Management 60 1.3.1 Multicast Member Record 61 1.3.1.1 JoinState 62 1.3.2 Join and Leave operations 63 1.3.2.1 Creating a Multicast Group 64 1.3.2.3 Deleting a Multicast Group 65 1.3.2.4 Multicast Group Create/Delete Traps 66 2.0 Management of InfiniBand Subnet 67 3.0 IP over IB 68 3.1 InfiniBand as Datalink 69 3.2 Multicast Support 70 3.2.1 Mapping IP Multicast to IB Multicast 71 3.2.2 Transient Flag in IB MGIDs 72 3.3 IP Subnet Across IB Subnets 73 4.0 IP Subnets in InfiniBand Fabrics 74 4.1 IPoIB VLANs 75 4.2 Multicast in IPoIB Subnets 76 4.2.1 Sending IP Multicast Datagrams 77 4.2.2 Receiving Multicast Packets 78 4.2.3 Forwarding Multicast Packets 79 4.2.4 Impact of InfiniBand Architecture Limits 80 4.2.5 Leaving/Deleting a Multicast Group 81 5.0 QoS and Related Issues 82 6.0 Security Considerations 83 7.0 Acknowledgments 84 8.0 References 85 9.0 Author's Address 87 1.0 Introduction to InfiniBand 89 The InfiniBand Trade Association(IBTA) was formed to develop 90 an I/O specification to deliver a channel based, switched 91 fabric technology. The InfiniBand standard is aimed at meeting 92 the requirements of scalability, reliability, availability and 93 performance of servers in data centers. 95 1.1 InfiniBand Architecture Specification 96 The InfiniBand Trade Association specification is available 97 for download from http://www.infinibandta.org. 99 1.2 Overview of InfiniBand Architecture 101 For a more complete overview the reader is referred to 102 chapter 3 of the InfiniBand specification. 104 InfiniBand Architecture (IBA) defines a System Area 105 Network(SAN) for connecting multiple independent processor 106 platforms, I/O platforms and I/O devices. The IBA SAN is a 107 communications and management infrastructure supporting both 108 I/O and inter-processor communications for one or more 109 computer systems. 111 An IBA SAN consists of processor nodes and I/O units connected 112 through an IBA fabric made up of cascaded switches and IB 113 routers (connecting IB subnets). I/O units can range in 114 complexity from single ASIC IBA attached devices such as a LAN 115 adapter to a large memory rich RAID subsystem. 117 An IBA network may be subdivided into subnets interconnected 118 by routers. These are IB routers and IB subnets and not IP 119 routers or IP subnets. This document will refer to InfiniBand 120 routers and subnets as 'IB routers' and 'IB subnets' 121 respectively. The IP routers and IP subnets will be referred 122 to as 'routers' and 'subnets' respectively. 124 Each IB node or switch may attach to a single or multiple 125 switches or directly with each other. Each IB unit interfaces 126 with the link by way of channel adapters (CAs). The 127 architecture supports multiple CAs per unit with each CA 128 providing one or more ports that connect to the fabric. Each 129 CA appears as a node to the fabric. 131 The ports are the endpoints to which the data is sent. 132 However, each of the ports may include multiple QPs (queue 133 pairs) that may be directly addressed from a remote peer. From 134 the point of view of data transfer the QP number (QPN) is part 135 of the address. 137 IBA supports both connection oriented and datagram service 138 between the ports. The peers are identified by QPN and the 139 port identifier. There are a two exceptions. QPNs are not used 140 when packets are multicast. QPNs are also not used in the raw 141 datagram mode. 143 A port, in a data packet, is identified by a local ID (LID) 144 and optionally a Global ID (GID). The GID in the packet is 145 needed only when communicating across an IB subnet though it 146 may always be included. 148 The GID is 128 bits long and is formed by the concatenation of 149 a 64 bit IB subnet prefix and a 64 bit EUI-64 compliant 150 portion (GUID). The LID is a 16 bit value that is assigned 151 when the port becomes active. Note that the GUID is the only 152 persistent identifier of a port. However, it cannot be used as 153 an address in a packet. If the prefix is modified then the GID 154 may change. The subnet manager may attempt to keep the LID 155 values constant across reboots but that is not a requirement. 157 The assignment of the GID and the LID is done by the subnet 158 manager. Every IB subnet has at least one subnet manager 159 component that controls the fabric. It assigns the LIDs and 160 GIDs. The subnet manager also programs the switches so that 161 they route packets between destinations. The subnet manager 162 and a related component, the subnet administrator (SA) are the 163 central repository of all information that is required to 164 setup and bring up the fabric. 166 IB routers are components that route packets between IB 167 subnets based on the GIDs. Thus within an IB subnet a packet 168 may or may not include a GID but when going across an IB 169 subnet the GID must be included. A LID is always needed in a 170 packet since the destination within a subnet is determined by 171 it. 173 A CA and a switch may have multiple ports. Each CA port is 174 assigned its own LID or a range of LIDs. The ports of a switch 175 are not addressable by LIDs/GIDs or in other words, are 176 transparent to other end nodes. Each port has its own set of 177 buffers. The buffering is channeled through virtual lanes(VL) 178 where each VL has its own flow control. There may be up to 16 179 VLs. 181 VLs provide a mechanism for creating multiple virtual links 182 within a single physical link. All ports must support VL15 183 which is reserved exclusively for subnet management datagrams 184 and hence doesn't concern the IPoIB discussions. The actual VL 185 that a packet uses is configured by the SM in the 186 switch/channel adapter tables and is determined based on the 187 Service Level (SL) specified in every packet. There are 16 188 possible SLs. 190 In addition to the features described above viz. Queue 191 Pairs(QPs), Service Levels(SLs) and addressing(GID/LID), IBA 192 also defines the following: 194 Partitioning: 196 Every packet, but for the raw datagrams, carries the 197 partition key (P_key). These values are used for 198 isolation in the fabric. A switch (this is an optional 199 feature) may be programmed by the SM to drop packets 200 not having a certain key. The CA ports always check 201 for the P_Keys. A CA port may belong to multiple 202 partitions. P_Key checking is optional at IB routers. 204 A P_Key may be described as having 'limited 205 membership' or 'full membership'. For a packet to be 206 accepted at least one of the P_Keys i.e. the P_Key in 207 the packet or the P_Key in the port, must be 'full 208 membership' P_Keys. 210 Q_Keys: 212 Q_Keys are used to enforce access rights for reliable 213 and unreliable IB datagram services. Raw datagram 214 services don't use Q_Keys. At communication 215 establishment the endpoints exchange the Q_Keys and 216 must always use the relevant Q_Keys when communicating 217 with one another. Multicast packets use the Q_Key 218 associated with the multicast group. 220 Q_Keys with the most significant bit set are 221 considered controlled Q_Keys (such as the GSI Q_Key) 222 and a HCA does not allow a consumer to arbitrarily 223 specify a controlled Q_Key. An attempt to send a 224 controlled Q_Key results in using the Q_Key in the QP 225 context. Thus the OS maintains control since it can 226 configure the QP context for the controlled Q_Key for 227 privileged consumers. It must be noted that though the 228 notion of a 'controlled Q_Key' is suggested by IB 229 specification it does not require its use or 230 implementation. 232 Multicast support: 234 A switch may support multicasting i.e. replication of 235 packets across multiple output ports. This is an 236 optional feature. Similarly, support for 237 sending/receiving multicast packets is optional in 238 CAs. A multicast group is identified by a GID. The GID 239 format is as defined in [RFC2373] on IPv6 addressing. 241 Thus from an IPv6 over InfiniBand's point of view the 242 data link multicast address looks like the network 243 address. An IB port must explicitly join a multicast 244 group by sending a request to the SM to receive 245 multicast packets. A port may send packets to any 246 multicast group. In both cases the multicast LID to be 247 used in the packets is received from the SM. 249 There are 6 methods for data transfer in IB architecture. 250 These are : 252 1. Unreliable Datagram (unacknowledged - connectionless) 254 The UD service is connectionless and unacknowledged. 255 It allows the QP to communicate with any unreliable 256 datagram QP on any node. 258 The switches and hence each link can support only a 259 certain MTU. The MTU ranges are 256 bytes, 512 bytes, 260 1024 bytes, 2048 bytes, 4096 bytes. A UD packet cannot 261 be larger than the smallest link MTU between the two 262 peers. 264 2. Reliable Datagram (acknowledged - multiplexed) 266 The RD service is multiplexed over connections between 267 nodes called End to end contexts (EEC) which allows 268 each RD QP to communicate with any RD QP on any node 269 with an established EEC. Multiple QPs can use the same 270 EEC and a single QP can use multiple EECs (one for 271 each remote node per reliable datagram domain). 273 3. Reliable Connected (acknowledged - connection oriented) 275 The RC service associates a local QP with one and only 276 one remote QP. The message sizes maybe as large as 277 2^31 bytes in length. The CA implementation takes care 278 of segmentation and assembly. 280 4. Unreliable Connected (unacknowledged - connection oriented) 282 The UC service associates one local QP with one and 283 only one remote QP. There is no acknowledgment and 284 hence no resend of lost or corrupted packets. Such 285 packets are therefore simply dropped. It is similar to 286 RC otherwise. 288 5. Raw Ethertype (unacknowledged - connectionless) 289 The Ethertype raw datagram packet contains a generic 290 transport header that is not interpreted by the CA but 291 it specifies the protocol type. The values for 292 ethertype are the same as defined in RFC1700 for 293 ethertype. 295 6. Raw IPv6 ( unacknowledged - connectionless) 297 Using IPv6 raw datagram service, the IBA CA can 298 support standard protocol layers atop IPv6 (such as 299 TCP/UDP). Thus native IPv6 packets can be bridged into 300 the IBA SAN and delivered directly to a port and to 301 its IPv6 raw datagram QP. 303 The first 4 types are referred to as IB transports. The latter 304 two are classified as Raw datagrams. There is no indication of 305 the QP number in the raw datagram packets. The raw datagram 306 packets are limited by the link MTU in size. 308 The two connected modes and the reliable datagram mode may 309 also support 'Automatic Path Migration(APM)'. This is an 310 optional facility that provides for a hardware based path 311 fail over. An alternate path is associated with the QP when 312 the connection/EE context is first created. If unrecoverable 313 errors are encountered the connection switches to using the 314 alternate path. 316 1.2.1 InfiniBand Addresses 318 The InfiniBand architecture borrows heavily from the IPv6 319 architecture in terms of the InfiniBand subnet structure and 320 global identifiers (GIDs). 322 The InfiniBand architecture defines the GID associated with a 323 port as a 128-bit unicast or multicast identifier. IBA derives 324 the GID address format from the IPv6 format[RFC_2373] with 325 some additional properties/restrictions defined to facilitate 326 efficient discovery, communication and routing. 328 Note: 329 The IBA refers to [RFC_2373] explicitly. It must be noted 330 that IBA is therefore unaffected by any further changes 331 that are introduced in IPv6 addressing architecture. 333 IBA defines two types of GIDs: 334 unicast and, 335 multicast. 337 1.2.1.1 Unicast GIDs 339 The unicast GIDs are defined, as in IPv6, with three scopes. 340 The IB specification states: 342 a. link local: This is defined to be FE80/10. 344 The IB routers will not forward packets with a 345 link local address in source or destination 346 beyond the IB subnet. 348 b. site local: FEC0/10 350 A unicast GID used within a collection of 351 subnets which is unique within that collection 352 (e.g. a data center or campus) but is not 353 necessarily globally unique. IB routers must 354 not forward any packets with either a 355 site-local Source GID or a site-local 356 Destination GID outside of the site. 358 c. global: 359 A unicast GID with a global prefix, i.e. an IB 360 router may use this GID to route packets 361 throughout an enterprise or internet. 363 1.2.1.2 Multicast GIDs 365 The multicast GIDs also parallel the IPv6 multicast addresses. 366 The IB specification defines the multicast GIDs as follows: 368 FFxy:<112 bits> 370 Flag bits: 372 The nibble, denoted by x above, are the 4 flag bits: 000T. 374 The first three bits are reserved and are set to zero. The 375 last bit is defined as follows: 377 T=0: denotes a permanently assigned i.e. well known GID 378 T=1: denotes a transient group 380 Scope bits: 382 The 4 bits, denoted by y in the GID above, are the scope 383 bits. These scope values are described in Table 1. 385 scope value Address value 387 0 Reserved 388 1 Unassigned 389 2 Link-local 390 3 Unassigned 391 4 Unassigned 392 5 Site-local 393 6 Unassigned 394 7 Unassigned 395 8 Organization-local 396 9 Unassigned 397 0xA Unassigned 398 0xB Unassigned 399 0xC Unassigned 400 0xD Unassigned 401 0xE Global 402 0xF Reserved 404 Table 1 406 The IB specification further refers to [RFC_2373] and 407 [RFC_2375] while defining the well known multicast addresses. 408 However, it then states that the well known addresses apply to 409 IB raw IPv6 datagrams only. It must be noted though that a 410 multicast group can be associated with only a single MGID. 411 Thus the same MGID cannot be associated with the UD mode and 412 the raw datagram mode. 414 1.3 InfiniBand Multicast Group Management 416 IB multicast groups, identified by Multicast Global 417 Identifiers (MGIDs), are managed by the subnet manager(SM). 418 The SM explicitly programs the IB switches in the fabric to 419 ensure that the packets are received by all the members of the 420 multicast group that request the reception of packets. SM also 421 needs to program the switches such that packets transmitted to 422 the group by any group member reach all receivers in the 423 multicast group. 425 IBA distinguishes between multicast senders and receivers. 426 Though all members of a multicast group can transmit to the 427 group (and expect their packets to be correctly forwarded) not 428 all members of the group are receivers. A port needs to 429 explicitly request that multicast packets addressed to the 430 group be forwarded to it. 432 A multicast group is created by sending a join request to the 433 SM. As will be explained later, IBA defines multiple modes for 434 joining a multicast group. The subnet manager records the 435 group's multicast GID and the associated characteristics. The 436 group characteristics are defined by the group path MTU, 437 whether the group will be used for raw datagrams or unreliable 438 datagrams, the service level, the partition key associated 439 with the group, the Local Identifier(LID) associated with the 440 group etc. These characteristics are defined at the time of 441 the group creation. The interested reader may lookup the 442 'MCMemberRecord' attribute in the IB architecture 443 specification[IB_ARCH] for the complete list of 444 characteristics that define a group. 446 A LID is associated with the multicast group by the subnet 447 manager(SM) at the time of the multicast group creation. The 448 SM determines the multicast tree based on all the group 449 members and programs the relevant switches. The Multicast 450 LID(MLID) is used by the switches to route the packets. 452 Any member IB port wanting to participate in the multicast 453 group must join the group. As part of the join operation the 454 node receives the group characteristics from the SM. At the 455 same time the subnet manager ensures that the requester can 456 indeed participate in the group by verifying that it can 457 support the group MTU, and accessibility to the rest of the 458 group members. Other group characteristics may need 459 verification too. 461 The SM, for groups that span IB subnet boundaries, must 462 interact with IB routers to determine the presence of this 463 group in other IB subnets. If present the MTU must match 464 across the IB subnets. 466 P_Key is another characteristic that must match across IB 467 subnets since the P_Key inserted into a packet is not modified 468 by the IB switches or IB routers. Thus if the P_Keys didn't 469 match the IB router(s) itself might drop the packets or 470 destinations on other subnets might drop the packets. 472 A join operation may cause the SM to reprogram the fabric so 473 that the new member can participate in the multicast group. By 474 the same token a leave may cause the SM to reprogram the 475 fabric to stop forwarding the packets to the requester. 477 1.3.1 Multicast Member Record 479 The multicast group is maintained by the SM with each of the 480 group members represented by an MCMemberRecord[IB_ARCH]. Some 481 of its components are: 483 MGID - Multicast GID for this multicast group 484 PortGID - Valid GID of the port joining this multicast group 485 Q_Key - Q_Key to be used by this multicast group 486 MLID - Multicast LID for this multicast group 487 MTU - MTU for this multicast group 488 P_Key - Partition key for this multicast group 489 SL - Service Level for this multicast group 490 Scope - Same as MGID address scope 491 JoinState - Join/Leave status requested by the port: 492 bit 0: FullMember 493 bit 1: NonMember 494 bit 2: SendOnlyNonMember 496 1.3.1.1 JoinState 498 The JoinState indicates the membership qualities a port wishes 499 to add while joining/creating a group or delete when leaving a 500 group. The meaning of the JoinState bits are: 502 FullMember: 503 Messages destined for the group are routed to and from 504 the port. A group may be deleted by the SM if there 505 are no FullMembers in the group. 507 NonMember: 508 Messages destined for the group are routed to and from 509 the port. The port is not considered a member for 510 purposes of group creation/deletion. 512 SendOnlyNonMember: 513 Group messages are only routed from the port but not 514 to the port. The port is not considered a member for 515 purposes of group creation/deletion. 517 A port may have multiple bits set in its record. In such case 518 the membership qualities are a union of the JoinStates. A port 519 may leave the multicast group for each of the JoinStates 520 individually or in any combination of JoinState 521 bits[IB_ARCH]. 523 1.3.2 Join and Leave Operations 525 An IB port joins a multicast group by sending a join 526 request(SubnAdmSet() method) and leaves a multicast group by 527 sending a leave message (SubnAdmDelete() method) to the SM. 528 The IBA specification[IB_ARCH] describes the methods and 529 attributes to be used when sending these messages. 531 1.3.2.1 Creating a Multicast Group 533 There is no 'create' command to form a new multicast group. 534 The FullMember bit in the JoinState must be set to create a 535 multicast group. In other words, the first FullMember join 536 request will cause the group to be created as a side effect of 537 the join request. Subsequent join or leave requests may 538 contain any combination of the JoinState bits. 540 The creator of the group specifies the Q_Key, MTU, P_Key, SL, 541 FlowLabel, TClass and the Scope value. A creator may request 542 that a suitable MGID be created for it. Alternatively, the 543 request can specify the desired MGID. In both cases the MLID 544 is assigned by the SM. 546 Thus a group will be created with the specified values when 547 the requester sets the FullMember bit and no such group 548 already exists in the subnet. 550 1.3.2.3 Deleting a Multicast Group 552 When the last FullMember leaves the multicast group the SM may 553 delete the multicast group releasing all resources, including 554 those that might exist in the fabric itself, associated with 555 the group. 557 Note that a special 'delete' message does not exist. It is a 558 side effect of the last FullMember 'leave' operation. 560 1.3.2.4 Multicast Group Create/Delete Traps 562 The SA may be requested by the ports to generate a report 563 whenever a multicast group is created or deleted. The port can 564 specify the multicast group it is interested in i.e. use a 565 specific MGID or use a wild card request. The SA will report 566 these events using traps 66 (for creates) and 67 (for 567 deletes)[IB_ARCH]. 569 Therefore, a port wishing to join a group but not create it by 570 itself may request a create notification or a port might even 571 request a notification for all groups that are created(a 572 wild card request). The SA will diligently inform them of the 573 creation utilizing the aforementioned traps. The requester can 574 then join the multicast group indicated. Similarly, a 575 SendOnlyNonMember or a NonMember might request the SA to 576 inform it of group deletions. The endnode, on receiving a 577 delete report, can safely release the resources associated 578 with the group. The associated MLID is no longer valid for the 579 group and may be reassigned to a new multicast group by the 580 SM. 582 2.0 Management of InfiniBand Subnet 584 To aid in the monitoring and configuration of InfiniBand 585 subnet components a set of MIB modules need to be defined. 586 MIB modules are needed for the channel adapters, InfiniBand 587 interfaces, InfiniBand subnet manager, InfiniBand subnet 588 management agents and to allow the management of specific 589 device properties. It must be noted that the management 590 objects addressed in the IPoIB documents are for all of the 591 IB subnet components and are not limited to IP(over IB). 592 The relevant MIB modules are described in separate 593 documents and are not covered here. 595 3.0 IP over IB 597 As described in section 1.0, the InfiniBand architecture 598 provides a broad set of capabilities to choose from when 599 implementing IP over InfiniBand networks. 601 The IPoIB specification must not, and does not, require 602 changes in IP and higher layer protocols. Nor does it mandate 603 requirements on IP stacks to implement special user level 604 programs. It is an aim of IPoIB specification that the IPoIB 605 changes be amenable to modularization and incorporation into 606 existing implementations at the same level as other media 607 types. 609 3.1 InfiniBand as Datalink 611 InfiniBand architecture provides multiple methods of data 612 exchange between two endpoints as was noted above. These are: 614 Reliable Connected (RC) 615 Reliable Datagram (RD) 616 Unreliable Connected (UC) 617 Unreliable Datagram (UD) 618 Raw Datagram : Raw IPv6 (R6) 619 : Raw Ethertype (RE) 621 IPoIB can be implemented over any, multiple or all of these 622 services. A case can be made for support on any of the 623 transport methods depending on the desired features. 625 The IB specification requires Unreliable Datagram mode to be 626 supported by all the IB nodes. The host channel adapters(HCAs) 627 are specifically required to support Reliable connected(RC) 628 and Unreliable connected(UC) modes but the same is not the 629 case with target channel adapters(TCAs). Support for the two 630 Raw Datagram modes is entirely optional. The Raw Datagram mode 631 supports a 16-bit CRC as against the better protection 632 provided by the use of a 32-bit CRC in other modes. 634 For the sake of simplicity, ease of implementation and 635 integration with existing stacks, it is desirable that the 636 fabric support multicasting. This is possible only in 637 Unreliable datagram (UD) and IB's Raw datagram modes. 639 Thus it is only the UD mode that is universal, supports 640 multicast, and a robust CRC. Given these conditions it is the 641 obvious choice for IP over InfiniBand [IPOIB_ENCAP]. 643 Future documents might consider the connected modes. In 644 contrast to the limited link MTU offered by UD mode, the 645 connected modes can offer significant benefit in terms of 646 performance by utilizing a larger MTU. Reliability is also 647 enhanced if the underlying feature of automatic path migration 648 of connected modes is utilized. 650 3.2 Multicast Support 652 InfiniBand specification makes support of multicasting in the 653 switches optional. Multicast however, is a basic requirement 654 in IP networks. Therefore, IPoIB requires that multicast 655 capable InfiniBand fabrics be used to implement IPoIB 656 subnets. 658 3.2.1 Mapping IP Multicast to IB Multicast 660 Well known IP multicast groups are defined for both IPv4 and 661 IPv6 (RFC_1700, RFC_2373). Multicast groups may also be 662 dynamically created at any time. To avoid creating unnecessary 663 duplicates of multicast packets in the fabric, and to avoid 664 unnecessary handling of such packets at the hosts each of the 665 IP multicast groups needs to be associated with a different IB 666 multicast group as far as possible. A process is defined in 667 [IPOIB_ENCAP] for mapping the IP multicast addresses to unique 668 IB multicast addresses. 670 3.2.2 Transient Flag in IB MGIDs 672 The IB specification describes the flag bits as discussed in 673 section 1.3. The IB specification also defines some well known 674 IB multicast GIDs(MGIDs). The MGIDs are reserved for the IB's 675 Raw datagram mode which is incompatible with the other 676 transports of IB. Any mapping that is defined from IP 677 multicast addresses therefore must not fall into IB's 678 definition of a well-known address. 680 Therefore all IPoIB related multicast GIDs always set the 681 transient bit. 683 3.3 IP Subnets Across IB Subnets 685 Some implementations may wish to support multiple clusters of 686 machines in their own IB subnets but otherwise be part of a 687 common IP subnet. For such a solution the IB specification 688 needs multiple upgrades. Some of the required enhancements 689 are: 691 1) A method for creating IB multicast GIDs that span multiple 692 IB subnets. The partition keys and other parameters need to 693 be consistent across IB subnets. 695 2) Develop IB routing protocol to determine the IB topology 696 across IB subnets. 698 3) Define the process and protocols needed between IB nodes 699 and IB routers 701 Until the above conditions are met it is not possible to 702 implement IPoIB subnets that span IB subnets. The IPoIB 703 standards have however been defined with this possibility in 704 mind. 706 4.0 IP Subnets in InfiniBand Fabrics 708 The IPoIB subnet is overlaid over the IB subnet. The IPoIB 709 subnet is brought up in the following steps: 711 Note: the join/leave operation at the IP level will be 712 referred to as IP_join/IP_leave and the join/leave 713 operations at the IB level will be referred to as 714 IB_join in this document. 716 1. The all-IPoIB nodes IB multicast group is created 718 The fabric administrator creates a IB multicast 719 group(henceforth called 'broadcast group') when the IP subnet 720 is setup. The 'broadcast group' is defined in [IPOIB_ENCAP]. 722 The method by which the broadcast group is setup is not 723 defined by IPoIB. The group may be setup at the SM by the 724 administrator or by the first IB_join. 726 As noted earlier, at the time of creating an IB multicast 727 group, multiple values such as the P_Key, Q_Key, Service 728 Level, Hop Limit, Flow ID, TClass, MTU etc., have to be 729 specified. These values should be such that all potential 730 members of the IB multicast group are be able to communicate 731 with one another when using them. In the future, as the IB 732 specification associates more meaning with the various 733 parameters and defines IB QoS, different values for IP 734 multicast traffic may be possible. All unicast packets also 735 need to use the P_Key and Q_Key specified in the broadcast 736 group [IPOIB_ENCAP]. It is obvious that a thought out 737 configuration is required for a successful setup of the IPoIB 738 subnet. 740 2. All IPoIB interfaces IB_join the broadcast group 742 The broadcast group defines the span and the members of the 743 IPoIB link. This link gets built up as IPoIB nodes IB_join the 744 broadcast group. 746 The IB_join to the broadcast group has the additional benefit 747 of distributing the above mentioned multicast group parameters 748 to all the members of the subnet. 750 Note that this IB_join to the broadcast group is a FullMember 751 join. If any of the ports or the switches linking the port to 752 the rest of the IPoIB subnet cannot support the 753 parameters(e.g. path MTU or P_Key) associated with the 754 broadcast group, then the IB_join request will fail and the 755 requesting port will not become part of the IPoIB subnet. 757 3. Configuration Parameters 759 As noted above, parameters such as, Q_Key, Path MTU, needed 760 for all IPoIB communication are returned to the IPoIB node on 761 IB_joining the 'broadcast group'. [IPOIB_ENCAP] also notes 762 that the parameters used in the broadcast group are used when 763 creating other multicast groups. 765 However, the P_Key must still be known to the IPoIB endnode 766 before it can join the broadcast-group. The P_Key is included 767 in the mapping of the broadcast group[IPOIB_ENCAP]. Another 768 parameter, the scope of the broadcast group, also needs to be 769 known to the endnode before it can join the broadcast group. 771 It is an implementation choice on how the P_Key and the scope 772 bits related to the IPoIB subnet are determined by the 773 implementation. These could be configuration parameters 774 initialized by some means by the administrator. 776 The methods employed by an implementation to determine the 777 P_Key and scope bits are not specified by IPoIB. 779 4.1 IPoIB VLANs 781 The endpoints in an IB subnet must have compatible P_Keys to 782 communicate with one another. Thus the administrator when 783 setting up an IP subnet over an IB subnet must ensure that all 784 the members have compatible P_Keys. An IP subnet can have only 785 one P_Key associated with it to ensure that all IP nodes in it 786 can talk to one another. An endpoint may however have multiple 787 P_Keys. 789 The IB architecture specifies that there can be only one MGID 790 associated with a multicast group in the IB subnet. The P_Key 791 is included in the MGID mappings from the IP multicast 792 addresses[IPOIB_ENCAP]. Since the P_Key is unique in the IB 793 subnet the inclusion of the P_Key in the IB MGIDs ensures that 794 unique MGID mappings are created. Every unique broadcast group 795 MGID so formed creates a separate abstract IPoIB link and 796 hence an IPoIB VLAN. 798 4.2 Multicast in IPoIB subnets 800 IP multicast on InfiniBand subnets follows the same concepts 801 and rules as on any other media. However, unlike most other 802 media multicast over InfiniBand requires interaction with 803 another entity, the IB subnet manager. This section describes 804 the outline of the process and suggests some guidelines. 806 IB architecture specifies the following format for IB 807 multicast packets when used over unreliable datagram(UD) 808 mode: 810 +--------+-------+---------+---------+-------+---------+---------+ 811 |Local |Global |Base |Datagram |Packet |Invariant| Variant | 812 |Routing |Routing|Transport|Extended |Payload| CRC | CRC | 813 |Header |Header |Header |Transport| (IP) | | | 814 | | | |Header | | | | 815 +--------+-------+---------+---------+-------+---------+---------+ 817 For details about the various headers please refer to 818 InfiniBand Architecture Specification[IB_ARCH]. 820 The Global routing header (GRH) includes the IB multicast 821 group GID. The Local routing header (LRH) includes the local 822 identifier (LID). The IB switches in the fabric route the 823 packet based on the LID. 825 The GID is made available to the receiving IB user (the IPoIB 826 interface driver for example). The driver can therefore 827 determine the IB group the packet belongs to. 829 IPv4 defines three levels of multicast compliance. These are: 831 Level 0: No support for IP multicasting 833 Level 1: Support for sending but not receiving multicasts 835 Level 2: Full support for IP multicasting 837 In IPv6 there is no such distinction. Full multicast support 838 is mandatory. Additionally, all IPv4 subnets support 839 broadcast(255.255.255.255). IPv4 broadcast can always be 840 sent/received by all IPv4 interfaces. 842 Every IPoIB subnet requires the broadcast GID to be defined. 843 Thus a packet can always be broadcast. 845 4.2.1 Sending IP Multicast Datagrams 847 An IP host may send a multicast packet at any time to any 848 multicast address. 850 The IP layer conveys the multicast packet to the IPoIB 851 interface driver/module. This module attempts to IB_join the 852 relevant IB multicast group. This is required since otherwise 853 InfiniBand architecture does not guarantee that the packet 854 will reach its destinations. 856 A pure sender may choose to join the multicast group as a 857 FullMember. In such a case the sender will receive all the 858 multicast packets transmitted to the IB group. Additionally, 859 the IB group will not be deleted until the sender leaves the 860 group. 862 Alternatively, a sender might IB_join as a SendOnlyNonMember. 863 In such a case the packets are not routed to the sender though 864 packets transmitted by it can reach the other group members. 865 Additionally, the group can be deleted when all FullMembers 866 have left the group. The sender can further request delete 867 updates from the SM. 869 If the sender does not find the group in existence it is 870 recommended in [IPOIB_ENCAP] that the packets be sent to the 871 MGID corresponding to the all-IP routers address. A sender 872 could also send the packets to the broadcast group. The 873 sender might also choose to request 'creation' reports from 874 the SM. 876 4.2.2 Receiving Multicast Packets 878 The IP host must join the IB multicast group corresponding to 879 the IP address. This follows from the IBA requirement that the 880 receiver must join the relevant IB multicast group. The group 881 is automatically created if it does not exist [IB_ARCH]. 883 The IP receivers must IB_leave the IB group when the IP layer 884 stops listening of the corresponding IP address. The SM can 885 then choose to delete the group. 887 4.2.3 Router considerations for IPoIB 889 IP routers know of the new IP groups created in the subnet by 890 the use of protocols such as IGMP/MLD. However, this is not 891 enough for IPoIB since the router needs to IB_join the 892 relevant IB groups to be able to receive and transmit the 893 packets. There is no promiscuous mode for listening to all 894 packets. 896 The IPoIB routers therefore need to request the SM to report 897 all creations of IB groups in the fabric. The IPoIB router can 898 then IB_join the reported group. It is not desirable that the 899 router's IB_joining of a multicast group be considered the 900 same as the IB_join from a receiver - the router's IB_join 901 shouldn't disallow the group's deletion when all receivers 902 leave. To overcome just this type of situations, IBA provides 903 the NonMember IB_join mode. 905 The NonMember IB_join mode can be used by IP routers when they 906 join in response to the create reports. A router should 907 ideally request the delete reports too so that it can release 908 all the resources associated with the group. The MLID 909 associated with a deleted MGID can be reassigned by the SM and 910 therefore there is a possibility of erroneous transmissions if 911 the MLID is cached. A router that does not request delete 912 reports will still work correctly since it will receive the 913 correct MLID , and purge any old cached value, when it 914 IB_joins the IB group in response to a create report. 916 It is reasonable for a router to IB_join as a FullMember if it 917 is joining the IB group in response to an application/routing 918 daemon request. In such a case the router might end up 919 controlling the existence of the IB group (since it is a 920 FullMember of the group). 922 4.2.4 Impact of InfiniBand Architecture Limits 924 An HCA or TCA may have a limit on the number of MGIDs it can 925 support. Thus, even though the groups may not be limited at 926 the subnet manager and in the subnet as such, they may be 927 limited at a particular interface. It is advisable to choose 928 an adequately provisioned HCA/TCA when setting up an IPoIB 929 subnet. 931 4.2.5 Leaving/Deleting a Multicast Group 933 An IPv4 sender (level 1 compliance) IB_joins the IB multicast 934 group only because that is the only way to guarantee reception 935 of the packets by all the group recipients. The sender must 936 however IB_leave the group at some time. A sender could, when 937 not a receiver on the group, start a timer per multicast group 938 sent to. The sender leaves the IB group when the timer goes 939 off. It restarts the timer if another message is sent. 941 This suggestion doesn't apply to the IB broadcast group. It 942 also doesn't apply to the IB group corresponding to the 943 all-hosts multicast group. An IPv4 host must always remain a 944 member of the broadcast group. 946 An IP multicast receiver IB_leaves the corresponding IB 947 multicast group when it IP_leaves the IP multicast group. In 948 the case of IPv4 implementation the receiver may choose to 949 continue to be a sender (level 1 compliance). In which case it 950 may choose not to IB_leave the IB group but start a timer as 951 explained above. 953 As noted elsewhere, the SM can choose to free up the 954 resources(e.g. routing entries in the switches) associated 955 with the IB group when the last FullMember IB_leave the group. 956 The MLID therefore becomes invalid for the group. The MLID can 957 be reassigned when a new group is created. 959 SendOnlyNonMember/NonMember ports caching the MLID need to 960 avoid this possibility. The way out is for them to request 961 group delete reports. An IP router requesting reports for all 962 groups need not request the delete report since an IB_join in 963 response to a create report will return the new MLID 964 association to it. 966 A router might prefer to IB_leave the IB multicast group when 967 there are no members of the IP multicast address in the subnet 968 and it has no explicit knowledge of any need to forward such 969 packets. 971 4.3 Transmission of IPoIB packets 973 The encapsulation of IP packets in InfiniBand is described 974 in[IPOIB_ENCAP]. 976 It specifies the use of an 'Ethertype' value [IANA] in all 977 IPoIB communication packets. The link-layer address is 978 comprised of the Global Identifier(GID) and the Queue Pair 979 Number(QPN) [IPOIB_ENCAP]. 981 To allow for multiple IB subnet based IPoIB subnets, the 982 specification utilizes the Global Identifier(GID) as part of 983 the link-layer address. Since all packets in IB have to use 984 the Local Identifier(LID) the address resolution process has 985 the additional step of resolving the destination GID, returned 986 in response to ARP/ND request, to the LID[IPOIB_ENCAP]. This 987 phase of address resolution might also be used to determine 988 other essential parameters (e.g. the SL, path rate etc.)for 989 successful IB communication between two peers. 991 As noted earlier, all communication in the IPoIB subnet 992 derives the Q_Key to use from the Q_Key specified in the 993 broadcast group. 995 4.4 RARP and Static ARP entries 997 RARP entries or static ARP entries are based on invariant 998 link-addresses. In the case of IPoIB, the link-address 999 includes the QPN which might not be constant across reboots or 1000 even across network interface resets. Therefore, static ARP 1001 entries or RARP server entries will only work if the 1002 implementation(s) using these options can ensure that the QPN 1003 associated with an interface is invariant across 1004 reboots/network resets[IPOIB_ENCAP]. 1006 4.5 DHCPv4 and IPoIB 1008 DHCPv4 [RFC_2131] utilizes a 'client identifier' field 1009 (expected to hold the link-layer address) of 16 bytes. The 1010 address in the case of IPoIB is 20-bytes. To get around this 1011 problem IPoIB specifies [IPOIB_DHCP] that the 'broadcast flag' 1012 be used by the client when requesting an IP address. 1014 5.0 QoS and Related Issues 1016 The IB specification suggests the use of service levels for 1017 load balancing, QoS and deadlock avoidance within an IB 1018 subnet. But the IB specification leaves the usage and mode of 1019 determination of the SL for the application to decide. The SL 1020 and list of SLs are available in the SA but it is up to the 1021 endnode's application to choose the 'right' value. 1023 Every IPoIB implementation will determine the relevant SL 1024 value based on its own policy. No method or process for 1025 choosing the SL has been defined by the IPoIB standards. 1027 6.0 Security Considerations 1029 This document describes the IB architecture as relevant to 1030 IPoIB. It further restates issues specified in other 1031 documents. It does not itself specify any requirements. There 1032 are no security issues introduces by this document. IPoIB 1033 related security issues are described in [IPOIB_ENCAP] and 1034 [IPOIB_DHCP]. 1036 7.0 Acknowledgments 1038 This document has benefited from the comments and suggestions 1039 of the members of the IPoIB working group and the members of 1040 the InfiniBand(SM) Trade Association. 1042 8.0 References 1044 8.1 Normative References 1046 [IB_ARCH] InfiniBand Architecture Specification, Volume 1.1 1048 [IPOIB_ENCAP] draft-ietf-ipoib-ip-over-infiniband-06.txt 1050 [IPOIB_DHCP] draft-ietf-ipoib-dhcp-over-infiniband-05.txt 1052 8.2 Informative References 1054 [RFC_2373] IP Version 6 Addressing Architecture 1056 [RFC_2375] IPv6 Multicast Address Assignments 1058 [RFC_1700] Assigned Numbers 1060 [RFC_1112] Host extensions for IP multicasting 1061 [RFC_2236] Internet Group Management Protocol, Version 2 1063 [RFC_2710] Multicast Listener Discovery 1065 9.0 Author's Address 1067 Vivek Kashyap 1069 IBM 1070 15450, SW Koll Parkway 1071 Beaverton, OR 97006 1073 Phone: +1 503 578 3422 1074 Email: vivk@us.ibm.com 1076 Full Copyright Statement 1078 Copyright (C) The Internet Society (2001). All Rights Reserved. 1080 This document and translations of it may be copied and 1081 furnished to others, and derivative works that comment on or 1082 otherwise explain it or assist in its implementation may be 1083 prepared, copied, published and distributed, in whole or in 1084 part, without restriction of any kind, provided that the above 1085 copyright notice and this paragraph are included on all such 1086 copies and derivative works. However, this document itself may 1087 not be modified in any way, such as by removing the copyright 1088 notice or references to the Internet Society or other Internet 1089 organizations, except as needed for the purpose of developing 1090 Internet standards in which case the procedures for copyrights 1091 defined in the Internet Standards process must be followed, or 1092 as required to translate it into languages other than 1093 English. 1095 The limited permissions granted above are perpetual and will 1096 not be revoked by the Internet Society or its successors or 1097 assigns. 1099 This document and the information contained herein is provided 1100 on an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET 1101 ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR 1102 IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE 1103 USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR 1104 ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A 1105 PARTICULAR PURPOSE.