idnits 2.17.1 draft-ietf-ipoib-architecture-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** The document seems to lack a 1id_guidelines paragraph about Internet-Drafts being working documents -- however, there's a paragraph with a matching beginning. Boilerplate error? ** The document seems to lack a 1id_guidelines paragraph about 6 months document validity -- however, there's a paragraph with a matching beginning. Boilerplate error? == No 'Intended status' indicated for this document; assuming Proposed Standard == The page length should not exceed 58 lines per page, but there was 23 longer pages, the longest (page 2) being 60 lines == It seems as if not all pages are separated by form feeds - found 0 form feeds but 24 pages Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. ** There are 6 instances of too long lines in the document, the longest one being 4 characters in excess of 72. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- Couldn't find a document date in the document -- date freshness check skipped. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Missing reference section? 'RFC2373' on line 242 looks like a reference -- Missing reference section? 'IANA' on line 990 looks like a reference Summary: 6 errors (**), 0 flaws (~~), 4 warnings (==), 4 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 INTERNET DRAFT 3 Vivek Kashyap 4 Expiration Date: March, 2004 IBM 5 September, 2003 7 IP over InfiniBand(IPoIB) Architecture 9 Status of this memo 11 This document is an Internet-Draft and is in full conformance 12 with all provisions of Section 10 of RFC 2026. 14 Internet-Drafts are working documents of the Internet 15 Engineering Task Force (IETF), its areas, and its working 16 groups. Note that other groups may also distribute working 17 documents as Internet- Drafts. 19 Internet-Drafts are draft documents valid for a maximum of six 20 months and may be updated, replaced, or obsoleted by other 21 documents at any time. It is inappropriate to use 22 Internet-Drafts as Reference material or to cite them other 23 than as ``work in progress''. 25 The list of current Internet-Drafts can be accessed at 26 http://www.ietf.org/ietf/1id-abstracts.txt 28 The list of Internet-Draft Shadow Directories can be accessed 29 at http://www.ietf.org/shadow.html 31 This memo provides information for the Internet community. 32 This memo does not specify an Internet standard of any kind. 33 Distribution of this memo is unlimited. 35 Copyright Notice 37 Copyright (C) The Internet Society (2001). All Rights Reserved. 39 Abstract 41 InfiniBand is a high speed, channel based interconnect between 42 systems and devices. 44 This document presents an overview of the InfiniBand 45 architecture. It further describes the requirements and 46 guidelines for the transmission of IP over InfiniBand. 47 Discussions in this document are applicable to both IPv4 and 48 IPv6 unless explicitly specified. The encapsulation of IP over 49 InfiniBand and the mechanism for IP address resolution on IB 50 fabrics are covered in [IPOIB_MCAST], [IPOIB_ENCAP] and 51 [IPOIB_DHCP]. 53 Table of Contents 55 1.0 Introduction to InfiniBand 56 1.1 InfiniBand Architecture Specification 57 1.2 Overview of InfiniBand Architecture 58 1.2.1 InfiniBand Addresses 59 1.2.1.1 Unicast GIDs 60 1.2.1.2 Multicast GIDs 61 1.3 InfiniBand Multicast Group Management 62 1.3.1 Multicast Member Record 63 1.3.1.1 JoinState 64 1.3.2 Join and Leave operations 65 1.3.2.1 Creating a Multicast Group 66 1.3.2.3 Deleting a Multicast Group 67 1.3.2.4 Multicast Group Create/Delete Traps 68 2.0 Management of InfiniBand Subnet 69 3.0 IP over IB 70 3.1 InfiniBand as Datalink 71 3.2 Multicast Support 72 3.2.1 Mapping IP Multicast to IB Multicast 73 3.2.2 Transient Flag in IB MGIDs 74 3.3 IP Subnet Across IB Subnets ? 75 4.0 IP Subnets in InfiniBand Fabrics 76 4.1 IPoIB VLANs 77 4.2 Multicast in IPoIB Subnets 78 4.2.1 Sending IP Multicast Datagrams 79 4.2.2 Receiving Multicast Packets 80 4.2.3 Forwarding Multicast Packets 81 4.2.4 Impact of InfiniBand Architecture Limits 82 4.2.5 Leaving/Deleting a Multicast Group 83 5.0 QoS and Related Issues 84 6.0 Security Considerations 85 7.0 Acknowledgements 86 8.0 References 87 9.0 Author's Address 89 1.0 Introduction to InfiniBand 91 The InfiniBand Trade Association(IBTA) was formed to develop 92 an I/O specification to deliver a channel based, switched 93 fabric technology. The InfiniBand standard is aimed at meeting 94 the requirements of scalability, reliability, availability and 95 performance of servers in data centers. 97 1.1 InfiniBand Architecture Specification 99 The InfiniBand Trade Association specification is available 100 for download from http://www.infinibandta.org. 102 1.2 Overview of InfiniBand Architecture 104 For a more complete overview the reader is referred to 105 chapter 3 of the InfiniBand specification. 107 InfiniBand Architecture (IBA) defines a System Area 108 Network(SAN) for connecting multiple independent processor 109 platforms, I/O platforms and I/O devices. The IBA SAN is a 110 communications and management infrastructure supporting both 111 I/O and inter-processor communications for one or more 112 computer systems. 114 An IBA SAN consists of processor nodes and I/O units connected 115 through an IBA fabric made up of cascaded switches and IB 116 routers (connecting IB subnets). I/O units can range in 117 complexity from single ASIC IBA attached devices such as a LAN 118 adapter to a large memory rich RAID subsystem. 120 An IBA network may be subdivided into subnets interconnected 121 by routers. These are IB routers and IB subnets and not IP 122 routers or IP subnets. This document will refer to InfiniBand 123 routers and subnets as 'IB routers' and 'IB subnets' 124 respectively. The IP routers and IP subnets will be referred 125 to as 'routers' and 'subnets' respectively. 127 Each IB node or switch may attach to a single or multiple 128 switches or directly with each other. Each IB unit interfaces 129 with the link by way of channel adapters (CAs). The 130 architecture supports multiple CAs per unit with each CA 131 providing one or more ports that connect to the fabric. Each 132 CA appears as a node to the fabric. 134 The ports are the endpoints to which the data is sent. 135 However, each of the ports may include multiple QPs (queue 136 pairs) that may be directly addressed from a remote peer. From 137 the point of view of data transfer the QP number (QPN) is part 138 of the address. 140 IBA supports both connection oriented and datagram service 141 between the ports. The peers are identified by QPN and the 142 port identifier. There are a two exceptions. QPNs are not used 143 when packets are multicast. QPNs are also not used in the raw 144 datagram mode. 146 A port, in a data packet, is identified by a local ID (LID) 147 and optionally a Global ID (GID). The GID in the packet is 148 needed only when communicating across an IB subnet though it 149 may always be included. 151 The GID is 128 bits long and is formed by the concatenation of 152 a 64 bit IB subnet prefix and a 64 bit EUI-64 compliant 153 portion (GUID). The LID is a 16 bit value that is assigned 154 when the port becomes active. Note that the GUID is the only 155 persistent identifier of a port. However, it cannot be used as 156 an address in a packet. If the prefix is modified then the GID 157 may change. The subnet manager may attempt to keep the LID 158 values constant across reboots but that is not a requirement. 160 The assignment of the GID and the LID is done by the subnet 161 manager. Every IB subnet has at least one subnet manager 162 component that controls the fabric. It assigns the LIDs and 163 GIDs. The subnet manager also programs the switches so that 164 they route packets between destinations. The subnet manager 165 and a related component, the subnet administrator (SA) are the 166 central repository of all information that is required to 167 setup and bring up the fabric. 169 IB routers are components that route packets between IB 170 subnets based on the GIDs. Thus within an IB subnet a packet 171 may or may not include a GID but when going across an IB 172 subnet the GID must be included. A LID is always needed in a 173 packet since the destination within a subnet is determined by 174 it. 176 A CA and a switch may have multiple ports. Each CA port is 177 assigned its own LID or a range of LIDs. The ports of a switch 178 are not addressable by LIDs/GIDs or in other words, are 179 transparent to other end nodes. Each port has its own set of 180 buffers. The buffering is channeled through virtual lanes(VL) 181 where each VL has its own flow control. There may be up to 16 182 VLs. 184 VLs provide a mechanism for creating multiple virtual links 185 within a single physical link. All ports must support VL15 186 which is reserved exclusively for subnet management datagrams 187 and hence doesn't concern the IPoIB discussions. The actual VL 188 that a packet uses is configured by the SM in the 189 switch/channel adapter tables and is determined based on the 190 Service Level (SL) specified in every packet. There are 16 191 possible SLs. 193 In addition to the features described above viz. Queue 194 Pairs(QPs), Service Levels(SLs) and addressing(GID/LID), IBA 195 also defines the following: 197 Partitioning: 199 Every packet, but for the raw datagrams, carries the 200 partition key (P_key). These values are used for 201 isolation in the fabric. A switch (this is an optional 202 feature) may be programmed by the SM to drop packets 203 not having a certain key. The CA ports always check 204 for the P_Keys. A CA port may belong to multiple 205 partitions. P_Key checking is optional at IB routers. 207 A P_Key may be described as having 'limited 208 membership' or 'full membership'. For a packet to be 209 accepted at least one of the P_Keys i.e. the P_Key in 210 the packet or the P_Key in the port, must be 'full 211 membership' P_Keys. 213 Q_Keys: 215 Q_Keys are used to enforce access rights for reliable 216 and unreliable IB datagram services. Raw datagram 217 services don't use Q_Keys. At communication 218 establishment the endpoints exchange the Q_Keys and 219 must always use the relevant Q_Keys when communicating 220 with one another. Multicast packets use the Q_Key 221 associated with the multicast group. 223 Q_Keys with the most significant bit set are 224 considered controlled Q_Keys (such as the GSI Q_Key) 225 and a HCA does not allow a consumer to arbitrarily 226 specify a controlled Q_Key. An attempt to send a 227 controlled Q_Key results in using the Q_Key in the QP 228 context. Thus the OS maintains control since it can 229 configure the QP context for the controlled Q_Key for 230 privileged consumers. It must be noted that though the 231 notion of a 'controlled Q_Key' is suggested by IB 232 specification it does not require its use or 233 implementation. 235 Multicast support: 237 A switch may support multicasting i.e. replication of 238 packets across multiple output ports. This is an 239 optional feature. Similarly, support for 240 sending/receiving multicast packets is optional in 241 CAs. A multicast group is identified by a GID. The GID 242 format is as defined in [RFC2373] on IPv6 addressing. 243 Thus from an IPv6 over InfiniBand's point of view the 244 data link multicast address looks like the network 245 address. An IB port must explicitly join a multicast 246 group by sending a request to the SM to receive 247 multicast packets. A port may send packets to any 248 multicast group. In both cases the multicast LID to be 249 used in the packets is received from the SM. 251 There are 6 methods for data transfer in IB architecture. 252 These are : 254 1. Unreliable Datagram (unacknowledged - connectionless) 256 The UD service is connectionless and unacknowledged. 257 It allows the QP to communicate with any unreliable 258 datagram QP on any node. 260 The switches and hence each link can support only a 261 certain MTU. The MTU ranges are 256 bytes, 512 bytes, 262 1024 bytes, 2048 bytes, 4096 bytes. A UD packet cannot 263 be larger than the smallest link MTU between the two 264 peers. 266 2. Reliable Datagram (acknowledged - multiplexed) 268 The RD service is multiplexed over connections between 269 nodes called End to end contexts (EEC) which allows 270 each RD QP to communicate with any RD QP on any node 271 with an established EEC. Multiple QPs can use the same 272 EEC and a single QP can use multiple EECs (one for 273 each remote node per reliable datagram domain). 275 3. Reliable Connected (acknowledged - connection oriented) 277 The RC service associates a local QP with one and only 278 one remote QP. The message sizes maybe as large as 279 2^31 bytes in length. The CA implementation takes care 280 of segmentation and assembly. 282 4. Unreliable Connected (unacknowledged - connection oriented) 284 The UC service associates one local QP with one and 285 only one remote QP. There is no acknowledgment and 286 hence no resend of lost or corrupted packets. Such 287 packets are therefore simply dropped. It is similar to 288 RC otherwise. 290 5. Raw Ethertype (unacknowledged - connectionless) 292 The Ethertype raw datagram packet contains a generic 293 transport header that is not interpreted by the CA but 294 it specifies the protocol type. The values for 295 ethertype are the same as defined in RFC1700 for 296 ethertype. 298 6. Raw IPv6 ( unacknowledged - connectionless) 300 Using IPv6 raw datagram service, the IBA CA can 301 support standard protocol layers atop IPv6 (such as 302 TCP/UDP). Thus native IPv6 packets can be bridged into 303 the IBA SAN and delivered directly to a port and to 304 its IPv6 raw datagram QP. 306 The first 4 types are referred to as IB transports. The latter 307 two are classified as Raw datagrams. There is no indication of 308 the QP number in the raw datagram packets. The raw datagram 309 packets are limited by the link MTU in size. 311 The two connected modes and the reliable datagram mode may 312 also support 'Automatic Path Migration(APM)'. This is an 313 optional facility that provides for a hardware based path 314 failover. An alternate path is associated with the QP when the 315 connection/EE context is first created. If unrecoverable 316 errors are encountered the connection switches to using the 317 alternate path. 319 1.2.1 InfiniBand Addresses 321 The InfiniBand architecture borrows heavily from the IPv6 322 architecture in terms of the InfiniBand subnet structure and 323 global identifiers (GIDs). 325 The InfiniBand architecture defines the global identifier 326 associated with a port as follows: 328 GID (Global Identifier): A 128-bit unicast or 329 multicast identifier used to identify a port on a 330 channel adapter, a port on a router, a switch, or a 331 multicast group. A GID is a valid 128-bit IPv6 332 address(per RFC 2373) with additional 333 properties/restrictions defined within IBA to 334 facilitate efficient discovery, communication, and 335 routing. 337 Note: These rules apply only to IBA operation and do 338 not apply to raw IPv6 operation unless specifically 339 called out. 341 The raw IPv6 operation referred to in the note 342 above is the IPv6 mode of InfiniBand's raw datagram 343 service. It does not mean IPv6 itself. The routers and 344 switches referred to in the above definition are the 345 InfiniBand routers and switches. 347 The InfiniBand(IB) specification defines two types of GIDs: 348 unicast and multicast. 350 1.2.1.1 Unicast GIDs 352 The unicast GIDs are defined, as in IPv6, with three scopes. 353 The IB specification states: 355 a. link local: This is defined to be FE80/10. 357 The IB routers will not forward packets with a 358 link local address in source or destination 359 beyond the IB subnet. 361 b. site local: FEC0/10 363 A unicast GID used within a collection of 364 subnets which is unique within that collection 365 (e.g. a data center or campus) but is not 366 necessarily globally unique. IB routers must 367 not forward any packets with either a 368 site-local Source GID or a site-local 369 Destination GID outside of the site. 371 c. global: 372 A unicast GID with a global prefix, i.e. an IB 373 router may use this GID to route packets 374 throughout an enterprise or internet. 376 1.2.1.2 Multicast GIDs 378 The multicast GIDs also parallel the IPv6 multicast addresses. 379 The IB specification defines the multicast GIDs as follows: 381 FFxy:<112 bits> 383 Flag bits: 385 The nibble, denoted by x above, are the 4 flag bits: 000T. 387 The first three bits are reserved and are set to zero. The 388 last bit is defined as follows: 390 T=0: denotes a permanently assigned i.e. well known GID 391 T=1: denotes a transient group 393 Scope bits: 395 The 4 bits, denoted by y in the GID above, are the scope 396 bits. These scope values are described in Table 1. 398 scope value Address value 400 0 Reserved 401 1 Unassigned 402 2 Link-local 403 3 Unassigned 404 4 Unassigned 405 5 Site-local 406 6 Unassigned 407 7 Unassigned 408 8 Organization-local 409 9 Unassigned 410 0xA Unassigned 411 0xB Unassigned 412 0xC Unassigned 413 0xD Unassigned 414 0xE Global 415 0xF Reserved 417 Table 1 419 The IB specification further refers to [RFC_2373] and 420 [RFC_2375] while defining the well known multicast addresses. 421 However, it then states that the well known addresses apply to 422 IB raw IPv6 datagrams only. It must be noted though that a 423 multicast group can be associated with only a single MGID. 424 Thus the same MGID cannot be associated with the UD mode and 425 the raw datagram mode. 427 1.3 InfiniBand Multicast Group Management 429 IB multicast groups, identified by Multicast Global 430 Identifiers (MGIDs), are managed by the subnet manager(SM). 431 The SM explicitly programs the IB switches in the fabric to 432 ensure that the packets are received by all the members of the 433 multicast group that request the reception of packets. SM also 434 needs to program the switches such that packets transmitted to 435 the group by any group member reach all receivers in the 436 multicast group. 438 IBA distinguishes between multicast senders and receivers. 439 Though all members of a multicast group can transmit to the 440 group (and expect their packets to be correctly forwarded) not 441 all members of the group are receivers. A port needs to 442 explicitly request that multicast packets addressed to the 443 group be forwarded to it. 445 A multicast group is created by sending a join request to the 446 SM. As will be explained later, IBA defines multiple modes for 447 joining a multicast group. The subnet manager records the 448 group's multicast GID and the associated characteristics. The 449 group characteristics are defined by the group path MTU, 450 whether the group will be used for raw datagrams or unreliable 451 datagrams, the service level, the partition key associated 452 with the group, the Local Identifier(LID) associated with the 453 group etc. These characteristics are defined at the time of 454 the group creation. The interested reader may lookup the 455 'MCMemberRecord' attribute in the IB architecture 456 specification[IB_ARCH] for the complete list of 457 characteristics that define a group. 459 A LID is associated with the multicast group by the subnet 460 manager(SM) at the time of the multicast group creation. The 461 SM determines the multicast tree based on all the group 462 members and programs the relevant switches. The Multicast 463 LID(MLID) is used by the switches to route the packets. 465 Any member IB port wanting to participate in the multicast 466 group must join the group. As part of the join operation the 467 port receives the group characteristics from the SM. At the 468 same time the subnet manager ensures that the requester can 469 indeed participate in the group by verifying that it can 470 support the group MTU, and accessibility to the rest of the 471 group members. Other group characteristics may need 472 verification too. 474 The SM, for groups that span IB subnet boundaries, must 475 interact with IB routers to determine the presence of this 476 group in other IB subnets. If present the MTU must match 477 across the IB subnets. 479 P_Key is another characteristic that must match across IB 480 subnets since the P_Key inserted into a packet is not modified 481 by the IB switches or IB routers. Thus if the P_Keys didn't 482 match the IB router(s) itself might drop the packets or 483 destinations on other subnets might drop the packets. 485 A join operation may cause the SM to reprogram the fabric so 486 that the new member can participate in the multicast group. By 487 the same token a leave may cause the SM to reprogram the 488 fabric to stop forwarding the packets to the requester. 490 1.3.1 Multicast Member Record 492 The multicast group is maintained by the SM with each of the 493 group members represented by an MCMemberRecord[IB_ARCH]. Some 494 of its components are: 496 MGID - Multicast GID for this multicast group 497 PortGID - Valid GID of the port joining this multicast group 498 Q_Key - Q_Key to be used by this multicast group 499 MLID - Multicast LID for this multicast group 500 MTU - MTU for this multicast group 501 P_Key - Partition key for this multicast group 502 SL - Service Level for this multicast group 503 Scope - Same as MGID address scope 504 JoinState - Join/Leave status requested by the port: 505 bit 0: FullMemeber 506 bit 1: NonMember 507 bit 2: SendOnlyNonMember 509 1.3.1.1 JoinState 511 The JoinState indicates the membership qualities a port wishes 512 to add while joining/creating a group or delete when leaving a 513 group. The meaning of the JoinState bits are: 515 FullMember: 516 Messages destined for the group are routed to and from 517 the port. A group may be deleted by the SM if there 518 are no FullMembers in the group. 520 NonMember: 521 Messages destined for the group are routed to and from 522 the port. The port is not considered a member for 523 purposes of group creation/deletion. 525 SendOnlyNonMember: 526 Group messages are only routed from the port but not 527 to the port. The port is not considered a member for 528 purposes of group creation/deletion. 530 A port may have multiple bits set in its record. In such case 531 the membership qualities are a union of the JoinStates. A port 532 may leave the multicast group for each of the JoinStates 533 individually or in any combination of JoinState 534 bits[IB_ARCH]. 536 1.3.2 Join and Leave Operations 538 An IB port joins a multicast group by sending a join 539 request(SubnAdmSet() method) and leaves a multicast group by 540 sending a leave message (SubnAdmDelete() method) to the SM. 541 The IBA specification[IB_ARCH] describes the methods and 542 attributes to be used when sending these messages. 544 1.3.2.1 Creating a Multicast Group 546 There is no 'create' command to form a new multicast group. 547 The FullMember bit in the JoinState must be set to create a 548 multicast group. In other words, the first FullMember join 549 request will cause the group to be created as a side effect of 550 the join request. Subsequent join or leave requests may 551 contain any combination of the JoinState bits. 553 The creator of the group specifies the Q_Key, MTU, P_Key, SL, 554 FlowLabel, TClass and the Scope value. A creator may request 555 that a suitable MGID be created for it. Alternatively, the 556 request can specify the desired MGID. In both cases the MLID 557 is assigned by the SM. 559 Thus a group will be created with the specified values when 560 the requester sets the FullMember bit and no such group 561 already exists in the subnet. 563 1.3.2.3 Deleting a Multicast Group 565 When the last FullMember leaves the multicast group the SM may 566 delete the multicast group releasing all resources, including 567 those that might exist in the fabric itself, associated with 568 the group. 570 Note that a special 'delete' message does not exist. It is a 571 side effect of the last FullMember 'leave' operation. 573 1.3.2.4 Multicast Group Create/Delete Traps 575 The SA may be requested by the ports to generate a report 576 whenever a multicast group is created or deleted. The port can 577 specify the multicast group it is interested in i.e. use a 578 specific MGID or use a wildcard request. The SA will report 579 these events using traps 66 (for creates) and 67 (for 580 deletes)[IB_ARCH]. 582 Therefore, a port wishing to join a group but not create it by 583 itself may request a create notification or a port might even 584 request a notification for all groups that are created(a 585 wildcarded request). The SA will diligently inform them of the 586 creation utilising the aforementioned traps. The requestor can 587 then join the multicast group indicated. Similarly, a 588 SendOnlyNonMember or a NonMember might request the SA to 589 inform it of group deletions. The endnode, on receiving a 590 delete report, can safely release the resources associated 591 with the group. The associated MLID is no longer valid for the 592 group and may be reassigned to a new multicast group by the 593 SM. 595 2.0 Management of InfiniBand Subnet 597 To aid in the monitoring and configuration of InfiniBand 598 subnet components a set of MIBs need to be defined. MIBs are 599 needed for the channel adapters, InfiniBand interfaces, 600 InfiniBand subnet manager, InfiniBand subnet management agents 601 and to allow the management of specific device properties. It 602 must be noted that the management objects addressed in the 603 IPoIB documents are for all of the IB subnet components and 604 are not limited to IP(over IB). The relevant MIBs are 605 described in separate documents and are not covered here. 607 3.0 IP over IB 609 As described in section 1.0, the InfiniBand architecture 610 provides a broad set of capabilities to choose from when 611 implementing IP over InfiniBand networks. 613 The IPoIB specification must not, and does not, require 614 changes in IP and higher layer protocols. Nor does it mandate 615 requirements on IP stacks to implement special user level 616 programs. It is an aim of IPoIB specification that the IPoIB 617 changes be amenable to modularisation and incorporation into 618 existing implementations at the same level as other media 619 types. 621 3.1 InfiniBand as Datalink 623 InfiniBand architecture provides multiple methods of data 624 exchange between two endpoints as was noted above. These are: 626 Reliable Connected (RC) 627 Reliable Datagram (RD) 628 Unreliable Connected (UC) 629 Unreliable Datagram (UD) 630 Raw Datagram : Raw IPv6 (R6) 631 : Raw Ethertype (RE) 633 IPoIB can be implemented over any, multiple or all of these 634 services. A case can be made for support on any of the 635 transport methods depending on the desired features. 637 The IB specification requires Unreliable Datagram mode to be 638 supported by all the IB nodes. The host channel adapters(HCAs) 639 are specifically required to support Reliable connected(RC) 640 and Unreliable connected(UC) modes but the same is not the 641 case with target channel adapters(TCAs). Support for the two 642 Raw Datagram modes is entirely optional. The Raw Datagram mode 643 supports a 16-bit CRC as against the better protection 644 provided by the use of a 32-bit CRC in other modes. 646 For the sake of simplicity, ease of implementation and 647 integration with existing stacks, it is desirable that the 648 fabric support multicasting. This is possible only in 649 Unreliable datagram (UD) and IB's Raw datagram modes. 651 Thus it is only the UD mode that is universal, supports 652 multicast, and a robust CRC. Given these conditions it is the 653 obvious choice for IP over InfiniBand [IPOIB_MCAST, 654 IPOIB_ENCAP]. 656 Future documents might consider the connected modes. In 657 contrast to the limited link MTU offered by UD mode, the 658 connected modes can offer significant benefit in terms of 659 performance by utilising a larger MTU. Reliability is also 660 enhanced if the underlying feature of automatic path migration 661 of connected modes is utilised. 663 3.2 Multicast Support 665 InfiniBand specification makes support of multicasting in the 666 switches optional. Multicast however, is a basic requirement 667 in IP networks. Therefore, IPoIB requires that multicast 668 capable InfiniBand fabrics be used to implement IPoIB 669 subnets. 671 3.2.1 Mapping IP Multicast to IB Multicast 673 Well known IP multicast groups are defined for both IPv4 and 674 IPv6 (RFC_1700, RFC_2373). Multicast groups may also be 675 dynamically created at any time. To avoid creating unnecessary 676 duplicates of multicast packets in the fabric, and to avoid 677 unnecessary handling of such packets at the hosts each of the 678 IP multicast groups needs to be associated with a different IB 679 multicast group as far as possible. A process is defined in 680 [IPOIB_MCAST] for mapping the IP multicast addresses to unique 681 IB multicast addresses. 683 3.2.2 Transient Flag in IB MGIDs 685 The IB specification describes the flag bits as discussed in 686 section 1.3. The IB specification also defines some well known 687 IB multicast GIDs(MGIDs). The MGIDs are reserved for the IB's 688 Raw datagram mode which is incompatible with the other 689 transports of IB. Any mapping that is defined from IP 690 multicast addresses therefore must not fall into IB's 691 definition of a well-known address. 693 Therefore all IPoIB related multicast GIDs always set the 694 transient bit. 696 3.3 IP Subnets Across IB Subnets ? 698 Some implementations may wish to support multiple clusters of 699 machines in their own IB subnets but otherwise be part of a 700 common IP subnet. For such a solution the IB specification 701 needs multiple upgrades. Some of the required enhancements 702 are: 704 1) A method for creating IB multicast GIDs that span multiple 705 IB subnets. The partition keys and other parameters need to 706 be consistent across IB subnets. 708 2) Develop IB routing protocol to determine the IB topology 709 across IB subnets. 711 3) Define the process and protocols needed between IB nodes 712 and IB routers 714 Until the above conditions are met it is not possible to 715 implement IPoIB subnets that span IB subnets. The IPoIB 716 standards have however been defined with this possibility in 717 mind. 719 4.0 IP Subnets in InfiniBand Fabrics 721 The IPoIB subnet is overlaid over the IB subnet. The IPoIB 722 subnet is brought up in the following steps: 724 Note: the join/leave operation at the IP level will be 725 referred to as IP_join/IP_leave and the join/leave 726 operations at the IB level will be referred to as 727 IB_join in this document. 729 1. The all-IPoIB nodes IB multicast group is created 731 The fabric administrator creates the IB multicast group 732 corresponding to the all-IPv6 nodes/IPv4 broadcast (henceforth 733 called 'broadcast group') when the IPv6/IPv4 subnet is setup. 734 The 'broadcast group' mapping from the all-IPv6 nodes and IPv4 735 broadcast address is defined in [IPOIB_MCAST]. 737 The method by which the broadcast group is setup is not 738 defined by IPoIB. The group may be setup at the SM by the 739 administrator or by the first IB_join. 741 As noted earlier, at the time of creating an IB multicast 742 group, multiple values such as the P_Key, Q_Key, Service 743 Level, Hop Limit, Flow ID, TClass, MTU etc., have to be 744 specified. These values should be such that all potential 745 members of the IB multicast group are be able to communicate 746 with one another when using them. In the future, as the IB 747 specification associates more meaning with the various 748 parameters and defines IB QoS, different values for IP 749 multicast traffic may be possible. All unicast packets also 750 need to use the P_Key and Q_Key specified in the broadcast 751 group [IPOIB_ENCAP]. It is obvious that a thought out 752 configuration is required for a successful setup of the IPoIB 753 subnet. 755 2. All IPoIB interfaces IB_join the broadcast group 757 The broadcast group defines the span and the members of the 758 IPoIB link. This link gets built up as IPoIB nodes IB_join the 759 broadcast group. 761 The IB_join to the broadcast group has the additional benefit 762 of distributing the above mentioned multicast group parameters 763 to all the members of the subnet. 765 Note that this IB_join to the broadcast group is a FullMember 766 join. If any of the ports or the switches linking the port to 767 the rest of the IPoIB subnet cannot support the 768 parameters(e.g. path MTU or P_Key) associated with the 769 broadcast group, then the IB_join request will fail and the 770 requesting port will not become part of the IPoIB subnet. 772 3. Configuration Parameters 774 As noted above, parameters such as, Q_Key, Path MTU, needed 775 for all IPoIB communication are returned to the IPoIB node on 776 IB_joining the 'broadcast group'. [IPOIB_MCAST] also notes 777 that the parameters used in the broadcast group are used when 778 creating other multicast groups. 780 However, the P_Key must still be known to the IPoIB endnode 781 before it can join the broadcast-group. The P_Key is included 782 in the mapping of the broadcast group[IPOIB_MCAST]. Another 783 parameter, the scope of the broadcast group, also needs to be 784 known to the endnode before it can join the broadcast group. 786 It is an implementation choice on how the P_Key and the scope 787 bits related to the IPoIB subnet are determined by the 788 implementation. These could be configuration parameters 789 initialised by some means by the administrator. 791 The methods employed by an implementation to determine the 792 P_Key and scope bits are not specified by IPoIB. 794 4.1 IPoIB VLANs 796 The endpoints in an IB subnet must have compatible P_Keys to 797 communicate with one another. Thus the administrator when 798 setting up an IP subnet over an IB subnet must ensure that all 799 the members have compatible P_Keys. An IP subnet can have only 800 one P_Key associated with it to ensure that all IP nodes in it 801 can talk to one another. An endpoint may however have multiple 802 P_Keys. 804 The IB architecture specifies that there can be only one MGID 805 associated with a multicast group in the IB subnet. The P_Key 806 is included in the MGID mappings from the IP multicast 807 addresses[IPOIB_MCAST]. Since the P_Key is unique in the IB 808 subnet the inclusion of the P_Key in the IB MGIDs ensures that 809 unique MGID mappings are created. Every unique broadcast group 810 MGID so formed creates a separate abstract IPoIB link and 811 hence an IPoIB VLAN. 813 4.2 Multicast in IPoIB subnets 815 IP multicast on InfiniBand subnets follows the same concepts 816 and rules as on any other media. However, unlike most other 817 media multicast over InfiniBand requires interaction with 818 another entity, the IB subnet manager. This section describes 819 the outline of the process and suggests some guidelines. 821 IB architecture specifies the following format for IB 822 multicast packets when used over unreliable datagram(UD) 823 mode: 825 +--------+-------+---------+---------+-------+---------+---------+ 826 |Local |Global |Base |Datagram |Packet |Invariant| Variant | 827 |Routing |Routing|Transport|Extended |Payload| CRC | CRC | 828 |Header |Header |Header |Transport| (IP) | | | 829 | | | |Header | | | | 830 +--------+-------+---------+---------+-------+---------+---------+ 832 For details about the various headers please refer to 833 InfiniBand Architecture Specification[IB_ARCH]. 835 The Global routing header (GRH) includes the IB multicast 836 group GID. The Local routing header (LRH) includes the local 837 identifier (LID). The IB switches in the fabric route the 838 packet based on the LID. 840 The GID is made available to the receiving IB user (the IPoIB 841 interface driver for example). The driver can therefore 842 determine the IB group the packet belongs to. 844 IPv4 defines three levels of multicast compliance. These are: 846 Level 0: No support for IP multicasting 848 Level 1: Support for sending but not receiving multicasts 850 Level 2: Full support for IP multicasting 852 In IPv6 there is no such distinction. Full multicast support 853 is mandatory. Additionally, all IPv4 subnets support 854 broadcast(255.255.255.255). IPv4 broadcast can always be 855 sent/received by all IPv4 interfaces. 857 Every IPoIB subnet requires the broadcast GID to be defined. 858 Thus a packet can always be broadcast. 860 4.2.1 Sending IP Multicast Datagrams 862 An IP host may send a multicast packet at any time to any 863 multicast address. 865 The IP layer conveys the multicast packet to the IPoIB 866 interface driver/module. This module attempts to IB_join the 867 relevant IB multicast group. This is required since otherwise 868 InfiniBand architecture does not guarantee that the packet 869 will reach its destinations. 871 A pure sender may choose to join the multicast group as a 872 FullMember. In such a case the sender will receive the 873 multicast packets transmitted. Additionally, the IB group will 874 not be deleted until the sender leaves the group. 876 Alternatively, a sender might IB_join as a SendOnlyNonMember. 877 In such a case the packets are not routed to the sender though 878 packets transmitted by it can reach the other group members. 879 Additionally, the group can be deleted when all FullMembers 880 have left the group. The sender can further request delete 881 updates from the SM. 883 If the sender does not find the group in existence it is 884 recommended in [IPOIB_MCAST] that the packets be sent to the 885 MGID corresponding to the all-IP routers address. A sender 886 could also send the packets to the broadcast group. The 887 sender might also choose to request 'creation' reports from 888 the SM. 890 4.2.2 Receiving Multicast Packets 892 The IP host must join the IB multicast group corresponding to 893 the IP address. This follows from the IBA requirement that the 894 receiver must join the relevant IB multicast group. The group 895 is automatically created if it does not exist [IB_ARCH]. 897 The IP receivers must IB_leave the IB group when the IP layer 898 stops listening of the corresponding IP address. The SM can 899 then choose to delete the group. 901 4.2.3 Router considerations for IPoIB 903 IP routers know of the new IP groups created in the subnet by 904 the use of protocols such as IGMP/MLD. However, this is not 905 enough for IPoIB since the router needs to IB_join the 906 relevant IB groups to be able to receive and transmit the 907 packets. There is no promiscuous mode for listening to all 908 packets. 910 The IPoIB routers therefore need to request the SM to report 911 all creations of IB groups in the fabric. The IPoIB router can 912 then IB_join the reported group. It is not desirable that the 913 router's IB_joining of a multicast group be considered the 914 same as the IB_join from a receiver - the router's IB_join 915 shouldn't disallow the group's deletion when all receivers 916 leave. To overcome just this type of situations, IBA provides 917 the NonMember IB_join mode. 919 The NonMember IB_join mode can be used by IP routers when they 920 join in response to the create reports. A router should 921 ideally request the delete reports too so that it can release 922 all the resources associated with the group. The MLID 923 associated with a deleted MGID can be reassigned by the SM and 924 therefore there is a possibility of erroneous transmissions if 925 the MLID is cached. A router that does not request delete 926 reports will still work correctly since it will receive the 927 correct MLID , and purge any old cached value, when it 928 IB_joins the IB group in response to a create report. 930 It is reasonable for a router to IB_join as a FullMember if it 931 is joining the IB group in response to an application/routing 932 daemon request. In such a case the router might end up 933 controlling the existence of the IB group (since it is a 934 FullMember of the group). 936 4.2.4 Impact of InfiniBand Architecture Limits 938 An HCA or TCA may have a limit on the number of MGIDs it can 939 support. Thus, even though the groups may not be limited at 940 the subnet manager and in the subnet as such, they may be 941 limited at a particular interface. It is advisable to choose 942 an adequately provisioned HCA/TCA when setting up an IPoIB 943 subnet. 945 4.2.5 Leaving/Deleting a Multicast Group 947 An IPv4 sender (level 1 compliance) IB_joins the IB multicast 948 group only because that is the only way to guarantee reception 949 of the packets by all the group recipients. The sender must 950 however IB_leave the group at some time. A sender could, when 951 not a receiver on the group, start a timer per multicast group 952 sent to. The sender leaves the IB group when the timer goes 953 off. It restarts the timer if another message is sent. 955 This suggestion doesn't apply to the IB broadcast group. It 956 also doesn't apply to the IB group corresponding to the 957 all-hosts multicast group. An IPv4 host must always remain a 958 member of the broadcast group. 960 An IP multicast receiver IB_leaves the corresponding IB 961 multicast group when it IP_leaves the IP multicast group. In 962 the case of IPv4 implementation the receiver may choose to 963 continue to be a sender (level 1 compliance). In which case it 964 may choose not to IB_leave the IB group but start a timer as 965 explained above. 967 As noted elsewhere, the SM can choose to free up the 968 resources(e.g. routing entries in the switches) associated 969 with the IB group when the last FullMember IB_leave the group. 970 The MLID therefore becomes invalid for the group. The MLID can 971 be reassigned when a new group is created. 973 SendOnlyNonMember/NonMember ports caching the MLID need to 974 avoid this possibility. The way out is for them to request 975 group delete reports. An IP router requesting reports for all 976 groups need not request the delete report since an IB_join in 977 response to a create report will return the new MLID 978 association to it. 980 A router might prefer to IB_leave the IB multicast group when 981 there are no members of the IP multicast address in the subnet 982 and it has no explicit knowledge of any need to forward such 983 packets. 985 4.3 Transmission of IPoIB packets 987 The encapsulation of IP packets in InfiniBand is described 988 in[IPOIB_ENCAP]. 990 It specifies the use of an 'Ethertype' value [IANA] in all 991 IPoIB communication packets. The link-layer address is 992 comprised of the Global Identifier(GID) and the Queue Pair 993 Number(QPN) [IPOIB_ENCAP]. 995 To allow for multiple IB subnet based IPoIB subnets, the 996 specification utilises the Global Identifier(GID) as part of 997 the link-layer address. Since all packets in IB have to use 998 the Local Identifier(LID) the address resolution process has 999 the additional step of resolving the destination GID, returned 1000 in response to ARP/ND request, to the LID[IPOIB_ENCAP]. This 1001 phase of address resolution might also be used to determine 1002 other essential parameters (e.g. the SL, path rate etc.)for 1003 successful IB communication between two peers. 1005 As noted earlier, all communication in the IPoIB subnet 1006 derives the Q_Key to use from the Q_Key specified in the 1007 broadcast group. 1009 4.4 RARP and Static ARP entries 1011 RARP entries or static ARP entries are based on invariant 1012 link-addresses. In the case of IPoIB, the link-address 1013 includes the QPN which might not be constant across reboots or 1014 even across network interface resets. Therefore, static ARP 1015 entries or RARP server entries will only work if the 1016 implementation(s) using these options can ensure that the QPN 1017 associated with an interface is invariant across 1018 reboots/network resets[IPOIB_ENCAP]. 1020 4.5 DHCPv4 and IPoIB 1022 DHCPv4 [RFC_2131] utilises a 'client identifier' field 1023 (expected to hold the link-layer address) of 16 bytes. The 1024 address in the case of IPoIB is 20-bytes. To get around this 1025 problem IPoIB specifies [IPOIB_DHCP] that the 'broadcast flag' 1026 be used by the client when requesting an IP address. 1028 5.0 QoS and Related Issues 1030 The IB specification suggests the use of service levels for 1031 load balancing, QoS and deadlock avoidance within an IB 1032 subnet. But the IB specification leaves the usage and mode of 1033 determination of the SL for the application to decide. The SL 1034 and list of SLs are available in the SA but it is up to the 1035 endnode's application to choose the 'right' value. 1037 Every IPoIB implementation will determine the relevant SL 1038 value based on its own policy. No method or process for 1039 choosing the SL has been defined by the IPoIB standards. 1041 6.0 Security Considerations 1043 This document describes the IB architecture as relevant to 1044 IPoIB. It further restates issues specified in other 1045 documents. It does not itself specify any requirements. There 1046 are no security issues introduces by this document. IPoIB 1047 related security issues are described in 1048 [IPOIB_MCAST], [IPOIB_ENCAP] and [IPOIB_DHCP]. 1050 7.0 Acknowledgements 1052 This document has benefited from the comments and suggestion 1053 of the members of the IPoIB working group and the members of 1054 the InfiniBand(SM) Trade Association. 1056 8.0 References 1058 [IB_ARCH] InfiniBand Architecture Specification, Volume 1.1 1060 [RFC_2373] IP Version 6 Addressing Architecture 1062 [RFC_2375] IPv6 Multicast Address Assignments 1064 [RFC_1700] Assigned Numbers 1066 [RFC_1112] Host extensions for IP multicasting 1068 [RFC_2236] Internet Group Management Protocol, Version 2 1070 [RFC_2710] Multicast Listener Discovery 1072 [IPOIB_MCAST] draft-ietf-ipoib-link-multicast-03.txt 1074 [IPOIB_ENCAP] draft-ietf-ipoib-ip-over-infiniband-04.txt 1076 [IPOIB_DHCP] draft-ietf-ipoib-dhcp-over-infiniband-05.txt 1078 9.0 Author's Address 1080 Vivek Kashyap 1082 IBM 1083 15450, SW Koll Parkway 1084 Beaverton, OR 97006 1086 Phone: +1 503 578 3422 1087 Email: vivk@us.ibm.com 1089 Full Copyright Statement 1091 Copyright (C) The Internet Society (2001). All Rights Reserved. 1093 This document and translations of it may be copied and 1094 furnished to others, and derivative works that comment on or 1095 otherwise explain it or assist in its implementation may be 1096 prepared, copied, published and distributed, in whole or in 1097 part, without restriction of any kind, provided that the above 1098 copyright notice and this paragraph are included on all such 1099 copies and derivative works. However, this document itself may 1100 not be modified in any way, such as by removing the copyright 1101 notice or references to the Internet Society or other Internet 1102 organizations, except as needed for the purpose of developing 1103 Internet standards in which case the procedures for copyrights 1104 defined in the Internet Standards process must be followed, or 1105 as required to translate it into languages other than 1106 English. 1108 The limited permissions granted above are perpetual and will 1109 not be revoked by the Internet Society or its successors or 1110 assigns. 1112 This document and the information contained herein is provided 1113 on an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET 1114 ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR 1115 IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE 1116 USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR 1117 ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A 1118 PARTICULAR PURPOSE.