idnits 2.17.1 draft-kashyap-ipoib-connected-mode-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** The document seems to lack a 1id_guidelines paragraph about 6 months document validity -- however, there's a paragraph with a matching beginning. Boilerplate error? == No 'Intended status' indicated for this document; assuming Proposed Standard == The page length should not exceed 58 lines per page, but there was 10 longer pages, the longest (page 5) being 76 lines == It seems as if not all pages are separated by form feeds - found 0 form feeds but 10 pages Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. ** There is 1 instance of too long lines in the document, the longest one being 1 character in excess of 72. == There are 1 instance of lines with non-RFC2606-compliant FQDNs in the document. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year == Line 310 has weird spacing: '...refixed by a ...' == Using lowercase 'not' together with uppercase 'MUST', 'SHALL', 'SHOULD', or 'RECOMMENDED' is not an accepted usage according to RFC 2119. Please use uppercase 'NOT' together with RFC 2119 keywords (if that is what you mean). Found 'MUST not' in this paragraph: Both the RC and UC flags MUST not be set at the same time. They are mutually exclusive. -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (June 2004) is 7253 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) No issues found here. Summary: 5 errors (**), 0 flaws (~~), 7 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 INTERNET DRAFT V. Kashyap 2 IBM 3 Expiration Date: December 2004 June 2004 5 IP over InfiniBand: Connected Mode 7 Status of this memo 9 This document is an Internet-Draft and is in full conformance with 10 all provisions of Section 10 of RFC 2026. 12 Internet-Drafts are working documents of the Internet Engineering 13 Task Force (IETF), its areas, and its working groups. Note that 14 other groups may also distribute working documents as Internet- 15 Drafts. 17 Internet-Drafts are draft documents valid for a maximum of six 18 months and may be updated, replaced, or obsoleted by other documents 19 at any time. It is inappropriate to use Internet-Drafts as Reference 20 material or to cite them other than as ``work in progress''. 22 The list of current Internet-Drafts can be accessed at 23 http://www.ietf.org/ietf/1id-abstracts.txt 25 The list of Internet-Draft Shadow Directories can be accessed at 26 http://www.ietf.org/shadow.html 28 This memo provides information for the Internet community. This 29 memo does not specify an Internet standard of any kind. 30 Distribution of this memo is unlimited. 32 Copyright Notice 34 Copyright (C) The Internet Society (2001). All Rights Reserved. 36 Abstract 38 This document specifies a method for transmitting IPv4/IPv6 39 packets and address resolution over the connectd modes of 40 InfiniBand. 42 Table of Contents 44 1.0 Introduction 45 2.0 IPoIB-connected mode 46 2.1 Multicasting 47 2.2 Outline of Address Resolution 48 2.3 Outline of Connection Setup 49 3.0 Address Resolution 50 3.1 Link-layer Address 51 3.2 IB Connection Setup 52 3.3 Service-ID 53 4.0 Frame Format 54 5.0 Maximum Transmission Unit 55 5.1 Per-Connection MTU 56 6.0 Security Considerations 57 7.0 References 59 1.0 Introduction 61 The InfiniBand specification [IB_ARCH] can be found at 62 www.infinibandta.org. The document [IPoIB_ARCH] provides a 63 short overview of InfiniBand architecture along with 64 consideration for specifying IP over InfiniBand networks. 66 The InfiniBand architecture (IBA) defines multiple modes of 67 transports. Of these the unreliable datagram (UD) transport 68 method best matches the needs of IP. IP over InfiniBand (IPoIB) 69 over UD is described in [IPoIB_ENCAP]. This document describes 70 IP transmission over the connected modes of IBA. 72 IBA defines two connected modes: 74 1. Reliable Connected (RC) 75 2. Unreliable Connected (UC) 77 As is evident from the nomenclature, the two modes differ mainly 78 in providing reliability of data delivery across the connection. 79 This document applies equally to both the connected modes. 80 IPoIB over these two modes is referred to as IPoIB-CM (connected 81 mode) in this document. For clarity IPoIB over the unreliable 82 datagram mode, as described in [IPoIB_ENCAP] is referred to as 83 IPoIB-UD. 85 IBA requires that all Host Channel Adapters (HCAs) support the 86 reliable and unreliable connected modes [IB_ARCH]. It is 87 optional for Target Channel Adapters (TCAs) to support the 88 connected modes. 90 The connected modes offer link MTUs of up to 2^31 octets in 91 length. Thus the use of connected modes can offer significant 92 benefits by supporting reasonably large MTUs. The datagram modes 93 of InfiniBand Architecture (IBA) are limited to 4096 octets. 95 Reliability is also enhanced by the underlying feature of 96 "automatic path migration" supported by the connected modes is 97 utilized. 99 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL 100 NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and 101 "OPTIONAL" in this document are to be interpreted as described 102 in RFC 2119. 104 2.0 IPoIB-connected mode 106 This document extensively refers to [IPoIB_ENCAP] and extends 107 IPoIB description given in [IPoIB_ENCAP] to IPoIB-CM. Therefore, 108 only additional requirements or enhancements needed to enable 109 IPoIB-CM are described. 111 The IP encapsulation, default MTU, link layer address format and 112 the IPv6 stateless autoconfiguration mechanism apply to IPoIB-CM 113 exactly as described in [IPoIB_ENCAP]. 115 2.1 Multicasting 117 The connected modes of IBA define a non-broadcast, multiple 118 access network. The connected modes of IBA do not support 119 multicasting though every node can communicate with every other 120 node if desired. 122 This requires that multicasting be emulated in some form by the 123 network. However, in the case of an InfiniBand network, instead 124 of an emulation, an unreliable datagram (UD) queue pair (QP) 125 can be used to support multicasting while the connected mode QP 126 is used for unicast traffic. Since IBA requires all channel 127 adapters to support the UD mode, every implementation supporting 128 IPoIB-CM will also be able to utilize UD QPs. 130 Multicast mapping, transmission and reception of multicast 131 packets and multicast routing is over the UD QP associated with 132 the IPoIB-CM interface in accordance with the document 133 [IPoIB_ENCAP]. 135 2.2 Outline of Address Resolution 137 Every IPoIB-CM interface MUST have two QPs associated with it: 139 1) A connected mode QP 140 2) An unreliable datagram mode QP 142 [IPoIB_ENCAP] proposes that the address resolution query is 143 multicast over an IB multicast address that is joined by every 144 member of the IPoIB subnet. This IB multicast address is 145 referred to as the "broadcast-GID" [IPoIB_ENCAP]. This document 146 extends the requirement of joining the "broadcast-GID" to IPoIB- 147 CM too by requiring every IPoIB-CM interface to "FullMember" 148 join the broadcast-GID using the associated UD QP. 150 A broadcast-GID is formed with the knowledge of the scope bits, 151 IP version, the partition key (P_Key) associated with the 152 subnet. Thus these three parameters must be known to the node 153 before an IPoIB interface can be brought up. The exact format 154 and rules to setup the broadcast-GID are defined in 155 [IPoIB_ENCAP]. 157 2.3 Outline of Connection setup 159 Once the link address of the remote node is known an IB 160 connection must be setup between the nodes before any IP 161 communication may occur. 163 To make a connection, the sender must know the service-ID to use 164 in the request to make a connection [IB_ARCH]. It must also 165 supply the "connection mode" queue pair to the remote node. The 166 peer replies with its queue pair. Each IB connection is peer to 167 peer and uses one connected mode QP at each end. 169 Though the address resolution occurs at an individual IP address 170 level the connection between the nodes is at the IB layer. 171 Therefore every individual address resolution does not imply a 172 new connection between the peers. 174 3.0 Address Resolution 176 Address resolution queries are sent out on the "broadcast-GID" 177 over the UD QP associated with the IPoIB-CM interface. A unicast 178 reply is received on the UD QP associated with the IPoIB-CM 179 interface. 181 An IPoIB-CM implementation MAY use the same UD QP as used by the 182 IPoIB-UD implementation if the latter mode is supported in the 183 same partition and scope. 185 3.1 Link-layer Address 187 IPoIB encapsulation [IPoIB_ENCAP] describes the link-layer 188 address as follows: 190 <1 octet reserved>:QP: GID 192 This document extends the link-layer address as follows: 194 :QPN:GID 196 Flags: 198 This is a single octet field. If bit 0 is set then it 199 implies that in the sender's view,the subnet is built 200 over IB's 'reliable connected' i.e. RC mode. If bit 1 is 201 set then it implies that the subnet is built over IB's 202 "unreliable connected" i.e. UC mode. All other bits in 203 the octet are reserved and MUST be set to 0. 205 If IPoIB-CM is not supported i.e. if the implementation 206 only supports IPoIB-UD, then the implementation MUST 207 ignore the on reception. It MUST set the 208 octet to all zeroes as specified in [IPoIB_ENCAP]. 210 Both the RC and UC flags MUST not be set at the same 211 time. They are mutually exclusive. 213 The format of the flags is: 215 +--+--+--+--+--+--+--+--+ 216 |RC|UC| 0| 0| 0| 0| 0| 0| 217 +--+--+--+--+--+--+--+--+ 219 Note: 220 The above implies that a given IP subnet can only be 221 supported on one of the InfiniBand modes at any 222 time. If the link layer includes no flags then it is 223 part of an IPoIB-UD subnet, if the link layer 224 includes the RC flag then it is part of an IPoIB-RC 225 subnet, if the link layer includes the UC flag then 226 it is part of an IPoIB-UC subnet. 228 QPN: 230 The queue-pair number (QPN) on which the unicast address 231 resolution reply will be received. This allows the 232 IPoIB-UD address resolution code and method to be used 233 for IPoIB-CM address resolution. 235 The QPN also serves another purpose. It is used to form 236 the Service-ID that is used to setup the IB connection. 238 On receiving the multicast/broadcast address resolution request 239 the receiver replies with its own link-address, including the 240 associated UD QPN and the appropriate flag. If the flags do not 241 match then there is a misconfiguration since the underlying IB 242 modes do not match. In such a case a suitable error indication 243 SHOULD be provided to the administrator. 245 The receiver's reply is unicast back to the sender after the 246 receiver has, as in the case of IPoIB over unreliable datagram 247 (IPoIB_UD), resolved the GID to the LID and determined other 248 required parameters [IPoIB_ENCAP]. 250 Once the address resolution is completed the underlying IB 251 connection can be setup. 253 3.2 IB Connection Setup 255 The IB reliable/unreliable mode connection may be setup by any 256 of the peers though it is more likely that the one that 257 initiated the address resolution phase, probably as a result of 258 the need to send IP data, will initiate the connection setup. 259 IBA allows passive-active and active-active connection setup. 261 To setup a connection IB Management Datagrams (MADs) are 262 directed to the peer's communication manager (CM). The 263 connection request always contains a Service-ID for the peer to 264 associate the request with the appropriate entity. If the 265 request is accepted the peer returns the relevant connected mode 266 QPN in the response MAD. The format of the CM connection 267 messages and the IB connection setup process is described in 268 [IB_ARCH]. 270 The CM messages include, among other parameters, the Service-ID, 271 Local QPN, and the payload size to use over the connection. 273 Note: 274 The IB connection is setup using the Service-ID as defined 275 above. The node MUST keep a record of IB connections it is 276 participating in. The node SHOULD NOT attempt another 277 connection to the remote peer using the same Service-ID as 278 used for an existing IB connection. 280 3.3 Service-ID 282 The InfiniBand specification defines a block of service IDs for 283 IETF use. The InfiniBand specification has left the definition 284 and management of this block to the IETF [IB_ARCH]. The 64-bit 285 block is: 287 +--------+--------+--------+--------+-------+--------+--------+------+ 288 |00000001|<-------------------IETF use------------------------------>| 289 +--------+--------+--------+--------+-------+--------+--------+------+ 291 The Service-IDs used by IPoIB will be in the format: 293 +--------+--------+--------+--------+-------+-------+--------+-------+ 294 |00000001| Type |Reserved| QPN | Reserved | 295 +--------+--------+--------+--------+-------+-------+--------+-------+ 297 The Reserved fields MUST be transmitted as zeroes. They are 298 ignored on reception. 300 The QPN MUST be the UD QP exchanged during address resolution. 302 The Type MUST be set to 0. 304 The service-ID formed using the UD QPN used for address 305 resolution MUST be supported by the associated interface. 307 4.0 Frame Format 309 All IP and ARP datagrams transported over InfiniBand are 310 prefixed by a 4-octet encapsulation header as described in 311 [IPoIB_ENCAP]. 313 0 1 2 3 314 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 315 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 316 | | | 317 | Type | Reserved | 318 | | | 319 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 320 The type field SHALL indicate the encapsulated protocol as per 321 the following table. 323 +----------+-------------+ 324 | Type | Protocol | 325 |------------------------| 326 | 0x800 | IPv4 | 327 |------------------------| 328 | 0x806 | ARP | 329 |------------------------| 330 | 0x8035 | RARP | 331 |------------------------| 332 | 0x86DD | IPv6 | 333 +------------------------+ 335 These values are taken from the "ETHER TYPE" numbers assigned by 336 Internet Assigned Numbers Authority (IANA). Other network 337 protocols, identified by different values of "ETHER TYPE", may 338 use the encapsulation format defined herein but such use is 339 outside of the scope of this document. 341 5.0 Maximum Transmission Unit 343 The IB connection setup might be used for both IPv4 and IPv6 or 344 it could be used for only one of them while a different 345 connection is used for the other. The link MTU MUST be able to 346 support the minimum MTU required by the protocols. 348 The default MTU of the IPoIB-CM interface is 2044 octets i.e. 349 2048 octet IPoIB-link MTU minus the 4 octet encapsulation 350 header. 352 The connected modes of InfiniBand allow message sizes up to 2^31 353 octets. Therefore, IPoIB-CM can use a much larger MTU for 354 unicast communication between any two endpoints. At the same 355 time the maximum and/or optimal payload that can be received or 356 sent over an InfiniBand connection is dependent on the 357 implementation, HCA and the resources configured. 359 An implementation MAY utilise the following mechanism to 360 request/accept MTUs across an IB connection. 362 5.1 Per-Connection MTU 364 Every IB connection setup message includes a "private data" 365 field [IB_ARCH]. The private data field MUST carry the following 366 information: 368 0 15 369 +----------------+ 370 | Desired MTU | 371 +----------------+ 372 | Minimum MTU | 373 +----------------+ 375 The connection setup message (CM REQ) MUST insert the requested 376 MTU in the "Desired MTU" field and the minimum acceptable MTU in 377 the "Minimum MTU" field. The "Minimum MTU" value SHOULD NOT be 378 less than the MTU set for multicast communication i.e. the MTU 379 received on "FullMember" join of the broadcast-GID on the 380 associated UD QP. The "Desired" and "Minimum" MTUs may be set to 381 the same value. 383 If the "Desired MTU" is not acceptable to the peer then it MUST 384 indicate it's preferred value in the "Desired MTU" when 385 rejecting (CM REJ) the request. If the "Desired MTU" is lower 386 than the minimum MTU that can be supported, the connection MUST 387 be rejected (CM REJ message) with the minimum acceptable MTU set 388 in both the desired and minimum MTU fields. 390 It is up to the implementation to utilize this mechanism for 391 setting the per IB connection MTU. The IPoIB interface must 392 account for the 4-octet encapsulation header and so the IPoIB 393 MTU over the connection will be smaller by that amount. 395 6.0 Security Considerations 397 A node may be returned a false set of flags by an impostor. This 398 may cause unnecessary attempts and some delay/disruption in 399 IPoIB communication. The same is the case if wrong/spurious QPN 400 values are provided during address resolution 401 broadcast/multicast. 403 7.0 References 405 [IB_ARCH] InfiniBand Architecture Specification, version 1.1 406 www.infinibandta.org 408 [IPoIB_ARCH] draft-ietf-ipoib-architecture-04.txt, V. Kashyap 410 [IPoIB_ENCAP] draft-ietf-ipoib-ip-over-infiniband-06.txt, 411 H.K. Jerry Chu, V. Kashyap 413 Author's Address 415 Vivek Kashyap 417 15350, SW Koll Parkway Beaverton, OR 97006 419 Phone: +1 503 578 3422 Email: vivk@us.ibm.com 421 Full Copyright Statement 423 Copyright (C) The Internet Society (2001). All Rights Reserved. 425 This document and translations of it may be copied and furnished to 426 others, and derivative works that comment on or otherwise explain it or 427 assist in its implementation may be prepared, copied, published and 428 distributed, in whole or in part, without restriction of any kind, 429 provided that the above copyright notice and this paragraph are included 430 on all such copies and derivative works. However, this document itself 431 may not be modified in any way, such as by removing the copyright notice 432 or references to the Internet Society or other Internet organizations, 433 except as needed for the purpose of developing Internet standards in 434 which case the procedures for copyrights defined in the Internet 435 Standards process must be followed, or as required to translate it into 436 languages other than English. 438 The limited permissions granted above are perpetual and will not be 439 revoked by the Internet Society or its successors or assigns. 441 This document and the information contained herein is provided on an "AS 442 IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK 443 FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT 444 LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT 445 INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR 446 FITNESS FOR A PARTICULAR PURPOSE.