idnits 2.17.1 draft-ietf-nfsv4-rpcrdma-cm-pvt-data-05.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (November 8, 2019) is 1630 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'RFC-TBD' is mentioned on line 392, but not defined -- Obsolete informational reference (is this intentional?): RFC 5666 (Obsoleted by RFC 8166) Summary: 0 errors (**), 0 flaws (~~), 2 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network File System Version 4 C. Lever 3 Internet-Draft Oracle 4 Updates: 8166 (if approved) November 8, 2019 5 Intended status: Standards Track 6 Expires: May 11, 2020 8 RDMA Connection Manager Private Data For RPC-Over-RDMA Version 1 9 draft-ietf-nfsv4-rpcrdma-cm-pvt-data-05 11 Abstract 13 This document specifies the format of RDMA-CM Private Data exchanged 14 between RPC-over-RDMA version 1 peers as part of establishing a 15 connection. The addition of the private data payload specified in 16 this document is an optional extension that does not alter the RPC- 17 over-RDMA version 1 protocol. This document updates RFC 8166. 19 Status of This Memo 21 This Internet-Draft is submitted in full conformance with the 22 provisions of BCP 78 and BCP 79. 24 Internet-Drafts are working documents of the Internet Engineering 25 Task Force (IETF). Note that other groups may also distribute 26 working documents as Internet-Drafts. The list of current Internet- 27 Drafts is at https://datatracker.ietf.org/drafts/current/. 29 Internet-Drafts are draft documents valid for a maximum of six months 30 and may be updated, replaced, or obsoleted by other documents at any 31 time. It is inappropriate to use Internet-Drafts as reference 32 material or to cite them other than as "work in progress." 34 This Internet-Draft will expire on May 11, 2020. 36 Copyright Notice 38 Copyright (c) 2019 IETF Trust and the persons identified as the 39 document authors. All rights reserved. 41 This document is subject to BCP 78 and the IETF Trust's Legal 42 Provisions Relating to IETF Documents 43 (https://trustee.ietf.org/license-info) in effect on the date of 44 publication of this document. Please review these documents 45 carefully, as they describe your rights and restrictions with respect 46 to this document. Code Components extracted from this document must 47 include Simplified BSD License text as described in Section 4.e of 48 the Trust Legal Provisions and are provided without warranty as 49 described in the Simplified BSD License. 51 Table of Contents 53 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 54 2. Requirements Language . . . . . . . . . . . . . . . . . . . . 3 55 3. Advertised Transport Properties . . . . . . . . . . . . . . . 3 56 3.1. Inline Threshold Size . . . . . . . . . . . . . . . . . . 3 57 3.2. Remote Invalidation . . . . . . . . . . . . . . . . . . . 4 58 4. Private Data Message Format . . . . . . . . . . . . . . . . . 5 59 4.1. Interoperability Considerations . . . . . . . . . . . . . 6 60 4.1.1. Amongst RPC-over-RDMA Version 1 Implementations . . . 7 61 4.1.2. Amongst Implementations of Other Upper-Layer 62 Protocols . . . . . . . . . . . . . . . . . . . . . . 7 63 5. Updating the Message Format . . . . . . . . . . . . . . . . . 7 64 5.1. Feature Support Flags . . . . . . . . . . . . . . . . . . 8 65 5.2. Inline Threshold Values . . . . . . . . . . . . . . . . . 8 66 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 8 67 6.1. Guidance for Designated Experts . . . . . . . . . . . . . 9 68 7. Security Considerations . . . . . . . . . . . . . . . . . . . 9 69 8. References . . . . . . . . . . . . . . . . . . . . . . . . . 10 70 8.1. Normative References . . . . . . . . . . . . . . . . . . 10 71 8.2. Informative References . . . . . . . . . . . . . . . . . 10 72 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . 11 73 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 12 75 1. Introduction 77 The RPC-over-RDMA version 1 transport protocol [RFC8166] enables 78 payload data transfer using Remote Direct Memory Access (RDMA) for 79 upper-layer protocols based on Remote Procedure Calls (RPC) 80 [RFC5531]. The terms "Remote Direct Memory Access" (RDMA) and 81 "Direct Data Placement" (DDP) are introduced in [RFC5040]. 83 The two most immediate shortcomings of RPC-over-RDMA version 1 are: 85 o Setting up an RDMA data transfer (via RDMA Read or Write) can be 86 costly. The small default size of messages transmitted using RDMA 87 Send forces the use of RDMA Read or Write operations even for 88 relatively small messages and data payloads. 89 The original specification of RPC-over-RDMA version 1 provided an 90 out-of-band protocol for passing inline threshold values between 91 connected peers [RFC5666]. However, [RFC8166] eliminated support 92 for this protocol making it unavailable for this purpose. 94 o Unlike most other contemporary RDMA-enabled storage protocols, 95 there is no facility in RPC-over-RDMA version 1 that enables the 96 use of remote invalidation [RFC5042]. 98 RPC-over-RDMA version 1 has no means of extending its XDR definition 99 in such a way that interoperability with existing implementations is 100 preserved. As a result, an out-of-band mechanism is needed to help 101 relieve these constraints for existing RPC-over-RDMA version 1 102 implementations. 104 This document specifies a simple, non-XDR-based message format 105 designed to be passed between RPC-over-RDMA version 1 peers at the 106 time each RDMA transport connection is first established. The 107 mechanism assumes that the underlying RDMA transport has a private 108 data field that is passed between peers at connection time, such as 109 is present in the iWARP protocol (described in Section 7.1 of 110 [RFC5044]) or the InfiniBand Connection Manager [IBA]. 112 To enable current RPC-over-RDMA version 1 implementations to 113 interoperate with implementations that support the private message 114 format described in this document, implementation of the private data 115 message is OPTIONAL. When the private data message has been 116 successfully exchanged, peers may choose to perform extended RDMA 117 semantics. However, the private message format does not alter the 118 XDR definition specified in [RFC8166]. 120 The message format is intended to be further extensible within the 121 normal scope of such IETF work (see Section 5 for further details). 122 Section 6 of the current document defines an IANA registry for this 123 purpose. In addition, interoperation between implementations of RPC- 124 over-RDMA version 1 that present this message format to peers and 125 those that do not recognize this message format is guaranteed. 127 2. Requirements Language 129 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 130 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 131 "OPTIONAL" in this document are to be interpreted as described in BCP 132 14 [RFC2119] [RFC8174] when, and only when, they appear in all 133 capitals, as shown here. 135 3. Advertised Transport Properties 137 3.1. Inline Threshold Size 139 Section 3.3.2 of [RFC8166] defines the term "inline threshold." An 140 inline threshold is the maximum number of bytes that can be 141 transmitted using one RDMA Send and one RDMA Receive. There are a 142 pair of inline thresholds for a connection: a client-to-server 143 threshold and a server-to-client threshold. 145 If an incoming message exceeds the size of a receiver's inline 146 threshold, the receive operation fails and the connection is 147 typically terminated. To convey a message larger than a receiver's 148 inline threshold, an NFS client uses explicit RDMA data transfer 149 operations, which are more expensive to use than RDMA Send. 151 The default value of inline thresholds for RPC-over-RDMA version 1 152 connections is 1024 bytes (see Section 3.3.3 of [RFC8166]). This 153 value is adequate for nearly all NFS version 3 procedures. 155 NFS version 4 COMPOUND operations [RFC7530] are larger on average 156 than NFS version 3 procedures [RFC1813], forcing clients to use 157 explicit RDMA operations for frequently-issued requests such as 158 LOOKUP and GETATTR. The use of RPCSEC_GSS security also increases 159 the average size of RPC messages, due to the larger size of 160 RPCSEC_GSS credential material included in RPC headers [RFC7861]. 162 If a sender and receiver could somehow agree on larger inline 163 thresholds, frequently-used RPC transactions avoid the cost of 164 explicit RDMA operations. 166 3.2. Remote Invalidation 168 After an RDMA data transfer operation completes, an RDMA consumer can 169 use remote invalidation to request that the remote peer RNIC 170 invalidate an STag associated with the data transfer [RFC5042]. 172 An RDMA consumer requests remote invalidation by posting an RDMA Send 173 With Invalidate Work Request in place of an RDMA Send Work Request. 174 Each RDMA Send With Invalidate carries one STag to invalidate. The 175 receiver of an RDMA Send With Invalidate performs the requested 176 invalidation and then reports that invalidation as part of the 177 completion of a waiting Receive Work Request. 179 If both peers support remote invalidation, an RPC-over-RDMA responder 180 might use remote invalidation when replying to an RPC request that 181 provided chunks. Because one of the chunks has already been 182 invalidated, finalizing the results of the RPC is made simpler and 183 faster. 185 However, there are some important caveats which contraindicate the 186 blanket use of remote invalidation: 188 o Remote invalidation is not supported by all RNICs. 190 o Not all RPC-over-RDMA responder implementations can generate RDMA 191 Send With Invalidate Work Requests. 193 o Not all RPC-over-RDMA requester implementations can recognize when 194 remote invalidation has occurred. 196 o On one connection in different RPC-over-RDMA transactions, or in a 197 single RPC-over-RDMA transaction, an RPC-over-RDMA requester can 198 expose a mixture of STags that may be invalidated remotely and 199 some that must not be. No indication is provided at the RDMA 200 layer as to which is which. 202 A responder therefore must not employ remote invalidation unless it 203 is aware of support for it in its own RDMA stack, and on the 204 requester. And, without altering the XDR structure of RPC-over-RDMA 205 version 1 messages, it is not possible to support remote invalidation 206 with requesters that mix STags that may and must not be invalidated 207 remotely in a single RPC or on the same connection. 209 There are some NFS/RDMA client implementations whose STags are always 210 safe to invalidate remotely. For such clients, indicating to the 211 responder that remote invalidation is always safe can allow such 212 invalidation without the need for additional protocol to be defined. 214 4. Private Data Message Format 216 With an InfiniBand lower layer, for example, RDMA connection setup 217 uses a Connection Manager when establishing a Reliable Connection 218 [IBA]. When an RPC-over-RDMA version 1 transport connection is 219 established, the client (which actively establishes connections) and 220 the server (which passively accepts connections) populate the CM 221 Private Data field exchanged as part of CM connection establishment. 223 The transport properties exchanged via this mechanism are fixed for 224 the life of the connection. Each new connection presents an 225 opportunity for a fresh exchange. An implementation of the extension 226 described in this document MUST be prepared for the settings to 227 change upon a reconnection. 229 For RPC-over-RDMA version 1, the CM Private Data field is formatted 230 as described in the following subsection. RPC clients and servers 231 use the same format. If the capacity of the Private Data field is 232 too small to contain this message format, the underlying RDMA 233 transport is not managed by a Connection Manager, or the underlying 234 RDMA transport uses Private Data for its own purposes, the CM Private 235 Data field cannot be used on behalf of RPC-over-RDMA version 1. 237 The first 8 octets of the CM Private Data field is to be formatted as 238 follows: 240 0 1 2 3 241 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 242 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 243 | Format Identifier | 244 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 245 | Version | Flags | Send Size | Receive Size | 246 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 248 Format Identifier: This field contains a fixed 32-bit value that 249 identifies the content of the Private Data field as an RPC-over- 250 RDMA version 1 CM Private Data message. In RPC-over-RDMA version 251 1 Private Data, the value of this field is always 0xf6ab0e18, in 252 network byte order. The use of this field is further expanded 253 upon in Section 4.1. 255 Version: This 8-bit field contains a message format version number. 256 The value "1" in this field indicates that exactly eight octets 257 are present, that they appear in the order described in this 258 section, and that each has the meaning defined in this section. 259 Further considerations about the use of this field are discussed 260 in Section 5. 262 Flags: This 8-bit field contains bit flags that indicate the support 263 status of optional features, such as remote invalidation. The 264 meaning of these flags is defined in Section 5.1. 266 Send Size: This 8-bit field contains an encoded value corresponding 267 to the maximum number of bytes this peer is prepared to transmit 268 in a single RDMA Send on this connection. The value is encoded as 269 described in Section 5.2. 271 Receive Size: This 8-bit field contains an encoded value 272 corresponding to the maximum number of bytes this peer is prepared 273 to receive with a single RDMA Receive on this connection. The 274 value is encoded as described in Section 5.2. 276 4.1. Interoperability Considerations 278 The extension described in this document is designed to allow RPC- 279 over-RDMA version implementations that use CM Private Data to 280 interoperate fully with RPC-over-RDMA version 1 implementations that 281 do not exchange this information. Implementations that use this 282 extension must also interoperate fully with RDMA implementations that 283 use CM Private Data for other purposes. Realizing these goals 284 require that implementations of this extension follow the practices 285 described in the rest of this section. 287 4.1.1. Amongst RPC-over-RDMA Version 1 Implementations 289 When a peer does not receive a CM Private Data message which conforms 290 to Section 4, it needs to act as if the remote peer supports only the 291 default RPC-over-RDMA version 1 settings, as defined in [RFC8166]. 292 In other words, the peer MUST behave as if a Private Data message was 293 received in which bit 15 of the Flags field is zero, and both Size 294 fields contain the value zero. 296 4.1.2. Amongst Implementations of Other Upper-Layer Protocols 298 The Format Identifier field in the message format defined in this 299 document is provided to enable implementations to distinguish RPC- 300 over-RDMA version 1 Private Data from application-specific private 301 data inserted by applications other than RPC-over RDMA version 1. 302 Examples of other applications that make use of CM Private Data 303 include iWARP, via the MPA enhancement described in [RFC6581], and 304 iSCSI extensions for RDMA (iSER), as defined in [RFC7145]. 306 During connection establishment, an implementation of the extension 307 described in this document checks the Format Identifier field before 308 decoding subsequent fields. If the RPC-over-RDMA version 1 CM 309 Private Data Format Identifier is not present as the first 4 octets, 310 an RPC-over-RDMA version 1 receiver MUST ignore the CM Private Data, 311 behaving as if no RPC-over-RDMA version 1 Private Data has been 312 provided (see above). 314 5. Updating the Message Format 316 Although the message format described in this document provides the 317 ability for the client and server to exchange particular information 318 about the local RPC-over-RDMA implementation, it is possible that 319 there will be a future need to exchange additional properties. This 320 would make it necessary to extend or otherwise modify the format 321 described in this document. 323 Any modification faces the problem of interoperating properly with 324 implementations of RPC-over-RDMA version 1 that are unaware of this 325 existence of the new format. These include implementations that that 326 do not recognize the exchange of CM Private Data as well as those 327 that recognize only the format described in this document. 329 Given the message format described in this document, these 330 interoperability constraints could be met by the following sorts of 331 new message formats: 333 o A format which uses a different value for the first four bytes of 334 the format, as provided for in the registry described in 335 Section 6. 337 o A format which uses the same value for the Format Identifier field 338 and a value other than one (1) in the Version field. 340 Although it is possible to reorganize the last three of the eight 341 bytes in the existing format, extended formats are unlikely to do so. 342 New formats would take the form of extensions of the format described 343 in this document with added fields starting at byte eight of the 344 format and changes to the definition of previously reserved flags. 346 5.1. Feature Support Flags 348 The bits in the Flags field are labeled from bit 8 to bit 15, as 349 shown in the diagram above. When the Version field contains the 350 value "1", the bits in the Flags field are to be set as follows: 352 Bit 15: When both connection peers have set this flag in their CM 353 Private Data, the responder MAY use RDMA Send With Invalidate when 354 transmitting RPC Replies. Each RDMA Send With Invalidate MUST 355 invalidate an STag associated only with the XID in the rdma_xid 356 field of the RPC-over-RDMA Transport Header it carries. 357 When either peer on a connection clears this flag, the responder 358 MUST use only RDMA Send when transmitting RPC Replies. 360 Bits 14 - 8: These bits are reserved and are always zero when the 361 Version field contains 1. 363 5.2. Inline Threshold Values 365 Inline threshold sizes from 1KB to 256KB can be represented in the 366 Send Size and Receive Size fields. A sender computes the encoded 367 value by dividing the actual value by 1024 and subtracting one from 368 the result. A receiver decodes this value by performing a 369 complementary set of operations. 371 The client uses the smaller of its own send size and the server's 372 reported receive size as the client-to-server inline threshold. The 373 server uses the smaller of its own send size and the clients's 374 reported receive size as the server-to-client inline threshold. 376 6. IANA Considerations 378 In accordance with [RFC8126], the author requests that IANA create a 379 new registry in the "Remote Direct Data Placement" Protocol Category 380 Group. The new registry is to be called the "RDMA-CM Private Data 381 Identifier Registry". This is a registry of 32-bit numbers that 382 identify the upper-layer protocol associated with data that appears 383 in the application-specific RDMA-CM Private Data area. The fields in 384 this registry include: Format Identifier, Description, and Reference. 386 The initial contents of this registry are a single entry: 388 +------------------+------------------------------------+-----------+ 389 | Format | Format Description | Reference | 390 | Identifier | | | 391 +------------------+------------------------------------+-----------+ 392 | 0xf6ab0e18 | RPC-over-RDMA version 1 CM Private | [RFC-TBD] | 393 | | Data | | 394 +------------------+------------------------------------+-----------+ 396 Table 1: RDMA-CM Private Data Identifier Registry 398 IANA is to assign subsequent new entries in this registry using the 399 Expert Review policy as defined in Section 4.5 of [RFC8126]. 401 6.1. Guidance for Designated Experts 403 The Designated Expert (DE), appointed by the IESG, should ascertain 404 the existence of suitable documentation that defines the semantics 405 and format of the private data, and verify that the document is 406 permanently and publicly available. Documentation produced outside 407 the IETF must not conflict with work that is active or already 408 published within the IETF. 410 The new Reference field should contain a reference to that 411 documentation. The DE can assign new Format Identifiers at random as 412 long as they do not conflict with existing entries in this registry. 413 The Description field should contain the name of a distinct upper- 414 layer RDMA consumer that will use the private data. 416 The DE will post the request to the nfsv4 WG mailing list (or a 417 successor to that list, if such a list exists), for comment and 418 review. The DE will approve or deny the request and publish notice 419 of the decision within 30 days. 421 7. Security Considerations 423 The Private Data extension specified in this document inherits the 424 security considerations of the protocols it extends, which in this 425 case is defined in [RFC8166]. 427 The integrity of CM Private Data and the authenticity of it's source 428 is ensured by the use of the Reliable connected (RC) Queue Pair (QP) 429 type, required by RPC-over-RDMA version 1. Any attempts to interfere 430 with or hijack an RC connection will result in the connection being 431 immediately terminated. Additional relevant analysis of RDMA 432 security appears in the Security Considerations section of [RFC5042]. 434 Improperly setting one of the fields in the private message payload 435 will result in a greatly increased risk of disconnection (i.e., self- 436 imposed Denial of Service). There is no increased risk of exposing 437 upper-layer data inappropriately. 439 8. References 441 8.1. Normative References 443 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 444 Requirement Levels", BCP 14, RFC 2119, 445 DOI 10.17487/RFC2119, March 1997, 446 . 448 [RFC5040] Recio, R., Metzler, B., Culley, P., Hilland, J., and D. 449 Garcia, "A Remote Direct Memory Access Protocol 450 Specification", RFC 5040, DOI 10.17487/RFC5040, October 451 2007, . 453 [RFC5042] Pinkerton, J. and E. Deleganes, "Direct Data Placement 454 Protocol (DDP) / Remote Direct Memory Access Protocol 455 (RDMAP) Security", RFC 5042, DOI 10.17487/RFC5042, October 456 2007, . 458 [RFC8126] Cotton, M., Leiba, B., and T. Narten, "Guidelines for 459 Writing an IANA Considerations Section in RFCs", BCP 26, 460 RFC 8126, DOI 10.17487/RFC8126, June 2017, 461 . 463 [RFC8166] Lever, C., Ed., Simpson, W., and T. Talpey, "Remote Direct 464 Memory Access Transport for Remote Procedure Call Version 465 1", RFC 8166, DOI 10.17487/RFC8166, June 2017, 466 . 468 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 469 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, 470 May 2017, . 472 8.2. Informative References 474 [IBA] InfiniBand Trade Association, "InfiniBand Architecture 475 Specification Volume 1", Release 1.3, March 2015. 477 Available from https://www.infinibandta.org/ 479 [RFC1813] Callaghan, B., Pawlowski, B., and P. Staubach, "NFS 480 Version 3 Protocol Specification", RFC 1813, 481 DOI 10.17487/RFC1813, June 1995, 482 . 484 [RFC5044] Culley, P., Elzur, U., Recio, R., Bailey, S., and J. 485 Carrier, "Marker PDU Aligned Framing for TCP 486 Specification", RFC 5044, DOI 10.17487/RFC5044, October 487 2007, . 489 [RFC5531] Thurlow, R., "RPC: Remote Procedure Call Protocol 490 Specification Version 2", RFC 5531, DOI 10.17487/RFC5531, 491 May 2009, . 493 [RFC5666] Talpey, T. and B. Callaghan, "Remote Direct Memory Access 494 Transport for Remote Procedure Call", RFC 5666, 495 DOI 10.17487/RFC5666, January 2010, 496 . 498 [RFC6581] Kanevsky, A., Ed., Bestler, C., Ed., Sharp, R., and S. 499 Wise, "Enhanced Remote Direct Memory Access (RDMA) 500 Connection Establishment", RFC 6581, DOI 10.17487/RFC6581, 501 April 2012, . 503 [RFC7145] Ko, M. and A. Nezhinsky, "Internet Small Computer System 504 Interface (iSCSI) Extensions for the Remote Direct Memory 505 Access (RDMA) Specification", RFC 7145, 506 DOI 10.17487/RFC7145, April 2014, 507 . 509 [RFC7530] Haynes, T., Ed. and D. Noveck, Ed., "Network File System 510 (NFS) Version 4 Protocol", RFC 7530, DOI 10.17487/RFC7530, 511 March 2015, . 513 [RFC7861] Adamson, A. and N. Williams, "Remote Procedure Call (RPC) 514 Security Version 3", RFC 7861, DOI 10.17487/RFC7861, 515 November 2016, . 517 Acknowledgments 519 Thanks to Christoph Hellwig and Devesh Sharma for suggesting this 520 approach, and to Tom Talpey and Dave Noveck for their expert comments 521 and review. The author also wishes to thank Bill Baker and Greg 522 Marsden for their support of this work. 524 Special thanks go to Transport Area Director Magnus Westerlund, NFSV4 525 Working Group Chairs Spencer Shepler and Brian Pawlowski, and NFSV4 526 Working Group Secretary Thomas Haynes. 528 Author's Address 530 Charles Lever 531 Oracle Corporation 532 United States of America 534 Email: chuck.lever@oracle.com