idnits 2.17.1 draft-ietf-nfsv4-rpcrdma-cm-pvt-data-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (June 13, 2019) is 1778 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Missing Reference: 'RFC-TBD' is mentioned on line 374, but not defined -- Obsolete informational reference (is this intentional?): RFC 5666 (Obsoleted by RFC 8166) Summary: 0 errors (**), 0 flaws (~~), 2 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network File System Version 4 C. Lever 3 Internet-Draft Oracle 4 Intended status: Informational June 13, 2019 5 Expires: December 15, 2019 7 RDMA Connection Manager Private Data For RPC-Over-RDMA Version 1 8 draft-ietf-nfsv4-rpcrdma-cm-pvt-data-03 10 Abstract 12 This document specifies the format of RDMA-CM Private Data exchanged 13 between RPC-over-RDMA version 1 peers as part of establishing a 14 connection. Such private data is used to indicate peer support for 15 remote invalidation and larger-than-default inline thresholds. The 16 addition of the private data payload specified in this document is an 17 OPTIONAL extension. The RPC-over-RDMA version 1 protocol does not 18 require the payload to be present. 20 Status of This Memo 22 This Internet-Draft is submitted in full conformance with the 23 provisions of BCP 78 and BCP 79. 25 Internet-Drafts are working documents of the Internet Engineering 26 Task Force (IETF). Note that other groups may also distribute 27 working documents as Internet-Drafts. The list of current Internet- 28 Drafts is at https://datatracker.ietf.org/drafts/current/. 30 Internet-Drafts are draft documents valid for a maximum of six months 31 and may be updated, replaced, or obsoleted by other documents at any 32 time. It is inappropriate to use Internet-Drafts as reference 33 material or to cite them other than as "work in progress." 35 This Internet-Draft will expire on December 15, 2019. 37 Copyright Notice 39 Copyright (c) 2019 IETF Trust and the persons identified as the 40 document authors. All rights reserved. 42 This document is subject to BCP 78 and the IETF Trust's Legal 43 Provisions Relating to IETF Documents 44 (https://trustee.ietf.org/license-info) in effect on the date of 45 publication of this document. Please review these documents 46 carefully, as they describe your rights and restrictions with respect 47 to this document. Code Components extracted from this document must 48 include Simplified BSD License text as described in Section 4.e of 49 the Trust Legal Provisions and are provided without warranty as 50 described in the Simplified BSD License. 52 Table of Contents 54 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 55 2. Requirements Language . . . . . . . . . . . . . . . . . . . . 3 56 3. Advertised Transport Properties . . . . . . . . . . . . . . . 3 57 3.1. Inline Threshold Size . . . . . . . . . . . . . . . . . . 3 58 3.2. Remote Invalidation . . . . . . . . . . . . . . . . . . . 4 59 4. Private Data Message Format . . . . . . . . . . . . . . . . . 5 60 4.1. Interoperability Considerations . . . . . . . . . . . . . 6 61 5. Updating the Message Format . . . . . . . . . . . . . . . . . 7 62 5.1. Feature Support Flags . . . . . . . . . . . . . . . . . . 7 63 5.2. Inline Threshold Values . . . . . . . . . . . . . . . . . 8 64 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 8 65 7. Security Considerations . . . . . . . . . . . . . . . . . . . 9 66 8. References . . . . . . . . . . . . . . . . . . . . . . . . . 9 67 8.1. Normative References . . . . . . . . . . . . . . . . . . 9 68 8.2. Informative References . . . . . . . . . . . . . . . . . 9 69 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . 10 70 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 10 72 1. Introduction 74 The RPC-over-RDMA version 1 transport protocol [RFC8166] enables 75 payload data transfer using Remote Direct Memory Access (RDMA) for 76 upper layer protocols based on Remote Procedure Calls (RPC) 77 [RFC5531]. The terms "Remote Direct Memory Access" (RDMA) and 78 "Direct Data Placement" (DDP) are introduced in [RFC5040]. 80 The two most immediate shortcomings of RPC-over-RDMA version 1 are: 82 o Setting up an RDMA data transfer (via RDMA Read or Write) can be 83 costly. The small default size of messages transmitted using RDMA 84 Send forces the use of RDMA Read or Write operations even for 85 relatively small messages and data payloads. 86 The original specification of RPC-over-RDMA version 1 provided an 87 out-of-band protocol for passing inline threshold values between 88 connected peers [RFC5666]. However, [RFC8166] eliminated support 89 for this protocol making it unavailable for this purpose. 91 o Unlike most other contemporary RDMA-enabled storage protocols, 92 there is no facility in RPC-over-RDMA version 1 that enables the 93 use of remote invalidation [RFC5042]. 95 RPC-over-RDMA version 1 has no means of extending its XDR definition 96 in such a way that interoperability with existing implementations is 97 preserved. As a result, an out-of-band mechanism is needed to help 98 relieve these constraints for existing RPC-over-RDMA version 1 99 implementations. 101 This document specifies a simple, non-XDR-based message format 102 designed to be passed between RPC-over-RDMA version 1 peers at the 103 time each RDMA transport connection is first established. The 104 purpose of such a message exchange is to enable the connecting peers 105 to indicate support for transport properties that are not defined in 106 the base RPC-over-RDMA version 1 protocol defined in [RFC8166]. 108 The message format is intended to be further extensible within the 109 normal scope of such IETF work (see Section 5 for further details). 110 Section 6 of the current document defines an IANA registry for this 111 purpose. In addition, interoperation between implementations of RPC- 112 over-RDMA version 1 that present this message format to peers and 113 those that do not recognize this message format is guaranteed. 115 2. Requirements Language 117 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 118 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 119 "OPTIONAL" in this document are to be interpreted as described in BCP 120 14 [RFC2119] [RFC8174] when, and only when, they appear in all 121 capitals, as shown here. 123 3. Advertised Transport Properties 125 3.1. Inline Threshold Size 127 Section 3.3.2 of [RFC8166] defines the term "inline threshold." An 128 inline threshold is the maximum number of bytes that can be 129 transmitted using one RDMA Send and one RDMA Receive. There are a 130 pair of inline thresholds for a connection: a client-to-server 131 threshold and a server-to-client threshold. 133 If an incoming message exceeds the size of a receiver's inline 134 threshold, the receive operation fails and the connection is 135 typically terminated. To convey a message larger than a receiver's 136 inline threshold, an NFS client uses explicit RDMA data transfer 137 operations, which are more expensive to use than RDMA Send. 139 The default value of inline thresholds for RPC-over-RDMA version 1 140 connections is 1024 bytes (see Section 3.3.3 of [RFC8166]). This 141 value is adequate for nearly all NFS version 3 procedures. 143 NFS version 4 COMPOUND operations [RFC7530] are larger on average 144 than NFS version 3 procedures [RFC1813], forcing clients to use 145 explicit RDMA operations for frequently-issued requests such as 146 LOOKUP and GETATTR. The use of RPCSEC_GSS security also increases 147 the average size of RPC messages, due to the larger size of 148 RPCSEC_GSS credential material included in RPC headers [RFC7861]. 150 If a sender and receiver could somehow agree on larger inline 151 thresholds, frequently-used RPC transactions avoid the cost of 152 explicit RDMA operations. 154 3.2. Remote Invalidation 156 After an RDMA data transfer operation completes, an RDMA consumer can 157 use remote invalidation to request that the remote peer RNIC 158 invalidate an STag associated with the data transfer [RFC5042]. 160 An RDMA consumer requests remote invalidation by posting an RDMA Send 161 With Invalidate Work Request in place of an RDMA Send Work Request. 162 Each RDMA Send With Invalidate carries one STag to invalidate. The 163 receiver of an RDMA Send With Invalidate performs the requested 164 invalidation and then reports that invalidation as part of the 165 completion of a waiting Receive Work Request. 167 If both peers support remote invalidation, an RPC-over-RDMA responder 168 might use remote invalidation when replying to an RPC request that 169 provided chunks. Because one of the chunks has already been 170 invalidated, finalizing the results of the RPC is made simpler and 171 faster. 173 However, there are some important caveats which contraindicate the 174 blanket use of remote invalidation: 176 o Remote invalidation is not supported by all RNICs. 178 o Not all RPC-over-RDMA responder implementations can generate RDMA 179 Send With Invalidate Work Requests. 181 o Not all RPC-over-RDMA requester implementations can recognize when 182 remote invalidation has occurred. 184 o On one connection in different RPC-over-RDMA transactions, or in a 185 single RPC-over-RDMA transaction, an RPC-over-RDMA requester can 186 expose a mixture of STags that may be invalidated remotely and 187 some that must not be. No indication is provided at the RDMA 188 layer as to which is which. 190 A responder therefore must not employ remote invalidation unless it 191 is aware of support for it in its own RDMA stack, and on the 192 requester. And, without altering the XDR structure of RPC-over-RDMA 193 version 1 messages, it is not possible to support remote invalidation 194 with requesters that mix STags that may and must not be invalidated 195 remotely in a single RPC or on the same connection. 197 There are some NFS/RDMA client implementations whose STags are always 198 safe to invalidate remotely. For such clients, indicating to the 199 responder that remote invalidation is always safe can allow such 200 invalidation without the need for additional protocol to be defined. 202 4. Private Data Message Format 204 With an InfiniBand lower layer, for example, RDMA connection setup 205 uses a Connection Manager when establishing a Reliable Connection 206 [IBARCH]. When an RPC-over-RDMA version 1 transport connection is 207 established, the client (which actively establishes connections) and 208 the server (which passively accepts connections) populate the CM 209 Private Data field exchanged as part of CM connection establishment. 211 The transport properties exchanged via this mechanism are fixed for 212 the life of the connection. Each new connection presents an 213 opportunity for a fresh exchange. An implementation of the extension 214 described in this document MUST be prepared for the settings to 215 change upon a reconnection. 217 For RPC-over-RDMA version 1, the CM Private Data field is formatted 218 as described in the following subsection. RPC clients and servers 219 use the same format. If the capacity of the Private Data field is 220 too small to contain this message format, the underlying RDMA 221 transport is not managed by a Connection Manager, or the underlying 222 RDMA transport uses Private Data for its own purposes, the CM Private 223 Data field cannot be used on behalf of RPC-over-RDMA version 1. 225 The first 8 octets of the CM Private Data field is to be formatted as 226 follows: 228 0 1 2 3 229 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 230 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 231 | Format Identifier | 232 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 233 | Version | Flags | Send Size | Receive Size | 234 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 236 Format Identifier: This field contains a fixed 32-bit value that 237 identifies the content of the Private Data field as an RPC-over- 238 RDMA version 1 CM Private Data message. The value of this field 239 is always 0xf6ab0e18, in network byte order. The use of this 240 field is further expanded upon in Section 4.1. 242 Version: This 8-bit field contains a message format version number. 243 The value "1" in this field indicates that exactly eight octets 244 are present, that they appear in the order described in this 245 section, and that each has the meaning defined in this section. 246 Further considerations about the use of this field are discussed 247 in Section 5. 249 Flags: This 8-bit field contains bit flags that indicate the support 250 status of optional features, such as remote invalidation. The 251 meaning of these flags is defined in Section 5.1. 253 Send Size: This 8-bit field contains an encoded value corresponding 254 to the maximum number of bytes this peer is prepared to transmit 255 in a single RDMA Send on this connection. The value is encoded as 256 described in Section 5.2. 258 Receive Size: This 8-bit field contains an encoded value 259 corresponding to the maximum number of bytes this peer is prepared 260 to receive with a single RDMA Receive on this connection. The 261 value is encoded as described in Section 5.2. 263 4.1. Interoperability Considerations 265 The extension described in this document is designed to allow RPC- 266 over-RDMA version implementations that use this extension to 267 interoperate fully with RPC-over-RDMA version 1 implementations that 268 do not exchange this information. Realizing this goal requires that 269 implementations of this extension follow the practices described in 270 the rest of this section. 272 RPC-over-RDMA version 1 implementations that support the extension 273 described in this document are intended to interoperate fully with 274 RPC-over-RDMA version 1 implementations that do not recognize the 275 exchange of CM Private Data. When a peer does not receive a CM 276 Private Data message which conforms to Section 4, it needs to act as 277 if the remote peer supports only the default RPC-over-RDMA version 1 278 settings, as defined in [RFC8166]. In other words, the peer is to 279 behave as if a Private Data message was received in which bit 8 of 280 the Flags field is zero, and both Size fields contain the value zero. 282 The Format Identifier field is provided in order to distinguish RPC- 283 over-RDMA version 1 Private Data from private data inserted by layers 284 below or above RPC-over RDMA version 1. During connection 285 establishment, RPC-over-RDMA version 1 implementations check for this 286 protocol number before decoding subsequent fields. If this protocol 287 number is not present as the first 4 octets, an RPC-over-RDMA 288 receiver needs to ignore the CM-Private Data (ie., behave as if no 289 RPC-over-RDMA version 1 Private Data has been provided). 291 5. Updating the Message Format 293 Although the message format described in this document provides the 294 ability for the client and server to exchange particular information 295 about the local RPC-over-RDMA implementation, it is possible that 296 there will be a future need to exchange additional properties. This 297 would make it necessary to extend or otherwise modify the format 298 described in this document. 300 Any modification faces the problem of interoperating properly with 301 implementations of RPC-over-RDMA version 1 that are unaware of this 302 existence of the new format. These include implementations that that 303 do not recognize the exchange of CM Private Data as well as those 304 that recognize only the format described in this document. 306 Given the message format described in this document, these 307 interoperability constraints could be met by the following sorts of 308 new message formats: 310 o A format which uses a different value for the first four bytes of 311 the format, as provided for in the registry described in 312 Section 6. 314 o A format which uses the same value for the Format Identifier field 315 and a value other than one (1) in the Version field. 317 Although it is possible to reorganize the last three of the eight 318 bytes in the existing format, extended formats are unlikely to do so. 319 New formats would take the form of extensions of the format described 320 in this document with added fields starting at byte eight of the 321 format and changes to the definition of previously reserved flags. 323 5.1. Feature Support Flags 325 The bits in the Flags field are labeled from bit 8 to bit 15, as 326 shown in the diagram above. When the Version field contains the 327 value "1", the bits in the Flags field are to be set as follows: 329 Bit 15: When both connection peers have set this flag in their CM 330 Private Data, the responder MAY use RDMA Send With Invalidate when 331 transmitting RPC Replies. Each RDMA Send With Invalidate MUST 332 invalidate an STag associated only with the XID in the rdma_xid 333 field of the RPC-over-RDMA Transport Header it carries. 334 When either peer on a connection clears this flag, the responder 335 MUST use only RDMA Send when transmitting RPC Replies. 337 Bits 14 - 8: These bits are reserved and are always zero. 339 5.2. Inline Threshold Values 341 Inline threshold sizes from 1KB to 256KB can be represented in the 342 Send Size and Receive Size fields. A sender computes the encoded 343 value by dividing the actual value by 1024 and subtracting one from 344 the result. A receiver decodes this value by performing a 345 complementary set of operations. 347 The client uses the smaller of its own send size and the server's 348 reported receive size as the client-to-server inline threshold. The 349 server uses the smaller of its own send size and the clients's 350 reported receive size as the server-to-client inline threshold. 352 6. IANA Considerations 354 In accordance with [RFC8126], the author requests that IANA create a 355 new registry in the "Remote Direct Data Placement" Protocol Category 356 Group. The new registry is to be called the "RDMA-CM Private Data 357 Identifier Registry". This is a registry of 32-bit numbers that 358 identify the Upper Layer protocol associated with data that appears 359 in the RDMA-CM Private Data area. 361 The information that must be provided to add an entry to this 362 registry will be an IESG-approved Standards Track specification 363 defining the semantics and interoperability requirements of the 364 proposed new value and the fields to be recorded in the registry. 365 The fields in this registry include: Field Identifier, Format 366 Description, and Reference. 368 The initial contents of this registry are a single entry: 370 +-----------------+-------------------------------------+-----------+ 371 | Field | Format Description | Reference | 372 | Identifier | | | 373 +-----------------+-------------------------------------+-----------+ 374 | 0xf6ab0e18 | RPC-over-RDMA version 1 CM Private | [RFC-TBD] | 375 | | Data | | 376 +-----------------+-------------------------------------+-----------+ 378 Table 1: RDMA-CM Private Data Identifier Registry 380 The Expert Review policy, as defined in Section 4.5 of [RFC8126] is 381 to be used to handle requests to add new entries to the "File 382 Provenance Information Registry". New protocol numbers can be 383 assigned at random as long as they do not conflict with existing 384 entries in this registry. 386 7. Security Considerations 388 The private data extension specified in this document inherits the 389 security considerations of the link layer protocols it extends; e.g., 390 the MPA protocol, as specified in [RFC5044] and extended in 391 [RFC6581]. Additional relevant analysis of RDMA security appears in 392 the Security Considerations section of [RFC5042]. 394 8. References 396 8.1. Normative References 398 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 399 Requirement Levels", BCP 14, RFC 2119, 400 DOI 10.17487/RFC2119, March 1997, 401 . 403 [RFC5040] Recio, R., Metzler, B., Culley, P., Hilland, J., and D. 404 Garcia, "A Remote Direct Memory Access Protocol 405 Specification", RFC 5040, DOI 10.17487/RFC5040, October 406 2007, . 408 [RFC5042] Pinkerton, J. and E. Deleganes, "Direct Data Placement 409 Protocol (DDP) / Remote Direct Memory Access Protocol 410 (RDMAP) Security", RFC 5042, DOI 10.17487/RFC5042, October 411 2007, . 413 [RFC8126] Cotton, M., Leiba, B., and T. Narten, "Guidelines for 414 Writing an IANA Considerations Section in RFCs", BCP 26, 415 RFC 8126, DOI 10.17487/RFC8126, June 2017, 416 . 418 [RFC8166] Lever, C., Ed., Simpson, W., and T. Talpey, "Remote Direct 419 Memory Access Transport for Remote Procedure Call Version 420 1", RFC 8166, DOI 10.17487/RFC8166, June 2017, 421 . 423 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 424 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, 425 May 2017, . 427 8.2. Informative References 429 [IBARCH] InfiniBand Trade Association, "InfiniBand Architecture 430 Specification Volume 1", Release 1.3, March 2015, 431 . 434 [RFC1813] Callaghan, B., Pawlowski, B., and P. Staubach, "NFS 435 Version 3 Protocol Specification", RFC 1813, 436 DOI 10.17487/RFC1813, June 1995, 437 . 439 [RFC5044] Culley, P., Elzur, U., Recio, R., Bailey, S., and J. 440 Carrier, "Marker PDU Aligned Framing for TCP 441 Specification", RFC 5044, DOI 10.17487/RFC5044, October 442 2007, . 444 [RFC5531] Thurlow, R., "RPC: Remote Procedure Call Protocol 445 Specification Version 2", RFC 5531, DOI 10.17487/RFC5531, 446 May 2009, . 448 [RFC5666] Talpey, T. and B. Callaghan, "Remote Direct Memory Access 449 Transport for Remote Procedure Call", RFC 5666, 450 DOI 10.17487/RFC5666, January 2010, 451 . 453 [RFC6581] Kanevsky, A., Ed., Bestler, C., Ed., Sharp, R., and S. 454 Wise, "Enhanced Remote Direct Memory Access (RDMA) 455 Connection Establishment", RFC 6581, DOI 10.17487/RFC6581, 456 April 2012, . 458 [RFC7530] Haynes, T., Ed. and D. Noveck, Ed., "Network File System 459 (NFS) Version 4 Protocol", RFC 7530, DOI 10.17487/RFC7530, 460 March 2015, . 462 [RFC7861] Adamson, A. and N. Williams, "Remote Procedure Call (RPC) 463 Security Version 3", RFC 7861, DOI 10.17487/RFC7861, 464 November 2016, . 466 Acknowledgments 468 Thanks to Christoph Hellwig and Devesh Sharma for suggesting this 469 approach, and to Tom Talpey and Dave Noveck for their expert comments 470 and review. The author also wishes to thank Bill Baker and Greg 471 Marsden for their support of this work. 473 Special thanks go to Transport Area Director Magnus Westerlund, NFSV4 474 Working Group Chairs Spencer Shepler and Brian Pawlowski, and NFSV4 475 Working Group Secretary Thomas Haynes. 477 Author's Address 478 Charles Lever 479 Oracle Corporation 480 United States of America 482 Email: chuck.lever@oracle.com