idnits 2.17.1 draft-ietf-nfsv4-rpcrdma-cm-pvt-data-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (May 5, 2019) is 1818 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Missing Reference: 'RFC-TBD' is mentioned on line 370, but not defined -- Obsolete informational reference (is this intentional?): RFC 5666 (Obsoleted by RFC 8166) Summary: 0 errors (**), 0 flaws (~~), 2 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network File System Version 4 C. Lever 3 Internet-Draft Oracle 4 Intended status: Informational May 5, 2019 5 Expires: November 6, 2019 7 RDMA Connection Manager Private Data For RPC-Over-RDMA Version 1 8 draft-ietf-nfsv4-rpcrdma-cm-pvt-data-02 10 Abstract 12 This document specifies the format of RDMA-CM Private Data exchanged 13 between RPC-over-RDMA version 1 peers as part of establishing a 14 connection. Such private data is used to indicate peer support for 15 remote invalidation and larger-than-default inline thresholds. 17 Status of This Memo 19 This Internet-Draft is submitted in full conformance with the 20 provisions of BCP 78 and BCP 79. 22 Internet-Drafts are working documents of the Internet Engineering 23 Task Force (IETF). Note that other groups may also distribute 24 working documents as Internet-Drafts. The list of current Internet- 25 Drafts is at https://datatracker.ietf.org/drafts/current/. 27 Internet-Drafts are draft documents valid for a maximum of six months 28 and may be updated, replaced, or obsoleted by other documents at any 29 time. It is inappropriate to use Internet-Drafts as reference 30 material or to cite them other than as "work in progress." 32 This Internet-Draft will expire on November 6, 2019. 34 Copyright Notice 36 Copyright (c) 2019 IETF Trust and the persons identified as the 37 document authors. All rights reserved. 39 This document is subject to BCP 78 and the IETF Trust's Legal 40 Provisions Relating to IETF Documents 41 (https://trustee.ietf.org/license-info) in effect on the date of 42 publication of this document. Please review these documents 43 carefully, as they describe your rights and restrictions with respect 44 to this document. Code Components extracted from this document must 45 include Simplified BSD License text as described in Section 4.e of 46 the Trust Legal Provisions and are provided without warranty as 47 described in the Simplified BSD License. 49 Table of Contents 51 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 52 2. Requirements Language . . . . . . . . . . . . . . . . . . . . 3 53 3. Advertised Transport Properties . . . . . . . . . . . . . . . 3 54 3.1. Inline Threshold Size . . . . . . . . . . . . . . . . . . 3 55 3.2. Remote Invalidation . . . . . . . . . . . . . . . . . . . 4 56 4. Private Data Message Format . . . . . . . . . . . . . . . . . 5 57 4.1. Interoperability Considerations . . . . . . . . . . . . . 6 58 5. Updating the Message Format . . . . . . . . . . . . . . . . . 7 59 5.1. Feature Support Flags . . . . . . . . . . . . . . . . . . 7 60 5.2. Inline Threshold Values . . . . . . . . . . . . . . . . . 8 61 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 8 62 7. Security Considerations . . . . . . . . . . . . . . . . . . . 9 63 8. References . . . . . . . . . . . . . . . . . . . . . . . . . 9 64 8.1. Normative References . . . . . . . . . . . . . . . . . . 9 65 8.2. Informative References . . . . . . . . . . . . . . . . . 9 66 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . 10 67 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 10 69 1. Introduction 71 The RPC-over-RDMA version 1 transport protocol [RFC8166] enables 72 payload data transfer using Remote Direct Memory Access (RDMA) for 73 upper layer protocols based on Remote Procedure Calls (RPC) 74 [RFC5531]. The terms "Remote Direct Memory Access" (RDMA) and 75 "Direct Data Placement" (DDP) are introduced in [RFC5040]. 77 The two most immediate shortcomings of RPC-over-RDMA version 1 are: 79 o Setting up an RDMA data transfer (via RDMA Read or Write) can be 80 costly. The small default size of messages transmitted using RDMA 81 Send forces the use of RDMA Read or Write operations even for 82 relatively small messages and data payloads. 83 The original specification of RPC-over-RDMA version 1 provided an 84 out-of-band protocol for passing inline threshold values between 85 connected peers [RFC5666]. However, [RFC8166] eliminated support 86 for this protocol making it unavailable for this purpose. 88 o Unlike most other contemporary RDMA-enabled storage protocols, 89 there is no facility in RPC-over-RDMA version 1 that enables the 90 use of remote invalidation [RFC5042]. 92 RPC-over-RDMA version 1 has no means of extending its XDR definition 93 in such a way that interoperability with existing implementations is 94 preserved. As a result, an out-of-band mechanism is needed to help 95 relieve these constraints for existing RPC-over-RDMA version 1 96 implementations. 98 This document specifies a simple, non-XDR-based message format 99 designed to be passed between RPC-over-RDMA version 1 peers at the 100 time each RDMA transport connection is first established. The 101 purpose of such a message exchange is to enable the connecting peers 102 to indicate support for transport properties that are not defined in 103 the base RPC-over-RDMA version 1 protocol defined in [RFC8166]. 105 The message format can be extended as needed. In addition, 106 interoperation between implementations of RPC-over-RDMA version 1 107 that present this message format to peers and those that do not 108 recognize this message format is guaranteed. 110 2. Requirements Language 112 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 113 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 114 "OPTIONAL" in this document are to be interpreted as described in BCP 115 14 [RFC2119] [RFC8174] when, and only when, they appear in all 116 capitals, as shown here. 118 3. Advertised Transport Properties 120 3.1. Inline Threshold Size 122 Section 3.3.2 of [RFC8166] defines the term "inline threshold." An 123 inline threshold is the maximum number of bytes that can be 124 transmitted using one RDMA Send and one RDMA Receive. There are a 125 pair of inline thresholds for a connection: a client-to-server 126 threshold and a server-to-client threshold. 128 If an incoming message exceeds the size of a receiver's inline 129 threshold, the receive operation fails and the connection is 130 typically terminated. To convey a message larger than a receiver's 131 inline threshold, an NFS client uses explicit RDMA data transfer 132 operations, which are more expensive to use than RDMA Send. 134 The default value of inline thresholds for RPC-over-RDMA version 1 135 connections is 1024 bytes (see Section 3.3.3 of [RFC8166]). This 136 value is adequate for nearly all NFS version 3 procedures. 138 NFS version 4 COMPOUND operations [RFC7530] are larger on average 139 than NFS version 3 procedures [RFC1813], forcing clients to use 140 explicit RDMA operations for frequently-issued requests such as 141 LOOKUP and GETATTR. The use of RPCSEC_GSS security also increases 142 the average size of RPC messages, due to the larger size of 143 RPCSEC_GSS credential material included in RPC headers [RFC7861]. 145 If a sender and receiver could somehow agree on larger inline 146 thresholds, frequently-used RPC transactions avoid the cost of 147 explicit RDMA operations. 149 3.2. Remote Invalidation 151 After an RDMA data transfer operation completes, an RDMA peer can use 152 remote invalidation to request that the remote peer RNIC invalidate 153 an STag associated with the data transfer [RFC5042]. 155 An RDMA consumer requests remote invalidation by posting an RDMA Send 156 With Invalidate Work Request in place of an RDMA Send Work Request. 157 Each RDMA Send With Invalidate carries one STag to invalidate. The 158 receiver of an RDMA Send With Invalidate performs the requested 159 invalidation and then reports that invalidation as part of the 160 completion of a waiting Receive Work Request. 162 An RPC-over-RDMA responder can use remote invalidation when replying 163 to an RPC request that provided Read or Write chunks. The requester 164 thus avoids dispatching an extra Work Request, the resulting context 165 switch, and the invalidation completion interrupt as part of 166 completing an RPC transaction that uses chunks. The upshot is faster 167 completion of RPC transactions that involve RDMA data transfer. 169 There are some important caveats which contraindicate the blanket use 170 of remote invalidation: 172 o Remote invalidation is not supported by all RNICs. 174 o Not all RPC-over-RDMA responder implementations can generate RDMA 175 Send With Invalidate Work Requests. 177 o Not all RPC-over-RDMA requester implementations can recognize when 178 remote invalidation has occurred. 180 o On one connection in different RPC-over-RDMA transactions, or in a 181 single RPC-over-RDMA transaction, an RPC-over-RDMA requester can 182 expose a mixture of STags that may be invalidated remotely and 183 some that must not be. No indication is provided at the RDMA 184 layer as to which is which. 186 A responder therefore must not employ remote invalidation unless it 187 is aware of support for it in its own RDMA stack, and on the 188 requester. And, without altering the XDR structure of RPC-over-RDMA 189 version 1 messages, it is not possible to support remote invalidation 190 with requesters that mix STags that may and must not be invalidated 191 remotely in a single RPC or on the same connection. 193 However, it is possible to provide a simple signaling mechanism for a 194 requester to indicate it can deal with remote invalidation of any 195 STag it has presented to a responder. There are some NFS/RDMA client 196 implementations that can successfully make use of such a signaling 197 mechanism. 199 4. Private Data Message Format 201 With an InfiniBand lower layer, for example, RDMA connection setup 202 uses a Connection Manager when establishing a Reliable Connection 203 [IBARCH]. When an RPC-over-RDMA version 1 transport connection is 204 established, the client (which actively establishes connections) and 205 the server (which passively accepts connections) populate the CM 206 Private Data field exchanged as part of CM connection establishment. 208 The transport properties exchanged via this mechanism are fixed for 209 the life of the connection. Each new connection presents an 210 opportunity for a fresh exchange. 212 For RPC-over-RDMA version 1, the CM Private Data field is formatted 213 as described in the following subsection. RPC clients and servers 214 use the same format. If the capacity of the Private Data field is 215 too small to contain this message format, the underlying RDMA 216 transport is not managed by a Connection Manager, or the underlying 217 RDMA transport uses Private Data for its own purposes, the CM Private 218 Data field cannot be used on behalf of RPC-over-RDMA version 1. 220 The first 8 octets of the CM Private Data field is to be formatted as 221 follows: 223 0 1 2 3 224 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 225 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 226 | Format Identifier | 227 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 228 | Version | Flags | Send Size | Receive Size | 229 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 231 Format Identifier: This field contains a fixed 32-bit value that 232 identifies the content of the Private Data field as an RPC-over- 233 RDMA version 1 CM Private Data message. The value of this field 234 is always 0xf6ab0e18, in network byte order. The use of this 235 field is further expanded upon in Section 4.1. 237 Version: This 8-bit field contains a message format version number. 238 The value "1" in this field indicates that exactly eight octets 239 are present, that they appear in the order described in this 240 section, and that each has the meaning defined in this section. 242 Further considerations about the use of this field are discussed 243 in Section 5. 245 Flags: This 8-bit field contains bit flags that indicate the support 246 status of optional features, such as remote invalidation. The 247 meaning of these flags is defined in Section 5.1. 249 Send Size: This 8-bit field contains an encoded value corresponding 250 to the maximum number of bytes this peer is prepared to transmit 251 in a single RDMA Send on this connection. The value is encoded as 252 described in Section 5.2. 254 Receive Size: This 8-bit field contains an encoded value 255 corresponding to the maximum number of bytes this peer is prepared 256 to receive with a single RDMA Receive on this connection. The 257 value is encoded as described in Section 5.2. 259 4.1. Interoperability Considerations 261 The extension described in this document is designed to allow RPC- 262 over-RDMA version implementations that use this extension to 263 interoperate fully with RPC-over-RDMA version 1 implementations that 264 do not exchange this information. Realizing this goal requires that 265 implementations of this extension follow the practices described in 266 the rest of this section. 268 RPC-over-RDMA version 1 implementations that support the extension 269 described in this document are intended to interoperate fully with 270 RPC-over-RDMA version 1 implementations that do not recognize the 271 exchange of CM Private Data. When a peer does not receive a CM 272 Private Data message which conforms to Section 4, it needs to act as 273 if the remote peer supports only the default RPC-over-RDMA version 1 274 settings, as defined in [RFC8166]. In other words, the peer is to 275 behave as if a Private Data message was received in which bit 8 of 276 the Flags field is zero, and both Size fields contain the value zero. 278 The Format Identifier field is provided in order to distinguish RPC- 279 over-RDMA version 1 Private Data from private data inserted by layers 280 below or above RPC-over RDMA version 1. During connection 281 establishment, RPC-over-RDMA version 1 implementations check for this 282 protocol number before decoding subsequent fields. If this protocol 283 number is not present as the first 4 octets, an RPC-over-RDMA 284 receiver needs to ignore the CM-Private Data (ie., behave as if no 285 RPC-over-RDMA version 1 Private Data has been provided). 287 5. Updating the Message Format 289 Although the message format described in this document provides the 290 ability for the client and server to exchange particular information 291 about the local RPC-over-RDMA implementation, it is possible that 292 there will be a future need to exchange additional properties. This 293 would make it necessary to extend or otherwise modify the format 294 described in this document. 296 Any modification faces the problem of interoperating properly with 297 implementations of RPC-over-RDMA version 1 that are unaware of this 298 existence of the new format. These include implementations that that 299 do not recognize the exchange of CM Private Data as well as those 300 that recognize only the format described in this document. 302 Given the message format described in this document, these 303 interoperability constraints could be met by the following sorts of 304 new message formats: 306 o A format which uses a different value for the first four bytes of 307 the format, as provided for in the registry described in 308 Section 6. 310 o A format which uses the same value for the Format Identifier field 311 and a value other than one (1) in the Version field. 313 Although it is possible to reorganize the last three of the eight 314 bytes in the existing format, extended formats are unlikely to do so. 315 New formats would take the form of extensions of the format described 316 in this document with added fields starting at byte eight of the 317 format and changes to the definition of previously reserved flags. 319 5.1. Feature Support Flags 321 The bits in the Flags field are labeled from bit 8 to bit 15, as 322 shown in the diagram above. When the Version field contains the 323 value "1", the bits in the Flags field are to be set as follows: 325 Bit 15: When both connection peers have set this flag in their CM 326 Private Data, the responder MAY use RDMA Send With Invalidate when 327 transmitting RPC Replies. Each RDMA Send With Invalidate MUST 328 invalidate an STag associated only with the XID in the rdma_xid 329 field of the RPC-over-RDMA Transport Header it carries. 330 When either peer on a connection clears this flag, the responder 331 MUST use only RDMA Send when transmitting RPC Replies. 333 Bits 14 - 8: These bits are reserved and are always zero. 335 5.2. Inline Threshold Values 337 Inline threshold sizes from 1KB to 256KB can be represented in the 338 Send Size and Receive Size fields. A sender computes the encoded 339 value by dividing the actual value by 1024 and subtracting one from 340 the result. A receiver decodes this value by performing a 341 complementary set of operations. 343 The client uses the smaller of its own send size and the server's 344 reported receive size as the client-to-server inline threshold. The 345 server uses the smaller of its own send size and the clients's 346 reported receive size as the server-to-client inline threshold. 348 6. IANA Considerations 350 In accordance with [RFC8126], the author requests that IANA create a 351 new registry in the "Remote Direct Data Placement" Protocol Category 352 Group. The new registry is to be called the "RDMA-CM Private Data 353 Identifier Registry". This is a registry of 32-bit numbers that 354 identify the Upper Layer protocol associated with data that appears 355 in the RDMA-CM Private Data area. 357 The information that must be provided to add an entry to this 358 registry will be an IESG-approved Standards Track specification 359 defining the semantics and interoperability requirements of the 360 proposed new value and the fields to be recorded in the registry. 361 The fields in this registry include: Field Identifier, Format 362 Description, and Reference. 364 The initial contents of this registry are a single entry: 366 +-----------------+-------------------------------------+-----------+ 367 | Field | Format Description | Reference | 368 | Identifier | | | 369 +-----------------+-------------------------------------+-----------+ 370 | 0xf6ab0e18 | RPC-over-RDMA version 1 CM Private | [RFC-TBD] | 371 | | Data | | 372 +-----------------+-------------------------------------+-----------+ 374 Table 1: RDMA-CM Private Data Identifier Registry 376 The Expert Review policy, as defined in Section 4.5 of [RFC8126] is 377 to be used to handle requests to add new entries to the "File 378 Provenance Information Registry". New protocol numbers can be 379 assigned at random as long as they do not conflict with existing 380 entries in this registry. 382 7. Security Considerations 384 RDMA-CM Private Data typically traverses the link layer in the clear. 385 A man-in-the-middle attack could alter the settings exchanged at 386 connect time such that one or both peers might perform operations 387 that result in premature termination of the connection. 389 8. References 391 8.1. Normative References 393 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 394 Requirement Levels", BCP 14, RFC 2119, 395 DOI 10.17487/RFC2119, March 1997, 396 . 398 [RFC5040] Recio, R., Metzler, B., Culley, P., Hilland, J., and D. 399 Garcia, "A Remote Direct Memory Access Protocol 400 Specification", RFC 5040, DOI 10.17487/RFC5040, October 401 2007, . 403 [RFC5042] Pinkerton, J. and E. Deleganes, "Direct Data Placement 404 Protocol (DDP) / Remote Direct Memory Access Protocol 405 (RDMAP) Security", RFC 5042, DOI 10.17487/RFC5042, October 406 2007, . 408 [RFC8126] Cotton, M., Leiba, B., and T. Narten, "Guidelines for 409 Writing an IANA Considerations Section in RFCs", BCP 26, 410 RFC 8126, DOI 10.17487/RFC8126, June 2017, 411 . 413 [RFC8166] Lever, C., Ed., Simpson, W., and T. Talpey, "Remote Direct 414 Memory Access Transport for Remote Procedure Call Version 415 1", RFC 8166, DOI 10.17487/RFC8166, June 2017, 416 . 418 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 419 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, 420 May 2017, . 422 8.2. Informative References 424 [IBARCH] InfiniBand Trade Association, "InfiniBand Architecture 425 Specification Volume 1", Release 1.3, March 2015, 426 . 429 [RFC1813] Callaghan, B., Pawlowski, B., and P. Staubach, "NFS 430 Version 3 Protocol Specification", RFC 1813, 431 DOI 10.17487/RFC1813, June 1995, 432 . 434 [RFC5531] Thurlow, R., "RPC: Remote Procedure Call Protocol 435 Specification Version 2", RFC 5531, DOI 10.17487/RFC5531, 436 May 2009, . 438 [RFC5666] Talpey, T. and B. Callaghan, "Remote Direct Memory Access 439 Transport for Remote Procedure Call", RFC 5666, 440 DOI 10.17487/RFC5666, January 2010, 441 . 443 [RFC7530] Haynes, T., Ed. and D. Noveck, Ed., "Network File System 444 (NFS) Version 4 Protocol", RFC 7530, DOI 10.17487/RFC7530, 445 March 2015, . 447 [RFC7861] Adamson, A. and N. Williams, "Remote Procedure Call (RPC) 448 Security Version 3", RFC 7861, DOI 10.17487/RFC7861, 449 November 2016, . 451 Acknowledgments 453 Thanks to Christoph Hellwig and Devesh Sharma for suggesting this 454 approach, and to Tom Talpey and Dave Noveck for their expert comments 455 and review. The author also wishes to thank Bill Baker and Greg 456 Marsden for their support of this work. 458 Special thanks go to Transport Area Director Magnus Westerlund, NFSV4 459 Working Group Chairs Spencer Shepler and Brian Pawlowski, and NFSV4 460 Working Group Secretary Thomas Haynes. 462 Author's Address 464 Charles Lever 465 Oracle Corporation 466 United States of America 468 Email: chuck.lever@oracle.com