idnits 2.17.1 draft-cel-nfsv4-rpcrdma-cm-pvt-msg-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document seems to lack the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords -- however, there's a paragraph with a matching beginning. Boilerplate error? (The document does seem to have the reference to RFC 2119 which the ID-Checklist requires). -- The document date (September 18, 2017) is 2405 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- -- Obsolete informational reference (is this intentional?): RFC 5666 (Obsoleted by RFC 8166) Summary: 0 errors (**), 0 flaws (~~), 2 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network File System Version 4 C. Lever 3 Internet-Draft Oracle 4 Intended status: Informational September 18, 2017 5 Expires: March 22, 2018 7 RDMA Connection Manager Private Data For RPC-Over-RDMA Version 1 8 draft-cel-nfsv4-rpcrdma-cm-pvt-msg-02 10 Abstract 12 This document specifies the format of RDMA-CM Private Data exchanged 13 between RPC-over-RDMA version 1 peers as a transport connection is 14 established. Such messages indicate peer support for Remote 15 Invalidation and larger-than-default inline thresholds. The message 16 format is extensible. 18 Status of This Memo 20 This Internet-Draft is submitted in full conformance with the 21 provisions of BCP 78 and BCP 79. 23 Internet-Drafts are working documents of the Internet Engineering 24 Task Force (IETF). Note that other groups may also distribute 25 working documents as Internet-Drafts. The list of current Internet- 26 Drafts is at https://datatracker.ietf.org/drafts/current/. 28 Internet-Drafts are draft documents valid for a maximum of six months 29 and may be updated, replaced, or obsoleted by other documents at any 30 time. It is inappropriate to use Internet-Drafts as reference 31 material or to cite them other than as "work in progress." 33 This Internet-Draft will expire on March 22, 2018. 35 Copyright Notice 37 Copyright (c) 2017 IETF Trust and the persons identified as the 38 document authors. All rights reserved. 40 This document is subject to BCP 78 and the IETF Trust's Legal 41 Provisions Relating to IETF Documents 42 (https://trustee.ietf.org/license-info) in effect on the date of 43 publication of this document. Please review these documents 44 carefully, as they describe your rights and restrictions with respect 45 to this document. Code Components extracted from this document must 46 include Simplified BSD License text as described in Section 4.e of 47 the Trust Legal Provisions and are provided without warranty as 48 described in the Simplified BSD License. 50 Table of Contents 52 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 53 2. Requirements Language . . . . . . . . . . . . . . . . . . . . 3 54 3. Advertised Transport Capabilities . . . . . . . . . . . . . . 3 55 3.1. Inline Threshold Size . . . . . . . . . . . . . . . . . . 3 56 3.2. Remote Invalidation . . . . . . . . . . . . . . . . . . . 4 57 4. Private Data Message Format . . . . . . . . . . . . . . . . . 5 58 4.1. Fixed Mandatory Fields . . . . . . . . . . . . . . . . . 5 59 4.2. Extending the Message Format . . . . . . . . . . . . . . 7 60 5. Interoperability Considerations . . . . . . . . . . . . . . . 7 61 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 7 62 7. Security Considerations . . . . . . . . . . . . . . . . . . . 7 63 8. References . . . . . . . . . . . . . . . . . . . . . . . . . 7 64 8.1. Normative References . . . . . . . . . . . . . . . . . . 8 65 8.2. Informative References . . . . . . . . . . . . . . . . . 8 66 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . 8 67 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 9 69 1. Introduction 71 The RPC-over-RDMA version 1 transport protocol enables the use of 72 RDMA data transfer for upper layer protocols based on RPC [RFC8166]. 73 The terms "Remote Direct Memory Access" (RDMA) and "Direct Data 74 Placement" (DDP) are introduced in [RFC5040]. 76 The two most immediate shortcomings of RPC-over-RDMA version 1 are: 78 o Setting up an RDMA data transfer (via RDMA Read or Write) can be 79 costly. The small maximum size of messages transmitted using RDMA 80 Send forces the use of RDMA Read or Write operations even for 81 relatively small messages and data payloads. 83 o Unlike most other contemporary RDMA-enabled storage protocols, 84 there is no facility in RPC-over-RDMA version 1 that enables the 85 use of Remote Invalidation [RFC5042]. 87 The original specification of RPC-over-RDMA version 1 provided an 88 out-of-band protocol for passing inline threshold values between 89 connected peers [RFC5666]. However, [RFC8166] eliminated support for 90 this protocol making it unavailable for this purpose. 92 RPC-over-RDMA version 1 has no means of extending its XDR definition 93 such that interoperability with existing implementations is 94 preserved. As a result, an out-of-band mechanism is needed to help 95 relieve these limitations for existing RPC-over-RDMA version 1 96 implementations. 98 This document specifies a simple, non-XDR-based message format 99 designed to pass between RPC-over-RDMA version 1 peers as each RDMA 100 transport connection is first established. The purpose of this 101 message format is two-fold: 103 o To provide immediate relief from certain performance constraints 104 inherent in RPC-over-RDMA version 1 106 o To enable experimentation with parameters of the base RDMA 107 transport over which RPC-over-RDMA runs 109 2. Requirements Language 111 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 112 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 113 document are to be interpreted as described in BCP 14 [RFC2119] 114 [RFC8174] when, and only when, they appear in all capitals, as shown 115 here. 117 3. Advertised Transport Capabilities 119 3.1. Inline Threshold Size 121 Section 3.3.2 of [RFC8166] defines the term "inline threshold." An 122 inline threshold is the maximum number of bytes that can be 123 transmitted using only one RDMA Send and one RDMA Receive. There are 124 a pair of inline thresholds per transport connection, one for each 125 direction of message flow. 127 If an incoming message exceeds the size of a receiver's inline 128 threshold, the receive operation fails and the connection is 129 typically terminated. To convey a message larger than a receiver's 130 inline threshold, an NFS client uses explicit RDMA data transfer 131 operations, which are more expensive to use than RDMA Send. 133 The default value of inline thresholds for RPC-over-RDMA version 1 134 connections is 1024 bytes in both directions (see Section 3.3.3 of 135 [RFC8166]). This value is adequate for nearly all NFS version 3 136 procedures. 138 NFS version 4 COMPOUNDs are larger on average than NFSv3 procedures, 139 forcing clients to use explicit RDMA operations for frequently-issued 140 requests such as LOOKUP and GETATTR. The use of RPCSEC_GSS security 141 also increases the average size of RPC messages, due to the larger 142 size of credential material in RPC headers [RFC7861]. 144 If a sender and receiver can somehow agree on larger inline 145 thresholds, more RPC transactions avoid the cost of explicit RDMA 146 operations. 148 3.2. Remote Invalidation 150 After an RDMA data transfer operation completes, an RDMA peer can use 151 Remote Invalidation to request that the remote peer RNIC invalidate 152 an STag associated with the data transfer [RFC5042]. 154 An RDMA consumer requests Remote Invalidation by posting an RDMA Send 155 With Invalidate Work Request in place of an RDMA Send Work Request. 156 The RDMA Send With Invalidate carries the R_key value of the STag to 157 invalidate. Invalidation of that R_key is performed and then 158 reported as part of the completion of a waiting Receive Work Request. 160 An RPC-over-RDMA responder might use Remote Invalidation when 161 replying to an RPC request that provided Read or Write chunks. The 162 requester avoids an extra Work Request, context switch, and interrupt 163 to invalidate one chunk as part of completing an RPC transaction. 164 The upshot is faster completion of RPC transactions that involve RDMA 165 data transfer. 167 There are some important caveats which might contraindicate the use 168 of Remote Invalidation: 170 o Remote Invalidation is not supported by all RNICs. 172 o Not all RPC-over-RDMA requester implementations can recognize when 173 Remote Invalidate has occurred. 175 o Not all RPC-over-RDMA responder implementations can generate RDMA 176 Send With Invalidate Work Requests. 178 o An RPC-over-RDMA requester that supports Remote Invalidation may 179 choose to use R_keys that must not be invalidated remotely. 181 o On one connection, RPC-over-RDMA requesters can mix R_keys that 182 may be invalidated remotely with some that must not. 184 o RPC-over-RDMA requesters often register more than one R_key per 185 RPC. In one RPC, they can mix R_keys that may be invalidated 186 remotely with some that must not. 188 Thus a responder must not employ Remote Invalidation unless it is 189 aware of support for it in its own RDMA stack, and on the requester. 190 And, without altering the XDR structure of RPC-over-RDMA version 1 191 messages, it is not possible to support Remote Invalidation with 192 requesters that mix R_keys that may and must not by invalidated 193 remotely. 195 However, it is possible to provide a simple signaling mechanism for a 196 requester to indicate it can deal with Remote Invalidation of any 197 R_key it presents to a responder. 199 4. Private Data Message Format 201 With an InfiniBand lower layer, for example, RDMA connection setup 202 uses the InfiniBand Connection Manager to establish a Reliable 203 Connection [IBARCH]. When an RPC-over-RDMA version 1 transport 204 connection is established, the client (which actively establishes 205 connections) and the server (which passively accepts connections) MAY 206 populate the CM Private Data field exchanged as part of CM connection 207 establishment. 209 The transport properties exchanged via this mechanism are fixed for 210 the life of the connection. Each new connection presents an 211 opportunity for a fresh exchange. 213 For RPC-over-RDMA version 1, the CM Private Data field is formatted 214 as described in this section. RPC clients and servers use the same 215 format. If the capacity of the Private Data field is too small to 216 contain this message format, or the underlying RDMA transport is not 217 managed by a Connection Manager, CM Private Data cannot be used. 219 4.1. Fixed Mandatory Fields 221 The first 8 octets of the CM Private Data field is to be formatted as 222 follows: 224 0 1 2 3 225 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 226 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 227 | Magic Number | 228 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 229 | Version | Flags | Send Size | Receive Size | 230 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 232 Magic Number 233 This field contains a fixed 32-bit value that identifies the 234 content of the Private Data field as an RPC-over-RDMA version 1 CM 235 Private Data message. The value of this field MUST be 0xf6ab0e18, 236 in big-endian order. 238 Version 239 This 8-bit field contains a message format version number. The 240 value "1" in this field indicates that exactly eight octets are 241 present, that they appear in the order described in this section, 242 and that each has the meaning defined in this section. 244 Bit Flags 245 This 8-bit field contains eight bit flags that indicate the 246 support status of optional features, such as Remote Invalidation. 247 The meaning of these flags is defined in Section 4.1.1. 249 Send Size 250 This 8-bit field contains an encoded value corresponding to the 251 maximum number of bytes this peer will transmit in a single RDMA 252 Send. The value is encoded as described in Section 4.1.2. 254 Receive Size 255 This 8-bit field contains an encoded value corresponding to the 256 maximum number of bytes this peer can receive with a single RDMA 257 Receive. The value is encoded as described in Section 4.1.2. 259 The requester MUST use the smaller of its own send size and the 260 responder's reported receive size as the requester-to-responder 261 inline threshold. The responder MUST use the smaller of its own send 262 size and the requester's reported receive size as the responder-to- 263 requester inline threshold. 265 4.1.1. Feature Support Flags 267 The bits in the Flags field are labeled from bit 8 to bit 15, as 268 shown in the diagram above. When the Version field contains the 269 value "1", the bits in the Flags field have the following meaning: 271 Bit 15 272 When a requester sets this flag, it sends only R_keys that can 273 tolerate Remote Invalidation. When a responder sets this flag, it 274 can generate RDMA Send With Invalidate Work Requests. When both 275 peers on a connection set this flag, the responder MAY use RDMA 276 Send With Invalidate when transmitting RPC Replies. When either 277 peer on a connection clear this flag, the responder MUST use RDMA 278 Send when transmitting RPC Replies. 280 Bits 14 - 8 281 These bits are reserved and must be zero. 283 4.1.2. Encoding the Inline Threshold Value 285 Inline threshold sizes from 1KB to 256KB can be represented in the 286 Send Size and Receive Size fields. A sender computes the encoded 287 value by dividing the actual value by 1024 and subtracting one from 288 the result. A receiver decodes this value by performing a 289 complementary set of operations. 291 4.2. Extending the Message Format 293 The Private Data format described above can be extended by adding 294 additional optional fields which follow the first eight octets, or by 295 making use of one of the reserved bits in the Flags fields. To 296 introduce such changes while preserving interoperability, a new 297 Version number is to be allocate, and new fields and bit flags are to 298 be defined. A description of how receivers should behave if they do 299 not recognize the new format is to be provided as well. Such 300 situations may be addressed by specifying the new format in a 301 document updating this one. 303 5. Interoperability Considerations 305 This extension is intended to interoperate with RPC-over-RDMA version 306 1 implementations that do not support the exchange of CM Private 307 Data. When a peer does not receive a CM Private Data message which 308 conforms to Section 4, it MUST act as if the remote peer supports 309 only the default RPC-over-RDMA version 1 settings as defined in 310 [RFC8166]. In other words, the peer is to behave as if a Private 311 Data message was received in which bit 8 of the Flags field is zero. 312 and both Size fields contain the value zero. 314 6. IANA Considerations 316 This document does not require actions by IANA. 318 7. Security Considerations 320 RDMA-CM Private Data typically traverses the link layer in the clear. 321 A man-in-the-middle attack could alter the settings exchanged at 322 connect time such that one or both peers might perform operations 323 that result in premature termination of the connection. 325 8. References 326 8.1. Normative References 328 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 329 Requirement Levels", BCP 14, RFC 2119, 330 DOI 10.17487/RFC2119, March 1997, 331 . 333 [RFC5040] Recio, R., Metzler, B., Culley, P., Hilland, J., and D. 334 Garcia, "A Remote Direct Memory Access Protocol 335 Specification", RFC 5040, DOI 10.17487/RFC5040, October 336 2007, . 338 [RFC5042] Pinkerton, J. and E. Deleganes, "Direct Data Placement 339 Protocol (DDP) / Remote Direct Memory Access Protocol 340 (RDMAP) Security", RFC 5042, DOI 10.17487/RFC5042, October 341 2007, . 343 [RFC8166] Lever, C., Ed., Simpson, W., and T. Talpey, "Remote Direct 344 Memory Access Transport for Remote Procedure Call Version 345 1", RFC 8166, DOI 10.17487/RFC8166, June 2017, 346 . 348 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 349 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, 350 May 2017, . 352 8.2. Informative References 354 [IBARCH] InfiniBand Trade Association, "InfiniBand Architecture 355 Specification Volume 1", Release 1.3, March 2015, 356 . 359 [RFC5666] Talpey, T. and B. Callaghan, "Remote Direct Memory Access 360 Transport for Remote Procedure Call", RFC 5666, 361 DOI 10.17487/RFC5666, January 2010, 362 . 364 [RFC7861] Adamson, A. and N. Williams, "Remote Procedure Call (RPC) 365 Security Version 3", RFC 7861, DOI 10.17487/RFC7861, 366 November 2016, . 368 Acknowledgments 370 Thanks to Christoph Hellwig and Devesh Sharma for suggesting this 371 approach. The author also wishes to thank Bill Baker and Greg 372 Marsden for their support of this work. 374 Special thanks go to Transport Area Director Spencer Dawkins, NFSV4 375 Working Group Chair Spencer Shepler, and NFSV4 Working Group 376 Secretary Thomas Haynes for their support. 378 Author's Address 380 Charles Lever 381 Oracle Corporation 382 1015 Granger Avenue 383 Ann Arbor, MI 48104 384 United States of America 386 Phone: +1 248 816 6463 387 Email: chuck.lever@oracle.com