idnits 2.17.1 draft-cel-nfsv4-rpcrdma-cm-pvt-msg-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (March 28, 2017) is 2586 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Obsolete informational reference (is this intentional?): RFC 5666 (Obsoleted by RFC 8166) Summary: 0 errors (**), 0 flaws (~~), 1 warning (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network File System Version 4 C. Lever 3 Internet-Draft Oracle 4 Intended status: Standards Track March 28, 2017 5 Expires: September 29, 2017 7 RDMA Connection Manager Private Messages For RPC-Over-RDMA Version One 8 draft-cel-nfsv4-rpcrdma-cm-pvt-msg-01 10 Abstract 12 This document specifies the format of RDMA-CM Private Data exchanged 13 between RPC-over-RDMA Version One peers as a transport connection is 14 established. Such messages indicate peer support for Remote 15 Invalidation and larger-than-default inline thresholds. The message 16 format is extensible. 18 Requirements Language 20 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 21 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 22 document are to be interpreted as described in [RFC2119]. 24 Status of This Memo 26 This Internet-Draft is submitted in full conformance with the 27 provisions of BCP 78 and BCP 79. 29 Internet-Drafts are working documents of the Internet Engineering 30 Task Force (IETF). Note that other groups may also distribute 31 working documents as Internet-Drafts. The list of current Internet- 32 Drafts is at http://datatracker.ietf.org/drafts/current/. 34 Internet-Drafts are draft documents valid for a maximum of six months 35 and may be updated, replaced, or obsoleted by other documents at any 36 time. It is inappropriate to use Internet-Drafts as reference 37 material or to cite them other than as "work in progress." 39 This Internet-Draft will expire on September 29, 2017. 41 Copyright Notice 43 Copyright (c) 2017 IETF Trust and the persons identified as the 44 document authors. All rights reserved. 46 This document is subject to BCP 78 and the IETF Trust's Legal 47 Provisions Relating to IETF Documents 48 (http://trustee.ietf.org/license-info) in effect on the date of 49 publication of this document. Please review these documents 50 carefully, as they describe your rights and restrictions with respect 51 to this document. Code Components extracted from this document must 52 include Simplified BSD License text as described in Section 4.e of 53 the Trust Legal Provisions and are provided without warranty as 54 described in the Simplified BSD License. 56 Table of Contents 58 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 59 2. Advertised Transport Capabilities . . . . . . . . . . . . . . 3 60 2.1. Inline Threshold Size . . . . . . . . . . . . . . . . . . 3 61 2.2. Remote Invalidation . . . . . . . . . . . . . . . . . . . 4 62 3. Private Data Message Format . . . . . . . . . . . . . . . . . 5 63 3.1. Fixed Mandatory Fields . . . . . . . . . . . . . . . . . 5 64 3.2. Extending The Message Format . . . . . . . . . . . . . . 7 65 4. Interoperability Considerations . . . . . . . . . . . . . . . 7 66 5. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 7 67 6. Security Considerations . . . . . . . . . . . . . . . . . . . 7 68 7. References . . . . . . . . . . . . . . . . . . . . . . . . . 7 69 7.1. Normative References . . . . . . . . . . . . . . . . . . 7 70 7.2. Informative References . . . . . . . . . . . . . . . . . 8 71 Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . 8 72 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 8 74 1. Introduction 76 The RPC-over-RDMA Version One transport protocol enables the use of 77 RDMA data transfer for upper layer protocols based on RPC 78 [I-D.ietf-nfsv4-rfc5666bis]. The terms "Remote Direct Memory Access" 79 (RDMA) and "Direct Data Placement" (DDP) are introduced in [RFC5040]. 81 The two most immediate shortcomings of RPC-over-RDMA Version One are: 83 o Setting up an RDMA data transfer (via RDMA Read or Write) can be 84 costly. The small maximum size of messages transmitted using RDMA 85 Send forces the use of RDMA Read or Write operations even for 86 relatively small messages and data payloads. 88 o Unlike most other contemporary RDMA-enabled storage protocols, 89 there is no facility in RPC-over-RDMA Version One that enables the 90 use of Remote Invalidation [RFC5042]. 92 The original specification of RPC-over-RDMA Version One provided an 93 out-of-band protocol for passing inline threshold settings between 94 connected peers [RFC5666]. However, [I-D.ietf-nfsv4-rfc5666bis] 95 eliminated support for this protocol making it unavailable for this 96 purpose. 98 RPC-over-RDMA Version One has no means of extending its XDR 99 definition such that interoperability with existing implementations 100 is preserved. As a result, an out-of-band mechanism is needed to 101 help relieve these limitations for existing RPC-over-RDMA Version One 102 implementations. 104 This document specifies a simple, non-XDR-based message format 105 designed to pass between RPC-over-RDMA Version One peers as each RDMA 106 transport connection is first established. The purpose of this 107 message format is two-fold: 109 o To provide immediate relief from certain performance constraints 110 inherent in RPC-over-RDMA Version One 112 o To enable experimentation with parameters of the base RDMA 113 transport over which RPC-over-RDMA runs 115 2. Advertised Transport Capabilities 117 2.1. Inline Threshold Size 119 Section 3.3.2 of [I-D.ietf-nfsv4-rfc5666bis] defines the term "inline 120 threshold." There are a pair of inline thresholds per transport 121 connection, one for each direction of message flow, which limit the 122 size of RPC-over-RDMA messages conveyed using RDMA Send and Receive. 124 If an incoming message exceeds the size of a receiver's inline 125 threshold, the receive operation fails and the connection is 126 typically terminated. To convey a message larger than a receiver's 127 inline threshold, an NFS client uses explicit RDMA operations, which 128 are more expensive to use than RDMA Send. 130 The default value of inline thresholds for RPC-over-RDMA Version One 131 connections is 1024 bytes in both directions (see Section 3.3.3 of 132 [I-D.ietf-nfsv4-rfc5666bis]). This value is adequate for nearly all 133 NFS Version 3 procedures. 135 NFS Version 4 COMPOUNDs are larger on average than NFSv3 procedures, 136 forcing clients to use explicit RDMA operations for frequently-issued 137 requests such as LOOKUP and GETATTR. The use of RPCSEC_GSS security 138 also increases the average size of RPC messages, due to the larger 139 size of credential material in RPC headers [RFC7861]. 141 If a sender and receiver can somehow agree on larger inline 142 thresholds, more RPC transactions avoid the cost of explicit RDMA 143 operations. 145 2.2. Remote Invalidation 147 After an RDMA data transfer operation completes, an RDMA peer can use 148 Remote Invalidation to request that the remote peer RNIC invalidate 149 an STag associated with the data transfer [RFC5042]. 151 An RDMA consumer requests Remote Invalidation by posting an RDMA Send 152 With Invalidate Work Request in place of an RDMA Send Work Request. 153 The RDMA Send With Invalidate carries the R_key value of the STag to 154 invalidate. Invalidation of that R_key is performed and then 155 reported as part of the completion of a waiting Receive Work Request. 157 An RPC-over-RDMA responder might use Remote Invalidation when 158 replying to an RPC request that provided Read or Write chunks. The 159 requester avoids an extra Work Request, context switch, and interrupt 160 to invalidate one chunk as part of completing an RPC transaction. 161 The upshot is faster completion of RPC transactions that involve RDMA 162 data transfer. 164 There are some important caveats which might contraindicate the use 165 of Remote Invalidation: 167 o Remote Invalidation is not supported by all RNICs. 169 o Not all RPC-over-RDMA requester implementations can recognize when 170 Remote Invalidate has occurred. 172 o Not all RPC-over-RDMA responder implementations can generate RDMA 173 Send With Invalidate Work Requests. 175 o An RPC-over-RDMA requester that supports Remote Invalidation may 176 choose to use R_keys that must not be invalidated remotely. 178 o On one connection, RPC-over-RDMA requesters can mix R_keys that 179 may be invalidated remotely with some that must not. 181 o RPC-over-RDMA requesters often register more than one R_key per 182 RPC. In one RPC, they can mix R_keys that may be invalidated 183 remotely with some that must not. 185 Thus a responder must not employ Remote Invalidation unless it is 186 aware of support for it in its own RDMA stack, and on the requester. 187 And, without altering the XDR structure of RPC-over-RDMA Version One 188 messages, it is not possible to support Remote Invalidation with 189 requesters that mix R_keys that may and must not by invalidated 190 remotely. 192 However, it is possible to provide a simple signaling mechanism for a 193 requester to indicate it can deal with Remote Invalidation of any 194 R_key it presents to a responder. 196 3. Private Data Message Format 198 With an InfiniBand lower layer, for example, RDMA connection setup 199 uses the InfiniBand Connection Manager to establish a Reliable 200 Connection [IBTA-IB]. When an RPC-over-RDMA Version One transport 201 connection is established, the client (which actively establishes 202 connections) and the server (which passively accepts connections) MAY 203 populate the CM Private Data field exchanged as part of CM connection 204 establishment. 206 For RPC-over-RDMA Version One, the CM Private Data field is formatted 207 as described in this section. RPC clients and servers use the same 208 format. If the capacity of the Private Data field is too small to 209 contain this message format, or the underlying RDMA transport is not 210 managed by a Connection Manager, CM Private Data cannot be used. 212 3.1. Fixed Mandatory Fields 214 The first 8 octets of the CM Private Data field is to be formatted as 215 follows: 217 0 1 2 3 218 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 219 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 220 | Magic Number | 221 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 222 | Version | Flags | Send Size | Receive Size | 223 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 225 Magic Number 226 This field contains a fixed 32-bit value that identifies the 227 content of the Private Data field as an RPC-over-RDMA Version One 228 CM Private Data message. The value of this field MUST be 229 0xf6ab0e18, in big-endian order. 231 Version 232 This 8-bit field contains a message format version number. The 233 value "1" in this field indicates that exactly eight octets are 234 present, that they appear in the order described in this section, 235 and that each has the meaning defined in this section. 237 Bit Flags 238 This 8-bit field contains eight bit flags that indicate the 239 support status of optional features, such as Remote Invalidation. 240 The meaning of these flags is defined in Section 3.1.1. 242 Send Size 243 This 8-bit field contains an encoded value corresponding to the 244 largest RPC-over-RDMA message this peer can transmit using RDMA 245 Send. The value is encoded as described in Section 3.1.2. 247 Receive Size 248 This 8-bit field contains an encoded value corresponding to the 249 largest RPC-over-RDMA message this peer can receive via posted 250 receive buffers. The value is encoded as described in 251 Section 3.1.2. 253 The requester MUST use the smaller of its maximum send size and the 254 responder's maximum receive size as the requester-to-responder inline 255 threshold. The responder MUST use the smaller of its maximum send 256 size and the requester's maximum receive size as the responder-to- 257 requester inline threshold. 259 3.1.1. Feature Support Flags 261 The bits in the Flags field are labeled from bit 8 to bit 15, as 262 shown in the diagram above. When the Version field contains the 263 value "1", the bits in the Flags field have the following meaning: 265 Bit 15 266 When a requester sets this flag, it sends only R_keys that can 267 tolerate Remote Invalidation. When a responder sets this flag, it 268 can generate RDMA Send With Invalidate Work Requests. When both 269 peers on a connection set this flag, the responder MAY use RDMA 270 Send With Invalidate when transmitting RPC Replies. When either 271 peer on a connection clear this flag, the responder MUST use RDMA 272 Send when transmitting RPC Replies. 274 Bits 14 - 8 275 These bits are reserved and must be zero. 277 3.1.2. Encoding the Inline Threshold Value 279 Inline threshold sizes from 1KB to 256KB can be represented in the 280 Send Size and Receive Size fields. A sender computes the encoded 281 value by dividing the actual value by 1024 and subtracting one from 282 the result. A receiver decodes this value by performing a 283 complementary set of operations. 285 3.2. Extending The Message Format 287 The Private Data format described above can be extended by adding 288 additional optional fields which follow the first eight octets, or by 289 making use of one of the reserved bits in the Flags fields. To 290 introduce such changes while preserving interoperability, a new 291 Version number is to be allocate, and new fields and bit flags are to 292 be defined. A description of how receivers should behave if they do 293 not recognize the new format is to be provided as well. Such 294 situations may be addressed by specifying the new format in a 295 document updating this one. 297 4. Interoperability Considerations 299 This extension is intended to interoperate with RPC-over-RDMA Version 300 One implementations that do not support the exchange of CM Private 301 Data. When a peer does not receive a CM Private Data message which 302 conforms to Section 3, it MUST act as if the remote peer supports 303 only the default RPC-over-RDMA Version One settings as defined in 304 [I-D.ietf-nfsv4-rfc5666bis]. In other words, the peer is to behave 305 as if a Private Data message was received in which bit 8 of the Flags 306 field is zero. and both Size fields contain the value zero. 308 5. IANA Considerations 310 This document does not require actions by IANA. 312 6. Security Considerations 314 RDMA-CM Private Data typically traverses the link layer in the clear. 315 A man-in-the-middle attack could alter the settings exchanged at 316 connect time such that one or both peers might perform operations 317 that result in premature termination of the connection. 319 7. References 321 7.1. Normative References 323 [I-D.ietf-nfsv4-rfc5666bis] 324 Lever, C., Simpson, W., and T. Talpey, "Remote Direct 325 Memory Access Transport for Remote Procedure Call, Version 326 One", draft-ietf-nfsv4-rfc5666bis-11 (work in progress), 327 March 2017. 329 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 330 Requirement Levels", BCP 14, RFC 2119, 331 DOI 10.17487/RFC2119, March 1997, 332 . 334 [RFC5040] Recio, R., Metzler, B., Culley, P., Hilland, J., and D. 335 Garcia, "A Remote Direct Memory Access Protocol 336 Specification", RFC 5040, DOI 10.17487/RFC5040, October 337 2007, . 339 [RFC5042] Pinkerton, J. and E. Deleganes, "Direct Data Placement 340 Protocol (DDP) / Remote Direct Memory Access Protocol 341 (RDMAP) Security", RFC 5042, DOI 10.17487/RFC5042, October 342 2007, . 344 7.2. Informative References 346 [IBTA-IB] InfiniBand Trade Association, "InfiniBand(TM) Architecture 347 Specification Volume 1 Release 1.2", November 2007, 348 . 350 [RFC5666] Talpey, T. and B. Callaghan, "Remote Direct Memory Access 351 Transport for Remote Procedure Call", RFC 5666, 352 DOI 10.17487/RFC5666, January 2010, 353 . 355 [RFC7861] Adamson, A. and N. Williams, "Remote Procedure Call (RPC) 356 Security Version 3", RFC 7861, DOI 10.17487/RFC7861, 357 November 2016, . 359 Appendix A. Acknowledgments 361 Thanks to Christoph Hellwig and Devesh Sharma for suggesting this 362 approach. The author also wishes to thank Bill Baker and Greg 363 Marsden for their support of this work. 365 Special thanks go to Transport Area Director Spencer Dawkins, nfsv4 366 Working Group Chair Spencer Shepler, and nfsv4 Working Group 367 Secretary Thomas Haynes for their support. 369 Author's Address 371 Charles Lever 372 Oracle Corporation 373 1015 Granger Avenue 374 Ann Arbor, MI 48104 375 USA 377 Phone: +1 248 816 6463 378 Email: chuck.lever@oracle.com