idnits 2.17.1 draft-ietf-nfsv4-rfc5667bis-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (June 13, 2016) is 2872 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) ** Obsolete normative reference: RFC 5661 (Obsoleted by RFC 8881) == Outdated reference: A later version (-11) exists of draft-ietf-nfsv4-rfc5666bis-07 -- Obsolete informational reference (is this intentional?): RFC 5666 (Obsoleted by RFC 8166) -- Obsolete informational reference (is this intentional?): RFC 5667 (Obsoleted by RFC 8267) Summary: 1 error (**), 0 flaws (~~), 2 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network File System Version 4 C. Lever, Ed. 3 Internet-Draft Oracle 4 Obsoletes: 5667 (if approved) June 13, 2016 5 Intended status: Standards Track 6 Expires: December 15, 2016 8 Network File System (NFS) Direct Data Placement 9 draft-ietf-nfsv4-rfc5667bis-00 11 Abstract 13 This document defines the bindings of the various Network File System 14 (NFS) versions to the Remote Direct Memory Access (RDMA) operations 15 supported by the RPC-over-RDMA transport protocol. It describes the 16 use of direct data placement by means of server-initiated RDMA 17 operations into client-supplied buffers for implementations of NFS 18 versions 2, 3, 4, and 4.1 over such an RDMA transport. This document 19 obsoletes RFC 5667. 21 Status of This Memo 23 This Internet-Draft is submitted in full conformance with the 24 provisions of BCP 78 and BCP 79. 26 Internet-Drafts are working documents of the Internet Engineering 27 Task Force (IETF). Note that other groups may also distribute 28 working documents as Internet-Drafts. The list of current Internet- 29 Drafts is at http://datatracker.ietf.org/drafts/current/. 31 Internet-Drafts are draft documents valid for a maximum of six months 32 and may be updated, replaced, or obsoleted by other documents at any 33 time. It is inappropriate to use Internet-Drafts as reference 34 material or to cite them other than as "work in progress." 36 This Internet-Draft will expire on December 15, 2016. 38 Copyright Notice 40 Copyright (c) 2016 IETF Trust and the persons identified as the 41 document authors. All rights reserved. 43 This document is subject to BCP 78 and the IETF Trust's Legal 44 Provisions Relating to IETF Documents 45 (http://trustee.ietf.org/license-info) in effect on the date of 46 publication of this document. Please review these documents 47 carefully, as they describe your rights and restrictions with respect 48 to this document. Code Components extracted from this document must 49 include Simplified BSD License text as described in Section 4.e of 50 the Trust Legal Provisions and are provided without warranty as 51 described in the Simplified BSD License. 53 Table of Contents 55 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 56 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 2 57 1.2. Planned Changes To This Document . . . . . . . . . . . . 2 58 2. Transfers from NFS Client to NFS Server . . . . . . . . . . . 3 59 3. Transfers from NFS Server to NFS Client . . . . . . . . . . . 3 60 4. NFS Versions 2 and 3 Mapping . . . . . . . . . . . . . . . . 5 61 5. NFS Version 4 Mapping . . . . . . . . . . . . . . . . . . . . 6 62 5.1. NFS Version 4 Callbacks . . . . . . . . . . . . . . . . . 8 63 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 9 64 7. Security Considerations . . . . . . . . . . . . . . . . . . . 9 65 8. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 9 66 9. References . . . . . . . . . . . . . . . . . . . . . . . . . 10 67 9.1. Normative References . . . . . . . . . . . . . . . . . . 10 68 9.2. Informative References . . . . . . . . . . . . . . . . . 10 69 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 11 71 1. Introduction 73 The Remote Direct Memory Access (RDMA) Transport for Remote Procedure 74 Call (RPC) [I-D.ietf-nfsv4-rfc5666bis] allows an RPC client 75 application to post buffers in a Chunk list for specific arguments 76 and results from an RPC call. The RDMA transport header conveys this 77 list of client buffer addresses to the server where the application 78 can associate them with client data and use RDMA operations to 79 transfer the results directly to and from the posted buffers on the 80 client. The client and server must agree on a consistent mapping of 81 posted buffers to RPC. This document details the mapping for each 82 version of the NFS protocol [RFC1094] [RFC1813] [RFC7530] [RFC5661]. 84 1.1. Requirements Language 86 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 87 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 88 document are to be interpreted as described in [RFC2119]. 90 1.2. Planned Changes To This Document 92 The following changes will be made, relative to [RFC5667]: 94 o References to [RFC5666] will be replaced with references to 95 [I-D.ietf-nfsv4-rfc5666bis]. Corrections and updates relative to 96 new language in [I-D.ietf-nfsv4-rfc5666bis] will be introduced. 98 o References to obsolete RFCs will be replaced. 100 o The reference to a non-existant NFSv4 SYMLINK operation will be 101 replaced with NFSv4 CREATE(NF4LNK). 103 o The discussion of 12KB and 36KB inline threshold will be removed. 105 o The discussion of NFSv4 COMPOUND handling will be completed. 107 o An explicit discussion of NFSv4.0 and NFSv4.1 backchannel 108 operation will be introduced. 110 o An IANA Considerations section is required by IDNITS. 112 o Code excerpts will be modernized. 114 Other minor changes and editorial corrections may also be made. 116 2. Transfers from NFS Client to NFS Server 118 The RDMA Read list, in the RDMA transport header, allows an RPC 119 client to marshal RPC call data selectively. Large chunks of data, 120 such as the file data of an NFS WRITE request, MAY be referenced by 121 an RDMA Read list and be moved efficiently and directly placed by an 122 RDMA Read operation initiated by the server. 124 The process of identifying these chunks for the RDMA Read list can be 125 implemented entirely within the RPC layer. It is transparent to the 126 upper-level protocol, such as NFS. For instance, the file data 127 portion of an NFS WRITE request can be selected as an RDMA "chunk" 128 within the eXternal Data Representation (XDR) marshaling code of RPC 129 based on a size criterion, independently of the NFS protocol layer. 130 The XDR unmarshaling on the receiving system can identify the 131 correspondence between Read chunks and protocol elements via the XDR 132 position value encoded in the Read chunk entry. 134 RPC RDMA Read chunks are employed by this NFS mapping to convey 135 specific NFS data to the server in a manner that may be directly 136 placed. The following sections describe this mapping for versions of 137 the NFS protocol. 139 3. Transfers from NFS Server to NFS Client 141 The RDMA Write list, in the RDMA transport header, allows the client 142 to post one or more buffers into which the server will RDMA Write 143 designated result chunks directly. If the client sends a null Write 144 list, then results from the RPC call will be returned either as an 145 inline reply, as chunks in an RDMA Read list of server-posted 146 buffers, or in a client-posted reply buffer. 148 Each posted buffer in a Write list is represented as an array of 149 memory segments. This allows the client some flexibility in 150 submitting discontiguous memory segments into which the server will 151 scatter the result. Each segment is described by a triplet 152 consisting of the segment handle or steering tag (STag), segment 153 length, and memory address or offset. 155 157 struct xdr_rdma_segment { 158 uint32 handle; /* Registered memory handle */ 159 uint32 length; /* Length of the chunk in bytes */ 160 uint64 offset; /* Chunk virtual address or offset */ 161 }; 163 struct xdr_write_chunk { 164 struct xdr_rdma_segment target<>; 165 }; 167 struct xdr_write_list { 168 struct xdr_write_chunk entry; 169 struct xdr_write_list *next; 170 }; 172 174 The sum of the segment lengths yields the total size of the buffer, 175 which MUST be large enough to accept the result. If the buffer is 176 too small, the server MUST return an XDR encode error. The server 177 MUST return the result data for a posted buffer by progressively 178 filling its segments, perhaps leaving some trailing segments unfilled 179 or partially full if the size of the result is less than the total 180 size of the buffer segments. 182 The server returns the RDMA Write list to the client with the segment 183 length fields overwritten to indicate the amount of data RDMA written 184 to each segment. Results returned by direct placement MUST NOT be 185 returned by other methods, e.g., by Read chunk list or inline. If no 186 result data at all is returned for the element, the server places no 187 data in the buffer(s), but does return zeros in the segment length 188 fields corresponding to the result. 190 The RDMA Write list allows the client to provide multiple result 191 buffers -- each buffer maps to a specific result in the reply. The 192 NFS client and server implementations agree by specifying the mapping 193 of results to buffers for each RPC procedure. The following sections 194 describe this mapping for versions of the NFS protocol. 196 Through the use of RDMA Write lists in NFS requests, it is not 197 necessary to employ the RDMA Read lists in the NFS replies, as 198 described in the RPC-over-RDMA protocol. This enables more efficient 199 operation, by avoiding the need for the server to expose buffers for 200 RDMA, and also avoiding "RDMA_DONE" exchanges. Clients MAY 201 additionally employ RDMA Reply chunks to receive entire messages, as 202 described in [I-D.ietf-nfsv4-rfc5666bis]. 204 4. NFS Versions 2 and 3 Mapping 206 A single RDMA Write list entry MAY be posted by the client to receive 207 either the opaque file data from a READ request or the pathname from 208 a READLINK request. The server MUST ignore a Write list for any 209 other NFS procedure, as well as any Write list entries beyond the 210 first in the list. 212 Similarly, a single RDMA Read list entry MAY be posted by the client 213 to supply the opaque file data for a WRITE request or the pathname 214 for a SYMLINK request. The server MUST ignore any Read list for 215 other NFS procedures, as well as additional Read list entries beyond 216 the first in the list. 218 Because there are no NFS version 2 or 3 requests that transfer bulk 219 data in both directions, it is not necessary to post requests 220 containing both Write and Read lists. Any unneeded Read or Write 221 lists are ignored by the server. 223 In the case where the outgoing request or expected incoming reply is 224 larger than the maximum size supported on the connection, it is 225 possible for the RPC layer to post the entire message or result in a 226 special "RDMA_NOMSG" message type that is transferred entirely by 227 RDMA. This is implemented in RPC, below NFS, and therefore has no 228 effect on the message contents. 230 Non-RDMA (inline) WRITE transfers MAY OPTIONALLY employ the 231 "RDMA_MSGP" padding method described in the RPC-over-RDMA protocol, 232 if the appropriate value for the server is known to the client. 233 Padding allows the opaque file data to arrive at the server in an 234 aligned fashion, which may improve server performance. 236 The NFS version 2 and 3 protocols are frequently limited in practice 237 to requests containing less than or equal to 8 kilobytes and 32 238 kilobytes of data, respectively. In these cases, it is often 239 practical to support basic operation without employing a 240 configuration exchange as discussed in [I-D.ietf-nfsv4-rfc5666bis]. 241 The server MUST post buffers large enough to receive the largest 242 possible incoming message (approximately 12 KB for NFS version 2, or 243 36 KB for NFS version 3, would be vastly sufficient), and the client 244 can post buffers large enough to receive replies based on the "rsize" 245 it is using to the server, plus a fixed overhead for the RPC and NFS 246 headers. Because the server MUST NOT return data in excess of this 247 size, the client can be assured of the adequacy of its posted buffer 248 sizes. 250 Flow control is handled dynamically by the RPC RDMA protocol, and 251 write padding is OPTIONAL and therefore MAY remain unused. 253 Alternatively, if the server is administratively configured to values 254 appropriate for all its clients, the same assurance of 255 interoperability within the domain can be made. 257 The use of a configuration protocol with NFS v2 and v3 is therefore 258 OPTIONAL. Employing a configuration exchange may allow some 259 advantage to server resource management through accurately sizing 260 buffers, enabling the server to know exactly how many RDMA Reads may 261 be in progress at once on the client connection, and enabling client 262 write padding, which may be desirable for certain servers when RDMA 263 Read is impractical. 265 5. NFS Version 4 Mapping 267 This specification applies to the first minor version of NFS version 268 4 (NFSv4.0) and any subsequent minor versions that do not override 269 this mapping. 271 The Write list MUST be considered only for the COMPOUND procedure. 272 This procedure returns results from a sequence of operations. Only 273 the opaque file data from an NFS READ operation and the pathname from 274 a READLINK operation MUST utilize entries from the Write list. 276 If there is no Write list, i.e., the list is null, then any READ or 277 READLINK operations in the COMPOUND MUST return their data inline. 278 The NFSv4.0 client MUST ensure in this case that any result of its 279 READ and READLINK requests will fit within its receive buffers, in 280 order to avoid a resulting RDMA transport error upon transfer. The 281 server is not required to detect this. 283 The first entry in the Write list MUST be used by the first READ or 284 READLINK in the COMPOUND request. The next Write list entry is used 285 by the next READ or READLINK, and so on. If there are more READ or 286 READLINK operations than Write list entries, then any remaining 287 operations MUST return their results inline. 289 If a Write list entry is presented, then the corresponding READ or 290 READLINK MUST return its data via an RDMA Write to the buffer 291 indicated by the Write list entry. If the Write list entry has zero 292 RDMA segments, or if the total size of the segments is zero, then the 293 corresponding READ or READLINK operation MUST return its result 294 inline. 296 The following example shows an RDMA Write list with three posted 297 buffers A, B, and C. The designated operations in the compound 298 request, READ and READLINK, consume the posted buffers by writing 299 their results back to each buffer. 301 RDMA Write list: 303 A --> B --> C 305 Compound request: 307 PUTFH LOOKUP READ PUTFH LOOKUP READLINK PUTFH LOOKUP READ 308 | | | 309 v v v 310 A B C 312 If the client does not want to have the READLINK result returned 313 directly, then it provides a zero-length array of segment triplets 314 for buffer B or sets the values in the segment triplet for buffer B 315 to zeros so that the READLINK result MUST be returned inline. 317 The situation is similar for RDMA Read lists sent by the client and 318 applies to the NFSv4.0 WRITE and SYMLINK procedures as for v3. 319 Additionally, inline segments too large to fit in posted buffers MAY 320 be transferred in special "RDMA_NOMSG" messages. 322 Non-RDMA (inline) WRITE transfers MAY OPTIONALLY employ the 323 "RDMA_MSGP" padding method described in the RPC-over-RDMA protocol, 324 if the appropriate value for the server is known to the client. 325 Padding allows the opaque file data to arrive at the server in an 326 aligned fashion, which may improve server performance. In order to 327 ensure accurate alignment for all data, it is likely that the client 328 will restrict its use of OPTIONAL padding to COMPOUND requests 329 containing only a single WRITE operation. 331 Unlike NFS versions 2 and 3, the maximum size of an NFS version 4 332 COMPOUND is not bounded, even when RDMA chunks are in use. While it 333 might appear that a configuration protocol exchange (such as the one 334 described in [I-D.ietf-nfsv4-rfc5666bis]) would help, in fact the 335 layering issues involved in building COMPOUNDs by NFS make such a 336 mechanism unworkable. 338 However, typical NFS version 4 clients rarely issue such problematic 339 requests. In practice, they behave in much more predictable ways, in 340 fact most still support the traditional rsize/wsize mount parameters. 341 Therefore, most NFS version 4 clients function over RPC-over-RDMA in 342 the same way as NFS versions 2 and 3, operationally. 344 There are however advantages to allowing both client and server to 345 operate with prearranged size constraints, for example, use of the 346 sizes to better manage the server's response cache. An extension to 347 NFS version 4 supporting a more comprehensive exchange of upper-layer 348 parameters is part of [RFC5661]. 350 5.1. NFS Version 4 Callbacks 352 The NFS version 4 protocols support server-initiated callbacks to 353 selected clients, in order to notify them of events such as recalled 354 delegations, etc. These callbacks present no particular issue to 355 being framed over RPC-over-RDMA since such callbacks do not carry 356 bulk data such as NFS READ or NFS WRITE. They MAY be transmitted 357 inline via RDMA_MSG, or if the callback message or its reply overflow 358 the negotiated buffer sizes for a callback connection, they MAY be 359 transferred via the RDMA_NOMSG method as described above for other 360 exchanges. 362 One special case is noteworthy: in NFS version 4.1, the callback 363 channel is optionally negotiated to be on the same connection as one 364 used for client requests. In this case, and because the transaction 365 ID (XID) is present in the RPC-over-RDMA header, the client MUST 366 ascertain whether the message is in fact an RPC REPLY, and therefore 367 a reply to a prior request and carrying its XID, before processing it 368 as such. By the same token, the server MUST ascertain whether an 369 incoming message on such a callback-eligible connection is an RPC 370 CALL, before optionally processing the XID. 372 In the callback case, the XID present in the RPC-over-RDMA header 373 will potentially have any value, which may (or may not) collide with 374 an XID used by the client for a previous or future request. The 375 client and server MUST inspect the RPC component of the message to 376 determine its potential disposition as either an RPC CALL or RPC 377 REPLY, prior to processing this XID, and MUST NOT reject or accept it 378 without also determining the proper context. 380 6. IANA Considerations 382 NFS use of direct data placement introduces a need for an additional 383 NFS port number assignment for networks that share traditional UDP 384 and TCP port spaces with RDMA services. The iWARP [RFC5041] 385 [RFC5040] protocol is such an example (InfiniBand is not). 387 NFS servers for versions 2 and 3 [RFC1094] [RFC1813] traditionally 388 listen for clients on UDP and TCP port 2049, and additionally, they 389 register these with the portmapper and/or rpcbind [RFC1833] service. 390 However, [RFC7530] requires NFS servers for version 4 to listen on 391 TCP port 2049, and they are not required to register. 393 An NFS version 2 or version 3 server supporting RPC-over-RDMA on such 394 a network and registering itself with the RPC portmapper MAY choose 395 an arbitrary port, or MAY use the alternative well-known port number 396 for its RPC-over-RDMA service. The chosen port MAY be registered 397 with the RPC portmapper under the netid assigned by the requirement 398 in [I-D.ietf-nfsv4-rfc5666bis]. 400 An NFS version 4 server supporting RPC-over-RDMA on such a network 401 MUST use the alternative well-known port number for its RPC-over-RDMA 402 service. Clients SHOULD connect to this well-known port without 403 consulting the RPC portmapper (as for NFSv4/TCP). 405 The port number assigned to an NFS service over an RPC-over-RDMA 406 transport is available from the IANA port registry [RFC3232]. 408 7. Security Considerations 410 The RDMA transport for RPC [I-D.ietf-nfsv4-rfc5666bis] supports all 411 RPC [RFC5531] security models, including RPCSEC_GSS [RFC2203] 412 security and link- level security. The choice of RDMA Read and RDMA 413 Write to return RPC argument and results, respectively, does not 414 affect this, since it only changes the method of data transfer. 415 Specifically, the requirements of [I-D.ietf-nfsv4-rfc5666bis] ensure 416 that this choice does not introduce new vulnerabilities. 418 Because this document defines only the binding of the NFS protocols 419 atop [I-D.ietf-nfsv4-rfc5666bis], all relevant security 420 considerations are therefore to be described at that layer. 422 8. Acknowledgments 424 The author gratefully acknowledges the work of Brent Callaghan and 425 Tom Talpey on the original NFS Direct Data Placement specification 426 [RFC5667]. The author also wishes to thank Bill Baker and Greg 427 Marsden for their support of this work. 429 9. References 431 9.1. Normative References 433 [RFC1833] Srinivasan, R., "Binding Protocols for ONC RPC Version 2", 434 RFC 1833, DOI 10.17487/RFC1833, August 1995, 435 . 437 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 438 Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/ 439 RFC2119, March 1997, 440 . 442 [RFC2203] Eisler, M., Chiu, A., and L. Ling, "RPCSEC_GSS Protocol 443 Specification", RFC 2203, DOI 10.17487/RFC2203, September 444 1997, . 446 [RFC5531] Thurlow, R., "RPC: Remote Procedure Call Protocol 447 Specification Version 2", RFC 5531, DOI 10.17487/RFC5531, 448 May 2009, . 450 [RFC5661] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed., 451 "Network File System (NFS) Version 4 Minor Version 1 452 Protocol", RFC 5661, DOI 10.17487/RFC5661, January 2010, 453 . 455 [RFC7530] Haynes, T., Ed. and D. Noveck, Ed., "Network File System 456 (NFS) Version 4 Protocol", RFC 7530, DOI 10.17487/RFC7530, 457 March 2015, . 459 9.2. Informative References 461 [I-D.ietf-nfsv4-rfc5666bis] 462 Lever, C., Simpson, W., and T. Talpey, "Remote Direct 463 Memory Access Transport for Remote Procedure Call, Version 464 One", draft-ietf-nfsv4-rfc5666bis-07 (work in progress), 465 May 2016. 467 [RFC1094] Nowicki, B., "NFS: Network File System Protocol 468 specification", RFC 1094, DOI 10.17487/RFC1094, March 469 1989, . 471 [RFC1813] Callaghan, B., Pawlowski, B., and P. Staubach, "NFS 472 Version 3 Protocol Specification", RFC 1813, DOI 10.17487/ 473 RFC1813, June 1995, 474 . 476 [RFC3232] Reynolds, J., Ed., "Assigned Numbers: RFC 1700 is Replaced 477 by an On-line Database", RFC 3232, DOI 10.17487/RFC3232, 478 January 2002, . 480 [RFC5040] Recio, R., Metzler, B., Culley, P., Hilland, J., and D. 481 Garcia, "A Remote Direct Memory Access Protocol 482 Specification", RFC 5040, DOI 10.17487/RFC5040, October 483 2007, . 485 [RFC5041] Shah, H., Pinkerton, J., Recio, R., and P. Culley, "Direct 486 Data Placement over Reliable Transports", RFC 5041, DOI 487 10.17487/RFC5041, October 2007, 488 . 490 [RFC5666] Talpey, T. and B. Callaghan, "Remote Direct Memory Access 491 Transport for Remote Procedure Call", RFC 5666, DOI 492 10.17487/RFC5666, January 2010, 493 . 495 [RFC5667] Talpey, T. and B. Callaghan, "Network File System (NFS) 496 Direct Data Placement", RFC 5667, DOI 10.17487/RFC5667, 497 January 2010, . 499 Author's Address 501 Charles Lever (editor) 502 Oracle Corporation 503 1015 Granger Avenue 504 Ann Arbor, MI 48104 505 USA 507 Phone: +1 734 274 2396 508 Email: chuck.lever@oracle.com