idnits 2.17.1 draft-ietf-nfsv4-nfsdirect-07.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3978, Section 5.1 on line 16. -- Found old boilerplate from RFC 3978, Section 5.5, updated by RFC 4748 on line 488. -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 498. -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 505. -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 511. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust Copyright Line does not match the current year -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (February 22, 2008) is 5907 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) ** Downref: Normative reference to an Informational RFC: RFC 1094 ** Obsolete normative reference: RFC 1831 (Obsoleted by RFC 5531) ** Downref: Normative reference to an Informational RFC: RFC 1813 ** Obsolete normative reference: RFC 3530 (Obsoleted by RFC 7530) Summary: 5 errors (**), 0 flaws (~~), 1 warning (==), 8 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 NFSv4 Working Group Tom Talpey 3 Internet-Draft NetApp 4 Intended status: Standards Track Brent Callaghan 5 Expires: August 23, 2008 Apple 6 February 22, 2008 8 NFS Direct Data Placement 9 draft-ietf-nfsv4-nfsdirect-07 11 Status of this Memo 13 By submitting this Internet-Draft, each author represents that any 14 applicable patent or other IPR claims of which he or she is aware 15 have been or will be disclosed, and any of which he or she becomes 16 aware will be disclosed, in accordance with Section 6 of BCP 79. 18 Internet-Drafts are working documents of the Internet Engineering 19 Task Force (IETF), its areas, and its working groups. Note that 20 other groups may also distribute working documents as Internet- 21 Drafts. 23 Internet-Drafts are draft documents valid for a maximum of six 24 months and may be updated, replaced, or obsoleted by other 25 documents at any time. It is inappropriate to use Internet-Drafts 26 as reference material or to cite them other than as "work in 27 progress." 29 The list of current Internet-Drafts can be accessed at 30 http://www.ietf.org/ietf/1id-abstracts.txt 32 The list of Internet-Draft Shadow Directories can be accessed at 33 http://www.ietf.org/shadow.html. 35 Abstract 37 This draft defines the bindings of the various Network File System 38 (NFS) versions to the Remote Direct Memory Access (RDMA) operations 39 supported by the RPC/RDMA transport protocol. It describes the use 40 of direct data placement by means of server-initiated RDMA operations 41 into client-supplied buffers for implementations of NFS versions 2, 42 3, 4 and 4.1 over such an RDMA transport. 44 Table of Contents 46 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 47 2. Transfers from NFS Client to NFS Server . . . . . . . . . . 2 48 3. Transfers from NFS Server to NFS Client . . . . . . . . . . 3 49 4. NFS Versions 2 and 3 Mapping . . . . . . . . . . . . . . . . 4 50 5. NFS Version 4 Mapping . . . . . . . . . . . . . . . . . . . 5 51 5.1. NFS Version 4 Callbacks . . . . . . . . . . . . . . . . . 7 52 6. Security Considerations . . . . . . . . . . . . . . . . . . 8 53 7. IANA Considerations . . . . . . . . . . . . . . . . . . . . 8 54 8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 8 55 9. Normative References . . . . . . . . . . . . . . . . . . . . 9 56 10. Informative References . . . . . . . . . . . . . . . . . 10 57 11. Authors' Addresses . . . . . . . . . . . . . . . . . . . 10 58 12. Intellectual Property and Copyright Statements . . . . . 10 59 Acknowledgment . . . . . . . . . . . . . . . . . . . . . . . . 11 61 Requirements Language 63 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 64 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 65 document are to be interpreted as described in [RFC2119]. 67 1. Introduction 69 The Remote Direct Memory Access (RDMA) Transport for Remote Procedure 70 Calls (RPC) [RPCRDMA] allows an RPC client application to post 71 buffers in a Chunk list for specific arguments and results from an 72 RPC call. The RDMA transport header conveys this list of client 73 buffer addresses to the server where the application can associate 74 them with client data and use RDMA operations to transfer the results 75 directly to and from the posted buffers on the client. The client 76 and server must agree on a consistent mapping of posted buffers to 77 RPC. This document details the mapping for each version of the NFS 78 protocol [RFC1094] [RFC1813] [RFC3530] [NFSv4.1]. 80 2. Transfers from NFS Client to NFS Server 82 The RDMA Read list, in the RDMA transport header, allows an RPC 83 client to marshal RPC call data selectively. Large chunks of data, 84 such as the file data of an NFS WRITE request, MAY be referenced by 85 an RDMA Read list and be moved efficiently and directly-placed by an 86 RDMA READ operation initiated by the server. 88 The process of identifying these chunks for the RDMA Read list can be 89 implemented entirely within the RPC layer. It is transparent to the 90 upper-level protocol, such as NFS. For instance, the file data 91 portion of an NFS WRITE request can be selected as an RDMA "chunk" 92 within the XDR marshaling code of RPC based on a size criterion, 93 independently of the NFS protocol layer. The XDR unmarshaling on the 94 receiving system can identify the correspondence between Read chunks 95 and protocol elements via the XDR position value encoded in the Read 96 chunk entry. 98 RPC RDMA Read chunks are employed by this NFS mapping to convey 99 specific NFS data to the server in a manner which may be directly 100 placed. The following sections describe this mapping for versions of 101 the NFS protocol. 103 3. Transfers from NFS Server to NFS Client 105 The RDMA Write list, in the RDMA transport header, allows the client 106 to post one or more buffers into which the server will RDMA Write 107 designated result chunks directly. If the client sends a null write 108 list, then results from the RPC call will be returned as either an 109 inline reply, as chunks in an RDMA Read list of server-posted 110 buffers, or in a client-posted reply buffer. 112 Each posted buffer in a Write list is represented as an array of 113 memory segments. This allows the client some flexibility in 114 submitting discontiguous memory segments into which the server will 115 scatter the result. Each segment is described by a triplet 116 consisting of the segment handle or steering tag (STag), segment 117 length, and memory address or offset. 119 struct xdr_rdma_segment { 120 uint32 handle; /* Registered memory handle */ 121 uint32 length; /* Length of the chunk in bytes */ 122 uint64 offset; /* Chunk virtual address or offset */ 123 }; 125 struct xdr_write_chunk { 126 struct xdr_rdma_segment target<>; 127 }; 129 struct xdr_write_list { 130 struct xdr_write_chunk entry; 131 struct xdr_write_list *next; 132 }; 134 The sum of the segment lengths yields the total size of the buffer, 135 which MUST be large enough to accept the result. If the buffer is 136 too small, the server MUST return an XDR encode error. The server 137 MUST return the result data for a posted buffer by progressively 138 filling its segments, perhaps leaving some trailing segments unfilled 139 or partially full if the size of the result is less than the total 140 size of the buffer segments. 142 The server returns the RDMA Write list to the client with the segment 143 length fields overwritten to indicate the amount of data RDMA Written 144 to each segment. Results returned by direct placement MUST NOT be 145 returned by other methods, e.g., by read chunk list or inline. If no 146 result data at all is returned for the element, the server places no 147 data in the buffer(s), but does return zeroes in the segment length 148 fields corresponding to the result. 150 The RDMA Write list allows the client to provide multiple result 151 buffers - each buffer maps to a specific result in the reply. The NFS 152 client and server implementations agree by specifying the mapping of 153 results to buffers for each RPC procedure. The following sections 154 describe this mapping for versions of the NFS protocol. 156 Through the use of RDMA Write lists in NFS requests, it is not 157 necessary to employ the RDMA Read lists in the NFS replies, as 158 described in the RPC/RDMA protocol. This enables more efficient 159 operation, by avoiding the need for the server to expose buffers for 160 RDMA, and also avoiding "RDMA_DONE" exchanges. Clients MAY 161 additionally employ RDMA Reply chunks to receive entire messages, as 162 described in [RPCRDMA]. 164 4. NFS Versions 2 and 3 Mapping 166 A single RDMA Write list entry MAY be posted by the client to receive 167 either the opaque file data from a READ request or the pathname from 168 a READLINK request. The server MUST ignore a Write list for any 169 other NFS procedure, as well as any Write list entries beyond the 170 first in the list. 172 Similarly, a single RDMA Read list entry MAY be posted by the client 173 to supply the opaque file data for a WRITE request or the pathname 174 for a SYMLINK request. The server MUST ignore any Read list for 175 other NFS procedures, as well as additional Read list entries beyond 176 the first in the list. 178 Because there are no NFS version 2 or 3 requests that transfer bulk 179 data in both directions, it is not necessary to post requests 180 containing both Write and Read lists. Any unneeded Read or Write 181 lists are ignored by the server. 183 In the case where the outgoing request or expected incoming reply is 184 larger than the maximum size supported on the connection, it is 185 possible for the RPC layer to post the entire message or result in a 186 special "RDMA_NOMSG" message type which is transferred entirely by 187 RDMA. This is implemented in RPC, below NFS and therefore has no 188 effect on the message contents. 190 Non-RDMA (inline) WRITE transfers MAY OPTIONALLY employ the 191 "RDMA_MSGP" padding method described in the RPC/RDMA protocol, if the 192 appropriate value for the server is known to the client. Padding 193 allows the opaque file data to arrive at the server in an aligned 194 fashion, which may improve server performance. 196 The NFS version 2 and 3 protocols are frequently limited in practice 197 to requests containing less than or equal to 8 kilobytes and 32 198 kilobytes of data, respectively. In these cases, it is often 199 practical to support basic operation without employing a 200 configuration exchange as discussed in [RPCRDMA]. The server MUST 201 post buffers large enough to receive the largest possible incoming 202 message (approximately 12KB for NFS version 2, or 36KB for NFS 203 version 3, would be vastly sufficient), and the client can post 204 buffers large enough to receive replies based on the "rsize" it is 205 using to the server, plus a fixed overhead for the RPC and NFS 206 headers. Because the server MUST NOT return data in excess of this 207 size, the client can be assured of the adequacy of its posted buffer 208 sizes. 210 Flow control is handled dynamically by the RPC RDMA protocol, and 211 write padding is OPTIONAL and therefore MAY remain unused. 213 Alternatively, if the server is administratively configured to values 214 appropriate for all its clients, the same assurance of 215 interoperability within the domain can be made. 217 The use of a configuration protocol with NFS v2 and v3 is therefore 218 OPTIONAL. Employing a configuration exchange may allow some advantage 219 to server resource management through accurately sizing buffers, 220 enabling the server to know exactly how many RDMA Reads may be in 221 progress at once on the client connection, and enabling client write 222 padding which may be desirable for certain servers when RDMA Read is 223 impractical. 225 5. NFS Version 4 Mapping 227 This specification applies to the first minor version of NFS version 228 4 (NFSv4.0) and any subsequent minor versions that do not override 229 this mapping. 231 The Write list MUST be considered only for the COMPOUND procedure. 233 This procedure returns results from a sequence of operations. Only 234 the opaque file data from an NFS READ operation, and the pathname 235 from a READLINK operation MUST utilize entries from the Write list. 237 If there is no Write list, i.e., the list is null, then any READ or 238 READLINK operations in the COMPOUND MUST return their data inline. 239 The NFSv4.0 client MUST ensure in this case that any result of its 240 READ and READLINK requests will fit within its receive buffers, in 241 order to avoid a resulting RDMA transport error upon transfer. The 242 server is not required to detect this. 244 The first entry in the Write list MUST be used by the first READ or 245 READLINK in the COMPOUND request. The next Write list entry by the 246 by the next READ or READLINK, and so on. If there are more READ or 247 READLINK operations than Write list entries, then any remaining 248 operations MUST return their results inline. 250 If a Write list entry is presented, then the corresponding READ or 251 READLINK MUST return its data via an RDMA WRITE to the buffer 252 indicated by the Write list entry. If the Write list entry has zero 253 RDMA segments, or if the total size of the segments is zero, then the 254 corresponding READ or READLINK operation MUST return its result 255 inline. 257 The following example shows an RDMA Write list with three posted 258 buffers A, B, and C. The designated operations in the compound 259 request, READ and READLINK, consume the posted buffers by writing 260 their results back to each buffer. 262 RDMA Write list: 264 A --> B --> C 266 Compound request: 268 PUTFH LOOKUP READ PUTFH LOOKUP READLINK PUTFH LOOKUP READ 269 | | | 270 v v v 271 A B C 273 If the client does not want to have the READLINK result returned 274 directly, then it provides a zero length array of segment triplets 275 for buffer B or sets the values in the segment triplet for buffer B 276 to zeros so that the READLINK result MUST be returned inline. 278 The situation is similar for RDMA Read lists sent by the client and 279 applies to the NFSv4.0 WRITE and SYMLINK procedures as for v3. 280 Additionally, inline segments too large to fit in posted buffers MAY 281 be transferred in special "RDMA_NOMSG" messages. 283 Non-RDMA (inline) WRITE transfers MAY OPTIONALLY employ the 284 "RDMA_MSGP" padding method described in the RPC/RDMA protocol, if the 285 appropriate value for the server is known to the client. Padding 286 allows the opaque file data to arrive at the server in an aligned 287 fashion, which may improve server performance. In order to ensure 288 accurate alignment for all data, it is likely that the client will 289 restrict its use of OPTIONAL padding to COMPOUND requests containing 290 only a single WRITE operation. 292 Unlike NFS versions 2 and 3, the maximum size of an NFS version 4 293 COMPOUND is not bounded, even when RDMA chunks are in use. While it 294 might appear that a configuration protocol exchange (such as the one 295 described in [RPCRDMA]) would help, in fact the layering issues 296 involved in building COMPOUNDs by NFS make such a mechanism 297 unworkable. 299 However, typical NFS version 4 clients rarely issue such problematic 300 requests. In practice, they behave in much more predictable ways, in 301 fact most still support the traditional rsize/wsize mount parameters. 302 Therefore, most NFS version 4 clients function over RPC/RDMA in the 303 same way as NFS versions 2 and 3, operationally. 305 There are however advantages to allowing both client and server to 306 operate with prearranged size constraints, for example use of the 307 sizes to better manage the server's response cache. An extension to 308 NFS version 4 supporting a more comprehensive exchange of upper layer 309 parameters is part of [NFSv4.1]. 311 5.1. NFS Version 4 Callbacks 313 The NFS version 4 protocols support server-initiated callbacks to 314 selected clients, in order to notify them of events such as recalled 315 delegations, etc. These callbacks present no particular issue to 316 being framed over RPC/RDMA, since such callbacks do not carry bulk 317 data such as read or write. They MAY be transmitted inline via 318 RDMA_MSG, or if the callback message or its reply overflow the 319 negotiated buffer sizes for a callback connection, they MAY be 320 transferred via the RDMA_NOMSG method as described above for other 321 exchanges. 323 One special case is noteworthy: in NFS version 4.1, the callback 324 channel is optionally negotiated to be on the same connection as one 325 used for client requests. In this case, and because the XID is 326 present in the RPC/RDMA header, the client MUST ascertain whether the 327 message is in fact an RPC REPLY, and therefore a reply to a prior 328 request and carrying its XID, before processing it as such. By the 329 same token, the server MUST ascertain whether an incoming message on 330 such a callback-eligible connection is an RPC CALL, before optionally 331 processing the XID. 333 In the callback case, the XID present in the RPC/RDMA header will 334 potentially have any value which may (or may not) collide with an XID 335 used by the client for a previous or future request. The client and 336 server MUST inspect the RPC component of the message to determine its 337 potential disposition as either an RPC CALL or RPC REPLY, prior to 338 processing this XID, and MUST NOT reject or accept it without also 339 determining the proper context. 341 6. Security Considerations 343 The RDMA transport for RPC [RPCRDMA] supports all RPC [RFC1831bis] 344 security models, including RPCSEC_GSS [RFC2203] security and link- 345 level security. The choice of RDMA Read and RDMA Write to return RPC 346 argument and results, respectively, does not affect this, since it 347 only changes the method of data transfer. Specifically, the 348 requirements of [RPCRDMA] ensure that this choice does not introduce 349 new vulnerabilities. 351 Because this document defines only the binding of the NFS protocols 352 atop [RPCRDMA], all relevant security considerations are therefore to 353 be described at that layer. 355 7. IANA Considerations 357 NFS use of direct data placement introduces a need for an additional 358 NFS port number assignment for networks which share traditional UDP 359 and TCP port spaces with RDMA services. The iWARP [RFC5041] 360 [RFC5040] protocol is such an example (Infiniband is not). 362 NFS servers for versions 2 and 3 [RFC1094] [RFC1813] traditionally 363 listen for clients on UDP and TCP port 2049, and additionally, they 364 register these with the portmapper and/or rpcbind [RFC1833] service. 365 However, [RFC3530] requires NFS servers for version 4 to listen on 366 TCP port 2049, and they are not required to register. 368 An NFS version 2 or version 3 server supporting RPC/RDMA on such a 369 network and registering itself with the RPC portmapper MAY choose an 370 arbitrary port, or MAY use the alternative well-known port number for 371 its RPC/RDMA service. The chosen port MAY be registered with the RPC 372 portmapper under the netid assigned by the requirement in [RPCRDMA]. 374 An NFS version 4 server supporting RPC/RDMA on such a network MUST 375 use the alternative well-known port number for its RPC/RDMA service. 376 Clients SHOULD connect to this well-known port without consulting the 377 RPC portmapper (as for NFSv4/TCP). 379 The port number assigned to an NFS service over an RPC/RDMA transport 380 is available from the IANA port registry [RFC3232]. 382 8. Acknowledgements 384 The authors would like to thank Dave Noveck and Chet Juszczak for 385 their contributions to this document. 387 9. Normative References 389 [RFC2119] 390 S. Bradner, "Key words for use in RFCs to Indicate Requirement 391 Levels", 392 Best Current Practice, 393 BCP 14, RFC 2119, March 1997. 395 [RFC1094] 396 "NFS: Network File System Protocol Specification", 397 (NFS version 2) Informational RFC, 398 http://www.ietf.org/rfc/rfc1094.txt 400 [RFC1831bis] 401 R. Thurlow, Ed., "RPC: Remote Procedure Call Protocol 402 Specification Version 2", 403 Standards Track RFC 405 [RFC1813] 406 B. Callaghan, B. Pawlowski, P. Staubach, "NFS Version 3 Protocol 407 Specification", 408 Informational RFC, 409 http://www.ietf.org/rfc/rfc1813.txt 411 [RFC1833] 412 R. Srinivasan, "Binding Protocols for ONC RPC Version 2", 413 Standards Track RFC, 414 http://www.ietf.org/rfc/rfc1833.txt 416 [RFC3530] 417 S. Shepler, et al., "NFS version 4 Protocol", 418 Standards Track RFC, 419 http://www.ietf.org/rfc/rfc3530.txt 421 [NFSv4.1] 422 S. Shepler et al., ed., "NFSv4 Minor Version 1" 423 Internet Draft Work in Progress, 424 draft-ietf-nfsv4-minorversion1 426 [RFC2203] 427 M. Eisler, A. Chiu, L. Ling, "RPCSEC_GSS Protocol Specification", 428 Standards Track RFC, 429 http://www.ietf.org/rfc/rfc2203.txt 431 10. Informative References 433 [RFC3232] 434 Internet Assigned Numbers Authority (IANA), 435 Port Registry database, 436 http://www.ietf.org/rfc/rfc3232.txt 437 http://www.iana.org/assignments/port-numbers 439 [RPCRDMA] 440 T. Talpey, B. Callaghan, "Remote Direct Memory Access Transport 441 for Remote Procedure Call" 442 Internet Draft Work in Progress, 443 draft-ietf-nfsv4-rpcrdma 445 [RFC5041] 446 H. Shah et al., "Direct Data Placement over Reliable Transports", 447 Standards Track RFC 449 [RFC5040] 450 R. Recio et al., "A Remote Direct Memory Access Protocol 451 Specification", 452 Standards Track RFC 454 11. Authors' Addresses 456 Tom Talpey 457 Network Appliance, Inc. 458 1601 Trapelo Road, #16 459 Waltham, MA 02451 USA 461 Phone: +1 781 768 5329 462 EMail: thomas.talpey@netapp.com 463 Brent Callaghan 464 Apple Computer, Inc. 465 MS: 302-4K 466 2 Infinite Loop 467 Cupertino, CA 95014 USA 469 EMail: brentc@apple.com 471 12. Intellectual Property and Copyright Statements 473 Full Copyright Statement 475 Copyright (C) The IETF Trust (2008). 477 This document is subject to the rights, licenses and restrictions 478 contained in BCP 78, and except as set forth therein, the authors 479 retain all their rights. 481 This document and the information contained herein are provided on 482 an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE 483 REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE 484 IETF TRUST AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL 485 WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY 486 WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE 487 ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS 488 FOR A PARTICULAR PURPOSE. 490 Intellectual Property 491 The IETF takes no position regarding the validity or scope of any 492 Intellectual Property Rights or other rights that might be claimed 493 to pertain to the implementation or use of the technology described 494 in this document or the extent to which any license under such 495 rights might or might not be available; nor does it represent that 496 it has made any independent effort to identify any such rights. 497 Information on the procedures with respect to rights in RFC 498 documents can be found in BCP 78 and BCP 79. 500 Copies of IPR disclosures made to the IETF Secretariat and any 501 assurances of licenses to be made available, or the result of an 502 attempt made to obtain a general license or permission for the use 503 of such proprietary rights by implementers or users of this 504 specification can be obtained from the IETF on-line IPR repository 505 at http://www.ietf.org/ipr. 507 The IETF invites any interested party to bring to its attention any 508 copyrights, patents or patent applications, or other proprietary 509 rights that may cover technology that may be required to implement 510 this standard. Please address the information to the IETF at ietf- 511 ipr@ietf.org. 513 Acknowledgment 514 Funding for the RFC Editor function is provided by the IETF 515 Administrative Support Activity (IASA).