idnits 2.17.1 draft-ietf-nfsv4-nfsdirect-08.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3978, Section 5.1 on line 16. -- Found old boilerplate from RFC 3978, Section 5.5, updated by RFC 4748 on line 492. -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 502. -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 509. -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 515. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust Copyright Line does not match the current year -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (April 16, 2008) is 5851 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) ** Downref: Normative reference to an Informational RFC: RFC 1094 ** Obsolete normative reference: RFC 1831 (Obsoleted by RFC 5531) ** Downref: Normative reference to an Informational RFC: RFC 1813 ** Obsolete normative reference: RFC 3530 (Obsoleted by RFC 7530) Summary: 5 errors (**), 0 flaws (~~), 1 warning (==), 8 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 NFSv4 Working Group Tom Talpey 3 Internet-Draft NetApp 4 Intended status: Standards Track Brent Callaghan 5 Expires: October 17, 2008 Apple 6 April 16, 2008 8 NFS Direct Data Placement 9 draft-ietf-nfsv4-nfsdirect-08 11 Status of this Memo 13 By submitting this Internet-Draft, each author represents that any 14 applicable patent or other IPR claims of which he or she is aware 15 have been or will be disclosed, and any of which he or she becomes 16 aware will be disclosed, in accordance with Section 6 of BCP 79. 18 Internet-Drafts are working documents of the Internet Engineering 19 Task Force (IETF), its areas, and its working groups. Note that 20 other groups may also distribute working documents as Internet- 21 Drafts. 23 Internet-Drafts are draft documents valid for a maximum of six 24 months and may be updated, replaced, or obsoleted by other 25 documents at any time. It is inappropriate to use Internet-Drafts 26 as reference material or to cite them other than as "work in 27 progress." 29 The list of current Internet-Drafts can be accessed at 30 http://www.ietf.org/ietf/1id-abstracts.txt 32 The list of Internet-Draft Shadow Directories can be accessed at 33 http://www.ietf.org/shadow.html. 35 Abstract 37 This draft defines the bindings of the various Network File System 38 (NFS) versions to the Remote Direct Memory Access (RDMA) operations 39 supported by the RPC/RDMA transport protocol. It describes the use 40 of direct data placement by means of server-initiated RDMA operations 41 into client-supplied buffers for implementations of NFS versions 2, 42 3, 4 and 4.1 over such an RDMA transport. 44 Table of Contents 46 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 47 2. Transfers from NFS Client to NFS Server . . . . . . . . . . 2 48 3. Transfers from NFS Server to NFS Client . . . . . . . . . . 3 49 4. NFS Versions 2 and 3 Mapping . . . . . . . . . . . . . . . . 4 50 5. NFS Version 4 Mapping . . . . . . . . . . . . . . . . . . . 5 51 5.1. NFS Version 4 Callbacks . . . . . . . . . . . . . . . . . 7 52 6. Port Usage Considerations . . . . . . . . . . . . . . . . . 8 53 7. Security Considerations . . . . . . . . . . . . . . . . . . 8 54 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . 8 55 9. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . 9 56 10. Normative References . . . . . . . . . . . . . . . . . . . 9 57 11. Informative References . . . . . . . . . . . . . . . . . 10 58 12. Authors' Addresses . . . . . . . . . . . . . . . . . . . 10 59 13. Intellectual Property and Copyright Statements . . . . . 11 60 Acknowledgment . . . . . . . . . . . . . . . . . . . . . . . . 11 62 Requirements Language 64 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 65 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 66 document are to be interpreted as described in [RFC2119]. 68 1. Introduction 70 The Remote Direct Memory Access (RDMA) Transport for Remote Procedure 71 Calls (RPC) [RPCRDMA] allows an RPC client application to post 72 buffers in a Chunk list for specific arguments and results from an 73 RPC call. The RDMA transport header conveys this list of client 74 buffer addresses to the server where the application can associate 75 them with client data and use RDMA operations to transfer the results 76 directly to and from the posted buffers on the client. The client 77 and server must agree on a consistent mapping of posted buffers to 78 RPC. This document details the mapping for each version of the NFS 79 protocol [RFC1094] [RFC1813] [RFC3530] [NFSv4.1]. 81 2. Transfers from NFS Client to NFS Server 83 The RDMA Read list, in the RDMA transport header, allows an RPC 84 client to marshal RPC call data selectively. Large chunks of data, 85 such as the file data of an NFS WRITE request, MAY be referenced by 86 an RDMA Read list and be moved efficiently and directly-placed by an 87 RDMA Read operation initiated by the server. 89 The process of identifying these chunks for the RDMA Read list can be 90 implemented entirely within the RPC layer. It is transparent to the 91 upper-level protocol, such as NFS. For instance, the file data 92 portion of an NFS WRITE request can be selected as an RDMA "chunk" 93 within the eXternal Data Representation (XDR) marshaling code of RPC 94 based on a size criterion, independently of the NFS protocol layer. 95 The XDR unmarshaling on the receiving system can identify the 96 correspondence between Read chunks and protocol elements via the XDR 97 position value encoded in the Read chunk entry. 99 RPC RDMA Read chunks are employed by this NFS mapping to convey 100 specific NFS data to the server in a manner which may be directly 101 placed. The following sections describe this mapping for versions of 102 the NFS protocol. 104 3. Transfers from NFS Server to NFS Client 106 The RDMA Write list, in the RDMA transport header, allows the client 107 to post one or more buffers into which the server will RDMA Write 108 designated result chunks directly. If the client sends a null Write 109 list, then results from the RPC call will be returned as either an 110 inline reply, as chunks in an RDMA Read list of server-posted 111 buffers, or in a client-posted reply buffer. 113 Each posted buffer in a Write list is represented as an array of 114 memory segments. This allows the client some flexibility in 115 submitting discontiguous memory segments into which the server will 116 scatter the result. Each segment is described by a triplet 117 consisting of the segment handle or steering tag (STag), segment 118 length, and memory address or offset. 120 struct xdr_rdma_segment { 121 uint32 handle; /* Registered memory handle */ 122 uint32 length; /* Length of the chunk in bytes */ 123 uint64 offset; /* Chunk virtual address or offset */ 124 }; 126 struct xdr_write_chunk { 127 struct xdr_rdma_segment target<>; 128 }; 130 struct xdr_write_list { 131 struct xdr_write_chunk entry; 132 struct xdr_write_list *next; 133 }; 135 The sum of the segment lengths yields the total size of the buffer, 136 which MUST be large enough to accept the result. If the buffer is 137 too small, the server MUST return an XDR encode error. The server 138 MUST return the result data for a posted buffer by progressively 139 filling its segments, perhaps leaving some trailing segments unfilled 140 or partially full if the size of the result is less than the total 141 size of the buffer segments. 143 The server returns the RDMA Write list to the client with the segment 144 length fields overwritten to indicate the amount of data RDMA Written 145 to each segment. Results returned by direct placement MUST NOT be 146 returned by other methods, e.g., by Read chunk list or inline. If no 147 result data at all is returned for the element, the server places no 148 data in the buffer(s), but does return zeroes in the segment length 149 fields corresponding to the result. 151 The RDMA Write list allows the client to provide multiple result 152 buffers - each buffer maps to a specific result in the reply. The 153 NFS client and server implementations agree by specifying the mapping 154 of results to buffers for each RPC procedure. The following sections 155 describe this mapping for versions of the NFS protocol. 157 Through the use of RDMA Write lists in NFS requests, it is not 158 necessary to employ the RDMA Read lists in the NFS replies, as 159 described in the RPC/RDMA protocol. This enables more efficient 160 operation, by avoiding the need for the server to expose buffers for 161 RDMA, and also avoiding "RDMA_DONE" exchanges. Clients MAY 162 additionally employ RDMA Reply chunks to receive entire messages, as 163 described in [RPCRDMA]. 165 4. NFS Versions 2 and 3 Mapping 167 A single RDMA Write list entry MAY be posted by the client to receive 168 either the opaque file data from a READ request or the pathname from 169 a READLINK request. The server MUST ignore a Write list for any 170 other NFS procedure, as well as any Write list entries beyond the 171 first in the list. 173 Similarly, a single RDMA Read list entry MAY be posted by the client 174 to supply the opaque file data for a WRITE request or the pathname 175 for a SYMLINK request. The server MUST ignore any Read list for 176 other NFS procedures, as well as additional Read list entries beyond 177 the first in the list. 179 Because there are no NFS version 2 or 3 requests that transfer bulk 180 data in both directions, it is not necessary to post requests 181 containing both Write and Read lists. Any unneeded Read or Write 182 lists are ignored by the server. 184 In the case where the outgoing request or expected incoming reply is 185 larger than the maximum size supported on the connection, it is 186 possible for the RPC layer to post the entire message or result in a 187 special "RDMA_NOMSG" message type which is transferred entirely by 188 RDMA. This is implemented in RPC, below NFS and therefore has no 189 effect on the message contents. 191 Non-RDMA (inline) WRITE transfers MAY OPTIONALLY employ the 192 "RDMA_MSGP" padding method described in the RPC/RDMA protocol, if the 193 appropriate value for the server is known to the client. Padding 194 allows the opaque file data to arrive at the server in an aligned 195 fashion, which may improve server performance. 197 The NFS version 2 and 3 protocols are frequently limited in practice 198 to requests containing less than or equal to 8 kilobytes and 32 199 kilobytes of data, respectively. In these cases, it is often 200 practical to support basic operation without employing a 201 configuration exchange as discussed in [RPCRDMA]. The server MUST 202 post buffers large enough to receive the largest possible incoming 203 message (approximately 12KB for NFS version 2, or 36KB for NFS 204 version 3, would be vastly sufficient), and the client can post 205 buffers large enough to receive replies based on the "rsize" it is 206 using to the server, plus a fixed overhead for the RPC and NFS 207 headers. Because the server MUST NOT return data in excess of this 208 size, the client can be assured of the adequacy of its posted buffer 209 sizes. 211 Flow control is handled dynamically by the RPC RDMA protocol, and 212 write padding is OPTIONAL and therefore MAY remain unused. 214 Alternatively, if the server is administratively configured to values 215 appropriate for all its clients, the same assurance of 216 interoperability within the domain can be made. 218 The use of a configuration protocol with NFS v2 and v3 is therefore 219 OPTIONAL. Employing a configuration exchange may allow some 220 advantage to server resource management through accurately sizing 221 buffers, enabling the server to know exactly how many RDMA Reads may 222 be in progress at once on the client connection, and enabling client 223 write padding which may be desirable for certain servers when RDMA 224 Read is impractical. 226 5. NFS Version 4 Mapping 228 This specification applies to the first minor version of NFS version 229 4 (NFSv4.0) and any subsequent minor versions that do not override 230 this mapping. 232 The Write list MUST be considered only for the COMPOUND procedure. 233 This procedure returns results from a sequence of operations. Only 234 the opaque file data from an NFS READ operation, and the pathname 235 from a READLINK operation MUST utilize entries from the Write list. 237 If there is no Write list, i.e., the list is null, then any READ or 238 READLINK operations in the COMPOUND MUST return their data inline. 239 The NFSv4.0 client MUST ensure in this case that any result of its 240 READ and READLINK requests will fit within its receive buffers, in 241 order to avoid a resulting RDMA transport error upon transfer. The 242 server is not required to detect this. 244 The first entry in the Write list MUST be used by the first READ or 245 READLINK in the COMPOUND request. The next Write list entry by the 246 by the next READ or READLINK, and so on. If there are more READ or 247 READLINK operations than Write list entries, then any remaining 248 operations MUST return their results inline. 250 If a Write list entry is presented, then the corresponding READ or 251 READLINK MUST return its data via an RDMA Write to the buffer 252 indicated by the Write list entry. If the Write list entry has zero 253 RDMA segments, or if the total size of the segments is zero, then the 254 corresponding READ or READLINK operation MUST return its result 255 inline. 257 The following example shows an RDMA Write list with three posted 258 buffers A, B, and C. The designated operations in the compound 259 request, READ and READLINK, consume the posted buffers by writing 260 their results back to each buffer. 262 RDMA Write list: 264 A --> B --> C 266 Compound request: 268 PUTFH LOOKUP READ PUTFH LOOKUP READLINK PUTFH LOOKUP READ 269 | | | 270 v v v 271 A B C 273 If the client does not want to have the READLINK result returned 274 directly, then it provides a zero length array of segment triplets 275 for buffer B or sets the values in the segment triplet for buffer B 276 to zeros so that the READLINK result MUST be returned inline. 278 The situation is similar for RDMA Read lists sent by the client and 279 applies to the NFSv4.0 WRITE and SYMLINK procedures as for v3. 280 Additionally, inline segments too large to fit in posted buffers MAY 281 be transferred in special "RDMA_NOMSG" messages. 283 Non-RDMA (inline) WRITE transfers MAY OPTIONALLY employ the 284 "RDMA_MSGP" padding method described in the RPC/RDMA protocol, if the 285 appropriate value for the server is known to the client. Padding 286 allows the opaque file data to arrive at the server in an aligned 287 fashion, which may improve server performance. In order to ensure 288 accurate alignment for all data, it is likely that the client will 289 restrict its use of OPTIONAL padding to COMPOUND requests containing 290 only a single WRITE operation. 292 Unlike NFS versions 2 and 3, the maximum size of an NFS version 4 293 COMPOUND is not bounded, even when RDMA chunks are in use. While it 294 might appear that a configuration protocol exchange (such as the one 295 described in [RPCRDMA]) would help, in fact the layering issues 296 involved in building COMPOUNDs by NFS make such a mechanism 297 unworkable. 299 However, typical NFS version 4 clients rarely issue such problematic 300 requests. In practice, they behave in much more predictable ways, in 301 fact most still support the traditional rsize/wsize mount parameters. 302 Therefore, most NFS version 4 clients function over RPC/RDMA in the 303 same way as NFS versions 2 and 3, operationally. 305 There are however advantages to allowing both client and server to 306 operate with prearranged size constraints, for example use of the 307 sizes to better manage the server's response cache. An extension to 308 NFS version 4 supporting a more comprehensive exchange of upper layer 309 parameters is part of [NFSv4.1]. 311 5.1. NFS Version 4 Callbacks 313 The NFS version 4 protocols support server-initiated callbacks to 314 selected clients, in order to notify them of events such as recalled 315 delegations, etc. These callbacks present no particular issue to 316 being framed over RPC/RDMA, since such callbacks do not carry bulk 317 data such as NFS READ or NFS WRITE. They MAY be transmitted inline 318 via RDMA_MSG, or if the callback message or its reply overflow the 319 negotiated buffer sizes for a callback connection, they MAY be 320 transferred via the RDMA_NOMSG method as described above for other 321 exchanges. 323 One special case is noteworthy: in NFS version 4.1, the callback 324 channel is optionally negotiated to be on the same connection as one 325 used for client requests. In this case, and because the XID is 326 present in the RPC/RDMA header, the client MUST ascertain whether the 327 message is in fact an RPC REPLY, and therefore a reply to a prior 328 request and carrying its XID, before processing it as such. By the 329 same token, the server MUST ascertain whether an incoming message on 330 such a callback-eligible connection is an RPC CALL, before optionally 331 processing the XID. 333 In the callback case, the XID present in the RPC/RDMA header will 334 potentially have any value which may (or may not) collide with an XID 335 used by the client for a previous or future request. The client and 336 server MUST inspect the RPC component of the message to determine its 337 potential disposition as either an RPC CALL or RPC REPLY, prior to 338 processing this XID, and MUST NOT reject or accept it without also 339 determining the proper context. 341 6. Port Usage Considerations 343 NFS use of direct data placement introduces a need for an additional 344 NFS port number assignment for networks which share traditional UDP 345 and TCP port spaces with RDMA services. The iWARP [RFC5041] 346 [RFC5040] protocol is such an example (Infiniband is not). 348 NFS servers for versions 2 and 3 [RFC1094] [RFC1813] traditionally 349 listen for clients on UDP and TCP port 2049, and additionally, they 350 register these with the portmapper and/or rpcbind [RFC1833] service. 351 However, [RFC3530] requires NFS servers for version 4 to listen on 352 TCP port 2049, and they are not required to register. 354 An NFS version 2 or version 3 server supporting RPC/RDMA on such a 355 network and registering itself with the RPC portmapper MAY choose an 356 arbitrary port, or MAY use the alternative well-known port number for 357 its RPC/RDMA service. The chosen port MAY be registered with the RPC 358 portmapper under the netid assigned by the requirement in [RPCRDMA]. 360 An NFS version 4 server supporting RPC/RDMA on such a network MUST 361 use the alternative well-known port number for its RPC/RDMA service. 362 Clients SHOULD connect to this well-known port without consulting the 363 RPC portmapper (as for NFSv4/TCP). 365 The port number assigned to an NFS service over an RPC/RDMA transport 366 is available from the IANA port registry [RFC3232]. 368 7. Security Considerations 370 The RDMA transport for RPC [RPCRDMA] supports all RPC [RFC1831bis] 371 security models, including RPCSEC_GSS [RFC2203] security and link- 372 level security. The choice of RDMA Read and RDMA Write to return RPC 373 argument and results, respectively, does not affect this, since it 374 only changes the method of data transfer. Specifically, the 375 requirements of [RPCRDMA] ensure that this choice does not introduce 376 new vulnerabilities. 378 Because this document defines only the binding of the NFS protocols 379 atop [RPCRDMA], all relevant security considerations are therefore to 380 be described at that layer. 382 8. IANA Considerations 384 This document has no IANA considerations. 386 9. Acknowledgments 388 The authors would like to thank Dave Noveck and Chet Juszczak for 389 their contributions to this document. 391 10. Normative References 393 [RFC2119] 394 S. Bradner, "Key words for use in RFCs to Indicate Requirement 395 Levels", 396 Best Current Practice, 397 BCP 14, RFC 2119, March 1997. 399 [RFC1094] 400 "NFS: Network File System Protocol Specification", 401 (NFS version 2) Informational RFC, 402 http://www.ietf.org/rfc/rfc1094.txt 404 [RFC1831bis] 405 R. Thurlow, Ed., "RPC: Remote Procedure Call Protocol 406 Specification Version 2", 407 Standards Track RFC 409 [RFC1813] 410 B. Callaghan, B. Pawlowski, P. Staubach, "NFS Version 3 Protocol 411 Specification", 412 Informational RFC, 413 http://www.ietf.org/rfc/rfc1813.txt 415 [RFC1833] 416 R. Srinivasan, "Binding Protocols for ONC RPC Version 2", 417 Standards Track RFC, 418 http://www.ietf.org/rfc/rfc1833.txt 420 [RFC3530] 421 S. Shepler, et al., "NFS version 4 Protocol", 422 Standards Track RFC, 423 http://www.ietf.org/rfc/rfc3530.txt 425 [NFSv4.1] 426 S. Shepler et al., ed., "NFSv4 Minor Version 1" 427 Internet Draft Work in Progress, 428 draft-ietf-nfsv4-minorversion1 430 [RFC2203] 431 M. Eisler, A. Chiu, L. Ling, "RPCSEC_GSS Protocol Specification", 432 Standards Track RFC, 433 http://www.ietf.org/rfc/rfc2203.txt 435 11. Informative References 437 [RFC3232] 438 Internet Assigned Numbers Authority (IANA), 439 Port Registry database, 440 http://www.ietf.org/rfc/rfc3232.txt 441 http://www.iana.org/assignments/port-numbers 443 [RPCRDMA] 444 T. Talpey, B. Callaghan, "Remote Direct Memory Access Transport 445 for Remote Procedure Call" 446 Internet Draft Work in Progress, 447 draft-ietf-nfsv4-rpcrdma 449 [RFC5041] 450 H. Shah et al., "Direct Data Placement over Reliable Transports", 451 Standards Track RFC 453 [RFC5040] 454 R. Recio et al., "A Remote Direct Memory Access Protocol 455 Specification", 456 Standards Track RFC 458 12. Authors' Addresses 460 Tom Talpey 461 Network Appliance, Inc. 462 1601 Trapelo Road, #16 463 Waltham, MA 02451 USA 465 Phone: +1 781 768 5329 466 EMail: thomas.talpey@netapp.com 467 Brent Callaghan 468 Apple Computer, Inc. 469 MS: 302-4K 470 2 Infinite Loop 471 Cupertino, CA 95014 USA 473 EMail: brentc@apple.com 475 13. Intellectual Property and Copyright Statements 477 Full Copyright Statement 479 Copyright (C) The IETF Trust (2008). 481 This document is subject to the rights, licenses and restrictions 482 contained in BCP 78, and except as set forth therein, the authors 483 retain all their rights. 485 This document and the information contained herein are provided on 486 an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE 487 REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE 488 IETF TRUST AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL 489 WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY 490 WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE 491 ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS 492 FOR A PARTICULAR PURPOSE. 494 Intellectual Property 495 The IETF takes no position regarding the validity or scope of any 496 Intellectual Property Rights or other rights that might be claimed 497 to pertain to the implementation or use of the technology described 498 in this document or the extent to which any license under such 499 rights might or might not be available; nor does it represent that 500 it has made any independent effort to identify any such rights. 501 Information on the procedures with respect to rights in RFC 502 documents can be found in BCP 78 and BCP 79. 504 Copies of IPR disclosures made to the IETF Secretariat and any 505 assurances of licenses to be made available, or the result of an 506 attempt made to obtain a general license or permission for the use 507 of such proprietary rights by implementers or users of this 508 specification can be obtained from the IETF on-line IPR repository 509 at http://www.ietf.org/ipr. 511 The IETF invites any interested party to bring to its attention any 512 copyrights, patents or patent applications, or other proprietary 513 rights that may cover technology that may be required to implement 514 this standard. Please address the information to the IETF at ietf- 515 ipr@ietf.org. 517 Acknowledgment 518 Funding for the RFC Editor function is provided by the IETF 519 Administrative Support Activity (IASA).