idnits 2.17.1 draft-ietf-nfsv4-nfsdirect-04.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3978, Section 5.1 on line 14. -- Found old boilerplate from RFC 3978, Section 5.5 on line 465. -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 441. -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 448. -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 454. ** This document has an original RFC 3978 Section 5.4 Copyright Line, instead of the newer IETF Trust Copyright according to RFC 4748. ** This document has an original RFC 3978 Section 5.5 Disclaimer, instead of the newer disclaimer which includes the IETF Trust according to RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year == Using lowercase 'not' together with uppercase 'MUST', 'SHALL', 'SHOULD', or 'RECOMMENDED' is not an accepted usage according to RFC 2119. Please use uppercase 'NOT' together with RFC 2119 keywords (if that is what you mean). Found 'MUST not' in this paragraph: The server returns the RDMA Write list to the client with the segment length fields overwritten to indicate the amount of data RDMA Written to each segment. Results returned by direct placement MUST not be returned by other methods, e.g. by read chunk list or inline. If no result data at all is returned for the element, the server places no data in the buffer(s), but does return zeroes in the segment length fields corresponding to the result. -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- Couldn't find a document date in the document -- date freshness check skipped. -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) ** Obsolete normative reference: RFC 1831 (Obsoleted by RFC 5531) ** Obsolete normative reference: RFC 1832 (Obsoleted by RFC 4506) ** Downref: Normative reference to an Informational RFC: RFC 1094 ** Downref: Normative reference to an Informational RFC: RFC 1813 ** Obsolete normative reference: RFC 3530 (Obsoleted by RFC 7530) Summary: 8 errors (**), 0 flaws (~~), 3 warnings (==), 8 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internet-Draft Tom Talpey 3 Expires: April 2007 Brent Callaghan 5 Document: draft-ietf-nfsv4-nfsdirect-04 October, 2007 7 NFS Direct Data Placement 9 Status of this Memo 11 By submitting this Internet-Draft, each author represents that any 12 applicable patent or other IPR claims of which he or she is aware 13 have been or will be disclosed, and any of which he or she becomes 14 aware will be disclosed, in accordance with Section 6 of BCP 79. 16 Internet-Drafts are working documents of the Internet Engineering 17 Task Force (IETF), its areas, and its working groups. Note that 18 other groups may also distribute working documents as Internet- 19 Drafts. 21 Internet-Drafts are draft documents valid for a maximum of six 22 months and may be updated, replaced, or obsoleted by other 23 documents at any time. It is inappropriate to use Internet-Drafts 24 as reference material or to cite them other than as "work in 25 progress." 27 The list of current Internet-Drafts can be accessed at 28 http://www.ietf.org/ietf/1id-abstracts.txt 30 The list of Internet-Draft Shadow Directories can be accessed at 31 http://www.ietf.org/shadow.html. 33 Abstract 35 The RDMA transport for ONC RPC provides direct data placement for NFS 36 data. Direct data placement not only reduces the amount of data that 37 needs to be copied in an NFS call, but allows much of the data 38 movement over the network to be implemented in RDMA hardware. This 39 draft describes the use of direct data placement by means of server- 40 initiated RDMA operations into client-supplied buffers in a Chunk 41 list for implementations of NFS versions 2, 3, and 4 over an RDMA 42 transport. 44 Table of Contents 46 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 47 2. Transfers from NFS Client to NFS Server . . . . . . . . . . 2 48 3. Transfers from NFS Server to NFS Client . . . . . . . . . . 3 49 4. NFS Versions 2 and 3 Mapping . . . . . . . . . . . . . . . . 4 50 5. NFS Version 4 Mapping . . . . . . . . . . . . . . . . . . . 5 51 6. Security . . . . . . . . . . . . . . . . . . . . . . . . . . 7 52 7. IANA Considerations . . . . . . . . . . . . . . . . . . . . 7 53 8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 8 54 9. Normative References . . . . . . . . . . . . . . . . . . . . 8 55 10. Informative References . . . . . . . . . . . . . . . . . . 9 56 11. Authors' Addresses . . . . . . . . . . . . . . . . . . . . 9 57 12. Intellectual Property and Copyright Statements . . . . . 10 58 Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . 10 60 Requirements Language 62 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 63 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 64 document are to be interpreted as described in [RFC2119]. 66 1. Introduction 68 The RDMA Transport for ONC RPC [RPCRDMA] allows an RPC client 69 application to post buffers in a Chunk list for specific arguments 70 and results from an RPC call. The RDMA transport header conveys this 71 list of client buffer addresses to the server where the application 72 can associate them with client data and use RDMA operations to 73 transfer the results directly to and from the posted buffers on the 74 client. The client and server must agree on a consistent mapping of 75 posted buffers to RPC. This document details the mapping for each 76 version of the NFS protocol [RFC1831] [RFC1832] [RFC1094] [RFC1813] 77 [RFC3530] [NFSv4.1]. 79 2. Transfers from NFS Client to NFS Server 81 The RDMA Read list, in the RDMA transport header, allows an RPC 82 client to marshal RPC call data selectively. Large chunks of data, 83 such as the file data of an NFS WRITE request, MAY be referenced by 84 an RDMA Read list and be moved efficiently and directly-placed by an 85 RDMA READ operation initiated by the server. 87 The process of identifying these chunks for the RDMA Read list can be 88 implemented entirely within the RPC layer. It is transparent to the 89 upper-level protocol, such as NFS. For instance, the file data 90 portion of an NFS WRITE request can be selected as an RDMA "chunk" 91 within the XDR marshaling code of RPC based on a size criterion, 92 independently of the NFS protocol layer. The XDR unmarshaling on the 93 receiving system can identify the correspondence between Read chunks 94 and protocol elements via the XDR position value encoded in the Read 95 chunk entry. 97 RPC RDMA Read chunks are employed by this NFS mapping to convey 98 specific NFS data to the server in a manner which may be directly 99 placed. The following sections describe this mapping for versions of 100 the NFS protocol. 102 3. Transfers from NFS Server to NFS Client 104 The RDMA Write list, in the RDMA transport header, allows the client 105 to post one or more buffers into which the server will RDMA Write 106 designated result chunks directly. If the client sends a null write 107 list, then results from the RPC call will be returned as either an 108 inline reply, as chunks in an RDMA Read list of server-posted 109 buffers, or in a client-posted reply buffer. 111 Each posted buffer in a Write list is represented as an array of 112 memory segments. This allows the client some flexibility in 113 submitting discontiguous memory segments into which the server will 114 scatter the result. Each segment is described by a triplet 115 consisting of the segment handle or steering tag (STag), segment 116 length, and memory address or offset. 118 struct xdr_rdma_segment { 119 uint32 handle; /* Registered memory handle */ 120 uint32 length; /* Length of the chunk in bytes */ 121 uint64 offset; /* Chunk virtual address or offset */ 122 }; 124 struct xdr_write_chunk { 125 struct xdr_rdma_segment target<>; 126 }; 128 struct xdr_write_list { 129 struct xdr_write_chunk entry; 130 struct xdr_write_list *next; 131 }; 133 The sum of the segment lengths yields the total size of the buffer, 134 which MUST be large enough to accept the result. If the buffer is 135 too small, the server MUST return an XDR encode error. The server 136 MUST return the result data for a posted buffer by progressively 137 filling its segments, perhaps leaving some trailing segments unfilled 138 or partially full if the size of the result is less than the total 139 size of the buffer segments. 141 The server returns the RDMA Write list to the client with the segment 142 length fields overwritten to indicate the amount of data RDMA Written 143 to each segment. Results returned by direct placement MUST not be 144 returned by other methods, e.g. by read chunk list or inline. If no 145 result data at all is returned for the element, the server places no 146 data in the buffer(s), but does return zeroes in the segment length 147 fields corresponding to the result. 149 The RDMA Write list allows the client to provide multiple result 150 buffers - each buffer maps to a specific result in the reply. The NFS 151 client and server implementations agree by specifying the mapping of 152 results to buffers for each RPC procedure. The following sections 153 describe this mapping for versions of the NFS protocol. 155 Through the use of RDMA Write lists in NFS requests, it is not 156 necessary to employ the RDMA Read lists in the NFS replies, as 157 described in the RPC/RDMA protocol. This enables more efficient 158 operation, by avoiding the need for the server to expose buffers for 159 RDMA, and also avoiding "RDMA_DONE" exchanges. Clients MAY 160 additionally employ RDMA Reply chunks to receive entire messages, as 161 described in [RPCRDMA]. 163 4. NFS Versions 2 and 3 Mapping 165 A single RDMA Write list entry MAY be posted by the client to receive 166 either the opaque file data from a READ request or the pathname from 167 a READLINK request. The server MUST ignore a Write list for any 168 other NFS procedure, as well as any Write list entries beyond the 169 first in the list. 171 Similarly, a single RDMA Read list entry MAY be posted by the client 172 to supply the opaque file data for a WRITE request or the pathname 173 for a SYMLINK request. The server MUST ignore any Read list for 174 other NFS procedures, as well as additional Read list entries beyond 175 the first in the list. 177 Because there are no NFS version 2 or 3 requests that transfer bulk 178 data in both directions, it is not necessary to post requests 179 containing both Write and Read lists. Any unneeded Read or Write 180 lists are ignored by the server. 182 In the case where the outgoing request or expected incoming reply is 183 larger than the maximum size supported on the connection, it is 184 possible for the RPC layer to post the entire message or result in a 185 special "RDMA_NOMSG" message type which is transferred entirely by 186 RDMA. This is implemented in RPC, below NFS and therefore has no 187 effect on the message contents. 189 Non-RDMA (inline) WRITE transfers MAY OPTIONALLY employ the 190 "RDMA_MSGP" padding method described in the RPC/RDMA protocol, if the 191 appropriate value for the server is known to the client. Padding 192 allows the opaque file data to arrive at the server in an aligned 193 fashion, which may improve server performance. 195 The NFS version 2 and 3 protocols are frequently limited in practice 196 to requests containing less than or equal to 8 kilobytes and 32 197 kilobytes of data, respectively. In these cases, it is often 198 practical to support basic operation without employing a 199 configuration exchange as discussed in [RPCRDMA]. The server MUST 200 post buffers large enough to receive the largest possible incoming 201 message (approximately 12KB for NFS version 2, or 36KB for NFS 202 version 3, would be vastly sufficient), and the client can post 203 buffers large enough to receive replies based on the "rsize" it is 204 using to the server, plus a fixed overhead for the RPC and NFS 205 headers. Because the server MUST NOT return data in excess of this 206 size, the client can be assured of the adequacy of its posted buffer 207 sizes. 209 Flow control is handled dynamically by the RPC RDMA protocol, and 210 write padding is OPTIONAL and therefore MAY remain unused. 212 Alternatively, if the server is administratively configured to values 213 appropriate for all its clients, the same assurance of 214 interoperability within the domain can be made. 216 The use of a configuration protocol with NFS v2 and v3 is therefore 217 OPTIONAL. Employing a configuration exchange may allow some advantage 218 to server resource management through accurately sizing buffers, 219 enabling the server to know exactly how many RDMA Reads may be in 220 progress at once on the client connection, and enabling client write 221 padding which may be desirable for certain servers when RDMA Read is 222 impractical. 224 5. NFS Version 4 Mapping 226 This specification applies to the first minor version of NFS version 227 4 (NFSv4.0) and any subsequent minor versions that do not override 228 this mapping. 230 The Write list MUST be considered only for the COMPOUND procedure. 231 This procedure returns results from a sequence of operations. Only 232 the opaque file data from an NFS READ operation, and the pathname 233 from a READLINK operation MUST utilize entries from the Write list. 235 If there is no Write list, i.e. the list is null, then any READ or 236 READLINK operations in the COMPOUND MUST return their data inline. 237 The NFSv4.0 client MUST ensure that any result of its READ and 238 READLINK requests fits within its receive buffers, lest an RDMA 239 transport error result upon transfer. 241 The first entry in the Write list MUST be used by the first READ or 242 READLINK in the COMPOUND request. The next Write list entry by the 243 by the next READ or READLINK, and so on. If there are more READ or 244 READLINK operations than Write list entries, then any remaining 245 operations MUST return their results inline. 247 If a Write list entry is presented, then the corresponding READ or 248 READLINK MUST return its data via an RDMA WRITE to the buffer 249 indicated by the Write list entry. If the Write list entry has zero 250 RDMA segments, or if the total size of the segments is zero, then the 251 corresponding READ or READLINK operation MUST return its result 252 inline. 254 The following example shows an RDMA Write list with three posted 255 buffers A, B, and C. The designated operations in the compound 256 request, READ and READLINK, consume the posted buffers by writing 257 their results back to each buffer. 259 RDMA Write list: 261 A --> B --> C 263 Compound request: 265 PUTFH LOOKUP READ PUTFH LOOKUP READLINK PUTFH LOOKUP READ 266 | | | 267 v v v 268 A B C 270 If the client does not want to have the READLINK result returned 271 directly, then it provides a zero length array of segment triplets 272 for buffer B or sets the values in the segment triplet for buffer B 273 to zeros so that the READLINK result MUST be returned inline. 275 The situation is similar for RDMA Read lists sent by the client and 276 applies to the NFSv4.0 WRITE and SYMLINK procedures as for v3. 278 Additionally, inline segments too large to fit in posted buffers MAY 279 be transferred in special "RDMA_NOMSG" messages. 281 Non-RDMA (inline) WRITE transfers MAY OPTIONALLY employ the 282 "RDMA_MSGP" padding method described in the RPC/RDMA protocol, if the 283 appropriate value for the server is known to the client. Padding 284 allows the opaque file data to arrive at the server in an aligned 285 fashion, which may improve server performance. In order to ensure 286 accurate alignment for all data, it is likely that the client will 287 restrict its use of OPTIONAL padding to COMPOUND requests containing 288 only a single WRITE operation. 290 Unlike NFS versions 2 and 3, the maximum size of an NFS version 4 291 COMPOUND is unbounded, even when RDMA chunks are in use. While it 292 might appear that a configuration protocol exchange (such as the one 293 described in [RPCRDMA]) would help, in fact the layering issues 294 involved in building COMPOUNDs by NFS make such a mechanism 295 unworkable. 297 However, typical NFS version 4 clients rarely issue such problematic 298 requests. In practice, they behave in much more predictable ways, in 299 fact most still support the traditional rsize/wsize mount parameters. 300 Therefore, most NFS version 4 clients function over RPC/RDMA in the 301 same way as NFS versions 2 and 3, operationally. 303 There are however advantages to allowing both client and server to 304 operate with prearranged sie constraints, for example use of the 305 sizes to better manage the server's response cache. An extension to 306 NFS version 4 supporting a more comprehensive exchange of upper layer 307 parameters is part of [NFSv4.1]. 309 6. Security 311 The RDMA transport for ONC RPC supports RPCSEC_GSS security as well 312 as link-level security. The use of RDMA Write to return RPC results 313 does not affect ONC RPC security. 315 7. IANA Considerations 317 NFS use of direct data placement introduces a need for an additional 318 NFS port number assignment for networks which share traditional UDP 319 and TCP port spaces with RDMA services. The iWARP [DDP] [RDMAP] 320 protocol is such an example (Infiniband is not). 322 NFS servers for versions 2 and 3 [RFC1094] [RFC1813] traditionally 323 listen for clients on UDP and TCP port 2049, and additionally, they 324 register these with the portmapper and/or rpcbind [RFC1833] service. 325 However, NFS servers for version 4 [RFC3530] are required by that 326 specification to listen on TCP port 2049, and are not required to 327 register. 329 An NFS version 2 or version 3 server supporting RPC/RDMA on such a 330 network and registering itself with the RPC portmapper MAY choose an 331 arbitrary port, or MAY use the alternative well-known port number for 332 its RPC/RDMA service by IANA. The chosen port MAY be registered with 333 the RPC portmapper under the netid assigned by the requirement in 334 [RPCRDMA]. 336 An NFS version 4 server supporting RPC/RDMA on such a network must 337 MUST use the alternative well-known port number for its RPC/RDMA 338 service by IANA. Clients SHOULD connect to this well-known port 339 without consulting the RPC portmapper (as for NFSv4/TCP). The 340 following port is assigned to an NFS service over an RPC/RDMA 341 transport: 343 nfs-rdma 2050 345 8. Acknowledgements 347 The authors would like to thank Dave Noveck and Chet Juszczak for 348 their contributions to this document. 350 9. Normative References 352 [RFC2119] 353 S. Bradner, "Key words for use in RFCs to Indicate Requirement 354 Levels", 355 Best Current Practice, 356 BCP 14, RFC 2119, March 1997. 358 [RFC1831] 359 R. Srinivasan, "RPC: Remote Procedure Call Protocol Specification 360 Version 2", 361 Standards Track RFC, 362 http://www.ietf.org/rfc/rfc1831.txt 364 [RFC1832] 365 R. Srinivasan, "XDR: External Data Representation Standard", 366 Standards Track RFC, 367 http://www.ietf.org/rfc/rfc1832.txt 369 [RFC1094] 370 "NFS: Network File System Protocol Specification", 371 (NFS version 2) Informational RFC, 372 http://www.ietf.org/rfc/rfc1094.txt 374 [RFC1813] 375 B. Callaghan, B. Pawlowski, P. Staubach, "NFS Version 3 Protocol 376 Specification", 377 Informational RFC, 378 http://www.ietf.org/rfc/rfc1813.txt 380 [RFC1833] 381 R. Srinivasan, "Binding Protocols for ONC RPC Version 2", 382 Standards Track RFC, 383 http://www.ietf.org/rfc/rfc1833.txt 385 [RFC3530] 386 S. Shepler, B. Callaghan, D. Robinson, R. Thurlow, C. Beame, M. 387 Eisler, D. Noveck, "NFS version 4 Protocol", 388 Standards Track RFC, 389 http://www.ietf.org/rfc/rfc3530.txt 391 10. Informative References 393 [RPCRDMA] 394 T. Talpey, B. Callaghan, "RDMA Transport for ONC RPC" 395 Internet Draft Work in Progress, 396 draft-ietf-nfsv4-rpcrdma 398 [NFSv4.1] 399 S. Shepler et. al., ed., "NFSv4 Minor Version 1" 400 Internet Draft Work in Progress, 401 draft-ietf-nfsv4-minorversion1 403 [DDP] 404 H. Shah et al, "Direct Data Placement over Reliable Transports", 405 Standards Track RFC, 406 draft-ietf-rddp-ddp 408 [RDMAP] 409 R. Recio et al, "An RDMA Protocol Specification", 410 Standards Track RFC, 411 draft-ietf-rddp-rdmap 413 11. Authors' Addresses 414 Tom Talpey 415 Network Appliance, Inc. 416 375 Totten Pond Road 417 Waltham, MA 02451 USA 419 Phone: +1 781 768 5329 420 EMail: thomas.talpey@netapp.com 422 Brent Callaghan 423 Apple Computer, Inc. 424 MS: 302-4K 425 2 Infinite Loop 426 Cupertino, CA 95014 USA 428 EMail: brentc@apple.com 430 12. Intellectual Property and Copyright Statements 432 Intellectual Property Statement 434 The IETF takes no position regarding the validity or scope of any 435 Intellectual Property Rights or other rights that might be claimed 436 to pertain to the implementation or use of the technology described 437 in this document or the extent to which any license under such 438 rights might or might not be available; nor does it represent that 439 it has made any independent effort to identify any such rights. 440 Information on the procedures with respect to rights in RFC 441 documents can be found in BCP 78 and BCP 79. 443 Copies of IPR disclosures made to the IETF Secretariat and any 444 assurances of licenses to be made available, or the result of an 445 attempt made to obtain a general license or permission for the use 446 of such proprietary rights by implementers or users of this 447 specification can be obtained from the IETF on-line IPR repository 448 at http://www.ietf.org/ipr. 450 The IETF invites any interested party to bring to its attention any 451 copyrights, patents or patent applications, or other proprietary 452 rights that may cover technology that may be required to implement 453 this standard. Please address the information to the IETF at ietf- 454 ipr@ietf.org. 456 Disclaimer of Validity 458 This document and the information contained herein are provided on 459 an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE 460 REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND 461 THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, 462 EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT 463 THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR 464 ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A 465 PARTICULAR PURPOSE. 467 Copyright Statement 469 Copyright (C) The Internet Society (2006). 471 This document is subject to the rights, licenses and restrictions 472 contained in BCP 78, and except as set forth therein, the authors 473 retain all their rights. 475 Acknowledgement 477 Funding for the RFC Editor function is currently provided by the 478 Internet Society.