idnits 2.17.1 draft-ietf-nfsv4-nfsdirect-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3978, Section 5.1 on line 14. -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 387. ** This document has an original RFC 3978 Section 5.4 Copyright Line, instead of the newer IETF Trust Copyright according to RFC 4748. ** The document seems to lack an RFC 3978 Section 5.4 Reference to BCP 78. ** The document seems to lack an RFC 3978 Section 5.5 (updated by RFC 4748) Disclaimer -- however, there's a paragraph with a matching beginning. Boilerplate error? ** The document seems to lack an RFC 3979 Section 5, para. 1 IPR Disclosure Acknowledgement -- however, there's a paragraph with a matching beginning. Boilerplate error? ** The document seems to lack an RFC 3979 Section 5, para. 2 IPR Disclosure Acknowledgement -- however, there's a paragraph with a matching beginning. Boilerplate error? Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** The document seems to lack a 1id_guidelines paragraph about 6 months document validity -- however, there's a paragraph with a matching beginning. Boilerplate error? ** The document seems to lack a 1id_guidelines paragraph about the list of Shadow Directories. == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- Couldn't find a document date in the document -- date freshness check skipped. -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Unused Reference: 'RFC1831' is defined on line 307, but no explicit reference was found in the text == Unused Reference: 'RFC1832' is defined on line 313, but no explicit reference was found in the text == Unused Reference: 'RFC1094' is defined on line 318, but no explicit reference was found in the text == Unused Reference: 'RFC1813' is defined on line 323, but no explicit reference was found in the text == Unused Reference: 'RFC3530' is defined on line 329, but no explicit reference was found in the text ** Obsolete normative reference: RFC 1831 (Obsoleted by RFC 5531) ** Obsolete normative reference: RFC 1832 (Obsoleted by RFC 4506) ** Downref: Normative reference to an Informational RFC: RFC 1094 ** Downref: Normative reference to an Informational RFC: RFC 1813 ** Obsolete normative reference: RFC 3530 (Obsoleted by RFC 7530) Summary: 13 errors (**), 0 flaws (~~), 7 warnings (==), 5 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internet-Draft Brent Callaghan 3 Expires: April 2006 Tom Talpey 5 Document: draft-ietf-nfsv4-nfsdirect-02 October, 2005 7 NFS Direct Data Placement 9 Status of this Memo 11 By submitting this Internet-Draft, each author represents that any 12 applicable patent or other IPR claims of which he or she is aware 13 have been or will be disclosed, and any of which he or she becomes 14 aware will be disclosed, in accordance with Section 6 of BCP 79. 16 Internet-Drafts are working documents of the Internet Engineering 17 Task Force (IETF), its areas, and its working groups. Note that 18 other groups may also distribute working documents as Internet- 19 Drafts. 21 Internet-Drafts are draft documents valid for a maximum of six 22 months and may be updated, replaced, or obsoleted by other docu- 23 ments at any time. It is inappropriate to use Internet-Drafts as 24 reference material or to cite them other than as "work in 25 progress." 27 The list of current Internet-Drafts can be accessed at 28 http://www.ietf.org/ietf/1id-abstracts.txt The list of Inter- 29 net-Draft Shadow Directories can be accessed at 30 http://www.ietf.org/shadow.html. 32 Abstract 34 The RDMA transport for ONC RPC provides direct data placement for NFS 35 data. Direct data placement not only reduces the amount of data that 36 needs to be copied in an NFS call, but allows much of the data 37 movement over the network to be implemented in RDMA hardware. This 38 draft describes the use of direct data placement by means of server- 39 initiated RDMA operations into client-supplied buffers in a Chunk 40 list for implementations of NFS versions 2, 3, and 4 over an RDMA 41 transport. 43 Table of Contents 45 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 46 2. Transfers from NFS Client to NFS Server . . . . . . . . . . 2 47 3. Transfers from NFS Server to NFS Client . . . . . . . . . . 2 48 4. NFS Versions 2 and 3 Mapping . . . . . . . . . . . . . . . . 4 49 5. NFS Version 4 Mapping . . . . . . . . . . . . . . . . . . . 5 50 6. Security . . . . . . . . . . . . . . . . . . . . . . . . . . 7 51 7. IANA Considerations . . . . . . . . . . . . . . . . . . . . 7 52 8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 7 53 9. Normative References . . . . . . . . . . . . . . . . . . . . 7 54 10. Informative References . . . . . . . . . . . . . . . . . . 8 55 11. Authors' Addresses . . . . . . . . . . . . . . . . . . . . 8 56 12. Intellectual Property and Copyright Statements . . . . . . 8 57 Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . . 9 59 1. Introduction 61 The RDMA Transport for ONC RPC [RPCRDMA] allows an RPC client 62 application to post buffers in a Chunk list for specific arguments 63 and results from an RPC call. The RDMA transport header conveys this 64 list of client buffer addresses to the server where the application 65 can associate them with client data and use RDMA operations to 66 transfer the results directly to and from the posted buffers on the 67 client. The client and server must agree on a consistent mapping of 68 posted buffers to RPC. This document details the mapping for each 69 version of the NFS protocol. 71 2. Transfers from NFS Client to NFS Server 73 The RDMA Read list, in the RDMA transport header, allows an RPC 74 client to marshal RPC call data selectively. Large chunks of data, 75 such as the file data of an NFS WRITE request, may be referenced by 76 an RDMA Read list and be moved efficiently and directly-placed by an 77 RDMA READ operation initiated by the server. 79 The process of identifying these chunks for the RDMA Read list can be 80 implemented entirely within the RPC layer. It is transparent to the 81 upper-level protocol, such as NFS. For instance, the file data 82 portion of an NFS WRITE request can be selected as an RDMA "chunk" 83 within the XDR marshalling code of RPC based on a size criterion, 84 independently of the NFS protocol layer. The XDR unmarshalling on the 85 receiving system can identify the correspondence between Read chunks 86 and protocol elements via the XDR position value encoded in the Read 87 chunk entry. 89 RPC RDMA Read chunks are employed by this NFS mapping to convey 90 specific NFS data to the server in a manner which may be directly 91 placed. The following sections describe this mapping for versions of 92 the NFS protocol. 94 3. Transfers from NFS Server to NFS Client 96 The RDMA Write list, in the RDMA transport header, allows the client 97 to post one or more buffers into which the server will RDMA Write 98 designated result chunks directly. If the client sends a null write 99 list, then results from the RPC call will be returned as either an 100 inline reply, as chunks in an RDMA Read list of server-posted 101 buffers, or in a client-posted reply buffer. 103 Each posted buffer in a Write list is represented as an array of 104 memory segments. This allows the client some flexibility in 105 submitting discontiguous memory segments into which the server will 106 scatter the result. Each segment is described by a triplet 107 consisting of the segment handle or steering tag (STag), segment 108 length, and memory address or offset. 110 struct xdr_rdma_segment { 111 uint32 handle; /* Registered memory handle */ 112 uint32 length; /* Length of the chunk in bytes */ 113 uint64 offset; /* Chunk virtual address or offset */ 114 }; 116 struct xdr_write_chunk { 117 struct xdr_rdma_segment target<>; 118 }; 120 struct xdr_write_list { 121 struct xdr_write_chunk entry; 122 struct xdr_write_list *next; 123 }; 125 The sum of the segment lengths yields the total size of the buffer, 126 which must be large enough to accept the result. If the buffer is 127 too small, the server must return an XDR encode error. The server 128 must return the result data for a posted buffer by progressively 129 filling its segments, perhaps leaving some trailing segments unfilled 130 or partially full if the size of the result is less than the total 131 size of the buffer segments. 133 The server returns the RDMA Write list to the client with the segment 134 length fields overwritten to indicate the amount of data RDMA Written 135 to each segment. Results returned by direct placement must not be 136 returned by other methods, e.g. by read chunk list or inline. If no 137 result data at all is returned for the element, the server places no 138 data in the buffer(s), but does return zeroes in the segment length 139 fields corresponding to the result. 141 The RDMA Write list allows the client to provide multiple result 142 buffers - each buffer must map to a specific result in the reply. The 143 NFS client and server implementations must agree on the mapping of 144 results to buffers for each RPC procedure. The following sections 145 describe this mapping for versions of the NFS protocol. 147 Through the use of RDMA Write lists in NFS requests, it is not 148 necessary to employ the RDMA Read lists in the NFS replies, as 149 described in the RPC/RDMA protocol. This enables more efficient 150 operation, by avoiding the need for the server to expose buffers for 151 RDMA, and also avoiding "RDMA_DONE" exchanges. Clients may 152 additionally employ RDMA Reply chunks to receive entire messages, as 153 described in [RPCRDMA]. 155 4. NFS Versions 2 and 3 Mapping 157 A single RDMA Write list entry may be posted by the client to receive 158 either the opaque file data from a READ request or the pathname from 159 a READLINK request. The server will ignore a Write list for any 160 other NFS procedure, as well as any Write list entries beyond the 161 first in the list. 163 Similarly, a single RDMA Read list entry may be posted by the client 164 to supply the opaque file data for a WRITE request or the pathname 165 for a SYMLINK request. The server will ignore any Read list for 166 other NFS procedures, as well as additional Read list entries beyond 167 the first in the list. 169 Because there are no NFS version 2 or 3 requests that transfer bulk 170 data in both directions, it is not necessary to post requests 171 containing both Write and Read lists. Any unneeded Read or Write 172 lists are ignored by the server. 174 In the case where the outgoing request or expected incoming reply is 175 larger than the maximum size supported on the connection, it is 176 possible for the RPC layer to post the entire message or result in a 177 special "RDMA_NOMSG" message type which is transferred entirely by 178 RDMA. This is implemented in RPC, below NFS and therefore has no 179 effect on the message contents. 181 Non-RDMA (inline) WRITE transfers may optionally employ the 182 "RDMA_MSGP" padding method described in the RPC/RDMA protocol, if the 183 appropriate value for the server is known to the client. Padding 184 allows the opaque file data to arrive at the server in an aligned 185 fashion, which may improve server performance. 187 The NFS version 2 and 3 protocols are frequently limited in practice 188 to requests containing less than or equal to 8 kilobytes and 32 189 kilobytes of data, respectively. In these cases, it is often 190 practical to support basic operation without employing a 191 configuration exchange as discussed in [RPCRDMA]. The server can 192 post buffers large enough to receive the largest possible incoming 193 message (approximately 12KB/36KB would be vastly sufficient in the 194 above cases), and the client can post buffers large enough to receive 195 replies based on the "rsize" it is using to the server. Because the 196 server will never return data in excess of this size, the client can 197 be assured of the adequacy of its posted buffer sizes. 199 Flow control is handled dynamically by the RPC RDMA protocol, and 200 write padding is optional and therefore may remain unused. 202 Alternatively, if the server is administratively configured to values 203 appropriate for all its clients, the same assurance of 204 interoperability within the domain can be made. 206 The use of a configuration protocol with NFS v2 and v3 is therefore 207 optional. Employing a configuration exchange may allow some advantage 208 to server resource management through accurately sizing buffers, 209 enabling the server to know exactly how many RDMA Reads may be in 210 progress at once on the client connection, and enabling client write 211 padding which may be desirable for certain servers when RDMA Read is 212 impractical. 214 5. NFS Version 4 Mapping 216 This specification applies to the first minor version of NFS version 217 4 (NFSv4.0) and any subsequent minor versions that do not override 218 this mapping. 220 The Write list will be considered only for the COMPOUND procedure. 221 This procedure returns results from a sequence of operations. Only 222 the opaque file data from an NFS READ operation, and the pathname 223 from a READLINK operation will utilize entries from the Write list. 225 If there is no Write list, i.e. the list is null, then any READ or 226 READLINK operations in the COMPOUND must return their data inline. 227 The NFSv4.0 client must ensure that any result of its READ and 228 READLINK requests must fit within its receive buffers, or an RDMA 229 transport error may occur. 231 The first entry in the Write list must be used by the first READ or 232 READLINK in the COMPOUND request. The next Write list entry by the 233 by the next READ or READLINK, and so on. If there are more READ or 234 READLINK operations than Write list entries, then any remaining 235 operations must return their results inline. 237 If a Write list entry is presented, then the corresponding READ or 238 READLINK must return its data via an RDMA WRITE to the buffer 239 indicated by the Write list entry. If the Write list entry has zero 240 RDMA segments, or if the total size of the segments is zero, then the 241 corresponding READ or READLINK operation must return its result 242 inline. 244 The following example shows an RDMA Write list with three posted 245 buffers A, B, and C. The designated operations in the compound 246 request, READ and READLINK, consume the posted buffers by writing 247 their results back to each buffer. 249 RDMA Write list: 251 A --> B --> C 253 Compound request: 255 PUTFH LOOKUP READ PUTFH LOOKUP READLINK PUTFH LOOKUP READ 256 | | | 257 v v v 258 A B C 260 If the client does not want to have the READLINK result returned 261 directly, then it provides a zero length array of segment triplets 262 for buffer B or sets the values in the segment triplet for buffer B 263 to zeros so that the READLINK result will be returned inline. 265 The situation is similar for RDMA Read lists sent by the client and 266 applies to the NFSv4.0 WRITE and SYMLINK procedures as for v3. 267 Additionally, inline segments too large to fit in posted buffers may 268 be transferred in special "RDMA_NOMSG" messages. 270 Non-RDMA (inline) WRITE transfers may optionally employ the 271 "RDMA_MSGP" padding method described in the RPC/RDMA protocol, if the 272 appropriate value for the server is known to the client. Padding 273 allows the opaque file data to arrive at the server in an aligned 274 fashion, which may improve server performance. In order to ensure 275 accurate alignment for all data, it is likely that the client will 276 restrict its use of optional padding to COMPOUND requests containing 277 only a single WRITE operation. 279 Unlike NFS versions 2 and 3, the maximum size of an NFS version 4 280 COMPOUND is unbounded, even when RDMA chunks are in use. While it 281 might appear that a configuration protocol exchange (such as the one 282 described in [RPCRDMA]) would help, in fact the layering issues 283 involved in building COMPOUNDs by NFS make such a mechanism 284 unworkable. Instead, an extension to NFS version 4 supporting a more 285 comprehensive exchange of upper layer (NFSv4) parameters is proposed 286 in [NFSSESS]. This proposal also addresses other use of the sizes, 287 such as in the server's response cache. 289 6. Security 291 The RDMA transport for ONC RPC supports RPCSEC_GSS security as well 292 as link-level security. The use of RDMA Write to return RPC results 293 does not affect ONC RPC security. 295 7. IANA Considerations 297 NFS use of direct data placement introduces no new IANA 298 considerations. 300 8. Acknowledgements 302 The authors would like to thank Dave Noveck and Chet Juszczak for 303 their contributions to this document. 305 9. Normative References 307 [RFC1831] 308 R. Srinivasan, "RPC: Remote Procedure Call Protocol Specification 309 Version 2", 310 Standards Track RFC, 311 http://www.ietf.org/rfc/rfc1831.txt 313 [RFC1832] 314 R. Srinivasan, "XDR: External Data Representation Standard", 315 Standards Track RFC, 316 http://www.ietf.org/rfc/rfc1832.txt 318 [RFC1094] 319 "NFS: Network File System Protocol Specification", 320 (NFS version 2) Informational RFC, 321 http://www.ietf.org/rfc/rfc1094.txt 323 [RFC1813] 324 B. Callaghan, B. Pawlowski, P. Staubach, "NFS Version 3 Protocol 325 Specification", 326 Informational RFC, 327 http://www.ietf.org/rfc/rfc1813.txt 329 [RFC3530] 330 S. Shepler, B. Callaghan, D. Robinson, R. Thurlow, C. Beame, M. 331 Eisler, D. Noveck, "NFS version 4 Protocol", 332 Standards Track RFC, 333 http://www.ietf.org/rfc/rfc3530.txt 335 10. Informative References 337 [RPCRDMA] 338 B. Callaghan, T. Talpey, "RDMA Transport for ONC RPC" 339 Internet Draft Work in Progress, 340 draft-ietf-nfsv4-rpcrdma 342 [NFSSESS] 343 T. Talpey, S. Shepler, J. Bauman, "NFSv4 Session Extensions" 344 Internet Draft Work in Progress, 345 draft-ietf-nfsv4-sess 347 11. Authors' Addresses 349 Brent Callaghan 350 1614 Montalto Dr. 351 Mountain View, California 94040 USA 353 Phone: +1 650 968 2333 354 EMail: brent.callaghan@gmail.com 356 Tom Talpey 357 Network Appliance, Inc. 358 375 Totten Pond Road 359 Waltham, MA 02451 USA 361 Phone: +1 781 768 5329 362 EMail: thomas.talpey@netapp.com 364 12. Intellectual Property and Copyright Statements 366 Intellectual Property Statement 367 The IETF takes no position regarding the validity or scope of any 368 Intellectual Property Rights or other rights that might be claimed 369 to pertain to the implementation or use of the technology described 370 in this document or the extent to which any license under such 371 rights might or might not be available; nor does it represent that 372 it has made any independent effort to identify any such rights. 373 Information on the procedures with respect to rights in RFC docu- 374 ments can be found in BCP 78 and BCP 79. 376 Copies of IPR disclosures made to the IETF Secretariat and any 377 assurances of licenses to be made available, or the result of an 378 attempt made to obtain a general license or permission for the use 379 of such proprietary rights by implementers or users of this speci- 380 fication can be obtained from the IETF on-line IPR repository at 381 http://www.ietf.org/ipr. 383 The IETF invites any interested party to bring to its attention any 384 copyrights, patents or patent applications, or other proprietary 385 rights that may cover technology that may be required to implement 386 this standard. Please address the information to the IETF at ietf- 387 ipr@ietf.org. 389 Disclaimer of Validity 391 This document and the information contained herein are provided on 392 an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REP- 393 RESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE 394 INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR 395 IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF 396 THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED 397 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 399 Copyright Statement 401 Copyright (C) The Internet Society (2005). This document is sub- 402 ject to the rights, licenses and restrictions contained in BCP 78, 403 and except as set forth therein, the authors retain all their 404 rights. 406 Acknowledgement 408 Funding for the RFC Editor function is currently provided by the 409 Internet Society.