idnits 2.17.1 draft-ietf-nfsv4-nfsdirect-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 400. ** Found boilerplate matching RFC 3978, Section 5.4, paragraph 1 (on line 368), which is fine, but *also* found old RFC 2026, Section 10.4C, paragraph 1 text on line 35. ** The document seems to lack an RFC 3978 Section 5.1 IPR Disclosure Acknowledgement -- however, there's a paragraph with a matching beginning. Boilerplate error? ** This document has an original RFC 3978 Section 5.4 Copyright Line, instead of the newer IETF Trust Copyright according to RFC 4748. ** The document seems to lack an RFC 3978 Section 5.4 Reference to BCP 78. ** The document seems to lack an RFC 3978 Section 5.5 (updated by RFC 4748) Disclaimer -- however, there's a paragraph with a matching beginning. Boilerplate error? ** The document seems to lack an RFC 3979 Section 5, para. 1 IPR Disclosure Acknowledgement -- however, there's a paragraph with a matching beginning. Boilerplate error? ** The document seems to lack an RFC 3979 Section 5, para. 2 IPR Disclosure Acknowledgement -- however, there's a paragraph with a matching beginning. Boilerplate error? Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** The document seems to lack a 1id_guidelines paragraph about 6 months document validity -- however, there's a paragraph with a matching beginning. Boilerplate error? ** The document seems to lack a 1id_guidelines paragraph about the list of Shadow Directories. == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- Couldn't find a document date in the document -- date freshness check skipped. -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Unused Reference: 'RFC1831' is defined on line 303, but no explicit reference was found in the text == Unused Reference: 'RFC1832' is defined on line 309, but no explicit reference was found in the text == Unused Reference: 'RFC1094' is defined on line 314, but no explicit reference was found in the text == Unused Reference: 'RFC1813' is defined on line 319, but no explicit reference was found in the text == Unused Reference: 'RFC3530' is defined on line 325, but no explicit reference was found in the text ** Obsolete normative reference: RFC 1831 (Obsoleted by RFC 5531) ** Obsolete normative reference: RFC 1832 (Obsoleted by RFC 4506) ** Downref: Normative reference to an Informational RFC: RFC 1094 ** Downref: Normative reference to an Informational RFC: RFC 1813 ** Obsolete normative reference: RFC 3530 (Obsoleted by RFC 7530) == Outdated reference: A later version (-09) exists of draft-ietf-nfsv4-rpcrdma-00 == Outdated reference: A later version (-02) exists of draft-ietf-nfsv4-sess-00 Summary: 15 errors (**), 0 flaws (~~), 9 warnings (==), 4 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 Internet-Draft Brent Callaghan 2 Expires: January 2005 Sun Microsystems, Inc. 3 Tom Talpey 4 Network Appliance, Inc. 6 Document: draft-ietf-nfsv4-nfsdirect-00.txt July, 2004 8 NFS Direct Data Placement 10 Status of this Memo 12 By submitting this Internet-Draft, I certify that any applicable 13 patent or other IPR claims of which I am aware have been disclosed, 14 or will be disclosed, and any of which I become aware will be dis- 15 closed, in accordance with RFC 3668. 17 Internet-Drafts are working documents of the Internet Engineering 18 Task Force (IETF), its areas, and its working groups. Note that 19 other groups may also distribute working documents as Internet- 20 Drafts. 22 Internet-Drafts are draft documents valid for a maximum of six 23 months and may be updated, replaced, or obsoleted by other docu- 24 ments at any time. It is inappropriate to use Internet-Drafts as 25 reference material or to cite them other than as "work in 26 progress." 28 The list of current Internet-Drafts can be accessed at 29 http://www.ietf.org/ietf/1id-abstracts.txt The list of Inter- 30 net-Draft Shadow Directories can be accessed at 31 http://www.ietf.org/shadow.html. 33 Copyright Notice 35 Copyright (C) The Internet Society (2004). All Rights Reserved. 37 Abstract 39 The RDMA transport for ONC RPC provides direct data placement for NFS 40 data. Direct data placement not only reduces the amount of data that 41 needs to be copied in an NFS call, but allows much of the data 42 movement over the network to be implemented in RDMA hardware. This 43 draft describes the use of direct data placement by means of server- 44 initiated RDMA operations into client-supplied buffers in a Chunk 45 list for implementations of NFS versions 2, 3, and 4 over an RDMA 46 transport. 48 Table of Contents 50 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 51 2. RDMA Read List . . . . . . . . . . . . . . . . . . . . . . . 2 52 3. RDMA Write List . . . . . . . . . . . . . . . . . . . . . . 3 53 4. NFS Versions 2 and 3 Mapping . . . . . . . . . . . . . . . . 4 54 5. NFS Version 4 Mapping . . . . . . . . . . . . . . . . . . . 5 55 6. Security . . . . . . . . . . . . . . . . . . . . . . . . . . 7 56 7. IANA Considerations . . . . . . . . . . . . . . . . . . . . 7 57 8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 7 58 9. Normative References . . . . . . . . . . . . . . . . . . . . 7 59 10. Informative References . . . . . . . . . . . . . . . . . . 8 60 11. Authors' Addresses . . . . . . . . . . . . . . . . . . . . 9 61 12. Full Copyright Statement . . . . . . . . . . . . . . . . . 9 62 Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . 10 64 1. Introduction 66 The RDMA Transport for ONC RPC [RPCRDMA] allows an RPC client 67 application to post buffers in a Chunk list for specific arguments 68 and results from an RPC call. The RDMA transport header conveys this 69 list of client buffer addresses to the server where the application 70 can associate them with client data and use RDMA operations to 71 transfer the results directly to and from the posted buffers on the 72 client. The client and server must agree on a consistent mapping of 73 posted buffers to RPC. This document details the mapping for each 74 version of the NFS protocol. 76 2. RDMA Read List 78 The RDMA Read list, in the RDMA transport header, allows an RPC 79 client to marshall RPC call data selectively. Large chunks of data, 80 such as the file data of an NFS WRITE request, can be referenced from 81 the RDMA Read list and be moved efficiently and direct-placed by an 82 RDMA READ operation initiated by the server. 84 The process of identifying these chunks for the RDMA Read list can be 85 implemented entirely within the RPC layer. It is transparent to the 86 upper-level protocol, such as NFS. For instance, the file data 87 portion of an NFS WRITE request can be selected as an RDMA "chunk" 88 within the XDR marshalling code of RPC based on a size criterion, 89 independently of the NFS protocol layer. The XDR unmarshalling on the 90 receiving system can identify the correspondence between Read chunks 91 and protocol elements via the XDR position value encoded in the Read 92 chunk entry. 94 However, the implementation of the RDMA Write list requires some help 95 from the NFS protocol layer to identify the candidate chunks. Since 96 there is no simple XDR position value to unambiguously label Write 97 chunks, both client and server must agree on a mapping of write chunk 98 entries to protocol elements. 100 3. RDMA Write List 102 The RDMA Write list, in the RDMA transport header, allows the client 103 to post one or more buffers into which the server will RDMA Write 104 designated result chunks directly. If the client sends a null write 105 list, then results from the RPC call will be returned as either an 106 in-line reply, as chunks in an RDMA Read list of server-posted 107 buffers, or in a client-posted reply buffer. 109 Each posted buffer in a Write list is represented as an array of 110 memory segments. This allows the client some flexibility in 111 submitting discontiguous memory segments into which the server will 112 scatter the result. Each segment is described by a triplet 113 consisting of the segment handle or steering tag (STag), segment 114 length, and memory address or offset. 116 struct xdr_rdma_segment { 117 uint32 handle; /* Registered memory handle */ 118 uint32 length; /* Length of the chunk in bytes */ 119 uint64 offset; /* Chunk virtual address or offset */ 120 }; 122 struct xdr_write_chunk { 123 struct xdr_rdma_segment target<>; 124 }; 126 struct xdr_write_list { 127 struct xdr_write_chunk entry; 128 struct xdr_write_list *next; 129 }; 131 The sum of the segment lengths yields the total size of the buffer, 132 which must be large enough to accept the result. If the buffer is 133 too small, the server must return an XDR encode error. The server 134 must return the result data for a posted buffer by progressively 135 filling its segments, perhaps leaving some trailing segments unfilled 136 or partially full if the size of the result is less than the total 137 size of the buffer segments. 139 The server returns the RDMA Write list to the client with the segment 140 length fields overwritten to indicate the amount of data RDMA Written 141 to each segment. Results returned by direct placement must not be 142 returned by other methods, e.g. by read chunk list or in-line. 144 The RDMA Write list allows the client to provide multiple result 145 buffers - each buffer must map to a specific result in the reply. The 146 NFS client and server implementations must agree on the mapping of 147 results to buffers for each RPC procedure. The following sections 148 describe this mapping for versions of the NFS protocol. 150 Through the use of RDMA Write lists in NFS requests, it is not 151 necessary to employ the RDMA Read lists in the NFS replies, as 152 described in the RPC/RDMA protocol. This enables more efficient 153 operation, by avoiding the need for the server to expose buffers for 154 RDMA, and also avoiding "RDMA_DONE" messages. 156 4. NFS Versions 2 and 3 Mapping 158 A single RDMA Write list entry may be posted by the client to receive 159 either the opaque file data from a READ request or the pathname from 160 a READLINK request. The server will ignore a Write list for any 161 other NFS procedure, as well as any Write list entries beyond the 162 first in the list. 164 Similarly, a single RDMA Read list entry may be posted by the client 165 to supply the opaque file data for a WRITE request or the pathname 166 for a SYMLINK request. The server will ignore any Read list for 167 other NFS procedures, as well as additional Read list entries beyond 168 the first in the list. 170 Because there are no NFS version 2 or 3 requests that transfer bulk 171 data in both directions, it is not necessary to post requests 172 containing both Write and Read lists. Any unneeded Read or Write 173 lists are ignored by the server. 175 In the case where the outgoing request or expected incoming reply is 176 larger than the maximum size supported on the connection, it is 177 possible for the RPC layer to post the entire message or result in a 178 special "RDMA_NOMSG" message type which is transferred entirely by 179 RDMA. This is implemented in RPC, below NFS and therefore has no 180 effect on the message contents. 182 Non-RDMA (inline) WRITE transfers may optionally employ the 183 "RDMA_MSGP" padding method described in the RPC/RDMA protocol, if the 184 appropriate value for the server is known to the client. Padding 185 allows the opaque file data to arrive at the server in an aligned 186 fashion, which may improve server performance. 188 The NFS version 2 and 3 protocols are frequently limited in practice 189 to requests containing less than or equal to 8 kilobytes and 32 190 kilobytes of data, respectively. In these cases, it is often 191 practical to support basic operation without employing a 192 configuration exchange as discussed in [RPCRDMA]. The server can 193 post buffers large enough to receive the largest possible incoming 194 message (approximately 12KB/36KB would be vastly sufficient in the 195 above cases), and the client can post buffers large enough to receive 196 replies based on the "rsize" it is using to the server. Flow control 197 is handled dynamically by the RPC RDMA protocol, and write padding is 198 optional and therefore may remain unused. 200 Alternatively, if the server is administratively configured to values 201 appropriate for all its clients, the same assurance of 202 interoperability within the domain can be made. 204 The use of a configuration protocol with NFS v2 and v3 is therefore 205 optional. Employing a configuration exchange may allow some advantage 206 to server resource management through accurately sizing buffers, 207 enabling the server to know exactly how many RDMA Reads may be in 208 progress at once on the client connection, and enabling client write 209 padding which may be desirable for certain servers when RDMA Read is 210 impractical. 212 5. NFS Version 4 Mapping 214 This specification applies to the first minor version of NFS version 215 4 (NFSv4.0) and any subsequent minor versions that do not override 216 this mapping. 218 The Write list will be considered only for the COMPOUND procedure. 219 This procedure returns results from a sequence of operations. Only 220 the opaque file data from an NFS READ operation, and the pathname 221 from a READLINK operation will utilize entries from the Write list. 223 If there is no Write list, i.e. the list is null, then any READ or 224 READLINK operations in the COMPOUND must return their data either in- 225 line or via RDMA READ (using the Read list). 227 The first entry in the Write list must be used by the first READ or 228 READLINK in the request. The next Write list entry by the by the 229 next READ or READLINK, and so on. If there are more READ or READLINK 230 operations than Write list entries, then any remaining operations 231 must return their results in-line or via the Read list. 233 If a Write list entry is presented, then the corresponding READ or 234 READLINK must return its data via an RDMA WRITE to the buffer 235 indicated by the Write list entry. If the Write list entry has zero 236 RDMA segments, or if the total size of the segments is zero, then the 237 corresponding READ or READLINK operation must return its result in- 238 line or via Read list. 240 The following example shows an RDMA Write list with three posted 241 buffers A, B, and C. The designated operations in the compound 242 request, READ and READLINK, consume the posted buffers by writing 243 their results back to each buffer. 245 RDMA Write list: 247 A --> B --> C 249 Compound request: 251 PUTFH LOOKUP READ PUTFH LOOKUP READLINK PUTFH LOOKUP READ 252 | | | 253 v v v 254 A B C 256 If the client does not want to have the READLINK result returned 257 directly, then it provides a zero length array of segment triplets 258 for buffer B or sets the values in the segment triplet for buffer B 259 to zeros so that the READLINK result will be returned in-line. 261 The situation is similar for RDMA Read lists and applies to the 262 NFSv4.0 WRITE and SYMLINK procedures as for v3. Additionally, inline 263 segments too large to fit in posted buffers may be transferred in 264 special "RDMA_NOMSG" messages. 266 Non-RDMA (inline) WRITE transfers may optionally employ the 267 "RDMA_MSGP" padding method described in the RPC/RDMA protocol, if the 268 appropriate value for the server is known to the client. Padding 269 allows the opaque file data to arrive at the server in an aligned 270 fashion, which may improve server performance. In order to ensure 271 accurate alignment for all data, it is likely that the client will 272 restrict its use of optional padding to COMPOUND requests containing 273 only a single WRITE operation. 275 Unlike NFS versions 2 and 3, the maximum size of an NFS version 4 276 COMPOUND is unbounded, even when RDMA chunks are in use. While it 277 might appear that a configuration protocol exchange (as described in 278 [RPCRDMA]) would help, in fact the layering issues involved in 279 building COMPOUNDs by NFS make such a mechanism unworkable. Instead, 280 an extension to NFS version 4 supporting a more comprehensive 281 exchange of upper layer (NFSv4) parameters is proposed in 282 [NFSSESSIONS]. This proposal also addresses other use of the sizes, 283 such as in the server's response cache. 285 6. Security 287 The RDMA transport for ONC RPC supports RPCSEC_GSS security as well 288 as link-level security. The use of RDMA Write to return RPC results 289 does not affect ONC RPC security. 291 7. IANA Considerations 293 NFS use of direct data placement introduces no new IANA 294 considerations. 296 8. Acknowledgements 298 The authors would like to thank Dave Noveck and Chet Juszczak for 299 their contributions to this document. 301 9. Normative References 303 [RFC1831] 304 R. Srinivasan, "RPC: Remote Procedure Call Protocol Specification 305 Version 2", 306 Standards Track RFC, 307 http://www.ietf.org/rfc/rfc1831.txt 309 [RFC1832] 310 R. Srinivasan, "XDR: External Data Representation Standard", 311 Standards Track RFC, 312 http://www.ietf.org/rfc/rfc1832.txt 314 [RFC1094] 315 "NFS: Network File System Protocol Specification", 316 (NFS version 2) Informational RFC, 317 http://www.ietf.org/rfc/rfc1094.txt 319 [RFC1813] 320 B. Callaghan, B. Pawlowski, P. Staubach, "NFS Version 3 Protocol 321 Specification", 322 Informational RFC, 323 http://www.ietf.org/rfc/rfc1813.txt 325 [RFC3530] 326 S. Shepler, B. Callaghan, D. Robinson, R. Thurlow, C. Beame, M. 327 Eisler, D. Noveck, "NFS version 4 Protocol", 328 Standards Track RFC, 329 http://www.ietf.org/rfc/rfc3530.txt 331 10. Informative References 333 [RPCRDMA] 334 B. Callaghan, T. Talpey, "RDMA Transport for ONC RPC" 335 Internet Draft Work in Progress, 336 http://www.ietf.org/internet-drafts/ 337 draft-ietf-nfsv4-rpcrdma-00.txt 339 [NFSSESSIONS] 340 T. Talpey, S. Shepler, J. Bauman, "NFSv4 Session Extensions" 341 Internet Draft Work in Progress, 342 http://www.ietf.org/internet-drafts/ 343 draft-ietf-nfsv4-sess-00.txt 345 11. Authors' Addresses 347 Brent Callaghan 348 Sun Microsystems, Inc. 349 17 Network Circle 350 Menlo Park, California 94025 USA 352 Phone: +1 650 786 5067 353 EMail: brent.callaghan@sun.com 355 Tom Talpey 356 Network Appliance, Inc. 357 375 Totten Pond Road 358 Waltham, MA 02451 USA 360 Phone: +1 781 768 5329 361 EMail: thomas.talpey@netapp.com 363 12. Full Copyright Statement 365 Copyright (C) The Internet Society (2004). This document is sub- 366 ject to the rights, licenses and restrictions contained in BCP 78 367 and except as set forth therein, the authors retain all their 368 rights. 370 This document and the information contained herein are provided on 371 an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REP- 372 RESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE 373 INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR 374 IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF 375 THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED 376 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 378 Intellectual Property 380 The IETF takes no position regarding the validity or scope of any 381 Intellectual Property Rights or other rights that might be claimed 382 to pertain to the implementation or use of the technology described 383 in this document or the extent to which any license under such 384 rights might or might not be available; nor does it represent that 385 it has made any independent effort to identify any such rights. 386 Information on the procedures with respect to rights in RFC docu- 387 ments can be found in BCP 78 and BCP 79. 389 Copies of IPR disclosures made to the IETF Secretariat and any 390 assurances of licenses to be made available, or the result of an 391 attempt made to obtain a general license or permission for the use 392 of such proprietary rights by implementers or users of this speci- 393 fication can be obtained from the IETF on-line IPR repository at 394 http://www.ietf.org/ipr. 396 The IETF invites any interested party to bring to its attention any 397 copyrights, patents or patent applications, or other proprietary 398 rights that may cover technology that may be required to implement 399 this standard. Please address the information to the IETF at ietf- 400 ipr@ietf.org. 402 Acknowledgement 404 Funding for the RFC Editor function is currently provided by the 405 Internet Society.