idnits 2.17.1 draft-ietf-nfsv4-nfsdirect-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3978, Section 5.1 on line 14. -- Found old boilerplate from RFC 3978, Section 5.5 on line 449. -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 425. -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 432. -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 438. ** This document has an original RFC 3978 Section 5.4 Copyright Line, instead of the newer IETF Trust Copyright according to RFC 4748. ** This document has an original RFC 3978 Section 5.5 Disclaimer, instead of the newer disclaimer which includes the IETF Trust according to RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- Couldn't find a document date in the document -- date freshness check skipped. -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'RFC3050' is mentioned on line 307, but not defined ** Obsolete normative reference: RFC 1831 (Obsoleted by RFC 5531) ** Obsolete normative reference: RFC 1832 (Obsoleted by RFC 4506) ** Downref: Normative reference to an Informational RFC: RFC 1094 ** Downref: Normative reference to an Informational RFC: RFC 1813 ** Obsolete normative reference: RFC 3530 (Obsoleted by RFC 7530) Summary: 8 errors (**), 0 flaws (~~), 3 warnings (==), 8 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internet-Draft Tom Talpey 3 Expires: December 2006 Brent Callaghan 5 Document: draft-ietf-nfsv4-nfsdirect-03 June, 2006 7 NFS Direct Data Placement 9 Status of this Memo 11 By submitting this Internet-Draft, each author represents that any 12 applicable patent or other IPR claims of which he or she is aware 13 have been or will be disclosed, and any of which he or she becomes 14 aware will be disclosed, in accordance with Section 6 of BCP 79. 16 Internet-Drafts are working documents of the Internet Engineering 17 Task Force (IETF), its areas, and its working groups. Note that 18 other groups may also distribute working documents as Internet- 19 Drafts. 21 Internet-Drafts are draft documents valid for a maximum of six 22 months and may be updated, replaced, or obsoleted by other 23 documents at any time. It is inappropriate to use Internet-Drafts 24 as reference material or to cite them other than as "work in 25 progress." 27 The list of current Internet-Drafts can be accessed at 28 http://www.ietf.org/ietf/1id-abstracts.txt 30 The list of Internet-Draft Shadow Directories can be accessed at 31 http://www.ietf.org/shadow.html. 33 Abstract 35 The RDMA transport for ONC RPC provides direct data placement for NFS 36 data. Direct data placement not only reduces the amount of data that 37 needs to be copied in an NFS call, but allows much of the data 38 movement over the network to be implemented in RDMA hardware. This 39 draft describes the use of direct data placement by means of server- 40 initiated RDMA operations into client-supplied buffers in a Chunk 41 list for implementations of NFS versions 2, 3, and 4 over an RDMA 42 transport. 44 Table of Contents 46 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 47 2. Transfers from NFS Client to NFS Server . . . . . . . . . . 2 48 3. Transfers from NFS Server to NFS Client . . . . . . . . . . 2 49 4. NFS Versions 2 and 3 Mapping . . . . . . . . . . . . . . . . 4 50 5. NFS Version 4 Mapping . . . . . . . . . . . . . . . . . . . 5 51 6. Security . . . . . . . . . . . . . . . . . . . . . . . . . . 7 52 7. IANA Considerations . . . . . . . . . . . . . . . . . . . . 7 53 8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 8 54 9. Normative References . . . . . . . . . . . . . . . . . . . . 8 55 10. Informative References . . . . . . . . . . . . . . . . . . 8 56 11. Authors' Addresses . . . . . . . . . . . . . . . . . . . . 9 57 12. Intellectual Property and Copyright Statements . . . . . . 9 58 Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . 10 60 1. Introduction 62 The RDMA Transport for ONC RPC [RPCRDMA] allows an RPC client 63 application to post buffers in a Chunk list for specific arguments 64 and results from an RPC call. The RDMA transport header conveys this 65 list of client buffer addresses to the server where the application 66 can associate them with client data and use RDMA operations to 67 transfer the results directly to and from the posted buffers on the 68 client. The client and server must agree on a consistent mapping of 69 posted buffers to RPC. This document details the mapping for each 70 version of the NFS protocol [RFC1831] [RFC1832] [RFC1094] [RFC1813] 71 [RFC3530] [NFSv4.1]. 73 2. Transfers from NFS Client to NFS Server 75 The RDMA Read list, in the RDMA transport header, allows an RPC 76 client to marshal RPC call data selectively. Large chunks of data, 77 such as the file data of an NFS WRITE request, may be referenced by 78 an RDMA Read list and be moved efficiently and directly-placed by an 79 RDMA READ operation initiated by the server. 81 The process of identifying these chunks for the RDMA Read list can be 82 implemented entirely within the RPC layer. It is transparent to the 83 upper-level protocol, such as NFS. For instance, the file data 84 portion of an NFS WRITE request can be selected as an RDMA "chunk" 85 within the XDR marshalling code of RPC based on a size criterion, 86 independently of the NFS protocol layer. The XDR unmarshalling on the 87 receiving system can identify the correspondence between Read chunks 88 and protocol elements via the XDR position value encoded in the Read 89 chunk entry. 91 RPC RDMA Read chunks are employed by this NFS mapping to convey 92 specific NFS data to the server in a manner which may be directly 93 placed. The following sections describe this mapping for versions of 94 the NFS protocol. 96 3. Transfers from NFS Server to NFS Client 98 The RDMA Write list, in the RDMA transport header, allows the client 99 to post one or more buffers into which the server will RDMA Write 100 designated result chunks directly. If the client sends a null write 101 list, then results from the RPC call will be returned as either an 102 inline reply, as chunks in an RDMA Read list of server-posted 103 buffers, or in a client-posted reply buffer. 105 Each posted buffer in a Write list is represented as an array of 106 memory segments. This allows the client some flexibility in 107 submitting discontiguous memory segments into which the server will 108 scatter the result. Each segment is described by a triplet 109 consisting of the segment handle or steering tag (STag), segment 110 length, and memory address or offset. 112 struct xdr_rdma_segment { 113 uint32 handle; /* Registered memory handle */ 114 uint32 length; /* Length of the chunk in bytes */ 115 uint64 offset; /* Chunk virtual address or offset */ 116 }; 118 struct xdr_write_chunk { 119 struct xdr_rdma_segment target<>; 120 }; 122 struct xdr_write_list { 123 struct xdr_write_chunk entry; 124 struct xdr_write_list *next; 125 }; 127 The sum of the segment lengths yields the total size of the buffer, 128 which must be large enough to accept the result. If the buffer is 129 too small, the server must return an XDR encode error. The server 130 must return the result data for a posted buffer by progressively 131 filling its segments, perhaps leaving some trailing segments unfilled 132 or partially full if the size of the result is less than the total 133 size of the buffer segments. 135 The server returns the RDMA Write list to the client with the segment 136 length fields overwritten to indicate the amount of data RDMA Written 137 to each segment. Results returned by direct placement must not be 138 returned by other methods, e.g. by read chunk list or inline. If no 139 result data at all is returned for the element, the server places no 140 data in the buffer(s), but does return zeroes in the segment length 141 fields corresponding to the result. 143 The RDMA Write list allows the client to provide multiple result 144 buffers - each buffer must map to a specific result in the reply. The 145 NFS client and server implementations must agree on the mapping of 146 results to buffers for each RPC procedure. The following sections 147 describe this mapping for versions of the NFS protocol. 149 Through the use of RDMA Write lists in NFS requests, it is not 150 necessary to employ the RDMA Read lists in the NFS replies, as 151 described in the RPC/RDMA protocol. This enables more efficient 152 operation, by avoiding the need for the server to expose buffers for 153 RDMA, and also avoiding "RDMA_DONE" exchanges. Clients may 154 additionally employ RDMA Reply chunks to receive entire messages, as 155 described in [RPCRDMA]. 157 4. NFS Versions 2 and 3 Mapping 159 A single RDMA Write list entry may be posted by the client to receive 160 either the opaque file data from a READ request or the pathname from 161 a READLINK request. The server will ignore a Write list for any 162 other NFS procedure, as well as any Write list entries beyond the 163 first in the list. 165 Similarly, a single RDMA Read list entry may be posted by the client 166 to supply the opaque file data for a WRITE request or the pathname 167 for a SYMLINK request. The server will ignore any Read list for 168 other NFS procedures, as well as additional Read list entries beyond 169 the first in the list. 171 Because there are no NFS version 2 or 3 requests that transfer bulk 172 data in both directions, it is not necessary to post requests 173 containing both Write and Read lists. Any unneeded Read or Write 174 lists are ignored by the server. 176 In the case where the outgoing request or expected incoming reply is 177 larger than the maximum size supported on the connection, it is 178 possible for the RPC layer to post the entire message or result in a 179 special "RDMA_NOMSG" message type which is transferred entirely by 180 RDMA. This is implemented in RPC, below NFS and therefore has no 181 effect on the message contents. 183 Non-RDMA (inline) WRITE transfers may optionally employ the 184 "RDMA_MSGP" padding method described in the RPC/RDMA protocol, if the 185 appropriate value for the server is known to the client. Padding 186 allows the opaque file data to arrive at the server in an aligned 187 fashion, which may improve server performance. 189 The NFS version 2 and 3 protocols are frequently limited in practice 190 to requests containing less than or equal to 8 kilobytes and 32 191 kilobytes of data, respectively. In these cases, it is often 192 practical to support basic operation without employing a 193 configuration exchange as discussed in [RPCRDMA]. The server can 194 post buffers large enough to receive the largest possible incoming 195 message (approximately 12KB/36KB would be vastly sufficient in the 196 above cases), and the client can post buffers large enough to receive 197 replies based on the "rsize" it is using to the server. Because the 198 server will never return data in excess of this size, the client can 199 be assured of the adequacy of its posted buffer sizes. 201 Flow control is handled dynamically by the RPC RDMA protocol, and 202 write padding is optional and therefore may remain unused. 204 Alternatively, if the server is administratively configured to values 205 appropriate for all its clients, the same assurance of 206 interoperability within the domain can be made. 208 The use of a configuration protocol with NFS v2 and v3 is therefore 209 optional. Employing a configuration exchange may allow some advantage 210 to server resource management through accurately sizing buffers, 211 enabling the server to know exactly how many RDMA Reads may be in 212 progress at once on the client connection, and enabling client write 213 padding which may be desirable for certain servers when RDMA Read is 214 impractical. 216 5. NFS Version 4 Mapping 218 This specification applies to the first minor version of NFS version 219 4 (NFSv4.0) and any subsequent minor versions that do not override 220 this mapping. 222 The Write list will be considered only for the COMPOUND procedure. 223 This procedure returns results from a sequence of operations. Only 224 the opaque file data from an NFS READ operation, and the pathname 225 from a READLINK operation will utilize entries from the Write list. 227 If there is no Write list, i.e. the list is null, then any READ or 228 READLINK operations in the COMPOUND must return their data inline. 229 The NFSv4.0 client must ensure that any result of its READ and 230 READLINK requests must fit within its receive buffers, or an RDMA 231 transport error may occur. 233 The first entry in the Write list must be used by the first READ or 234 READLINK in the COMPOUND request. The next Write list entry by the 235 by the next READ or READLINK, and so on. If there are more READ or 236 READLINK operations than Write list entries, then any remaining 237 operations must return their results inline. 239 If a Write list entry is presented, then the corresponding READ or 240 READLINK must return its data via an RDMA WRITE to the buffer 241 indicated by the Write list entry. If the Write list entry has zero 242 RDMA segments, or if the total size of the segments is zero, then the 243 corresponding READ or READLINK operation must return its result 244 inline. 246 The following example shows an RDMA Write list with three posted 247 buffers A, B, and C. The designated operations in the compound 248 request, READ and READLINK, consume the posted buffers by writing 249 their results back to each buffer. 251 RDMA Write list: 253 A --> B --> C 255 Compound request: 257 PUTFH LOOKUP READ PUTFH LOOKUP READLINK PUTFH LOOKUP READ 258 | | | 259 v v v 260 A B C 262 If the client does not want to have the READLINK result returned 263 directly, then it provides a zero length array of segment triplets 264 for buffer B or sets the values in the segment triplet for buffer B 265 to zeros so that the READLINK result will be returned inline. 267 The situation is similar for RDMA Read lists sent by the client and 268 applies to the NFSv4.0 WRITE and SYMLINK procedures as for v3. 269 Additionally, inline segments too large to fit in posted buffers may 270 be transferred in special "RDMA_NOMSG" messages. 272 Non-RDMA (inline) WRITE transfers may optionally employ the 273 "RDMA_MSGP" padding method described in the RPC/RDMA protocol, if the 274 appropriate value for the server is known to the client. Padding 275 allows the opaque file data to arrive at the server in an aligned 276 fashion, which may improve server performance. In order to ensure 277 accurate alignment for all data, it is likely that the client will 278 restrict its use of optional padding to COMPOUND requests containing 279 only a single WRITE operation. 281 Unlike NFS versions 2 and 3, the maximum size of an NFS version 4 282 COMPOUND is unbounded, even when RDMA chunks are in use. While it 283 might appear that a configuration protocol exchange (such as the one 284 described in [RPCRDMA]) would help, in fact the layering issues 285 involved in building COMPOUNDs by NFS make such a mechanism 286 unworkable. Instead, an extension to NFS version 4 supporting a more 287 comprehensive exchange of upper layer (NFSv4) parameters is proposed 288 in [NFSv4.1]. This proposal also addresses other use of the sizes, 289 such as in the server's response cache. 291 6. Security 293 The RDMA transport for ONC RPC supports RPCSEC_GSS security as well 294 as link-level security. The use of RDMA Write to return RPC results 295 does not affect ONC RPC security. 297 7. IANA Considerations 299 NFS use of direct data placement may introduce a need for an 300 additional NFS port number assignment for networks which share 301 traditional UDP and TCP port spaces with RDMA services. The iWARP 302 [DDP] [RDMAP] protocol is such an example (Infiniband is not). 304 NFS servers for versions 2 and 3 [RFC1094] [RFC1813] traditionally 305 listen for clients on UDP and TCP port 2049, and additionally, they 306 register these with the portmapper. NFS servers for version 4 307 [RFC3050] are required to listen on TCP port 2049, and are not 308 required to register. 310 An NFS version 2 or version 3 server supporting RPC/RDMA on such a 311 network and registering itself with the RPC portmapper may choose an 312 arbitrary port, or may be assigned an alternative well-known port 313 number for its RPC/RDMA service by IANA. The chosen port must be 314 registered with the RPC portmapper under the netid assigned by the 315 requirement in [RPCRDMA]. 317 An NFS version 4 server supporting RPC/RDMA on such a network must be 318 assigned an alternative well-known port number for its RPC/RDMA 319 service by IANA. Clients will connect to this well-known port 320 without consulting the RPC portmapper (as for NFSv4/TCP). 322 Any subsequent NFS version 4 minor version's [NFSv4.1] server may 323 reuse port 2049, by requiring the client to perform the RDMA session 324 negotiation supported by this protocol. If it does not require the 325 client to negotiate an RDMA-enabled session, it must use the 326 alternative port for RPC/RDMA, as for version 4. 328 This is not an issue on non-IP transports such as native Infiniband, 329 where a non-colliding port translation scheme is used [IBPORT]. On 330 such interfaces, the server can simply listen on the port mapped from 331 the IANA-assigned NFS 2049, or any other port as assigned by the 332 native transport. Such assignments are out of the scope of IANA, and 333 of this document. 335 8. Acknowledgements 337 The authors would like to thank Dave Noveck and Chet Juszczak for 338 their contributions to this document. 340 9. Normative References 342 [RFC1831] 343 R. Srinivasan, "RPC: Remote Procedure Call Protocol Specification 344 Version 2", 345 Standards Track RFC, 346 http://www.ietf.org/rfc/rfc1831.txt 348 [RFC1832] 349 R. Srinivasan, "XDR: External Data Representation Standard", 350 Standards Track RFC, 351 http://www.ietf.org/rfc/rfc1832.txt 353 [RFC1094] 354 "NFS: Network File System Protocol Specification", 355 (NFS version 2) Informational RFC, 356 http://www.ietf.org/rfc/rfc1094.txt 358 [RFC1813] 359 B. Callaghan, B. Pawlowski, P. Staubach, "NFS Version 3 Protocol 360 Specification", 361 Informational RFC, 362 http://www.ietf.org/rfc/rfc1813.txt 364 [RFC3530] 365 S. Shepler, B. Callaghan, D. Robinson, R. Thurlow, C. Beame, M. 366 Eisler, D. Noveck, "NFS version 4 Protocol", 367 Standards Track RFC, 368 http://www.ietf.org/rfc/rfc3530.txt 370 10. Informative References 372 [RPCRDMA] 373 T. Talpey, B. Callaghan, "RDMA Transport for ONC RPC" 374 Internet Draft Work in Progress, 375 draft-ietf-nfsv4-rpcrdma 377 [NFSv4.1] 378 S. Shepler, ed., "NFSv4 Minor Version 1" 379 Internet Draft Work in Progress, 380 draft-ietf-nfsv4-minorversion1 382 [DDP] 383 H. Shah et al, "Direct Data Placement over Reliable Transports", 384 Internet Draft Work in Progress, 385 draft-ietf-rddp-ddp 387 [RDMAP] 388 R. Recio et al, "An RDMA Protocol Specification", 389 Internet Draft Work in Progress, 390 draft-ietf-rddp-rdmap 392 [IBPORT] 393 Infiniband Trade Association, "IP Addressing Annex", 394 available from www.infinibandta.org 396 11. Authors' Addresses 398 Tom Talpey 399 Network Appliance, Inc. 400 375 Totten Pond Road 401 Waltham, MA 02451 USA 403 Phone: +1 781 768 5329 404 EMail: thomas.talpey@netapp.com 406 Brent Callaghan 407 Apple Computer, Inc. 408 MS: 302-4K 409 2 Infinite Loop 410 Cupertino, CA 95014 USA 412 EMail: brentc@apple.com 414 12. Intellectual Property and Copyright Statements 416 Intellectual Property Statement 418 The IETF takes no position regarding the validity or scope of any 419 Intellectual Property Rights or other rights that might be claimed 420 to pertain to the implementation or use of the technology described 421 in this document or the extent to which any license under such 422 rights might or might not be available; nor does it represent that 423 it has made any independent effort to identify any such rights. 424 Information on the procedures with respect to rights in RFC 425 documents can be found in BCP 78 and BCP 79. 427 Copies of IPR disclosures made to the IETF Secretariat and any 428 assurances of licenses to be made available, or the result of an 429 attempt made to obtain a general license or permission for the use 430 of such proprietary rights by implementers or users of this 431 specification can be obtained from the IETF on-line IPR repository 432 at http://www.ietf.org/ipr. 434 The IETF invites any interested party to bring to its attention any 435 copyrights, patents or patent applications, or other proprietary 436 rights that may cover technology that may be required to implement 437 this standard. Please address the information to the IETF at ietf- 438 ipr@ietf.org. 440 Disclaimer of Validity 442 This document and the information contained herein are provided on 443 an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE 444 REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND 445 THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, 446 EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT 447 THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR 448 ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A 449 PARTICULAR PURPOSE. 451 Copyright Statement 453 Copyright (C) The Internet Society (2006). 455 This document is subject to the rights, licenses and restrictions 456 contained in BCP 78, and except as set forth therein, the authors 457 retain all their rights. 459 Acknowledgement 461 Funding for the RFC Editor function is currently provided by the 462 Internet Society.