idnits 2.17.1 draft-ietf-nfsv4-nfsdirect-05.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3978, Section 5.1 on line 16. -- Found old boilerplate from RFC 3978, Section 5.5, updated by RFC 4748 on line 450. -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 468. -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 474. ** The document seems to lack an RFC 3979 Section 5, para. 1 IPR Disclosure Acknowledgement -- however, there's a paragraph with a matching beginning. Boilerplate error? Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust Copyright Line does not match the current year == Using lowercase 'not' together with uppercase 'MUST', 'SHALL', 'SHOULD', or 'RECOMMENDED' is not an accepted usage according to RFC 2119. Please use uppercase 'NOT' together with RFC 2119 keywords (if that is what you mean). Found 'MUST not' in this paragraph: The server returns the RDMA Write list to the client with the segment length fields overwritten to indicate the amount of data RDMA Written to each segment. Results returned by direct placement MUST not be returned by other methods, e.g. by read chunk list or inline. If no result data at all is returned for the element, the server places no data in the buffer(s), but does return zeroes in the segment length fields corresponding to the result. -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (May 7, 2007) is 6198 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) ** Obsolete normative reference: RFC 1831 (Obsoleted by RFC 5531) ** Downref: Normative reference to an Informational RFC: RFC 1094 ** Downref: Normative reference to an Informational RFC: RFC 1813 ** Obsolete normative reference: RFC 3530 (Obsoleted by RFC 7530) Summary: 6 errors (**), 0 flaws (~~), 2 warnings (==), 7 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 NFSv4 Working Group Tom Talpey 3 Internet-Draft Network Appliance, Inc. 4 Intended status: Standards Track Brent Callaghan 5 Expires: November 8, 2007 Apple Computer, Inc. 6 May 7, 2007 8 NFS Direct Data Placement 9 draft-ietf-nfsv4-nfsdirect-05 11 Status of this Memo 13 By submitting this Internet-Draft, each author represents that any 14 applicable patent or other IPR claims of which he or she is aware 15 have been or will be disclosed, and any of which he or she becomes 16 aware will be disclosed, in accordance with Section 6 of BCP 79. 18 Internet-Drafts are working documents of the Internet Engineering 19 Task Force (IETF), its areas, and its working groups. Note that 20 other groups may also distribute working documents as Internet- 21 Drafts. 23 Internet-Drafts are draft documents valid for a maximum of six 24 months and may be updated, replaced, or obsoleted by other 25 documents at any time. It is inappropriate to use Internet-Drafts 26 as reference material or to cite them other than as "work in 27 progress." 29 The list of current Internet-Drafts can be accessed at 30 http://www.ietf.org/ietf/1id-abstracts.txt 32 The list of Internet-Draft Shadow Directories can be accessed at 33 http://www.ietf.org/shadow.html. 35 Abstract 37 The RDMA transport for ONC RPC provides direct data placement for NFS 38 data. Direct data placement not only reduces the amount of data that 39 needs to be copied in an NFS call, but allows much of the data 40 movement over the network to be implemented in RDMA hardware. This 41 draft describes the use of direct data placement by means of server- 42 initiated RDMA operations into client-supplied buffers in a Chunk 43 list for implementations of NFS versions 2, 3, and 4 over an RDMA 44 transport. 46 Table of Contents 48 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 49 2. Transfers from NFS Client to NFS Server . . . . . . . . . . 2 50 3. Transfers from NFS Server to NFS Client . . . . . . . . . . 3 51 4. NFS Versions 2 and 3 Mapping . . . . . . . . . . . . . . . . 4 52 5. NFS Version 4 Mapping . . . . . . . . . . . . . . . . . . . 5 53 6. Security . . . . . . . . . . . . . . . . . . . . . . . . . . 7 54 7. IANA Considerations . . . . . . . . . . . . . . . . . . . . 7 55 8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 8 56 9. Normative References . . . . . . . . . . . . . . . . . . . . 8 57 10. Informative References . . . . . . . . . . . . . . . . . . 9 58 11. Authors' Addresses . . . . . . . . . . . . . . . . . . . . 9 59 12. Intellectual Property and Copyright Statements . . . . . 10 60 Acknowledgment . . . . . . . . . . . . . . . . . . . . . . . . 10 62 Requirements Language 64 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 65 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 66 document are to be interpreted as described in [RFC2119]. 68 1. Introduction 70 The RDMA Transport for ONC RPC [RPCRDMA] allows an RPC client 71 application to post buffers in a Chunk list for specific arguments 72 and results from an RPC call. The RDMA transport header conveys this 73 list of client buffer addresses to the server where the application 74 can associate them with client data and use RDMA operations to 75 transfer the results directly to and from the posted buffers on the 76 client. The client and server must agree on a consistent mapping of 77 posted buffers to RPC. This document details the mapping for each 78 version of the NFS protocol [RFC1831] [RFC4506] [RFC1094] [RFC1813] 79 [RFC3530] [NFSv4.1]. 81 2. Transfers from NFS Client to NFS Server 83 The RDMA Read list, in the RDMA transport header, allows an RPC 84 client to marshal RPC call data selectively. Large chunks of data, 85 such as the file data of an NFS WRITE request, MAY be referenced by 86 an RDMA Read list and be moved efficiently and directly-placed by an 87 RDMA READ operation initiated by the server. 89 The process of identifying these chunks for the RDMA Read list can be 90 implemented entirely within the RPC layer. It is transparent to the 91 upper-level protocol, such as NFS. For instance, the file data 92 portion of an NFS WRITE request can be selected as an RDMA "chunk" 93 within the XDR marshaling code of RPC based on a size criterion, 94 independently of the NFS protocol layer. The XDR unmarshaling on the 95 receiving system can identify the correspondence between Read chunks 96 and protocol elements via the XDR position value encoded in the Read 97 chunk entry. 99 RPC RDMA Read chunks are employed by this NFS mapping to convey 100 specific NFS data to the server in a manner which may be directly 101 placed. The following sections describe this mapping for versions of 102 the NFS protocol. 104 3. Transfers from NFS Server to NFS Client 106 The RDMA Write list, in the RDMA transport header, allows the client 107 to post one or more buffers into which the server will RDMA Write 108 designated result chunks directly. If the client sends a null write 109 list, then results from the RPC call will be returned as either an 110 inline reply, as chunks in an RDMA Read list of server-posted 111 buffers, or in a client-posted reply buffer. 113 Each posted buffer in a Write list is represented as an array of 114 memory segments. This allows the client some flexibility in 115 submitting discontiguous memory segments into which the server will 116 scatter the result. Each segment is described by a triplet 117 consisting of the segment handle or steering tag (STag), segment 118 length, and memory address or offset. 120 struct xdr_rdma_segment { 121 uint32 handle; /* Registered memory handle */ 122 uint32 length; /* Length of the chunk in bytes */ 123 uint64 offset; /* Chunk virtual address or offset */ 124 }; 126 struct xdr_write_chunk { 127 struct xdr_rdma_segment target<>; 128 }; 130 struct xdr_write_list { 131 struct xdr_write_chunk entry; 132 struct xdr_write_list *next; 133 }; 135 The sum of the segment lengths yields the total size of the buffer, 136 which MUST be large enough to accept the result. If the buffer is 137 too small, the server MUST return an XDR encode error. The server 138 MUST return the result data for a posted buffer by progressively 139 filling its segments, perhaps leaving some trailing segments unfilled 140 or partially full if the size of the result is less than the total 141 size of the buffer segments. 143 The server returns the RDMA Write list to the client with the segment 144 length fields overwritten to indicate the amount of data RDMA Written 145 to each segment. Results returned by direct placement MUST not be 146 returned by other methods, e.g. by read chunk list or inline. If no 147 result data at all is returned for the element, the server places no 148 data in the buffer(s), but does return zeroes in the segment length 149 fields corresponding to the result. 151 The RDMA Write list allows the client to provide multiple result 152 buffers - each buffer maps to a specific result in the reply. The NFS 153 client and server implementations agree by specifying the mapping of 154 results to buffers for each RPC procedure. The following sections 155 describe this mapping for versions of the NFS protocol. 157 Through the use of RDMA Write lists in NFS requests, it is not 158 necessary to employ the RDMA Read lists in the NFS replies, as 159 described in the RPC/RDMA protocol. This enables more efficient 160 operation, by avoiding the need for the server to expose buffers for 161 RDMA, and also avoiding "RDMA_DONE" exchanges. Clients MAY 162 additionally employ RDMA Reply chunks to receive entire messages, as 163 described in [RPCRDMA]. 165 4. NFS Versions 2 and 3 Mapping 167 A single RDMA Write list entry MAY be posted by the client to receive 168 either the opaque file data from a READ request or the pathname from 169 a READLINK request. The server MUST ignore a Write list for any 170 other NFS procedure, as well as any Write list entries beyond the 171 first in the list. 173 Similarly, a single RDMA Read list entry MAY be posted by the client 174 to supply the opaque file data for a WRITE request or the pathname 175 for a SYMLINK request. The server MUST ignore any Read list for 176 other NFS procedures, as well as additional Read list entries beyond 177 the first in the list. 179 Because there are no NFS version 2 or 3 requests that transfer bulk 180 data in both directions, it is not necessary to post requests 181 containing both Write and Read lists. Any unneeded Read or Write 182 lists are ignored by the server. 184 In the case where the outgoing request or expected incoming reply is 185 larger than the maximum size supported on the connection, it is 186 possible for the RPC layer to post the entire message or result in a 187 special "RDMA_NOMSG" message type which is transferred entirely by 188 RDMA. This is implemented in RPC, below NFS and therefore has no 189 effect on the message contents. 191 Non-RDMA (inline) WRITE transfers MAY OPTIONALLY employ the 192 "RDMA_MSGP" padding method described in the RPC/RDMA protocol, if the 193 appropriate value for the server is known to the client. Padding 194 allows the opaque file data to arrive at the server in an aligned 195 fashion, which may improve server performance. 197 The NFS version 2 and 3 protocols are frequently limited in practice 198 to requests containing less than or equal to 8 kilobytes and 32 199 kilobytes of data, respectively. In these cases, it is often 200 practical to support basic operation without employing a 201 configuration exchange as discussed in [RPCRDMA]. The server MUST 202 post buffers large enough to receive the largest possible incoming 203 message (approximately 12KB for NFS version 2, or 36KB for NFS 204 version 3, would be vastly sufficient), and the client can post 205 buffers large enough to receive replies based on the "rsize" it is 206 using to the server, plus a fixed overhead for the RPC and NFS 207 headers. Because the server MUST NOT return data in excess of this 208 size, the client can be assured of the adequacy of its posted buffer 209 sizes. 211 Flow control is handled dynamically by the RPC RDMA protocol, and 212 write padding is OPTIONAL and therefore MAY remain unused. 214 Alternatively, if the server is administratively configured to values 215 appropriate for all its clients, the same assurance of 216 interoperability within the domain can be made. 218 The use of a configuration protocol with NFS v2 and v3 is therefore 219 OPTIONAL. Employing a configuration exchange may allow some advantage 220 to server resource management through accurately sizing buffers, 221 enabling the server to know exactly how many RDMA Reads may be in 222 progress at once on the client connection, and enabling client write 223 padding which may be desirable for certain servers when RDMA Read is 224 impractical. 226 5. NFS Version 4 Mapping 228 This specification applies to the first minor version of NFS version 229 4 (NFSv4.0) and any subsequent minor versions that do not override 230 this mapping. 232 The Write list MUST be considered only for the COMPOUND procedure. 233 This procedure returns results from a sequence of operations. Only 234 the opaque file data from an NFS READ operation, and the pathname 235 from a READLINK operation MUST utilize entries from the Write list. 237 If there is no Write list, i.e. the list is null, then any READ or 238 READLINK operations in the COMPOUND MUST return their data inline. 239 The NFSv4.0 client MUST ensure that any result of its READ and 240 READLINK requests fits within its receive buffers, lest an RDMA 241 transport error result upon transfer. 243 The first entry in the Write list MUST be used by the first READ or 244 READLINK in the COMPOUND request. The next Write list entry by the 245 by the next READ or READLINK, and so on. If there are more READ or 246 READLINK operations than Write list entries, then any remaining 247 operations MUST return their results inline. 249 If a Write list entry is presented, then the corresponding READ or 250 READLINK MUST return its data via an RDMA WRITE to the buffer 251 indicated by the Write list entry. If the Write list entry has zero 252 RDMA segments, or if the total size of the segments is zero, then the 253 corresponding READ or READLINK operation MUST return its result 254 inline. 256 The following example shows an RDMA Write list with three posted 257 buffers A, B, and C. The designated operations in the compound 258 request, READ and READLINK, consume the posted buffers by writing 259 their results back to each buffer. 261 RDMA Write list: 263 A --> B --> C 265 Compound request: 267 PUTFH LOOKUP READ PUTFH LOOKUP READLINK PUTFH LOOKUP READ 268 | | | 269 v v v 270 A B C 272 If the client does not want to have the READLINK result returned 273 directly, then it provides a zero length array of segment triplets 274 for buffer B or sets the values in the segment triplet for buffer B 275 to zeros so that the READLINK result MUST be returned inline. 277 The situation is similar for RDMA Read lists sent by the client and 278 applies to the NFSv4.0 WRITE and SYMLINK procedures as for v3. 280 Additionally, inline segments too large to fit in posted buffers MAY 281 be transferred in special "RDMA_NOMSG" messages. 283 Non-RDMA (inline) WRITE transfers MAY OPTIONALLY employ the 284 "RDMA_MSGP" padding method described in the RPC/RDMA protocol, if the 285 appropriate value for the server is known to the client. Padding 286 allows the opaque file data to arrive at the server in an aligned 287 fashion, which may improve server performance. In order to ensure 288 accurate alignment for all data, it is likely that the client will 289 restrict its use of OPTIONAL padding to COMPOUND requests containing 290 only a single WRITE operation. 292 Unlike NFS versions 2 and 3, the maximum size of an NFS version 4 293 COMPOUND is unbounded, even when RDMA chunks are in use. While it 294 might appear that a configuration protocol exchange (such as the one 295 described in [RPCRDMA]) would help, in fact the layering issues 296 involved in building COMPOUNDs by NFS make such a mechanism 297 unworkable. 299 However, typical NFS version 4 clients rarely issue such problematic 300 requests. In practice, they behave in much more predictable ways, in 301 fact most still support the traditional rsize/wsize mount parameters. 302 Therefore, most NFS version 4 clients function over RPC/RDMA in the 303 same way as NFS versions 2 and 3, operationally. 305 There are however advantages to allowing both client and server to 306 operate with prearranged sie constraints, for example use of the 307 sizes to better manage the server's response cache. An extension to 308 NFS version 4 supporting a more comprehensive exchange of upper layer 309 parameters is part of [NFSv4.1]. 311 6. Security 313 The RDMA transport for ONC RPC supports RPCSEC_GSS security as well 314 as link-level security. The use of RDMA Write to return RPC results 315 does not affect ONC RPC security. 317 7. IANA Considerations 319 NFS use of direct data placement introduces a need for an additional 320 NFS port number assignment for networks which share traditional UDP 321 and TCP port spaces with RDMA services. The iWARP [DDP] [RDMAP] 322 protocol is such an example (Infiniband is not). 324 NFS servers for versions 2 and 3 [RFC1094] [RFC1813] traditionally 325 listen for clients on UDP and TCP port 2049, and additionally, they 326 register these with the portmapper and/or rpcbind [RFC1833] service. 327 However, NFS servers for version 4 [RFC3530] are required by that 328 specification to listen on TCP port 2049, and are not required to 329 register. 331 An NFS version 2 or version 3 server supporting RPC/RDMA on such a 332 network and registering itself with the RPC portmapper MAY choose an 333 arbitrary port, or MAY use the alternative well-known port number for 334 its RPC/RDMA service by IANA. The chosen port MAY be registered with 335 the RPC portmapper under the netid assigned by the requirement in 336 [RPCRDMA]. 338 An NFS version 4 server supporting RPC/RDMA on such a network must 339 MUST use the alternative well-known port number for its RPC/RDMA 340 service by IANA. Clients SHOULD connect to this well-known port 341 without consulting the RPC portmapper (as for NFSv4/TCP). The 342 following port is assigned to an NFS service over an RPC/RDMA 343 transport: 345 nfs-rdma 2050 347 8. Acknowledgements 349 The authors would like to thank Dave Noveck and Chet Juszczak for 350 their contributions to this document. 352 9. Normative References 354 [RFC2119] 355 S. Bradner, "Key words for use in RFCs to Indicate Requirement 356 Levels", 357 Best Current Practice, 358 BCP 14, RFC 2119, March 1997. 360 [RFC1831] 361 R. Srinivasan, "RPC: Remote Procedure Call Protocol Specification 362 Version 2", 363 Standards Track RFC, 364 http://www.ietf.org/rfc/rfc1831.txt 366 [RFC4506] 367 M. Eisler, Ed., "XDR: External Data Representation Standard", 368 Standards Track RFC, 369 http://www.ietf.org/rfc/rfc4506.txt 371 [RFC1094] 372 "NFS: Network File System Protocol Specification", 373 (NFS version 2) Informational RFC, 374 http://www.ietf.org/rfc/rfc1094.txt 376 [RFC1813] 377 B. Callaghan, B. Pawlowski, P. Staubach, "NFS Version 3 Protocol 378 Specification", 379 Informational RFC, 380 http://www.ietf.org/rfc/rfc1813.txt 382 [RFC1833] 383 R. Srinivasan, "Binding Protocols for ONC RPC Version 2", 384 Standards Track RFC, 385 http://www.ietf.org/rfc/rfc1833.txt 387 [RFC3530] 388 S. Shepler, B. Callaghan, D. Robinson, R. Thurlow, C. Beame, M. 389 Eisler, D. Noveck, "NFS version 4 Protocol", 390 Standards Track RFC, 391 http://www.ietf.org/rfc/rfc3530.txt 393 10. Informative References 395 [RPCRDMA] 396 T. Talpey, B. Callaghan, "RDMA Transport for ONC RPC" 397 Internet Draft Work in Progress, 398 draft-ietf-nfsv4-rpcrdma 400 [NFSv4.1] 401 S. Shepler et. al., ed., "NFSv4 Minor Version 1" 402 Internet Draft Work in Progress, 403 draft-ietf-nfsv4-minorversion1 405 [DDP] 406 H. Shah et al, "Direct Data Placement over Reliable Transports", 407 Standards Track RFC, 408 draft-ietf-rddp-ddp 410 [RDMAP] 411 R. Recio et al, "An RDMA Protocol Specification", 412 Standards Track RFC, 413 draft-ietf-rddp-rdmap 415 11. Authors' Addresses 417 Tom Talpey 418 Network Appliance, Inc. 419 375 Totten Pond Road 420 Waltham, MA 02451 USA 422 Phone: +1 781 768 5329 423 EMail: thomas.talpey@netapp.com 425 Brent Callaghan 426 Apple Computer, Inc. 427 MS: 302-4K 428 2 Infinite Loop 429 Cupertino, CA 95014 USA 431 EMail: brentc@apple.com 433 12. Intellectual Property and Copyright Statements 435 Full Copyright Statement 437 Copyright (C) The IETF Trust (2007). 439 This document is subject to the rights, licenses and restrictions 440 contained in BCP 78, and except as set forth therein, the authors 441 retain all their rights. 443 This document and the information contained herein are provided on 444 an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE 445 REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE 446 IETF TRUST AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL 447 WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY 448 WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE 449 ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS 450 FOR A PARTICULAR PURPOSE. 452 Intellectual Property 453 The IETF takes no position regarding the validity or scope of any 454 Intellectual Property Rights or other rights that might be claimed 455 to pertain to the implementation or use of the technology described 456 in this document or the extent to which any license under such 457 rights might or might not be available; nor does it represent that 458 it has made any independent effort to identify any such rights. 460 Information on the procedures with respect to rights in RFC 461 documents can be found in BCP 78 and BCP 79. 463 Copies of IPR disclosures made to the IETF Secretariat and any 464 assurances of licenses to be made available, or the result of an 465 attempt made to obtain a general license or permission for the use 466 of such proprietary rights by implementers or users of this 467 specification can be obtained from the IETF on-line IPR repository 468 at http://www.ietf.org/ipr. 470 The IETF invites any interested party to bring to its attention any 471 copyrights, patents or patent applications, or other proprietary 472 rights that may cover technology that may be required to implement 473 this standard. Please address the information to the IETF at ietf- 474 ipr@ietf.org. 476 Acknowledgment 477 Funding for the RFC Editor function is provided by the IETF 478 Administrative Support Activity (IASA).