idnits 2.17.1 draft-dnoveck-nfsv4-nfsulb-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document seems to contain a disclaimer for pre-RFC5378 work, but was first submitted on or after 10 November 2008. The disclaimer is usually necessary only for documents that revise or obsolete older RFCs, and that take significant amounts of text from those RFCs. If you can contact all authors of the source material and they are willing to grant the BCP78 rights to the IETF Trust, you can and should remove the disclaimer. Otherwise, the disclaimer is needed and you can ignore this comment. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (February 1, 2017) is 2641 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Outdated reference: A later version (-11) exists of draft-ietf-nfsv4-rfc5666bis-09 == Outdated reference: A later version (-08) exists of draft-ietf-nfsv4-rpcrdma-bidirection-06 ** Obsolete normative reference: RFC 5661 (Obsoleted by RFC 8881) == Outdated reference: A later version (-13) exists of draft-ietf-nfsv4-rfc5667bis-04 == Outdated reference: A later version (-11) exists of draft-ietf-nfsv4-versioning-09 -- Obsolete informational reference (is this intentional?): RFC 5667 (Obsoleted by RFC 8267) Summary: 1 error (**), 0 flaws (~~), 6 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network File System Version 4 D. Noveck 3 Internet-Draft February 1, 2017 4 Intended status: Standards Track 5 Expires: August 5, 2017 7 Transport-generic Network File System (NFS) Upper Layer Bindings To RPC- 8 Over-RDMA 9 draft-dnoveck-nfsv4-nfsulb-00 11 Abstract 13 This document specifies Upper Layer Bindings to allow use of RPC- 14 over-RDMA by protocols related to the Network File System (NFS). 15 Such bindings are required when using RPC-over-RDMA, in order to 16 enable use of Direct Data Placement and for a number of other 17 reasons. These bindings are structured to be applicable to all known 18 version of the RPC-over-RDMA transport, including optional 19 extensions. All versions of NFS are addressed. 21 Requirements Language 23 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 24 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 25 document are to be interpreted as described in [RFC2119]. 27 Status of This Memo 29 This Internet-Draft is submitted in full conformance with the 30 provisions of BCP 78 and BCP 79. 32 Internet-Drafts are working documents of the Internet Engineering 33 Task Force (IETF). Note that other groups may also distribute 34 working documents as Internet-Drafts. The list of current Internet- 35 Drafts is at http://datatracker.ietf.org/drafts/current/. 37 Internet-Drafts are draft documents valid for a maximum of six months 38 and may be updated, replaced, or obsoleted by other documents at any 39 time. It is inappropriate to use Internet-Drafts as reference 40 material or to cite them other than as "work in progress." 42 This Internet-Draft will expire on August 5, 2017. 44 Copyright Notice 46 Copyright (c) 2017 IETF Trust and the persons identified as the 47 document authors. All rights reserved. 49 This document is subject to BCP 78 and the IETF Trust's Legal 50 Provisions Relating to IETF Documents 51 (http://trustee.ietf.org/license-info) in effect on the date of 52 publication of this document. Please review these documents 53 carefully, as they describe your rights and restrictions with respect 54 to this document. Code Components extracted from this document must 55 include Simplified BSD License text as described in Section 4.e of 56 the Trust Legal Provisions and are provided without warranty as 57 described in the Simplified BSD License. 59 This document may contain material from IETF Documents or IETF 60 Contributions published or made publicly available before November 61 10, 2008. The person(s) controlling the copyright in some of this 62 material may not have granted the IETF Trust the right to allow 63 modifications of such material outside the IETF Standards Process. 64 Without obtaining an adequate license from the person(s) controlling 65 the copyright in such materials, this document may not be modified 66 outside the IETF Standards Process, and derivative works of it may 67 not be created outside the IETF Standards Process, except to format 68 it for publication as an RFC or to translate it into languages other 69 than English. 71 Table of Contents 73 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 74 2. Conveying NFS Operations On RPC-Over-RDMA . . . . . . . . . . 3 75 2.1. Direct Placement of Request Data . . . . . . . . . . . . 4 76 2.2. Direct Placement of Response Data . . . . . . . . . . . . 4 77 2.3. Scatter-gather when Using DDP . . . . . . . . . . . . . . 5 78 2.4. DDP-eligibility Violations . . . . . . . . . . . . . . . 5 79 2.5. Long Calls and Replies . . . . . . . . . . . . . . . . . 6 80 3. Preparatory Material for Multiple Bindings . . . . . . . . . 6 81 3.1. Reply Size Estimation . . . . . . . . . . . . . . . . . . 7 82 3.2. Retry to Deal with Reply Size Mis-estimation . . . . . . 7 83 4. Upper Layer Binding for NFS Versions 2 And 3 . . . . . . . . 8 84 4.1. Auxiliary Protocols . . . . . . . . . . . . . . . . . . . 9 85 5. Upper Layer Binding for NFS Version 4 . . . . . . . . . . . . 10 86 5.1. DDP-Eligibility . . . . . . . . . . . . . . . . . . . . . 10 87 5.2. NFS Version 4 Reply Size Estimation . . . . . . . . . . . 11 88 5.3. NFS Version 4 COMPOUND Requests . . . . . . . . . . . . . 12 89 5.4. NFS Version 4 Callback . . . . . . . . . . . . . . . . . 14 90 5.5. Session-Related Considerations . . . . . . . . . . . . . 15 91 5.6. Connection Keep-Alive . . . . . . . . . . . . . . . . . . 16 92 6. Extending NFS Upper Layer Bindings . . . . . . . . . . . . . 16 93 7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 16 94 8. Security Considerations . . . . . . . . . . . . . . . . . . . 17 95 9. References . . . . . . . . . . . . . . . . . . . . . . . . . 17 96 9.1. Normative References . . . . . . . . . . . . . . . . . . 17 97 9.2. Informative References . . . . . . . . . . . . . . . . . 18 98 Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . 19 99 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 20 101 1. Introduction 103 An RPC-over-RDMA transport, such as the ones defined in 104 [I-D.ietf-nfsv4-rfc5666bis] and [rpcrdmav2], may employ direct data 105 placement to convey data payloads associated with RPC transactions. 106 To enable successful interoperation, RPC client and server 107 implementations must agree as to which XDR data items in what 108 particular RPC procedures are eligible for direct data placement 109 (DDP). Specifying those data items is a major component of a 110 protocol's Upper Layer Binding. 112 In addition, Upper Layer Bindings are required to include additional 113 information to assure that adequate resources are allocated to 114 receive RPC replies, and for a number of other reasons. 116 This document contains material required of Upper Layer Bindings, as 117 specified in [I-D.ietf-nfsv4-rfc5666bis], for the NFS protocol 118 versions listed below. In addition, bindings are provided, when 119 necessary, for auxiliary protocols used together with NFS versions 2 120 and 3. 122 o NFS Version 2 [RFC1094] 124 o NFS Version 3 [RFC1813] 126 o NFS Version 4.0 [RFC7530] 128 o NFS Version 4.1 [RFC5661] 130 o NFS Version 4.2 [RFC7862] 132 2. Conveying NFS Operations On RPC-Over-RDMA 134 This document is written to apply to multiple versions of the RPC- 135 over-RDMA transport, only the first of which is currently specified 136 by a working group document (in [I-D.ietf-nfsv4-rfc5666bis]). 137 However, it is expected that other versions will be created, and this 138 document has been structured to support future versions by focusing 139 on the functions to be provided by the transport and the transport 140 limitations which the Upper Layer Protocols need to accommodate, 141 allowing the transport specification and the specifications for 142 associated extensions to define how those functions will be provided 143 and the details of the transport limitations. 145 In the subsections that follow, we will describe the generic function 146 to be provided or limitation to be accommodated and follow it with 147 material that describes, in general, how that issue is dealt with in 148 Version One. For more detail about Version One, 149 [I-D.ietf-nfsv4-rfc5666bis] should be consulted. How these issues 150 are to be dealt with in future versions is left to the specification 151 documents for those versions and associated documents defining 152 optional extensions. 154 For example: 156 o ULBs within this document define which data items are eligible for 157 Direct Data Placement while transport versions might differ as how 158 this is to be effected. See Sections 2.1 and 2.2 for more detail. 159 In both cases, Section 2.3 discusses issues connected with the use 160 of discontiguous areas for Direct Data Placement. 162 o Section 2.4 defines the concept of a DDP-eligibility violation and 163 requires that such violations be reported while transport versions 164 might differ as to the manner in which the reporting is to be 165 done. 167 o Section 2.5 discusses issues arising from limits on the size of 168 messages conveyed using RDMA SENDs. Different transport version 169 may have different size limits, while, in the case of replies the 170 ULBs are responsible for specifying how limits on reply sizes are 171 to be determined. 173 2.1. Direct Placement of Request Data 175 When DDP-eligible XDR data items appear in a request the requester 176 needs to take special actions in order to provide for the direct 177 placement of those items in the responder's memory. The specific 178 actions to be taken are defined by the transport version being used. 180 In Version One, the Read list in each RPC-over-RDMA transport header 181 represents a set of memory regions containing a single item of DDP- 182 eligible NFS argument data. Large data items, such as the data 183 payload of an NFS version 3 WRITE procedure, can be referenced by the 184 Read list. The NFS server pulls such payloads from the client and 185 places them directly into its own memory. 187 2.2. Direct Placement of Response Data 189 When a request is such that it is possible for DDP-eligible data 190 items to appear in the corresponding reply, the requester needs to 191 take special actions in order to provide for the direct placement of 192 those items in the requester's memory, if such placement is desired. 194 The specific actions to be taken are defined by the transport version 195 being used as is the means to indicate that such direct placement is 196 not to be done 198 In Version One, the Write list in each RPC-over-RDMA transport header 199 represents a set of memory regions that can receive DDP-eligible NFS 200 result data. Large data items, such as the payload of an NFS version 201 3 READ procedure, can be referenced by the Write list. The NFS 202 server pushes such payloads to the client, placing them directly into 203 the client's memory, using target addresses provided by the client 204 when sending the request. 206 Each Write chunk corresponds to a specific XDR data item in an NFS 207 reply. This document describes how NFS client and server 208 implementations determine the correspondence between Write chunks and 209 XDR results. 211 2.3. Scatter-gather when Using DDP 213 In order to accommodate the storage of multiple data blocks within 214 individual cache buffers, the RPC-over-RDMA transport allows the 215 addresses to which a DDP-eligible data item to be discontiguous. How 216 these addresses are indicated depends on the transport version. 218 Within Version One, a chunk typically corresponds to exactly one XDR 219 data item. Each Read chunk is represented as a list of segments at 220 the same XDR Position. Each Write chunk is represented as an array 221 of segments. An NFS client thus has the flexibility to advertise a 222 set of discontiguous memory regions in which to convey a single DDP- 223 eligible XDR data item. 225 2.4. DDP-eligibility Violations 227 When the transport header uses the means defined to directly place on 228 an XDR item is applied to an XDR item not define in the ULB as DDP- 229 eligible, a DDP-eligibility violation is recognized. The means by 230 which such violations are to be reported is defined by the particular 231 transport version being used. 233 To report a DDP-eligibility violation within Version One, an NFS 234 server returns one of: 236 o An RPC-over-RDMA message of type RDMA_ERROR, with the rdma_xid 237 field set to the XID of the matching NFS Call, and the rdma_error 238 field set to ERR_CHUNK 240 o An RPC message (via an RDMA_MSG message) with the xid field set to 241 the XID of the matching NFS Call, the mtype field set to REPLY, 242 the stat field set to MSG_ACCEPTED, and the accept_stat field set 243 to GARBAGE_ARGS. 245 2.5. Long Calls and Replies 247 Because of the use of pre-posted receive buffers whose size is fixed, 248 all RPC-over-RDMA transport versions have limits on the size of 249 messages which can be conveyed without use of explicit RDMA 250 operations, although different transport versions may have different 251 limits. In particular, when the transport version allows messages to 252 be continued across multiple RDMA SENDs, the limit can be 253 substantially greater than the receive buffer size. Also note that 254 the size of the messages allowed may be reduced because of space 255 taken up by the transport header fields. 257 Each transport version is responsible for defining the message size 258 limits and the means by which the transfer of messages that exceed 259 these limits is to be provided for. These means may be different in 260 the cases of long calls and replies. 262 When using Version One, if an NFS request is too large to be conveyed 263 within the NFS server's responder inline threshold, even after any 264 DDP-eligible data items have been removed, an NFS client must send 265 the request in the form of a Long Call. The entire NFS request is 266 sent in a special Read chunk called a Position Zero Read chunk. 268 Also when using Version One, if an NFS client determines that the 269 maximum size of an NFS reply could be too large to be conveyed within 270 its own inline threshold, it provides a Reply chunk in the RPC-over- 271 RDMA transport header conveying the NFS request. The server places 272 the entire NFS reply in the Reply chunk. 274 There exist cases in which an NFS client needs to provide both a 275 Position Zero Read chunk and a Reply chunk for the same RPC. One 276 common source of such situations is when the RPC authentication 277 flavor being used requires that DDP-eligible data items never be 278 removed from RPC messages. 280 3. Preparatory Material for Multiple Bindings 282 Although each of the NFS versions and each of the auxiliary protocols 283 discussed in Section 4.1 has its own ULB, there is important 284 preparatory material in the subsections below that applies to 285 multiple ULPs. In particular: 287 o The material in Section 3.1 applies to all of the ULPs discussed 288 in this document. 290 o The material in Section 3.2 applies to NFSv2, NFSv3, NFSv4.0, the 291 MOUNT protocol, and the NFSACL protocol. 293 3.1. Reply Size Estimation 295 During the construction of each RPC Call message, a client is 296 responsible for allocating appropriate resources for receiving the 297 matching Reply message. The resources required depends on the 298 maximum reply size expected, whether DDP-eligible can removed from 299 the reply and the transport version being used. The ULB is 300 responsible for defining how the maximum reply size is to be 301 determined while the specifiction of the transport version being used 302 is responsible for defining how this maximum affects the resources to 303 be allocated. Because the responder may not be able to send the 304 required response when these resources have not been allocated, 305 reliable reply size estimation is necessary to allow successful 306 interoperation. 308 In many cases the Upper Layer Protocol's XDR definition provides 309 enough information to enable the client to make a reliable prediction 310 of the maximum size of the expected Reply message. However, If there 311 are variable-size data items in the result, the maximum size of the 312 RPC Reply message can be reliably estimated in many cases: 314 o The client requests only a specific portion of an object (for 315 example, using the "count" and "offset" fields in an NFS READ). 317 o The client has already cached the size of the whole object it is 318 about to request (e.g., via a previous NFS GETATTR request). 320 o The client specifies a reply size limit for the particular reply, 321 as it does by setting the count field of READDIR request. 323 It is sometimes not possible to determine the maximum Reply message 324 size based solely on the above criteria. Client implementers can 325 choose to provide the largest possible Reply buffer in those cases, 326 based on, for instance, the largest possible NFS READ or WRITE 327 payload (which is negotiated at mount time). 329 There exist cases in which a client cannot be sure any a priori 330 determination is fully reliable. Handling of such cases is discussed 331 in Section 3.2. 333 3.2. Retry to Deal with Reply Size Mis-estimation 335 For some of the protocols discussed in this document, it is possible 336 for a compliant responder to send a valid reply whose length exceeds 337 the client's a priori estimate. In such cases, the client needs to 338 expect an error indication that indicates the existence of the 339 oversize reply. When this happens, the client can either terminate 340 that RPC transaction, or retry it with a larger reply size estimate. 342 In the case of the NFSv4.0, the use of NFS COMPOUND operations raises 343 the possibility of non-idempotent requests that combine a non- 344 idempotent operation with an operation whose maximum reply size 345 cannot be determined with certainty. This makes retrying the 346 operation problematic. It should be noted that many operations 347 normally considered non-idempotent (e.g. WRITE, SETATTR) are 348 actually idempotent. Truly non-idempotent operations are quite 349 unusual in COMPOUNDs that include operations with uncertain reply 350 sizes. 352 Depending on the transport version used, the client's choices may be 353 restricted as follows: 355 o The client may be required to treat the error as permanent, with 356 retry not allowed. 358 o The client may be allowed to reissue the request with a larger 359 reply estimate, unless it is a non-idempotent request. In this 360 case, non-idempotent requests may not be retried and will result 361 in errors being reported to the issuer in this case. 363 o The client may be allowed to reissue the request with a larger 364 reply estimate, in essentially all cases. In this case, the 365 client has sufficient information to avoid re-executing a non- 366 idempotent request and may, if it chooses, retry all requests with 367 a larger reply size. 369 In the case of Version One, the absence of a itinct error code to 370 signal a reply chunk of inadequate size meanss that retry in this 371 situation is not available. 373 4. Upper Layer Binding for NFS Versions 2 And 3 375 This Upper Layer Binding specification applies to NFS Version 2 376 [RFC1094] and NFS Version 3 [RFC1813]. For brevity, in this section 377 a "legacy NFS client" refers to an NFS client using NFS version 2 or 378 NFS version 3 to communicate with an NFS server. Likewise, a "legacy 379 NFS server" is an NFS server communicating with clients using NFS 380 version 2 or NFS version 3. 382 The following XDR data items in NFS versions 2 and 3 are DDP- 383 eligible: 385 o The opaque file data argument in the NFS WRITE procedure 386 o The pathname argument in the NFS SYMLINK procedure 388 o The opaque file data result in the NFS READ procedure 390 o The pathname result in the NFS READLINK procedure 392 All other argument or result data items in NFS versions 2 and 3 are 393 not DDP-eligible. 395 A legacy NFS client determines the maximum reply size for each 396 operation using the basic criteria outlined in Section 3.1. Such 397 clients deal with reply sizes beyond the maximum as escribed in 398 Section 2.5. 400 4.1. Auxiliary Protocols 402 NFS versions 2 and 3 are typically deployed with several other 403 protocols, sometimes referred to as "NFS auxiliary protocols." These 404 are separate RPC programs that define procedures which are not part 405 of the NFS version 2 or version 3 RPC programs. These include: 407 o The MOUNT and NLM protocols, introduced in an appendix of 408 [RFC1813] 410 o The NSM protocol, described in Chapter 11 of [NSM] 412 o The NFSACL protocol, which does not have a public definition 413 (NFSACL here is treated as a de facto standard as there are 414 several interoperating implementations). 416 RPC-over-RDMA treats these programs as distinct Upper Layer Protocols 417 [I-D.ietf-nfsv4-rfc5666bis]. To enable the use of these ULPs on an 418 RPC-over-RDMA transport, an Upper Layer Binding specification is 419 provided here for each. 421 4.1.1. MOUNT, NLM, And NSM Protocols 423 Typically MOUNT, NLM, and NSM are conveyed via TCP, even in 424 deployments where NFS operations on RPC-over-RDMA. When a legacy 425 server supports these programs on RPC-over-RDMA, it advertises the 426 port address via the usual rpcbind service [RFC1833]. 428 No operation in these protocols conveys a significant data payload, 429 and the size of RPC messages in these protocols is uniformly small. 430 Therefore, no XDR data items in these protocols are DDP-eligible. 431 The largest variable-length XDR data item is an xdr_netobj. In most 432 implementations this data item is not larger than 1024 bytes, making 433 this size a reasonable basis for reply size estimation. However, 434 since this limit is not specified as part of the protocol, the 435 techniques described in Section 3.1 should be used to deal with 436 situations where these sizes are exceeded. 438 4.1.2. NFSACL Protocol 440 Legacy clients and servers that support the NFSACL RPC program 441 typically convey NFSACL procedures on the same connection as the NFS 442 RPC program. This obviates the need for separate rpcbind queries to 443 discover server support for this RPC program. 445 ACLs are typically small, but even large ACLs must be encoded and 446 decoded to some degree. Thus, no data item in this Upper Layer 447 Protocol is DDP-eligible. 449 For procedures whose replies do not include an ACL object, the size 450 of a reply is determined directly from the NFSACL program's XDR 451 definition. 453 There is no protocol-wide size limit for NFS version 3 ACLs, and 454 there is no mechanism in either the NFSACL or NFS programs for a 455 legacy client to ascertain the largest ACL a legacy server can store. 456 Legacy client implementations should choose a maximum size for ACLs 457 based on their own internal limits. A recommended lower bound for 458 this maximum is 32,768 bytes, though a larger Reply chunk (up to the 459 negotiated rsize setting) can be provided. Since no limit is 460 specified as part of the protocol, the techniques described in 461 Section 3.1 should be used to deal with situations where these 462 recommended bounds are exceeded. 464 5. Upper Layer Binding for NFS Version 4 466 This Upper Layer Binding specification applies to all protocols 467 defined in NFS Version 4.0 [RFC7530], NFS Version 4.1 [RFC5661], and 468 NFS Version 4.2 [RFC7862]. 470 5.1. DDP-Eligibility 472 Only the following XDR data items in the COMPOUND procedure of all 473 NFS version 4 minor versions are DDP-eligible: 475 o The opaque data field in the WRITE4args structure 477 o The linkdata field of the NF4LNK arm in the createtype4 union 479 o The opaque data field in the READ4resok structure 481 o The linkdata field in the READLINK4resok structure 482 o In minor version 2 and newer, the rpc_data field of the 483 read_plus_content union (further restrictions on the use of this 484 data item follow below). 486 5.1.1. READ_PLUS Replies 488 The NFS version 4.2 READ_PLUS operation returns a complex data type 489 [RFC7862]. The rpr_contents field in the result of this operation is 490 an array of read_plus_content unions, one arm of which contains an 491 opaque byte stream (d_data). 493 The size of d_data is limited to the value of the rpa_count field, 494 but the protocol does not bound the number of elements which can be 495 returned in the rpr_contents array. In order to make the size of 496 READ_PLUS replies predictable by NFS version 4.2 clients, the 497 following restrictions are placed on the use of the READ_PLUS 498 operation on RPC-over-RDMA transports: 500 o An NFS version 4.2 client MUST NOT provide more than one Write 501 chunk for any READ_PLUS operation. When providing a Write chunk 502 for a READ_PLUS operation, an NFS version 4.2 client MUST provide 503 a Write chunk that is either empty (which forces all result data 504 items for this operation to be returned inline) or large enough to 505 receive rpa_count bytes in a single element of the rpr_contents 506 array. 508 o If the Write chunk provided for a READ_PLUS operation by an NFS 509 version 4.2 client is not empty, an NFS version 4.2 server MUST 510 use that chunk for the first element of the rpr_contents array 511 that has an rpc_data arm. 513 o An NFS version 4.2 server MUST NOT return more than two elements 514 in the rpr_contents array of any READ_PLUS operation. It returns 515 as much of the requested byte range as it can fit within these two 516 elements. If the NFS version 4.2 server has not asserted rpr_eof 517 in the reply, the NFS version 4.2 client SHOULD send additional 518 READ_PLUS requests for any remaining bytes. 520 5.2. NFS Version 4 Reply Size Estimation 522 An NFS version 4 client provides a Reply chunk when the maximum 523 possible reply size is larger than the client's responder inline 524 threshold. 526 There are certain NFS version 4 data items whose size cannot be 527 estimated by clients reliably, however, because there is no protocol- 528 specified size limit on these structures. These include: 530 o The attrlist4 field 532 o Fields containing ACLs such as fattr4_acl, fattr4_dacl, 533 fattr4_sacl 535 o Fields in the fs_locations4 and fs_locations_info4 data structures 537 o Opaque fields which pertain to pNFS layout metadata, such as 538 loc_body, loh_body, da_addr_body, lou_body, lrf_body, 539 fattr_layout_types and fs_layout_types, 541 5.2.1. Reply Size Estimation for Minor Version 0 543 The items enumerated above in Section 5.2 make it difficult to 544 predict the maximum size of GETATTR replies that interrogate 545 variable-length attributes. As discussed in Section 3.1, client 546 implementations can rely on their own internal architectural limits 547 to bound the reply size. However, since such limits are not 548 guaranteed to be reliable, use of the techniques discussed in 549 Section 3.2 may sometimes be necessary. 551 It is best to avoid issuing single COMPOUNDs that contain both non- 552 idempotent operations and operations where the maximum reply size 553 cannot be reliably predicted. 555 5.2.2. Reply Size Estimation for Minor Version 1 And Newer 557 In NFS version 4.1 and newer minor versions, the csa_fore_chan_attrs 558 argument of the CREATE_SESSION operation contains a 559 ca_maxresponsesize field. The value in this field can be taken as 560 the absolute maximum size of replies generated by a replying NFS 561 version 4 server. 563 This value can be used in cases where it is not possible to estimate 564 a reply size upper bound precisely. In practice, objects such as 565 ACLs, named attributes, layout bodies, and security labels are much 566 smaller than this maximum. 568 5.3. NFS Version 4 COMPOUND Requests 570 The NFS version 4 COMPOUND procedure allows the transmission of more 571 than one DDP-eligible data item per Call and Reply message. An NFS 572 version 4 client provides XDR Position values in each Read chunk to 573 disambiguate which chunk is associated with which argument data item. 574 However, NFS version 4 server and client implementations must agree 575 in advance on how to pair Write chunks with returned result data 576 items. 578 The mechanism specified in Section 4.3.2 of 579 [I-D.ietf-nfsv4-rfc5666bis]) is applied here, with additional 580 restrictions that appear below. In the following list, an "NFS Read" 581 operation refers to any NFS Version 4 operation which has a DDP- 582 eligible result data item (i.e., either a READ, READ_PLUS, or 583 READLINK operation). 585 o If an NFS version 4 client wishes all DDP-eligible items in an NFS 586 reply to be conveyed inline, it leaves the Write list empty. 588 o The first chunk in the Write list MUST be used by the first READ 589 operation in an NFS version 4 COMPOUND procedure. The next Write 590 chunk is used by the next READ operation, and so on. 592 o If an NFS version 4 client has provided a matching non-empty Write 593 chunk, then the corresponding READ operation MUST return its DDP- 594 eligible data item using that chunk. 596 o If an NFS version 4 client has provided an empty matching Write 597 chunk, then the corresponding READ operation MUST return all of 598 its result data items inline. 600 o If an READ operation returns a union arm which does not contain a 601 DDP-eligible result, and the NFS version 4 client has provided a 602 matching non-empty Write chunk, an NFS version 4 server MUST 603 return an empty Write chunk in that Write list position. 605 o If there are more READ operations than Write chunks, then 606 remaining NFS Read operations in an NFS version 4 COMPOUND that 607 have no matching Write chunk MUST return their results inline. 609 5.3.1. NFS Version 4 COMPOUND Example 611 The following example shows a Write list with three Write chunks, A, 612 B, and C. The NFS version 4 server consumes the provided Write 613 chunks by writing the results of the designated operations in the 614 compound request (READ and READLINK) back to each chunk. 616 Write list: 618 A --> B --> C 620 NFS version 4 COMPOUND request: 622 PUTFH LOOKUP READ PUTFH LOOKUP READLINK PUTFH LOOKUP READ 623 | | | 624 v v v 625 A B C 627 If the NFS version 4 client does not want to have the READLINK result 628 returned via RDMA, it provides an empty Write chunk for buffer B to 629 indicate that the READLINK result must be returned inline. 631 5.4. NFS Version 4 Callback 633 The NFS version 4 protocols support server-initiated callbacks to 634 notify clients of events such as recalled delegations. 636 5.4.1. NFS Version 4.0 Callback 638 NFS version 4.0 implementations typically employ a separate TCP 639 connection to handle callback operations, even when the forward 640 channel uses a RPC-over-RDMA transport. 642 No operation in the NFS version 4.0 callback RPC program conveys a 643 significant data payload. Therefore, no XDR data items in this RPC 644 program is DDP-eligible. 646 A CB_RECALL reply is small and fixed in size. The CB_GETATTR reply 647 contains a variable-length fattr4 data item. See Section 5.2.1 for a 648 discussion of reply size prediction for this data item. 650 An NFS version 4.0 client advertises netids and ad hoc port addresses 651 for contacting its NFS version 4.0 callback service using the 652 SETCLIENTID operation. 654 5.4.2. NFS Version 4.1 Callback 656 In NFS version 4.1 and newer minor versions, callback operations may 657 appear on the same connection as is used for NFS version 4 forward 658 channel client requests. NFS version 4 clients and servers MUST use 659 the mechanism described in [I-D.ietf-nfsv4-rpcrdma-bidirection] when 660 backchannel operations are conveyed on RPC-over-RDMA transports. 662 The csa_back_chan_attrs argument of the CREATE_SESSION operation 663 contains a ca_maxresponsesize field. The value in this field can be 664 taken as the absolute maximum size of backchannel replies generated 665 by a replying NFS version 4 client. 667 There are no DDP-eligible data items in callback procedures defined 668 in NFS version 4.1 or NFS version 4.2. However, some callback 669 operations, such as messages that convey device ID information, can 670 be large, in which case a Long Call or Reply might be required. 672 When an NFS version 4.1 client reports a backchannel 673 ca_maxrequestsize that is larger than the connection's inline 674 thresholds, the NFS version 4 client can support Long Calls. 675 Otherwise an NFS version 4 server MUST use Short messages to convey 676 backchannel operations. 678 5.5. Session-Related Considerations 680 Typically, the presence of an NFS session [RFC5661] has no effect on 681 the operation of RPC-over-RDMA. None of the operations introduced to 682 support NFS sessions contain DDP-eligible data items. There is no 683 need to match the number of session slots with the number of 684 available RPC-over-RDMA credits. 686 However, there are some rare error conditions which require special 687 handling when an NFS session is operating on an RPC-over-RDMA 688 transport. For example, a requester might receive, in response to an 689 RPC request, an RDMA_ERROR message with an rdma_err value of 690 ERR_CHUNK, or an RDMA_MSG containing an RPC_GARBAGEARGS reply. 691 Within RPC-over-RDMA Version One, this class of error can be 692 generated for two different reasons: 694 o There was an XDR error detected parsing the RPC-over-RDMA headers. 696 o There was an error sending the response, because, for example, a 697 necessary reply chunk was not provided or the one provided is of 698 insufficient length. 700 These two situations, which arise due to incorrect implementations or 701 underestimation of reply size, have different implications with 702 regard to Exactly-Once Semantics. An XDR error in decoding the 703 request precludes the execution of the request on the responder, but 704 failure to send a reply indicates that some or all of the operations 705 were executed. 707 In both instances, the client SHOULD NOT retry the operation without 708 addressing reply resource inadequacy. Such a retry can result in the 709 same sort of error seen previously. Instead, it is best to consider 710 the operation as completed unsuccessfully and report an error to the 711 consumer who requested the RPC. 713 In addition, within the error response, the requester does not have 714 the result of the execution of the SEQUENCE operation, which 715 identifies the session, slot, and sequence id for the request which 716 has failed. The xid associated with the request, obtained from the 717 rdma_xid field of the RDMA_ERROR or RDMA_MSG message, must be used to 718 determine the session and slot for the request which failed, and the 719 slot must be properly retired. If this is not done, the slot could 720 be rendered permanently unavailable. 722 5.6. Connection Keep-Alive 724 NFS version 4 client implementations often rely on a transport-layer 725 keep-alive mechanism to detect when an NFS version 4 server has 726 become unresponsive. When an NFS server is no longer responsive, 727 client-side keep-alive terminates the connection, which in turn 728 triggers reconnection and RPC retransmission. 730 Some RDMA transports (such as Reliable Connections on InfiniBand) 731 have no keep-alive mechanism. Without a disconnect or new RPC 732 traffic, such connections can remain alive long after an NFS server 733 has become unresponsive. Once an NFS client has consumed all 734 available RPC-over-RDMA credits on that transport connection, it will 735 forever await a reply before sending another RPC request. 737 NFS version 4 clients SHOULD reserve one RPC-over-RDMA credit to use 738 for periodic server or connection health assessment. This credit can 739 be used to drive an RPC request on an otherwise idle connection, 740 triggering either a quick affirmative server response or immediate 741 connection termination. 743 6. Extending NFS Upper Layer Bindings 745 RPC programs such as NFS are required to have an Upper Layer Binding 746 specification to interoperate on RPC-over-RDMA transports 747 [I-D.ietf-nfsv4-rfc5666bis]. Via standards action, the Upper Layer 748 Binding specified in this document can be extended to cover versions 749 of the NFS version 4 protocol specified after NFS version 4 minor 750 version 2, or separately published extensions to an existing NFS 751 version 4 minor version, as described in [I-D.ietf-nfsv4-versioning]. 753 7. IANA Considerations 755 NFS use of direct data placement introduces a need for an additional 756 NFS port number assignment for networks that share traditional UDP 757 and TCP port spaces with RDMA services. The iWARP [RFC5041] 758 [RFC5040] protocol is such an example (InfiniBand is not). 760 NFS servers for versions 2 and 3 [RFC1094] [RFC1813] traditionally 761 listen for clients on UDP and TCP port 2049, and additionally, they 762 register these with the portmapper and/or rpcbind [RFC1833] service. 763 However, [RFC7530] requires NFS version 4 servers to listen on TCP 764 port 2049, and they are not required to register. 766 An NFS version 2 or version 3 server supporting RPC-over-RDMA on such 767 a network and registering itself with the RPC portmapper MAY choose 768 an arbitrary port, or MAY use the alternative well-known port number 769 for its RPC-over-RDMA service. The chosen port MAY be registered 770 with the RPC portmapper under the netid assigned by the requirement 771 in [I-D.ietf-nfsv4-rfc5666bis]. 773 An NFS version 4 server supporting RPC-over-RDMA on such a network 774 MUST use the alternative well-known port number for its RPC-over-RDMA 775 service. Clients SHOULD connect to this well-known port without 776 consulting the RPC portmapper (as for NFS version 4 on TCP 777 transports). 779 The port number assigned to an NFS service over an RPC-over-RDMA 780 transport is available from the IANA port registry [RFC3232]. 782 8. Security Considerations 784 RPC-over-RDMA supports all RPC security models, including RPCSEC_GSS 785 security and transport-level security [RFC2203]. The choice of RDMA 786 Read and RDMA Write to convey RPC argument and results does not 787 affect this, since it changes only the method of data transfer. 788 Specifically, the requirements of [I-D.ietf-nfsv4-rfc5666bis] ensure 789 that this choice does not introduce new vulnerabilities. 791 Because this document defines only the binding of the NFS protocols 792 atop [I-D.ietf-nfsv4-rfc5666bis], all relevant security 793 considerations are therefore to be described at that layer. 795 9. References 797 9.1. Normative References 799 [I-D.ietf-nfsv4-rfc5666bis] 800 Lever, C., Simpson, W., and T. Talpey, "Remote Direct 801 Memory Access Transport for Remote Procedure Call, Version 802 One", draft-ietf-nfsv4-rfc5666bis-09 (work in progress), 803 January 2017. 805 [I-D.ietf-nfsv4-rpcrdma-bidirection] 806 Lever, C., "Bi-directional Remote Procedure Call On RPC- 807 over-RDMA Transports", draft-ietf-nfsv4-rpcrdma- 808 bidirection-06 (work in progress), January 2017. 810 [RFC1833] Srinivasan, R., "Binding Protocols for ONC RPC Version 2", 811 RFC 1833, DOI 10.17487/RFC1833, August 1995, 812 . 814 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 815 Requirement Levels", BCP 14, RFC 2119, 816 DOI 10.17487/RFC2119, March 1997, 817 . 819 [RFC2203] Eisler, M., Chiu, A., and L. Ling, "RPCSEC_GSS Protocol 820 Specification", RFC 2203, DOI 10.17487/RFC2203, September 821 1997, . 823 [RFC5661] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed., 824 "Network File System (NFS) Version 4 Minor Version 1 825 Protocol", RFC 5661, DOI 10.17487/RFC5661, January 2010, 826 . 828 [RFC7530] Haynes, T., Ed. and D. Noveck, Ed., "Network File System 829 (NFS) Version 4 Protocol", RFC 7530, DOI 10.17487/RFC7530, 830 March 2015, . 832 [RFC7862] Haynes, T., "Network File System (NFS) Version 4 Minor 833 Version 2 Protocol", RFC 7862, DOI 10.17487/RFC7862, 834 November 2016, . 836 9.2. Informative References 838 [I-D.ietf-nfsv4-rfc5667bis] 839 Lever, C., "Remote Direct Memory Access Transport for 840 Remote Procedure Call, Version One", draft-ietf- 841 nfsv4-rfc5667bis-04 (work in progress), January 2017. 843 [I-D.ietf-nfsv4-versioning] 844 Noveck, D., "Rules for NFSv4 Extensions and Minor 845 Versions", draft-ietf-nfsv4-versioning-09 (work in 846 progress), December 2016. 848 [NSM] The Open Group, "Protocols for Interworking: XNFS, Version 849 3W", February 1998. 851 [RFC1094] Nowicki, B., "NFS: Network File System Protocol 852 specification", RFC 1094, DOI 10.17487/RFC1094, March 853 1989, . 855 [RFC1813] Callaghan, B., Pawlowski, B., and P. Staubach, "NFS 856 Version 3 Protocol Specification", RFC 1813, 857 DOI 10.17487/RFC1813, June 1995, 858 . 860 [RFC3232] Reynolds, J., Ed., "Assigned Numbers: RFC 1700 is Replaced 861 by an On-line Database", RFC 3232, DOI 10.17487/RFC3232, 862 January 2002, . 864 [RFC5040] Recio, R., Metzler, B., Culley, P., Hilland, J., and D. 865 Garcia, "A Remote Direct Memory Access Protocol 866 Specification", RFC 5040, DOI 10.17487/RFC5040, October 867 2007, . 869 [RFC5041] Shah, H., Pinkerton, J., Recio, R., and P. Culley, "Direct 870 Data Placement over Reliable Transports", RFC 5041, 871 DOI 10.17487/RFC5041, October 2007, 872 . 874 [RFC5667] Talpey, T. and B. Callaghan, "Network File System (NFS) 875 Direct Data Placement", RFC 5667, DOI 10.17487/RFC5667, 876 January 2010, . 878 [rpcrdmav2] 879 Lever, C., Ed. and D. Noveck, "RPC-over-RDMA Version Two", 880 December 2016, . 883 Work in progress. 885 Appendix A. Acknowledgments 887 The author gratefully acknowledges the work of Brent Callaghan and 888 Tom Talpey on the original NFS Direct Data Placement specification 889 [RFC5667]. 891 A large part of the material in this doccument is taken from 892 [I-D.ietf-nfsv4-rfc5667bis] written by Chuck Lever. The author 893 wishes to acknowlege the debt he owes to Chuck for his work in 894 providing an updated Upper Layer Binding for the NFS-related 895 protocols. 897 The author also wishes to thank Bill Baker and Greg Marsden for their 898 support of the work to revive RPC-over-RDMA. 900 Author's Address 902 David Noveck 903 26 Locust Avenue 904 Lexington, MA 02421 905 USA 907 Phone: +1 781 572 8038 908 Email: davenoveck@gmail.com