idnits 2.17.1 draft-dnoveck-nfsv4-nfsulb-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document seems to contain a disclaimer for pre-RFC5378 work, but was first submitted on or after 10 November 2008. The disclaimer is usually necessary only for documents that revise or obsolete older RFCs, and that take significant amounts of text from those RFCs. If you can contact all authors of the source material and they are willing to grant the BCP78 rights to the IETF Trust, you can and should remove the disclaimer. Otherwise, the disclaimer is needed and you can ignore this comment. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (July 25, 2017) is 2466 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) ** Obsolete normative reference: RFC 5661 (Obsoleted by RFC 8881) == Outdated reference: A later version (-13) exists of draft-ietf-nfsv4-rfc5667bis-11 -- Obsolete informational reference (is this intentional?): RFC 5667 (Obsoleted by RFC 8267) Summary: 1 error (**), 0 flaws (~~), 3 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network File System Version 4 D. Noveck 3 Internet-Draft July 25, 2017 4 Intended status: Standards Track 5 Expires: January 26, 2018 7 Transport-generic Network File System (NFS) Upper Layer Bindings To RPC- 8 Over-RDMA 9 draft-dnoveck-nfsv4-nfsulb-01 11 Abstract 13 This document specifies Upper Layer Bindings to allow use of RPC- 14 over-RDMA by protocols related to the Network File System (NFS). 15 Such bindings are required when using RPC-over-RDMA, in order to 16 enable use of Direct Data Placement and for a number of other 17 reasons. These bindings are structured to be applicable to all known 18 version of the RPC-over-RDMA transport, including optional 19 extensions. All versions of NFS are addressed. 21 Requirements Language 23 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 24 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 25 document are to be interpreted as described in [RFC2119]. 27 Status of This Memo 29 This Internet-Draft is submitted in full conformance with the 30 provisions of BCP 78 and BCP 79. 32 Internet-Drafts are working documents of the Internet Engineering 33 Task Force (IETF). Note that other groups may also distribute 34 working documents as Internet-Drafts. The list of current Internet- 35 Drafts is at http://datatracker.ietf.org/drafts/current/. 37 Internet-Drafts are draft documents valid for a maximum of six months 38 and may be updated, replaced, or obsoleted by other documents at any 39 time. It is inappropriate to use Internet-Drafts as reference 40 material or to cite them other than as "work in progress." 42 This Internet-Draft will expire on January 26, 2018. 44 Copyright Notice 46 Copyright (c) 2017 IETF Trust and the persons identified as the 47 document authors. All rights reserved. 49 This document is subject to BCP 78 and the IETF Trust's Legal 50 Provisions Relating to IETF Documents 51 (http://trustee.ietf.org/license-info) in effect on the date of 52 publication of this document. Please review these documents 53 carefully, as they describe your rights and restrictions with respect 54 to this document. Code Components extracted from this document must 55 include Simplified BSD License text as described in Section 4.e of 56 the Trust Legal Provisions and are provided without warranty as 57 described in the Simplified BSD License. 59 This document may contain material from IETF Documents or IETF 60 Contributions published or made publicly available before November 61 10, 2008. The person(s) controlling the copyright in some of this 62 material may not have granted the IETF Trust the right to allow 63 modifications of such material outside the IETF Standards Process. 64 Without obtaining an adequate license from the person(s) controlling 65 the copyright in such materials, this document may not be modified 66 outside the IETF Standards Process, and derivative works of it may 67 not be created outside the IETF Standards Process, except to format 68 it for publication as an RFC or to translate it into languages other 69 than English. 71 Table of Contents 73 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 74 2. Conveying NFS Operations On RPC-Over-RDMA . . . . . . . . . . 3 75 2.1. Direct Placement of Request Data . . . . . . . . . . . . 4 76 2.2. Direct Placement of Response Data . . . . . . . . . . . . 5 77 2.3. Scatter-gather when Using DDP . . . . . . . . . . . . . . 5 78 2.4. DDP-eligibility Violations . . . . . . . . . . . . . . . 5 79 2.5. Long Calls and Replies . . . . . . . . . . . . . . . . . 6 80 3. Preparatory Material for Multiple Bindings . . . . . . . . . 7 81 3.1. Reply Size Estimation . . . . . . . . . . . . . . . . . . 7 82 3.2. Retry to Deal with Reply Size Mis-estimation . . . . . . 8 83 4. Upper Layer Binding for NFS Versions 2 And 3 . . . . . . . . 9 84 4.1. Auxiliary Protocols . . . . . . . . . . . . . . . . . . . 9 85 4.1.1. MOUNT, NLM, And NSM Protocols . . . . . . . . . . . . 10 86 4.1.2. NFSACL Protocol . . . . . . . . . . . . . . . . . . . 10 87 5. Upper Layer Binding for NFS Version 4 . . . . . . . . . . . . 10 88 5.1. DDP-Eligibility . . . . . . . . . . . . . . . . . . . . . 11 89 5.1.1. READ_PLUS Replies . . . . . . . . . . . . . . . . . . 11 90 5.2. NFS Version 4 Reply Size Estimation . . . . . . . . . . . 12 91 5.2.1. Reply Size Estimation for Minor Version 0 . . . . . . 12 92 5.2.2. Reply Size Estimation for Minor Version 1 And Newer . 12 93 5.3. NFS Version 4 COMPOUND Requests . . . . . . . . . . . . . 13 94 5.3.1. NFS Version 4 COMPOUND Example . . . . . . . . . . . 13 95 5.4. NFS Version 4 Callback . . . . . . . . . . . . . . . . . 14 96 5.4.1. NFS Version 4.0 Callback . . . . . . . . . . . . . . 14 97 5.4.2. NFS Version 4.1 Callback . . . . . . . . . . . . . . 14 98 5.5. Session-Related Considerations . . . . . . . . . . . . . 15 99 5.6. Connection Keep-Alive . . . . . . . . . . . . . . . . . . 16 100 6. Extending NFS Upper Layer Bindings . . . . . . . . . . . . . 16 101 7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 17 102 8. Security Considerations . . . . . . . . . . . . . . . . . . . 17 103 9. References . . . . . . . . . . . . . . . . . . . . . . . . . 17 104 9.1. Normative References . . . . . . . . . . . . . . . . . . 18 105 9.2. Informative References . . . . . . . . . . . . . . . . . 18 106 Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . 19 107 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 20 109 1. Introduction 111 An RPC-over-RDMA transport, such as the ones defined in [RFC8166] and 112 [rpcrdmav2], may employ direct data placement to convey data payloads 113 associated with RPC transactions. To enable successful 114 interoperation, RPC client and server implementations must agree as 115 to which XDR data items in what particular RPC procedures are 116 eligible for direct data placement (DDP). Specifying those data 117 items is a major component of a protocol's Upper Layer Binding. 119 In addition, Upper Layer Bindings are required to include additional 120 information to assure that adequate resources are allocated to 121 receive RPC replies, and for a number of other reasons. 123 This document contains material required of Upper Layer Bindings, as 124 specified in [RFC8166], for the NFS protocol versions listed below. 125 In addition, bindings are provided, when necessary, for auxiliary 126 protocols used together with NFS versions 2 and 3. 128 o NFS Version 2 [RFC1094] 130 o NFS Version 3 [RFC1813] 132 o NFS Version 4.0 [RFC7530] 134 o NFS Version 4.1 [RFC5661] 136 o NFS Version 4.2 [RFC7862] 138 2. Conveying NFS Operations On RPC-Over-RDMA 140 This document is written to apply to multiple versions of the RPC- 141 over-RDMA transport, only the first of which is currently specified 142 by a Sroposed Standard (in [RFC8166]). However, it is expected that 143 other versions will be created, and this document has been structured 144 to support future versions by focusing on the functions to be 145 provided by the transport and the transport limitations which the 146 Upper Layer Protocols need to accommodate, allowing the transport 147 specification and the specifications for associated extensions to 148 define how those functions will be provided and the details of the 149 transport limitations. 151 In the subsections that follow, we will describe the generic function 152 to be provided or limitation to be accommodated and follow it with 153 material that describes, in general, how that issue is dealt with in 154 Version One. For more detail about Version One, [RFC8166] should be 155 consulted. How these issues are to be dealt with in future versions 156 is left to the specification documents for those versions and 157 associated documents defining optional extensions. 159 For example: 161 o ULBs within this document define which data items are eligible for 162 Direct Data Placement while transport versions might differ as how 163 this is to be effected. See Sections 2.1 and 2.2 for more detail. 164 In both cases, Section 2.3 discusses issues connected with the use 165 of discontiguous areas for Direct Data Placement. 167 o Section 2.4 defines the concept of a DDP-eligibility violation and 168 requires that such violations be reported while transport versions 169 might differ as to the manner in which the reporting is to be 170 done. 172 o Section 2.5 discusses issues arising from limits on the size of 173 messages conveyed using RDMA SENDs. Different transport version 174 may have different size limits, while, in the case of replies the 175 ULBs are responsible for specifying how limits on reply sizes are 176 to be determined. 178 2.1. Direct Placement of Request Data 180 When DDP-eligible XDR data items appear in a request the requester 181 needs to take special actions in order to provide for the direct 182 placement of those items in the responder's memory. The specific 183 actions to be taken are defined by the transport version being used. 185 In Version One, the Read list in each RPC-over-RDMA transport header 186 represents a set of memory regions containing a single item of DDP- 187 eligible NFS argument data. Large data items, such as the data 188 payload of an NFS version 3 WRITE procedure, can be referenced by the 189 Read list. The NFS server pulls such payloads from the client and 190 places them directly into its own memory. 192 2.2. Direct Placement of Response Data 194 When a request is such that it is possible for DDP-eligible data 195 items to appear in the corresponding reply, the requester needs to 196 take special actions in order to provide for the direct placement of 197 those items in the requester's memory, if such placement is desired. 198 The specific actions to be taken are defined by the transport version 199 being used as is the means to indicate that such direct placement is 200 not to be done 202 In Version One, the Write list in each RPC-over-RDMA transport header 203 represents a set of memory regions that can receive DDP-eligible NFS 204 result data. Large data items, such as the payload of an NFS version 205 3 READ procedure, can be referenced by the Write list. The NFS 206 server pushes such payloads to the client, placing them directly into 207 the client's memory, using target addresses provided by the client 208 when sending the request. 210 Each Write chunk corresponds to a specific XDR data item in an NFS 211 reply. This document describes how NFS client and server 212 implementations determine the correspondence between Write chunks and 213 XDR results. 215 2.3. Scatter-gather when Using DDP 217 In order to accommodate the storage of multiple data blocks within 218 individual cache buffers, the RPC-over-RDMA transport allows the 219 addresses to which a DDP-eligible data item to be discontiguous. How 220 these addresses are indicated depends on the transport version. 222 Within Version One, a chunk typically corresponds to exactly one XDR 223 data item. Each Read chunk is represented as a list of segments at 224 the same XDR Position. Each Write chunk is represented as an array 225 of segments. An NFS client thus has the flexibility to advertise a 226 set of discontiguous memory regions in which to convey a single DDP- 227 eligible XDR data item. 229 2.4. DDP-eligibility Violations 231 When the transport header uses the means defined to directly place on 232 an XDR item is applied to an XDR item not define in the ULB as DDP- 233 eligible, a DDP-eligibility violation is recognized. The means by 234 which such violations are to be reported is defined by the particular 235 transport version being used. 237 To report a DDP-eligibility violation within Version One, an NFS 238 server returns one of: 240 o An RPC-over-RDMA message of type RDMA_ERROR, with the rdma_xid 241 field set to the XID of the matching NFS Call, and the rdma_error 242 field set to ERR_CHUNK 244 o An RPC message (via an RDMA_MSG message) with the xid field set to 245 the XID of the matching NFS Call, the mtype field set to REPLY, 246 the stat field set to MSG_ACCEPTED, and the accept_stat field set 247 to GARBAGE_ARGS. 249 2.5. Long Calls and Replies 251 Because of the use of pre-posted receive buffers whose size is fixed, 252 all RPC-over-RDMA transport versions have limits on the size of 253 messages which can be conveyed without use of explicit RDMA 254 operations, although different transport versions may have different 255 limits. In particular, when the transport version allows messages to 256 be continued across multiple RDMA SENDs, the limit can be 257 substantially greater than the receive buffer size. Also note that 258 the size of the messages allowed may be reduced because of space 259 taken up by the transport header fields. 261 Each transport version is responsible for defining the message size 262 limits and the means by which the transfer of messages that exceed 263 these limits is to be provided for. These means may be different in 264 the cases of long calls and replies. 266 When using Version One, if an NFS request is too large to be conveyed 267 within the NFS server's responder inline threshold, even after any 268 DDP-eligible data items have been removed, an NFS client must send 269 the request in the form of a Long Call. The entire NFS request is 270 sent in a special Read chunk called a Position Zero Read chunk. 272 Also when using Version One, if an NFS client determines that the 273 maximum size of an NFS reply could be too large to be conveyed within 274 its own inline threshold, it provides a Reply chunk in the RPC-over- 275 RDMA transport header conveying the NFS request. The server places 276 the entire NFS reply in the Reply chunk. 278 There exist cases in which an NFS client needs to provide both a 279 Position Zero Read chunk and a Reply chunk for the same RPC. One 280 common source of such situations is when the RPC authentication 281 flavor being used requires that DDP-eligible data items never be 282 removed from RPC messages. 284 3. Preparatory Material for Multiple Bindings 286 Although each of the NFS versions and each of the auxiliary protocols 287 discussed in Section 4.1 has its own ULB, there is important 288 preparatory material in the subsections below that applies to 289 multiple ULPs. In particular: 291 o The material in Section 3.1 applies to all of the ULPs discussed 292 in this document. 294 o The material in Section 3.2 applies to NFSv2, NFSv3, NFSv4.0, the 295 MOUNT protocol, and the NFSACL protocol. 297 3.1. Reply Size Estimation 299 During the construction of each RPC Call message, a client is 300 responsible for allocating appropriate resources for receiving the 301 matching Reply message. The resources required depends on the 302 maximum reply size expected, whether DDP-eligible can removed from 303 the reply and the transport version being used. The ULB is 304 responsible for defining how the maximum reply size is to be 305 determined while the specifiction of the transport version being used 306 is responsible for defining how this maximum affects the resources to 307 be allocated. Because the responder may not be able to send the 308 required response when these resources have not been allocated, 309 reliable reply size estimation is necessary to allow successful 310 interoperation. 312 In many cases the Upper Layer Protocol's XDR definition provides 313 enough information to enable the client to make a reliable prediction 314 of the maximum size of the expected Reply message. However, If there 315 are variable-size data items in the result, the maximum size of the 316 RPC Reply message can be reliably estimated in many cases: 318 o The client requests only a specific portion of an object (for 319 example, using the "count" and "offset" fields in an NFS READ). 321 o The client has already cached the size of the whole object it is 322 about to request (e.g., via a previous NFS GETATTR request). 324 o The client specifies a reply size limit for the particular reply, 325 as it does by setting the count field of READDIR request. 327 It is sometimes not possible to determine the maximum Reply message 328 size based solely on the above criteria. Client implementers can 329 choose to provide the largest possible Reply buffer in those cases, 330 based on, for instance, the largest possible NFS READ or WRITE 331 payload (which is negotiated at mount time). 333 There exist cases in which a client cannot be sure any a priori 334 determination is fully reliable. Handling of such cases is discussed 335 in Section 3.2. 337 3.2. Retry to Deal with Reply Size Mis-estimation 339 For some of the protocols discussed in this document, it is possible 340 for a compliant responder to send a valid reply whose length exceeds 341 the client's a priori estimate. In such cases, the client needs to 342 expect an error indication that indicates the existence of the 343 oversize reply. When this happens, the client can either terminate 344 that RPC transaction, or retry it with a larger reply size estimate. 346 In the case of the NFSv4.0, the use of NFS COMPOUND operations raises 347 the possibility of non-idempotent requests that combine a non- 348 idempotent operation with an operation whose maximum reply size 349 cannot be determined with certainty. This makes retrying the 350 operation problematic. It should be noted that many operations 351 normally considered non-idempotent (e.g. WRITE, SETATTR) are 352 actually idempotent. Truly non-idempotent operations are quite 353 unusual in COMPOUNDs that include operations with uncertain reply 354 sizes. 356 Depending on the transport version used, the client's choices may be 357 restricted as follows: 359 o The client may be required to treat the error as permanent, with 360 retry not allowed. 362 o The client may be allowed to reissue the request with a larger 363 reply estimate, unless it is a non-idempotent request. In this 364 case, non-idempotent requests may not be retried and will result 365 in errors being reported to the issuer in this case. 367 o The client may be allowed to reissue the request with a larger 368 reply estimate, in essentially all cases. In this case, the 369 client has sufficient information to avoid re-executing a non- 370 idempotent request and may, if it chooses, retry all requests with 371 a larger reply size. 373 In the case of Version One, the absence of a itinct error code to 374 signal a reply chunk of inadequate size meanss that retry in this 375 situation is not available. 377 4. Upper Layer Binding for NFS Versions 2 And 3 379 This Upper Layer Binding specification applies to NFS Version 2 380 [RFC1094] and NFS Version 3 [RFC1813]. For brevity, in this section 381 a "legacy NFS client" refers to an NFS client using NFS version 2 or 382 NFS version 3 to communicate with an NFS server. Likewise, a "legacy 383 NFS server" is an NFS server communicating with clients using NFS 384 version 2 or NFS version 3. 386 The following XDR data items in NFS versions 2 and 3 are DDP- 387 eligible: 389 o The opaque file data argument in the NFS WRITE procedure 391 o The pathname argument in the NFS SYMLINK procedure 393 o The opaque file data result in the NFS READ procedure 395 o The pathname result in the NFS READLINK procedure 397 All other argument or result data items in NFS versions 2 and 3 are 398 not DDP-eligible. 400 A legacy NFS client determines the maximum reply size for each 401 operation using the basic criteria outlined in Section 3.1. Such 402 clients deal with reply sizes beyond the maximum as escribed in 403 Section 2.5. 405 4.1. Auxiliary Protocols 407 NFS versions 2 and 3 are typically deployed with several other 408 protocols, sometimes referred to as "NFS auxiliary protocols." These 409 are separate RPC programs that define procedures which are not part 410 of the NFS version 2 or version 3 RPC programs. These include: 412 o The MOUNT and NLM protocols, introduced in an appendix of 413 [RFC1813] 415 o The NSM protocol, described in Chapter 11 of [NSM] 417 o The NFSACL protocol, which does not have a public definition 418 (NFSACL here is treated as a de facto standard as there are 419 several interoperating implementations). 421 RPC-over-RDMA treats these programs as distinct Upper Layer Protocols 422 [RFC8166]. To enable the use of these ULPs on an RPC-over-RDMA 423 transport, an Upper Layer Binding specification is provided here for 424 each. 426 4.1.1. MOUNT, NLM, And NSM Protocols 428 Typically MOUNT, NLM, and NSM are conveyed via TCP, even in 429 deployments where NFS operations on RPC-over-RDMA. When a legacy 430 server supports these programs on RPC-over-RDMA, it advertises the 431 port address via the usual rpcbind service [RFC1833]. 433 No operation in these protocols conveys a significant data payload, 434 and the size of RPC messages in these protocols is uniformly small. 435 Therefore, no XDR data items in these protocols are DDP-eligible. 436 The largest variable-length XDR data item is an xdr_netobj. In most 437 implementations this data item is not larger than 1024 bytes, making 438 this size a reasonable basis for reply size estimation. However, 439 since this limit is not specified as part of the protocol, the 440 techniques described in Section 3.1 should be used to deal with 441 situations where these sizes are exceeded. 443 4.1.2. NFSACL Protocol 445 Legacy clients and servers that support the NFSACL RPC program 446 typically convey NFSACL procedures on the same connection as the NFS 447 RPC program. This obviates the need for separate rpcbind queries to 448 discover server support for this RPC program. 450 ACLs are typically small, but even large ACLs must be encoded and 451 decoded to some degree. Thus, no data item in this Upper Layer 452 Protocol is DDP-eligible. 454 For procedures whose replies do not include an ACL object, the size 455 of a reply is determined directly from the NFSACL program's XDR 456 definition. 458 There is no protocol-wide size limit for NFS version 3 ACLs, and 459 there is no mechanism in either the NFSACL or NFS programs for a 460 legacy client to ascertain the largest ACL a legacy server can store. 461 Legacy client implementations should choose a maximum size for ACLs 462 based on their own internal limits. A recommended lower bound for 463 this maximum is 32,768 bytes, though a larger Reply chunk (up to the 464 negotiated rsize setting) can be provided. Since no limit is 465 specified as part of the protocol, the techniques described in 466 Section 3.1 should be used to deal with situations where these 467 recommended bounds are exceeded. 469 5. Upper Layer Binding for NFS Version 4 471 This Upper Layer Binding specification applies to all protocols 472 defined in NFS Version 4.0 [RFC7530], NFS Version 4.1 [RFC5661], and 473 NFS Version 4.2 [RFC7862]. 475 5.1. DDP-Eligibility 477 Only the following XDR data items in the COMPOUND procedure of all 478 NFS version 4 minor versions are DDP-eligible: 480 o The opaque data field in the WRITE4args structure 482 o The linkdata field of the NF4LNK arm in the createtype4 union 484 o The opaque data field in the READ4resok structure 486 o The linkdata field in the READLINK4resok structure 488 o In minor version 2 and newer, the rpc_data field of the 489 read_plus_content union (further restrictions on the use of this 490 data item follow below). 492 5.1.1. READ_PLUS Replies 494 The NFS version 4.2 READ_PLUS operation returns a complex data type 495 [RFC7862]. The rpr_contents field in the result of this operation is 496 an array of read_plus_content unions, one arm of which contains an 497 opaque byte stream (d_data). 499 The size of d_data is limited to the value of the rpa_count field, 500 but the protocol does not bound the number of elements which can be 501 returned in the rpr_contents array. In order to make the size of 502 READ_PLUS replies predictable by NFS version 4.2 clients, the 503 following restrictions are placed on the use of the READ_PLUS 504 operation on RPC-over-RDMA transports: 506 o An NFS version 4.2 client MUST NOT provide more than one Write 507 chunk for any READ_PLUS operation. When providing a Write chunk 508 for a READ_PLUS operation, an NFS version 4.2 client MUST provide 509 a Write chunk that is either empty (which forces all result data 510 items for this operation to be returned inline) or large enough to 511 receive rpa_count bytes in a single element of the rpr_contents 512 array. 514 o If the Write chunk provided for a READ_PLUS operation by an NFS 515 version 4.2 client is not empty, an NFS version 4.2 server MUST 516 use that chunk for the first element of the rpr_contents array 517 that has an rpc_data arm. 519 o An NFS version 4.2 server MUST NOT return more than two elements 520 in the rpr_contents array of any READ_PLUS operation. It returns 521 as much of the requested byte range as it can fit within these two 522 elements. If the NFS version 4.2 server has not asserted rpr_eof 523 in the reply, the NFS version 4.2 client SHOULD send additional 524 READ_PLUS requests for any remaining bytes. 526 5.2. NFS Version 4 Reply Size Estimation 528 An NFS version 4 client provides a Reply chunk when the maximum 529 possible reply size is larger than the client's responder inline 530 threshold. 532 There are certain NFS version 4 data items whose size cannot be 533 estimated by clients reliably, however, because there is no protocol- 534 specified size limit on these structures. These include: 536 o The attrlist4 field 538 o Fields containing ACLs such as fattr4_acl, fattr4_dacl, 539 fattr4_sacl 541 o Fields in the fs_locations4 and fs_locations_info4 data structures 543 o Opaque fields which pertain to pNFS layout metadata, such as 544 loc_body, loh_body, da_addr_body, lou_body, lrf_body, 545 fattr_layout_types and fs_layout_types, 547 5.2.1. Reply Size Estimation for Minor Version 0 549 The items enumerated above in Section 5.2 make it difficult to 550 predict the maximum size of GETATTR replies that interrogate 551 variable-length attributes. As discussed in Section 3.1, client 552 implementations can rely on their own internal architectural limits 553 to bound the reply size. However, since such limits are not 554 guaranteed to be reliable, use of the techniques discussed in 555 Section 3.2 may sometimes be necessary. 557 It is best to avoid issuing single COMPOUNDs that contain both non- 558 idempotent operations and operations where the maximum reply size 559 cannot be reliably predicted. 561 5.2.2. Reply Size Estimation for Minor Version 1 And Newer 563 In NFS version 4.1 and newer minor versions, the csa_fore_chan_attrs 564 argument of the CREATE_SESSION operation contains a 565 ca_maxresponsesize field. The value in this field can be taken as 566 the absolute maximum size of replies generated by a replying NFS 567 version 4 server. 569 This value can be used in cases where it is not possible to estimate 570 a reply size upper bound precisely. In practice, objects such as 571 ACLs, named attributes, layout bodies, and security labels are much 572 smaller than this maximum. 574 5.3. NFS Version 4 COMPOUND Requests 576 The NFS version 4 COMPOUND procedure allows the transmission of more 577 than one DDP-eligible data item per Call and Reply message. An NFS 578 version 4 client provides XDR Position values in each Read chunk to 579 disambiguate which chunk is associated with which argument data item. 580 However, NFS version 4 server and client implementations must agree 581 in advance on how to pair Write chunks with returned result data 582 items. 584 The mechanism specified in Section 4.3.2 of [RFC8166]) is applied 585 here, with additional restrictions that appear below. In the 586 following list, an "NFS Read" operation refers to any NFS Version 4 587 operation which has a DDP-eligible result data item (i.e., either a 588 READ, READ_PLUS, or READLINK operation). 590 o If an NFS version 4 client wishes all DDP-eligible items in an NFS 591 reply to be conveyed inline, it leaves the Write list empty. 593 o The first chunk in the Write list MUST be used by the first READ 594 operation in an NFS version 4 COMPOUND procedure. The next Write 595 chunk is used by the next READ operation, and so on. 597 o If an NFS version 4 client has provided a matching non-empty Write 598 chunk, then the corresponding READ operation MUST return its DDP- 599 eligible data item using that chunk. 601 o If an NFS version 4 client has provided an empty matching Write 602 chunk, then the corresponding READ operation MUST return all of 603 its result data items inline. 605 o If an READ operation returns a union arm which does not contain a 606 DDP-eligible result, and the NFS version 4 client has provided a 607 matching non-empty Write chunk, an NFS version 4 server MUST 608 return an empty Write chunk in that Write list position. 610 o If there are more READ operations than Write chunks, then 611 remaining NFS Read operations in an NFS version 4 COMPOUND that 612 have no matching Write chunk MUST return their results inline. 614 5.3.1. NFS Version 4 COMPOUND Example 616 The following example shows a Write list with three Write chunks, A, 617 B, and C. The NFS version 4 server consumes the provided Write 618 chunks by writing the results of the designated operations in the 619 compound request (READ and READLINK) back to each chunk. 621 Write list: 623 A --> B --> C 625 NFS version 4 COMPOUND request: 627 PUTFH LOOKUP READ PUTFH LOOKUP READLINK PUTFH LOOKUP READ 628 | | | 629 v v v 630 A B C 632 If the NFS version 4 client does not want to have the READLINK result 633 returned via RDMA, it provides an empty Write chunk for buffer B to 634 indicate that the READLINK result must be returned inline. 636 5.4. NFS Version 4 Callback 638 The NFS version 4 protocols support server-initiated callbacks to 639 notify clients of events such as recalled delegations. 641 5.4.1. NFS Version 4.0 Callback 643 NFS version 4.0 implementations typically employ a separate TCP 644 connection to handle callback operations, even when the forward 645 channel uses a RPC-over-RDMA transport. 647 No operation in the NFS version 4.0 callback RPC program conveys a 648 significant data payload. Therefore, no XDR data items in this RPC 649 program is DDP-eligible. 651 A CB_RECALL reply is small and fixed in size. The CB_GETATTR reply 652 contains a variable-length fattr4 data item. See Section 5.2.1 for a 653 discussion of reply size prediction for this data item. 655 An NFS version 4.0 client advertises netids and ad hoc port addresses 656 for contacting its NFS version 4.0 callback service using the 657 SETCLIENTID operation. 659 5.4.2. NFS Version 4.1 Callback 661 In NFS version 4.1 and newer minor versions, callback operations may 662 appear on the same connection as is used for NFS version 4 forward 663 channel client requests. NFS version 4 clients and servers MUST use 664 the mechanism described in [RFC8167] when backchannel operations are 665 conveyed on RPC-over-RDMA transports. 667 The csa_back_chan_attrs argument of the CREATE_SESSION operation 668 contains a ca_maxresponsesize field. The value in this field can be 669 taken as the absolute maximum size of backchannel replies generated 670 by a replying NFS version 4 client. 672 There are no DDP-eligible data items in callback procedures defined 673 in NFS version 4.1 or NFS version 4.2. However, some callback 674 operations, such as messages that convey device ID information, can 675 be large, in which case a Long Call or Reply might be required. 677 When an NFS version 4.1 client reports a backchannel 678 ca_maxrequestsize that is larger than the connection's inline 679 thresholds, the NFS version 4 client can support Long Calls. 680 Otherwise an NFS version 4 server MUST use Short messages to convey 681 backchannel operations. 683 5.5. Session-Related Considerations 685 Typically, the presence of an NFS session [RFC5661] has no effect on 686 the operation of RPC-over-RDMA. None of the operations introduced to 687 support NFS sessions contain DDP-eligible data items. There is no 688 need to match the number of session slots with the number of 689 available RPC-over-RDMA credits. 691 However, there are some rare error conditions which require special 692 handling when an NFS session is operating on an RPC-over-RDMA 693 transport. For example, a requester might receive, in response to an 694 RPC request, an RDMA_ERROR message with an rdma_err value of 695 ERR_CHUNK, or an RDMA_MSG containing an RPC_GARBAGEARGS reply. 696 Within RPC-over-RDMA Version One, this class of error can be 697 generated for two different reasons: 699 o There was an XDR error detected parsing the RPC-over-RDMA headers. 701 o There was an error sending the response, because, for example, a 702 necessary reply chunk was not provided or the one provided is of 703 insufficient length. 705 These two situations, which arise due to incorrect implementations or 706 underestimation of reply size, have different implications with 707 regard to Exactly-Once Semantics. An XDR error in decoding the 708 request precludes the execution of the request on the responder, but 709 failure to send a reply indicates that some or all of the operations 710 were executed. 712 In both instances, the client SHOULD NOT retry the operation without 713 addressing reply resource inadequacy. Such a retry can result in the 714 same sort of error seen previously. Instead, it is best to consider 715 the operation as completed unsuccessfully and report an error to the 716 consumer who requested the RPC. 718 In addition, within the error response, the requester does not have 719 the result of the execution of the SEQUENCE operation, which 720 identifies the session, slot, and sequence id for the request which 721 has failed. The xid associated with the request, obtained from the 722 rdma_xid field of the RDMA_ERROR or RDMA_MSG message, must be used to 723 determine the session and slot for the request which failed, and the 724 slot must be properly retired. If this is not done, the slot could 725 be rendered permanently unavailable. 727 5.6. Connection Keep-Alive 729 NFS version 4 client implementations often rely on a transport-layer 730 keep-alive mechanism to detect when an NFS version 4 server has 731 become unresponsive. When an NFS server is no longer responsive, 732 client-side keep-alive terminates the connection, which in turn 733 triggers reconnection and RPC retransmission. 735 Some RDMA transports (such as Reliable Connections on InfiniBand) 736 have no keep-alive mechanism. Without a disconnect or new RPC 737 traffic, such connections can remain alive long after an NFS server 738 has become unresponsive. Once an NFS client has consumed all 739 available RPC-over-RDMA credits on that transport connection, it will 740 forever await a reply before sending another RPC request. 742 NFS version 4 clients SHOULD reserve one RPC-over-RDMA credit to use 743 for periodic server or connection health assessment. This credit can 744 be used to drive an RPC request on an otherwise idle connection, 745 triggering either a quick affirmative server response or immediate 746 connection termination. 748 6. Extending NFS Upper Layer Bindings 750 RPC programs such as NFS are required to have an Upper Layer Binding 751 specification to interoperate on RPC-over-RDMA transports [RFC8166]. 752 Via standards action, the Upper Layer Binding specified in this 753 document can be extended to cover versions of the NFS version 4 754 protocol specified after NFS version 4 minor version 2, or separately 755 published extensions to an existing NFS version 4 minor version, as 756 described in [RFC8178]. 758 7. IANA Considerations 760 NFS use of direct data placement introduces a need for an additional 761 NFS port number assignment for networks that share traditional UDP 762 and TCP port spaces with RDMA services. The iWARP [RFC5041] 763 [RFC5040] protocol is such an example (InfiniBand is not). 765 NFS servers for versions 2 and 3 [RFC1094] [RFC1813] traditionally 766 listen for clients on UDP and TCP port 2049, and additionally, they 767 register these with the portmapper and/or rpcbind [RFC1833] service. 768 However, [RFC7530] requires NFS version 4 servers to listen on TCP 769 port 2049, and they are not required to register. 771 An NFS version 2 or version 3 server supporting RPC-over-RDMA on such 772 a network and registering itself with the RPC portmapper MAY choose 773 an arbitrary port, or MAY use the alternative well-known port number 774 for its RPC-over-RDMA service. The chosen port MAY be registered 775 with the RPC portmapper under the netid assigned by the requirement 776 in [RFC8166]. 778 An NFS version 4 server supporting RPC-over-RDMA on such a network 779 MUST use the alternative well-known port number for its RPC-over-RDMA 780 service. Clients SHOULD connect to this well-known port without 781 consulting the RPC portmapper (as for NFS version 4 on TCP 782 transports). 784 The port number assigned to an NFS service over an RPC-over-RDMA 785 transport is available from the IANA port registry [RFC3232]. 787 8. Security Considerations 789 RPC-over-RDMA supports all RPC security models, including RPCSEC_GSS 790 security and transport-level security [RFC2203]. The choice of RDMA 791 Read and RDMA Write to convey RPC argument and results does not 792 affect this, since it changes only the method of data transfer. 793 Specifically, the requirements of [RFC8166] ensure that this choice 794 does not introduce new vulnerabilities. 796 Because this document defines only the binding of the NFS protocols 797 atop [RFC8166], all relevant security considerations are therefore to 798 be described at that layer. 800 9. References 801 9.1. Normative References 803 [RFC1833] Srinivasan, R., "Binding Protocols for ONC RPC Version 2", 804 RFC 1833, DOI 10.17487/RFC1833, August 1995, 805 . 807 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 808 Requirement Levels", BCP 14, RFC 2119, 809 DOI 10.17487/RFC2119, March 1997, 810 . 812 [RFC2203] Eisler, M., Chiu, A., and L. Ling, "RPCSEC_GSS Protocol 813 Specification", RFC 2203, DOI 10.17487/RFC2203, September 814 1997, . 816 [RFC5661] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed., 817 "Network File System (NFS) Version 4 Minor Version 1 818 Protocol", RFC 5661, DOI 10.17487/RFC5661, January 2010, 819 . 821 [RFC7530] Haynes, T., Ed. and D. Noveck, Ed., "Network File System 822 (NFS) Version 4 Protocol", RFC 7530, DOI 10.17487/RFC7530, 823 March 2015, . 825 [RFC7862] Haynes, T., "Network File System (NFS) Version 4 Minor 826 Version 2 Protocol", RFC 7862, DOI 10.17487/RFC7862, 827 November 2016, . 829 [RFC8166] Lever, C., Ed., Simpson, W., and T. Talpey, "Remote Direct 830 Memory Access Transport for Remote Procedure Call Version 831 1", RFC 8166, DOI 10.17487/RFC8166, June 2017, 832 . 834 [RFC8167] Lever, C., "Bidirectional Remote Procedure Call on RPC- 835 over-RDMA Transports", RFC 8167, DOI 10.17487/RFC8167, 836 June 2017, . 838 [RFC8178] Noveck, D., "Rules for NFSv4 Extensions and Minor 839 Versions", RFC 8178, DOI 10.17487/RFC8178, July 2017, 840 . 842 9.2. Informative References 844 [I-D.ietf-nfsv4-rfc5667bis] 845 Lever, C., "Remote Direct Memory Access Transport for 846 Remote Procedure Call, Version One", draft-ietf- 847 nfsv4-rfc5667bis-11 (work in progress), May 2017. 849 [NSM] The Open Group, "Protocols for Interworking: XNFS, Version 850 3W", February 1998. 852 [RFC1094] Nowicki, B., "NFS: Network File System Protocol 853 specification", RFC 1094, DOI 10.17487/RFC1094, March 854 1989, . 856 [RFC1813] Callaghan, B., Pawlowski, B., and P. Staubach, "NFS 857 Version 3 Protocol Specification", RFC 1813, 858 DOI 10.17487/RFC1813, June 1995, 859 . 861 [RFC3232] Reynolds, J., Ed., "Assigned Numbers: RFC 1700 is Replaced 862 by an On-line Database", RFC 3232, DOI 10.17487/RFC3232, 863 January 2002, . 865 [RFC5040] Recio, R., Metzler, B., Culley, P., Hilland, J., and D. 866 Garcia, "A Remote Direct Memory Access Protocol 867 Specification", RFC 5040, DOI 10.17487/RFC5040, October 868 2007, . 870 [RFC5041] Shah, H., Pinkerton, J., Recio, R., and P. Culley, "Direct 871 Data Placement over Reliable Transports", RFC 5041, 872 DOI 10.17487/RFC5041, October 2007, 873 . 875 [RFC5667] Talpey, T. and B. Callaghan, "Network File System (NFS) 876 Direct Data Placement", RFC 5667, DOI 10.17487/RFC5667, 877 January 2010, . 879 [rpcrdmav2] 880 Lever, C., Ed. and D. Noveck, "RPC-over-RDMA Version Two", 881 July 2017, . 884 Work in progress. 886 Appendix A. Acknowledgments 888 The author gratefully acknowledges the work of Brent Callaghan and 889 Tom Talpey on the original NFS Direct Data Placement specification 890 [RFC5667]. 892 A large part of the material in this doccument is taken from 893 [I-D.ietf-nfsv4-rfc5667bis] written by Chuck Lever. The author 894 wishes to acknowlege the debt he owes to Chuck for his work in 895 providing an updated Upper Layer Binding for the NFS-related 896 protocols. 898 The author also wishes to thank Bill Baker and Greg Marsden for their 899 support of the work to revive RPC-over-RDMA. 901 Author's Address 903 David Noveck 904 1601 Trapelo Road 905 waltham, MA 02451 906 Unied States of America 908 Phone: +1 781 572 8038 909 Email: davenoveck@gmail.com