| < draft-ietf-nfsv4-rfc5667bis-04.txt | draft-ietf-nfsv4-rfc5667bis-05.txt > | |||
|---|---|---|---|---|
| Network File System Version 4 C. Lever, Ed. | Network File System Version 4 C. Lever, Ed. | |||
| Internet-Draft Oracle | Internet-Draft Oracle | |||
| Obsoletes: 5667 (if approved) January 20, 2017 | Obsoletes: 5667 (if approved) February 3, 2017 | |||
| Intended status: Standards Track | Intended status: Standards Track | |||
| Expires: July 24, 2017 | Expires: August 7, 2017 | |||
| Network File System (NFS) Upper Layer Binding To RPC-Over-RDMA | Network File System (NFS) Upper Layer Binding To RPC-Over-RDMA | |||
| draft-ietf-nfsv4-rfc5667bis-04 | draft-ietf-nfsv4-rfc5667bis-05 | |||
| Abstract | Abstract | |||
| This document specifies Upper Layer Bindings of Network File System | This document specifies Upper Layer Bindings of Network File System | |||
| (NFS) protocol versions to RPC-over-RDMA. Upper Layer Bindings are | (NFS) protocol versions to RPC-over-RDMA. Upper Layer Bindings are | |||
| required to enable RPC-based protocols, such as NFS, to use Direct | required to enable RPC-based protocols, such as NFS, to use Direct | |||
| Data Placement on RPC-over-RDMA. This document obsoletes RFC 5667. | Data Placement on RPC-over-RDMA. This document obsoletes RFC 5667. | |||
| Requirements Language | Requirements Language | |||
| skipping to change at page 1, line 40 ¶ | skipping to change at page 1, line 40 ¶ | |||
| Internet-Drafts are working documents of the Internet Engineering | Internet-Drafts are working documents of the Internet Engineering | |||
| Task Force (IETF). Note that other groups may also distribute | Task Force (IETF). Note that other groups may also distribute | |||
| working documents as Internet-Drafts. The list of current Internet- | working documents as Internet-Drafts. The list of current Internet- | |||
| Drafts is at http://datatracker.ietf.org/drafts/current/. | Drafts is at http://datatracker.ietf.org/drafts/current/. | |||
| Internet-Drafts are draft documents valid for a maximum of six months | Internet-Drafts are draft documents valid for a maximum of six months | |||
| and may be updated, replaced, or obsoleted by other documents at any | and may be updated, replaced, or obsoleted by other documents at any | |||
| time. It is inappropriate to use Internet-Drafts as reference | time. It is inappropriate to use Internet-Drafts as reference | |||
| material or to cite them other than as "work in progress." | material or to cite them other than as "work in progress." | |||
| This Internet-Draft will expire on July 24, 2017. | This Internet-Draft will expire on August 7, 2017. | |||
| Copyright Notice | Copyright Notice | |||
| Copyright (c) 2017 IETF Trust and the persons identified as the | Copyright (c) 2017 IETF Trust and the persons identified as the | |||
| document authors. All rights reserved. | document authors. All rights reserved. | |||
| This document is subject to BCP 78 and the IETF Trust's Legal | This document is subject to BCP 78 and the IETF Trust's Legal | |||
| Provisions Relating to IETF Documents | Provisions Relating to IETF Documents | |||
| (http://trustee.ietf.org/license-info) in effect on the date of | (http://trustee.ietf.org/license-info) in effect on the date of | |||
| publication of this document. Please review these documents | publication of this document. Please review these documents | |||
| skipping to change at page 2, line 27 ¶ | skipping to change at page 2, line 27 ¶ | |||
| the copyright in such materials, this document may not be modified | the copyright in such materials, this document may not be modified | |||
| outside the IETF Standards Process, and derivative works of it may | outside the IETF Standards Process, and derivative works of it may | |||
| not be created outside the IETF Standards Process, except to format | not be created outside the IETF Standards Process, except to format | |||
| it for publication as an RFC or to translate it into languages other | it for publication as an RFC or to translate it into languages other | |||
| than English. | than English. | |||
| Table of Contents | Table of Contents | |||
| 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 | 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 | |||
| 2. Conveying NFS Operations On RPC-Over-RDMA . . . . . . . . . . 3 | 2. Conveying NFS Operations On RPC-Over-RDMA . . . . . . . . . . 3 | |||
| 3. Upper Layer Binding For NFS Versions 2 And 3 . . . . . . . . 5 | 3. Upper Layer Binding For NFS Versions 2 And 3 . . . . . . . . 4 | |||
| 4. Upper Layer Binding For NFS Version 4 . . . . . . . . . . . . 7 | 4. Upper Layer Binding For NFS Version 4 . . . . . . . . . . . . 6 | |||
| 5. Extending NFS Upper Layer Bindings . . . . . . . . . . . . . 13 | 5. Extending NFS Upper Layer Bindings . . . . . . . . . . . . . 12 | |||
| 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 14 | 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 13 | |||
| 7. Security Considerations . . . . . . . . . . . . . . . . . . . 14 | 7. Security Considerations . . . . . . . . . . . . . . . . . . . 13 | |||
| 8. References . . . . . . . . . . . . . . . . . . . . . . . . . 15 | 8. References . . . . . . . . . . . . . . . . . . . . . . . . . 14 | |||
| Appendix A. Changes Since RFC 5667 . . . . . . . . . . . . . . . 16 | Appendix A. Changes Since RFC 5667 . . . . . . . . . . . . . . . 15 | |||
| Appendix B. Acknowledgments . . . . . . . . . . . . . . . . . . 17 | Appendix B. Acknowledgments . . . . . . . . . . . . . . . . . . 17 | |||
| Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 18 | Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 17 | |||
| 1. Introduction | 1. Introduction | |||
| An RPC-over-RDMA transport, such as the one defined in | An RPC-over-RDMA transport, such as the one defined in | |||
| [I-D.ietf-nfsv4-rfc5666bis], may employ direct data placement to | [I-D.ietf-nfsv4-rfc5666bis], may employ direct data placement to | |||
| convey data payloads associated with RPC transactions. To enable | convey data payloads associated with RPC transactions. To enable | |||
| successful interoperation, RPC client and server implementations must | successful interoperation, RPC client and server implementations must | |||
| agree as to which XDR data items in what particular RPC procedures | agree as to which XDR data items in what particular RPC procedures | |||
| are eligible for direct data placement (DDP). | are eligible for direct data placement (DDP). | |||
| skipping to change at page 3, line 24 ¶ | skipping to change at page 3, line 24 ¶ | |||
| 2. Conveying NFS Operations On RPC-Over-RDMA | 2. Conveying NFS Operations On RPC-Over-RDMA | |||
| Definitions of terminology and a general discussion of how RPC-over- | Definitions of terminology and a general discussion of how RPC-over- | |||
| RDMA is used to convey RPC transactions can be found in | RDMA is used to convey RPC transactions can be found in | |||
| [I-D.ietf-nfsv4-rfc5666bis]. In this section, these general | [I-D.ietf-nfsv4-rfc5666bis]. In this section, these general | |||
| principles are applied in the context of conveying NFS procedures on | principles are applied in the context of conveying NFS procedures on | |||
| RPC-over-RDMA. Some issues common to all NFS protocol versions are | RPC-over-RDMA. Some issues common to all NFS protocol versions are | |||
| introduced. | introduced. | |||
| 2.1. The Read List | 2.1. DDP Eligibility Violations | |||
| The Read list in each RPC-over-RDMA transport header represents a set | ||||
| of memory regions containing DDP-eligible NFS argument data. Large | ||||
| data items, such as the data payload of an NFS version 3 WRITE | ||||
| procedure, can be referenced by the Read list. The NFS server pulls | ||||
| such payloads from the client and places them directly into its own | ||||
| memory. | ||||
| Exactly which XDR data items may be conveyed in this fashion is | ||||
| detailed later in this document. | ||||
| 2.2. The Write List | ||||
| The Write list in each RPC-over-RDMA transport header represents a | ||||
| set of memory regions that can receive DDP-eligible NFS result data. | ||||
| Large data items, such as the payload of an NFS version 3 READ | ||||
| procedure, can be referenced by the Write list. The NFS server | ||||
| pushes such payloads to the client, placing them directly into the | ||||
| client's memory. | ||||
| Each Write chunk corresponds to a specific XDR data item in an NFS | ||||
| reply. This document specifies how NFS client and server | ||||
| implementations identify the correspondence between Write chunks and | ||||
| XDR results. | ||||
| Exactly which XDR data items may be conveyed in this fashion is | ||||
| detailed later in this document. | ||||
| 2.3. Long Calls And Replies | ||||
| Small RPC messages are conveyed using RDMA Send operations which are | ||||
| of limited size. If an NFS request is too large to be conveyed | ||||
| within the NFS server's responder inline threshold, and there are no | ||||
| DDP-eligible data items that can be removed, an NFS client must send | ||||
| the request in the form of a Long Call. The entire NFS request is | ||||
| sent in a special Read chunk called a Position Zero Read chunk. | ||||
| If an NFS client determines that the maximum size of an NFS reply | ||||
| could be too large to be conveyed within it's own responder inline | ||||
| threshold, it provides a Reply chunk in the RPC-over-RDMA transport | ||||
| header conveying the NFS request. The server places the entire NFS | ||||
| reply in the Reply chunk. | ||||
| When the RPC authentication flavor requires that DDP-eligible data | ||||
| items are never removed from RPC messages, an NFS client can provide | ||||
| both a Position Zero Read chunk and a Reply chunk for the same RPC. | ||||
| These special chunks are discussed in further detail in | ||||
| [I-D.ietf-nfsv4-rfc5666bis]. | ||||
| 2.4. Scatter-Gather Considerations | ||||
| A chunk typically corresponds to exactly one XDR data item. Each | ||||
| Read chunk is represented as a list of segments at the same XDR | ||||
| Position. Each Write chunk is represented as an array of segments. | ||||
| An NFS client thus has the flexibility to advertise a set of | ||||
| discontiguous memory regions in which to convey a single DDP-eligible | ||||
| XDR data item. | ||||
| 2.5. DDP Eligibility Violations | ||||
| To report a DDP-eligibity violation, an NFS server MUST return one | To report a DDP-eligibity violation, an NFS server MUST return one | |||
| of: | of: | |||
| o An RPC-over-RDMA message of type RDMA_ERROR, with the rdma_xid | o An RPC-over-RDMA message of type RDMA_ERROR, with the rdma_xid | |||
| field set to the XID of the matching NFS Call, and the rdma_error | field set to the XID of the matching NFS Call, and the rdma_error | |||
| field set to ERR_CHUNK; or | field set to ERR_CHUNK; or | |||
| o An RPC message (via an RDMA_MSG message) with the xid field set to | o An RPC message (via an RDMA_MSG message) with the xid field set to | |||
| the XID of the matching NFS Call, the mtype field set to REPLY, | the XID of the matching NFS Call, the mtype field set to REPLY, | |||
| the stat field set to MSG_ACCEPTED, and the accept_stat field set | the stat field set to MSG_ACCEPTED, and the accept_stat field set | |||
| to GARBAGE_ARGS. | to GARBAGE_ARGS. | |||
| Subsequent sections of this document describe further considerations | Subsequent sections of this document describe further considerations | |||
| particular to specific NFS protocols or procedures. | particular to specific NFS protocols or procedures. | |||
| 2.6. Reply Size Estimation | 2.2. Reply Size Estimation | |||
| During the construction of each RPC Call message, an NFS client is | During the construction of each RPC Call message, an NFS client is | |||
| responsible for allocating appropriate resources for receiving the | responsible for allocating appropriate resources for receiving the | |||
| matching Reply message. A Reply buffer overrun can result in | matching Reply message. A Reply buffer overrun can result in | |||
| corruption of the Reply message or termination of the transport | corruption of the Reply message or termination of the transport | |||
| connection. Therefore reliable reply size estimation is necessary to | connection. Therefore reliable reply size estimation is necessary to | |||
| ensure successful interoperation. | ensure successful interoperation. This is particularly critical, for | |||
| example, when allocating a Reply chunk. | ||||
| In many cases the Upper Layer Protocol's XDR definition provides | In many cases the Upper Layer Protocol's XDR definition provides | |||
| enough information to enable the client to make a reliable prediction | enough information to enable the client to make a reliable prediction | |||
| of the maximum size of the expected Reply message. If there are | of the maximum size of the expected Reply message. If there are | |||
| variable-size data items in the result, the maximum size of the RPC | variable-size data items in the result, the maximum size of the RPC | |||
| Reply message can be reliably estimated in most cases: | Reply message can be reliably estimated in most cases: | |||
| o The client requests only a specific portion of an object (for | o The client requests only a specific portion of an object (for | |||
| example, using the "count" and "offset" fields in an NFS READ). | example, using the "count" and "offset" fields in an NFS READ). | |||
| o The client has already cached the size of the whole object it is | o The client has already cached the size of the whole object it is | |||
| about to request (say, via a previous NFS GETATTR request). | about to request (say, via a previous NFS GETATTR request). | |||
| It is occasionally not possible to determine the maximum Reply | o The client and server have negotiated a maximum size for all calls | |||
| message size based solely on the above criteria. NFS client | and responses. | |||
| implementers can choose to provide the largest possible Reply buffer | ||||
| in those cases, based on, for instance, the largest possible NFS READ | ||||
| or WRITE payload (which is negotiated at mount time). | ||||
| In rare cases, a client may encounter a reply for which no a priori | ||||
| determination of reply size bound is possible. The client SHOULD | ||||
| expect a transport error to indicate that it must either terminate | ||||
| that RPC transaction, or retry it with a larger Reply chunk. | ||||
| The use of NFS COMPOUND operations raises the possibility of non- | Subsequent sections of this document describe considerations | |||
| idempotent requests that combine a non-idempotent operation with an | particular to specific NFS procedures where it is not possible to | |||
| operation whose reply size is uncertain. This causes potential | determine the maximum Reply message size based solely on the above | |||
| difficulties with retrying the transaction. Note however that many | criteria. | |||
| operations normally considered non-idempotent (e.g WRITE, SETATTR) | ||||
| are actually idempotent. Truly non-idempotent operations are quite | ||||
| unusual in COMPOUNDs that include operations with uncertain reply | ||||
| sizes. | ||||
| 3. Upper Layer Binding For NFS Versions 2 And 3 | 3. Upper Layer Binding For NFS Versions 2 And 3 | |||
| This Upper Layer Binding specification applies to NFS Version 2 | This Upper Layer Binding specification applies to NFS Version 2 | |||
| [RFC1094] and NFS Version 3 [RFC1813]. For brevity, in this section | [RFC1094] and NFS Version 3 [RFC1813]. For brevity, in this section | |||
| a "legacy NFS client" refers to an NFS client using NFS version 2 or | a "legacy NFS client" refers to an NFS client using NFS version 2 or | |||
| NFS version 3 to communicate with an NFS server. Likewise, a "legacy | NFS version 3 to communicate with an NFS server. Likewise, a "legacy | |||
| NFS server" is an NFS server communicating with clients using NFS | NFS server" is an NFS server communicating with clients using NFS | |||
| version 2 or NFS version 3. | version 2 or NFS version 3. | |||
| skipping to change at page 6, line 22 ¶ | skipping to change at page 4, line 49 ¶ | |||
| o The pathname argument in the NFS SYMLINK procedure | o The pathname argument in the NFS SYMLINK procedure | |||
| o The opaque file data result in the NFS READ procedure | o The opaque file data result in the NFS READ procedure | |||
| o The pathname result in the NFS READLINK procedure | o The pathname result in the NFS READLINK procedure | |||
| All other argument or result data items in NFS versions 2 and 3 are | All other argument or result data items in NFS versions 2 and 3 are | |||
| not DDP-eligible. | not DDP-eligible. | |||
| A legacy server's response to a DDP-eligibility violation (described | A legacy server's response to a DDP-eligibility violation (described | |||
| in Section 2.5) does not give an indication to legacy clients of | in Section 2.1) does not give an indication to legacy clients of | |||
| whether the server has processed the arguments of the RPC Call, or | whether the server has processed the arguments of the RPC Call, or | |||
| whether the server has accessed or modified client memory associated | whether the server has accessed or modified client memory associated | |||
| with that RPC. | with that RPC. | |||
| A legacy NFS client determines the maximum reply size for each | A legacy NFS client determines the maximum reply size for each | |||
| operation using the basic criteria outlined in Section 2.6. Such | operation using the basic criteria outlined in Section 2.2. | |||
| clients provide a Reply chunk when the maximum possible reply size, | ||||
| exclusive of any data items represented by Write chunks, is larger | ||||
| than the client's responder inline threshold. | ||||
| 3.1. Auxiliary Protocols | 3.1. Auxiliary Protocols | |||
| NFS versions 2 and 3 are typically deployed with several other | NFS versions 2 and 3 are typically deployed with several other | |||
| protocols, sometimes referred to as "NFS auxiliary protocols." These | protocols, sometimes referred to as "NFS auxiliary protocols." These | |||
| are separate RPC programs that define procedures which are not part | are separate RPC programs that define procedures which are not part | |||
| of the NFS version 2 or version 3 RPC programs. These include: | of the NFS version 2 or version 3 RPC programs. These include: | |||
| o The MOUNT and NLM protocols, introduced in an appendix of | o The MOUNT and NLM protocols, introduced in an appendix of | |||
| [RFC1813] | [RFC1813] | |||
| skipping to change at page 7, line 20 ¶ | skipping to change at page 5, line 42 ¶ | |||
| deployments where NFS operations on RPC-over-RDMA. When a legacy | deployments where NFS operations on RPC-over-RDMA. When a legacy | |||
| server supports these programs on RPC-over-RDMA, it advertises the | server supports these programs on RPC-over-RDMA, it advertises the | |||
| port address via the usual rpcbind service [RFC1833]. | port address via the usual rpcbind service [RFC1833]. | |||
| No operation in these protocols conveys a significant data payload, | No operation in these protocols conveys a significant data payload, | |||
| and the size of RPC messages in these protocols is uniformly small. | and the size of RPC messages in these protocols is uniformly small. | |||
| Therefore, no XDR data items in these protocols are DDP-eligible. | Therefore, no XDR data items in these protocols are DDP-eligible. | |||
| The largest variable-length XDR data item is an xdr_netobj. In most | The largest variable-length XDR data item is an xdr_netobj. In most | |||
| implementations this data item is not larger than 1024 bytes, making | implementations this data item is not larger than 1024 bytes, making | |||
| reliable reply size estimation straightforward using the criteria | reliable reply size estimation straightforward using the criteria | |||
| outlined in Section 2.6. | outlined in Section 2.2. | |||
| 3.1.2. NFSACL Protocol | 3.1.2. NFSACL Protocol | |||
| Legacy clients and servers that support the NFSACL RPC program | Legacy clients and servers that support the NFSACL RPC program | |||
| typically convey NFSACL procedures on the same connection as the NFS | typically convey NFSACL procedures on the same connection as the NFS | |||
| RPC program. This obviates the need for separate rpcbind queries to | RPC program. This obviates the need for separate rpcbind queries to | |||
| discover server support for this RPC program. | discover server support for this RPC program. | |||
| ACLs are typically small, but even large ACLs must be encoded and | ACLs are typically small, but even large ACLs must be encoded and | |||
| decoded to some degree. Thus no data item in this Upper Layer | decoded to some degree. Thus no data item in this Upper Layer | |||
| skipping to change at page 7, line 42 ¶ | skipping to change at page 6, line 18 ¶ | |||
| For procedures whose replies do not include an ACL object, the size | For procedures whose replies do not include an ACL object, the size | |||
| of a reply is determined directly from the NFSACL program's XDR | of a reply is determined directly from the NFSACL program's XDR | |||
| definition. | definition. | |||
| There is no protocol-wide size limit for NFS version 3 ACLs, and | There is no protocol-wide size limit for NFS version 3 ACLs, and | |||
| there is no mechanism in either the NFSACL or NFS programs for a | there is no mechanism in either the NFSACL or NFS programs for a | |||
| legacy client to ascertain the largest ACL a legacy server can store. | legacy client to ascertain the largest ACL a legacy server can store. | |||
| Legacy client implementations should choose a maximum size for ACLs | Legacy client implementations should choose a maximum size for ACLs | |||
| based on their own internal limits. A recommended lower bound for | based on their own internal limits. A recommended lower bound for | |||
| this maximum is 32,768 bytes, though a larger Reply chunk (up to the | this maximum is 32,768 bytes. | |||
| negotiated rsize setting) can be provided. | ||||
| When an especially large ACL is expected, a Reply chunk might be | ||||
| required. If a legacy NFS server indicates that it cannot return an | ||||
| NFSACL GETACL response because the legacy NFS client has not provided | ||||
| a large enough Reply chunk to receive that response, the legacy NFS | ||||
| client can choose to | ||||
| o Terminate the NFSACL GETACL with an error, or | ||||
| o Allocate a larger Reply chunk and send the same NFSACL GETACL | ||||
| request as a new RPC transaction. The NFS client should avoid | ||||
| retrying the request indefinitely. | ||||
| 4. Upper Layer Binding For NFS Version 4 | 4. Upper Layer Binding For NFS Version 4 | |||
| This Upper Layer Binding specification applies to all protocols | This Upper Layer Binding specification applies to all protocols | |||
| defined in NFS Version 4.0 [RFC7530], NFS Version 4.1 [RFC5661], and | defined in NFS Version 4.0 [RFC7530], NFS Version 4.1 [RFC5661], and | |||
| NFS Version 4.2 [RFC7862]. | NFS Version 4.2 [RFC7862]. | |||
| 4.1. DDP-Eligibility | 4.1. DDP-Eligibility | |||
| Only the following XDR data items in the COMPOUND procedure of all | Only the following XDR data items in the COMPOUND procedure of all | |||
| skipping to change at page 9, line 9 ¶ | skipping to change at page 7, line 44 ¶ | |||
| o An NFS version 4.2 server MUST NOT return more than two elements | o An NFS version 4.2 server MUST NOT return more than two elements | |||
| in the rpr_contents array of any READ_PLUS operation. It returns | in the rpr_contents array of any READ_PLUS operation. It returns | |||
| as much of the requested byte range as it can fit within these two | as much of the requested byte range as it can fit within these two | |||
| elements. If the NFS version 4.2 server has not asserted rpr_eof | elements. If the NFS version 4.2 server has not asserted rpr_eof | |||
| in the reply, the NFS version 4.2 client SHOULD send additional | in the reply, the NFS version 4.2 client SHOULD send additional | |||
| READ_PLUS requests for any remaining bytes. | READ_PLUS requests for any remaining bytes. | |||
| 4.2. NFS Version 4 Reply Size Estimation | 4.2. NFS Version 4 Reply Size Estimation | |||
| An NFS version 4 client provides a Reply chunk when the maximum | Within NFS version 4, there are certain variable-length result data | |||
| possible reply size is larger than the client's responder inline | items whose maximum size cannot be estimated by clients reliably | |||
| threshold. | because there is no protocol-specified size limit on these arrays. | |||
| These include: | ||||
| There are certain NFS version 4 data items whose size cannot be | ||||
| estimated by clients reliably, however, because there is no protocol- | ||||
| specified size limit on these structures. These include: | ||||
| o The attrlist4 field | o The attrlist4 field | |||
| o Fields containing ACLs such as fattr4_acl, fattr4_dacl, | o Fields containing ACLs such as fattr4_acl, fattr4_dacl, | |||
| fattr4_sacl | fattr4_sacl | |||
| o Fields in the fs_locations4 and fs_locations_info4 data structures | o Fields in the fs_locations4 and fs_locations_info4 data structures | |||
| o Opaque fields which pertain to pNFS layout metadata, such as | o Fields opaque to the NFS version 4 protocol which pertain to pNFS | |||
| loc_body, loh_body, da_addr_body, lou_body, lrf_body, | layout metadata, such as loc_body, loh_body, da_addr_body, | |||
| fattr_layout_types and fs_layout_types, | lou_body, lrf_body, fattr_layout_types and fs_layout_types, | |||
| 4.2.1. Reply Size Estimation For Minor Version 0 | 4.2.1. Reply Size Estimation For Minor Version 0 | |||
| The items enumerated above in Section 4.2 make it difficult to | The NFSv4.0 protocol itself does not impose any bound on the size of | |||
| predict the maximum size of GETATTR replies that interrogate | NFS calls or responses. | |||
| variable-length attributes. As discussed in Section 2.6, client | ||||
| Some of the data items enumerated in Section 4.2 (in particular, the | ||||
| items related to ACLs and fs_locations) make it difficult to predict | ||||
| the maximum size of NFSv4.0 GETATTR replies that interrogate | ||||
| variable-length attributes. As discussed in Section 2.2, client | ||||
| implementations can rely on their own internal architectural limits | implementations can rely on their own internal architectural limits | |||
| to bound the reply size, but such limits are not guaranteed to be | to bound the reply size, but such limits are not always guaranteed to | |||
| reliable. | be reliable. | |||
| If a client implementation is equipped to recognize that a transport | When an especially large NFSv4.0 GETATTR result is expected, a Reply | |||
| error could mean that it provisioned an inadequately sized Reply | chunk might be required. If an NFSv4.0 server indicates that it | |||
| chunk, it can retry the operation with a larger Reply chunk. | cannot return an NFSv4.0 GETATTR response because the requesting | |||
| Otherwise, the client must terminate the RPC transaction. | NFSv4.0 client has not provided a large enough Reply chunk to receive | |||
| that response, the NFSv4.0 client can choose to | ||||
| It is best to avoid issuing single COMPOUNDs that contain both non- | o Terminate the NFSv4.0 GETATTR with an error, or | |||
| idempotent operations and operations where the maximum reply size | ||||
| cannot be reliably predicted. | o Allocate a larger Reply chunk and send the same NFSv4.0 GETATTR | |||
| request as a new RPC transaction. The NFS client should avoid | ||||
| retrying the request indefinitely. | ||||
| The use of NFS COMPOUND operations raises the possibility of requests | ||||
| that combine a non-idempotent operation (eg. NFS WRITE) with an | ||||
| NFSv4.0 GETATTR that requests one or more variable length results. | ||||
| This combination should be avoided by ensuring that any NFSv4.0 | ||||
| GETATTR operation that might return a result of unpredictable length | ||||
| is sent in an NFS COMPOUND by itself. | ||||
| 4.2.2. Reply Size Estimation For Minor Version 1 And Newer | 4.2.2. Reply Size Estimation For Minor Version 1 And Newer | |||
| In NFS version 4.1 and newer minor versions, the csa_fore_chan_attrs | In NFS version 4.1 and newer minor versions, the csa_fore_chan_attrs | |||
| argument of the CREATE_SESSION operation contains a | argument of the CREATE_SESSION operation contains a | |||
| ca_maxresponsesize field. The value in this field can be taken as | ca_maxresponsesize field. The value in this field can be taken as | |||
| the absolute maximum size of replies generated by a replying NFS | the absolute maximum size of replies generated by a replying NFS | |||
| version 4 server. | version 4 server. | |||
| This value can be used in cases where it is not possible to estimate | This value can be used in cases where it is not possible to estimate | |||
| skipping to change at page 10, line 43 ¶ | skipping to change at page 9, line 41 ¶ | |||
| chunk is used by the next READ operation, and so on. | chunk is used by the next READ operation, and so on. | |||
| o If an NFS version 4 client has provided a matching non-empty Write | o If an NFS version 4 client has provided a matching non-empty Write | |||
| chunk, then the corresponding READ operation MUST return its DDP- | chunk, then the corresponding READ operation MUST return its DDP- | |||
| eligible data item using that chunk. | eligible data item using that chunk. | |||
| o If an NFS version 4 client has provided an empty matching Write | o If an NFS version 4 client has provided an empty matching Write | |||
| chunk, then the corresponding READ operation MUST return all of | chunk, then the corresponding READ operation MUST return all of | |||
| its result data items inline. | its result data items inline. | |||
| o If an READ operation returns a union arm which does not contain a | o If a READ operation returns a union arm which does not contain a | |||
| DDP-eligible result, and the NFS version 4 client has provided a | DDP-eligible result, and the NFS version 4 client has provided a | |||
| matching non-empty Write chunk, an NFS version 4 server MUST | matching non-empty Write chunk, an NFS version 4 server MUST | |||
| return an empty Write chunk in that Write list position. | return an empty Write chunk in that Write list position. | |||
| o If there are more READ operations than Write chunks, then | o If there are more READ operations than Write chunks, then | |||
| remaining NFS Read operations in an NFS version 4 COMPOUND that | remaining NFS Read operations in an NFS version 4 COMPOUND that | |||
| have no matching Write chunk MUST return their results inline. | have no matching Write chunk MUST return their results inline. | |||
| 4.3.1. NFS Version 4 COMPOUND Example | 4.3.1. NFS Version 4 COMPOUND Example | |||
| skipping to change at page 12, line 23 ¶ | skipping to change at page 11, line 23 ¶ | |||
| The csa_back_chan_attrs argument of the CREATE_SESSION operation | The csa_back_chan_attrs argument of the CREATE_SESSION operation | |||
| contains a ca_maxresponsesize field. The value in this field can be | contains a ca_maxresponsesize field. The value in this field can be | |||
| taken as the absolute maximum size of backchannel replies generated | taken as the absolute maximum size of backchannel replies generated | |||
| by a replying NFS version 4 client. | by a replying NFS version 4 client. | |||
| There are no DDP-eligible data items in callback procedures defined | There are no DDP-eligible data items in callback procedures defined | |||
| in NFS version 4.1 or NFS version 4.2. However, some callback | in NFS version 4.1 or NFS version 4.2. However, some callback | |||
| operations, such as messages that convey device ID information, can | operations, such as messages that convey device ID information, can | |||
| be large, in which case a Long Call or Reply might be required. | be large, in which case a Long Call or Reply might be required. | |||
| When an NFS version 4.1 client reports a backchannel | When an NFS version 4.1 client can support Long Calls in its | |||
| ca_maxrequestsize that is larger than the connection's inline | backchannel, it reports a backchannel ca_maxrequestsize that is | |||
| thresholds, the NFS version 4 client can support Long Calls. | larger than the connection's inline thresholds. Otherwise an NFS | |||
| Otherwise an NFS version 4 server MUST use Short messages to convey | version 4 server MUST use only Short messages to convey backchannel | |||
| backchannel operations. | operations. | |||
| 4.5. Session-Related Considerations | 4.5. Session-Related Considerations | |||
| Typically the presence of an NFS session [RFC5661] has no effect on | Typically the presence of an NFS session [RFC5661] has no effect on | |||
| the operation of RPC-over-RDMA. None of the operations introduced to | the operation of RPC-over-RDMA. None of the operations introduced to | |||
| support NFS sessions contain DDP-eligible data items. There is no | support NFS sessions (eg. SEQUENCE) contain DDP-eligible data items. | |||
| need to match the number of session slots with the number of | There is no need to match the number of session slots with the number | |||
| available RPC-over-RDMA credits. | of available RPC-over-RDMA credits. | |||
| However, there are some rare error conditions which require special | ||||
| handling when an NFS session is operating on an RPC-over-RDMA | ||||
| transport. For example, a requester might receive, in response to an | ||||
| RPC request, an RDMA_ERROR message with an rdma_err value of | ||||
| ERR_CHUNK, or an RDMA_MSG containing an RPC_GARBAGEARGS reply. | ||||
| Within RPC-over-RDMA Version One, this class of error can be | ||||
| generated for two different reasons: | ||||
| o There was an XDR error detected parsing the RPC-over-RDMA headers. | ||||
| o There was an error sending the response, because, for example, a | ||||
| necessary reply chunk was not provided or the one provided is of | ||||
| insufficient length. | ||||
| These two situations, which arise due to incorrect implementations or | ||||
| underestimation of reply size, have different implications with | ||||
| regard to Exactly-Once Semantics. An XDR error in decoding the | ||||
| request precludes the execution of the request on the responder, but | ||||
| failure to send a reply indicates that some or all of the operations | ||||
| were executed. | ||||
| In both instances, the client SHOULD NOT retry the operation without | When an NFS session operates on an RPC-over-RDMA transport, there are | |||
| addressing reply resource inadequacy. Such a retry can result in the | a few additional cases where an RPC transaction can fail. For | |||
| same sort of error seen previously. Instead, it is best to consider | example, a requester might receive, in response to an RPC request, an | |||
| the operation as completed unsuccessfully and report an error to the | RDMA_ERROR message with an rdma_err value of ERR_CHUNK, or an | |||
| consumer who requested the RPC. | RDMA_MSG containing an RPC_GARBAGEARGS reply. These situations are | |||
| no different from existing RPC errors which an NFS session | ||||
| implementation is already prepared to handle for other transports. | ||||
| In addition, within the error response, the requester does not have | As with other transports during such a failure, there might be no | |||
| the result of the execution of the SEQUENCE operation, which | SEQUENCE result available to the requester to distinguish whether | |||
| identifies the session, slot, and sequence id for the request which | failure occurred before or after the requested operations were | |||
| has failed. The xid associated with the request, obtained from the | executed on the responder. When a transport error occurs (eg. | |||
| rdma_xid field of the RDMA_ERROR or RDMA_MSG message, must be used to | RDMA_ERROR), the requester proceeds as usual to match the incoming | |||
| determine the session and slot for the request which failed, and the | XID value to a waiting RPC Call. The RPC transaction is terminated, | |||
| slot must be properly retired. If this is not done, the slot could | and the result status is reported to the Upper Layer Protocol. The | |||
| be rendered permanently unavailable. | requester's session implementation then determines the session ID and | |||
| slot for the failed request, and performs slot recovery to make that | ||||
| slot usable again. If this is not done, that slot could be rendered | ||||
| permanently unavailable. | ||||
| 4.6. Connection Keep-Alive | 4.6. Retransmission And Keep-Alive | |||
| NFS version 4 client implementations often rely on a transport-layer | NFS version 4 client implementations often rely on a transport-layer | |||
| keep-alive mechanism to detect when an NFS version 4 server has | keep-alive mechanism to detect when an NFS version 4 server has | |||
| become unresponsive. When an NFS server is no longer responsive, | become unresponsive. When an NFS server is no longer responsive, | |||
| client-side keep-alive terminates the connection, which in turn | client-side keep-alive terminates the connection, which in turn | |||
| triggers reconnection and RPC retransmission. | triggers reconnection and RPC retransmission. | |||
| Some RDMA transports (such as Reliable Connections on InfiniBand) | Some RDMA transports (such as Reliable Connections on InfiniBand) | |||
| have no keep-alive mechanism. Without a disconnect or new RPC | have no keep-alive mechanism. Without a disconnect or new RPC | |||
| traffic, such connections can remain alive long after an NFS server | traffic, such connections can remain alive long after an NFS server | |||
| has become unresponsive. Once an NFS client has consumed all | has become unresponsive. Once an NFS client has consumed all | |||
| available RPC-over-RDMA credits on that transport connection, it will | available RPC-over-RDMA credits on that transport connection, it will | |||
| forever await a reply before sending another RPC request. | forever await a reply before sending another RPC request. | |||
| NFS version 4 clients SHOULD reserve one RPC-over-RDMA credit to use | NFS version 4 clients SHOULD reserve one RPC-over-RDMA credit to use | |||
| for periodic server or connection health assessment. This credit can | for periodic server or connection health assessment. This credit can | |||
| be used to drive an RPC request on an otherwise idle connection, | be used to drive an RPC request on an otherwise idle connection, | |||
| triggering either a quick affirmative server response or immediate | triggering either a quick affirmative server response or immediate | |||
| connection termination. | connection termination. | |||
| In addition to network partition and request loss scenarios, RPC- | ||||
| over-RDMA connections can be terminated when a Transport header is | ||||
| malformed, messages are larger than receive resources, or when too | ||||
| many RPC-over-RDMA messages are sent at once. In such cases: | ||||
| o If there is a transport error indicated (ie, RDMA_ERROR) before | ||||
| the disconnect or instead of a disconnect, the requester MUST | ||||
| respond to that error as prescribed by the specification of the | ||||
| RPC transport. Then the NFS version 4 rules for handling | ||||
| retransmission apply. | ||||
| o If there is a transport disconnect and the responder has provided | ||||
| no other response for a request, then only the NFS version 4 rules | ||||
| for handling retransmission apply. | ||||
| 5. Extending NFS Upper Layer Bindings | 5. Extending NFS Upper Layer Bindings | |||
| RPC programs such as NFS are required to have an Upper Layer Binding | RPC programs such as NFS are required to have an Upper Layer Binding | |||
| specification to interoperate on RPC-over-RDMA transports | specification to interoperate on RPC-over-RDMA transports | |||
| [I-D.ietf-nfsv4-rfc5666bis]. Via standards action, the Upper Layer | [I-D.ietf-nfsv4-rfc5666bis]. Via standards action, the Upper Layer | |||
| Binding specified in this document can be extended to cover versions | Binding specified in this document can be extended to cover versions | |||
| of the NFS version 4 protocol specified after NFS version 4 minor | of the NFS version 4 protocol specified after NFS version 4 minor | |||
| version 2, or separately published extensions to an existing NFS | version 2, or separately published extensions to an existing NFS | |||
| version 4 minor version, as described in [I-D.ietf-nfsv4-versioning]. | version 4 minor version, as described in [I-D.ietf-nfsv4-versioning]. | |||
| skipping to change at page 14, line 39 ¶ | skipping to change at page 13, line 39 ¶ | |||
| service. Clients SHOULD connect to this well-known port without | service. Clients SHOULD connect to this well-known port without | |||
| consulting the RPC portmapper (as for NFS version 4 on TCP | consulting the RPC portmapper (as for NFS version 4 on TCP | |||
| transports). | transports). | |||
| The port number assigned to an NFS service over an RPC-over-RDMA | The port number assigned to an NFS service over an RPC-over-RDMA | |||
| transport is available from the IANA port registry [RFC3232]. | transport is available from the IANA port registry [RFC3232]. | |||
| 7. Security Considerations | 7. Security Considerations | |||
| RPC-over-RDMA supports all RPC security models, including RPCSEC_GSS | RPC-over-RDMA supports all RPC security models, including RPCSEC_GSS | |||
| security and transport-level security [RFC2203]. The choice of RDMA | security and transport-level security [RFC2203]. The choice of what | |||
| Read and RDMA Write to convey RPC argument and results does not | Direct Data Placement mechanism to convey RPC argument and results | |||
| affect this, since it changes only the method of data transfer. | does not affect this, since it changes only the method of data | |||
| Specifically, the requirements of [I-D.ietf-nfsv4-rfc5666bis] ensure | transfer. Specifically, the requirements of | |||
| that this choice does not introduce new vulnerabilities. | [I-D.ietf-nfsv4-rfc5666bis] ensure that this choice does not | |||
| introduce new vulnerabilities. | ||||
| Because this document defines only the binding of the NFS protocols | Because this document defines only the binding of the NFS protocols | |||
| atop [I-D.ietf-nfsv4-rfc5666bis], all relevant security | atop [I-D.ietf-nfsv4-rfc5666bis], all relevant security | |||
| considerations are therefore to be described at that layer. | considerations are therefore to be described at that layer. | |||
| 8. References | 8. References | |||
| 8.1. Normative References | 8.1. Normative References | |||
| [I-D.ietf-nfsv4-rfc5666bis] | [I-D.ietf-nfsv4-rfc5666bis] | |||
| skipping to change at page 17, line 19 ¶ | skipping to change at page 16, line 19 ¶ | |||
| results to Write chunks. | results to Write chunks. | |||
| Requirements to ignore extra Read or Write chunks have been removed | Requirements to ignore extra Read or Write chunks have been removed | |||
| from the NFS version 2 and 3 Upper Layer Binding, as they conflict | from the NFS version 2 and 3 Upper Layer Binding, as they conflict | |||
| with [I-D.ietf-nfsv4-rfc5666bis]. | with [I-D.ietf-nfsv4-rfc5666bis]. | |||
| A complete discussion of reply size estimation has been introduced | A complete discussion of reply size estimation has been introduced | |||
| for all protocols covered by the Upper Layer Bindings in this | for all protocols covered by the Upper Layer Bindings in this | |||
| document. | document. | |||
| A section discussing NFS version 4 retransmission and connection loss | ||||
| has been added. | ||||
| The following additional improvements have been made, relative to | The following additional improvements have been made, relative to | |||
| [RFC5667]: | [RFC5667]: | |||
| o An explicit discussion of NFS version 4.0 and NFS version 4.1 | o An explicit discussion of NFS version 4.0 and NFS version 4.1 | |||
| backchannel operation has replaced the previous treatment of | backchannel operation has replaced the previous treatment of | |||
| callback operations. | callback operations. | |||
| o A binding for NFS version 4.2 has been added that includes | o A binding for NFS version 4.2 has been added that includes | |||
| discussion of new data-bearing operations like READ_PLUS. | discussion of new data-bearing operations like READ_PLUS. | |||
| skipping to change at page 18, line 10 ¶ | skipping to change at page 17, line 17 ¶ | |||
| The author gratefully acknowledges the work of Brent Callaghan and | The author gratefully acknowledges the work of Brent Callaghan and | |||
| Tom Talpey on the original NFS Direct Data Placement specification | Tom Talpey on the original NFS Direct Data Placement specification | |||
| [RFC5667]. The author also wishes to thank Bill Baker and Greg | [RFC5667]. The author also wishes to thank Bill Baker and Greg | |||
| Marsden for their support of this work. | Marsden for their support of this work. | |||
| Dave Noveck provided excellent review, constructive suggestions, and | Dave Noveck provided excellent review, constructive suggestions, and | |||
| consistent navigational guidance throughout the process of drafting | consistent navigational guidance throughout the process of drafting | |||
| this document. Dave also contributed the text of Section 4.5 | this document. Dave also contributed the text of Section 4.5 | |||
| Thanks to Karen Deitke for her sharp observations about idempotency, | Thanks to Karen Deitke for her sharp observations about idempotency, | |||
| and the clarity of the discussion of NFS COMPOUNDs. | and the clarity of the discussion of NFS COMPOUNDs and NFS sessions. | |||
| Special thanks go to Transport Area Director Spencer Dawkins, nfsv4 | Special thanks go to Transport Area Director Spencer Dawkins, nfsv4 | |||
| Working Group Chair Spencer Shepler, and nfsv4 Working Group | Working Group Chair Spencer Shepler, and nfsv4 Working Group | |||
| Secretary Thomas Haynes for their support. | Secretary Thomas Haynes for their support. | |||
| Author's Address | Author's Address | |||
| Charles Lever (editor) | Charles Lever (editor) | |||
| Oracle Corporation | Oracle Corporation | |||
| 1015 Granger Avenue | 1015 Granger Avenue | |||
| End of changes. 32 change blocks. | ||||
| 174 lines changed or deleted | 124 lines changed or added | |||
This html diff was produced by rfcdiff 1.48. The latest version is available from http://tools.ietf.org/tools/rfcdiff/ | ||||