| < draft-ietf-nfsv4-rfc5667bis-00.txt | draft-ietf-nfsv4-rfc5667bis-01.txt > | |||
|---|---|---|---|---|
| Network File System Version 4 C. Lever, Ed. | Network File System Version 4 C. Lever, Ed. | |||
| Internet-Draft Oracle | Internet-Draft Oracle | |||
| Obsoletes: 5667 (if approved) June 13, 2016 | Obsoletes: 5667 (if approved) June 30, 2016 | |||
| Intended status: Standards Track | Intended status: Standards Track | |||
| Expires: December 15, 2016 | Expires: January 1, 2017 | |||
| Network File System (NFS) Direct Data Placement | Network File System (NFS) Upper Layer Binding To RPC-Over-RDMA | |||
| draft-ietf-nfsv4-rfc5667bis-00 | draft-ietf-nfsv4-rfc5667bis-01 | |||
| Abstract | Abstract | |||
| This document defines the bindings of the various Network File System | This document specifies the Upper Layer Bindings of Network File | |||
| (NFS) versions to the Remote Direct Memory Access (RDMA) operations | System (NFS) protocol versions to RPC-over-RDMA transports. Such | |||
| supported by the RPC-over-RDMA transport protocol. It describes the | Upper Layer Bindings are required to enable RPC-based protocols to | |||
| use of direct data placement by means of server-initiated RDMA | use direct data placement when conveying large data payloads on RPC- | |||
| operations into client-supplied buffers for implementations of NFS | over-RDMA transports. This document obsoletes RFC 5667. | |||
| versions 2, 3, 4, and 4.1 over such an RDMA transport. This document | ||||
| obsoletes RFC 5667. | ||||
| Status of This Memo | Status of This Memo | |||
| This Internet-Draft is submitted in full conformance with the | This Internet-Draft is submitted in full conformance with the | |||
| provisions of BCP 78 and BCP 79. | provisions of BCP 78 and BCP 79. | |||
| Internet-Drafts are working documents of the Internet Engineering | Internet-Drafts are working documents of the Internet Engineering | |||
| Task Force (IETF). Note that other groups may also distribute | Task Force (IETF). Note that other groups may also distribute | |||
| working documents as Internet-Drafts. The list of current Internet- | working documents as Internet-Drafts. The list of current Internet- | |||
| Drafts is at http://datatracker.ietf.org/drafts/current/. | Drafts is at http://datatracker.ietf.org/drafts/current/. | |||
| Internet-Drafts are draft documents valid for a maximum of six months | Internet-Drafts are draft documents valid for a maximum of six months | |||
| and may be updated, replaced, or obsoleted by other documents at any | and may be updated, replaced, or obsoleted by other documents at any | |||
| time. It is inappropriate to use Internet-Drafts as reference | time. It is inappropriate to use Internet-Drafts as reference | |||
| material or to cite them other than as "work in progress." | material or to cite them other than as "work in progress." | |||
| This Internet-Draft will expire on December 15, 2016. | This Internet-Draft will expire on January 1, 2017. | |||
| Copyright Notice | Copyright Notice | |||
| Copyright (c) 2016 IETF Trust and the persons identified as the | Copyright (c) 2016 IETF Trust and the persons identified as the | |||
| document authors. All rights reserved. | document authors. All rights reserved. | |||
| This document is subject to BCP 78 and the IETF Trust's Legal | This document is subject to BCP 78 and the IETF Trust's Legal | |||
| Provisions Relating to IETF Documents | Provisions Relating to IETF Documents | |||
| (http://trustee.ietf.org/license-info) in effect on the date of | (http://trustee.ietf.org/license-info) in effect on the date of | |||
| publication of this document. Please review these documents | publication of this document. Please review these documents | |||
| carefully, as they describe your rights and restrictions with respect | carefully, as they describe your rights and restrictions with respect | |||
| to this document. Code Components extracted from this document must | to this document. Code Components extracted from this document must | |||
| include Simplified BSD License text as described in Section 4.e of | include Simplified BSD License text as described in Section 4.e of | |||
| the Trust Legal Provisions and are provided without warranty as | the Trust Legal Provisions and are provided without warranty as | |||
| described in the Simplified BSD License. | described in the Simplified BSD License. | |||
| Table of Contents | Table of Contents | |||
| 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 | 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 | |||
| 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 2 | 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 3 | |||
| 1.2. Planned Changes To This Document . . . . . . . . . . . . 2 | 1.2. Changes Since RFC 5667 . . . . . . . . . . . . . . . . . 3 | |||
| 2. Transfers from NFS Client to NFS Server . . . . . . . . . . . 3 | 1.3. Planned Changes To This Document . . . . . . . . . . . . 4 | |||
| 3. Transfers from NFS Server to NFS Client . . . . . . . . . . . 3 | 2. Conveying NFS Operations On RPC-Over-RDMA Transports . . . . 4 | |||
| 4. NFS Versions 2 and 3 Mapping . . . . . . . . . . . . . . . . 5 | 2.1. Use Of The Read List . . . . . . . . . . . . . . . . . . 4 | |||
| 5. NFS Version 4 Mapping . . . . . . . . . . . . . . . . . . . . 6 | 2.2. Use Of The Write List . . . . . . . . . . . . . . . . . . 5 | |||
| 5.1. NFS Version 4 Callbacks . . . . . . . . . . . . . . . . . 8 | 2.3. Construction Of Individual Chunks . . . . . . . . . . . . 5 | |||
| 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 9 | 2.4. Use Of Long Calls And Replies . . . . . . . . . . . . . . 5 | |||
| 7. Security Considerations . . . . . . . . . . . . . . . . . . . 9 | 3. NFS Versions 2 And 3 Upper Layer Binding . . . . . . . . . . 5 | |||
| 8. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 9 | 4. NFS Version 4 Upper Layer Binding . . . . . . . . . . . . . . 6 | |||
| 9. References . . . . . . . . . . . . . . . . . . . . . . . . . 10 | 4.1. NFS Version 4 COMPOUND Considerations . . . . . . . . . . 7 | |||
| 9.1. Normative References . . . . . . . . . . . . . . . . . . 10 | 4.2. NFS Version 4 Callbacks . . . . . . . . . . . . . . . . . 8 | |||
| 9.2. Informative References . . . . . . . . . . . . . . . . . 10 | 5. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 8 | |||
| 6. Security Considerations . . . . . . . . . . . . . . . . . . . 9 | ||||
| 7. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 9 | ||||
| 8. References . . . . . . . . . . . . . . . . . . . . . . . . . 9 | ||||
| 8.1. Normative References . . . . . . . . . . . . . . . . . . 9 | ||||
| 8.2. Informative References . . . . . . . . . . . . . . . . . 10 | ||||
| Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 11 | Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 11 | |||
| 1. Introduction | 1. Introduction | |||
| The Remote Direct Memory Access (RDMA) Transport for Remote Procedure | Remote Direct Memory Access Transport for Remote Procedure Call, | |||
| Call (RPC) [I-D.ietf-nfsv4-rfc5666bis] allows an RPC client | Version One [I-D.ietf-nfsv4-rfc5666bis] (RPC-over-RDMA) enables the | |||
| application to post buffers in a Chunk list for specific arguments | use of direct data placement to accelerate the transmission of large | |||
| and results from an RPC call. The RDMA transport header conveys this | data payloads associated with RPC transactions. | |||
| list of client buffer addresses to the server where the application | ||||
| can associate them with client data and use RDMA operations to | Each RPC-over-RDMA transport header can convey lists of memory | |||
| transfer the results directly to and from the posted buffers on the | locations involved in direct transfers of data payloads. These | |||
| client. The client and server must agree on a consistent mapping of | memory locations correspond to XDR data items defined in an Upper | |||
| posted buffers to RPC. This document details the mapping for each | Layer Protocol (such as NFS). | |||
| version of the NFS protocol [RFC1094] [RFC1813] [RFC7530] [RFC5661]. | ||||
| To facilitate interoperation, RPC client and server implementations | ||||
| must agree on what XDR data items in which RPC procedures are | ||||
| eligible for direct data placement (DDP). | ||||
| This document specifies the set of XDR data items in each of the | ||||
| following NFS protocol versions that are eligible for DDP. It also | ||||
| contains additional material required of Upper Layer Bindings as | ||||
| specified in [I-D.ietf-nfsv4-rfc5666bis]. | ||||
| o NFS Version 2 [RFC1094] | ||||
| o NFS Version 3 [RFC1813] | ||||
| o NFS Version 4.0 [RFC7530] | ||||
| o NFS Version 4.1 [RFC5661] | ||||
| o NFS Version 4.2 [I-D.ietf-nfsv4-minorversion2] | ||||
| The Upper Layer Binding specified in this document can be extended to | ||||
| cover the addition of new DDP-eligible XDR data items defined by | ||||
| versions of the NFS version 4 protocol specified after this document | ||||
| has been ratified. | ||||
| 1.1. Requirements Language | 1.1. Requirements Language | |||
| The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", | The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", | |||
| "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this | "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this | |||
| document are to be interpreted as described in [RFC2119]. | document are to be interpreted as described in [RFC2119]. | |||
| 1.2. Planned Changes To This Document | 1.2. Changes Since RFC 5667 | |||
| The following changes will be made, relative to [RFC5667]: | Corrections and updates made necessary by new language in | |||
| [I-D.ietf-nfsv4-rfc5666bis] has been introduced. For example, | ||||
| references to deprecated features of RPC-over-RDMA Version One, such | ||||
| as RDMA_MSGP, and the use of the Read list for handling RPC replies, | ||||
| has been removed. The term "mapping" has been replaced with the term | ||||
| "binding" or "Upper Layer Binding" throughout the document. Material | ||||
| that duplicates what is in [I-D.ietf-nfsv4-rfc5666bis] has been | ||||
| deleted. | ||||
| o References to [RFC5666] will be replaced with references to | Material required by [I-D.ietf-nfsv4-rfc5666bis] for Upper Layer | |||
| [I-D.ietf-nfsv4-rfc5666bis]. Corrections and updates relative to | Bindings that was not present in [RFC5667] has been added, including | |||
| new language in [I-D.ietf-nfsv4-rfc5666bis] will be introduced. | discussion of how each NFS version properly estimates the maximum | |||
| size of RPC replies. | ||||
| o References to obsolete RFCs will be replaced. | The following changes have been made, relative to [RFC5667]: | |||
| o The reference to a non-existant NFSv4 SYMLINK operation will be | o Ambiguous or erroneous uses of RFC2119 terms have been corrected. | |||
| replaced with NFSv4 CREATE(NF4LNK). | ||||
| o The discussion of 12KB and 36KB inline threshold will be removed. | o References to specific data movement mechanisms have been made | |||
| generic or removed. | ||||
| o The discussion of NFSv4 COMPOUND handling will be completed. | o References to obsolete RFCs have been replaced. | |||
| o Technical corrections have been made. For example, the mention of | ||||
| 12KB and 36KB inline thresholds have been removed. The reference | ||||
| to a non-existant NFS version 4 SYMLINK operation has been | ||||
| replaced with NFS version 4 CREATE(NF4LNK). | ||||
| o An IANA Considerations Section has replaced the "Port Usage | ||||
| Considerations" Section. | ||||
| o Code excerpts have been removed, and figures have been modernized. | ||||
| o Language inconsistent with or contradictory to | ||||
| [I-D.ietf-nfsv4-rfc5666bis] has been removed from Sections 2 and | ||||
| 3, and both Sections have been combined into Section 2 in the | ||||
| present document. | ||||
| o An explicit discussion of NFSv4.0 and NFSv4.1 backchannel | o An explicit discussion of NFSv4.0 and NFSv4.1 backchannel | |||
| operation will be introduced. | operation will replace the previous treatment of callback | |||
| operations. No NFSv4.x callback operation is DDP-eligible. | ||||
| o An IANA Considerations section is required by IDNITS. | o The binding for NFSv4.1 has been completed. No additional DDP- | |||
| eligible operations exist in NFSv4.1. | ||||
| o Code excerpts will be modernized. | o A binding for NFSv4.2 has been added that includes discussion of | |||
| new data-bearing operations like READ_PLUS. | ||||
| Other minor changes and editorial corrections may also be made. | 1.3. Planned Changes To This Document | |||
| 2. Transfers from NFS Client to NFS Server | The following changes are planned, relative to [RFC5667]: | |||
| The RDMA Read list, in the RDMA transport header, allows an RPC | o The discussion of NFS version 4 COMPOUND handling will be | |||
| client to marshal RPC call data selectively. Large chunks of data, | completed. | |||
| such as the file data of an NFS WRITE request, MAY be referenced by | ||||
| an RDMA Read list and be moved efficiently and directly placed by an | ||||
| RDMA Read operation initiated by the server. | ||||
| The process of identifying these chunks for the RDMA Read list can be | o Remarks about handling DDP-eligibility violations will be | |||
| implemented entirely within the RPC layer. It is transparent to the | introduced. | |||
| upper-level protocol, such as NFS. For instance, the file data | ||||
| portion of an NFS WRITE request can be selected as an RDMA "chunk" | ||||
| within the eXternal Data Representation (XDR) marshaling code of RPC | ||||
| based on a size criterion, independently of the NFS protocol layer. | ||||
| The XDR unmarshaling on the receiving system can identify the | ||||
| correspondence between Read chunks and protocol elements via the XDR | ||||
| position value encoded in the Read chunk entry. | ||||
| RPC RDMA Read chunks are employed by this NFS mapping to convey | o A discussion of how the NFS binding to RPC-over-RDMA is extended | |||
| specific NFS data to the server in a manner that may be directly | by standards action will be added. | |||
| placed. The following sections describe this mapping for versions of | ||||
| the NFS protocol. | ||||
| 3. Transfers from NFS Server to NFS Client | 2. Conveying NFS Operations On RPC-Over-RDMA Transports | |||
| The RDMA Write list, in the RDMA transport header, allows the client | Definitions of terminology and a general discussion of how RPC-over- | |||
| to post one or more buffers into which the server will RDMA Write | RDMA is used to convey RPC transactions can be found in | |||
| designated result chunks directly. If the client sends a null Write | [I-D.ietf-nfsv4-rfc5666bis]. In this section, these general | |||
| list, then results from the RPC call will be returned either as an | principals are applied to the specifics of the NFS protocol. | |||
| inline reply, as chunks in an RDMA Read list of server-posted | ||||
| buffers, or in a client-posted reply buffer. | ||||
| Each posted buffer in a Write list is represented as an array of | 2.1. Use Of The Read List | |||
| memory segments. This allows the client some flexibility in | ||||
| submitting discontiguous memory segments into which the server will | ||||
| scatter the result. Each segment is described by a triplet | ||||
| consisting of the segment handle or steering tag (STag), segment | ||||
| length, and memory address or offset. | ||||
| <CODE BEGINS> | The Read list in each RPC-over-RDMA transport header represents a set | |||
| of memory regions containing DDP-eligible NFS argument data. Large | ||||
| data items, such as the file data payload of an NFS WRITE request, | ||||
| are referenced by the Read list and placed directly into server | ||||
| memory. | ||||
| struct xdr_rdma_segment { | XDR unmarshaling code on the NFS server identifies the correspondence | |||
| uint32 handle; /* Registered memory handle */ | between Read chunks and particular NFS arguments via the chunk | |||
| uint32 length; /* Length of the chunk in bytes */ | Position value encoded in each Read chunk. | |||
| uint64 offset; /* Chunk virtual address or offset */ | ||||
| }; | ||||
| struct xdr_write_chunk { | 2.2. Use Of The Write List | |||
| struct xdr_rdma_segment target<>; | ||||
| }; | ||||
| struct xdr_write_list { | The Write list in each RPC-over-RDMA transport header represents a | |||
| struct xdr_write_chunk entry; | set of memory regions that can receive DDP-eligible NFS result data. | |||
| struct xdr_write_list *next; | Large data items such as the payload of an NFS READ request are | |||
| }; | referenced by the Write list and placed directly into client memory. | |||
| <CODE ENDS> | Each Write chunk corresponds to a specific XDR data item in an NFS | |||
| reply. This document specifies how NFS client and server | ||||
| implementations identify the correspondence between Write chunks and | ||||
| each XDR result. | ||||
| The sum of the segment lengths yields the total size of the buffer, | 2.3. Construction Of Individual Chunks | |||
| which MUST be large enough to accept the result. If the buffer is | ||||
| too small, the server MUST return an XDR encode error. The server | ||||
| MUST return the result data for a posted buffer by progressively | ||||
| filling its segments, perhaps leaving some trailing segments unfilled | ||||
| or partially full if the size of the result is less than the total | ||||
| size of the buffer segments. | ||||
| The server returns the RDMA Write list to the client with the segment | Each Read chunk is represented as a list of segments at the same XDR | |||
| length fields overwritten to indicate the amount of data RDMA written | Position, and each Write chunk is represented as an array of | |||
| to each segment. Results returned by direct placement MUST NOT be | segments. An NFS client thus has the flexibility to advertise a set | |||
| returned by other methods, e.g., by Read chunk list or inline. If no | of discontiguous memory regions in which to send or receive a single | |||
| result data at all is returned for the element, the server places no | DDP-eligible data item. | |||
| data in the buffer(s), but does return zeros in the segment length | ||||
| fields corresponding to the result. | ||||
| The RDMA Write list allows the client to provide multiple result | 2.4. Use Of Long Calls And Replies | |||
| buffers -- each buffer maps to a specific result in the reply. The | ||||
| NFS client and server implementations agree by specifying the mapping | ||||
| of results to buffers for each RPC procedure. The following sections | ||||
| describe this mapping for versions of the NFS protocol. | ||||
| Through the use of RDMA Write lists in NFS requests, it is not | Small RPC messages are conveyed using RDMA Send operations which are | |||
| necessary to employ the RDMA Read lists in the NFS replies, as | of limited size. If an NFS request is too large to be conveyed via | |||
| described in the RPC-over-RDMA protocol. This enables more efficient | an RDMA Send, and there are no DDP-eligible data items that can be | |||
| operation, by avoiding the need for the server to expose buffers for | removed, an NFS client must send the request using a Long Call. The | |||
| RDMA, and also avoiding "RDMA_DONE" exchanges. Clients MAY | entire NFS request is sent in a special Read chunk. | |||
| additionally employ RDMA Reply chunks to receive entire messages, as | ||||
| described in [I-D.ietf-nfsv4-rfc5666bis]. | ||||
| 4. NFS Versions 2 and 3 Mapping | If a client expects that an NFS reply will be too large to be | |||
| conveyed via an RDMA Send, it provides a Reply chunk in the RPC-over- | ||||
| RDMA transport header conveying the NFS request. The server can | ||||
| place the entire NFS reply in the Reply chunk. | ||||
| A single RDMA Write list entry MAY be posted by the client to receive | These are described in more detail in [I-D.ietf-nfsv4-rfc5666bis]. | |||
| either the opaque file data from a READ request or the pathname from | ||||
| a READLINK request. The server MUST ignore a Write list for any | ||||
| other NFS procedure, as well as any Write list entries beyond the | ||||
| first in the list. | ||||
| Similarly, a single RDMA Read list entry MAY be posted by the client | 3. NFS Versions 2 And 3 Upper Layer Binding | |||
| to supply the opaque file data for a WRITE request or the pathname | ||||
| for a SYMLINK request. The server MUST ignore any Read list for | ||||
| other NFS procedures, as well as additional Read list entries beyond | ||||
| the first in the list. | ||||
| Because there are no NFS version 2 or 3 requests that transfer bulk | An NFS client MAY send a single Read chunk to supply opaque file data | |||
| data in both directions, it is not necessary to post requests | for an NFS WRITE procedure, or the pathname for an NFS SYMLINK | |||
| containing both Write and Read lists. Any unneeded Read or Write | procedure. For all other NFS procedures, the server MUST ignore Read | |||
| lists are ignored by the server. | chunks that have a non-zero value in their Position fields, and Read | |||
| chunks beyond the first in the Read list. | ||||
| In the case where the outgoing request or expected incoming reply is | Similarly, an NFS client MAY provide a single Write chunk to receive | |||
| larger than the maximum size supported on the connection, it is | either opaque file data from an NFS READ procedure, or the pathname | |||
| possible for the RPC layer to post the entire message or result in a | from an NFS READLINK procedure. The server MUST ignore the Write | |||
| special "RDMA_NOMSG" message type that is transferred entirely by | list for any other NFS procedure, and any Write chunks beyond the | |||
| RDMA. This is implemented in RPC, below NFS, and therefore has no | first in the Write list. | |||
| effect on the message contents. | ||||
| Non-RDMA (inline) WRITE transfers MAY OPTIONALLY employ the | There are no NFS version 2 or 3 procedures that have DDP-eligible | |||
| "RDMA_MSGP" padding method described in the RPC-over-RDMA protocol, | data items in both their Call and Reply. However, if an NFS client | |||
| if the appropriate value for the server is known to the client. | is sending a Long Call or Reply, it MAY provide a combination of Read | |||
| Padding allows the opaque file data to arrive at the server in an | list, Write list, and/or a Reply chunk in the same transaction. | |||
| aligned fashion, which may improve server performance. | ||||
| The NFS version 2 and 3 protocols are frequently limited in practice | NFS clients already successfully estimate the maximum reply size of | |||
| to requests containing less than or equal to 8 kilobytes and 32 | each operation in order to provide an adequate set of buffers to | |||
| kilobytes of data, respectively. In these cases, it is often | receive each NFS reply. An NFS client provides a Reply chunk when | |||
| practical to support basic operation without employing a | the maximum possible reply size is larger than the client's responder | |||
| configuration exchange as discussed in [I-D.ietf-nfsv4-rfc5666bis]. | inline threshold. | |||
| The server MUST post buffers large enough to receive the largest | ||||
| possible incoming message (approximately 12 KB for NFS version 2, or | ||||
| 36 KB for NFS version 3, would be vastly sufficient), and the client | ||||
| can post buffers large enough to receive replies based on the "rsize" | ||||
| it is using to the server, plus a fixed overhead for the RPC and NFS | ||||
| headers. Because the server MUST NOT return data in excess of this | ||||
| size, the client can be assured of the adequacy of its posted buffer | ||||
| sizes. | ||||
| Flow control is handled dynamically by the RPC RDMA protocol, and | How does the server respond if the client has not provided enough | |||
| write padding is OPTIONAL and therefore MAY remain unused. | Write list resources to handle an NFS WRITE or READLINK reply? How | |||
| does the server respond if the client has not provided enough Reply | ||||
| chunk resources to handle an NFS reply? | ||||
| Alternatively, if the server is administratively configured to values | 4. NFS Version 4 Upper Layer Binding | |||
| appropriate for all its clients, the same assurance of | ||||
| interoperability within the domain can be made. | ||||
| The use of a configuration protocol with NFS v2 and v3 is therefore | This specification applies to NFS Version 4.0 [RFC7530], NFS Version | |||
| OPTIONAL. Employing a configuration exchange may allow some | 4.1 [RFC5661], and NFS Version 4.2 [I-D.ietf-nfsv4-minorversion2]. | |||
| advantage to server resource management through accurately sizing | It also applies to the callback protocols associated with each of | |||
| buffers, enabling the server to know exactly how many RDMA Reads may | these minor versions. | |||
| be in progress at once on the client connection, and enabling client | ||||
| write padding, which may be desirable for certain servers when RDMA | ||||
| Read is impractical. | ||||
| 5. NFS Version 4 Mapping | An NFS client MAY send a Read chunk to supply opaque file data for a | |||
| WRITE operation or the pathname for a CREATE(NF4LNK) operation in an | ||||
| NFS version 4 COMPOUND procedure. An NFS client MUST NOT send a Read | ||||
| chunk that corresponds with any other XDR data item in any other NFS | ||||
| version 4 operation. | ||||
| This specification applies to the first minor version of NFS version | Similarly, an NFS client MAY provide a Write chunk to receive either | |||
| 4 (NFSv4.0) and any subsequent minor versions that do not override | opaque file data from a READ operation, NFS4_CONTENT_DATA from a | |||
| this mapping. | READ_PLUS operation, or the pathname from a READLINK operation in an | |||
| NFS version 4 COMPOUND procedure. An NFS client MUST NOT provide a | ||||
| Write chunk that corresponds with any other XDR data item in any | ||||
| other NFS version 4 operation. | ||||
| The Write list MUST be considered only for the COMPOUND procedure. | There is no prohibition against an NFS version 4 COMPOUND procedure | |||
| This procedure returns results from a sequence of operations. Only | constructed with both a READ and WRITE operation, say. Thus it is | |||
| the opaque file data from an NFS READ operation and the pathname from | possible for NFS version 4 COMPOUND procedures to use both the Read | |||
| a READLINK operation MUST utilize entries from the Write list. | list and Write list simultaneously. An NFS client MAY provide a Read | |||
| list and a Write list in the same transaction if it is sending a Long | ||||
| Call or Reply. | ||||
| If there is no Write list, i.e., the list is null, then any READ or | Some remarks need to be made about how NFS version 4 clients estimate | |||
| READLINK operations in the COMPOUND MUST return their data inline. | reply size, and how DDP-eligibility violations are reported. | |||
| The NFSv4.0 client MUST ensure in this case that any result of its | ||||
| READ and READLINK requests will fit within its receive buffers, in | ||||
| order to avoid a resulting RDMA transport error upon transfer. The | ||||
| server is not required to detect this. | ||||
| The first entry in the Write list MUST be used by the first READ or | 4.1. NFS Version 4 COMPOUND Considerations | |||
| READLINK in the COMPOUND request. The next Write list entry is used | ||||
| by the next READ or READLINK, and so on. If there are more READ or | ||||
| READLINK operations than Write list entries, then any remaining | ||||
| operations MUST return their results inline. | ||||
| If a Write list entry is presented, then the corresponding READ or | An NFS version 4 COMPOUND procedure supplies arguments for a sequence | |||
| READLINK MUST return its data via an RDMA Write to the buffer | of operations, and returns results from that sequence. A client MAY | |||
| indicated by the Write list entry. If the Write list entry has zero | construct an NFS version 4 COMPOUND procedure that uses more than one | |||
| RDMA segments, or if the total size of the segments is zero, then the | chunk in either the Read list or Write list. The NFS client provides | |||
| corresponding READ or READLINK operation MUST return its result | XDR Position values in each Read chunk to disambiguate which chunk is | |||
| inline. | associated with which XDR data item. | |||
| The following example shows an RDMA Write list with three posted | However NFS server and client implementations must agree in advance | |||
| buffers A, B, and C. The designated operations in the compound | on how to pair Write chunks with returned result data items. The | |||
| request, READ and READLINK, consume the posted buffers by writing | mechanism specified in [I-D.ietf-nfsv4-rfc5666bis]) is applied here: | |||
| their results back to each buffer. | ||||
| RDMA Write list: | o The first chunk in the Write list MUST be used by the first READ | |||
| or READLINK operation in an NFS version 4 COMPOUND procedure. The | ||||
| next Write chunk is used by the next READ or READLINK, and so on. | ||||
| o If there are more READ or READLINK operations than Write chunks, | ||||
| then any remaining operations MUST return their results inline. | ||||
| o If an NFS client presents a Write chunk, then the corresponding | ||||
| READ or READLINK operation MUST return its data by placing data | ||||
| into that chunk. | ||||
| o If the Write chunk has zero RDMA segments, or if the total size of | ||||
| the segments is zero, then the corresponding READ or READLINK | ||||
| operation MUST return its result inline. | ||||
| The following example shows a Write list with three Write chunks, A, | ||||
| B, and C. The server consumes the provided Write chunks by writing | ||||
| the results of the designated operations in the compound request, | ||||
| READ and READLINK, back to each chunk. | ||||
| Write list: | ||||
| A --> B --> C | A --> B --> C | |||
| Compound request: | NFS version 4 COMPOUND request: | |||
| PUTFH LOOKUP READ PUTFH LOOKUP READLINK PUTFH LOOKUP READ | PUTFH LOOKUP READ PUTFH LOOKUP READLINK PUTFH LOOKUP READ | |||
| | | | | | | | | |||
| v v v | v v v | |||
| A B C | A B C | |||
| If the client does not want to have the READLINK result returned | If the client does not want to have the READLINK result returned | |||
| directly, then it provides a zero-length array of segment triplets | directly, it provides a zero-length array of segment triplets for | |||
| for buffer B or sets the values in the segment triplet for buffer B | buffer B or sets the values in the segment triplet for buffer B to | |||
| to zeros so that the READLINK result MUST be returned inline. | zeros to indicate that the READLINK result must be returned inline. | |||
| The situation is similar for RDMA Read lists sent by the client and | ||||
| applies to the NFSv4.0 WRITE and SYMLINK procedures as for v3. | ||||
| Additionally, inline segments too large to fit in posted buffers MAY | ||||
| be transferred in special "RDMA_NOMSG" messages. | ||||
| Non-RDMA (inline) WRITE transfers MAY OPTIONALLY employ the | ||||
| "RDMA_MSGP" padding method described in the RPC-over-RDMA protocol, | ||||
| if the appropriate value for the server is known to the client. | ||||
| Padding allows the opaque file data to arrive at the server in an | ||||
| aligned fashion, which may improve server performance. In order to | ||||
| ensure accurate alignment for all data, it is likely that the client | ||||
| will restrict its use of OPTIONAL padding to COMPOUND requests | ||||
| containing only a single WRITE operation. | ||||
| Unlike NFS versions 2 and 3, the maximum size of an NFS version 4 | Unlike NFS versions 2 and 3, the maximum size of an NFS version 4 | |||
| COMPOUND is not bounded, even when RDMA chunks are in use. While it | COMPOUND is not bounded. However, typical NFS version 4 clients | |||
| might appear that a configuration protocol exchange (such as the one | rarely issue such problematic requests. In practice, NFS version 4 | |||
| described in [I-D.ietf-nfsv4-rfc5666bis]) would help, in fact the | clients behave in much more predictable ways. Rsize and wsize apply | |||
| layering issues involved in building COMPOUNDs by NFS make such a | to COMPOUND operations by capping the total amount of data payload | |||
| mechanism unworkable. | allowed in each COMPOUND. An extension to NFS version 4 supporting a | |||
| comprehensive exchange of upper-layer message size parameters is part | ||||
| However, typical NFS version 4 clients rarely issue such problematic | of [RFC5661]. | |||
| requests. In practice, they behave in much more predictable ways, in | ||||
| fact most still support the traditional rsize/wsize mount parameters. | ||||
| Therefore, most NFS version 4 clients function over RPC-over-RDMA in | ||||
| the same way as NFS versions 2 and 3, operationally. | ||||
| There are however advantages to allowing both client and server to | ||||
| operate with prearranged size constraints, for example, use of the | ||||
| sizes to better manage the server's response cache. An extension to | ||||
| NFS version 4 supporting a more comprehensive exchange of upper-layer | ||||
| parameters is part of [RFC5661]. | ||||
| 5.1. NFS Version 4 Callbacks | 4.2. NFS Version 4 Callbacks | |||
| The NFS version 4 protocols support server-initiated callbacks to | The NFS version 4 protocols support server-initiated callbacks to | |||
| selected clients, in order to notify them of events such as recalled | notify clients of events such as recalled delegations. There are no | |||
| delegations, etc. These callbacks present no particular issue to | DDP-eligible data items in callback protocols associated with | |||
| being framed over RPC-over-RDMA since such callbacks do not carry | NFSv4.0, NFSv4.1, or NFSv4.2. | |||
| bulk data such as NFS READ or NFS WRITE. They MAY be transmitted | ||||
| inline via RDMA_MSG, or if the callback message or its reply overflow | ||||
| the negotiated buffer sizes for a callback connection, they MAY be | ||||
| transferred via the RDMA_NOMSG method as described above for other | ||||
| exchanges. | ||||
| One special case is noteworthy: in NFS version 4.1, the callback | ||||
| channel is optionally negotiated to be on the same connection as one | ||||
| used for client requests. In this case, and because the transaction | ||||
| ID (XID) is present in the RPC-over-RDMA header, the client MUST | ||||
| ascertain whether the message is in fact an RPC REPLY, and therefore | ||||
| a reply to a prior request and carrying its XID, before processing it | ||||
| as such. By the same token, the server MUST ascertain whether an | ||||
| incoming message on such a callback-eligible connection is an RPC | ||||
| CALL, before optionally processing the XID. | ||||
| In the callback case, the XID present in the RPC-over-RDMA header | In NFS version 4.1 and 4.2, callback operations may appear on the | |||
| will potentially have any value, which may (or may not) collide with | same connection as one used for NFS version 4 client requests. To | |||
| an XID used by the client for a previous or future request. The | operate on RPC-over-RDMA transports, NFS version 4 clients and | |||
| client and server MUST inspect the RPC component of the message to | servers MUST use the mechanism described in | |||
| determine its potential disposition as either an RPC CALL or RPC | [I-D.ietf-nfsv4-rpcrdma-bidirection]. | |||
| REPLY, prior to processing this XID, and MUST NOT reject or accept it | ||||
| without also determining the proper context. | ||||
| 6. IANA Considerations | 5. IANA Considerations | |||
| NFS use of direct data placement introduces a need for an additional | NFS use of direct data placement introduces a need for an additional | |||
| NFS port number assignment for networks that share traditional UDP | NFS port number assignment for networks that share traditional UDP | |||
| and TCP port spaces with RDMA services. The iWARP [RFC5041] | and TCP port spaces with RDMA services. The iWARP [RFC5041] | |||
| [RFC5040] protocol is such an example (InfiniBand is not). | [RFC5040] protocol is such an example (InfiniBand is not). | |||
| NFS servers for versions 2 and 3 [RFC1094] [RFC1813] traditionally | NFS servers for versions 2 and 3 [RFC1094] [RFC1813] traditionally | |||
| listen for clients on UDP and TCP port 2049, and additionally, they | listen for clients on UDP and TCP port 2049, and additionally, they | |||
| register these with the portmapper and/or rpcbind [RFC1833] service. | register these with the portmapper and/or rpcbind [RFC1833] service. | |||
| However, [RFC7530] requires NFS servers for version 4 to listen on | However, [RFC7530] requires NFS servers for version 4 to listen on | |||
| skipping to change at page 9, line 33 ¶ | skipping to change at page 9, line 13 ¶ | |||
| in [I-D.ietf-nfsv4-rfc5666bis]. | in [I-D.ietf-nfsv4-rfc5666bis]. | |||
| An NFS version 4 server supporting RPC-over-RDMA on such a network | An NFS version 4 server supporting RPC-over-RDMA on such a network | |||
| MUST use the alternative well-known port number for its RPC-over-RDMA | MUST use the alternative well-known port number for its RPC-over-RDMA | |||
| service. Clients SHOULD connect to this well-known port without | service. Clients SHOULD connect to this well-known port without | |||
| consulting the RPC portmapper (as for NFSv4/TCP). | consulting the RPC portmapper (as for NFSv4/TCP). | |||
| The port number assigned to an NFS service over an RPC-over-RDMA | The port number assigned to an NFS service over an RPC-over-RDMA | |||
| transport is available from the IANA port registry [RFC3232]. | transport is available from the IANA port registry [RFC3232]. | |||
| 7. Security Considerations | 6. Security Considerations | |||
| The RDMA transport for RPC [I-D.ietf-nfsv4-rfc5666bis] supports all | The RDMA transport for RPC [I-D.ietf-nfsv4-rfc5666bis] supports all | |||
| RPC [RFC5531] security models, including RPCSEC_GSS [RFC2203] | RPC [RFC5531] security models, including RPCSEC_GSS [RFC2203] | |||
| security and link- level security. The choice of RDMA Read and RDMA | security and transport-level security. The choice of RDMA Read and | |||
| Write to return RPC argument and results, respectively, does not | RDMA Write to convey RPC argument and results does not affect this, | |||
| affect this, since it only changes the method of data transfer. | since it only changes the method of data transfer. Specifically, the | |||
| Specifically, the requirements of [I-D.ietf-nfsv4-rfc5666bis] ensure | requirements of [I-D.ietf-nfsv4-rfc5666bis] ensure that this choice | |||
| that this choice does not introduce new vulnerabilities. | does not introduce new vulnerabilities. | |||
| Because this document defines only the binding of the NFS protocols | Because this document defines only the binding of the NFS protocols | |||
| atop [I-D.ietf-nfsv4-rfc5666bis], all relevant security | atop [I-D.ietf-nfsv4-rfc5666bis], all relevant security | |||
| considerations are therefore to be described at that layer. | considerations are therefore to be described at that layer. | |||
| 8. Acknowledgments | 7. Acknowledgments | |||
| The author gratefully acknowledges the work of Brent Callaghan and | The author gratefully acknowledges the work of Brent Callaghan and | |||
| Tom Talpey on the original NFS Direct Data Placement specification | Tom Talpey on the original NFS Direct Data Placement specification | |||
| [RFC5667]. The author also wishes to thank Bill Baker and Greg | [RFC5667]. The author also wishes to thank Bill Baker and Greg | |||
| Marsden for their support of this work. | Marsden for their support of this work. | |||
| 9. References | Dave Noveck provided excellent review, constructive suggestions, and | |||
| consistent navigational guidance throughout the process of drafting | ||||
| this document. | ||||
| 9.1. Normative References | Special thanks go to nfsv4 Working Group Chair Spencer Shepler and | |||
| nfsv4 Working Group Secretary Thomas Haynes for their support. | ||||
| 8. References | ||||
| 8.1. Normative References | ||||
| [I-D.ietf-nfsv4-minorversion2] | ||||
| Haynes, T., "NFS Version 4 Minor Version 2", draft-ietf- | ||||
| nfsv4-minorversion2-41 (work in progress), January 2016. | ||||
| [I-D.ietf-nfsv4-rfc5666bis] | ||||
| Lever, C., Simpson, W., and T. Talpey, "Remote Direct | ||||
| Memory Access Transport for Remote Procedure Call, Version | ||||
| One", draft-ietf-nfsv4-rfc5666bis-07 (work in progress), | ||||
| May 2016. | ||||
| [I-D.ietf-nfsv4-rpcrdma-bidirection] | ||||
| Lever, C., "Bi-directional Remote Procedure Call On RPC- | ||||
| over-RDMA Transports", draft-ietf-nfsv4-rpcrdma- | ||||
| bidirection-05 (work in progress), June 2016. | ||||
| [RFC1833] Srinivasan, R., "Binding Protocols for ONC RPC Version 2", | [RFC1833] Srinivasan, R., "Binding Protocols for ONC RPC Version 2", | |||
| RFC 1833, DOI 10.17487/RFC1833, August 1995, | RFC 1833, DOI 10.17487/RFC1833, August 1995, | |||
| <http://www.rfc-editor.org/info/rfc1833>. | <http://www.rfc-editor.org/info/rfc1833>. | |||
| [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate | [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate | |||
| Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/ | Requirement Levels", BCP 14, RFC 2119, | |||
| RFC2119, March 1997, | DOI 10.17487/RFC2119, March 1997, | |||
| <http://www.rfc-editor.org/info/rfc2119>. | <http://www.rfc-editor.org/info/rfc2119>. | |||
| [RFC2203] Eisler, M., Chiu, A., and L. Ling, "RPCSEC_GSS Protocol | [RFC2203] Eisler, M., Chiu, A., and L. Ling, "RPCSEC_GSS Protocol | |||
| Specification", RFC 2203, DOI 10.17487/RFC2203, September | Specification", RFC 2203, DOI 10.17487/RFC2203, September | |||
| 1997, <http://www.rfc-editor.org/info/rfc2203>. | 1997, <http://www.rfc-editor.org/info/rfc2203>. | |||
| [RFC5531] Thurlow, R., "RPC: Remote Procedure Call Protocol | [RFC5531] Thurlow, R., "RPC: Remote Procedure Call Protocol | |||
| Specification Version 2", RFC 5531, DOI 10.17487/RFC5531, | Specification Version 2", RFC 5531, DOI 10.17487/RFC5531, | |||
| May 2009, <http://www.rfc-editor.org/info/rfc5531>. | May 2009, <http://www.rfc-editor.org/info/rfc5531>. | |||
| [RFC5661] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed., | [RFC5661] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed., | |||
| "Network File System (NFS) Version 4 Minor Version 1 | "Network File System (NFS) Version 4 Minor Version 1 | |||
| Protocol", RFC 5661, DOI 10.17487/RFC5661, January 2010, | Protocol", RFC 5661, DOI 10.17487/RFC5661, January 2010, | |||
| <http://www.rfc-editor.org/info/rfc5661>. | <http://www.rfc-editor.org/info/rfc5661>. | |||
| [RFC7530] Haynes, T., Ed. and D. Noveck, Ed., "Network File System | [RFC7530] Haynes, T., Ed. and D. Noveck, Ed., "Network File System | |||
| (NFS) Version 4 Protocol", RFC 7530, DOI 10.17487/RFC7530, | (NFS) Version 4 Protocol", RFC 7530, DOI 10.17487/RFC7530, | |||
| March 2015, <http://www.rfc-editor.org/info/rfc7530>. | March 2015, <http://www.rfc-editor.org/info/rfc7530>. | |||
| 9.2. Informative References | 8.2. Informative References | |||
| [I-D.ietf-nfsv4-rfc5666bis] | ||||
| Lever, C., Simpson, W., and T. Talpey, "Remote Direct | ||||
| Memory Access Transport for Remote Procedure Call, Version | ||||
| One", draft-ietf-nfsv4-rfc5666bis-07 (work in progress), | ||||
| May 2016. | ||||
| [RFC1094] Nowicki, B., "NFS: Network File System Protocol | [RFC1094] Nowicki, B., "NFS: Network File System Protocol | |||
| specification", RFC 1094, DOI 10.17487/RFC1094, March | specification", RFC 1094, DOI 10.17487/RFC1094, March | |||
| 1989, <http://www.rfc-editor.org/info/rfc1094>. | 1989, <http://www.rfc-editor.org/info/rfc1094>. | |||
| [RFC1813] Callaghan, B., Pawlowski, B., and P. Staubach, "NFS | [RFC1813] Callaghan, B., Pawlowski, B., and P. Staubach, "NFS | |||
| Version 3 Protocol Specification", RFC 1813, DOI 10.17487/ | Version 3 Protocol Specification", RFC 1813, | |||
| RFC1813, June 1995, | DOI 10.17487/RFC1813, June 1995, | |||
| <http://www.rfc-editor.org/info/rfc1813>. | <http://www.rfc-editor.org/info/rfc1813>. | |||
| [RFC3232] Reynolds, J., Ed., "Assigned Numbers: RFC 1700 is Replaced | [RFC3232] Reynolds, J., Ed., "Assigned Numbers: RFC 1700 is Replaced | |||
| by an On-line Database", RFC 3232, DOI 10.17487/RFC3232, | by an On-line Database", RFC 3232, DOI 10.17487/RFC3232, | |||
| January 2002, <http://www.rfc-editor.org/info/rfc3232>. | January 2002, <http://www.rfc-editor.org/info/rfc3232>. | |||
| [RFC5040] Recio, R., Metzler, B., Culley, P., Hilland, J., and D. | [RFC5040] Recio, R., Metzler, B., Culley, P., Hilland, J., and D. | |||
| Garcia, "A Remote Direct Memory Access Protocol | Garcia, "A Remote Direct Memory Access Protocol | |||
| Specification", RFC 5040, DOI 10.17487/RFC5040, October | Specification", RFC 5040, DOI 10.17487/RFC5040, October | |||
| 2007, <http://www.rfc-editor.org/info/rfc5040>. | 2007, <http://www.rfc-editor.org/info/rfc5040>. | |||
| [RFC5041] Shah, H., Pinkerton, J., Recio, R., and P. Culley, "Direct | [RFC5041] Shah, H., Pinkerton, J., Recio, R., and P. Culley, "Direct | |||
| Data Placement over Reliable Transports", RFC 5041, DOI | Data Placement over Reliable Transports", RFC 5041, | |||
| 10.17487/RFC5041, October 2007, | DOI 10.17487/RFC5041, October 2007, | |||
| <http://www.rfc-editor.org/info/rfc5041>. | <http://www.rfc-editor.org/info/rfc5041>. | |||
| [RFC5666] Talpey, T. and B. Callaghan, "Remote Direct Memory Access | ||||
| Transport for Remote Procedure Call", RFC 5666, DOI | ||||
| 10.17487/RFC5666, January 2010, | ||||
| <http://www.rfc-editor.org/info/rfc5666>. | ||||
| [RFC5667] Talpey, T. and B. Callaghan, "Network File System (NFS) | [RFC5667] Talpey, T. and B. Callaghan, "Network File System (NFS) | |||
| Direct Data Placement", RFC 5667, DOI 10.17487/RFC5667, | Direct Data Placement", RFC 5667, DOI 10.17487/RFC5667, | |||
| January 2010, <http://www.rfc-editor.org/info/rfc5667>. | January 2010, <http://www.rfc-editor.org/info/rfc5667>. | |||
| Author's Address | Author's Address | |||
| Charles Lever (editor) | Charles Lever (editor) | |||
| Oracle Corporation | Oracle Corporation | |||
| 1015 Granger Avenue | 1015 Granger Avenue | |||
| Ann Arbor, MI 48104 | Ann Arbor, MI 48104 | |||
| End of changes. 69 change blocks. | ||||
| 291 lines changed or deleted | 285 lines changed or added | |||
This html diff was produced by rfcdiff 1.48. The latest version is available from http://tools.ietf.org/tools/rfcdiff/ | ||||