NFSv4 C. Hellwig Internet-Draft July 02, 2017 Intended status: Standards Track Expires: January 3, 2018 Parallel NFS (pNFS) RDMA Layout draft-hellwig-nfsv4-rdma-layout-00.txt Abstract The Parallel Network File System (pNFS) allows a separation between the metadata (onto a metadata server) and data (onto a storage device) for a file. The RDMA Layout Type is defined in this document as an extension to pNFS to allow the use of RDMA Verbs operations to access remote storage, with a special focus on accessing byte addressable persistent memory. Status of This Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at http://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on January 3, 2018. Copyright Notice Copyright (c) 2017 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of Hellwig Expires January 3, 2018 [Page 1] Internet-Draft pNFS RDMA Layout July 2017 the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1. Conventions Used in This Document . . . . . . . . . . . . 3 1.2. General Definitions . . . . . . . . . . . . . . . . . . . 3 1.3. Code Components Licensing Notice . . . . . . . . . . . . 4 1.4. XDR Description . . . . . . . . . . . . . . . . . . . . . 4 2. RDMA Layout Description . . . . . . . . . . . . . . . . . . . 6 2.1. Background and Architecture . . . . . . . . . . . . . . . 6 2.2. layouttype4 . . . . . . . . . . . . . . . . . . . . . . . 6 2.3. Device Addressing and Discovery . . . . . . . . . . . . . 7 2.3.1. pnfs_rdma_device_addr4 . . . . . . . . . . . . . . . 7 2.4. Data Structures: Extents and Extent Lists . . . . . . . . 7 2.4.1. Layout Requests and Extent Lists . . . . . . . . . . 9 2.4.2. Layout Commits . . . . . . . . . . . . . . . . . . . 11 2.4.3. Layout Returns . . . . . . . . . . . . . . . . . . . 11 2.4.4. Layout Revocation . . . . . . . . . . . . . . . . . . 12 2.4.5. Client Copy-on-Write Processing . . . . . . . . . . . 12 2.4.6. Extents are Permissions . . . . . . . . . . . . . . . 13 2.4.7. End-of-file Processing . . . . . . . . . . . . . . . 14 2.4.8. Layout Hints . . . . . . . . . . . . . . . . . . . . 15 2.5. Crash Recovery Issues . . . . . . . . . . . . . . . . . . 15 2.6. Transient and Permanent Errors . . . . . . . . . . . . . 15 3. Security Considerations . . . . . . . . . . . . . . . . . . . 16 4. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 17 5. References . . . . . . . . . . . . . . . . . . . . . . . . . 17 5.1. Normative References . . . . . . . . . . . . . . . . . . 17 5.2. Informative References . . . . . . . . . . . . . . . . . 18 Appendix A. RFC Editor Notes . . . . . . . . . . . . . . . . . . 18 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 18 1. Introduction Figure 1 shows the overall architecture of a Parallel NFS (pNFS) system: Hellwig Expires January 3, 2018 [Page 2] Internet-Draft pNFS RDMA Layout July 2017 +-----------+ |+-----------+ +-----------+ ||+-----------+ | | ||| | NFSv4.1 + pNFS | | +|| Clients |<------------------------------>| Server | +| | | | +-----------+ | | ||| +-----------+ ||| | ||| | ||| Storage +-----------+ | ||| Protocol |+-----------+ | ||+----------------||+-----------+ Control | |+-----------------||| | Protocol| +------------------+|| Storage |------------+ +| Systems | +-----------+ Figure 1 The overall approach is that pNFS-enhanced clients obtain sufficient information from the server to enable them to access the underlying storage (on the storage systems) directly. See the Section 12 of [RFC5661] for more details. RDMA ([RFC5040] [RFC5041] [IBARCH]) is a technique for moving data efficiently between end nodes. By directing data into destination buffers as it is sent on a network, and placing it via direct memory access by hardware, the benefits of faster transfers and reduced host overhead are obtained. Unlike the RPC RDMA transport [RFC8166] the pNFS RDMA layout does not transfer remote procedural calls over RDMA networks, but instead uses raw RDMA READ and WRITE operations to access a memory region exposed on a storage device. 1.1. Conventions Used in This Document The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119]. 1.2. General Definitions The following definitions are provided for the purpose of providing an appropriate context for the reader. Byte This document defines a byte as an octet, i.e., a datum exactly 8 bits in length. Hellwig Expires January 3, 2018 [Page 3] Internet-Draft pNFS RDMA Layout July 2017 Client The "client" is the entity that accesses the NFS server's resources. The client may be an application that contains the logic to access the NFS server directly. The client may also be the traditional operating system client that provides remote file system services for a set of applications. Server The "server" is the entity responsible for coordinating client access to a set of file systems and is identified by a server owner. metadata server (MDS) The metadata server is a pNFS server which provides metadata information for a file system object. It also is responsible for generating layouts for file system objects. Note that the MDS is also responsible for directory-based operations. 1.3. Code Components Licensing Notice The external data representation (XDR) description and scripts for extracting the XDR description are Code Components as described in Section 4 of "Legal Provisions Relating to IETF Documents" [LEGAL]. These Code Components are licensed according to the terms of Section 4 of "Legal Provisions Relating to IETF Documents". 1.4. XDR Description This document contains the XDR [RFC4506] description of the NFSv4.1 RDMA layout protocol. The XDR description is embedded in this document in a way that makes it simple for the reader to extract into a ready-to-compile form. The reader can feed this document into the following shell script to produce the machine readable XDR description of the NFSv4.1 RDMA layout: #!/bin/sh grep '^ *///' $* | sed 's?^ */// ??' | sed 's?^ *///$??' That is, if the above script is stored in a file called "extract.sh", and this document is in a file called "spec.txt", then the reader can do: sh extract.sh < spec.txt > rdma_prot.x The effect of the script is to remove leading white space from each line, plus a sentinel sequence of "///". The embedded XDR file header follows. Subsequent XDR descriptions, with the sentinel sequence are embedded throughout the document. Hellwig Expires January 3, 2018 [Page 4] Internet-Draft pNFS RDMA Layout July 2017 Note that the XDR code contained in this document depends on types from the NFSv4.1 nfs4_prot.x file [RFC5662]. This includes both nfs types that end with a 4, such as offset4, length4, etc., as well as more generic types such as uint32_t and uint64_t. /// /* /// * This code was derived from RFCTBD10 /// * Please reproduce this note if possible. /// */ /// /* /// * Copyright (c) 2010,2015 IETF Trust and the persons /// * identified as the document authors. All rights reserved. /// * /// * Redistribution and use in source and binary forms, with /// * or without modification, are permitted provided that the /// * following conditions are met: /// * /// * - Redistributions of source code must retain the above /// * copyright notice, this list of conditions and the /// * following disclaimer. /// * /// * - Redistributions in binary form must reproduce the above /// * copyright notice, this list of conditions and the /// * following disclaimer in the documentation and/or other /// * materials provided with the distribution. /// * /// * - Neither the name of Internet Society, IETF or IETF /// * Trust, nor the names of specific contributors, may be /// * used to endorse or promote products derived from this /// * software without specific prior written permission. /// * /// * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS /// * AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED /// * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE /// * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS /// * FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO /// * EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE /// * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, /// * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT /// * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR /// * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS /// * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF /// * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, /// * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING /// * IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF /// * ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. /// */ /// Hellwig Expires January 3, 2018 [Page 5] Internet-Draft pNFS RDMA Layout July 2017 /// /* /// * nfs4_rdma_layout_prot.x /// */ /// /// %#include "nfsv41.h" /// 2. RDMA Layout Description 2.1. Background and Architecture A pNFS RDMA layout is responsible for mapping from an NFS file (or portion of a file) to memory regions that contain the file. These regions are expressed as extents with 64-bit offsets and lengths using the existing NFSv4 offset4 and length4 types, and map to memory regions that the servers registered, and for which it exposes a handle (R_key or stag) that allows for RDMA READ and RDMA WRITE operations from the client. The pNFS operation for requesting a layout (LAYOUTGET) includes the "layoutiomode4 loga_iomode" argument, which indicates whether the requested layout is for read-only use or read-write use. A read-only layout may contain holes that are read as zero, whereas a read-write layout will contain allocated, but un-initialized storage in those holes (read as zero, can be written by client). This document also supports client participation in copy-on-write (e.g., for file systems with snapshots) by providing both read-only and un- initialized storage for the same extent in a layout. Reads are initially performed on the read-only storage, with writes going to the un-initialized storage. After the first write that initializes the un-initialized storage, all reads are performed to that now- initialized writable storage, and the corresponding read-only storage is no longer used. 2.2. layouttype4 The layout4 type defined in [RFC5662] is extended with a new value as follows: enum layouttype4 { LAYOUT4_NFSV4_1_FILES = 1, LAYOUT4_OSD2_OBJECTS = 2, LAYOUT4_BLOCK_VOLUME = 3, LAYOUT4_SCSI = 4, LAYOUT4_RDMA = 0x80000006 [[RFC Editor: please modify the LAYOUT4_RDMA to be the layouttype assigned by IANA]] }; Hellwig Expires January 3, 2018 [Page 6] Internet-Draft pNFS RDMA Layout July 2017 This document defines structure associated with the layouttype4 value LAYOUT4_RDMA. [RFC5661] specifies the loc_body structure as an XDR type "opaque". The opaque layout is uninterpreted by the generic pNFS client layers, but obviously must be interpreted by the Layout Type implementation. 2.3. Device Addressing and Discovery Data operations to a storage device require the client to know the network address of the storage device. The NFSv4.1+ GETDEVICEINFO operation (Section 18.40 of [RFC5661]) is used by the client to retrieve that information. 2.3.1. pnfs_rdma_device_addr4 The "pnfs_rdma_device_addr4" data structure is returned by the server as the storage-protocol-specific opaque field da_addr_body in the "device_addr4" structure by a successful GETDEVICEINFO operation [RFC5661]. It contains the network address of the storage device. The RDMA Connection manager (RDMA/CM) shall be used to establish the queue pair for the RDMA READ and RDMA WRITE operations used by the layout. Details of connection establishment will be provided in future versions of this document. /// struct pnfs_rdma_device_addr4 { /// struct netaddr4 addr; /* address of the device */ /// }; /// 2.4. Data Structures: Extents and Extent Lists A pNFS RDMA layout is a list of extents within a flat array of data in a device. The RDMA layout describes the individual byte ranges (extents) on the device that make up the file. The offsets and length contained in an extent are specified in units of bytes. Hellwig Expires January 3, 2018 [Page 7] Internet-Draft pNFS RDMA Layout July 2017 /// enum pnfs_rdma_extent_state4 { /// PNFS_RDMA_READ_WRITE_DATA = 0, /* the data located by /// this extent is valid /// for reading and /// writing. */ /// PNFS_RDMA_READ_DATA = 1, /* the data located by this /// extent is valid for /// reading only; it may not /// be written. */ /// PNFS_RDMA_INVALID_DATA = 2, /* the location is valid; the /// data is invalid. It is a /// newly (pre-) allocated /// extent. The client MUST /// not read from this /// space */ /// PNFS_RDMA_NONE_DATA = 3 /* the location is invalid. /// It is a hole in the file. /// The client MUST NOT read /// from or write to this /// space */ /// }; /// /// struct pnfs_rdma_extent4 { /// deviceid4 re_device_id; /* id of the device on /// which extent of file is /// stored. */ /// offset4 re_file_offset; /* starting byte offset /// in the file */ /// uint32 re_handle; /* registered memory /// handle */ /// length4 re_length; /* size in bytes of the /// extent */ /// offset4 re_storage_offset;/* starting byte offset /// in the volume */ /// pnfs_rdma_extent_state4 re_state; /// /* state of this extent */ /// }; /// /// /* RDMA layout-specific type for loc_body */ /// struct pnfs_rdma_layout4 { /// pnfs_rdma_extent4 rl_extents<>; /// /* extents which make up this /// layout. */ /// }; /// Hellwig Expires January 3, 2018 [Page 8] Internet-Draft pNFS RDMA Layout July 2017 The RDMA layout consists of a list of extents that map the regions of the file to locations on a device. The "re_storage_offset" field within each extent identifies a location on the device specified by the "re_device_id" field in the extent. Each extent maps a region of the file onto a portion of the specified device. The re_file_offset, re_length, and re_state fields for an extent returned from the server are valid for all extents. In contrast, the interpretation of the re_storage_offset field depends on the value of re_state as follows (in increasing order): PNFS_RDMA_READ_WRITE_DATA means that re_storage_offset is valid, and points to valid/initialized data that can be read and written. PNFS_RDMA_READ_DATA means that re_storage_offset is valid and points to valid/initialized data that can only be read. Write operations are prohibited. PNFS_RDMA_INVALID_DATA means that re_storage_offset is valid, but points to invalid un-initialized data. This data MUST not be read from the device until it has been initialized. A read request for a PNFS_RDMA_INVALID_DATA extent MUST fill the user buffer with zeros, unless the extent is covered by a PNFS_RDMA_READ_DATA extent of a copy-on-write file system. Write requests MUST write whole server-sized blocks to the device; bytes not initialized by the user MUST be set to zero. Any write to parts of a device covered by a PNFS_RDMA_INVALID_DATA extent changes the written portion of the extent to PNFS_RDMA_READ_WRITE_DATA; the pNFS client is responsible for reporting this change via LAYOUTCOMMIT. PNFS_RDMA_NONE_DATA means that re_storage_offset is not valid, and this extent MAY not be used to satisfy write requests. Read requests MAY be satisfied by zero-filling as for PNFS_RDMA_INVALID_DATA. PNFS_RDMA_NONE_DATA extents MAY be returned by requests for readable extents; they are never returned if the request was for a writable extent. An extent list contains all relevant extents in increasing order of the re_file_offset of each extent; any ties are broken by increasing order of the extent state (re_state). 2.4.1. Layout Requests and Extent Lists Each request for a layout specifies at least three parameters: file offset, desired size, and minimum size. If the status of a request indicates success, the extent list returned MUST meet the following criteria: Hellwig Expires January 3, 2018 [Page 9] Internet-Draft pNFS RDMA Layout July 2017 o A request for a readable (but not writable) layout MUST return either PNFS_RDMA_READ_DATA or PNFS_RDMA_NONE_DATA extents. It SHALL NOT return PNFS_RDMA_INVALID_DATA or PNFS_RDMA_READ_WRITE_DATA extents. o A request for a writable layout MUST return PNFS_RDMA_READ_WRITE_DATA or PNFS_RDMA_INVALID_DATA extents, and it MAY return addition PNFS_RDMA_READ_DATA extents for ranges covered by PNFS_RDMA_INVALID_DATA extents to allow client side copy-on-write operations. A request for a writable layout SHALL NOT return PNFS_RDMA_NONE_DATA extents. o The first extent in the list MUST contain the requested starting offset. o The total size of extents within the requested range MUST cover at least the minimum size. One exception is allowed: the total size MAY be smaller if only readable extents were requested and EOF is encountered. o Extents in the extent list MUST be logically contiguous for a read-only layout. For a read-write layout, the set of writable extents (i.e., excluding PNFS_RDMA_READ_DATA extents) MUST be logically contiguous. Every PNFS_RDMA_READ_DATA extent in a read- write layout MUST be covered by one or more PNFS_RDMA_INVALID_DATA extents. This overlap of PNFS_RDMA_READ_DATA and PNFS_RDMA_INVALID_DATA extents is the only permitted extent overlap. o Extents MUST be ordered in the list by starting offset, with PNFS_RDMA_READ_DATA extents preceding PNFS_RDMA_INVALID_DATA extents in the case of equal re_file_offsets. The server shall ensure that it has registered handles for the memory regions that the extents in the layout refer to so that RDMA READ and/or RDMA WRITE requests can be performed by the client. Multiple extents may refer to the same handle. The handle shall be invalidated on LAYOUTRETURN operation, including implicit layout returns as part of CB_LAYOUTRECALL operations, or when a layout is revoked. According to [RFC5661], if the minimum requested size, loga_minlength, is zero, this is an indication to the metadata server that the client desires any layout at offset loga_offset or less that the metadata server has "readily available". Given the lack of a clear definition of this phrase, in the context of the RDMA layout type, when loga_minlength is zero, the metadata server SHOULD: Hellwig Expires January 3, 2018 [Page 10] Internet-Draft pNFS RDMA Layout July 2017 o when processing requests for readable layouts, return all such, even if some extents are in the PNFS_RDMA_NONE_DATA state. o when processing requests for writable layouts, return extents which can be returned in the PNFS_RDMA_READ_WRITE_DATA state. 2.4.2. Layout Commits /// /// /* RDMA layout-specific type for lou_body */ /// /// struct pnfs_rdma_range4 { /// offset4 rr_file_offset; /* starting byte offset /// in the file */ /// length4 rr_length; /* size in bytes */ /// }; /// /// struct pnfs_rdma_layoutupdate4 { /// pnfs_rdma_range4 rlu_commit_list<>; /// /* list of extents which /// * now contain valid data. /// */ /// }; The "pnfs_rdma_layoutupdate4" structure is used by the client as the RDMA layout-specific argument in a LAYOUTCOMMIT operation. The "rlu_commit_list" field is a list covering regions of the file layout that were previously in the PNFS_RDMA_INVALID_DATA state, but have been written by the client and SHOULD now be considered in the PNFS_RDMA_READ_WRITE_DATA state. The extents in the commit list MUST be disjoint and MUST be sorted by rr_file_offset. Implementors should be aware that a server MAY be unable to commit regions at a granularity smaller than a file-system block (typically 4 KB or 8 KB). As noted above, the block-size that the server uses is available as an NFSv4 attribute, and any extents included in the "rlu_commit_list" MUST be aligned to this granularity and have a size that is a multiple of this granularity. Since the block in question is in state PNFS_RDMA_INVALID_DATA, byte ranges not written SHOULD be filled with zeros. This applies even if it appears that the area being written is beyond what the client believes to be the end of file. 2.4.3. Layout Returns A LAYOUTRETURN operation represents an explicit release of resources by the client. This MAY be done in response to a CB_LAYOUTRECALL or before any recall, in order to avoid a future CB_LAYOUTRECALL. When the LAYOUTRETURN operation specifies a LAYOUTRETURN4_FILE return Hellwig Expires January 3, 2018 [Page 11] Internet-Draft pNFS RDMA Layout July 2017 type, then the layoutreturn_file4 data structure specifies the region of the file layout that is no longer needed by the client. The LAYOUTRETURN operation is done without any RDMA layout specific data. The opaque "lrf_body" field of the "layoutreturn_file4" data structure MUST have length zero. 2.4.4. Layout Revocation Layouts MAY be unilaterally revoked by the server, due to the client's lease time expiring, or the client failing to return a layout which has been recalled in a timely manner. For the RDMA layout type this is accomplished by invalidating the handle for the remote memory region exposed to the client. Once the invalidation has completed the HCA will reject all access from the client to the memory region. 2.4.5. Client Copy-on-Write Processing Copy-on-write is a mechanism used to support file and/or file system snapshots. When writing to unaligned regions, or to regions smaller than a file system block, the writer MUST copy the portions of the original file data to a new location on disk. This behavior can either be implemented on the client or the server. The paragraphs below describe how a pNFS RDMA layout client implements access to a file that requires copy-on-write semantics. Distinguishing the PNFS_RDMA_READ_WRITE_DATA and PNFS_RDMA_READ_DATA extent types in combination with the allowed overlap of PNFS_RDMA_READ_DATA extents with PNFS_RDMA_INVALID_DATA extents allows copy-on-write processing to be done by pNFS clients. In classic NFS, this operation would be done by the server. Since pNFS enables clients to do direct block access, it is useful for clients to participate in copy-on-write operations. All pNFS RDMA layout clients MUST support this copy-on-write processing. When a client wishes to write data covered by a PNFS_RDMA_READ_DATA extent, it MUST have requested a writable layout from the server; that layout will contain PNFS_RDMA_INVALID_DATA extents to cover all the data ranges of that layout's PNFS_RDMA_READ_DATA extents. More precisely, for any re_file_offset range covered by one or more PNFS_RDMA_READ_DATA extents in a writable layout, the server MUST include one or more PNFS_RDMA_INVALID_DATA extents in the layout that cover the same re_file_offset range. When performing a write to such an area of a layout, the client MUST effectively copy the data from the PNFS_RDMA_READ_DATA extent for any partial blocks of re_file_offset and range, merge in the changes to be written, and write the result to the PNFS_RDMA_INVALID_DATA extent for the blocks Hellwig Expires January 3, 2018 [Page 12] Internet-Draft pNFS RDMA Layout July 2017 for that re_file_offset and range. That is, if entire blocks of data are to be overwritten by an operation, the corresponding PNFS_RDMA_READ_DATA blocks need not be fetched, but any partial- block writes MUST be merged with data fetched via PNFS_RDMA_READ_DATA extents before storing the result via PNFS_RDMA_INVALID_DATA extents. For the purposes of this discussion, "entire blocks" and "partial blocks" refer to the server's file-system block size. Storing of data in a PNFS_RDMA_INVALID_DATA extent converts the written portion of the PNFS_RDMA_INVALID_DATA extent to a PNFS_RDMA_READ_WRITE_DATA extent; all subsequent reads MUST be performed from this extent; the corresponding portion of the PNFS_RDMA_READ_DATA extent MUST NOT be used after storing data in a PNFS_RDMA_INVALID_DATA extent. If a client writes only a portion of an extent, the extent MAY be split at block aligned boundaries. When a client wishes to write data to a PNFS_RDMA_INVALID_DATA extent that is not covered by a PNFS_RDMA_READ_DATA extent, it MUST treat this write identically to a write to a file not involved with copy- on-write semantics. Thus, data MUST be written in at least block- sized increments, aligned to multiples of block-sized offsets, and unwritten portions of blocks MUST be zero filled. 2.4.6. Extents are Permissions Layout extents returned to pNFS clients grant permission to read or write; PNFS_RDMA_READ_DATA and PNFS_RDMA_NONE_DATA are read-only (PNFS_RDMA_NONE_DATA reads as zeroes), PNFS_RDMA_READ_WRITE_DATA and PNFS_RDMA_INVALID_DATA are read/write, (PNFS_RDMA_INVALID_DATA reads as zeros, any write converts it to PNFS_RDMA_READ_WRITE_DATA). This is the only means a client has of obtaining permission to perform direct I/O to storage devices; a pNFS client MUST NOT perform direct I/O operations that are not permitted by an extent held by the client. Client adherence to this rule places the pNFS server in control of potentially conflicting storage device operations, enabling the server to determine what does conflict and how to avoid conflicts by granting and recalling extents to/from clients. If a client makes a layout request that conflicts with an existing layout delegation, the request will be rejected with the error NFS4ERR_LAYOUTTRYLATER. This client is then expected to retry the request after a short interval. During this interval, the server SHOULD recall the conflicting portion of the layout delegation from the client that currently holds it. This reject-and-retry approach does not prevent client starvation when there is contention for the layout of a particular file. For this reason, a pNFS server SHOULD implement a mechanism to prevent starvation. One possibility is that the server can maintain a queue of rejected layout requests. Each new layout request can be checked to see if it conflicts with a Hellwig Expires January 3, 2018 [Page 13] Internet-Draft pNFS RDMA Layout July 2017 previous rejected request, and if so, the newer request can be rejected. Once the original requesting client retries its request, its entry in the rejected request queue can be cleared, or the entry in the rejected request queue can be removed when it reaches a certain age. NFSv4 supports mandatory locks and share reservations. These are mechanisms that clients can use to restrict the set of I/O operations that are permissible to other clients. Since all I/O operations ultimately arrive at the NFSv4 server for processing, the server is in a position to enforce these restrictions. However, with pNFS layouts, I/Os will be issued from the clients that hold the layouts directly to the storage devices that host the data. These devices have no knowledge of files, mandatory locks, or share reservations, and are not in a position to enforce such restrictions. For this reason the NFSv4 server MUST NOT grant layouts that conflict with mandatory locks or share reservations. Further, if a conflicting mandatory lock request or a conflicting open request arrives at the server, the server MUST recall the part of the layout in conflict with the request before granting the request. 2.4.7. End-of-file Processing The end-of-file location can be changed in two ways: implicitly as the result of a WRITE or LAYOUTCOMMIT beyond the current end-of-file, or explicitly as the result of a SETATTR request. Typically, when a file is truncated by an NFSv4 client via the SETATTR call, the server frees any disk blocks belonging to the file that are beyond the new end-of-file byte, and MUST write zeros to the portion of the new end- of-file block beyond the new end-of-file byte. These actions render any pNFS layouts that refer to the blocks that are freed or written semantically invalid. Therefore, the server MUST recall from clients the portions of any pNFS layouts that refer to blocks that will be freed or written by the server before effecting the file truncation. These recalls may take time to complete; as explained in [RFC5661], if the server cannot respond to the client SETATTR request in a reasonable amount of time, it SHOULD reply to the client with the error NFS4ERR_DELAY. Blocks in the PNFS_RDMA_INVALID_DATA state that lie beyond the new end-of-file block present a special case. The server has reserved these blocks for use by a pNFS client with a writable layout for the file, but the client has yet to commit the blocks, and they are not yet a part of the file mapping on disk. The server MAY free these blocks while processing the SETATTR request. If so, the server MUST recall any layouts from pNFS clients that refer to the blocks before processing the truncate. If the server does not free the PNFS_RDMA_INVALID_DATA blocks while processing the SETATTR request, Hellwig Expires January 3, 2018 [Page 14] Internet-Draft pNFS RDMA Layout July 2017 it need not recall layouts that refer only to the PNFS_RDMA_INVALID_DATA blocks. When a file is extended implicitly by a WRITE or LAYOUTCOMMIT beyond the current end-of-file, or extended explicitly by a SETATTR request, the server need not recall any portions of any pNFS layouts. 2.4.8. Layout Hints The layout hint attribute specified in [RFC5661] is not supported by the RDMA layout, and the pNFS server MUST reject setting a layout hint attribute with a loh_type value of LAYOUT4_RDMA_VOLUME during OPEN or SETATTR operations. On a file system only supporting the RDMA layout a server MUST NOT report the layout_hint attribute in the supported_attrs attribute. 2.5. Crash Recovery Issues A critical requirement in crash recovery is that both the client and the server know when the other has failed. Additionally, it is required that a client sees a consistent view of data across server restarts. These requirements and a full discussion of crash recovery issues are covered in the "Crash Recovery" section of the NFSv41 specification [RFC5661]. This document contains additional crash recovery material specific only to the RDMA layout. When the server crashes while the client holds a writable layout, and the client has written data to blocks covered by the layout, and the blocks are still in the PNFS_RDMA_INVALID_DATA state, the client has two options for recovery. If the data that has been written to these blocks is still cached by the client, the client can simply re-write the data via NFSv4, once the server has come back online. However, if the data is no longer in the client's cache, the client MUST NOT attempt to source the data from the data servers. Instead, it SHOULD attempt to commit the blocks in question to the server during the server's recovery grace period, by sending a LAYOUTCOMMIT with the "loca_reclaim" flag set to true. This process is described in detail in Section 18.42.4 of [RFC5661]. 2.6. Transient and Permanent Errors The server may respond to LAYOUTGET with a variety of error statuses. These errors can convey transient conditions or more permanent conditions that are unlikely to be resolved soon. The error NFS4ERR_RECALLCONFLICT indicates that the server has recently issued a CB_LAYOUTRECALL to the requesting client, making it necessary for the client to respond to the recall before processing Hellwig Expires January 3, 2018 [Page 15] Internet-Draft pNFS RDMA Layout July 2017 the layout request. A client can wait for that recall to be receive and processe or it can retry as for NFS4ERR_TRYLATER, as described below. The error NFS4ERR_TRYLATER is used to indicate that the server cannot immediately grant the layout to the client. This may be due to constraints on writable sharing of blocks by multiple clients or to a conflict with a recallable lock (e.g. a delegation). In either case, a reasonable approach for the client is to wait several milliseconds and retry the request. The client SHOULD track the number of retries, and if forward progress is not made, the client SHOULD abandon the attempt to get a layout and perform READ and WRITE operations by sending them to the server The error NFS4ERR_LAYOUTUNAVAILABLE MAY be returned by the server if layouts are not supported for the requested file or its containing file system. The server MAY also return this error code if the server is the progress of migrating the file from secondary storage, there is a conflicting lock that would prevent the layout from being granted, or for any other reason that causes the server to be unable to supply the layout. As a result of receiving NFS4ERR_LAYOUTUNAVAILABLE, the client SHOULD abandon the attempt to get a layout and perform READ and WRITE operations by sending them to the MDS. It is expected that a client will not cache the file's layoutunavailable state forever. In particular, when the file is closed or opened by the client, issuing a new LAYOUTGET is appropriate. 3. Security Considerations The pNFS extension partitions the NFSv4.1+ file system protocol into two parts, the control path and the data path (storage protocol). The control path contains all the new operations described by this extension; all existing NFSv4 security mechanisms and features apply to the control path. The combination of components in a pNFS system is required to preserve the security properties of NFSv4.1+ with respect to an entity accessing data via a client, including security countermeasures to defend against threats that NFSv4.1+ provides defenses for in environments where these threats are considered significant. The metadata server enforces the file access-control policy at LAYOUTGET time. The client should use suitable authorization credentials for getting the layout for the requested iomode (READ or RW) and the server verifies the permissions and ACL for these credentials, possibly returning NFS4ERR_ACCESS if the client is not allowed the requested iomode. If the LAYOUTGET operation succeeds the client receives, as part of the layout, a set of credentials Hellwig Expires January 3, 2018 [Page 16] Internet-Draft pNFS RDMA Layout July 2017 allowing it I/O access to the specified data files corresponding to the requested iomode. When the client acts on I/O operations on behalf of its local users, it MUST authenticate and authorize the user by issuing respective OPEN and ACCESS calls to the metadata server, similar to having NFSv4 data delegations. If access is allowed, the client uses the corresponding (READ or RW) credentials to perform the I/O operations at the data file's storage devices. When the metadata server receives a request to change a file's permissions or ACL, it SHOULD recall all layouts for that file and it MUST fence off the clients holding outstanding layouts for the respective file by implicitly invalidating the outstanding credentials on all data files comprising before committing to the new permissions and ACL. Doing this will ensure that clients re- authorize their layouts according to the modified permissions and ACL by requesting new layouts. Recalling the layouts in this case is courtesy of the server intended to prevent clients from getting an error on I/Os done after the client was fenced off. 4. IANA Considerations IANA is requested to assign a new pNFS layout type in the pNFS Layout Types Registry as follows (the value 5 is suggested): Layout Type Name: LAYOUT4_RDMA Value: 0x00000006 RFC: RFCTBD10 How: L (new layout type) Minor Versions: 1 5. References 5.1. Normative References [LEGAL] IETF Trust, "Legal Provisions Relating to IETF Documents", November 2008, . [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", March 1997. [RFC4506] Eisler, M., "XDR: External Data Representation Standard", STD 67, RFC 4506, May 2006. [RFC5661] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed., "Network File System (NFS) Version 4 Minor Version 1 Protocol", RFC 5661, January 2010. [RFC5662] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed., "Network File System (NFS) Version 4 Minor Version 1 External Data Representation Standard (XDR) Description", RFC 5662, January 2010. Hellwig Expires January 3, 2018 [Page 17] Internet-Draft pNFS RDMA Layout July 2017 [RFC8166] Lever, C., Simpson, W., and T. Talpey, "Remote Direct Memory Access Transport for Remote Procedure Call Version 1", RFC RFC8166, June 2017. 5.2. Informative References [IBARCH] InfiniBand Trade Association, "InfiniBand Architecture Specification Volume 1 Release 1.3", March 2015. [RFC5040] Recio, B., Ed., Metzler, B., Ed., Culley, P., Ed., Hilland, J., Ed., and D. Garcia, Ed., "A Remote Direct Memory Access Protocol Specification", RFC 5040, October 2007. [RFC5041] Shah, H., Ed., Pinkerton, J., Ed., Recio, B., Ed., and P. Culley, Ed., "Direct Data Placement over Reliable Transports", RFC 5041, October 2007. Appendix A. RFC Editor Notes [RFC Editor: please remove this section prior to publishing this document as an RFC] [RFC Editor: prior to publishing this document as an RFC, please replace all occurrences of RFCTBD10 with RFCxxxx where xxxx is the RFC number of this document] Author's Address Christoph Hellwig Email: hch@lst.de Hellwig Expires January 3, 2018 [Page 18]