Internet-Draft LAYOUT_RECOVERY March 2024
Haynes & Myklebust Expires 22 September 2024 [Page]
Network File System Version 4
Intended Status:
Standards Track
T. Haynes
T. Myklebust

Reporting of Errors via LAYOUTRETURN in NFSv4.2


The Parallel Network File System (pNFS) allows for a file's metadata (MDS) and data (DS) to be on different servers. When the metadata server is restarted, the client can still modify the data file component. During the recovery phase of startup, the metadata server and the data servers work together to recover state (which files are open, last modification time, size, etc). If the client has not encountered errors with the data files, then the state can be recovered, avoiding resilvering of the data files. With any errors, there is no means by which the client can report errors to the metadata server. As such, the metadata server has to assume that file needs resilvering. This document presents an extension to RFC8435 to allow the client to update the metadata and avoid the resilvering.

This note is to be removed before publishing as an RFC.

Discussion of this draft takes place on the NFSv4 working group mailing list (, which is archived at Working Group information can be found at

Status of This Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."

This Internet-Draft will expire on 22 September 2024.

Table of Contents

1. Introduction

In the Network File System version4 (NFSv4) with a Parallel NFS (pNFS) Flexible File Layout ([RFC8435]) server, during recovery after a restart, there is no mechanism for the client to inform the metadata server about an error which occurred during a WRITE (see Section 18.32 of [RFC8881]) operation to the data servers in the period of the outage.

Using the process detailed in [RFC8178], the revisions in this document become an extension of NFSv4.2 [RFC7862]. They are built on top of the external data representation (XDR) [RFC4506] generated from [RFC7863].

1.1. Definitions

See Section 1.1 of [RFC8435] for a more complete set of definitions.

(file) data:
that part of the file system object that contains the data to be read or written. It is the contents of the object rather than the attributes of the object.
data server (DS):
a pNFS server that provides the file's data when the file system object is accessed over a file-based protocol.
(file) metadata:
the part of the file system object that contains various descriptive data relevant to the file object, as opposed to the file data itself. This could include the time of last modification, access time, EOF position, etc.
metadata server (MDS):
the pNFS server that provides metadata information for a file system object.

1.2. Requirements Language

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.

2. Layout State Recovery

When a metadata server restarts, clients are provided a grace period where they are allowed to recover any state that they had established. With open files, the client can send an OPEN (see Section 18.16 of [RFC8881]) operation with a claim type of CLAIM_PREVIOUS (see Section 9.11 of [RFC8881]). The client uses the RECLAIM_COMPLETE (see Section 18.51 of [RFC8881]) operation to notify the metadata server that it is done reclaiming state.

The NFSv4 Flexible File Layout Type allows for the client to mirror files (see Section 8 of [RFC8435]). With client side mirroring, it is important for the client to inform the metadata server of any I/O errors encountered with one of the mirrors. This is the only way for the metadata server to determine one or more of the mirrors is corrupt and then repair the mirrors via resilvering. The client can use LAYOUTRETURN (see Section 18.44 of [RFC8881]) and the ff_ioerr4 (see Section 9.1.1 of [RFC8435]) structure to inform the metadata server of I/O errors.

A problem is that if the metadata server restarts and the client has errors it needs to report, it can not do so. The LAYOUTRETURN needs a layout stateid to proceed and there is no way for the client to recover layout state. As such, clients have no choice but to not recover files with I/O errors. In turn, the metadata server MUST assume that the mirrors are inconsistent and pick one for resilvering. It is a MUST because even if the metadata server can determine that the client did modify data during the outage, it MUST NOT assume those modifications were consistent.

If the server were to allow the client to use the anonymous stateid of all zeros (see Section 8.2.3 of [RFC8881]) for lrf_stateid in LAYOUTRETURN (see Section 18.44.1 of [RFC8881]), then the client could inform the metadata server of errors encountered. That in turn would allow the metadata server to accurately resilver the file by picking the correct mirror(s).

There are two error scenarios that can occur:

During the grace period:
If the client were to send any lrf_stateid in the LAYOUTRETURN other than the anonymous stateid of all zeros, then the metadata server would respond with an error of NFS4ERR_GRACE (see Section of [RFC8881]).
After the grace period:
If the client were to send any lrf_stateid in the LAYOUTRETURN with the anonymous stateid of all zeros, then the metadata server would respond with an error of NFS4ERR_NO_GRACE (see Section of [RFC8881]).

Also, when the metadata server builds the reply to the LAYOUTRETURN, it MUST NOT bump the seqid of the lorr_stateid.

The metadata server MUST NOT resilver a file if there are clients with outstanding layouts with iomode of LAYOUTIOMODE4_RW. Whether the metadata server prevents all I/O to the file until the resilvering is done or forces all I/O to go through the metadata server or allows a proxy server to update the new data file as it is being reslivered is all an implementation choice. The constraint is that the metadata server is responsible for the reconstruction of the data file and for the consistency of the mirrors. The metadata server MUST NOT have been resilvering the file such that it has a different layout (set of mirror instances) than the client before the restart of the metadata server. Further, the metadata server MUST NOT start a new resilvering of the file during the grace period unless there are no outstanding layouts with iomode of LAYOUTIOMODE4_RW.

If the metadata server detects that the layout being returned in the LAYOUTRETURN does not match the current mirror instances found for the file, then it should ignore the LAYOUTRETURN and resilver the file in question.

Finally, the metadata server MUST determine that any files which are neither explicitly recovered with a CLAIM_PREVIOUS nor have a reported error via a LAYOUTRETURN, have to be resilvered. The client has most likely restarted and lost any state.

3. Security Considerations

There are no new security considerations beyond those in [RFC7862].

4. IANA Considerations

IANA should use the current document (RFC-TBD) as the reference for the new entries.

5. References

5.1. Normative References

Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, , <>.
Eisler, M., Ed., "XDR: External Data Representation Standard", STD 67, RFC 4506, DOI 10.17487/RFC4506, , <>.
Haynes, T., "Network File System (NFS) Version 4 Minor Version 2 Protocol", RFC 7862, DOI 10.17487/RFC7862, , <>.
Haynes, T., "Network File System (NFS) Version 4 Minor Version 2 External Data Representation Standard (XDR) Description", RFC 7863, DOI 10.17487/RFC7863, , <>.
Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, , <>.
Noveck, D., "Rules for NFSv4 Extensions and Minor Versions", RFC 8178, DOI 10.17487/RFC8178, , <>.
Halevy, B. and T. Haynes, "Parallel NFS (pNFS) Flexible File Layout", RFC 8435, DOI 10.17487/RFC8435, , <>.
Noveck, D., Ed. and C. Lever, "Network File System (NFS) Version 4 Minor Version 1 Protocol", RFC 8881, DOI 10.17487/RFC8881, , <>.

Appendix A. Acknowledgments

Tigran Mkrtchyan, Jeff Layton, and Rick Macklem provided reviews of the document.

Authors' Addresses

Thomas Haynes
Trond Myklebust