idnits 2.17.1 draft-ietf-nfsv4-minorversion2-20.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- -- The document has examples using IPv4 documentation addresses according to RFC6890, but does not use any IPv6 documentation addresses. Maybe there should be IPv6 examples, too? Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (August 13, 2013) is 3908 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '
' and
     '' lines.


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  -- Looks like a reference, but probably isn't: '0' on line 3587

  == Missing Reference: '32K' is mentioned on line 3587, but not defined

  ** Obsolete normative reference: RFC 5661 (Obsoleted by RFC 8881)

  == Outdated reference: A later version (-05) exists of
     draft-ietf-nfsv4-labreqs-04

  == Outdated reference: A later version (-35) exists of
     draft-ietf-nfsv4-rfc3530bis-25

  -- Obsolete informational reference (is this intentional?): RFC 2616
     (Obsoleted by RFC 7230, RFC 7231, RFC 7232, RFC 7233, RFC 7234, RFC 7235)

  -- Obsolete informational reference (is this intentional?): RFC 5226
     (Obsoleted by RFC 8126)


     Summary: 1 error (**), 0 flaws (~~), 4 warnings (==), 6 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	NFSv4                                                     T. Haynes, Ed.
3	Internet-Draft                                                    NetApp
4	Intended status: Standards Track                         August 13, 2013
5	Expires: February 14, 2014

7	                     NFS Version 4 Minor Version 2
8	                 draft-ietf-nfsv4-minorversion2-20.txt

10	Abstract

12	   This Internet-Draft describes NFS version 4 minor version two,
13	   focusing mainly on the protocol extensions made from NFS version 4
14	   minor version 0 and NFS version 4 minor version 1.  Major extensions
15	   introduced in NFS version 4 minor version two include: Server-side
16	   Copy, Application I/O Advise, Space Reservations, Sparse Files,
17	   Application Data Blocks, and Labeled NFS.

19	Requirements Language

21	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
22	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
23	   document are to be interpreted as described in RFC 2119 [RFC2119].

25	Status of this Memo

27	   This Internet-Draft is submitted in full conformance with the
28	   provisions of BCP 78 and BCP 79.

30	   Internet-Drafts are working documents of the Internet Engineering
31	   Task Force (IETF).  Note that other groups may also distribute
32	   working documents as Internet-Drafts.  The list of current Internet-
33	   Drafts is at http://datatracker.ietf.org/drafts/current/.

35	   Internet-Drafts are draft documents valid for a maximum of six months
36	   and may be updated, replaced, or obsoleted by other documents at any
37	   time.  It is inappropriate to use Internet-Drafts as reference
38	   material or to cite them other than as "work in progress."

40	   This Internet-Draft will expire on February 14, 2014.

42	Copyright Notice

44	   Copyright (c) 2013 IETF Trust and the persons identified as the
45	   document authors.  All rights reserved.

47	   This document is subject to BCP 78 and the IETF Trust's Legal
48	   Provisions Relating to IETF Documents
49	   (http://trustee.ietf.org/license-info) in effect on the date of
50	   publication of this document.  Please review these documents
51	   carefully, as they describe your rights and restrictions with respect
52	   to this document.  Code Components extracted from this document must
53	   include Simplified BSD License text as described in Section 4.e of
54	   the Trust Legal Provisions and are provided without warranty as
55	   described in the Simplified BSD License.

57	Table of Contents

59	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  5
60	     1.1.   The NFS Version 4 Minor Version 2 Protocol  . . . . . . .  5
61	     1.2.   Scope of This Document  . . . . . . . . . . . . . . . . .  5
62	     1.3.   NFSv4.2 Goals . . . . . . . . . . . . . . . . . . . . . .  5
63	     1.4.   Overview of NFSv4.2 Features  . . . . . . . . . . . . . .  6
64	       1.4.1.  Server-side Copy . . . . . . . . . . . . . . . . . . .  6
65	       1.4.2.  Application I/O Advise . . . . . . . . . . . . . . . .  6
66	       1.4.3.  Sparse Files . . . . . . . . . . . . . . . . . . . . .  6
67	       1.4.4.  Space Reservation  . . . . . . . . . . . . . . . . . .  6
68	       1.4.5.  Application Data Hole (ADH) Support  . . . . . . . . .  6
69	       1.4.6.  Labeled NFS  . . . . . . . . . . . . . . . . . . . . .  6
70	     1.5.   Differences from NFSv4.1  . . . . . . . . . . . . . . . .  7
71	   2.  Minor Versioning . . . . . . . . . . . . . . . . . . . . . . .  7
72	   3.  Server-side Copy . . . . . . . . . . . . . . . . . . . . . . . 10
73	     3.1.   Introduction  . . . . . . . . . . . . . . . . . . . . . . 10
74	     3.2.   Protocol Overview . . . . . . . . . . . . . . . . . . . . 11
75	       3.2.1.  Overview of Copy Operations  . . . . . . . . . . . . . 11
76	       3.2.2.  Locking the Files  . . . . . . . . . . . . . . . . . . 12
77	       3.2.3.  Intra-Server Copy  . . . . . . . . . . . . . . . . . . 12
78	       3.2.4.  Inter-Server Copy  . . . . . . . . . . . . . . . . . . 14
79	       3.2.5.  Server-to-Server Copy Protocol . . . . . . . . . . . . 18
80	     3.3.   Requirements for Operations . . . . . . . . . . . . . . . 19
81	       3.3.1.  netloc4 - Network Locations  . . . . . . . . . . . . . 20
82	       3.3.2.  Copy Offload Stateids  . . . . . . . . . . . . . . . . 20
83	     3.4.   Security Considerations . . . . . . . . . . . . . . . . . 21
84	       3.4.1.  Inter-Server Copy Security . . . . . . . . . . . . . . 21
85	   4.  Support for Application IO Hints . . . . . . . . . . . . . . . 23
86	   5.  Sparse Files . . . . . . . . . . . . . . . . . . . . . . . . . 24
87	     5.1.   Introduction  . . . . . . . . . . . . . . . . . . . . . . 24
88	     5.2.   Terminology . . . . . . . . . . . . . . . . . . . . . . . 24
89	     5.3.   New Operations  . . . . . . . . . . . . . . . . . . . . . 25
90	       5.3.1.  READ_PLUS  . . . . . . . . . . . . . . . . . . . . . . 25
91	       5.3.2.  WRITE_PLUS . . . . . . . . . . . . . . . . . . . . . . 25
92	   6.  Space Reservation  . . . . . . . . . . . . . . . . . . . . . . 26
93	     6.1.   Introduction  . . . . . . . . . . . . . . . . . . . . . . 26
94	   7.  Application Data Hole Support  . . . . . . . . . . . . . . . . 28
95	     7.1.   Generic Framework . . . . . . . . . . . . . . . . . . . . 29
96	       7.1.1.  Data Hole Representation . . . . . . . . . . . . . . . 29
97	       7.1.2.  Data Content . . . . . . . . . . . . . . . . . . . . . 30
98	     7.2.   An Example of Detecting Corruption  . . . . . . . . . . . 30
99	     7.3.   Example of READ_PLUS  . . . . . . . . . . . . . . . . . . 31
100	   8.  Labeled NFS  . . . . . . . . . . . . . . . . . . . . . . . . . 32
101	     8.1.   Introduction  . . . . . . . . . . . . . . . . . . . . . . 32
102	     8.2.   Definitions . . . . . . . . . . . . . . . . . . . . . . . 33
103	     8.3.   MAC Security Attribute  . . . . . . . . . . . . . . . . . 34
104	       8.3.1.  Delegations  . . . . . . . . . . . . . . . . . . . . . 34
105	       8.3.2.  Permission Checking  . . . . . . . . . . . . . . . . . 35
106	       8.3.3.  Object Creation  . . . . . . . . . . . . . . . . . . . 35
107	       8.3.4.  Existing Objects . . . . . . . . . . . . . . . . . . . 35
108	       8.3.5.  Label Changes  . . . . . . . . . . . . . . . . . . . . 35
109	     8.4.   pNFS Considerations . . . . . . . . . . . . . . . . . . . 36
110	     8.5.   Discovery of Server Labeled NFS Support . . . . . . . . . 36
111	     8.6.   MAC Security NFS Modes of Operation . . . . . . . . . . . 36
112	       8.6.1.  Full Mode  . . . . . . . . . . . . . . . . . . . . . . 36
113	       8.6.2.  Guest Mode . . . . . . . . . . . . . . . . . . . . . . 38
114	     8.7.   Security Considerations . . . . . . . . . . . . . . . . . 38
115	   9.  Sharing change attribute implementation details with NFSv4
116	       clients  . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
117	     9.1.   Introduction  . . . . . . . . . . . . . . . . . . . . . . 39
118	   10. Security Considerations  . . . . . . . . . . . . . . . . . . . 39
119	   11. Error Values . . . . . . . . . . . . . . . . . . . . . . . . . 39
120	     11.1.  Error Definitions . . . . . . . . . . . . . . . . . . . . 40
121	       11.1.1. General Errors . . . . . . . . . . . . . . . . . . . . 40
122	       11.1.2. Server to Server Copy Errors . . . . . . . . . . . . . 40
123	       11.1.3. Labeled NFS Errors . . . . . . . . . . . . . . . . . . 41
124	     11.2.  New Operations and Their Valid Errors . . . . . . . . . . 41
125	     11.3.  New Callback Operations and Their Valid Errors  . . . . . 44
126	   12. New File Attributes  . . . . . . . . . . . . . . . . . . . . . 45
127	     12.1.  New RECOMMENDED Attributes - List and Definition
128	            References  . . . . . . . . . . . . . . . . . . . . . . . 45
129	     12.2.  Attribute Definitions . . . . . . . . . . . . . . . . . . 46
130	   13. Operations: REQUIRED, RECOMMENDED, or OPTIONAL . . . . . . . . 49
131	   14. NFSv4.2 Operations . . . . . . . . . . . . . . . . . . . . . . 53
132	     14.1.  Operation 59: COPY - Initiate a server-side copy  . . . . 53
133	     14.2.  Operation 60: OFFLOAD_ABORT - Cancel a server-side
134	            copy  . . . . . . . . . . . . . . . . . . . . . . . . . . 56
135	     14.3.  Operation 61: COPY_NOTIFY - Notify a source server of
136	            a future copy . . . . . . . . . . . . . . . . . . . . . . 57
137	     14.4.  Operation 62: OFFLOAD_REVOKE - Revoke a destination
138	            server's copy privileges  . . . . . . . . . . . . . . . . 58
139	     14.5.  Operation 63: OFFLOAD_STATUS - Poll for status of a
140	            server-side copy  . . . . . . . . . . . . . . . . . . . . 59
141	     14.6.  Modification to Operation 42: EXCHANGE_ID -
142	            Instantiate Client ID . . . . . . . . . . . . . . . . . . 60
143	     14.7.  Operation 64: WRITE_PLUS  . . . . . . . . . . . . . . . . 61
144	     14.8.  Operation 67: IO_ADVISE - Application I/O access
145	            pattern hints . . . . . . . . . . . . . . . . . . . . . . 67
146	     14.9.  Changes to Operation 51: LAYOUTRETURN . . . . . . . . . . 72
147	     14.10. Operation 65: READ_PLUS . . . . . . . . . . . . . . . . . 75
148	     14.11. Operation 66: SEEK  . . . . . . . . . . . . . . . . . . . 80
149	   15. NFSv4.2 Callback Operations  . . . . . . . . . . . . . . . . . 81
150	     15.1.  Operation 15: CB_OFFLOAD - Report results of an
151	            asynchronous operation  . . . . . . . . . . . . . . . . . 81
152	   16. IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 82
153	   17. References . . . . . . . . . . . . . . . . . . . . . . . . . . 83
154	     17.1.  Normative References  . . . . . . . . . . . . . . . . . . 83
155	     17.2.  Informative References  . . . . . . . . . . . . . . . . . 83
156	   Appendix A.  Acknowledgments . . . . . . . . . . . . . . . . . . . 85
157	   Appendix B.  RFC Editor Notes  . . . . . . . . . . . . . . . . . . 85
158	   Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 86

160	1.  Introduction

162	1.1.  The NFS Version 4 Minor Version 2 Protocol

164	   The NFS version 4 minor version 2 (NFSv4.2) protocol is the third
165	   minor version of the NFS version 4 (NFSv4) protocol.  The first minor
166	   version, NFSv4.0, is described in [I-D.ietf-nfsv4-rfc3530bis] and the
167	   second minor version, NFSv4.1, is described in [RFC5661].  It follows
168	   the guidelines for minor versioning that are listed in Section 11 of
169	   [I-D.ietf-nfsv4-rfc3530bis].

171	   As a minor version, NFSv4.2 is consistent with the overall goals for
172	   NFSv4, but extends the protocol so as to better meet those goals,
173	   based on experiences with NFSv4.1.  In addition, NFSv4.2 has adopted
174	   some additional goals, which motivate some of the major extensions in
175	   NFSv4.2.

177	1.2.  Scope of This Document

179	   This document describes the NFSv4.2 protocol.  With respect to
180	   NFSv4.0 and NFSv4.1, this document does not:

182	   o  describe the NFSv4.0 or NFSv4.1 protocols, except where needed to
183	      contrast with NFSv4.2

185	   o  modify the specification of the NFSv4.0 or NFSv4.1 protocols

187	   o  clarify the NFSv4.0 or NFSv4.1 protocols.  I.e., any
188	      clarifications made here apply to NFSv4.2 and neither of the prior
189	      protocols

191	   The full XDR for NFSv4.2 is presented in [4.2xdr].

193	1.3.  NFSv4.2 Goals

195	   The goal of the design of NFSv4.2 is to take common local file system
196	   features and offer them remotely.  These features might

198	   o  already be available on the servers, e.g., sparse files

200	   o  be under development as a new standard, e.g., SEEK_HOLE and
201	      SEEK_DATA

203	   o  be used by clients with the servers via some proprietary means,
204	      e.g., Labeled NFS

206	   but the clients are not able to leverage them on the server within
207	   the confines of the NFS protocol.

209	1.4.  Overview of NFSv4.2 Features

211	1.4.1.  Server-side Copy

213	   A traditional file copy from one server to another results in the
214	   data being put on the network twice - source to client and then
215	   client to destination.  New operations are introduced to allow the
216	   client to authorize the two servers to interact directly.  As this
217	   copy can be lengthy, asynchronous support is also provided.

219	1.4.2.  Application I/O Advise

221	   Applications and clients want to advise the server as to expected I/O
222	   behavior.  Using IO_ADVISE (see Section 14.8) to communicate future
223	   I/O behavior such as whether a file will be accessed sequentially or
224	   randomly, and whether a file will or will not be accessed in the near
225	   future, allows servers to optimize future I/O requests for a file by,
226	   for example, prefetching or evicting data.  This operation can be
227	   used to support the posix_fadvise function as well as other
228	   applications such as databases and video editors.

230	1.4.3.  Sparse Files

232	   Sparse files are ones which have unallocated data blocks as holes in
233	   the file.  Such holes are typically transferred as 0s during I/O.
234	   READ_PLUS (see Section 14.10) allows a server to send back to the
235	   client metadata describing the hole and WRITE_PLUS (see Section 14.7)
236	   allows the client to punch holes into a file.  In addition, SEEK (see
237	   Section 14.11) is provided to scan for the next hole or data from a
238	   given location.

240	1.4.4.  Space Reservation

242	   When a file is sparse, one concern applications have is ensuring that
243	   there will always be enough data blocks available for the file during
244	   future writes.  A new attribute, space_reserved (see Section 12.2.4)
245	   provides the client a guarantee that space will be available.

247	1.4.5.  Application Data Hole (ADH) Support

249	   Some applications treat a file as if it were a disk and as such want
250	   to initialize (or format) the file image.  We extend both READ_PLUS
251	   and WRITE_PLUS to understand this metadata as a new form of a hole.

253	1.4.6.  Labeled NFS

255	   While both clients and servers can employ Mandatory Access Control
256	   (MAC) security models to enforce data access, there has been no
257	   protocol support to allow full interoperability.  A new file object
258	   attribute, sec_label (see Section 12.2.2) allows for the server to
259	   store and enforce MAC labels.  The format of the sec_label
260	   accommodates any MAC security system.

262	1.5.  Differences from NFSv4.1

264	   In NFSv4.1, the only way to introduce new variants of an operation
265	   was to introduce a new operation.  I.e., READ becomes either READ2 or
266	   READ_PLUS.  With the use of discriminated unions as parameters to
267	   such functions in NFSv4.2, it is possible to add a new arm in a
268	   subsequent minor version.  And it is also possible to move such an
269	   operation from OPTIONAL/RECOMMENDED to REQUIRED.  Forcing an
270	   implementation to adopt each arm of a discriminated union at such a
271	   time does not meet the spirit of the minor versioning rules.  As
272	   such, new arms of a discriminated union MUST follow the same
273	   guidelines for minor versioning as operations in NFSv4.1 - i.e., they
274	   may not be made REQUIRED.  To support this, a new error code,
275	   NFS4ERR_UNION_NOTSUPP, is introduced which allows the server to
276	   communicate to the client that the operation is supported, but the
277	   specific arm of the discriminated union is not.

279	2.  Minor Versioning

281	   To address the requirement of an NFS protocol that can evolve as the
282	   need arises, the NFSv4 protocol contains the rules and framework to
283	   allow for future minor changes or versioning.

285	   The base assumption with respect to minor versioning is that any
286	   future accepted minor version will be documented in one or more
287	   Standards Track RFCs.  Minor version 0 of the NFSv4 protocol is
288	   represented by [I-D.ietf-nfsv4-rfc3530bis], minor version 1 by
289	   [RFC5661], and minor version 2 by this document.  The COMPOUND and
290	   CB_COMPOUND procedures support the encoding of the minor version
291	   being requested by the client.

293	   The following items represent the basic rules for the development of
294	   minor versions.  Note that a future minor version may modify or add
295	   to the following rules as part of the minor version definition.

297	   1.   Procedures are not added or deleted.

299	        To maintain the general RPC model, NFSv4 minor versions will not
300	        add to or delete procedures from the NFS program.

302	   2.   Minor versions may add operations to the COMPOUND and
303	        CB_COMPOUND procedures.

305	        The addition of operations to the COMPOUND and CB_COMPOUND
306	        procedures does not affect the RPC model.

308	        *  Minor versions may append attributes to the bitmap4 that
309	           represents sets of attributes and to the fattr4 that
310	           represents sets of attribute values.

312	           This allows for the expansion of the attribute model to allow
313	           for future growth or adaptation.

315	        *  Minor version X must append any new attributes after the last
316	           documented attribute.

318	           Since attribute results are specified as an opaque array of
319	           per-attribute, XDR-encoded results, the complexity of adding
320	           new attributes in the midst of the current definitions would
321	           be too burdensome.

323	   3.   Minor versions must not modify the structure of an existing
324	        operation's arguments or results.

326	        Again, the complexity of handling multiple structure definitions
327	        for a single operation is too burdensome.  New operations should
328	        be added instead of modifying existing structures for a minor
329	        version.

331	        This rule does not preclude the following adaptations in a minor
332	        version:

334	        *  adding bits to flag fields, such as new attributes to
335	           GETATTR's bitmap4 data type, and providing corresponding
336	           variants of opaque arrays, such as a notify4 used together
337	           with such bitmaps

339	        *  adding bits to existing attributes like ACLs that have flag
340	           words

342	        *  extending enumerated types (including NFS4ERR_*) with new
343	           values

345	        *  adding cases to a switched union

347	   4.   Note that when adding new cases to a switched union, a minor
348	        version must not make new cases be REQUIRED.  While the
349	        encapsulating operation may be REQUIRED, the new cases (the
350	        specific arm of the discriminated union) is not.  The error code
351	        NFS4ERR_UNION_NOTSUPP is used to notify the client when the
352	        server does not support such a case.

354	   5.   Minor versions must not modify the structure of existing
355	        attributes.

357	   6.   Minor versions must not delete operations.

359	        This prevents the potential reuse of a particular operation
360	        "slot" in a future minor version.

362	   7.   Minor versions must not delete attributes.

364	   8.   Minor versions must not delete flag bits or enumeration values.

366	   9.   Minor versions may declare an operation MUST NOT be implemented.

368	        Specifying that an operation MUST NOT be implemented is
369	        equivalent to obsoleting an operation.  For the client, it means
370	        that the operation MUST NOT be sent to the server.  For the
371	        server, an NFS error can be returned as opposed to "dropping"
372	        the request as an XDR decode error.  This approach allows for
373	        the obsolescence of an operation while maintaining its structure
374	        so that a future minor version can reintroduce the operation.

376	        1.  Minor versions may declare that an attribute MUST NOT be
377	            implemented.

379	        2.  Minor versions may declare that a flag bit or enumeration
380	            value MUST NOT be implemented.

382	   10.  Minor versions may declare an operation to be OBSOLESCENT, which
383	        indicates an intention to remove the operation (i.e., make it
384	        MANDATORY TO NOT implement) in a subsequent minor version.  Such
385	        labeling is separate from the question of whether the operation
386	        is REQUIRED or RECOMMENDED or OPTIONAL in the current minor
387	        version.  An operation may be both REQUIRED for the given minor
388	        version and marked OBSOLESCENT, with the expectation that it
389	        will be MANDATORY TO NOT implement in the next (or other
390	        subsequent) minor version.

392	   11.  Note that the early notification of operation obsolescence is
393	        put in place to mitigate the effects of design and
394	        implementation mistakes, and to allow protocol development to
395	        adapt to unexpected changes in the pace of implementation.  Even
396	        if an operation is marked OBSOLESCENT in a given minor version,
397	        it may end up not being marked MANDATORY TO NOT implement in the
398	        next minor version.  In unusual circumstances, it might not be
399	        marked OBSOLESCENT in a subsequent minor version, and never
400	        become MANDATORY TO NOT implement.

402	   12.  Minor versions may downgrade features from REQUIRED to
403	        RECOMMENDED, from RECOMMENDED to OPTIONAL, or from OPTIONAL to
404	        MANDATORY TO NOT implement.  Also, if a feature was marked as
405	        OBSOLESCENT in the prior minor version, it may be downgraded
406	        from REQUIRED to OPTIONAL from RECOMMENDED to MANDATORY TO NOT
407	        implement, or from REQUIRED to MANDATORY TO NOT implement.

409	   13.  Minor versions may upgrade features from OPTIONAL to
410	        RECOMMENDED, or RECOMMENDED to REQUIRED.  Also, if a feature was
411	        marked as OBSOLESCENT in the prior minor version, it may be
412	        upgraded to not be OBSOLESCENT.

414	   14.  A client and server that support minor version X SHOULD support
415	        minor versions 0 through X-1 as well.

417	   15.  Except for infrastructural changes, a minor version must not
418	        introduce REQUIRED new features.

420	        This rule allows for the introduction of new functionality and
421	        forces the use of implementation experience before designating a
422	        feature as REQUIRED.  On the other hand, some classes of
423	        features are infrastructural and have broad effects.  Allowing
424	        infrastructural features to be RECOMMENDED or OPTIONAL
425	        complicates implementation of the minor version.

427	   16.  A client MUST NOT attempt to use a stateid, filehandle, or
428	        similar returned object from the COMPOUND procedure with minor
429	        version X for another COMPOUND procedure with minor version Y,
430	        where X != Y.

432	3.  Server-side Copy

434	3.1.  Introduction

436	   The server-side copy feature provides a mechanism for the NFS client
437	   to perform a file copy on the server without the data being
438	   transmitted back and forth over the network.  Without this feature,
439	   an NFS client copies data from one location to another by reading the
440	   data from the server over the network, and then writing the data back
441	   over the network to the server.  Using this server-side copy
442	   operation, the client is able to instruct the server to copy the data
443	   locally without the data being sent back and forth over the network
444	   unnecessarily.

446	   If the source object and destination object are on different file
447	   servers, the file servers will communicate with one another to
448	   perform the copy operation.  The server-to-server protocol by which
449	   this is accomplished is not defined in this document.

451	3.2.  Protocol Overview

453	   The server-side copy offload operations support both intra-server and
454	   inter-server file copies.  An intra-server copy is a copy in which
455	   the source file and destination file reside on the same server.  In
456	   an inter-server copy, the source file and destination file are on
457	   different servers.  In both cases, the copy may be performed
458	   synchronously or asynchronously.

460	   Throughout the rest of this document, we refer to the NFS server
461	   containing the source file as the "source server" and the NFS server
462	   to which the file is transferred as the "destination server".  In the
463	   case of an intra-server copy, the source server and destination
464	   server are the same server.  Therefore in the context of an intra-
465	   server copy, the terms source server and destination server refer to
466	   the single server performing the copy.

468	   The operations described below are designed to copy files.  Other
469	   file system objects can be copied by building on these operations or
470	   using other techniques.  For example if the user wishes to copy a
471	   directory, the client can synthesize a directory copy by first
472	   creating the destination directory and then copying the source
473	   directory's files to the new destination directory.  If the user
474	   wishes to copy a namespace junction [FEDFS-NSDB] [FEDFS-ADMIN], the
475	   client can use the ONC RPC Federated Filesystem protocol
476	   [FEDFS-ADMIN] to perform the copy.  Specifically the client can
477	   determine the source junction's attributes using the FEDFS_LOOKUP_FSN
478	   procedure and create a duplicate junction using the
479	   FEDFS_CREATE_JUNCTION procedure.

481	   For the inter-server copy, the operations are defined to be
482	   compatible with the traditional copy authentication approach.  The
483	   client and user are authorized at the source for reading.  Then they
484	   are authorized at the destination for writing.

486	3.2.1.  Overview of Copy Operations

488	   COPY_NOTIFY:  For inter-server copies, the client sends this
489	      operation to the source server to notify it of a future file copy
490	      from a given destination server for the given user.
491	      (Section 14.3)

493	   OFFLOAD_REVOKE:  Also for inter-server copies, the client sends this
494	      operation to the source server to revoke permission to copy a file
495	      for the given user.  (Section 14.4)

497	   COPY:  Used by the client to request a file copy.  (Section 14.1)

499	   OFFLOAD_ABORT:  Used by the client to abort an asynchronous file
500	      copy.  (Section 14.2)

502	   OFFLOAD_STATUS:  Used by the client to poll the status of an
503	      asynchronous file copy.  (Section 14.5)

505	   CB_OFFLOAD:  Used by the destination server to report the results of
506	      an asynchronous file copy to the client.  (Section 15.1)

508	3.2.2.  Locking the Files

510	   Both the source and destination file may need to be locked to protect
511	   the content during the copy operations.  A client can achieve this by
512	   a combination of OPEN and LOCK operations.  I.e., either share or
513	   byte range locks might be desired.

515	3.2.3.  Intra-Server Copy

517	   To copy a file on a single server, the client uses a COPY operation.
518	   The server may respond to the copy operation with the final results
519	   of the copy or it may perform the copy asynchronously and deliver the
520	   results using a CB_OFFLOAD operation callback.  If the copy is
521	   performed asynchronously, the client may poll the status of the copy
522	   using OFFLOAD_STATUS or cancel the copy using OFFLOAD_ABORT.

524	   A synchronous intra-server copy is shown in Figure 1.  In this
525	   example, the NFS server chooses to perform the copy synchronously.
526	   The copy operation is completed, either successfully or
527	   unsuccessfully, before the server replies to the client's request.
528	   The server's reply contains the final result of the operation.

530	     Client                                  Server
531	        +                                      +
532	        |                                      |
533	        |--- OPEN ---------------------------->| Client opens
534	        |<------------------------------------/| the source file
535	        |                                      |
536	        |--- OPEN ---------------------------->| Client opens
537	        |<------------------------------------/| the destination file
538	        |                                      |
539	        |--- COPY ---------------------------->| Client requests
540	        |<------------------------------------/| a file copy
541	        |                                      |
542	        |--- CLOSE --------------------------->| Client closes
543	        |<------------------------------------/| the destination file
544	        |                                      |
545	        |--- CLOSE --------------------------->| Client closes
546	        |<------------------------------------/| the source file
547	        |                                      |
548	        |                                      |

550	                Figure 1: A synchronous intra-server copy.

552	   An asynchronous intra-server copy is shown in Figure 2.  In this
553	   example, the NFS server performs the copy asynchronously.  The
554	   server's reply to the copy request indicates that the copy operation
555	   was initiated and the final result will be delivered at a later time.
556	   The server's reply also contains a copy stateid.  The client may use
557	   this copy stateid to poll for status information (as shown) or to
558	   cancel the copy using a OFFLOAD_ABORT.  When the server completes the
559	   copy, the server performs a callback to the client and reports the
560	   results.

562	     Client                                  Server
563	        +                                      +
564	        |                                      |
565	        |--- OPEN ---------------------------->| Client opens
566	        |<------------------------------------/| the source file
567	        |                                      |
568	        |--- OPEN ---------------------------->| Client opens
569	        |<------------------------------------/| the destination file
570	        |                                      |
571	        |--- COPY ---------------------------->| Client requests
572	        |<------------------------------------/| a file copy
573	        |                                      |
574	        |                                      |
575	        |--- OFFLOAD_STATUS ------------------>| Client may poll
576	        |<------------------------------------/| for status
577	        |                                      |
578	        |                  .                   | Multiple OFFLOAD_STATUS
579	        |                  .                   | operations may be sent.
580	        |                  .                   |
581	        |                                      |
582	        |<-- CB_OFFLOAD -----------------------| Server reports results
583	        |\------------------------------------>|
584	        |                                      |
585	        |--- CLOSE --------------------------->| Client closes
586	        |<------------------------------------/| the destination file
587	        |                                      |
588	        |--- CLOSE --------------------------->| Client closes
589	        |<------------------------------------/| the source file
590	        |                                      |
591	        |                                      |

593	               Figure 2: An asynchronous intra-server copy.

595	3.2.4.  Inter-Server Copy

597	   A copy may also be performed between two servers.  The copy protocol
598	   is designed to accommodate a variety of network topologies.  As shown
599	   in Figure 3, the client and servers may be connected by multiple
600	   networks.  In particular, the servers may be connected by a
601	   specialized, high speed network (network 192.0.2.0/24 in the diagram)
602	   that does not include the client.  The protocol allows the client to
603	   setup the copy between the servers (over network 203.0.113.0/24 in
604	   the diagram) and for the servers to communicate on the high speed
605	   network if they choose to do so.

607	                             192.0.2.0/24
608	                 +-------------------------------------+
609	                 |                                     |
610	                 |                                     |
611	                 | 192.0.2.18                          | 192.0.2.56
612	         +-------+------+                       +------+------+
613	         |     Source   |                       | Destination |
614	         +-------+------+                       +------+------+
615	                 | 203.0.113.18                        | 203.0.113.56
616	                 |                                     |
617	                 |                                     |
618	                 |             203.0.113.0/24          |
619	                 +------------------+------------------+
620	                                    |
621	                                    |
622	                                    | 203.0.113.243
623	                              +-----+-----+
624	                              |   Client  |
625	                              +-----------+

627	            Figure 3: An example inter-server network topology.

629	   For an inter-server copy, the client notifies the source server that
630	   a file will be copied by the destination server using a COPY_NOTIFY
631	   operation.  The client then initiates the copy by sending the COPY
632	   operation to the destination server.  The destination server may
633	   perform the copy synchronously or asynchronously.

635	   A synchronous inter-server copy is shown in Figure 4.  In this case,
636	   the destination server chooses to perform the copy before responding
637	   to the client's COPY request.

639	   An asynchronous copy is shown in Figure 5.  In this case, the
640	   destination server chooses to respond to the client's COPY request
641	   immediately and then perform the copy asynchronously.

643	     Client                Source         Destination
644	        +                    +                 +
645	        |                    |                 |
646	        |--- OPEN        --->|                 | Returns os1
647	        |<------------------/|                 |
648	        |                    |                 |
649	        |--- COPY_NOTIFY --->|                 |
650	        |<------------------/|                 |
651	        |                    |                 |
652	        |--- OPEN ---------------------------->| Returns os2
653	        |<------------------------------------/|
654	        |                    |                 |
655	        |--- COPY ---------------------------->|
656	        |                    |                 |
657	        |                    |                 |
658	        |                    |<----- read -----|
659	        |                    |\--------------->|
660	        |                    |                 |
661	        |                    |        .        | Multiple reads may
662	        |                    |        .        | be necessary
663	        |                    |        .        |
664	        |                    |                 |
665	        |                    |                 |
666	        |<------------------------------------/| Destination replies
667	        |                    |                 | to COPY
668	        |                    |                 |
669	        |--- CLOSE --------------------------->| Release open state
670	        |<------------------------------------/|
671	        |                    |                 |
672	        |--- CLOSE       --->|                 | Release open state
673	        |<------------------/|                 |

675	                Figure 4: A synchronous inter-server copy.

677	     Client                Source         Destination
678	       +                    +                 +
679	       |                    |                 |
680	       |--- OPEN        --->|                 | Returns os1
681	       |<------------------/|                 |
682	       |                    |                 |
683	       |--- LOCK        --->|                 | Optional, could be done
684	       |<------------------/|                 | with a share lock
685	       |                    |                 |
686	       |--- COPY_NOTIFY --->|                 | Need to pass in
687	       |<------------------/|                 | os1 or lock state
688	       |                    |                 |
689	       |                    |                 |
690	       |                    |                 |
691	       |--- OPEN ---------------------------->| Returns os2
692	       |<------------------------------------/|
693	       |                    |                 |
694	       |--- LOCK ---------------------------->| Optional ...
695	       |<------------------------------------/|
696	       |                    |                 |
697	       |--- COPY ---------------------------->| Need to pass in
698	       |<------------------------------------/| os2 or lock state
699	       |                    |                 |
700	       |                    |                 |
701	       |                    |<----- read -----|
702	       |                    |\--------------->|
703	       |                    |                 |
704	       |                    |        .        | Multiple reads may
705	       |                    |        .        | be necessary
706	       |                    |        .        |
707	       |                    |                 |
708	       |                    |                 |
709	       |--- OFFLOAD_STATUS ------------------>| Client may poll
710	       |<------------------------------------/| for status
711	       |                    |                 |
712	       |                    |        .        | Multiple OFFLOAD_STATUS
713	       |                    |        .        | operations may be sent
714	       |                    |        .        |
715	       |                    |                 |
716	       |                    |                 |
717	       |                    |                 |
718	       |<-- CB_OFFLOAD -----------------------| Destination reports
719	       |\------------------------------------>| results
720	       |                    |                 |
721	       |--- LOCKU --------------------------->| Only if LOCK was done
722	       |<------------------------------------/|
723	       |                    |                 |
724	       |--- CLOSE --------------------------->| Release open state
725	       |<------------------------------------/|
726	       |                    |                 |
727	       |--- LOCKU       --->|                 | Only if LOCK was done
728	       |<------------------/|                 |
729	       |                    |                 |
730	       |--- CLOSE       --->|                 | Release open state
731	       |<------------------/|                 |
732	       |                    |                 |

734	               Figure 5: An asynchronous inter-server copy.

736	3.2.5.  Server-to-Server Copy Protocol

738	   The source server and destination server are not required to use a
739	   specific protocol to transfer the file data.  The choice of what
740	   protocol to use is ultimately the destination server's decision.

742	3.2.5.1.  Using NFSv4.x as a Server-to-Server Copy Protocol

744	   The destination server MAY use standard NFSv4.x (where x >= 1)
745	   operations to read the data from the source server.  If NFSv4.x is
746	   used for the server-to-server copy protocol, the destination server
747	   can use the source filehandle provided in the COPY request with
748	   standard NFSv4.x operations to read data from the source server.
749	   Specifically, the destination server may use the NFSv4.x OPEN
750	   operation's CLAIM_FH facility to open the file being copied and
751	   obtain an open stateid.  Using the stateid, the destination server
752	   may then use NFSv4.x READ operations to read the file.

754	3.2.5.2.  Using an alternative Server-to-Server Copy Protocol

756	   In a homogeneous environment, the source and destination servers
757	   might be able to perform the file copy extremely efficiently using
758	   specialized protocols.  For example the source and destination
759	   servers might be two nodes sharing a common file system format for
760	   the source and destination file systems.  Thus the source and
761	   destination are in an ideal position to efficiently render the image
762	   of the source file to the destination file by replicating the file
763	   system formats at the block level.  Another possibility is that the
764	   source and destination might be two nodes sharing a common storage
765	   area network, and thus there is no need to copy any data at all, and
766	   instead ownership of the file and its contents might simply be re-
767	   assigned to the destination.  To allow for these possibilities, the
768	   destination server is allowed to use a server-to-server copy protocol
769	   of its choice.

771	   In a heterogeneous environment, using a protocol other than NFSv4.x
772	   (e.g., HTTP [RFC2616] or FTP [RFC0959]) presents some challenges.  In
773	   particular, the destination server is presented with the challenge of
774	   accessing the source file given only an NFSv4.x filehandle.

776	   One option for protocols that identify source files with path names
777	   is to use an ASCII hexadecimal representation of the source
778	   filehandle as the file name.

780	   Another option for the source server is to use URLs to direct the
781	   destination server to a specialized service.  For example, the
782	   response to COPY_NOTIFY could include the URL
783	   ftp://s1.example.com:9999/_FH/0x12345, where 0x12345 is the ASCII
784	   hexadecimal representation of the source filehandle.  When the
785	   destination server receives the source server's URL, it would use
786	   "_FH/0x12345" as the file name to pass to the FTP server listening on
787	   port 9999 of s1.example.com.  On port 9999 there would be a special
788	   instance of the FTP service that understands how to convert NFS
789	   filehandles to an open file descriptor (in many operating systems,
790	   this would require a new system call, one which is the inverse of the
791	   makefh() function that the pre-NFSv4 MOUNT service needs).

793	   Authenticating and identifying the destination server to the source
794	   server is also a challenge.  Recommendations for how to accomplish
795	   this are given in Section 3.4.1.3.

797	3.3.  Requirements for Operations

799	   The implementation of server-side copy is OPTIONAL by the client and
800	   the server.  However, in order to successfully copy a file, some
801	   operations MUST be supported by the client and/or server.

803	   If a client desires an intra-server file copy, then it MUST support
804	   the COPY and CB_OFFLOAD operations.  If COPY returns a stateid, then
805	   the client MAY use the OFFLOAD_ABORT and OFFLOAD_STATUS operations.

807	   If a client desires an inter-server file copy, then it MUST support
808	   the COPY, COPY_NOTICE, and CB_OFFLOAD operations, and MAY use the
809	   OFFLOAD_REVOKE operation.  If COPY returns a stateid, then the client
810	   MAY use the OFFLOAD_ABORT and OFFLOAD_STATUS operations.

812	   If a server supports intra-server copy, then the server MUST support
813	   the COPY operation.  If a server's COPY operation returns a stateid,
814	   then the server MUST also support these operations: CB_OFFLOAD,
815	   OFFLOAD_ABORT, and OFFLOAD_STATUS.

817	   If a source server supports inter-server copy, then the source server
818	   MUST support all these operations: COPY_NOTIFY and OFFLOAD_REVOKE.
819	   If a destination server supports inter-server copy, then the
820	   destination server MUST support the COPY operation.  If a destination
821	   server's COPY operation returns a stateid, then the destination
822	   server MUST also support these operations: CB_OFFLOAD, OFFLOAD_ABORT,
823	   COPY_NOTIFY, OFFLOAD_REVOKE, and OFFLOAD_STATUS.

825	   Each operation is performed in the context of the user identified by
826	   the ONC RPC credential of its containing COMPOUND or CB_COMPOUND
827	   request.  For example, a OFFLOAD_ABORT operation issued by a given
828	   user indicates that a specified COPY operation initiated by the same
829	   user be canceled.  Therefore a OFFLOAD_ABORT MUST NOT interfere with
830	   a copy of the same file initiated by another user.

832	   An NFS server MAY allow an administrative user to monitor or cancel
833	   copy operations using an implementation specific interface.

835	3.3.1.  netloc4 - Network Locations

837	   The server-side copy operations specify network locations using the
838	   netloc4 data type shown below:

840	   enum netloc_type4 {
841	           NL4_NAME        = 0,
842	           NL4_URL         = 1,
843	           NL4_NETADDR     = 2
844	   };
845	   union netloc4 switch (netloc_type4 nl_type) {
846	           case NL4_NAME:          utf8str_cis nl_name;
847	           case NL4_URL:           utf8str_cis nl_url;
848	           case NL4_NETADDR:       netaddr4    nl_addr;
849	   };

851	   If the netloc4 is of type NL4_NAME, the nl_name field MUST be
852	   specified as a UTF-8 string.  The nl_name is expected to be resolved
853	   to a network address via DNS, LDAP, NIS, /etc/hosts, or some other
854	   means.  If the netloc4 is of type NL4_URL, a server URL [RFC3986]
855	   appropriate for the server-to-server copy operation is specified as a
856	   UTF-8 string.  If the netloc4 is of type NL4_NETADDR, the nl_addr
857	   field MUST contain a valid netaddr4 as defined in Section 3.3.9 of
858	   [RFC5661].

860	   When netloc4 values are used for an inter-server copy as shown in
861	   Figure 3, their values may be evaluated on the source server,
862	   destination server, and client.  The network environment in which
863	   these systems operate should be configured so that the netloc4 values
864	   are interpreted as intended on each system.

866	3.3.2.  Copy Offload Stateids

868	   A server may perform a copy offload operation asynchronously.  An
869	   asynchronous copy is tracked using a copy offload stateid.  Copy
870	   offload stateids are included in the COPY, OFFLOAD_ABORT,
871	   OFFLOAD_STATUS, and CB_OFFLOAD operations.

873	   Section 8.2.4 of [RFC5661] specifies that stateids are valid until
874	   either (A) the client or server restart or (B) the client returns the
875	   resource.

877	   A copy offload stateid will be valid until either (A) the client or
878	   server restarts or (B) the client returns the resource by issuing a
879	   OFFLOAD_ABORT operation or the client replies to a CB_OFFLOAD
880	   operation.

882	   A copy offload stateid's seqid MUST NOT be 0.  In the context of a
883	   copy offload operation, it is ambiguous to indicate the most recent
884	   copy offload operation using a stateid with seqid of 0.  Therefore a
885	   copy offload stateid with seqid of 0 MUST be considered invalid.

887	3.4.  Security Considerations

889	   The security considerations pertaining to NFSv4
890	   [I-D.ietf-nfsv4-rfc3530bis] apply to this chapter.

892	   The standard security mechanisms provide by NFSv4
893	   [I-D.ietf-nfsv4-rfc3530bis] may be used to secure the protocol
894	   described in this chapter.

896	   NFSv4 clients and servers supporting the inter-server copy operations
897	   described in this chapter are REQUIRED to implement the mechanism
898	   described in Section 3.4.1.2, and to support rejecting COPY_NOTIFY
899	   requests that do not use RPCSEC_GSS with privacy.  This requirement
900	   to implement is not a requirement to use; for example, a server may
901	   depending on configuration also allow COPY_NOTIFY requests that use
902	   only AUTH_SYS.

904	3.4.1.  Inter-Server Copy Security

906	3.4.1.1.  Requirements for Secure Inter-Server Copy

908	   Inter-server copy is driven by several requirements:

910	   o  The specification must not mandate an inter-server copy protocol.
911	      There are many ways to copy data.  Some will be more optimal than
912	      others depending on the identities of the source server and
913	      destination server.  For example the source and destination
914	      servers might be two nodes sharing a common file system format for
915	      the source and destination file systems.  Thus the source and
916	      destination are in an ideal position to efficiently render the
917	      image of the source file to the destination file by replicating
918	      the file system formats at the block level.  In other cases, the
919	      source and destination might be two nodes sharing a common storage
920	      area network, and thus there is no need to copy any data at all,
921	      and instead ownership of the file and its contents simply gets re-
922	      assigned to the destination.

924	   o  The specification must provide guidance for using NFSv4.x as a
925	      copy protocol.  For those source and destination servers willing
926	      to use NFSv4.x there are specific security considerations that
927	      this specification can and does address.

929	   o  The specification must not mandate pre-configuration between the
930	      source and destination server.  Requiring that the source and
931	      destination first have a "copying relationship" increases the
932	      administrative burden.  However the specification MUST NOT
933	      preclude implementations that require pre-configuration.

935	   o  The specification must not mandate a trust relationship between
936	      the source and destination server.  The NFSv4 security model
937	      requires mutual authentication between a principal on an NFS
938	      client and a principal on an NFS server.  This model MUST continue
939	      with the introduction of COPY.

941	3.4.1.2.  Inter-Server Copy via ONC RPC

943	   In the absence of a strong security mechanism designed for the
944	   purpose, the challenge is how the source server and destination
945	   server identify themselves to each other, especially in the presence
946	   of multi-homed source and destination servers.  In a multi-homed
947	   environment, the destination server might not contact the source
948	   server from the same network address specified by the client in the
949	   COPY_NOTIFY.  This can be overcome using the procedure described
950	   below.

952	   When the client sends the source server the COPY_NOTIFY operation,
953	   the source server may reply to the client with a list of target
954	   addresses, names, and/or URLs and assign them to the unique
955	   quadruple: .  If the destination uses one of these target netlocs to contact
957	   the source server, the source server will be able to uniquely
958	   identify the destination server, even if the destination server does
959	   not connect from the address specified by the client in COPY_NOTIFY.
960	   The level of assurance in this identification depends on the
961	   unpredictability, strength and secrecy of the random number.

963	   For example, suppose the network topology is as shown in Figure 3.
964	   If the source filehandle is 0x12345, the source server may respond to
965	   a COPY_NOTIFY for destination 203.0.113.56 with the URLs:

967	      nfs://203.0.113.18//_COPY/FvhH1OKbu8VrxvV1erdjvR7N/203.0.113.56/
968	      _FH/0x12345

970	      nfs://192.0.2.18//_COPY/FvhH1OKbu8VrxvV1erdjvR7N/203.0.113.56/_FH/
971	      0x12345

973	   The name component after _COPY is 24 characters of base 64, more than
974	   enough to encode a 128 bit random number.

976	   The client will then send these URLs to the destination server in the
977	   COPY operation.  Suppose that the 192.0.2.0/24 network is a high
978	   speed network and the destination server decides to transfer the file
979	   over this network.  If the destination contacts the source server
980	   from 192.0.2.56 over this network using NFSv4.1, it does the
981	   following:

983	   COMPOUND  { PUTROOTFH, LOOKUP "_COPY" ; LOOKUP
984	      "FvhH1OKbu8VrxvV1erdjvR7N" ; LOOKUP "203.0.113.56"; LOOKUP "_FH" ;
985	      OPEN "0x12345" ; GETFH }

987	   Provided that the random number is unpredictable and has been kept
988	   secret by the parties involved, the source server will therefore know
989	   that these NFSv4.x operations are being issued by the destination
990	   server identified in the COPY_NOTIFY.  This random number technique
991	   only provides initial authentication of the destination server, and
992	   cannot defend against man-in-the-middle attacks after authentication
993	   or an eavesdropper that observes the random number on the wire.
994	   Other secure communication techniques (e.g., IPsec) are necessary to
995	   block these attacks.

997	   Servers SHOULD reject COPY_NOTIFY requests that do not use RPCSEC_GSS
998	   with privacy, thus ensuring the URL in the COPY_NOTIFY reply is
999	   encrypted.  For the same reason, clients SHOULD send COPY requests to
1000	   the destination using RPCSEC_GSS with privacy.

1002	3.4.1.3.  Inter-Server Copy without ONC RPC

1004	   The same techniques as Section 3.4.1.2, using unique URLs for each
1005	   destination server, can be used for other protocols (e.g., HTTP
1006	   [RFC2616] and FTP [RFC0959]) as well.

1008	4.  Support for Application IO Hints

1010	   Applications can issue client I/O hints via posix_fadvise()
1011	   [posix_fadvise] to the NFS client.  While this can help the NFS
1012	   client optimize I/O and caching for a file, it does not allow the NFS
1013	   server and its exported file system to do likewise.  We add an
1014	   IO_ADVISE procedure (Section 14.8) to communicate the client file
1015	   access patterns to the NFS server.  The NFS server upon receiving a
1016	   IO_ADVISE operation MAY choose to alter its I/O and caching behavior,
1017	   but is under no obligation to do so.

1019	   Application specific NFS clients such as those used by hypervisors
1020	   and databases can also leverage application hints to communicate
1021	   their specialized requirements.

1023	5.  Sparse Files

1025	5.1.  Introduction

1027	   A sparse file is a common way of representing a large file without
1028	   having to utilize all of the disk space for it.  Consequently, a
1029	   sparse file uses less physical space than its size indicates.  This
1030	   means the file contains 'holes', byte ranges within the file that
1031	   contain no data.  Most modern file systems support sparse files,
1032	   including most UNIX file systems and NTFS, but notably not Apple's
1033	   HFS+.  Common examples of sparse files include Virtual Machine (VM)
1034	   OS/disk images, database files, log files, and even checkpoint
1035	   recovery files most commonly used by the HPC community.

1037	   If an application reads a hole in a sparse file, the file system must
1038	   return all zeros to the application.  For local data access there is
1039	   little penalty, but with NFS these zeroes must be transferred back to
1040	   the client.  If an application uses the NFS client to read data into
1041	   memory, this wastes time and bandwidth as the application waits for
1042	   the zeroes to be transferred.

1044	   A sparse file is typically created by initializing the file to be all
1045	   zeros - nothing is written to the data in the file, instead the hole
1046	   is recorded in the metadata for the file.  So a 8G disk image might
1047	   be represented initially by a couple hundred bits in the inode and
1048	   nothing on the disk.  If the VM then writes 100M to a file in the
1049	   middle of the image, there would now be two holes represented in the
1050	   metadata and 100M in the data.

1052	   Two new operations WRITE_PLUS (Section 14.7) and READ_PLUS
1053	   (Section 14.10) are introduced.  WRITE_PLUS allows for the creation
1054	   of a sparse file and for hole punching.  An application might want to
1055	   zero out a range of the file.  READ_PLUS supports all the features of
1056	   READ but includes an extension to support sparse pattern files
1057	   (Section 7.1.2).  READ_PLUS is guaranteed to perform no worse than
1058	   READ, and can dramatically improve performance with sparse files.
1059	   READ_PLUS does not depend on pNFS protocol features, but can be used
1060	   by pNFS to support sparse files.

1062	5.2.  Terminology

1064	   Regular file:  An object of file type NF4REG or NF4NAMEDATTR.

1066	   Sparse file:  A Regular file that contains one or more Holes.

1068	   Hole:  A byte range within a Sparse file that contains regions of all
1069	      zeroes.  For block-based file systems, this could also be an
1070	      unallocated region of the file.

1072	   Hole Threshold:  The minimum length of a Hole as determined by the
1073	      server.  If a server chooses to define a Hole Threshold, then it
1074	      would not return hole information about holes with a length
1075	      shorter than the Hole Threshold.

1077	5.3.  New Operations

1079	   READ_PLUS and WRITE_PLUS are new variants of the NFSv4.1 READ and
1080	   WRITE operations [RFC5661].  Besides being able to support all of the
1081	   data semantics of those operations, they can also be used by the
1082	   client and server to efficiently transfer both holes and ADHs (see
1083	   Section 7.1.1).  As both READ and WRITE are inefficient for transfer
1084	   of sparse sections of the file, they are marked as OBSOLESCENT in
1085	   NFSv4.2.  Instead, a client should utilize READ_PLUS and WRITE_PLUS.
1086	   Note that as the client has no a priori knowledge of whether either
1087	   an ADH or a hole is present or not, if it supports these operations
1088	   and so does the server, then it should always use these operations.

1090	5.3.1.  READ_PLUS

1092	   For holes, READ_PLUS extends the response to avoid returning data for
1093	   portions of the file which are initialized and contain no backing
1094	   store.  Additionally it will do so if the result would appear to be a
1095	   hole.  I.e., if the result was a data block composed entirely of
1096	   zeros, then it is easier to return a hole.  Returning data blocks of
1097	   uninitialized data wastes computational and network resources, thus
1098	   reducing performance.  For ADHs, READ_PLUS is used to return the
1099	   metadata describing the portions of the file which are initialized
1100	   and contain no backing store.

1102	   If the client sends a READ operation, it is explicitly stating that
1103	   it is neither supporting sparse files nor ADHs.  So if a READ occurs
1104	   on a sparse ADH or file, then the server must expand such data to be
1105	   raw bytes.  If a READ occurs in the middle of a hole or ADH, the
1106	   server can only send back bytes starting from that offset.  In
1107	   contrast, if a READ_PLUS occurs in the middle of a hole or ADH, the
1108	   server can send back a range which starts before the offset and
1109	   extends past the range.

1111	5.3.2.  WRITE_PLUS

1113	   WRITE_PLUS can be used to either hole punch or initialize ADHs.  For
1114	   either purpose, the client can avoid the transfer of a repetitive
1115	   pattern across the network.  If the filesystem on the server does not
1116	   supports sparse files, the WRITE_PLUS operation may return the result
1117	   asynchronously via the CB_OFFLOAD operation.  As a hole punch may
1118	   entail deallocating data blocks, even if the filesystem supports
1119	   sparse files, it may still have to return the result via CB_OFFLOAD.

1121	6.  Space Reservation

1123	6.1.  Introduction

1125	   Applications such as hypervisors want to be able to reserve space for
1126	   a file, report the amount of actual disk space a file occupies, and
1127	   free-up the backing space of a file when it is not required.  In
1128	   virtualized environments, virtual disk files are often stored on NFS
1129	   mounted volumes.  Since virtual disk files represent the hard disks
1130	   of virtual machines, hypervisors often have to guarantee certain
1131	   properties for the file.

1133	   One such example is space reservation.  When a hypervisor creates a
1134	   virtual disk file, it often tries to preallocate the space for the
1135	   file so that there are no future allocation related errors during the
1136	   operation of the virtual machine.  Such errors prevent a virtual
1137	   machine from continuing execution and result in downtime.

1139	   Currently, in order to achieve such a guarantee, applications zero
1140	   the entire file.  The initial zeroing allocates the backing blocks
1141	   and all subsequent writes are overwrites of already allocated blocks.
1142	   This approach is not only inefficient in terms of the amount of I/O
1143	   done, it is also not guaranteed to work on file systems that are log
1144	   structured or deduplicated.  An efficient way of guaranteeing space
1145	   reservation would be beneficial to such applications.

1147	   We define a "reservation" as being the combination of the
1148	   space_reserved attribute (see Section 12.2.4) and the size attribute
1149	   (see Section 5.8.1.5 of [RFC5661]).  If space_reserved attribute is
1150	   set on a file, it is guaranteed that writes that do not grow the file
1151	   past the size will not fail with NFS4ERR_NOSPC.  Once the size is
1152	   changed, then the reservation is changed to that new size.

1154	   Another useful feature is the ability to report the number of blocks
1155	   that would be freed when a file is deleted.  Currently, NFS reports
1156	   two size attributes:

1158	   size  The logical file size of the file.

1160	   space_used  The size in bytes that the file occupies on disk

1162	   While these attributes are sufficient for space accounting in
1163	   traditional file systems, they prove to be inadequate in modern file
1164	   systems that support block sharing.  In such file systems, multiple
1165	   inodes can point to a single block with a block reference count to
1166	   guard against premature freeing.  Having a way to tell the number of
1167	   blocks that would be freed if the file was deleted would be useful to
1168	   applications that wish to migrate files when a volume is low on
1169	   space.

1171	   Since virtual disks represent a hard drive in a virtual machine, a
1172	   virtual disk can be viewed as a file system within a file.  Since not
1173	   all blocks within a file system are in use, there is an opportunity
1174	   to reclaim blocks that are no longer in use.  A call to deallocate
1175	   blocks could result in better space efficiency.  Lesser space MAY be
1176	   consumed for backups after block deallocation.

1178	   The following operations and attributes can be used to resolve this
1179	   issues:

1181	   space_reserved  This attribute specifies that writes to the reserved
1182	      area of the file will not fail with NFS4ERR_NOSPACE.

1184	   space_freed  This attribute specifies the space freed when a file is
1185	      deleted, taking block sharing into consideration.

1187	   WRITE_PLUS  This operation zeroes and/or deallocates the blocks
1188	      backing a region of the file.

1190	   If space_used of a file is interpreted to mean the size in bytes of
1191	   all disk blocks pointed to by the inode of the file, then shared
1192	   blocks get double counted, over-reporting the space utilization.
1193	   This also has the adverse effect that the deletion of a file with
1194	   shared blocks frees up less than space_used bytes.

1196	   On the other hand, if space_used is interpreted to mean the size in
1197	   bytes of those disk blocks unique to the inode of the file, then
1198	   shared blocks are not counted in any file, resulting in under-
1199	   reporting of the space utilization.

1201	   For example, two files A and B have 10 blocks each.  Let 6 of these
1202	   blocks be shared between them.  Thus, the combined space utilized by
1203	   the two files is 14 * BLOCK_SIZE bytes.  In the former case, the
1204	   combined space utilization of the two files would be reported as 20 *
1205	   BLOCK_SIZE.  However, deleting either would only result in 4 *
1206	   BLOCK_SIZE being freed.  Conversely, the latter interpretation would
1207	   report that the space utilization is only 8 * BLOCK_SIZE.

1209	   Adding another size attribute, space_freed (see Section 12.2.5), is
1210	   helpful in solving this problem. space_freed is the number of blocks
1211	   that are allocated to the given file that would be freed on its
1212	   deletion.  In the example, both A and B would report space_freed as 4
1213	   * BLOCK_SIZE and space_used as 10 * BLOCK_SIZE.  If A is deleted, B
1214	   will report space_freed as 10 * BLOCK_SIZE as the deletion of B would
1215	   result in the deallocation of all 10 blocks.

1217	   The addition of this problem does not solve the problem of space
1218	   being over-reported.  However, over-reporting is better than under-
1219	   reporting.

1221	7.  Application Data Hole Support

1223	   At the OS level, files are contained on disk blocks.  Applications
1224	   are also free to impose structure on the data contained in a file and
1225	   we can define an Application Data Block (ADB) to be such a structure.
1226	   From the application's viewpoint, it only wants to handle ADBs and
1227	   not raw bytes (see [Strohm11]).  An ADB is typically comprised of two
1228	   sections: a header and data.  The header describes the
1229	   characteristics of the block and can provide a means to detect
1230	   corruption in the data payload.  The data section is typically
1231	   initialized to all zeros.

1233	   The format of the header is application specific, but there are two
1234	   main components typically encountered:

1236	   1.  A logical block number which allows the application to determine
1237	       which data block is being referenced.  This is useful when the
1238	       client is not storing the blocks in contiguous memory.

1240	   2.  Fields to describe the state of the ADB and a means to detect
1241	       block corruption.  For both pieces of data, a useful property is
1242	       that allowed values be unique in that if passed across the
1243	       network, corruption due to translation between big and little
1244	       endian architectures are detectable.  For example, 0xF0DEDEF0 has
1245	       the same bit pattern in both architectures.

1247	   Applications already impose structures on files [Strohm11] and detect
1248	   corruption in data blocks [Ashdown08].  What they are not able to do
1249	   is efficiently transfer and store ADBs.  To initialize a file with
1250	   ADBs, the client must send the full ADB to the server and that must
1251	   be stored on the server.

1253	   In this section, we are going to define an Application Data Hole
1254	   (ADH), which is a generic framework for transferring the ADB, present
1255	   one approach to detecting corruption in a given ADH implementation,
1256	   and describe the model for how the client and server can support
1257	   efficient initialization of ADHs, reading of ADH holes, punching ADH
1258	   holes in a file, and space reservation.  We define the ADHN to be the
1259	   Application Data Hole Number, which is the logical block number
1260	   discussed earlier.

1262	7.1.  Generic Framework

1264	   We want the representation of the ADH to be flexible enough to
1265	   support many different applications.  The most basic approach is no
1266	   imposition of a block at all, which means we are working with the raw
1267	   bytes.  Such an approach would be useful for storing holes, punching
1268	   holes, etc.  In more complex deployments, a server might be
1269	   supporting multiple applications, each with their own definition of
1270	   the ADH.  One might store the ADHN at the start of the block and then
1271	   have a guard pattern to detect corruption [McDougall07].  The next
1272	   might store the ADHN at an offset of 100 bytes within the block and
1273	   have no guard pattern at all, i.e., existing applications might
1274	   already have well defined formats for their data blocks.

1276	   The guard pattern can be used to represent the state of the block, to
1277	   protect against corruption, or both.  Again, it needs to be able to
1278	   be placed anywhere within the ADH.

1280	   We need to be able to represent the starting offset of the block and
1281	   the size of the block.  Note that nothing prevents the application
1282	   from defining different sized blocks in a file.

1284	7.1.1.  Data Hole Representation

1286	   struct app_data_hole4 {
1287	           offset4         adh_offset;
1288	           length4         adh_block_size;
1289	           length4         adh_block_count;
1290	           length4         adh_reloff_blocknum;
1291	           count4          adh_block_num;
1292	           length4         adh_reloff_pattern;
1293	           opaque          adh_pattern<>;
1294	   };

1296	   The app_data_hole4 structure captures the abstraction presented for
1297	   the ADH.  The additional fields present are to allow the transmission
1298	   of adh_block_count ADHs at one time.  We also use adh_block_num to
1299	   convey the ADHN of the first block in the sequence.  Each ADH will
1300	   contain the same adh_pattern string.

1302	   As both adh_block_num and adh_pattern are optional, if either
1303	   adh_reloff_pattern or adh_reloff_blocknum is set to NFS4_UINT64_MAX,
1304	   then the corresponding field is not set in any of the ADH.

1306	7.1.2.  Data Content

1308	   /*
1309	    * Use an enum such that we can extend new types.
1310	    */
1311	   enum data_content4 {
1312	           NFS4_CONTENT_DATA = 0,
1313	           NFS4_CONTENT_APP_DATA_HOLE = 1,
1314	           NFS4_CONTENT_HOLE = 2
1315	   };

1317	   New operations might need to differentiate between wanting to access
1318	   data versus an ADH.  Also, future minor versions might want to
1319	   introduce new data formats.  This enumeration allows that to occur.

1321	7.2.  An Example of Detecting Corruption

1323	   In this section, we define an ADH format in which corruption can be
1324	   detected.  Note that this is just one possible format and means to
1325	   detect corruption.

1327	   Consider a very basic implementation of an operating system's disk
1328	   blocks.  A block is either data or it is an indirect block which
1329	   allows for files to be larger than one block.  It is desired to be
1330	   able to initialize a block.  Lastly, to quickly unlink a file, a
1331	   block can be marked invalid.  The contents remain intact - which
1332	   would enable this OS application to undelete a file.

1334	   The application defines 4k sized data blocks, with an 8 byte block
1335	   counter occurring at offset 0 in the block, and with the guard
1336	   pattern occurring at offset 8 inside the block.  Furthermore, the
1337	   guard pattern can take one of four states:

1339	   0xfeedface -   This is the FREE state and indicates that the ADH
1340	      format has been applied.

1342	   0xcafedead -   This is the DATA state and indicates that real data
1343	      has been written to this block.

1345	   0xe4e5c001 -   This is the INDIRECT state and indicates that the
1346	      block contains block counter numbers that are chained off of this
1347	      block.

1349	   0xba1ed4a3 -   This is the INVALID state and indicates that the block
1350	      contains data whose contents are garbage.

1352	   Finally, it also defines an 8 byte checksum [Baira08] starting at
1353	   byte 16 which applies to the remaining contents of the block.  If the
1354	   state is FREE, then that checksum is trivially zero.  As such, the
1355	   application has no need to transfer the checksum implicitly inside
1356	   the ADH - it need not make the transfer layer aware of the fact that
1357	   there is a checksum (see [Ashdown08] for an example of checksums used
1358	   to detect corruption in application data blocks).

1360	   Corruption in each ADH can thus be detected:

1362	   o  If the guard pattern is anything other than one of the allowed
1363	      values, including all zeros.

1365	   o  If the guard pattern is FREE and any other byte in the remainder
1366	      of the ADH is anything other than zero.

1368	   o  If the guard pattern is anything other than FREE, then if the
1369	      stored checksum does not match the computed checksum.

1371	   o  If the guard pattern is INDIRECT and one of the stored indirect
1372	      block numbers has a value greater than the number of ADHs in the
1373	      file.

1375	   o  If the guard pattern is INDIRECT and one of the stored indirect
1376	      block numbers is a duplicate of another stored indirect block
1377	      number.

1379	   As can be seen, the application can detect errors based on the
1380	   combination of the guard pattern state and the checksum.  But also,
1381	   the application can detect corruption based on the state and the
1382	   contents of the ADH.  This last point is important in validating the
1383	   minimum amount of data we incorporated into our generic framework.
1384	   I.e., the guard pattern is sufficient in allowing applications to
1385	   design their own corruption detection.

1387	   Finally, it is important to note that none of these corruption checks
1388	   occur in the transport layer.  The server and client components are
1389	   totally unaware of the file format and might report everything as
1390	   being transferred correctly even in the case the application detects
1391	   corruption.

1393	7.3.  Example of READ_PLUS

1395	   The hypothetical application presented in Section 7.2 can be used to
1396	   illustrate how READ_PLUS would return an array of results.  A file is
1397	   created and initialized with 100 4k ADHs in the FREE state:

1399	      WRITE_PLUS {0, 4k, 100, 0, 0, 8, 0xfeedface}

1401	   Further, assume the application writes a single ADH at 16k, changing
1402	   the guard pattern to 0xcafedead, we would then have in memory:

1404	      0 -> (16k - 1)   : 4k, 4, 0, 0, 8, 0xfeedface
1405	      16k -> (20k - 1) : 00 00 00 05 ca fe de ad XX XX ... XX XX
1406	      20k -> 400k      : 4k, 95, 0, 6, 0xfeedface

1408	   And when the client did a READ_PLUS of 64k at the start of the file,
1409	   it would get back a result of an ADH, some data, and a final ADH:

1411	      ADH {0, 4, 0, 0, 8, 0xfeedface}
1412	      data 4k
1413	      ADH {20k, 4k, 59, 0, 6, 0xfeedface}

1415	8.  Labeled NFS

1417	8.1.  Introduction

1419	   Access control models such as Unix permissions or Access Control
1420	   Lists are commonly referred to as Discretionary Access Control (DAC)
1421	   models.  These systems base their access decisions on user identity
1422	   and resource ownership.  In contrast Mandatory Access Control (MAC)
1423	   models base their access control decisions on the label on the
1424	   subject (usually a process) and the object it wishes to access
1425	   [Haynes13].  These labels may contain user identity information but
1426	   usually contain additional information.  In DAC systems users are
1427	   free to specify the access rules for resources that they own.  MAC
1428	   models base their security decisions on a system wide policy
1429	   established by an administrator or organization which the users do
1430	   not have the ability to override.  In this section, we add a MAC
1431	   model to NFSv4.2.

1433	   The first change necessary is to devise a method for transporting and
1434	   storing security label data on NFSv4 file objects.  Security labels
1435	   have several semantics that are met by NFSv4 recommended attributes
1436	   such as the ability to set the label value upon object creation.
1437	   Access control on these attributes are done through a combination of
1438	   two mechanisms.  As with other recommended attributes on file objects
1439	   the usual DAC checks (ACLs and permission bits) will be performed to
1440	   ensure that proper file ownership is enforced.  In addition a MAC
1441	   system MAY be employed on the client, server, or both to enforce
1442	   additional policy on what subjects may modify security label
1443	   information.

1445	   The second change is to provide methods for the client to determine
1446	   if the security label has changed.  A client which needs to know if a
1447	   label is going to change SHOULD request a delegation on that file.
1448	   In order to change the security label, the server will have to recall
1449	   all delegations.  This will inform the client of the change.  If a
1450	   client wants to detect if the label has changed, it MAY use VERIFY
1451	   and NVERIFY on FATTR4_CHANGE_SEC_LABEL to detect that the
1452	   FATTR4_SEC_LABEL has been modified.

1454	   An additional useful change would be modification to the RPC layer
1455	   used in NFSv4 to allow RPC calls to carry security labels.  Such
1456	   modifications are outside the scope of this document.

1458	8.2.  Definitions

1460	   Label Format Specifier (LFS):  is an identifier used by the client to
1461	      establish the syntactic format of the security label and the
1462	      semantic meaning of its components.  These specifiers exist in a
1463	      registry associated with documents describing the format and
1464	      semantics of the label.

1466	   Label Format Registry:  is the IANA registry containing all
1467	      registered LFS along with references to the documents that
1468	      describe the syntactic format and semantics of the security label.

1470	   Policy Identifier (PI):  is an optional part of the definition of a
1471	      Label Format Specifier which allows for clients and server to
1472	      identify specific security policies.

1474	   Object:  is a passive resource within the system that we wish to be
1475	      protected.  Objects can be entities such as files, directories,
1476	      pipes, sockets, and many other system resources relevant to the
1477	      protection of the system state.

1479	   Subject:  is an active entity usually a process which is requesting
1480	      access to an object.

1482	   MAC-Aware:  is a server which can transmit and store object labels.

1484	   MAC-Functional:  is a client or server which is Labeled NFS enabled.
1485	      Such a system can interpret labels and apply policies based on the
1486	      security system.

1488	   Multi-Level Security (MLS):  is a traditional model where objects are
1489	      given a sensitivity level (Unclassified, Secret, Top Secret, etc)
1490	      and a category set [MLS].

1492	8.3.  MAC Security Attribute

1494	   MAC models base access decisions on security attributes bound to
1495	   subjects and objects.  This information can range from a user
1496	   identity for an identity based MAC model, sensitivity levels for
1497	   Multi-level security, or a type for Type Enforcement.  These models
1498	   base their decisions on different criteria but the semantics of the
1499	   security attribute remain the same.  The semantics required by the
1500	   security attributes are listed below:

1502	   o  MUST provide flexibility with respect to the MAC model.

1504	   o  MUST provide the ability to atomically set security information
1505	      upon object creation.

1507	   o  MUST provide the ability to enforce access control decisions both
1508	      on the client and the server.

1510	   o  MUST NOT expose an object to either the client or server name
1511	      space before its security information has been bound to it.

1513	   NFSv4 implements the security attribute as a recommended attribute.
1514	   These attributes have a fixed format and semantics, which conflicts
1515	   with the flexible nature of the security attribute.  To resolve this
1516	   the security attribute consists of two components.  The first
1517	   component is a LFS as defined in [Quigley11] to allow for
1518	   interoperability between MAC mechanisms.  The second component is an
1519	   opaque field which is the actual security attribute data.  To allow
1520	   for various MAC models, NFSv4 should be used solely as a transport
1521	   mechanism for the security attribute.  It is the responsibility of
1522	   the endpoints to consume the security attribute and make access
1523	   decisions based on their respective models.  In addition, creation of
1524	   objects through OPEN and CREATE allows for the security attribute to
1525	   be specified upon creation.  By providing an atomic create and set
1526	   operation for the security attribute it is possible to enforce the
1527	   second and fourth requirements.  The recommended attribute
1528	   FATTR4_SEC_LABEL (see Section 12.2.2) will be used to satisfy this
1529	   requirement.

1531	8.3.1.  Delegations

1533	   In the event that a security attribute is changed on the server while
1534	   a client holds a delegation on the file, both the server and the
1535	   client MUST follow the NFSv4.1 protocol (see Chapter 10 of [RFC5661])
1536	   with respect to attribute changes.  It SHOULD flush all changes back
1537	   to the server and relinquish the delegation.

1539	8.3.2.  Permission Checking

1541	   It is not feasible to enumerate all possible MAC models and even
1542	   levels of protection within a subset of these models.  This means
1543	   that the NFSv4 client and servers cannot be expected to directly make
1544	   access control decisions based on the security attribute.  Instead
1545	   NFSv4 should defer permission checking on this attribute to the host
1546	   system.  These checks are performed in addition to existing DAC and
1547	   ACL checks outlined in the NFSv4 protocol.  Section 8.6 gives a
1548	   specific example of how the security attribute is handled under a
1549	   particular MAC model.

1551	8.3.3.  Object Creation

1553	   When creating files in NFSv4 the OPEN and CREATE operations are used.
1554	   One of the parameters to these operations is an fattr4 structure
1555	   containing the attributes the file is to be created with.  This
1556	   allows NFSv4 to atomically set the security attribute of files upon
1557	   creation.  When a client is MAC-Functional it must always provide the
1558	   initial security attribute upon file creation.  In the event that the
1559	   server is MAC-Functional as well, it should determine by policy
1560	   whether it will accept the attribute from the client or instead make
1561	   the determination itself.  If the client is not MAC-Functional, then
1562	   the MAC-Functional server must decide on a default label.  A more in
1563	   depth explanation can be found in Section 8.6.

1565	8.3.4.  Existing Objects

1567	   Note that under the MAC model, all objects must have labels.
1568	   Therefore, if an existing server is upgraded to include Labeled NFS
1569	   support, then it is the responsibility of the security system to
1570	   define the behavior for existing objects.

1572	8.3.5.  Label Changes

1574	   If there are open delegations on the file belonging to client other
1575	   than the one making the label change, then the process described in
1576	   Section 8.3.1 must be followed.  In short, the delegation will be
1577	   recalled, which effectively notifies the client of the change.

1579	   Consider a system in which the clients enforce MAC checks and and the
1580	   server has a very simple security system which just stores the
1581	   labels.  In this system, the MAC label check always allows access,
1582	   regardless of the subject label.

1584	   The way in which MAC labels are enforced is by the client.  The
1585	   security policies on the client can be such that the client does not
1586	   have access to the file unless it has a delegation.  The recall of
1587	   the delegation will force the client to flush any cached content of
1588	   the file.  The clients could also be configured to periodically
1589	   VERIFY/NVERIFY the FATTR4_CHANGE_SEC_LABEL attribute to determine
1590	   when the label has changed.  When a change is detected, then the
1591	   client could take the costlier action of retrieving the
1592	   FATTR4_SEC_LABEL.

1594	8.4.  pNFS Considerations

1596	   The new FATTR4_SEC_LABEL attribute is metadata information and as
1597	   such the DS is not aware of the value contained on the MDS.
1598	   Fortunately, the NFSv4.1 protocol [RFC5661] already has provisions
1599	   for doing access level checks from the DS to the MDS.  In order for
1600	   the DS to validate the subject label presented by the client, it
1601	   SHOULD utilize this mechanism.

1603	8.5.  Discovery of Server Labeled NFS Support

1605	   The server can easily determine that a client supports Labeled NFS
1606	   when it queries for the FATTR4_SEC_LABEL label for an object.  The
1607	   client might need to discover which LFS the server supports.

1609	   The following compound MUST NOT be denied by any MAC label check:

1611	        PUTROOTFH, GETATTR {FATTR4_SEC_LABEL}

1613	   Note that the server might have imposed a security flavor on the root
1614	   that precludes such access.  I.e., if the server requires kerberized
1615	   access and the client presents a compound with AUTH_SYS, then the
1616	   server is allowed to return NFS4ERR_WRONGSEC in this case.  But if
1617	   the client presents a correct security flavor, then the server MUST
1618	   return the FATTR4_SEC_LABEL attribute with the supported LFS filled
1619	   in.

1621	8.6.  MAC Security NFS Modes of Operation

1623	   A system using Labeled NFS may operate in two modes.  The first mode
1624	   provides the most protection and is called "full mode".  In this mode
1625	   both the client and server implement a MAC model allowing each end to
1626	   make an access control decision.  The remaining mode is called the
1627	   "guest mode" and in this mode one end of the connection is not
1628	   implementing a MAC model and thus offers less protection than full
1629	   mode.

1631	8.6.1.  Full Mode

1633	   Full mode environments consist of MAC-Functional NFSv4 servers and
1634	   clients and may be composed of mixed MAC models and policies.  The
1635	   system requires that both the client and server have an opportunity
1636	   to perform an access control check based on all relevant information
1637	   within the network.  The file object security attribute is provided
1638	   using the mechanism described in Section 8.3.

1640	   Fully MAC-Functional NFSv4 servers are not possible in the absence of
1641	   RPC layer modifications to support subject label transport.  However,
1642	   servers may make decisions based on the RPC credential information
1643	   available and future specifications may provide subject label
1644	   transport.

1646	8.6.1.1.  Initial Labeling and Translation

1648	   The ability to create a file is an action that a MAC model may wish
1649	   to mediate.  The client is given the responsibility to determine the
1650	   initial security attribute to be placed on a file.  This allows the
1651	   client to make a decision as to the acceptable security attributes to
1652	   create a file with before sending the request to the server.  Once
1653	   the server receives the creation request from the client it may
1654	   choose to evaluate if the security attribute is acceptable.

1656	   Security attributes on the client and server may vary based on MAC
1657	   model and policy.  To handle this the security attribute field has an
1658	   LFS component.  This component is a mechanism for the host to
1659	   identify the format and meaning of the opaque portion of the security
1660	   attribute.  A full mode environment may contain hosts operating in
1661	   several different LFSs.  In this case a mechanism for translating the
1662	   opaque portion of the security attribute is needed.  The actual
1663	   translation function will vary based on MAC model and policy and is
1664	   out of the scope of this document.  If a translation is unavailable
1665	   for a given LFS then the request MUST be denied.  Another recourse is
1666	   to allow the host to provide a fallback mapping for unknown security
1667	   attributes.

1669	8.6.1.2.  Policy Enforcement

1671	   In full mode access control decisions are made by both the clients
1672	   and servers.  When a client makes a request it takes the security
1673	   attribute from the requesting process and makes an access control
1674	   decision based on that attribute and the security attribute of the
1675	   object it is trying to access.  If the client denies that access an
1676	   RPC call to the server is never made.  If however the access is
1677	   allowed the client will make a call to the NFS server.

1679	   When the server receives the request from the client it uses any
1680	   credential information conveyed in the RPC request and the attributes
1681	   of the object the client is trying to access to make an access
1682	   control decision.  If the server's policy allows this access it will
1683	   fulfill the client's request, otherwise it will return
1684	   NFS4ERR_ACCESS.

1686	   Future protocol extensions may also allow the server to factor into
1687	   the decision a security label extracted from the RPC request.

1689	   Implementations MAY validate security attributes supplied over the
1690	   network to ensure that they are within a set of attributes permitted
1691	   from a specific peer, and if not, reject them.  Note that a system
1692	   may permit a different set of attributes to be accepted from each
1693	   peer.

1695	8.6.1.3.  Limited Server

1697	   A Limited Server mode (see Section 3.5.2 of [Haynes13]) consists of a
1698	   server which is label aware, but does not enforce policies.  Such a
1699	   server will store and retrieve all object labels presented by
1700	   clients, utilize the methods described in Section 8.3.5 to allow the
1701	   clients to detect changing labels, but may not factor the label into
1702	   access decisions.  Instead, it will expect the clients to enforce all
1703	   such access locally.

1705	8.6.2.  Guest Mode

1707	   Guest mode implies that either the client or the server does not
1708	   handle labels.  If the client is not Labeled NFS aware, then it will
1709	   not offer subject labels to the server.  The server is the only
1710	   entity enforcing policy, and may selectively provide standard NFS
1711	   services to clients based on their authentication credentials and/or
1712	   associated network attributes (e.g., IP address, network interface).
1713	   The level of trust and access extended to a client in this mode is
1714	   configuration-specific.  If the server is not Labeled NFS aware, then
1715	   it will not return object labels to the client.  Clients in this
1716	   environment are may consist of groups implementing different MAC
1717	   model policies.  The system requires that all clients in the
1718	   environment be responsible for access control checks.

1720	8.7.  Security Considerations

1722	   This entire chapter deals with security issues.

1724	   Depending on the level of protection the MAC system offers there may
1725	   be a requirement to tightly bind the security attribute to the data.

1727	   When only one of the client or server enforces labels, it is
1728	   important to realize that the other side is not enforcing MAC
1729	   protections.  Alternate methods might be in use to handle the lack of
1730	   MAC support and care should be taken to identify and mitigate threats
1731	   from possible tampering outside of these methods.

1733	   An example of this is that a server that modifies READDIR or LOOKUP
1734	   results based on the client's subject label might want to always
1735	   construct the same subject label for a client which does not present
1736	   one.  This will prevent a non-Labeled NFS client from mixing entries
1737	   in the directory cache.

1739	9.  Sharing change attribute implementation details with NFSv4 clients

1741	9.1.  Introduction

1743	   Although both the NFSv4 [I-D.ietf-nfsv4-rfc3530bis] and NFSv4.1
1744	   protocol [RFC5661], define the change attribute as being mandatory to
1745	   implement, there is little in the way of guidance.  The only mandated
1746	   feature is that the value must change whenever the file data or
1747	   metadata change.

1749	   While this allows for a wide range of implementations, it also leaves
1750	   the client with a conundrum: how does it determine which is the most
1751	   recent value for the change attribute in a case where several RPC
1752	   calls have been issued in parallel?  In other words if two COMPOUNDs,
1753	   both containing WRITE and GETATTR requests for the same file, have
1754	   been issued in parallel, how does the client determine which of the
1755	   two change attribute values returned in the replies to the GETATTR
1756	   requests correspond to the most recent state of the file?  In some
1757	   cases, the only recourse may be to send another COMPOUND containing a
1758	   third GETATTR that is fully serialized with the first two.

1760	   NFSv4.2 avoids this kind of inefficiency by allowing the server to
1761	   share details about how the change attribute is expected to evolve,
1762	   so that the client may immediately determine which, out of the
1763	   several change attribute values returned by the server, is the most
1764	   recent. change_attr_type is defined as a new recommended attribute
1765	   (see Section 12.2.1), and is per file system.

1767	10.  Security Considerations

1769	   NFSv4.2 has all of the security concerns present in NFSv4.1 (see
1770	   Section 21 of [RFC5661]) and those present in the Server-side Copy
1771	   (see Section 3.4) and in Labeled NFS (see Section 8.7).

1773	11.  Error Values

1775	   NFS error numbers are assigned to failed operations within a Compound
1776	   (COMPOUND or CB_COMPOUND) request.  A Compound request contains a
1777	   number of NFS operations that have their results encoded in sequence
1778	   in a Compound reply.  The results of successful operations will
1779	   consist of an NFS4_OK status followed by the encoded results of the
1780	   operation.  If an NFS operation fails, an error status will be
1781	   entered in the reply and the Compound request will be terminated.

1783	11.1.  Error Definitions

1785	                        Protocol Error Definitions

1787	         +--------------------------+--------+------------------+
1788	         | Error                    | Number | Description      |
1789	         +--------------------------+--------+------------------+
1790	         | NFS4ERR_BADLABEL         | 10093  | Section 11.1.3.1 |
1791	         | NFS4ERR_METADATA_NOTSUPP | 10090  | Section 11.1.2.1 |
1792	         | NFS4ERR_OFFLOAD_DENIED   | 10091  | Section 11.1.2.2 |
1793	         | NFS4ERR_PARTNER_NO_AUTH  | 10089  | Section 11.1.2.3 |
1794	         | NFS4ERR_PARTNER_NOTSUPP  | 10088  | Section 11.1.2.4 |
1795	         | NFS4ERR_UNION_NOTSUPP    | 10094  | Section 11.1.1.1 |
1796	         | NFS4ERR_WRONG_LFS        | 10092  | Section 11.1.3.2 |
1797	         +--------------------------+--------+------------------+

1799	                                  Table 1

1801	11.1.1.  General Errors

1803	   This section deals with errors that are applicable to a broad set of
1804	   different purposes.

1806	11.1.1.1.  NFS4ERR_UNION_NOTSUPP (Error Code 10094)

1808	   One of the arguments to the operation is a discriminated union and
1809	   while the server supports the given operation, it does not support
1810	   the selected arm of the discriminated union.  For an example, see
1811	   READ_PLUS (Section 14.10).

1813	11.1.2.  Server to Server Copy Errors

1815	   These errors deal with the interaction between server to server
1816	   copies.

1818	11.1.2.1.  NFS4ERR_METADATA_NOTSUPP (Error Code 10090)

1820	   The destination file cannot support the same metadata as the source
1821	   file.

1823	11.1.2.2.  NFS4ERR_OFFLOAD_DENIED (Error Code 10091)

1825	   The copy offload operation is supported by both the source and the
1826	   destination, but the destination is not allowing it for this file.
1827	   If the client sees this error, it should fall back to the normal copy
1828	   semantics.

1830	11.1.2.3.  NFS4ERR_PARTNER_NO_AUTH (Error Code 10089)

1832	   The source server does not authorize a server-to-server copy offload
1833	   operation.  This may be due to the client's failure to send the
1834	   COPY_NOTIFY operation to the source server, the source server
1835	   receiving a server-to-server copy offload request after the copy
1836	   lease time expired, or for some other permission problem.

1838	11.1.2.4.  NFS4ERR_PARTNER_NOTSUPP (Error Code 10088)

1840	   The remote server does not support the server-to-server copy offload
1841	   protocol.

1843	11.1.3.  Labeled NFS Errors

1845	   These errors are used in Labeled NFS.

1847	11.1.3.1.  NFS4ERR_BADLABEL (Error Code 10093)

1849	   The label specified is invalid in some manner.

1851	11.1.3.2.  NFS4ERR_WRONG_LFS (Error Code 10092)

1853	   The LFS specified in the subject label is not compatible with the LFS
1854	   in the object label.

1856	11.2.  New Operations and Their Valid Errors

1858	   This section contains a table that gives the valid error returns for
1859	   each new NFSv4.2 protocol operation.  The error code NFS4_OK
1860	   (indicating no error) is not listed but should be understood to be
1861	   returnable by all new operations.  The error values for all other
1862	   operations are defined in Section 15.2 of [RFC5661].

1864	            Valid Error Returns for Each New Protocol Operation

1866	   +----------------+--------------------------------------------------+
1867	   | Operation      | Errors                                           |
1868	   +----------------+--------------------------------------------------+
1869	   | COPY           | NFS4ERR_ACCESS, NFS4ERR_ADMIN_REVOKED,           |
1870	   |                | NFS4ERR_BADXDR, NFS4ERR_BAD_STATEID,             |
1871	   |                | NFS4ERR_DEADSESSION, NFS4ERR_DELAY,              |
1872	   |                | NFS4ERR_DELEG_REVOKED, NFS4ERR_DQUOT,            |
1873	   |                | NFS4ERR_EXPIRED, NFS4ERR_FBIG,                   |
1874	   |                | NFS4ERR_FHEXPIRED, NFS4ERR_GRACE, NFS4ERR_INVAL, |
1875	   |                | NFS4ERR_IO, NFS4ERR_ISDIR, NFS4ERR_LOCKED,       |
1876	   |                | NFS4ERR_METADATA_NOTSUPP, NFS4ERR_MOVED,         |
1877	   |                | NFS4ERR_NOFILEHANDLE, NFS4ERR_NOSPC,             |
1878	   |                | NFS4ERR_OFFLOAD_DENIED, NFS4ERR_OLD_STATEID,     |
1879	   |                | NFS4ERR_OPENMODE, NFS4ERR_OP_NOT_IN_SESSION,     |
1880	   |                | NFS4ERR_PARTNER_NO_AUTH,                         |
1881	   |                | NFS4ERR_PARTNER_NOTSUPP, NFS4ERR_PNFS_IO_HOLE,   |
1882	   |                | NFS4ERR_PNFS_NO_LAYOUT, NFS4ERR_REP_TOO_BIG,     |
1883	   |                | NFS4ERR_REP_TOO_BIG_TO_CACHE,                    |
1884	   |                | NFS4ERR_REQ_TOO_BIG, NFS4ERR_RETRY_UNCACHED_REP, |
1885	   |                | NFS4ERR_ROFS, NFS4ERR_SERVERFAULT,               |
1886	   |                | NFS4ERR_STALE, NFS4ERR_SYMLINK,                  |
1887	   |                | NFS4ERR_TOO_MANY_OPS, NFS4ERR_WRONG_TYPE         |
1888	   | COPY_NOTIFY    | NFS4ERR_ACCESS, NFS4ERR_ADMIN_REVOKED,           |
1889	   |                | NFS4ERR_BADXDR, NFS4ERR_BAD_STATEID,             |
1890	   |                | NFS4ERR_DEADSESSION, NFS4ERR_DELAY,              |
1891	   |                | NFS4ERR_DELEG_REVOKED, NFS4ERR_EXPIRED,          |
1892	   |                | NFS4ERR_FHEXPIRED, NFS4ERR_GRACE, NFS4ERR_INVAL, |
1893	   |                | NFS4ERR_ISDIR, NFS4ERR_IO, NFS4ERR_LOCKED,       |
1894	   |                | NFS4ERR_MOVED, NFS4ERR_NOFILEHANDLE,             |
1895	   |                | NFS4ERR_OLD_STATEID, NFS4ERR_OPENMODE,           |
1896	   |                | NFS4ERR_OP_NOT_IN_SESSION, NFS4ERR_PNFS_IO_HOLE, |
1897	   |                | NFS4ERR_PNFS_NO_LAYOUT, NFS4ERR_REP_TOO_BIG,     |
1898	   |                | NFS4ERR_REP_TOO_BIG_TO_CACHE,                    |
1899	   |                | NFS4ERR_REQ_TOO_BIG, NFS4ERR_RETRY_UNCACHED_REP, |
1900	   |                | NFS4ERR_SERVERFAULT, NFS4ERR_STALE,              |
1901	   |                | NFS4ERR_SYMLINK, NFS4ERR_TOO_MANY_OPS,           |
1902	   |                | NFS4ERR_WRONG_TYPE                               |
1903	   | OFFLOAD_ABORT  | NFS4ERR_ADMIN_REVOKED, NFS4ERR_BADXDR,           |
1904	   |                | NFS4ERR_BAD_STATEID, NFS4ERR_COMPLETE_ALREADY,   |
1905	   |                | NFS4ERR_DEADSESSION, NFS4ERR_EXPIRED,            |
1906	   |                | NFS4ERR_DELAY, NFS4ERR_GRACE, NFS4ERR_NOTSUPP,   |
1907	   |                | NFS4ERR_OLD_STATEID, NFS4ERR_OP_NOT_IN_SESSION,  |
1908	   |                | NFS4ERR_SERVERFAULT, NFS4ERR_TOO_MANY_OPS        |
1909	   | OFFLOAD_REVOKE | NFS4ERR_ADMIN_REVOKED, NFS4ERR_BADXDR,           |
1910	   |                | NFS4ERR_COMPLETE_ALREADY, NFS4ERR_DELAY,         |
1911	   |                | NFS4ERR_GRACE, NFS4ERR_INVALID, NFS4ERR_MOVED,   |
1912	   |                | NFS4ERR_NOTSUPP, NFS4ERR_OP_NOT_IN_SESSION,      |
1913	   |                | NFS4ERR_SERVERFAULT, NFS4ERR_TOO_MANY_OPS        |
1914	   | OFFLOAD_STATUS | NFS4ERR_ADMIN_REVOKED, NFS4ERR_BADXDR,           |
1915	   |                | NFS4ERR_BAD_STATEID, NFS4ERR_COMPLETE_ALREADY,   |
1916	   |                | NFS4ERR_DEADSESSION, NFS4ERR_EXPIRED,            |
1917	   |                | NFS4ERR_DELAY, NFS4ERR_GRACE, NFS4ERR_NOTSUPP,   |
1918	   |                | NFS4ERR_OLD_STATEID, NFS4ERR_OP_NOT_IN_SESSION,  |
1919	   |                | NFS4ERR_SERVERFAULT, NFS4ERR_TOO_MANY_OPS        |
1920	   | READ_PLUS      | NFS4ERR_ACCESS, NFS4ERR_ADMIN_REVOKED,           |
1921	   |                | NFS4ERR_BADXDR, NFS4ERR_BAD_STATEID,             |
1922	   |                | NFS4ERR_DEADSESSION, NFS4ERR_DELAY,              |
1923	   |                | NFS4ERR_DELEG_REVOKED, NFS4ERR_EXPIRED,          |
1924	   |                | NFS4ERR_FHEXPIRED, NFS4ERR_GRACE, NFS4ERR_INVAL, |
1925	   |                | NFS4ERR_ISDIR, NFS4ERR_IO, NFS4ERR_LOCKED,       |
1926	   |                | NFS4ERR_MOVED, NFS4ERR_NOFILEHANDLE,             |
1927	   |                | NFS4ERR_OLD_STATEID, NFS4ERR_OPENMODE,           |
1928	   |                | NFS4ERR_OP_NOT_IN_SESSION, NFS4ERR_PNFS_IO_HOLE, |
1929	   |                | NFS4ERR_PNFS_NO_LAYOUT, NFS4ERR_REP_TOO_BIG,     |
1930	   |                | NFS4ERR_REP_TOO_BIG_TO_CACHE,                    |
1931	   |                | NFS4ERR_REQ_TOO_BIG, NFS4ERR_RETRY_UNCACHED_REP, |
1932	   |                | NFS4ERR_SERVERFAULT, NFS4ERR_STALE,              |
1933	   |                | NFS4ERR_SYMLINK, NFS4ERR_TOO_MANY_OPS,           |
1934	   |                | NFS4ERR_UNION_NOTSUPP, NFS4ERR_WRONG_TYPE        |
1935	   | SEEK           | NFS4ERR_ACCESS, NFS4ERR_ADMIN_REVOKED,           |
1936	   |                | NFS4ERR_BADXDR, NFS4ERR_BAD_STATEID,             |
1937	   |                | NFS4ERR_DEADSESSION, NFS4ERR_DELAY,              |
1938	   |                | NFS4ERR_DELEG_REVOKED, NFS4ERR_EXPIRED,          |
1939	   |                | NFS4ERR_FHEXPIRED, NFS4ERR_GRACE, NFS4ERR_INVAL, |
1940	   |                | NFS4ERR_ISDIR, NFS4ERR_IO, NFS4ERR_LOCKED,       |
1941	   |                | NFS4ERR_MOVED, NFS4ERR_NOFILEHANDLE,             |
1942	   |                | NFS4ERR_OLD_STATEID, NFS4ERR_OPENMODE,           |
1943	   |                | NFS4ERR_OP_NOT_IN_SESSION, NFS4ERR_PNFS_IO_HOLE, |
1944	   |                | NFS4ERR_PNFS_NO_LAYOUT, NFS4ERR_REP_TOO_BIG,     |
1945	   |                | NFS4ERR_REP_TOO_BIG_TO_CACHE,                    |
1946	   |                | NFS4ERR_REQ_TOO_BIG, NFS4ERR_RETRY_UNCACHED_REP, |
1947	   |                | NFS4ERR_SERVERFAULT, NFS4ERR_STALE,              |
1948	   |                | NFS4ERR_SYMLINK, NFS4ERR_TOO_MANY_OPS,           |
1949	   |                | NFS4ERR_UNION_NOTSUPP, NFS4ERR_WRONG_TYPE        |
1950	   | SEQUENCE       | NFS4ERR_BADSESSION, NFS4ERR_BADSLOT,             |
1951	   |                | NFS4ERR_BADXDR, NFS4ERR_BAD_HIGH_SLOT,           |
1952	   |                | NFS4ERR_CONN_NOT_BOUND_TO_SESSION,               |
1953	   |                | NFS4ERR_DEADSESSION, NFS4ERR_DELAY,              |
1954	   |                | NFS4ERR_REP_TOO_BIG,                             |
1955	   |                | NFS4ERR_REP_TOO_BIG_TO_CACHE,                    |
1956	   |                | NFS4ERR_REQ_TOO_BIG, NFS4ERR_RETRY_UNCACHED_REP, |
1957	   |                | NFS4ERR_SEQUENCE_POS, NFS4ERR_SEQ_FALSE_RETRY,   |
1958	   |                | NFS4ERR_SEQ_MISORDERED, NFS4ERR_TOO_MANY_OPS     |
1959	   | WRITE_PLUS     | NFS4ERR_ACCESS, NFS4ERR_ADMIN_REVOKED,           |
1960	   |                | NFS4ERR_BADXDR, NFS4ERR_BAD_STATEID,             |
1961	   |                | NFS4ERR_DEADSESSION, NFS4ERR_DELAY,              |
1962	   |                | NFS4ERR_DELEG_REVOKED, NFS4ERR_DQUOT,            |
1963	   |                | NFS4ERR_EXPIRED, NFS4ERR_FBIG,                   |
1964	   |                | NFS4ERR_FHEXPIRED, NFS4ERR_GRACE, NFS4ERR_INVAL, |
1965	   |                | NFS4ERR_IO, NFS4ERR_ISDIR, NFS4ERR_LOCKED,       |
1966	   |                | NFS4ERR_MOVED, NFS4ERR_NOFILEHANDLE,             |
1967	   |                | NFS4ERR_NOSPC, NFS4ERR_OLD_STATEID,              |
1968	   |                | NFS4ERR_OPENMODE, NFS4ERR_OP_NOT_IN_SESSION,     |
1969	   |                | NFS4ERR_PNFS_IO_HOLE, NFS4ERR_PNFS_NO_LAYOUT,    |
1970	   |                | NFS4ERR_REP_TOO_BIG,                             |
1971	   |                | NFS4ERR_REP_TOO_BIG_TO_CACHE,                    |
1972	   |                | NFS4ERR_REQ_TOO_BIG, NFS4ERR_RETRY_UNCACHED_REP, |
1973	   |                | NFS4ERR_ROFS, NFS4ERR_SERVERFAULT,               |
1974	   |                | NFS4ERR_STALE, NFS4ERR_SYMLINK,                  |
1975	   |                | NFS4ERR_TOO_MANY_OPS, NFS4ERR_UNION_NOTSUPP,     |
1976	   |                | NFS4ERR_WRONG_TYPE                               |
1977	   +----------------+--------------------------------------------------+

1979	                                  Table 2

1981	11.3.  New Callback Operations and Their Valid Errors

1983	   This section contains a table that gives the valid error returns for
1984	   each new NFSv4.2 callback operation.  The error code NFS4_OK
1985	   (indicating no error) is not listed but should be understood to be
1986	   returnable by all new callback operations.  The error values for all
1987	   other callback operations are defined in Section 15.3 of [RFC5661].

1989	       Valid Error Returns for Each New Protocol Callback Operation

1991	   +------------+------------------------------------------------------+
1992	   | Callback   | Errors                                               |
1993	   | Operation  |                                                      |
1994	   +------------+------------------------------------------------------+
1995	   | CB_OFFLOAD | NFS4ERR_BADHANDLE, NFS4ERR_BADXDR,                   |
1996	   |            | NFS4ERR_BAD_STATEID, NFS4ERR_DELAY,                  |
1997	   |            | NFS4ERR_OP_NOT_IN_SESSION, NFS4ERR_REP_TOO_BIG,      |
1998	   |            | NFS4ERR_REP_TOO_BIG_TO_CACHE, NFS4ERR_REQ_TOO_BIG,   |
1999	   |            | NFS4ERR_RETRY_UNCACHED_REP, NFS4ERR_SERVERFAULT,     |
2000	   |            | NFS4ERR_TOO_MANY_OPS                                 |
2001	   +------------+------------------------------------------------------+

2003	                                  Table 3

2005	12.  New File Attributes

2007	12.1.  New RECOMMENDED Attributes - List and Definition References

2009	   The list of new RECOMMENDED attributes appears in Table 4.  The
2010	   meaning of the columns of the table are:

2012	   Name:  The name of the attribute.

2014	   Id:  The number assigned to the attribute.  In the event of conflicts
2015	      between the assigned number and [4.2xdr], the latter is likely
2016	      authoritative, but should be resolved with Errata to this document
2017	      and/or [4.2xdr].  See [IESG08] for the Errata process.

2019	   Data Type:  The XDR data type of the attribute.

2021	   Acc:  Access allowed to the attribute.

2023	      R  means read-only (GETATTR may retrieve, SETATTR may not set).

2025	      W  means write-only (SETATTR may set, GETATTR may not retrieve).

2027	      R W   means read/write (GETATTR may retrieve, SETATTR may set).

2029	   Defined in:  The section of this specification that describes the
2030	      attribute.

2032	   +------------------+----+-------------------+-----+----------------+
2033	   | Name             | Id | Data Type         | Acc | Defined in     |
2034	   +------------------+----+-------------------+-----+----------------+
2035	   | change_attr_type | 79 | change_attr_type4 | R   | Section 12.2.1 |
2036	   | sec_label        | 80 | sec_label4        | R W | Section 12.2.2 |
2037	   | change_sec_label | 81 | change_sec_label4 | R   | Section 12.2.3 |
2038	   | space_reserved   | 77 | boolean           | R W | Section 12.2.4 |
2039	   | space_freed      | 78 | length4           | R   | Section 12.2.5 |
2040	   +------------------+----+-------------------+-----+----------------+

2042	                                  Table 4

2044	12.2.  Attribute Definitions

2046	12.2.1.  Attribute 79: change_attr_type

2048	   enum change_attr_type4 {
2049	              NFS4_CHANGE_TYPE_IS_MONOTONIC_INCR         = 0,
2050	              NFS4_CHANGE_TYPE_IS_VERSION_COUNTER        = 1,
2051	              NFS4_CHANGE_TYPE_IS_VERSION_COUNTER_NOPNFS = 2,
2052	              NFS4_CHANGE_TYPE_IS_TIME_METADATA          = 3,
2053	              NFS4_CHANGE_TYPE_IS_UNDEFINED              = 4
2054	   };

2056	   change_attr_type is a per file system attribute which enables the
2057	   NFSv4.2 server to provide additional information about how it expects
2058	   the change attribute value to evolve after the file data, or metadata
2059	   has changed.  While Section 5.4 of [RFC5661] discusses per file
2060	   system attributes, it is expected that the value of change_attr_type
2061	   not depend on the value of "homogeneous" and only changes in the
2062	   event of a migration.

2064	   NFS4_CHANGE_TYPE_IS_UNDEFINED:  The change attribute does not take
2065	      values that fit into any of these categories.

2067	   NFS4_CHANGE_TYPE_IS_MONOTONIC_INCR:  The change attribute value MUST
2068	      monotonically increase for every atomic change to the file
2069	      attributes, data, or directory contents.

2071	   NFS4_CHANGE_TYPE_IS_VERSION_COUNTER:  The change attribute value MUST
2072	      be incremented by one unit for every atomic change to the file
2073	      attributes, data, or directory contents.  This property is
2074	      preserved when writing to pNFS data servers.

2076	   NFS4_CHANGE_TYPE_IS_VERSION_COUNTER_NOPNFS:  The change attribute
2077	      value MUST be incremented by one unit for every atomic change to
2078	      the file attributes, data, or directory contents.  In the case
2079	      where the client is writing to pNFS data servers, the number of
2080	      increments is not guaranteed to exactly match the number of
2081	      writes.

2083	   NFS4_CHANGE_TYPE_IS_TIME_METADATA:  The change attribute is
2084	      implemented as suggested in the NFSv4 spec
2085	      [I-D.ietf-nfsv4-rfc3530bis] in terms of the time_metadata
2086	      attribute.

2088	   If either NFS4_CHANGE_TYPE_IS_MONOTONIC_INCR,
2089	   NFS4_CHANGE_TYPE_IS_VERSION_COUNTER, or
2090	   NFS4_CHANGE_TYPE_IS_TIME_METADATA are set, then the client knows at
2091	   the very least that the change attribute is monotonically increasing,
2092	   which is sufficient to resolve the question of which value is the
2093	   most recent.

2095	   If the client sees the value NFS4_CHANGE_TYPE_IS_TIME_METADATA, then
2096	   by inspecting the value of the 'time_delta' attribute it additionally
2097	   has the option of detecting rogue server implementations that use
2098	   time_metadata in violation of the spec.

2100	   If the client sees NFS4_CHANGE_TYPE_IS_VERSION_COUNTER, it has the
2101	   ability to predict what the resulting change attribute value should
2102	   be after a COMPOUND containing a SETATTR, WRITE, or CREATE.  This
2103	   again allows it to detect changes made in parallel by another client.
2104	   The value NFS4_CHANGE_TYPE_IS_VERSION_COUNTER_NOPNFS permits the
2105	   same, but only if the client is not doing pNFS WRITEs.

2107	   Finally, if the server does not support change_attr_type or if
2108	   NFS4_CHANGE_TYPE_IS_UNDEFINED is set, then the server SHOULD make an
2109	   effort to implement the change attribute in terms of the
2110	   time_metadata attribute.

2112	12.2.2.  Attribute 80: sec_label

2114	   typedef uint32_t  policy4;

2116	   struct labelformat_spec4 {
2117	           policy4 lfs_lfs;
2118	           policy4 lfs_pi;
2119	   };

2121	   struct sec_label4 {
2122	           labelformat_spec4       slai_lfs;
2123	           opaque                  slai_data<>;
2124	   };

2126	   The FATTR4_SEC_LABEL contains an array of two components with the
2127	   first component being an LFS.  It serves to provide the receiving end
2128	   with the information necessary to translate the security attribute
2129	   into a form that is usable by the endpoint.  Label Formats assigned
2130	   an LFS may optionally choose to include a Policy Identifier field to
2131	   allow for complex policy deployments.  The LFS and Label Format
2132	   Registry are described in detail in [Quigley11].  The translation
2133	   used to interpret the security attribute is not specified as part of
2134	   the protocol as it may depend on various factors.  The second
2135	   component is an opaque section which contains the data of the
2136	   attribute.  This component is dependent on the MAC model to interpret
2137	   and enforce.

2139	   In particular, it is the responsibility of the LFS specification to
2140	   define a maximum size for the opaque section, slai_data<>.  When
2141	   creating or modifying a label for an object, the client needs to be
2142	   guaranteed that the server will accept a label that is sized
2143	   correctly.  By both client and server being part of a specific MAC
2144	   model, the client will be aware of the size.

2146	   If a server supports sec_label, then it MUST also support
2147	   change_sec_label.  Any modification to sec_label MUST modify the
2148	   value for change_sec_label.

2150	12.2.3.  Attribute 81: change_sec_label

2152	   The change_sec_label attribute is a read-only attribute per file.  If
2153	   the value of sec_label for a file is not the same at two disparate
2154	   times then the values of change_sec_label at those times MUST be
2155	   different as well.  The value of change_sec_label MAY change at other
2156	   times as well, but this should be rare, as that will require the
2157	   client to abort any operation in progress, re-read the label, and
2158	   retry the operation.  As the sec_label is not bounded by size, this
2159	   attribute allows for VERIFY and NVERIFY to quickly determine if the
2160	   sec_label has been modified.

2162	12.2.4.  Attribute 77: space_reserved

2164	   The space_reserve attribute is a read/write attribute of type
2165	   boolean.  It is a per file attribute and applies during the lifetime
2166	   of the file or until it is turned off.  When the space_reserved
2167	   attribute is set via SETATTR, the server must ensure that there is
2168	   disk space to accommodate every byte in the file before it can return
2169	   success.  If the server cannot guarantee this, it must return
2170	   NFS4ERR_NOSPC.

2172	   If the client tries to grow a file which has the space_reserved
2173	   attribute set, the server must guarantee that there is disk space to
2174	   accommodate every byte in the file with the new size before it can
2175	   return success.  If the server cannot guarantee this, it must return
2176	   NFS4ERR_NOSPC.

2178	   It is not required that the server allocate the space to the file
2179	   before returning success.  The allocation can be deferred, however,
2180	   it must be guaranteed that it will not fail for lack of space.

2182	   The value of space_reserved can be obtained at any time through
2183	   GETATTR.  If the size is retrieved at the same time, the client can
2184	   determine the size of the reservation.

2186	   In order to avoid ambiguity, the space_reserve bit cannot be set
2187	   along with the size bit in SETATTR.  Increasing the size of a file
2188	   with space_reserve set will fail if space reservation cannot be
2189	   guaranteed for the new size.  If the file size is decreased, space
2190	   reservation is only guaranteed for the new size.  If a hole is
2191	   punched into the file, then the reservation is not changed.

2193	12.2.5.  Attribute 78: space_freed

2195	   space_freed gives the number of bytes freed if the file is deleted.
2196	   This attribute is read only and is of type length4.  It is a per file
2197	   attribute.

2199	13.  Operations: REQUIRED, RECOMMENDED, or OPTIONAL

2201	   The following tables summarize the operations of the NFSv4.2 protocol
2202	   and the corresponding designation of REQUIRED, RECOMMENDED, and
2203	   OPTIONAL to implement or either OBSOLESCENT or MUST NOT implement.
2204	   The designation of OBSOLESCENT is reserved for those operations which
2205	   are defined in either NFSv4.0 or NFSv4.1 and are intended to be
2206	   classified as MUST NOT be implemented in NFSv4.3.  The designation of
2207	   MUST NOT implement is reserved for those operations that were defined
2208	   in either NFSv4.0 or NFSV4.1 and MUST NOT be implemented in NFSv4.2.

2210	   For the most part, the REQUIRED, RECOMMENDED, or OPTIONAL designation
2211	   for operations sent by the client is for the server implementation.
2212	   The client is generally required to implement the operations needed
2213	   for the operating environment for which it serves.  For example, a
2214	   read-only NFSv4.2 client would have no need to implement the WRITE
2215	   operation and is not required to do so.

2217	   The REQUIRED or OPTIONAL designation for callback operations sent by
2218	   the server is for both the client and server.  Generally, the client
2219	   has the option of creating the backchannel and sending the operations
2220	   on the fore channel that will be a catalyst for the server sending
2221	   callback operations.  A partial exception is CB_RECALL_SLOT; the only
2222	   way the client can avoid supporting this operation is by not creating
2223	   a backchannel.

2225	   Since this is a summary of the operations and their designation,
2226	   there are subtleties that are not presented here.  Therefore, if
2227	   there is a question of the requirements of implementation, the
2228	   operation descriptions themselves must be consulted along with other
2229	   relevant explanatory text within this either specification or that of
2230	   NFSv4.1 [RFC5661].

2232	   The abbreviations used in the second and third columns of the table
2233	   are defined as follows.

2235	   REQ  REQUIRED to implement

2237	   REC  RECOMMENDED to implement

2239	   OPT  OPTIONAL to implement

2241	   MNI  MUST NOT implement

2243	   OBS  Also OBSOLESCENT for future versions.

2245	   For the NFSv4.2 features that are OPTIONAL, the operations that
2246	   support those features are OPTIONAL, and the server would return
2247	   NFS4ERR_NOTSUPP in response to the client's use of those operations.
2248	   If an OPTIONAL feature is supported, it is possible that a set of
2249	   operations related to the feature become REQUIRED to implement.  The
2250	   third column of the table designates the feature(s) and if the
2251	   operation is REQUIRED or OPTIONAL in the presence of support for the
2252	   feature.

2254	   The OPTIONAL features identified and their abbreviations are as
2255	   follows:

2257	   pNFS  Parallel NFS

2259	   FDELG  File Delegations

2261	   DDELG  Directory Delegations

2263	   COPY  Server Side Copy

2265	   ADH  Application Data Holes

2267	                                Operations

2269	   +----------------------+---------------------+----------------------+
2270	   | Operation            | EOL, REQ, REC, OPT, | Feature (REQ, REC,   |
2271	   |                      | or MNI              | or OPT)              |
2272	   +----------------------+---------------------+----------------------+
2273	   | ACCESS               | REQ                 |                      |
2274	   | BACKCHANNEL_CTL      | REQ                 |                      |
2275	   | BIND_CONN_TO_SESSION | REQ                 |                      |
2276	   | CLOSE                | REQ                 |                      |
2277	   | COMMIT               | REQ                 |                      |
2278	   | COPY                 | OPT                 | COPY (REQ)           |
2279	   | OFFLOAD_ABORT        | OPT                 | COPY (REQ)           |
2280	   | COPY_NOTIFY          | OPT                 | COPY (REQ)           |
2281	   | OFFLOAD_REVOKE       | OPT                 | COPY (REQ)           |
2282	   | OFFLOAD_STATUS       | OPT                 | COPY (REQ)           |
2283	   | CREATE               | REQ                 |                      |
2284	   | CREATE_SESSION       | REQ                 |                      |
2285	   | DELEGPURGE           | OPT                 | FDELG (REQ)          |
2286	   | DELEGRETURN          | OPT                 | FDELG, DDELG, pNFS   |
2287	   |                      |                     | (REQ)                |
2288	   | DESTROY_CLIENTID     | REQ                 |                      |
2289	   | DESTROY_SESSION      | REQ                 |                      |
2290	   | EXCHANGE_ID          | REQ                 |                      |
2291	   | FREE_STATEID         | REQ                 |                      |
2292	   | GETATTR              | REQ                 |                      |
2293	   | GETDEVICEINFO        | OPT                 | pNFS (REQ)           |
2294	   | GETDEVICELIST        | OPT                 | pNFS (OPT)           |
2295	   | GETFH                | REQ                 |                      |
2296	   | WRITE_PLUS           | OPT                 | ADH (REQ)            |
2297	   | GET_DIR_DELEGATION   | OPT                 | DDELG (REQ)          |
2298	   | LAYOUTCOMMIT         | OPT                 | pNFS (REQ)           |
2299	   | LAYOUTGET            | OPT                 | pNFS (REQ)           |
2300	   | LAYOUTRETURN         | OPT                 | pNFS (REQ)           |
2301	   | LINK                 | OPT                 |                      |
2302	   | LOCK                 | REQ                 |                      |
2303	   | LOCKT                | REQ                 |                      |
2304	   | LOCKU                | REQ                 |                      |
2305	   | LOOKUP               | REQ                 |                      |
2306	   | LOOKUPP              | REQ                 |                      |
2307	   | NVERIFY              | REQ                 |                      |
2308	   | OPEN                 | REQ                 |                      |
2309	   | OPENATTR             | OPT                 |                      |
2310	   | OPEN_CONFIRM         | MNI                 |                      |
2311	   | OPEN_DOWNGRADE       | REQ                 |                      |
2312	   | PUTFH                | REQ                 |                      |
2313	   | PUTPUBFH             | REQ                 |                      |
2314	   | PUTROOTFH            | REQ                 |                      |
2315	   | READ                 | REQ (OBS)           |                      |
2316	   | READDIR              | REQ                 |                      |
2317	   | READLINK             | OPT                 |                      |
2318	   | READ_PLUS            | OPT                 | ADH (REQ)            |
2319	   | RECLAIM_COMPLETE     | REQ                 |                      |
2320	   | RELEASE_LOCKOWNER    | MNI                 |                      |
2321	   | REMOVE               | REQ                 |                      |
2322	   | RENAME               | REQ                 |                      |
2323	   | RENEW                | MNI                 |                      |
2324	   | RESTOREFH            | REQ                 |                      |
2325	   | SAVEFH               | REQ                 |                      |
2326	   | SECINFO              | REQ                 |                      |
2327	   | SECINFO_NO_NAME      | REC                 | pNFS file layout     |
2328	   |                      |                     | (REQ)                |
2329	   | SEQUENCE             | REQ                 |                      |
2330	   | SETATTR              | REQ                 |                      |
2331	   | SETCLIENTID          | MNI                 |                      |
2332	   | SETCLIENTID_CONFIRM  | MNI                 |                      |
2333	   | SET_SSV              | REQ                 |                      |
2334	   | TEST_STATEID         | REQ                 |                      |
2335	   | VERIFY               | REQ                 |                      |
2336	   | WANT_DELEGATION      | OPT                 | FDELG (OPT)          |
2337	   | WRITE                | REQ (OBS)           |                      |
2338	   +----------------------+---------------------+----------------------+

2340	                            Callback Operations

2342	   +-------------------------+-------------------+---------------------+
2343	   | Operation               | REQ, REC, OPT, or | Feature (REQ, REC,  |
2344	   |                         | MNI               | or OPT)             |
2345	   +-------------------------+-------------------+---------------------+
2346	   | CB_OFFLOAD              | OPT               | COPY (REQ)          |
2347	   | CB_GETATTR              | OPT               | FDELG (REQ)         |
2348	   | CB_LAYOUTRECALL         | OPT               | pNFS (REQ)          |
2349	   | CB_NOTIFY               | OPT               | DDELG (REQ)         |
2350	   | CB_NOTIFY_DEVICEID      | OPT               | pNFS (OPT)          |
2351	   | CB_NOTIFY_LOCK          | OPT               |                     |
2352	   | CB_PUSH_DELEG           | OPT               | FDELG (OPT)         |
2353	   | CB_RECALL               | OPT               | FDELG, DDELG, pNFS  |
2354	   |                         |                   | (REQ)               |
2355	   | CB_RECALL_ANY           | OPT               | FDELG, DDELG, pNFS  |
2356	   |                         |                   | (REQ)               |
2357	   | CB_RECALL_SLOT          | REQ               |                     |
2358	   | CB_RECALLABLE_OBJ_AVAIL | OPT               | DDELG, pNFS (REQ)   |
2359	   | CB_SEQUENCE             | OPT               | FDELG, DDELG, pNFS  |
2360	   |                         |                   | (REQ)               |
2361	   | CB_WANTS_CANCELLED      | OPT               | FDELG, DDELG, pNFS  |
2362	   |                         |                   | (REQ)               |
2363	   +-------------------------+-------------------+---------------------+

2365	14.  NFSv4.2 Operations

2367	14.1.  Operation 59: COPY - Initiate a server-side copy

2369	14.1.1.  ARGUMENT

2371	   struct COPY4args {
2372	           /* SAVED_FH: source file */
2373	           /* CURRENT_FH: destination file */
2374	           stateid4        ca_src_stateid;
2375	           stateid4        ca_dst_stateid;
2376	           offset4         ca_src_offset;
2377	           offset4         ca_dst_offset;
2378	           length4         ca_count;
2379	           netloc4         ca_source_server<>;
2380	   };

2382	14.1.2.  RESULT

2384	   union COPY4res switch (nfsstat4 cr_status) {
2385	   case NFS4_OK:
2386	           write_response4 resok4;
2387	   default:
2388	           length4         cr_bytes_copied;
2389	   };

2391	14.1.3.  DESCRIPTION

2393	   The COPY operation is used for both intra-server and inter-server
2394	   copies.  In both cases, the COPY is always sent from the client to
2395	   the destination server of the file copy.  The COPY operation requests
2396	   that a file be copied from the location specified by the SAVED_FH
2397	   value to the location specified by the CURRENT_FH.

2399	   The SAVED_FH must be a regular file.  If SAVED_FH is not a regular
2400	   file, the operation MUST fail and return NFS4ERR_WRONG_TYPE.

2402	   In order to set SAVED_FH to the source file handle, the compound
2403	   procedure requesting the COPY will include a sub-sequence of
2404	   operations such as

2406	      PUTFH source-fh
2407	      SAVEFH

2409	   If the request is for a server-to-server copy, the source-fh is a
2410	   filehandle from the source server and the compound procedure is being
2411	   executed on the destination server.  In this case, the source-fh is a
2412	   foreign filehandle on the server receiving the COPY request.  If
2413	   either PUTFH or SAVEFH checked the validity of the filehandle, the
2414	   operation would likely fail and return NFS4ERR_STALE.

2416	   If a server supports the server-to-server COPY feature, a PUTFH
2417	   followed by a SAVEFH MUST NOT return NFS4ERR_STALE for either
2418	   operation.  These restrictions do not pose substantial difficulties
2419	   for servers.  The CURRENT_FH and SAVED_FH may be validated in the
2420	   context of the operation referencing them and an NFS4ERR_STALE error
2421	   returned for an invalid file handle at that point.

2423	   For an intra-server copy, both the ca_src_stateid and ca_dst_stateid
2424	   MUST refer to either open or locking states provided earlier by the
2425	   server.  If either stateid is invalid, then the operation MUST fail.
2426	   If the request is for a inter-server copy, then the ca_src_stateid
2427	   can be ignored.  If ca_dst_stateid is invalid, then the operation
2428	   MUST fail.

2430	   The CURRENT_FH specifies the destination of the copy operation.  The
2431	   CURRENT_FH MUST be a regular file and not a directory.  Note, the
2432	   file MUST exist before the COPY operation begins.  It is the
2433	   responsibility of the client to create the file if necessary,
2434	   regardless of the actual copy protocol used.  If the file cannot be
2435	   created in the destination file system (due to file name
2436	   restrictions, such as case or length), the COPY operation MUST NOT be
2437	   called.

2439	   The ca_src_offset is the offset within the source file from which the
2440	   data will be read, the ca_dst_offset is the offset within the
2441	   destination file to which the data will be written, and the ca_count
2442	   is the number of bytes that will be copied.  An offset of 0 (zero)
2443	   specifies the start of the file.  A count of 0 (zero) requests that
2444	   all bytes from ca_src_offset through EOF be copied to the
2445	   destination.  If concurrent modifications to the source file overlap
2446	   with the source file region being copied, the data copied may include
2447	   all, some, or none of the modifications.  The client can use standard
2448	   NFS operations (e.g., OPEN with OPEN4_SHARE_DENY_WRITE or mandatory
2449	   byte range locks) to protect against concurrent modifications if the
2450	   client is concerned about this.  If the source file's end of file is
2451	   being modified in parallel with a copy that specifies a count of 0
2452	   (zero) bytes, the amount of data copied is implementation dependent
2453	   (clients may guard against this case by specifying a non-zero count
2454	   value or preventing modification of the source file as mentioned
2455	   above).

2457	   If the source offset or the source offset plus count is greater than
2458	   or equal to the size of the source file, the operation will fail with
2459	   NFS4ERR_INVAL.  The destination offset or destination offset plus
2460	   count may be greater than the size of the destination file.  This
2461	   allows for the client to issue parallel copies to implement
2462	   operations such as "cat file1 file2 file3 file4 > dest".

2464	   If the ca_source_server list is specified, then this is an inter-
2465	   server copy operation and the source file is on a remote server.  The
2466	   client is expected to have previously issued a successful COPY_NOTIFY
2467	   request to the remote source server.  The ca_source_server list MUST
2468	   be the same as the COPY_NOTIFY response's cnr_source_server list.  If
2469	   the client includes the entries from the COPY_NOTIFY response's
2470	   cnr_source_server list in the ca_source_server list, the source
2471	   server can indicate a specific copy protocol for the destination
2472	   server to use by returning a URL, which specifies both a protocol
2473	   service and server name.  Server-to-server copy protocol
2474	   considerations are described in Section 3.2.5 and Section 3.4.1.

2476	   The copying of any and all attributes on the source file is the
2477	   responsibility of both the client and the copy protocol.  Any
2478	   attribute which is both exposed via the NFS protocol on the source
2479	   file and set SHOULD be copied to the destination file.  Any attribute
2480	   supported by the destination server that is not set on the source
2481	   file SHOULD be left unset.  If the client cannot copy an attribute
2482	   from the source to destination, it MAY fail the copy transaction.

2484	   Metadata attributes not exposed via the NFS protocol SHOULD be copied
2485	   to the destination file where appropriate via the copy protocol.
2486	   Note that if the copy protocol is NFSv4.x, then these attributes will
2487	   be lost.

2489	   The destination file's named attributes are not duplicated from the
2490	   source file.  After the copy process completes, the client MAY
2491	   attempt to duplicate named attributes using standard NFSv4
2492	   operations.  However, the destination file's named attribute
2493	   capabilities MAY be different from the source file's named attribute
2494	   capabilities.

2496	   If the operation does not result in an immediate failure, the server
2497	   will return NFS4_OK, and the CURRENT_FH will remain the destination's
2498	   filehandle.

2500	   If an immediate failure does occur, cr_bytes_copied will be set to
2501	   the number of bytes copied to the destination file before the error
2502	   occurred.  The cr_bytes_copied value indicates the number of bytes
2503	   copied but not which specific bytes have been copied.

2505	   A return of NFS4_OK indicates that either the operation is complete
2506	   or the operation was initiated and a callback will be used to deliver
2507	   the final status of the operation.

2509	   If the cr_callback_id is returned, this indicates that the operation
2510	   was initiated and a CB_OFFLOAD callback will deliver the final
2511	   results of the operation.  The cr_callback_id stateid is termed a
2512	   copy stateid in this context.  The server is given the option of
2513	   returning the results in a callback because the data may require a
2514	   relatively long period of time to copy.

2516	   If no cr_callback_id is returned, the operation completed
2517	   synchronously and no callback will be issued by the server.  The
2518	   completion status of the operation is indicated by cr_status.

2520	   If the copy completes successfully, either synchronously or
2521	   asynchronously, the data copied from the source file to the
2522	   destination file MUST appear identical to the NFS client.  However,
2523	   the NFS server's on disk representation of the data in the source
2524	   file and destination file MAY differ.  For example, the NFS server
2525	   might encrypt, compress, deduplicate, or otherwise represent the on
2526	   disk data in the source and destination file differently.

2528	14.2.  Operation 60: OFFLOAD_ABORT - Cancel a server-side copy

2530	14.2.1.  ARGUMENT

2532	   struct OFFLOAD_ABORT4args {
2533	           /* CURRENT_FH: destination file */
2534	           stateid4        oaa_stateid;
2535	   };

2537	14.2.2.  RESULT

2539	   struct OFFLOAD_ABORT4res {
2540	           nfsstat4        oar_status;
2541	   };

2543	14.2.3.  DESCRIPTION

2545	   OFFLOAD_ABORT is used for both intra- and inter-server asynchronous
2546	   copies.  The OFFLOAD_ABORT operation allows the client to cancel a
2547	   server-side copy operation that it initiated.  This operation is sent
2548	   in a COMPOUND request from the client to the destination server.
2549	   This operation may be used to cancel a copy when the application that
2550	   requested the copy exits before the operation is completed or for
2551	   some other reason.

2553	   The request contains the filehandle and copy stateid cookies that act
2554	   as the context for the previously initiated copy operation.

2556	   The result's oar_status field indicates whether the cancel was
2557	   successful or not.  A value of NFS4_OK indicates that the copy
2558	   operation was canceled and no callback will be issued by the server.
2559	   A copy operation that is successfully canceled may result in none,
2560	   some, or all of the data and/or metadata copied.

2562	   If the server supports asynchronous copies, the server is REQUIRED to
2563	   support the OFFLOAD_ABORT operation.

2565	14.3.  Operation 61: COPY_NOTIFY - Notify a source server of a future
2566	       copy

2568	14.3.1.  ARGUMENT

2570	   struct COPY_NOTIFY4args {
2571	           /* CURRENT_FH: source file */
2572	           stateid4        cna_src_stateid;
2573	           netloc4         cna_destination_server;
2574	   };

2576	14.3.2.  RESULT

2578	   struct COPY_NOTIFY4resok {
2579	           nfstime4        cnr_lease_time;
2580	           netloc4         cnr_source_server<>;
2581	   };

2583	   union COPY_NOTIFY4res switch (nfsstat4 cnr_status) {
2584	   case NFS4_OK:
2585	           COPY_NOTIFY4resok       resok4;
2586	   default:
2587	           void;
2588	   };

2590	14.3.3.  DESCRIPTION

2592	   This operation is used for an inter-server copy.  A client sends this
2593	   operation in a COMPOUND request to the source server to authorize a
2594	   destination server identified by cna_destination_server to read the
2595	   file specified by CURRENT_FH on behalf of the given user.

2597	   The cna_src_stateid MUST refer to either open or locking states
2598	   provided earlier by the server.  If it is invalid, then the operation
2599	   MUST fail.

2601	   The cna_destination_server MUST be specified using the netloc4
2602	   network location format.  The server is not required to resolve the
2603	   cna_destination_server address before completing this operation.

2605	   If this operation succeeds, the source server will allow the
2606	   cna_destination_server to copy the specified file on behalf of the
2607	   given user as long as both of the following conditions are met:

2609	   o  The destination server begins reading the source file before the
2610	      cnr_lease_time expires.  If the cnr_lease_time expires while the
2611	      destination server is still reading the source file, the
2612	      destination server is allowed to finish reading the file.

2614	   o  The client has not issued a COPY_REVOKE for the same combination
2615	      of user, filehandle, and destination server.

2617	   The cnr_lease_time is chosen by the source server.  A cnr_lease_time
2618	   of 0 (zero) indicates an infinite lease.  To avoid the need for
2619	   synchronized clocks, copy lease times are granted by the server as a
2620	   time delta.  To renew the copy lease time the client should resend
2621	   the same copy notification request to the source server.

2623	   A successful response will also contain a list of netloc4 network
2624	   location formats called cnr_source_server, on which the source is
2625	   willing to accept connections from the destination.  These might not
2626	   be reachable from the client and might be located on networks to
2627	   which the client has no connection.

2629	   If the client wishes to perform an inter-server copy, the client MUST
2630	   send a COPY_NOTIFY to the source server.  Therefore, the source
2631	   server MUST support COPY_NOTIFY.

2633	   For a copy only involving one server (the source and destination are
2634	   on the same server), this operation is unnecessary.

2636	14.4.  Operation 62: OFFLOAD_REVOKE - Revoke a destination server's copy
2637	       privileges

2639	14.4.1.  ARGUMENT

2641	   struct OFFLOAD_REVOKE4args {
2642	           /* CURRENT_FH: source file */
2643	           netloc4         ora_destination_server;
2644	   };

2646	14.4.2.  RESULT

2648	   struct OFFLOAD_REVOKE4res {
2649	           nfsstat4        orr_status;
2650	   };

2652	14.4.3.  DESCRIPTION

2654	   This operation is used for an inter-server copy.  A client sends this
2655	   operation in a COMPOUND request to the source server to revoke the
2656	   authorization of a destination server identified by
2657	   ora_destination_server from reading the file specified by CURRENT_FH
2658	   on behalf of given user.  If the ora_destination_server has already
2659	   begun copying the file, a successful return from this operation
2660	   indicates that further access will be prevented.

2662	   The ora_destination_server MUST be specified using the netloc4
2663	   network location format.  The server is not required to resolve the
2664	   ora_destination_server address before completing this operation.

2666	   The client uses OFFLOAD_ABORT to inform the destination to stop the
2667	   active transfer and OFFLOAD_REVOKE to inform the source to not allow
2668	   any more copy requests from the destination.  The OFFLOAD_REVOKE
2669	   operation is also useful in situations in which the source server
2670	   granted a very long or infinite lease on the destination server's
2671	   ability to read the source file and all copy operations on the source
2672	   file have been completed.

2674	   For a copy only involving one server (the source and destination are
2675	   on the same server), this operation is unnecessary.

2677	   If the server supports COPY_NOTIFY, the server is REQUIRED to support
2678	   the OFFLOAD_REVOKE operation.

2680	14.5.  Operation 63: OFFLOAD_STATUS - Poll for status of a server-side
2681	       copy

2683	14.5.1.  ARGUMENT

2685	   struct OFFLOAD_STATUS4args {
2686	           /* CURRENT_FH: destination file */
2687	           stateid4        osa_stateid;
2688	   };

2690	14.5.2.  RESULT

2692	   struct OFFLOAD_STATUS4resok {
2693	           length4         osr_bytes_copied;
2694	           nfsstat4        osr_complete<1>;
2695	   };

2697	   union OFFLOAD_STATUS4res switch (nfsstat4 osr_status) {
2698	   case NFS4_OK:
2699	           OFFLOAD_STATUS4resok            osr_resok4;
2700	   default:
2701	           void;
2702	   };

2704	14.5.3.  DESCRIPTION

2706	   OFFLOAD_STATUS is used for both intra- and inter-server asynchronous
2707	   copies.  The OFFLOAD_STATUS operation allows the client to poll the
2708	   destination server to determine the status of an asynchronous copy
2709	   operation.

2711	   If this operation is successful, the number of bytes copied are
2712	   returned to the client in the osr_bytes_copied field.  The
2713	   osr_bytes_copied value indicates the number of bytes copied but not
2714	   which specific bytes have been copied.

2716	   If the optional osr_complete field is present, the copy has
2717	   completed.  In this case the status value indicates the result of the
2718	   asynchronous copy operation.  In all cases, the server will also
2719	   deliver the final results of the asynchronous copy in a CB_OFFLOAD
2720	   operation.

2722	   The failure of this operation does not indicate the result of the
2723	   asynchronous copy in any way.

2725	   If the server supports asynchronous copies, the server is REQUIRED to
2726	   support the OFFLOAD_STATUS operation.

2728	14.6.  Modification to Operation 42: EXCHANGE_ID - Instantiate Client ID

2730	14.6.1.  ARGUMENT

2732	      /* new */
2733	      const EXCHGID4_FLAG_SUPP_FENCE_OPS      = 0x00000004;

2735	14.6.2.  RESULT

2737	      Unchanged

2739	14.6.3.  MOTIVATION

2741	   Enterprise applications require guarantees that an operation has
2742	   either aborted or completed.  NFSv4.1 provides this guarantee as long
2743	   as the session is alive: simply send a SEQUENCE operation on the same
2744	   slot with a new sequence number, and the successful return of
2745	   SEQUENCE indicates the previous operation has completed.  However, if
2746	   the session is lost, there is no way to know when any in progress
2747	   operations have aborted or completed.  In hindsight, the NFSv4.1
2748	   specification should have mandated that DESTROY_SESSION either abort
2749	   or complete all outstanding operations.

2751	14.6.4.  DESCRIPTION

2753	   A client SHOULD request the EXCHGID4_FLAG_SUPP_FENCE_OPS capability
2754	   when it sends an EXCHANGE_ID operation.  The server SHOULD set this
2755	   capability in the EXCHANGE_ID reply whether the client requests it or
2756	   not.  It is the server's return that determines whether this
2757	   capability is in effect.  When it is in effect, the following will
2758	   occur:

2760	   o  The server will not reply to any DESTROY_SESSION invoked with the
2761	      client ID until all operations in progress are completed or
2762	      aborted.

2764	   o  The server will not reply to subsequent EXCHANGE_ID invoked on the
2765	      same client owner with a new verifier until all operations in
2766	      progress on the client ID's session are completed or aborted.

2768	   o  The NFS server SHOULD support client ID trunking, and if it does
2769	      and the EXCHGID4_FLAG_SUPP_FENCE_OPS capability is enabled, then a
2770	      session ID created on one node of the storage cluster MUST be
2771	      destroyable via DESTROY_SESSION.  In addition, DESTROY_CLIENTID
2772	      and an EXCHANGE_ID with a new verifier affects all sessions
2773	      regardless what node the sessions were created on.

2775	14.7.  Operation 64: WRITE_PLUS
2776	14.7.1.  ARGUMENT

2778	   struct data_info4 {
2779	           offset4         di_offset;
2780	           length4         di_length;
2781	           bool            di_allocated;
2782	   };

2784	   struct data4 {
2785	           offset4         d_offset;
2786	           bool            d_allocated;
2787	           opaque          d_data<>;
2788	   };

2790	   union write_plus_arg4 switch (data_content4 wpa_content) {
2791	   case NFS4_CONTENT_DATA:
2792	           data4           wpa_data;
2793	   case NFS4_CONTENT_APP_DATA_HOLE:
2794	           app_data_hole4  wpa_adh;
2795	   case NFS4_CONTENT_HOLE:
2796	           data_info4      wpa_hole;
2797	   default:
2798	           void;
2799	   };

2801	   struct WRITE_PLUS4args {
2802	           /* CURRENT_FH: file */
2803	           stateid4        wp_stateid;
2804	           stable_how4     wp_stable;
2805	           write_plus_arg4 wp_data<>;
2806	   };

2808	14.7.2.  RESULT

2810	   struct write_response4 {
2811	           stateid4        wr_callback_id<1>;
2812	           count4          wr_count;
2813	           stable_how4     wr_committed;
2814	           verifier4       wr_writeverf;
2815	   };
2816	   union WRITE_PLUS4res switch (nfsstat4 wp_status) {
2817	   case NFS4_OK:
2818	           write_response4         wp_resok4;
2819	   default:
2820	           void;
2821	   };

2823	14.7.3.  DESCRIPTION

2825	   The WRITE_PLUS operation is an extension of the NFSv4.1 WRITE
2826	   operation (see Section 18.2 of [RFC5661] and writes data to the
2827	   regular file identified by the current filehandle.  The server MAY
2828	   write fewer bytes than requested by the client.

2830	   The WRITE_PLUS argument is comprised of an array of rpr_contents,
2831	   each of which describe a data_content4 type of data (Section 7.1.2).
2832	   For NFSv4.2, the allowed values are data, ADH, and hole.  The array
2833	   contents MUST be contiguous in the file.  A successful WRITE_PLUS
2834	   will construct a reply for wr_count, wr_committed, and wr_writeverf
2835	   as per the NFSv4.1 WRITE operation results.  If wr_callback_id is
2836	   set, it indicates an asynchronous reply (see Section 14.7.3.4).

2838	   WRITE_PLUS has to support all of the errors which are returned by
2839	   WRITE plus NFS4ERR_UNION_NOTSUPP.  If the client asks for a hole and
2840	   the server does not support that arm of the discriminated union, but
2841	   does support one or more additional arms, it can signal to the client
2842	   that it supports the operation, but not the arm with
2843	   NFS4ERR_UNION_NOTSUPP.

2845	   If the client supports WRITE_PLUS and any arm of the discriminated
2846	   union other than NFS4_CONTENT_DATA, it MUST support CB_OFFLOAD.

2848	14.7.3.1.  Data

2850	   The d_offset specifies the offset where the data should be written.
2851	   An d_offset of zero specifies that the write should start at the
2852	   beginning of the file.  The d_count, as encoded as part of the opaque
2853	   data parameter, represents the number of bytes of data that are to be
2854	   written.  If the d_count is zero, the WRITE_PLUS will succeed and
2855	   return a d_count of zero subject to permissions checking.

2857	   Note that d_allocated has no meaning for WRITE_PLUS.

2859	   The data MUST be written synchronously and MUST follow the same
2860	   semantics of COMMIT as does the WRITE operation.

2862	14.7.3.2.  Hole punching

2864	   Whenever a client wishes to zero the blocks backing a particular
2865	   region in the file, it calls the WRITE_PLUS operation with the
2866	   current filehandle set to the filehandle of the file in question, and
2867	   the equivalent of start offset and length in bytes of the region set
2868	   in wpa_hole.di_offset and wpa_hole.di_length respectively.  If the
2869	   wpa_hole.di_allocated is set to TRUE, then the blocks will be zeroed
2870	   and if it is set to FALSE, then they will be deallocated.  All
2871	   further reads to this region MUST return zeros until overwritten.
2872	   The filehandle specified must be that of a regular file.

2874	   Situations may arise where di_offset and/or di_offset + di_length
2875	   will not be aligned to a boundary that the server does allocations/
2876	   deallocations in.  For most file systems, this is the block size of
2877	   the file system.  In such a case, the server can deallocate as many
2878	   bytes as it can in the region.  The blocks that cannot be deallocated
2879	   MUST be zeroed.  Except for the block deallocation and maximum hole
2880	   punching capability, a WRITE_PLUS operation is to be treated similar
2881	   to a write of zeroes.

2883	   The server is not required to complete deallocating the blocks
2884	   specified in the operation before returning.  The server SHOULD
2885	   return an asynchronous result if it can determine the operation will
2886	   be long running (see Section 14.7.3.4).

2888	   If used to hole punch, WRITE_PLUS will result in the space_used
2889	   attribute being decreased by the number of bytes that were
2890	   deallocated.  The space_freed attribute may or may not decrease,
2891	   depending on the support and whether the blocks backing the specified
2892	   range were shared or not.  The size attribute will remain unchanged.

2894	   The WRITE_PLUS operation MUST NOT change the space reservation
2895	   guarantee of the file.  While the server can deallocate the blocks
2896	   specified by di_offset and di_length, future writes to this region
2897	   MUST NOT fail with NFSERR_NOSPC.

2899	14.7.3.3.  ADHs

2901	   If the server supports ADHs, then it MUST support the
2902	   NFS4_CONTENT_APP_DATA_HOLE arm of the WRITE_PLUS operation.  The
2903	   server has no concept of the structure imposed by the application.
2904	   It is only when the application writes to a section of the file does
2905	   order get imposed.  In order to detect corruption even before the
2906	   application utilizes the file, the application will want to
2907	   initialize a range of ADHs using WRITE_PLUS.

2909	   For ADHs, when the client invokes the WRITE_PLUS operation, it has
2910	   two desired results:

2912	   1.  The structure described by the app_data_block4 be imposed on the
2913	       file.

2915	   2.  The contents described by the app_data_block4 be sparse.

2917	   If the server supports the WRITE_PLUS operation, it still might not
2918	   support sparse files.  So if it receives the WRITE_PLUS operation,
2919	   then it MUST populate the contents of the file with the initialized
2920	   ADHs.  The server SHOULD return an asynchronous result if it can
2921	   determine the operation will be long running (see Section 14.7.3.4).

2923	   If the data was already initialized, there are two interesting
2924	   scenarios:

2926	   1.  The data blocks are allocated.

2928	   2.  Initializing in the middle of an existing ADH.

2930	   If the data blocks were already allocated, then the WRITE_PLUS is a
2931	   hole punch operation.  If WRITE_PLUS supports sparse files, then the
2932	   data blocks are to be deallocated.  If not, then the data blocks are
2933	   to be rewritten in the indicated ADH format.

2935	   Since the server has no knowledge of ADHs, it should not report
2936	   misaligned creation of ADHs.  Even while it can detect them, it
2937	   cannot disallow them, as the application might be in the process of
2938	   changing the size of the ADHs.  Thus the server must be prepared to
2939	   handle an WRITE_PLUS into an existing ADH.

2941	   This document does not mandate the manner in which the server stores
2942	   ADHs sparsely for a file.  However, if an WRITE_PLUS arrives that
2943	   will force a new ADH to start inside an existing ADH then the server
2944	   will have three ADHs instead of two.  It will have one up to the new
2945	   one for the WRITE_PLUS, one for the WRITE_PLUS, and one for after the
2946	   WRITE_PLUS.  Note that depending on server specific policies for
2947	   block allocation, there may also be some physical blocks allocated to
2948	   align the boundaries.

2950	14.7.3.4.  Asynchronous Transactions

2952	   Both hole punching and ADH initialization may lead to server
2953	   determining to service the operation asynchronously.  If it decides
2954	   to do so, it sets the stateid in wr_callback_id to be that of the
2955	   wp_stateid.  If it does not set the wr_callback_id, then the result
2956	   is synchronous.

2958	   When the client determines that the reply will be given
2959	   asynchronously, it should not assume anything about the contents of
2960	   what it wrote until it is informed by the server that the operation
2961	   is complete.  It can use OFFLOAD_STATUS (Section 14.5) to monitor the
2962	   operation and OFFLOAD_ABORT (Section 14.2) to cancel the operation.
2963	   An example of a asynchronous WRITE_PLUS is shown in Figure 6.  Note
2964	   that as with the COPY operation, WRITE_PLUS must provide a stateid
2965	   for tracking the asynchronous operation.

2967	     Client                                  Server
2968	        +                                      +
2969	        |                                      |
2970	        |--- OPEN ---------------------------->| Client opens
2971	        |<------------------------------------/| the file
2972	        |                                      |
2973	        |--- WRITE_PLUS ---------------------->| Client punches
2974	        |<------------------------------------/| a hole
2975	        |                                      |
2976	        |                                      |
2977	        |--- OFFLOAD_STATUS ------------------>| Client may poll
2978	        |<------------------------------------/| for status
2979	        |                                      |
2980	        |                  .                   | Multiple OFFLOAD_STATUS
2981	        |                  .                   | operations may be sent.
2982	        |                  .                   |
2983	        |                                      |
2984	        |<-- CB_OFFLOAD -----------------------| Server reports results
2985	        |\------------------------------------>|
2986	        |                                      |
2987	        |--- CLOSE --------------------------->| Client closes
2988	        |<------------------------------------/| the file
2989	        |                                      |
2990	        |                                      |

2992	                   Figure 6: An asynchronous WRITE_PLUS.

2994	   When CB_OFFLOAD informs the client of the successful WRITE_PLUS, the
2995	   write_response4 embedded in the operation will provide the necessary
2996	   information that a synchronous WRITE_PLUS would have provided.

2998	   Regardless of whether the operation is asynchronous or synchronous,
2999	   it MUST still support the COMMIT operation semantics as outlined in
3000	   Section 18.3 of [RFC5661].  I.e., COMMIT works on one or more WRITE
3001	   operations and the WRITE_PLUS operation can appear as several WRITE
3002	   operations to the server.  The client can use locking operations to
3003	   control the behavior on the server with respect to long running
3004	   asynchronous write operations.

3006	14.8.  Operation 67: IO_ADVISE - Application I/O access pattern hints

3008	14.8.1.  ARGUMENT

3010	   enum IO_ADVISE_type4 {
3011	           IO_ADVISE4_NORMAL                       = 0,
3012	           IO_ADVISE4_SEQUENTIAL                   = 1,
3013	           IO_ADVISE4_SEQUENTIAL_BACKWARDS         = 2,
3014	           IO_ADVISE4_RANDOM                       = 3,
3015	           IO_ADVISE4_WILLNEED                     = 4,
3016	           IO_ADVISE4_WILLNEED_OPPORTUNISTIC       = 5,
3017	           IO_ADVISE4_DONTNEED                     = 6,
3018	           IO_ADVISE4_NOREUSE                      = 7,
3019	           IO_ADVISE4_READ                         = 8,
3020	           IO_ADVISE4_WRITE                        = 9,
3021	           IO_ADVISE4_INIT_PROXIMITY               = 10
3022	   };

3024	   struct IO_ADVISE4args {
3025	           /* CURRENT_FH: file */
3026	           stateid4        iar_stateid;
3027	           offset4         iar_offset;
3028	           length4         iar_count;
3029	           bitmap4         iar_hints;
3030	   };

3032	14.8.2.  RESULT

3034	   struct IO_ADVISE4resok {
3035	           bitmap4 ior_hints;
3036	   };

3038	   union IO_ADVISE4res switch (nfsstat4 _status) {
3039	   case NFS4_OK:
3040	           IO_ADVISE4resok resok4;
3041	   default:
3042	           void;
3043	   };

3045	14.8.3.  DESCRIPTION

3047	   The IO_ADVISE operation sends an I/O access pattern hint to the
3048	   server for the owner of the stateid for a given byte range specified
3049	   by iar_offset and iar_count.  The byte range specified by iar_offset
3050	   and iar_count need not currently exist in the file, but the iar_hints
3051	   will apply to the byte range when it does exist.  If iar_count is 0,
3052	   all data following iar_offset is specified.  The server MAY ignore
3053	   the advice.

3055	   The following are the allowed hints for a stateid holder:

3057	   IO_ADVISE4_NORMAL  There is no advice to give, this is the default
3058	      behavior.

3060	   IO_ADVISE4_SEQUENTIAL  Expects to access the specified data
3061	      sequentially from lower offsets to higher offsets.

3063	   IO_ADVISE4_SEQUENTIAL_BACKWARDS  Expects to access the specified data
3064	      sequentially from higher offsets to lower offsets.

3066	   IO_ADVISE4_RANDOM  Expects to access the specified data in a random
3067	      order.

3069	   IO_ADVISE4_WILLNEED  Expects to access the specified data in the near
3070	      future.

3072	   IO_ADVISE4_WILLNEED_OPPORTUNISTIC  Expects to possibly access the
3073	      data in the near future.  This is a speculative hint, and
3074	      therefore the server should prefetch data or indirect blocks only
3075	      if it can be done at a marginal cost.

3077	   IO_ADVISE_DONTNEED  Expects that it will not access the specified
3078	      data in the near future.

3080	   IO_ADVISE_NOREUSE  Expects to access the specified data once and then
3081	      not reuse it thereafter.

3083	   IO_ADVISE4_READ  Expects to read the specified data in the near
3084	      future.

3086	   IO_ADVISE4_WRITE  Expects to write the specified data in the near
3087	      future.

3089	   IO_ADVISE4_INIT_PROXIMITY  Informs the server that the data in the
3090	      byte range remains important to the client.

3092	   Since IO_ADVISE is a hint, a server SHOULD NOT return an error and
3093	   invalidate a entire Compound request if one of the sent hints in
3094	   iar_hints is not supported by the server.  Also, the server MUST NOT
3095	   return an error if the client sends contradictory hints to the
3096	   server, e.g., IO_ADVISE4_SEQUENTIAL and IO_ADVISE4_RANDOM in a single
3097	   IO_ADVISE operation.  In these cases, the server MUST return success
3098	   and a ior_hints value that indicates the hint it intends to
3099	   implement.  This may mean simply returning IO_ADVISE4_NORMAL.

3101	   The ior_hints returned by the server is primarily for debugging
3102	   purposes since the server is under no obligation to carry out the
3103	   hints that it describes in the ior_hints result.  In addition, while
3104	   the server may have intended to implement the hints returned in
3105	   ior_hints, as time progresses, the server may need to change its
3106	   handling of a given file due to several reasons including, but not
3107	   limited to, memory pressure, additional IO_ADVISE hints sent by other
3108	   clients, and heuristically detected file access patterns.

3110	   The server MAY return different advice than what the client
3111	   requested.  If it does, then this might be due to one of several
3112	   conditions, including, but not limited to another client advising of
3113	   a different I/O access pattern; a different I/O access pattern from
3114	   another client that that the server has heuristically detected; or
3115	   the server is not able to support the requested I/O access pattern,
3116	   perhaps due to a temporary resource limitation.

3118	   Each issuance of the IO_ADVISE operation overrides all previous
3119	   issuances of IO_ADVISE for a given byte range.  This effectively
3120	   follows a strategy of last hint wins for a given stateid and byte
3121	   range.

3123	   Clients should assume that hints included in an IO_ADVISE operation
3124	   will be forgotten once the file is closed.

3126	14.8.4.  IMPLEMENTATION

3128	   The NFS client may choose to issue an IO_ADVISE operation to the
3129	   server in several different instances.

3131	   The most obvious is in direct response to an application's execution
3132	   of posix_fadvise().  In this case, IO_ADVISE4_WRITE and
3133	   IO_ADVISE4_READ may be set based upon the type of file access
3134	   specified when the file was opened.

3136	14.8.5.  IO_ADVISE4_INIT_PROXIMITY

3138	   The IO_ADVISE4_INIT_PROXIMITY hint is non-posix in origin and conveys
3139	   that the client has recently accessed the byte range in its own
3140	   cache.  I.e., it has not accessed it on the server, but it has
3141	   locally.  When the server reaches resource exhaustion, knowing which
3142	   data is more important allows the server to make better choices about
3143	   which data to, for example purge from a cache, or move to secondary
3144	   storage.  It also informs the server which delegations are more
3145	   important, since if delegations are working correctly, once delegated
3146	   to a client and the client has read the content for that byte range,
3147	   a server might never receive another read request for that byte
3148	   range.

3150	   This hint is also useful in the case of NFS clients which are network
3151	   booting from a server.  If the first client to be booted sends this
3152	   hint, then it keeps the cache warm for the remaining clients.

3154	14.8.6.  pNFS File Layout Data Type Considerations

3156	   The IO_ADVISE considerations for pNFS are very similar to the COMMIT
3157	   considerations for pNFS.  That is, as with COMMIT, some NFS server
3158	   implementations prefer IO_ADVISE be done on the DS, and some prefer
3159	   it be done on the MDS.

3161	   So for the file's layout type, it is proposed that NFSv4.2 include an
3162	   additional hint NFL42_CARE_IO_ADVISE_THRU_MDS which is valid only on
3163	   NFSv4.2 or higher.  Any file's layout obtained with NFSv4.1 MUST NOT
3164	   have NFL42_UFLG_IO_ADVISE_THRU_MDS set.  Any file's layout obtained
3165	   with NFSv4.2 MAY have NFL42_UFLG_IO_ADVISE_THRU_MDS set.  If the
3166	   client does not implement IO_ADVISE, then it MUST ignore
3167	   NFL42_UFLG_IO_ADVISE_THRU_MDS.

3169	   If NFL42_UFLG_IO_ADVISE_THRU_MDS is set, the client MUST send the
3170	   IO_ADVISE operation to the MDS in order for it to be honored by the
3171	   DS.  Once the MDS receives the IO_ADVISE operation, it will
3172	   communicate the advice to each DS.

3174	   If NFL42_UFLG_IO_ADVISE_THRU_MDS is not set, then the client SHOULD
3175	   send an IO_ADVISE operation to the appropriate DS for the specified
3176	   byte range.  While the client MAY always send IO_ADVISE to the MDS,
3177	   if the server has not set NFL42_UFLG_IO_ADVISE_THRU_MDS, the client
3178	   should expect that such an IO_ADVISE is futile.  Note that a client
3179	   SHOULD use the same set of arguments on each IO_ADVISE sent to a DS
3180	   for the same open file reference.

3182	   The server is not required to support different advice for different
3183	   DS's with the same open file reference.

3185	14.8.6.1.  Dense and Sparse Packing Considerations

3187	   The IO_ADVISE operation MUST use the iar_offset and byte range as
3188	   dictated by the presence or absence of NFL4_UFLG_DENSE.

3190	   E.g., if NFL4_UFLG_DENSE is present, and a READ or WRITE to the DS
3191	   for iar_offset 0 really means iar_offset 10000 in the logical file,
3192	   then an IO_ADVISE for iar_offset 0 means iar_offset 10000.

3194	   E.g., if NFL4_UFLG_DENSE is absent, then a READ or WRITE to the DS
3195	   for iar_offset 0 really means iar_offset 0 in the logical file, then
3196	   an IO_ADVISE for iar_offset 0 means iar_offset 0 in the logical file.

3198	   E.g., if NFL4_UFLG_DENSE is present, the stripe unit is 1000 bytes
3199	   and the stripe count is 10, and the dense DS file is serving
3200	   iar_offset 0.  A READ or WRITE to the DS for iar_offsets 0, 1000,
3201	   2000, and 3000, really mean iar_offsets 10000, 20000, 30000, and
3202	   40000 (implying a stripe count of 10 and a stripe unit of 1000), then
3203	   an IO_ADVISE sent to the same DS with an iar_offset of 500, and a
3204	   iar_count of 3000 means that the IO_ADVISE applies to these byte
3205	   ranges of the dense DS file:

3207	     - 500 to 999
3208	     - 1000 to 1999
3209	     - 2000 to 2999
3210	     - 3000 to 3499

3212	   I.e., the contiguous range 500 to 3499 as specified in IO_ADVISE.

3214	   It also applies to these byte ranges of the logical file:

3216	     - 10500 to 10999 (500 bytes)
3217	     - 20000 to 20999 (1000 bytes)
3218	     - 30000 to 30999 (1000 bytes)
3219	     - 40000 to 40499 (500 bytes)
3220	     (total            3000 bytes)

3222	   E.g., if NFL4_UFLG_DENSE is absent, the stripe unit is 250 bytes, the
3223	   stripe count is 4, and the sparse DS file is serving iar_offset 0.
3224	   Then a READ or WRITE to the DS for iar_offsets 0, 1000, 2000, and
3225	   3000, really mean iar_offsets 0, 1000, 2000, and 3000 in the logical
3226	   file, keeping in mind that on the DS file,. byte ranges 250 to 999,
3227	   1250 to 1999, 2250 to 2999, and 3250 to 3999 are not accessible.
3228	   Then an IO_ADVISE sent to the same DS with an iar_offset of 500, and
3229	   a iar_count of 3000 means that the IO_ADVISE applies to these byte
3230	   ranges of the logical file and the sparse DS file:

3232	     - 500 to 999 (500 bytes)   - no effect
3233	     - 1000 to 1249 (250 bytes) - effective
3234	     - 1250 to 1999 (750 bytes) - no effect
3235	     - 2000 to 2249 (250 bytes) - effective
3236	     - 2250 to 2999 (750 bytes) - no effect
3237	     - 3000 to 3249 (250 bytes) - effective
3238	     - 3250 to 3499 (250 bytes) - no effect
3239	     (subtotal      2250 bytes) - no effect
3240	     (subtotal       750 bytes) - effective
3241	     (grand total   3000 bytes) - no effect + effective

3243	   If neither of the flags NFL42_UFLG_IO_ADVISE_THRU_MDS and
3244	   NFL4_UFLG_DENSE are set in the layout, then any IO_ADVISE request
3245	   sent to the data server with a byte range that overlaps stripe unit
3246	   that the data server does not serve MUST NOT result in the status
3247	   NFS4ERR_PNFS_IO_HOLE.  Instead, the response SHOULD be successful and
3248	   if the server applies IO_ADVISE hints on any stripe units that
3249	   overlap with the specified range, those hints SHOULD be indicated in
3250	   the response.

3252	14.9.  Changes to Operation 51: LAYOUTRETURN

3254	14.9.1.  Introduction

3256	   In the pNFS description provided in [RFC5661], the client is not
3257	   capable to relay an error code from the DS to the MDS.  In the
3258	   specification of the Objects-Based Layout protocol [RFC5664], use is
3259	   made of the opaque lrf_body field of the LAYOUTRETURN argument to do
3260	   such a relaying of error codes.  In this section, we define a new
3261	   data structure to enable the passing of error codes back to the MDS
3262	   and provide some guidelines on what both the client and MDS should
3263	   expect in such circumstances.

3265	   There are two broad classes of errors, transient and persistent.  The
3266	   client SHOULD strive to only use this new mechanism to report
3267	   persistent errors.  It MUST be able to deal with transient issues by
3268	   itself.  Also, while the client might consider an issue to be
3269	   persistent, it MUST be prepared for the MDS to consider such issues
3270	   to be transient.  A prime example of this is if the MDS fences off a
3271	   client from either a stateid or a filehandle.  The client will get an
3272	   error from the DS and might relay either NFS4ERR_ACCESS or
3273	   NFS4ERR_BAD_STATEID back to the MDS, with the belief that this is a
3274	   hard error.  If the MDS is informed by the client that there is an
3275	   error, it can safely ignore that.  For it, the mission is
3276	   accomplished in that the client has returned a layout that the MDS
3277	   had most likely recalled.

3279	   The client might also need to inform the MDS that it cannot reach one
3280	   or more of the DSes.  While the MDS can detect the connectivity of
3281	   both of these paths:

3283	   o  MDS to DS

3285	   o  MDS to client

3287	   it cannot determine if the client and DS path is working.  As with
3288	   the case of the DS passing errors to the client, it must be prepared
3289	   for the MDS to consider such outages as being transitory.

3291	   The existing LAYOUTRETURN operation is extended by introducing a new
3292	   data structure to report errors, layoutreturn_device_error4.  Also,
3293	   layoutreturn_device_error4 is introduced to enable an array of errors
3294	   to be reported.

3296	14.9.2.  ARGUMENT

3298	   The ARGUMENT specification of the LAYOUTRETURN operation in section
3299	   18.44.1 of [RFC5661] is augmented by the following XDR code
3300	   [RFC4506]:

3302	   struct layoutreturn_device_error4 {
3303	           deviceid4       lrde_deviceid;
3304	           nfsstat4        lrde_status;
3305	           nfs_opnum4      lrde_opnum;
3306	   };

3308	   struct layoutreturn_error_report4 {
3309	           layoutreturn_device_error4      lrer_errors<>;
3310	   };

3312	14.9.3.  RESULT

3314	   The RESULT of the LAYOUTRETURN operation is unchanged; see section
3315	   18.44.2 of [RFC5661].

3317	14.9.4.  DESCRIPTION

3319	   The following text is added to the end of the LAYOUTRETURN operation
3320	   DESCRIPTION in section 18.44.3 of [RFC5661].

3322	   When a client uses LAYOUTRETURN with a type of LAYOUTRETURN4_FILE,
3323	   then if the lrf_body field is NULL, it indicates to the MDS that the
3324	   client experienced no errors.  If lrf_body is non-NULL, then the
3325	   field references error information which is layout type specific.
3326	   I.e., the Objects-Based Layout protocol can continue to utilize
3327	   lrf_body as specified in [RFC5664].  For both Files-Based and Block-
3328	   Based Layouts, the field references a layoutreturn_device_error4,
3329	   which contains an array of layoutreturn_device_error4.

3331	   Each individual layoutreturn_device_error4 describes a single error
3332	   associated with a DS, which is identified via lrde_deviceid.  The
3333	   operation which returned the error is identified via lrde_opnum.
3334	   Finally the NFS error value (nfsstat4) encountered is provided via
3335	   lrde_status and may consist of the following error codes:

3337	   NFS4ERR_NXIO:  The client was unable to establish any communication
3338	      with the DS.

3340	   NFS4ERR_*:  The client was able to establish communication with the
3341	      DS and is returning one of the allowed error codes for the
3342	      operation denoted by lrde_opnum.

3344	14.9.5.  IMPLEMENTATION

3346	   The following text is added to the end of the LAYOUTRETURN operation
3347	   IMPLEMENTATION in section 18.4.4 of [RFC5661].

3349	   Clients are expected to tolerate transient storage device errors, and
3350	   hence clients SHOULD NOT use the LAYOUTRETURN error handling for
3351	   device access problems that may be transient.  The methods by which a
3352	   client decides whether a device access problem is transient vs.
3353	   persistent are implementation-specific, but may include retrying I/Os
3354	   to a data server under appropriate conditions.

3356	   When an I/O fails to a storage device, the client SHOULD retry the
3357	   failed I/O via the MDS.  In this situation, before retrying the I/O,
3358	   the client SHOULD return the layout, or the affected portion thereof,
3359	   and SHOULD indicate which storage device or devices was problematic.
3360	   The client needs to do this when the DS is being unresponsive in
3361	   order to fence off any failed write attempts, and ensure that they do
3362	   not end up overwriting any later data being written through the MDS.
3363	   If the client does not do this, the MDS MAY issue a layout recall
3364	   callback in order to perform the retried I/O.

3366	   The client needs to be cognizant that since this error handling is
3367	   optional in the MDS, the MDS may silently ignore this functionality.
3368	   Also, as the MDS may consider some issues the client reports to be
3369	   expected (see Section 14.9.1), the client might find it difficult to
3370	   detect a MDS which has not implemented error handling via
3371	   LAYOUTRETURN.

3373	   If an MDS is aware that a storage device is proving problematic to a
3374	   client, the MDS SHOULD NOT include that storage device in any pNFS
3375	   layouts sent to that client.  If the MDS is aware that a storage
3376	   device is affecting many clients, then the MDS SHOULD NOT include
3377	   that storage device in any pNFS layouts sent out.  If a client asks
3378	   for a new layout for the file from the MDS, it MUST be prepared for
3379	   the MDS to return that storage device in the layout.  The MDS might
3380	   not have any choice in using the storage device, i.e., there might
3381	   only be one possible layout for the system.  Also, in the case of
3382	   existing files, the MDS might have no choice in which storage devices
3383	   to hand out to clients.

3385	   The MDS is not required to indefinitely retain per-client storage
3386	   device error information.  An MDS is also not required to
3387	   automatically reinstate use of a previously problematic storage
3388	   device; administrative intervention may be required instead.

3390	14.10.  Operation 65: READ_PLUS

3392	14.10.1.  ARGUMENT

3394	   struct READ_PLUS4args {
3395	           /* CURRENT_FH: file */
3396	           stateid4        rpa_stateid;
3397	           offset4         rpa_offset;
3398	           count4          rpa_count;
3399	   };

3401	14.10.2.  RESULT

3403	   struct data_info4 {
3404	           offset4         di_offset;
3405	           length4         di_length;
3406	           bool            di_allocated;
3407	   };

3409	   struct data4 {
3410	           offset4         d_offset;
3411	           bool            d_allocated;
3412	           opaque          d_data<>;
3413	   };
3414	   union read_plus_content switch (data_content4 rpc_content) {
3415	   case NFS4_CONTENT_DATA:
3416	           data4           rpc_data;
3417	   case NFS4_CONTENT_APP_DATA_HOLE:
3418	           app_data_hole4  rpc_adh;
3419	   case NFS4_CONTENT_HOLE:
3420	           data_info4      rpc_hole;
3421	   default:
3422	           void;
3423	   };

3425	   /*
3426	    * Allow a return of an array of contents.
3427	    */
3428	   struct read_plus_res4 {
3429	           bool                    rpr_eof;
3430	           read_plus_content       rpr_contents<>;
3431	   };

3433	   union READ_PLUS4res switch (nfsstat4 rp_status) {
3434	   case NFS4_OK:
3435	           read_plus_res4  rp_resok4;
3436	   default:
3437	           void;
3438	   };

3440	14.10.3.  DESCRIPTION

3442	   The READ_PLUS operation is based upon the NFSv4.1 READ operation (see
3443	   Section 18.22 of [RFC5661]) and similarly reads data from the regular
3444	   file identified by the current filehandle.

3446	   The client provides a rpa_offset of where the READ_PLUS is to start
3447	   and a rpa_count of how many bytes are to be read.  A rpa_offset of
3448	   zero means to read data starting at the beginning of the file.  If
3449	   rpa_offset is greater than or equal to the size of the file, the
3450	   status NFS4_OK is returned with di_length (the data length) set to
3451	   zero and eof set to TRUE.

3453	   The READ_PLUS result is comprised of an array of rpr_contents, each
3454	   of which describe a data_content4 type of data (Section 7.1.2).  For
3455	   NFSv4.2, the allowed values are data, ADH, and hole.  A server is
3456	   required to support the data type, but neither ADH nor hole.  Both an
3457	   ADH and a hole must be returned in its entirety - clients must be
3458	   prepared to get more information than they requested.  Both the start
3459	   and the end of the hole may exceed what was requested.  The array
3460	   contents MUST be contiguous in the file.

3462	   READ_PLUS has to support all of the errors which are returned by READ
3463	   plus NFS4ERR_UNION_NOTSUPP.  If the client asks for a hole and the
3464	   server does not support that arm of the discriminated union, but does
3465	   support one or more additional arms, it can signal to the client that
3466	   it supports the operation, but not the arm with
3467	   NFS4ERR_UNION_NOTSUPP.

3469	   If the data to be returned is comprised entirely of zeros, then the
3470	   server may elect to return that data as a hole.  The server
3471	   differentiates this to the client by setting di_allocated to TRUE in
3472	   this case.  Note that in such a scenario, the server is not required
3473	   to determine the full extent of the "hole" - it does not need to
3474	   determine where the zeros start and end.  If the server elects to
3475	   return the hole as data, then it can set the d_allocted to FALSE in
3476	   the rpc_data to indicate it is a hole.

3478	   The server may elect to return adjacent elements of the same type.
3479	   For example, the guard pattern or block size of an ADH might change,
3480	   which would require adjacent elements of type ADH.  Likewise if the
3481	   server has a range of data comprised entirely of zeros and then a
3482	   hole, it might want to return two adjacent holes to the client.

3484	   If the client specifies a rpa_count value of zero, the READ_PLUS
3485	   succeeds and returns zero bytes of data.  In all situations, the
3486	   server may choose to return fewer bytes than specified by the client.
3487	   The client needs to check for this condition and handle the condition
3488	   appropriately.

3490	   If the client specifies an rpa_offset and rpa_count value that is
3491	   entirely contained within a hole of the file, then the di_offset and
3492	   di_length returned must be for the entire hole.  This result is
3493	   considered valid until the file is changed (detected via the change
3494	   attribute).  The server MUST provide the same semantics for the hole
3495	   as if the client read the region and received zeroes; the implied
3496	   holes contents lifetime MUST be exactly the same as any other read
3497	   data.

3499	   If the client specifies an rpa_offset and rpa_count value that begins
3500	   in a non-hole of the file but extends into hole the server should
3501	   return an array comprised of both data and a hole.  The client MUST
3502	   be prepared for the server to return a short read describing just the
3503	   data.  The client will then issue another READ_PLUS for the remaining
3504	   bytes, which the server will respond with information about the hole
3505	   in the file.

3507	   Except when special stateids are used, the stateid value for a
3508	   READ_PLUS request represents a value returned from a previous byte-
3509	   range lock or share reservation request or the stateid associated
3510	   with a delegation.  The stateid identifies the associated owners if
3511	   any and is used by the server to verify that the associated locks are
3512	   still valid (e.g., have not been revoked).

3514	   If the read ended at the end-of-file (formally, in a correctly formed
3515	   READ_PLUS operation, if rpa_offset + rpa_count is equal to the size
3516	   of the file), or the READ_PLUS operation extends beyond the size of
3517	   the file (if rpa_offset + rpa_count is greater than the size of the
3518	   file), eof is returned as TRUE; otherwise, it is FALSE.  A successful
3519	   READ_PLUS of an empty file will always return eof as TRUE.

3521	   If the current filehandle is not an ordinary file, an error will be
3522	   returned to the client.  In the case that the current filehandle
3523	   represents an object of type NF4DIR, NFS4ERR_ISDIR is returned.  If
3524	   the current filehandle designates a symbolic link, NFS4ERR_SYMLINK is
3525	   returned.  In all other cases, NFS4ERR_WRONG_TYPE is returned.

3527	   For a READ_PLUS with a stateid value of all bits equal to zero, the
3528	   server MAY allow the READ_PLUS to be serviced subject to mandatory
3529	   byte-range locks or the current share deny modes for the file.  For a
3530	   READ_PLUS with a stateid value of all bits equal to one, the server
3531	   MAY allow READ_PLUS operations to bypass locking checks at the
3532	   server.

3534	   On success, the current filehandle retains its value.

3536	14.10.4.  IMPLEMENTATION

3538	   In general, the IMPLEMENTATION notes for READ in Section 18.22.4 of
3539	   [RFC5661] also apply to READ_PLUS.  One delta is that when the owner
3540	   has a locked byte range, the server MUST return an array of
3541	   rpr_contents with values inside that range.

3543	14.10.4.1.  Additional pNFS Implementation Information

3545	   With pNFS, the semantics of using READ_PLUS remains the same.  Any
3546	   data server MAY return a hole or ADH result for a READ_PLUS request
3547	   that it receives.  When a data server chooses to return such a
3548	   result, it has the option of returning information for the data
3549	   stored on that data server (as defined by the data layout), but it
3550	   MUST NOT return results for a byte range that includes data managed
3551	   by another data server.

3553	   A data server should do its best to return as much information about
3554	   a ADH as is feasible without having to contact the metadata server.
3555	   If communication with the metadata server is required, then every
3556	   attempt should be taken to minimize the number of requests.

3558	   If mandatory locking is enforced, then the data server must also
3559	   ensure that to return only information that is within the owner's
3560	   locked byte range.

3562	14.10.5.  READ_PLUS with Sparse Files Example

3564	   The following table describes a sparse file.  For each byte range,
3565	   the file contains either non-zero data or a hole.  In addition, the
3566	   server in this example uses a Hole Threshold of 32K.

3568	                        +-------------+----------+
3569	                        | Byte-Range  | Contents |
3570	                        +-------------+----------+
3571	                        | 0-15999     | Hole     |
3572	                        | 16K-31999   | Non-Zero |
3573	                        | 32K-255999  | Hole     |
3574	                        | 256K-287999 | Non-Zero |
3575	                        | 288K-353999 | Hole     |
3576	                        | 354K-417999 | Non-Zero |
3577	                        +-------------+----------+

3579	                                  Table 5

3581	   Under the given circumstances, if a client was to read from the file
3582	   with a max read size of 64K, the following will be the results for
3583	   the given READ_PLUS calls.  This assumes the client has already
3584	   opened the file, acquired a valid stateid ('s' in the example), and
3585	   just needs to issue READ_PLUS requests.

3587	   1.  READ_PLUS(s, 0, 64K) --> NFS_OK, eof = false, .  Since the first hole is less than the server's
3589	       Hole Threshhold, the first 32K of the file is returned as data
3590	       and the remaining 32K is returned as a hole which actually
3591	       extends to 256K.

3593	   2.  READ_PLUS(s, 32K, 64K) --> NFS_OK, eof = false, 
3594	       The requested range was all zeros, and the current hole begins at
3595	       offset 32K and is 224K in length.  Note that the client should
3596	       not have followed up the previous READ_PLUS request with this one
3597	       as the hole information from the previous call extended past what
3598	       the client was requesting.

3600	   3.  READ_PLUS(s, 256K, 64K) --> NFS_OK, eof = false, .  Returns an array of the 32K data and
3602	       the hole which extends to 354K.

3604	   4.  READ_PLUS(s, 354K, 64K) --> NFS_OK, eof = true, .  Returns the final 64K of data and informs the client
3606	       there is no more data in the file.

3608	14.11.  Operation 66: SEEK

3610	   SEEK is an operation that allows a client to determine the location
3611	   of the next data_content4 in a file.  It allows an implementation of
3612	   the emerging extension to lseek(2) to allow clients to determine
3613	   SEEK_HOLE and SEEK_DATA.

3615	14.11.1.  ARGUMENT

3617	   struct SEEK4args {
3618	           /* CURRENT_FH: file */
3619	           stateid4        sa_stateid;
3620	           offset4         sa_offset;
3621	           data_content4   sa_what;
3622	   };

3624	14.11.2.  RESULT

3626	   union seek_content switch (data_content4 content) {
3627	   case NFS4_CONTENT_DATA:
3628	           data_info4      sc_data;
3629	   case NFS4_CONTENT_APP_DATA_HOLE:
3630	           app_data_hole4  sc_adh;
3631	   case NFS4_CONTENT_HOLE:
3632	           data_info4      sc_hole;
3633	   default:
3634	           void;
3635	   };

3637	   struct seek_res4 {
3638	           bool                    sr_eof;
3639	           seek_content            sr_contents;
3640	   };

3642	   union SEEK4res switch (nfsstat4 status) {
3643	   case NFS4_OK:
3644	           seek_res4       resok4;
3645	   default:
3646	           void;
3647	   };

3649	14.11.3.  DESCRIPTION

3651	   From the given sa_offset, find the next data_content4 of type sa_what
3652	   in the file.  For either a hole or ADH, this must return the
3653	   data_content4 in its entirety.  For data, it must not return the
3654	   actual data.

3656	   SEEK must follow the same rules for stateids as READ_PLUS
3657	   (Section 14.10.3).

3659	   If the server could not find a corresponding sa_what, then the status
3660	   would still be NFS4_OK, but sr_eof would be TRUE.  The sr_contents
3661	   would contain a zero-ed out content of the appropriate type.

3663	15.  NFSv4.2 Callback Operations

3665	15.1.  Operation 15: CB_OFFLOAD - Report results of an asynchronous
3666	       operation

3668	15.1.1.  ARGUMENT

3670	   struct write_response4 {
3671	           stateid4        wr_callback_id<1>;
3672	           count4          wr_count;
3673	           stable_how4     wr_committed;
3674	           verifier4       wr_writeverf;
3675	   };

3677	   union offload_info4 switch (nfsstat4 coa_status) {
3678	   case NFS4_OK:
3679	           write_response4 coa_resok4;
3680	   default:
3681	           length4         coa_bytes_copied;
3682	   };

3684	   struct CB_OFFLOAD4args {
3685	           nfs_fh4         coa_fh;
3686	           stateid4        coa_stateid;
3687	           offload_info4   coa_offload_info;
3688	   };

3690	15.1.2.  RESULT

3692	   struct CB_OFFLOAD4res {
3693	           nfsstat4        cor_status;
3694	   };

3696	15.1.3.  DESCRIPTION

3698	   CB_OFFLOAD is used to report to the client the results of an
3699	   asynchronous operation, e.g., Server-side Copy or a hole punch.  The
3700	   coa_fh and coa_stateid identify the transaction and the coa_status
3701	   indicates success or failure.  The coa_resok4.wr_callback_id MUST NOT
3702	   be set.  If the transaction failed, then the coa_bytes_copied
3703	   contains the number of bytes copied before the failure occurred.  The
3704	   coa_bytes_copied value indicates the number of bytes copied but not
3705	   which specific bytes have been copied.

3707	   If the client supports either

3709	   1.  the COPY operation

3711	   2.  the WRITE_PLUS operation and any arm of the discriminated union
3712	       other than NFS4_CONTENT_DATA

3714	   then the client is REQUIRED to support the CB_OFFLOAD operation.

3716	   There is a potential race between the reply to the original
3717	   transaction on the forechannel and the CB_OFFLOAD callback on the
3718	   backchannel.  Sections 2.10.6.3 and 20.9.3 of [RFC5661] describe how
3719	   to handle this type of issue.

3721	15.1.3.1.  Server-side Copy

3723	   CB_OFFLOAD is used for both intra- and inter-server asynchronous
3724	   copies.  This operation is sent by the destination server to the
3725	   client in a CB_COMPOUND request.  Upon success, the
3726	   coa_resok4.wr_count presents the total number of bytes copied.

3728	15.1.3.2.  WRITE_PLUS

3730	   CB_OFFLOAD is used to report the completion of either a hole punch or
3731	   an ADH initialization.  Upon success, the coa_resok4 will contain the
3732	   same information that a synchronous WRITE_PLUS would have returned.

3734	16.  IANA Considerations

3736	   This section uses terms that are defined in [RFC5226].

3738	17.  References

3740	17.1.  Normative References

3742	   [4.2xdr]   Haynes, T., "Network File System (NFS) Version 4 Minor
3743	              Version 2 External Data Representation Standard (XDR)
3744	              Description", March 2013.

3746	   [RFC3986]  Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform
3747	              Resource Identifier (URI): Generic Syntax", STD 66,
3748	              RFC 3986, January 2005.

3750	   [RFC5661]  Shepler, S., Eisler, M., and D. Noveck, "Network File
3751	              System (NFS) Version 4 Minor Version 1 Protocol",
3752	              RFC 5661, January 2010.

3754	   [RFC5664]  Halevy, B., Welch, B., and J. Zelenka, "Object-Based
3755	              Parallel NFS (pNFS) Operations", RFC 5664, January 2010.

3757	   [posix_fadvise]
3758	              The Open Group, "Section 'posix_fadvise()' of System
3759	              Interfaces of The Open Group Base Specifications Issue 6,
3760	              IEEE Std 1003.1, 2004 Edition", 2004.

3762	17.2.  Informative References

3764	   [Ashdown08]
3765	              Ashdown, L., "Chapter 15, Validating Database Files and
3766	              Backups, of Oracle Database Backup and Recovery User's
3767	              Guide 11g Release 1 (11.1)", August 2008.

3769	   [Baira08]  Bairavasundaram, L., Goodson, G., Schroeder, B., Arpaci-
3770	              Dusseau, A., and R. Arpaci-Dusseau, "An Analysis of Data
3771	              Corruption in the Storage Stack", Proceedings of the 6th
3772	              USENIX Symposium on File and Storage Technologies (FAST
3773	              '08) , 2008.

3775	   [FEDFS-ADMIN]
3776	              Lentini, J., Everhart, C., Ellard, D., Tewari, R., and M.
3777	              Naik, "Administration Protocol for Federated Filesystems",
3778	              draft-ietf-nfsv4-federated-fs-admin (Work In Progress),
3779	              2010.

3781	   [FEDFS-NSDB]
3782	              Lentini, J., Everhart, C., Ellard, D., Tewari, R., and M.
3783	              Naik, "NSDB Protocol for Federated Filesystems",
3784	              draft-ietf-nfsv4-federated-fs-protocol (Work In Progress),
3785	              2010.

3787	   [Haynes13]
3788	              Haynes, T., "Requirements for Labeled NFS",
3789	              draft-ietf-nfsv4-labreqs-04 (work in progress), 2013.

3791	   [I-D.ietf-nfsv4-rfc3530bis]
3792	              Haynes, T. and D. Noveck, "Network File System (NFS)
3793	              version 4 Protocol", draft-ietf-nfsv4-rfc3530bis-25 (Work
3794	              In Progress), February 2013.

3796	   [IESG08]   ISEG, "IESG Processing of RFC Errata for the IETF Stream",
3797	              2008.

3799	   [MLS]      "Section 46.6. Multi-Level Security (MLS) of Deployment
3800	              Guide: Deployment, configuration and administration of Red
3801	              Hat Enterprise Linux 5, Edition 6", 2011.

3803	   [McDougall07]
3804	              McDougall, R. and J. Mauro, "Section 11.4.3, Detecting
3805	              Memory Corruption of Solaris Internals", 2007.

3807	   [Quigley11]
3808	              Quigley, D. and J. Lu, "Registry Specification for MAC
3809	              Security Label Formats",
3810	              draft-quigley-label-format-registry (work in progress),
3811	              2011.

3813	   [RFC0959]  Postel, J. and J. Reynolds, "File Transfer Protocol",
3814	              STD 9, RFC 959, October 1985.

3816	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
3817	              Requirement Levels", March 1997.

3819	   [RFC2616]  Fielding, R., Gettys, J., Mogul, J., Frystyk, H.,
3820	              Masinter, L., Leach, P., and T. Berners-Lee, "Hypertext
3821	              Transfer Protocol -- HTTP/1.1", RFC 2616, June 1999.

3823	   [RFC4506]  Eisler, M., "XDR: External Data Representation Standard",
3824	              RFC 4506, May 2006.

3826	   [RFC5226]  Narten, T. and H. Alvestrand, "Guidelines for Writing an
3827	              IANA Considerations Section in RFCs", BCP 26, RFC 5226,
3828	              May 2008.

3830	   [Strohm11]
3831	              Strohm, R., "Chapter 2, Data Blocks, Extents, and
3832	              Segments, of Oracle Database Concepts 11g Release 1
3833	              (11.1)", January 2011.

3835	Appendix A.  Acknowledgments

3837	   For the pNFS Access Permissions Check, the original draft was by
3838	   Sorin Faibish, David Black, Mike Eisler, and Jason Glasgow.  The work
3839	   was influenced by discussions with Benny Halevy and Bruce Fields.  A
3840	   review was done by Tom Haynes.

3842	   For the Sharing change attribute implementation details with NFSv4
3843	   clients, the original draft was by Trond Myklebust.

3845	   For the NFS Server-side Copy, the original draft was by James
3846	   Lentini, Mike Eisler, Deepak Kenchammana, Anshul Madan, and Rahul
3847	   Iyer.  Tom Talpey co-authored an unpublished version of that
3848	   document.  It was also was reviewed by a number of individuals:
3849	   Pranoop Erasani, Tom Haynes, Arthur Lent, Trond Myklebust, Dave
3850	   Noveck, Theresa Lingutla-Raj, Manjunath Shankararao, Satyam Vaghani,
3851	   and Nico Williams.

3853	   For the NFS space reservation operations, the original draft was by
3854	   Mike Eisler, James Lentini, Manjunath Shankararao, and Rahul Iyer.

3856	   For the sparse file support, the original draft was by Dean
3857	   Hildebrand and Marc Eshel.  Valuable input and advice was received
3858	   from Sorin Faibish, Bruce Fields, Benny Halevy, Trond Myklebust, and
3859	   Richard Scheffenegger.

3861	   For the Application IO Hints, the original draft was by Dean
3862	   Hildebrand, Mike Eisler, Trond Myklebust, and Sam Falkner.  Some
3863	   early reviewers included Benny Halevy and Pranoop Erasani.

3865	   For Labeled NFS, the original draft was by David Quigley, James
3866	   Morris, Jarret Lu, and Tom Haynes.  Peter Staubach, Trond Myklebust,
3867	   Stephen Smalley, Sorin Faibish, Nico Williams, and David Black also
3868	   contributed in the final push to get this accepted.

3870	   During the review process, Talia Reyes-Ortiz helped the sessions run
3871	   smoothly.  While many people contributed here and there, the core
3872	   reviewers were Andy Adamson, Pranoop Erasani, Bruce Fields, Chuck
3873	   Lever, Trond Myklebust, David Noveck, Peter Staubach, and Mike
3874	   Kupfer.

3876	Appendix B.  RFC Editor Notes

3878	   [RFC Editor: please remove this section prior to publishing this
3879	   document as an RFC]

3881	   [RFC Editor: prior to publishing this document as an RFC, please
3882	   replace all occurrences of RFCTBD10 with RFCxxxx where xxxx is the
3883	   RFC number of this document]

3885	Author's Address

3887	   Thomas Haynes (editor)
3888	   NetApp
3889	   495 E Java Dr
3890	   Sunnyvale, CA  95054
3891	   USA

3893	   Phone: +1 408 419 3018
3894	   Email: thomas@netapp.com