idnits 2.17.1 

draft-ietf-nfsv4-minorversion2-08.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  == There are 5 instances of lines with non-RFC6890-compliant IPv4 addresses
     in the document.  If these are example addresses, they should be changed.

  == There are 5 instances of lines with private range IPv4 addresses in the
     document.  If these are generic example addresses, they should be changed
     to use any of the ranges defined in RFC 6890 (or successor): 192.0.2.x,
     198.51.100.x or 203.0.113.x.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  == Using lowercase 'not' together with uppercase 'MUST', 'SHALL', 'SHOULD',
     or 'RECOMMENDED' is not an accepted usage according to RFC 2119.  Please
     use uppercase 'NOT' together with RFC 2119 keywords (if that is what you
     mean).
     
     Found 'MUST not' in this paragraph:
     
     Furthermore, each DS MUST not report to a client either a sparse
     ADB or data which belongs to another DS.  One implication of this
     requirement is that the app_data_block4's adb_block_size MUST be either
     be the stripe width or the stripe width must be an even multiple of it.

  == Using lowercase 'not' together with uppercase 'MUST', 'SHALL', 'SHOULD',
     or 'RECOMMENDED' is not an accepted usage according to RFC 2119.  Please
     use uppercase 'NOT' together with RFC 2119 keywords (if that is what you
     mean).
     
     Found 'MUST not' in this paragraph:
     
     The second change is to provide a method for the server to notify
     the client that the attribute changed on an open file on the server.  If
     the file is closed, then during the open attempt, the client will gather
     the new attribute value.  The server MUST not communicate the new value
     of the attribute, the client MUST query it.  This requirement stems from
     the need for the client to provide sufficient access rights to the
     attribute.

  == Using lowercase 'not' together with uppercase 'MUST', 'SHALL', 'SHOULD',
     or 'RECOMMENDED' is not an accepted usage according to RFC 2119.  Please
     use uppercase 'NOT' together with RFC 2119 keywords (if that is what you
     mean).
     
     Found 'MUST not' in this paragraph:
     
     When a data server chooses to return a hole result, it has the
     option of returning hole information for the data stored on that data
     server (as defined by the data layout), but it MUST not return results
     for a byte range that includes data managed by another data server.  Data
     servers that can obtain hole information for the parts of the file stored
     on that data server, the data server SHOULD return HOLE_INFO and the byte
     range of the hole stored on that data server.

  == The document seems to contain a disclaimer for pre-RFC5378 work, but was
     first submitted on or after 10 November 2008.  The disclaimer is usually
     necessary only for documents that revise or obsolete older RFCs, and that
     take significant amounts of text from those RFCs.  If you can contact all
     authors of the source material and they are willing to grant the BCP78
     rights to the IETF Trust, you can and should remove the disclaimer. 
     Otherwise, the disclaimer is needed and you can ignore this comment. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (April 25, 2012) is 4382 days in the past.  Is this
     intentional?

  -- Found something which looks like a code comment -- if you have code
     sections in the document, please surround them with '<CODE BEGINS>' and
     '<CODE ENDS>' lines.


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  == Missing Reference: '0' is mentioned on line 3785, but not defined

  -- Looks like a reference, but probably isn't: '32K' on line 3785

  -- Possible downref: Non-RFC (?) normative reference: ref. '1'

  ** Obsolete normative reference: RFC 5661 (ref. '2') (Obsoleted by RFC 8881)

  -- Possible downref: Non-RFC (?) normative reference: ref. '3'

  -- Possible downref: Non-RFC (?) normative reference: ref. '6'

  == Outdated reference: A later version (-05) exists of
     draft-ietf-nfsv4-labreqs-00

  ** Downref: Normative reference to an Informational draft:
     draft-ietf-nfsv4-labreqs (ref. '7')

  == Outdated reference: A later version (-35) exists of
     draft-ietf-nfsv4-rfc3530bis-09

  -- Obsolete informational reference (is this intentional?): RFC 2616 (ref.
     '13') (Obsoleted by RFC 7230, RFC 7231, RFC 7232, RFC 7233, RFC 7234, RFC
     7235)

  -- Obsolete informational reference (is this intentional?): RFC 5226 (ref.
     '24') (Obsoleted by RFC 8126)


     Summary: 2 errors (**), 0 flaws (~~), 10 warnings (==), 8 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	NFSv4                                                          T. Haynes
3	Internet-Draft                                                    Editor
4	Intended status: Standards Track                          April 25, 2012
5	Expires: October 27, 2012

7	                     NFS Version 4 Minor Version 2
8	                 draft-ietf-nfsv4-minorversion2-08.txt

10	Abstract

12	   This Internet-Draft describes NFS version 4 minor version two,
13	   focusing mainly on the protocol extensions made from NFS version 4
14	   minor version 0 and NFS version 4 minor version 1.  Major extensions
15	   introduced in NFS version 4 minor version two include: Server-side
16	   Copy, Space Reservations, and Support for Sparse Files.

18	Requirements Language

20	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
21	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
22	   document are to be interpreted as described in RFC 2119 [1].

24	Status of this Memo

26	   This Internet-Draft is submitted in full conformance with the
27	   provisions of BCP 78 and BCP 79.

29	   Internet-Drafts are working documents of the Internet Engineering
30	   Task Force (IETF).  Note that other groups may also distribute
31	   working documents as Internet-Drafts.  The list of current Internet-
32	   Drafts is at http://datatracker.ietf.org/drafts/current/.

34	   Internet-Drafts are draft documents valid for a maximum of six months
35	   and may be updated, replaced, or obsoleted by other documents at any
36	   time.  It is inappropriate to use Internet-Drafts as reference
37	   material or to cite them other than as "work in progress."

39	   This Internet-Draft will expire on October 27, 2012.

41	Copyright Notice

43	   Copyright (c) 2012 IETF Trust and the persons identified as the
44	   document authors.  All rights reserved.

46	   This document is subject to BCP 78 and the IETF Trust's Legal
47	   Provisions Relating to IETF Documents
48	   (http://trustee.ietf.org/license-info) in effect on the date of
49	   publication of this document.  Please review these documents
50	   carefully, as they describe your rights and restrictions with respect
51	   to this document.  Code Components extracted from this document must
52	   include Simplified BSD License text as described in Section 4.e of
53	   the Trust Legal Provisions and are provided without warranty as
54	   described in the Simplified BSD License.

56	   This document may contain material from IETF Documents or IETF
57	   Contributions published or made publicly available before November
58	   10, 2008.  The person(s) controlling the copyright in some of this
59	   material may not have granted the IETF Trust the right to allow
60	   modifications of such material outside the IETF Standards Process.
61	   Without obtaining an adequate license from the person(s) controlling
62	   the copyright in such materials, this document may not be modified
63	   outside the IETF Standards Process, and derivative works of it may
64	   not be created outside the IETF Standards Process, except to format
65	   it for publication as an RFC or to translate it into languages other
66	   than English.

68	Table of Contents

70	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  6
71	     1.1.   The NFS Version 4 Minor Version 2 Protocol  . . . . . . .  6
72	     1.2.   Scope of This Document  . . . . . . . . . . . . . . . . .  6
73	     1.3.   NFSv4.2 Goals . . . . . . . . . . . . . . . . . . . . . .  6
74	     1.4.   Overview of NFSv4.2 Features  . . . . . . . . . . . . . .  6
75	       1.4.1.  Sparse Files . . . . . . . . . . . . . . . . . . . . .  6
76	       1.4.2.  Application I/O Advise . . . . . . . . . . . . . . . .  7
77	     1.5.   Differences from NFSv4.1  . . . . . . . . . . . . . . . .  7
78	   2.  NFS Server-side Copy . . . . . . . . . . . . . . . . . . . . .  7
79	     2.1.   Introduction  . . . . . . . . . . . . . . . . . . . . . .  7
80	     2.2.   Protocol Overview . . . . . . . . . . . . . . . . . . . .  8
81	       2.2.1.  Intra-Server Copy  . . . . . . . . . . . . . . . . . .  9
82	       2.2.2.  Inter-Server Copy  . . . . . . . . . . . . . . . . . . 11
83	       2.2.3.  Server-to-Server Copy Protocol . . . . . . . . . . . . 13
84	     2.3.   Operations  . . . . . . . . . . . . . . . . . . . . . . . 15
85	       2.3.1.  netloc4 - Network Locations  . . . . . . . . . . . . . 15
86	       2.3.2.  Copy Offload Stateids  . . . . . . . . . . . . . . . . 16
87	     2.4.   Security Considerations . . . . . . . . . . . . . . . . . 16
88	       2.4.1.  Inter-Server Copy Security . . . . . . . . . . . . . . 16
89	   3.  Sparse Files . . . . . . . . . . . . . . . . . . . . . . . . . 25
90	     3.1.   Introduction  . . . . . . . . . . . . . . . . . . . . . . 25
91	     3.2.   Terminology . . . . . . . . . . . . . . . . . . . . . . . 25
92	     3.3.   Determining the next hole/data  . . . . . . . . . . . . . 26
93	   4.  Space Reservation  . . . . . . . . . . . . . . . . . . . . . . 26
94	     4.1.   Introduction  . . . . . . . . . . . . . . . . . . . . . . 26
95	   5.  Support for Application IO Hints . . . . . . . . . . . . . . . 28
96	     5.1.   Introduction  . . . . . . . . . . . . . . . . . . . . . . 28
97	     5.2.   POSIX Requirements  . . . . . . . . . . . . . . . . . . . 29
98	     5.3.   Additional Requirements . . . . . . . . . . . . . . . . . 30
99	     5.4.   Security Considerations . . . . . . . . . . . . . . . . . 31
100	     5.5.   IANA Considerations . . . . . . . . . . . . . . . . . . . 31
101	   6.  Application Data Block Support . . . . . . . . . . . . . . . . 31
102	     6.1.   Generic Framework . . . . . . . . . . . . . . . . . . . . 32
103	       6.1.1.  Data Block Representation  . . . . . . . . . . . . . . 32
104	       6.1.2.  Data Content . . . . . . . . . . . . . . . . . . . . . 33
105	     6.2.   pNFS Considerations . . . . . . . . . . . . . . . . . . . 33
106	     6.3.   An Example of Detecting Corruption  . . . . . . . . . . . 34
107	     6.4.   Example of READ_PLUS  . . . . . . . . . . . . . . . . . . 35
108	     6.5.   Zero Filled Holes . . . . . . . . . . . . . . . . . . . . 36
109	   7.  Labeled NFS  . . . . . . . . . . . . . . . . . . . . . . . . . 36
110	     7.1.   Introduction  . . . . . . . . . . . . . . . . . . . . . . 36
111	     7.2.   Definitions . . . . . . . . . . . . . . . . . . . . . . . 37
112	     7.3.   MAC Security Attribute  . . . . . . . . . . . . . . . . . 37
113	       7.3.1.  Interpreting FATTR4_SEC_LABEL  . . . . . . . . . . . . 38
114	       7.3.2.  Delegations  . . . . . . . . . . . . . . . . . . . . . 39
115	       7.3.3.  Permission Checking  . . . . . . . . . . . . . . . . . 39
116	       7.3.4.  Object Creation  . . . . . . . . . . . . . . . . . . . 40
117	       7.3.5.  Existing Objects . . . . . . . . . . . . . . . . . . . 40
118	       7.3.6.  Label Changes  . . . . . . . . . . . . . . . . . . . . 40
119	     7.4.   pNFS Considerations . . . . . . . . . . . . . . . . . . . 41
120	     7.5.   Discovery of Server LNFS Support  . . . . . . . . . . . . 41
121	     7.6.   MAC Security NFS Modes of Operation . . . . . . . . . . . 42
122	       7.6.1.  Full Mode  . . . . . . . . . . . . . . . . . . . . . . 42
123	       7.6.2.  Guest Mode . . . . . . . . . . . . . . . . . . . . . . 43
124	     7.7.   Security Considerations . . . . . . . . . . . . . . . . . 44
125	   8.  Sharing change attribute implementation details with NFSv4
126	       clients  . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
127	     8.1.   Introduction  . . . . . . . . . . . . . . . . . . . . . . 44
128	     8.2.   Definition of the 'change_attr_type' per-file system
129	            attribute . . . . . . . . . . . . . . . . . . . . . . . . 45
130	   9.  Security Considerations  . . . . . . . . . . . . . . . . . . . 46
131	   10. Error Values . . . . . . . . . . . . . . . . . . . . . . . . . 46
132	     10.1.  Error Definitions . . . . . . . . . . . . . . . . . . . . 46
133	       10.1.1. General Errors . . . . . . . . . . . . . . . . . . . . 47
134	       10.1.2. Server to Server Copy Errors . . . . . . . . . . . . . 47
135	       10.1.3. Labeled NFS Errors . . . . . . . . . . . . . . . . . . 47
136	   11. File Attributes  . . . . . . . . . . . . . . . . . . . . . . . 48
137	     11.1.  Attribute Definitions . . . . . . . . . . . . . . . . . . 48
138	   12. Operations: REQUIRED, RECOMMENDED, or OPTIONAL . . . . . . . . 48
139	   13. NFSv4.2 Operations . . . . . . . . . . . . . . . . . . . . . . 52
140	     13.1.  Operation 59: COPY - Initiate a server-side copy  . . . . 52
141	     13.2.  Operation 60: COPY_ABORT - Cancel a server-side copy  . . 59
142	     13.3.  Operation 61: COPY_NOTIFY - Notify a source server of
143	            a future copy . . . . . . . . . . . . . . . . . . . . . . 61
144	     13.4.  Operation 62: COPY_REVOKE - Revoke a destination
145	            server's copy privileges  . . . . . . . . . . . . . . . . 63
146	     13.5.  Operation 63: COPY_STATUS - Poll for status of a
147	            server-side copy  . . . . . . . . . . . . . . . . . . . . 64
148	     13.6.  Modification to Operation 42: EXCHANGE_ID -
149	            Instantiate Client ID . . . . . . . . . . . . . . . . . . 65
150	     13.7.  Operation 64: INITIALIZE  . . . . . . . . . . . . . . . . 66
151	     13.8.  Operation 67: IO_ADVISE - Application I/O access
152	            pattern hints . . . . . . . . . . . . . . . . . . . . . . 70
153	     13.9.  Changes to Operation 51: LAYOUTRETURN . . . . . . . . . . 76
154	     13.10. Operation 65: READ_PLUS . . . . . . . . . . . . . . . . . 79
155	     13.11. Operation 66: SEEK  . . . . . . . . . . . . . . . . . . . 85
156	   14. NFSv4.2 Callback Operations  . . . . . . . . . . . . . . . . . 86
157	     14.1.  Procedure 16: CB_ATTR_CHANGED - Notify Client that
158	            the File's Attributes Changed . . . . . . . . . . . . . . 86
159	     14.2.  Operation 15: CB_COPY - Report results of a
160	            server-side copy  . . . . . . . . . . . . . . . . . . . . 87
161	   15. IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 88
162	   16. References . . . . . . . . . . . . . . . . . . . . . . . . . . 88
163	     16.1.  Normative References  . . . . . . . . . . . . . . . . . . 88
164	     16.2.  Informative References  . . . . . . . . . . . . . . . . . 89
165	   Appendix A.  Acknowledgments . . . . . . . . . . . . . . . . . . . 90
166	   Appendix B.  RFC Editor Notes  . . . . . . . . . . . . . . . . . . 91
167	   Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 91

169	1.  Introduction

171	1.1.  The NFS Version 4 Minor Version 2 Protocol

173	   The NFS version 4 minor version 2 (NFSv4.2) protocol is the third
174	   minor version of the NFS version 4 (NFSv4) protocol.  The first minor
175	   version, NFSv4.0, is described in [10] and the second minor version,
176	   NFSv4.1, is described in [2].  It follows the guidelines for minor
177	   versioning that are listed in Section 11 of [10].

179	   As a minor version, NFSv4.2 is consistent with the overall goals for
180	   NFSv4, but extends the protocol so as to better meet those goals,
181	   based on experiences with NFSv4.1.  In addition, NFSv4.2 has adopted
182	   some additional goals, which motivate some of the major extensions in
183	   NFSv4.2.

185	1.2.  Scope of This Document

187	   This document describes the NFSv4.2 protocol.  With respect to
188	   NFSv4.0 and NFSv4.1, this document does not:

190	   o  describe the NFSv4.0 or NFSv4.1 protocols, except where needed to
191	      contrast with NFSv4.2.

193	   o  modify the specification of the NFSv4.0 or NFSv4.1 protocols.

195	   o  clarify the NFSv4.0 or NFSv4.1 protocols.  I.e., any
196	      clarifications made here apply to NFSv4.2 and neither of the prior
197	      protocols.

199	   The full XDR for NFSv4.2 is presented in [3].

201	1.3.  NFSv4.2 Goals

203	   [[Comment.1: This needs fleshing out! --TH]]

205	1.4.  Overview of NFSv4.2 Features

207	   [[Comment.2: This needs fleshing out! --TH]]

209	1.4.1.  Sparse Files

211	   Two new operations are defined to support the reading of sparse files
212	   (READ_PLUS) and the punching of holes to remove backing storage
213	   (INITIALIZE).

215	1.4.2.  Application I/O Advise

217	   We propose a new IO_ADVISE operation for NFSv4.2 that clients can use
218	   to communicate expected I/O behavior to the server.  By communicating
219	   future I/O behavior such as whether a file will be accessed
220	   sequentially or randomly, and whether a file will or will not be
221	   accessed in the near future, servers can optimize future I/O requests
222	   for a file by, for example, prefetching or evicting data.  This
223	   operation can be used to support the posix_fadvise function as well
224	   as other applications such as databases and video editors.

226	1.5.  Differences from NFSv4.1

228	   In NFSv4.1, the only way to introduce new variants of an operation
229	   was to introduce a new operation.  I.e., READ becomes either READ2 or
230	   READ_PLUS.  With the use of discriminated unions as parameters to
231	   such functions in NFSv4.2, it is possible to add a new arm in a
232	   subsequent minor version.  And it is also possible to move such an
233	   operation from OPTIONAL/RECOMMENDED to REQUIRED.  Forcing an
234	   implementation to adopt each arm of a discriminated union at such a
235	   time does not meet the spirit of the minor versioning rules.  As
236	   such, new arms of a discriminated union MUST follow the same
237	   guidelines for minor versioning as operations in NFSv4.1 - i.e., they
238	   may not be made REQUIRED.  To support this, a new error code,
239	   NFS4ERR_UNION_NOTSUPP, is introduced which allows the server to
240	   communicate to the client that the operation is supported, but the
241	   specific arm of the discriminated union is not.

243	2.  NFS Server-side Copy

245	2.1.  Introduction

247	   This section describes a server-side copy feature for the NFS
248	   protocol.

250	   The server-side copy feature provides a mechanism for the NFS client
251	   to perform a file copy on the server without the data being
252	   transmitted back and forth over the network.

254	   Without this feature, an NFS client copies data from one location to
255	   another by reading the data from the server over the network, and
256	   then writing the data back over the network to the server.  Using
257	   this server-side copy operation, the client is able to instruct the
258	   server to copy the data locally without the data being sent back and
259	   forth over the network unnecessarily.

261	   In general, this feature is useful whenever data is copied from one
262	   location to another on the server.  It is particularly useful when
263	   copying the contents of a file from a backup.  Backup-versions of a
264	   file are copied for a number of reasons, including restoring and
265	   cloning data.

267	   If the source object and destination object are on different file
268	   servers, the file servers will communicate with one another to
269	   perform the copy operation.  The server-to-server protocol by which
270	   this is accomplished is not defined in this document.

272	2.2.  Protocol Overview

274	   The server-side copy offload operations support both intra-server and
275	   inter-server file copies.  An intra-server copy is a copy in which
276	   the source file and destination file reside on the same server.  In
277	   an inter-server copy, the source file and destination file are on
278	   different servers.  In both cases, the copy may be performed
279	   synchronously or asynchronously.

281	   Throughout the rest of this document, we refer to the NFS server
282	   containing the source file as the "source server" and the NFS server
283	   to which the file is transferred as the "destination server".  In the
284	   case of an intra-server copy, the source server and destination
285	   server are the same server.  Therefore in the context of an intra-
286	   server copy, the terms source server and destination server refer to
287	   the single server performing the copy.

289	   The operations described below are designed to copy files.  Other
290	   file system objects can be copied by building on these operations or
291	   using other techniques.  For example if the user wishes to copy a
292	   directory, the client can synthesize a directory copy by first
293	   creating the destination directory and then copying the source
294	   directory's files to the new destination directory.  If the user
295	   wishes to copy a namespace junction [11] [12], the client can use the
296	   ONC RPC Federated Filesystem protocol [12] to perform the copy.
297	   Specifically the client can determine the source junction's
298	   attributes using the FEDFS_LOOKUP_FSN procedure and create a
299	   duplicate junction using the FEDFS_CREATE_JUNCTION procedure.

301	   For the inter-server copy protocol, the operations are defined to be
302	   compatible with a server-to-server copy protocol in which the
303	   destination server reads the file data from the source server.  This
304	   model in which the file data is pulled from the source by the
305	   destination has a number of advantages over a model in which the
306	   source pushes the file data to the destination.  The advantages of
307	   the pull model include:

309	   o  The pull model only requires a remote server (i.e., the
310	      destination server) to be granted read access.  A push model
311	      requires a remote server (i.e., the source server) to be granted
312	      write access, which is more privileged.

314	   o  The pull model allows the destination server to stop reading if it
315	      has run out of space.  In a push model, the destination server
316	      must flow control the source server in this situation.

318	   o  The pull model allows the destination server to easily flow
319	      control the data stream by adjusting the size of its read
320	      operations.  In a push model, the destination server does not have
321	      this ability.  The source server in a push model is capable of
322	      writing chunks larger than the destination server has requested in
323	      attributes and session parameters.  In theory, the destination
324	      server could perform a "short" write in this situation, but this
325	      approach is known to behave poorly in practice.

327	   The following operations are provided to support server-side copy:

329	   COPY_NOTIFY:  For inter-server copies, the client sends this
330	      operation to the source server to notify it of a future file copy
331	      from a given destination server for the given user.

333	   COPY_REVOKE:  Also for inter-server copies, the client sends this
334	      operation to the source server to revoke permission to copy a file
335	      for the given user.

337	   COPY:  Used by the client to request a file copy.

339	   COPY_ABORT:  Used by the client to abort an asynchronous file copy.

341	   COPY_STATUS:  Used by the client to poll the status of an
342	      asynchronous file copy.

344	   CB_COPY:  Used by the destination server to report the results of an
345	      asynchronous file copy to the client.

347	   These operations are described in detail in Section 2.3.  This
348	   section provides an overview of how these operations are used to
349	   perform server-side copies.

351	2.2.1.  Intra-Server Copy

353	   To copy a file on a single server, the client uses a COPY operation.
354	   The server may respond to the copy operation with the final results
355	   of the copy or it may perform the copy asynchronously and deliver the
356	   results using a CB_COPY operation callback.  If the copy is performed
357	   asynchronously, the client may poll the status of the copy using
358	   COPY_STATUS or cancel the copy using COPY_ABORT.

360	   A synchronous intra-server copy is shown in Figure 1.  In this
361	   example, the NFS server chooses to perform the copy synchronously.
362	   The copy operation is completed, either successfully or
363	   unsuccessfully, before the server replies to the client's request.
364	   The server's reply contains the final result of the operation.

366	     Client                                  Server
367	        +                                      +
368	        |                                      |
369	        |--- COPY ---------------------------->| Client requests
370	        |<------------------------------------/| a file copy
371	        |                                      |
372	        |                                      |

374	                Figure 1: A synchronous intra-server copy.

376	   An asynchronous intra-server copy is shown in Figure 2.  In this
377	   example, the NFS server performs the copy asynchronously.  The
378	   server's reply to the copy request indicates that the copy operation
379	   was initiated and the final result will be delivered at a later time.
380	   The server's reply also contains a copy stateid.  The client may use
381	   this copy stateid to poll for status information (as shown) or to
382	   cancel the copy using a COPY_ABORT.  When the server completes the
383	   copy, the server performs a callback to the client and reports the
384	   results.

386	     Client                                  Server
387	        +                                      +
388	        |                                      |
389	        |--- COPY ---------------------------->| Client requests
390	        |<------------------------------------/| a file copy
391	        |                                      |
392	        |                                      |
393	        |--- COPY_STATUS --------------------->| Client may poll
394	        |<------------------------------------/| for status
395	        |                                      |
396	        |                  .                   | Multiple COPY_STATUS
397	        |                  .                   | operations may be sent.
398	        |                  .                   |
399	        |                                      |
400	        |<-- CB_COPY --------------------------| Server reports results
401	        |\------------------------------------>|
402	        |                                      |

404	               Figure 2: An asynchronous intra-server copy.

406	2.2.2.  Inter-Server Copy

408	   A copy may also be performed between two servers.  The copy protocol
409	   is designed to accommodate a variety of network topologies.  As shown
410	   in Figure 3, the client and servers may be connected by multiple
411	   networks.  In particular, the servers may be connected by a
412	   specialized, high speed network (network 192.168.33.0/24 in the
413	   diagram) that does not include the client.  The protocol allows the
414	   client to setup the copy between the servers (over network
415	   10.11.78.0/24 in the diagram) and for the servers to communicate on
416	   the high speed network if they choose to do so.

418	                             192.168.33.0/24
419	                 +-------------------------------------+
420	                 |                                     |
421	                 |                                     |
422	                 | 192.168.33.18                       | 192.168.33.56
423	         +-------+------+                       +------+------+
424	         |     Source   |                       | Destination |
425	         +-------+------+                       +------+------+
426	                 | 10.11.78.18                         | 10.11.78.56
427	                 |                                     |
428	                 |                                     |
429	                 |             10.11.78.0/24           |
430	                 +------------------+------------------+
431	                                    |
432	                                    |
433	                                    | 10.11.78.243
434	                              +-----+-----+
435	                              |   Client  |
436	                              +-----------+

438	            Figure 3: An example inter-server network topology.

440	   For an inter-server copy, the client notifies the source server that
441	   a file will be copied by the destination server using a COPY_NOTIFY
442	   operation.  The client then initiates the copy by sending the COPY
443	   operation to the destination server.  The destination server may
444	   perform the copy synchronously or asynchronously.

446	   A synchronous inter-server copy is shown in Figure 4.  In this case,
447	   the destination server chooses to perform the copy before responding
448	   to the client's COPY request.

450	   An asynchronous copy is shown in Figure 5.  In this case, the
451	   destination server chooses to respond to the client's COPY request
452	   immediately and then perform the copy asynchronously.

454	     Client                Source         Destination
455	        +                    +                 +
456	        |                    |                 |
457	        |--- COPY_NOTIFY --->|                 |
458	        |<------------------/|                 |
459	        |                    |                 |
460	        |                    |                 |
461	        |--- COPY ---------------------------->|
462	        |                    |                 |
463	        |                    |                 |
464	        |                    |<----- read -----|
465	        |                    |\--------------->|
466	        |                    |                 |
467	        |                    |        .        | Multiple reads may
468	        |                    |        .        | be necessary
469	        |                    |        .        |
470	        |                    |                 |
471	        |                    |                 |
472	        |<------------------------------------/| Destination replies
473	        |                    |                 | to COPY

475	                Figure 4: A synchronous inter-server copy.

477	     Client                Source         Destination
478	        +                    +                 +
479	        |                    |                 |
480	        |--- COPY_NOTIFY --->|                 |
481	        |<------------------/|                 |
482	        |                    |                 |
483	        |                    |                 |
484	        |--- COPY ---------------------------->|
485	        |<------------------------------------/|
486	        |                    |                 |
487	        |                    |                 |
488	        |                    |<----- read -----|
489	        |                    |\--------------->|
490	        |                    |                 |
491	        |                    |        .        | Multiple reads may
492	        |                    |        .        | be necessary
493	        |                    |        .        |
494	        |                    |                 |
495	        |                    |                 |
496	        |--- COPY_STATUS --------------------->| Client may poll
497	        |<------------------------------------/| for status
498	        |                    |                 |
499	        |                    |        .        | Multiple COPY_STATUS
500	        |                    |        .        | operations may be sent
501	        |                    |        .        |
502	        |                    |                 |
503	        |                    |                 |
504	        |                    |                 |
505	        |<-- CB_COPY --------------------------| Destination reports
506	        |\------------------------------------>| results
507	        |                    |                 |

509	               Figure 5: An asynchronous inter-server copy.

511	2.2.3.  Server-to-Server Copy Protocol

513	   During an inter-server copy, the destination server reads the file
514	   data from the source server.  The source server and destination
515	   server are not required to use a specific protocol to transfer the
516	   file data.  The choice of what protocol to use is ultimately the
517	   destination server's decision.

519	2.2.3.1.  Using NFSv4.x as a Server-to-Server Copy Protocol

521	   The destination server MAY use standard NFSv4.x (where x >= 1) to
522	   read the data from the source server.  If NFSv4.x is used for the
523	   server-to-server copy protocol, the destination server can use the
524	   filehandle contained in the COPY request with standard NFSv4.x
525	   operations to read data from the source server.  Specifically, the
526	   destination server may use the NFSv4.x OPEN operation's CLAIM_FH
527	   facility to open the file being copied and obtain an open stateid.
528	   Using the stateid, the destination server may then use NFSv4.x READ
529	   operations to read the file.

531	2.2.3.2.  Using an alternative Server-to-Server Copy Protocol

533	   In a homogeneous environment, the source and destination servers
534	   might be able to perform the file copy extremely efficiently using
535	   specialized protocols.  For example the source and destination
536	   servers might be two nodes sharing a common file system format for
537	   the source and destination file systems.  Thus the source and
538	   destination are in an ideal position to efficiently render the image
539	   of the source file to the destination file by replicating the file
540	   system formats at the block level.  Another possibility is that the
541	   source and destination might be two nodes sharing a common storage
542	   area network, and thus there is no need to copy any data at all, and
543	   instead ownership of the file and its contents might simply be re-
544	   assigned to the destination.  To allow for these possibilities, the
545	   destination server is allowed to use a server-to-server copy protocol
546	   of its choice.

548	   In a heterogeneous environment, using a protocol other than NFSv4.x
549	   (e.g,.  HTTP [13] or FTP [14]) presents some challenges.  In
550	   particular, the destination server is presented with the challenge of
551	   accessing the source file given only an NFSv4.x filehandle.

553	   One option for protocols that identify source files with path names
554	   is to use an ASCII hexadecimal representation of the source
555	   filehandle as the file name.

557	   Another option for the source server is to use URLs to direct the
558	   destination server to a specialized service.  For example, the
559	   response to COPY_NOTIFY could include the URL
560	   ftp://s1.example.com:9999/_FH/0x12345, where 0x12345 is the ASCII
561	   hexadecimal representation of the source filehandle.  When the
562	   destination server receives the source server's URL, it would use
563	   "_FH/0x12345" as the file name to pass to the FTP server listening on
564	   port 9999 of s1.example.com.  On port 9999 there would be a special
565	   instance of the FTP service that understands how to convert NFS
566	   filehandles to an open file descriptor (in many operating systems,
567	   this would require a new system call, one which is the inverse of the
568	   makefh() function that the pre-NFSv4 MOUNT service needs).

570	   Authenticating and identifying the destination server to the source
571	   server is also a challenge.  Recommendations for how to accomplish
572	   this are given in Section 2.4.1.2.4 and Section 2.4.1.4.

574	2.3.  Operations

576	   In the sections that follow, several operations are defined that
577	   together provide the server-side copy feature.  These operations are
578	   intended to be OPTIONAL operations as defined in section 17 of [2].
579	   The COPY_NOTIFY, COPY_REVOKE, COPY, COPY_ABORT, and COPY_STATUS
580	   operations are designed to be sent within an NFSv4 COMPOUND
581	   procedure.  The CB_COPY operation is designed to be sent within an
582	   NFSv4 CB_COMPOUND procedure.

584	   Each operation is performed in the context of the user identified by
585	   the ONC RPC credential of its containing COMPOUND or CB_COMPOUND
586	   request.  For example, a COPY_ABORT operation issued by a given user
587	   indicates that a specified COPY operation initiated by the same user
588	   be canceled.  Therefore a COPY_ABORT MUST NOT interfere with a copy
589	   of the same file initiated by another user.

591	   An NFS server MAY allow an administrative user to monitor or cancel
592	   copy operations using an implementation specific interface.

594	2.3.1.  netloc4 - Network Locations

596	   The server-side copy operations specify network locations using the
597	   netloc4 data type shown below:

599	   enum netloc_type4 {
600	           NL4_NAME        = 0,
601	           NL4_URL         = 1,
602	           NL4_NETADDR     = 2
603	   };
604	   union netloc4 switch (netloc_type4 nl_type) {
605	           case NL4_NAME:          utf8str_cis nl_name;
606	           case NL4_URL:           utf8str_cis nl_url;
607	           case NL4_NETADDR:       netaddr4    nl_addr;
608	   };

610	   If the netloc4 is of type NL4_NAME, the nl_name field MUST be
611	   specified as a UTF-8 string.  The nl_name is expected to be resolved
612	   to a network address via DNS, LDAP, NIS, /etc/hosts, or some other
613	   means.  If the netloc4 is of type NL4_URL, a server URL [4]
614	   appropriate for the server-to-server copy operation is specified as a
615	   UTF-8 string.  If the netloc4 is of type NL4_NETADDR, the nl_addr
616	   field MUST contain a valid netaddr4 as defined in Section 3.3.9 of
617	   [2].

619	   When netloc4 values are used for an inter-server copy as shown in
620	   Figure 3, their values may be evaluated on the source server,
621	   destination server, and client.  The network environment in which
622	   these systems operate should be configured so that the netloc4 values
623	   are interpreted as intended on each system.

625	2.3.2.  Copy Offload Stateids

627	   A server may perform a copy offload operation asynchronously.  An
628	   asynchronous copy is tracked using a copy offload stateid.  Copy
629	   offload stateids are included in the COPY, COPY_ABORT, COPY_STATUS,
630	   and CB_COPY operations.

632	   Section 8.2.4 of [2] specifies that stateids are valid until either
633	   (A) the client or server restart or (B) the client returns the
634	   resource.

636	   A copy offload stateid will be valid until either (A) the client or
637	   server restarts or (B) the client returns the resource by issuing a
638	   COPY_ABORT operation or the client replies to a CB_COPY operation.

640	   A copy offload stateid's seqid MUST NOT be 0 (zero).  In the context
641	   of a copy offload operation, it is ambiguous to indicate the most
642	   recent copy offload operation using a stateid with seqid of 0 (zero).
643	   Therefore a copy offload stateid with seqid of 0 (zero) MUST be
644	   considered invalid.

646	2.4.  Security Considerations

648	   The security considerations pertaining to NFSv4 [10] apply to this
649	   document.

651	   The standard security mechanisms provide by NFSv4 [10] may be used to
652	   secure the protocol described in this document.

654	   NFSv4 clients and servers supporting the the inter-server copy
655	   operations described in this document are REQUIRED to implement [5],
656	   including the RPCSEC_GSSv3 privileges copy_from_auth and
657	   copy_to_auth.  If the server-to-server copy protocol is ONC RPC
658	   based, the servers are also REQUIRED to implement the RPCSEC_GSSv3
659	   privilege copy_confirm_auth.  These requirements to implement are not
660	   requirements to use.  NFSv4 clients and servers are RECOMMENDED to
661	   use [5] to secure server-side copy operations.

663	2.4.1.  Inter-Server Copy Security

665	2.4.1.1.  Requirements for Secure Inter-Server Copy

667	   Inter-server copy is driven by several requirements:

669	   o  The specification MUST NOT mandate an inter-server copy protocol.
670	      There are many ways to copy data.  Some will be more optimal than
671	      others depending on the identities of the source server and
672	      destination server.  For example the source and destination
673	      servers might be two nodes sharing a common file system format for
674	      the source and destination file systems.  Thus the source and
675	      destination are in an ideal position to efficiently render the
676	      image of the source file to the destination file by replicating
677	      the file system formats at the block level.  In other cases, the
678	      source and destination might be two nodes sharing a common storage
679	      area network, and thus there is no need to copy any data at all,
680	      and instead ownership of the file and its contents simply gets re-
681	      assigned to the destination.

683	   o  The specification MUST provide guidance for using NFSv4.x as a
684	      copy protocol.  For those source and destination servers willing
685	      to use NFSv4.x there are specific security considerations that
686	      this specification can and does address.

688	   o  The specification MUST NOT mandate pre-configuration between the
689	      source and destination server.  Requiring that the source and
690	      destination first have a "copying relationship" increases the
691	      administrative burden.  However the specification MUST NOT
692	      preclude implementations that require pre-configuration.

694	   o  The specification MUST NOT mandate a trust relationship between
695	      the source and destination server.  The NFSv4 security model
696	      requires mutual authentication between a principal on an NFS
697	      client and a principal on an NFS server.  This model MUST continue
698	      with the introduction of COPY.

700	2.4.1.2.  Inter-Server Copy with RPCSEC_GSSv3

702	   When the client sends a COPY_NOTIFY to the source server to expect
703	   the destination to attempt to copy data from the source server, it is
704	   expected that this copy is being done on behalf of the principal
705	   (called the "user principal") that sent the RPC request that encloses
706	   the COMPOUND procedure that contains the COPY_NOTIFY operation.  The
707	   user principal is identified by the RPC credentials.  A mechanism
708	   that allows the user principal to authorize the destination server to
709	   perform the copy in a manner that lets the source server properly
710	   authenticate the destination's copy, and without allowing the
711	   destination to exceed its authorization is necessary.

713	   An approach that sends delegated credentials of the client's user
714	   principal to the destination server is not used for the following
715	   reasons.  If the client's user delegated its credentials, the
716	   destination would authenticate as the user principal.  If the
717	   destination were using the NFSv4 protocol to perform the copy, then
718	   the source server would authenticate the destination server as the
719	   user principal, and the file copy would securely proceed.  However,
720	   this approach would allow the destination server to copy other files.
721	   The user principal would have to trust the destination server to not
722	   do so.  This is counter to the requirements, and therefore is not
723	   considered.  Instead an approach using RPCSEC_GSSv3 [5] privileges is
724	   proposed.

726	   One of the stated applications of the proposed RPCSEC_GSSv3 protocol
727	   is compound client host and user authentication [+ privilege
728	   assertion].  For inter-server file copy, we require compound NFS
729	   server host and user authentication [+ privilege assertion].  The
730	   distinction between the two is one without meaning.

732	   RPCSEC_GSSv3 introduces the notion of privileges.  We define three
733	   privileges:

735	   copy_from_auth:  A user principal is authorizing a source principal
736	      ("nfs@<source>") to allow a destination principal ("nfs@
737	      <destination>") to copy a file from the source to the destination.
738	      This privilege is established on the source server before the user
739	      principal sends a COPY_NOTIFY operation to the source server.

741	   struct copy_from_auth_priv {
742	           secret4             cfap_shared_secret;
743	           netloc4             cfap_destination;
744	           /* the NFSv4 user name that the user principal maps to */
745	           utf8str_mixed       cfap_username;
746	           /* equal to seq_num of rpc_gss_cred_vers_3_t */
747	           unsigned int        cfap_seq_num;
748	   };

750	      cap_shared_secret is a secret value the user principal generates.

752	   copy_to_auth:  A user principal is authorizing a destination
753	      principal ("nfs@<destination>") to allow it to copy a file from
754	      the source to the destination.  This privilege is established on
755	      the destination server before the user principal sends a COPY
756	      operation to the destination server.

758	   struct copy_to_auth_priv {
759	           /* equal to cfap_shared_secret */
760	           secret4              ctap_shared_secret;
761	           netloc4              ctap_source;
762	           /* the NFSv4 user name that the user principal maps to */
763	           utf8str_mixed        ctap_username;
764	           /* equal to seq_num of rpc_gss_cred_vers_3_t */
765	           unsigned int         ctap_seq_num;
766	   };

768	      ctap_shared_secret is a secret value the user principal generated
769	      and was used to establish the copy_from_auth privilege with the
770	      source principal.

772	   copy_confirm_auth:  A destination principal is confirming with the
773	      source principal that it is authorized to copy data from the
774	      source on behalf of the user principal.  When the inter-server
775	      copy protocol is NFSv4, or for that matter, any protocol capable
776	      of being secured via RPCSEC_GSSv3 (i.e., any ONC RPC protocol),
777	      this privilege is established before the file is copied from the
778	      source to the destination.

780	   struct copy_confirm_auth_priv {
781	           /* equal to GSS_GetMIC() of cfap_shared_secret */
782	           opaque              ccap_shared_secret_mic<>;
783	           /* the NFSv4 user name that the user principal maps to */
784	           utf8str_mixed       ccap_username;
785	           /* equal to seq_num of rpc_gss_cred_vers_3_t */
786	           unsigned int        ccap_seq_num;
787	   };

789	2.4.1.2.1.  Establishing a Security Context

791	   When the user principal wants to COPY a file between two servers, if
792	   it has not established copy_from_auth and copy_to_auth privileges on
793	   the servers, it establishes them:

795	   o  The user principal generates a secret it will share with the two
796	      servers.  This shared secret will be placed in the
797	      cfap_shared_secret and ctap_shared_secret fields of the
798	      appropriate privilege data types, copy_from_auth_priv and
799	      copy_to_auth_priv.

801	   o  An instance of copy_from_auth_priv is filled in with the shared
802	      secret, the destination server, and the NFSv4 user id of the user
803	      principal.  It will be sent with an RPCSEC_GSS3_CREATE procedure,
804	      and so cfap_seq_num is set to the seq_num of the credential of the
805	      RPCSEC_GSS3_CREATE procedure.  Because cfap_shared_secret is a
806	      secret, after XDR encoding copy_from_auth_priv, GSS_Wrap() (with
807	      privacy) is invoked on copy_from_auth_priv.  The
808	      RPCSEC_GSS3_CREATE procedure's arguments are:

810	      struct {
811	         rpc_gss3_gss_binding    *compound_binding;
812	         rpc_gss3_chan_binding   *chan_binding_mic;
813	         rpc_gss3_assertion      assertions<>;
814	         rpc_gss3_extension      extensions<>;
815	      } rpc_gss3_create_args;

817	      The string "copy_from_auth" is placed in assertions[0].privs.  The
818	      output of GSS_Wrap() is placed in extensions[0].data.  The field
819	      extensions[0].critical is set to TRUE.  The source server calls
820	      GSS_Unwrap() on the privilege, and verifies that the seq_num
821	      matches the credential.  It then verifies that the NFSv4 user id
822	      being asserted matches the source server's mapping of the user
823	      principal.  If it does, the privilege is established on the source
824	      server as: <"copy_from_auth", user id, destination>.  The
825	      successful reply to RPCSEC_GSS3_CREATE has:

827	      struct {
828	         opaque                  handle<>;
829	         rpc_gss3_chan_binding   *chan_binding_mic;
830	         rpc_gss3_assertion      granted_assertions<>;
831	         rpc_gss3_assertion      server_assertions<>;
832	         rpc_gss3_extension      extensions<>;
833	      } rpc_gss3_create_res;

835	      The field "handle" is the RPCSEC_GSSv3 handle that the client will
836	      use on COPY_NOTIFY requests involving the source and destination
837	      server. granted_assertions[0].privs will be equal to
838	      "copy_from_auth".  The server will return a GSS_Wrap() of
839	      copy_to_auth_priv.

841	   o  An instance of copy_to_auth_priv is filled in with the shared
842	      secret, the source server, and the NFSv4 user id.  It will be sent
843	      with an RPCSEC_GSS3_CREATE procedure, and so ctap_seq_num is set
844	      to the seq_num of the credential of the RPCSEC_GSS3_CREATE
845	      procedure.  Because ctap_shared_secret is a secret, after XDR
846	      encoding copy_to_auth_priv, GSS_Wrap() is invoked on
847	      copy_to_auth_priv.  The RPCSEC_GSS3_CREATE procedure's arguments
848	      are:

850	      struct {
851	         rpc_gss3_gss_binding    *compound_binding;
852	         rpc_gss3_chan_binding   *chan_binding_mic;
853	         rpc_gss3_assertion      assertions<>;
854	         rpc_gss3_extension      extensions<>;
855	      } rpc_gss3_create_args;

857	      The string "copy_to_auth" is placed in assertions[0].privs.  The
858	      output of GSS_Wrap() is placed in extensions[0].data.  The field
859	      extensions[0].critical is set to TRUE.  After unwrapping,
860	      verifying the seq_num, and the user principal to NFSv4 user ID
861	      mapping, the destination establishes a privilege of
862	      <"copy_to_auth", user id, source>.  The successful reply to
863	      RPCSEC_GSS3_CREATE has:

865	      struct {
866	         opaque                  handle<>;
867	         rpc_gss3_chan_binding   *chan_binding_mic;
868	         rpc_gss3_assertion      granted_assertions<>;
869	         rpc_gss3_assertion      server_assertions<>;
870	         rpc_gss3_extension      extensions<>;
871	      } rpc_gss3_create_res;

873	      The field "handle" is the RPCSEC_GSSv3 handle that the client will
874	      use on COPY requests involving the source and destination server.
875	      The field granted_assertions[0].privs will be equal to
876	      "copy_to_auth".  The server will return a GSS_Wrap() of
877	      copy_to_auth_priv.

879	2.4.1.2.2.  Starting a Secure Inter-Server Copy

881	   When the client sends a COPY_NOTIFY request to the source server, it
882	   uses the privileged "copy_from_auth" RPCSEC_GSSv3 handle.
883	   cna_destination_server in COPY_NOTIFY MUST be the same as the name of
884	   the destination server specified in copy_from_auth_priv.  Otherwise,
885	   COPY_NOTIFY will fail with NFS4ERR_ACCESS.  The source server
886	   verifies that the privilege <"copy_from_auth", user id, destination>
887	   exists, and annotates it with the source filehandle, if the user
888	   principal has read access to the source file, and if administrative
889	   policies give the user principal and the NFS client read access to
890	   the source file (i.e., if the ACCESS operation would grant read
891	   access).  Otherwise, COPY_NOTIFY will fail with NFS4ERR_ACCESS.

893	   When the client sends a COPY request to the destination server, it
894	   uses the privileged "copy_to_auth" RPCSEC_GSSv3 handle.
895	   ca_source_server in COPY MUST be the same as the name of the source
896	   server specified in copy_to_auth_priv.  Otherwise, COPY will fail
897	   with NFS4ERR_ACCESS.  The destination server verifies that the
898	   privilege <"copy_to_auth", user id, source> exists, and annotates it
899	   with the source and destination filehandles.  If the client has
900	   failed to establish the "copy_to_auth" policy it will reject the
901	   request with NFS4ERR_PARTNER_NO_AUTH.

903	   If the client sends a COPY_REVOKE to the source server to rescind the
904	   destination server's copy privilege, it uses the privileged
905	   "copy_from_auth" RPCSEC_GSSv3 handle and the cra_destination_server
906	   in COPY_REVOKE MUST be the same as the name of the destination server
907	   specified in copy_from_auth_priv.  The source server will then delete
908	   the <"copy_from_auth", user id, destination> privilege and fail any
909	   subsequent copy requests sent under the auspices of this privilege
910	   from the destination server.

912	2.4.1.2.3.  Securing ONC RPC Server-to-Server Copy Protocols

914	   After a destination server has a "copy_to_auth" privilege established
915	   on it, and it receives a COPY request, if it knows it will use an ONC
916	   RPC protocol to copy data, it will establish a "copy_confirm_auth"
917	   privilege on the source server, using nfs@<destination> as the
918	   initiator principal, and nfs@<source> as the target principal.

920	   The value of the field ccap_shared_secret_mic is a GSS_VerifyMIC() of
921	   the shared secret passed in the copy_to_auth privilege.  The field
922	   ccap_username is the mapping of the user principal to an NFSv4 user
923	   name ("user"@"domain" form), and MUST be the same as ctap_username
924	   and cfap_username.  The field ccap_seq_num is the seq_num of the
925	   RPCSEC_GSSv3 credential used for the RPCSEC_GSS3_CREATE procedure the
926	   destination will send to the source server to establish the
927	   privilege.

929	   The source server verifies the privilege, and establishes a
930	   <"copy_confirm_auth", user id, destination> privilege.  If the source
931	   server fails to verify the privilege, the COPY operation will be
932	   rejected with NFS4ERR_PARTNER_NO_AUTH.  All subsequent ONC RPC
933	   requests sent from the destination to copy data from the source to
934	   the destination will use the RPCSEC_GSSv3 handle returned by the
935	   source's RPCSEC_GSS3_CREATE response.

937	   Note that the use of the "copy_confirm_auth" privilege accomplishes
938	   the following:

940	   o  if a protocol like NFS is being used, with export policies, export
941	      policies can be overridden in case the destination server as-an-
942	      NFS-client is not authorized

944	   o  manual configuration to allow a copy relationship between the
945	      source and destination is not needed.

947	   If the attempt to establish a "copy_confirm_auth" privilege fails,
948	   then when the user principal sends a COPY request to destination, the
949	   destination server will reject it with NFS4ERR_PARTNER_NO_AUTH.

951	2.4.1.2.4.  Securing Non ONC RPC Server-to-Server Copy Protocols

953	   If the destination won't be using ONC RPC to copy the data, then the
954	   source and destination are using an unspecified copy protocol.  The
955	   destination could use the shared secret and the NFSv4 user id to
956	   prove to the source server that the user principal has authorized the
957	   copy.

959	   For protocols that authenticate user names with passwords (e.g., HTTP
960	   [13] and FTP [14]), the nfsv4 user id could be used as the user name,
961	   and an ASCII hexadecimal representation of the RPCSEC_GSSv3 shared
962	   secret could be used as the user password or as input into non-
963	   password authentication methods like CHAP [15].

965	2.4.1.3.  Inter-Server Copy via ONC RPC but without RPCSEC_GSSv3

967	   ONC RPC security flavors other than RPCSEC_GSSv3 MAY be used with the
968	   server-side copy offload operations described in this document.  In
969	   particular, host-based ONC RPC security flavors such as AUTH_NONE and
970	   AUTH_SYS MAY be used.  If a host-based security flavor is used, a
971	   minimal level of protection for the server-to-server copy protocol is
972	   possible.

974	   In the absence of strong security mechanisms such as RPCSEC_GSSv3,
975	   the challenge is how the source server and destination server
976	   identify themselves to each other, especially in the presence of
977	   multi-homed source and destination servers.  In a multi-homed
978	   environment, the destination server might not contact the source
979	   server from the same network address specified by the client in the
980	   COPY_NOTIFY.  This can be overcome using the procedure described
981	   below.

983	   When the client sends the source server the COPY_NOTIFY operation,
984	   the source server may reply to the client with a list of target
985	   addresses, names, and/or URLs and assign them to the unique
986	   quadruple: <random number, source fh, user ID, destination address
987	   Y>.  If the destination uses one of these target netlocs to contact
988	   the source server, the source server will be able to uniquely
989	   identify the destination server, even if the destination server does
990	   not connect from the address specified by the client in COPY_NOTIFY.
991	   The level of assurance in this identification depends on the
992	   unpredictability, strength and secrecy of the random number.

994	   For example, suppose the network topology is as shown in Figure 3.
995	   If the source filehandle is 0x12345, the source server may respond to
996	   a COPY_NOTIFY for destination 10.11.78.56 with the URLs:

998	      nfs://10.11.78.18//_COPY/FvhH1OKbu8VrxvV1erdjvR7N/10.11.78.56/_FH/
999	      0x12345

1001	      nfs://192.168.33.18//_COPY/FvhH1OKbu8VrxvV1erdjvR7N/10.11.78.56/
1002	      _FH/0x12345

1004	   The name component after _COPY is 24 characters of base 64, more than
1005	   enough to encode a 128 bit random number.

1007	   The client will then send these URLs to the destination server in the
1008	   COPY operation.  Suppose that the 192.168.33.0/24 network is a high
1009	   speed network and the destination server decides to transfer the file
1010	   over this network.  If the destination contacts the source server
1011	   from 192.168.33.56 over this network using NFSv4.1, it does the
1012	   following:

1014	   COMPOUND  { PUTROOTFH, LOOKUP "_COPY" ; LOOKUP
1015	      "FvhH1OKbu8VrxvV1erdjvR7N" ; LOOKUP "10.11.78.56"; LOOKUP "_FH" ;
1016	      OPEN "0x12345" ; GETFH }

1018	   Provided that the random number is unpredictable and has been kept
1019	   secret by the parties involved, the source server will therefore know
1020	   that these NFSv4.x operations are being issued by the destination
1021	   server identified in the COPY_NOTIFY.  This random number technique
1022	   only provides initial authentication of the destination server, and
1023	   cannot defend against man-in-the-middle attacks after authentication
1024	   or an eavesdropper that observes the random number on the wire.
1025	   Other secure communication techniques (e.g., IPsec) are necessary to
1026	   block these attacks.

1028	2.4.1.4.  Inter-Server Copy without ONC RPC and RPCSEC_GSSv3

1030	   The same techniques as Section 2.4.1.3, using unique URLs for each
1031	   destination server, can be used for other protocols (e.g., HTTP [13]
1032	   and FTP [14]) as well.

1034	3.  Sparse Files

1036	3.1.  Introduction

1038	   A sparse file is a common way of representing a large file without
1039	   having to utilize all of the disk space for it.  Consequently, a
1040	   sparse file uses less physical space than its size indicates.  This
1041	   means the file contains 'holes', byte ranges within the file that
1042	   contain no data.  Most modern file systems support sparse files,
1043	   including most UNIX file systems and NTFS, but notably not Apple's
1044	   HFS+.  Common examples of sparse files include Virtual Machine (VM)
1045	   OS/disk images, database files, log files, and even checkpoint
1046	   recovery files most commonly used by the HPC community.

1048	   If an application reads a hole in a sparse file, the file system must
1049	   return all zeros to the application.  For local data access there is
1050	   little penalty, but with NFS these zeroes must be transferred back to
1051	   the client.  If an application uses the NFS client to read data into
1052	   memory, this wastes time and bandwidth as the application waits for
1053	   the zeroes to be transferred.

1055	   A sparse file is typically created by initializing the file to be all
1056	   zeros - nothing is written to the data in the file, instead the hole
1057	   is recorded in the metadata for the file.  So a 8G disk image might
1058	   be represented initially by a couple hundred bits in the inode and
1059	   nothing on the disk.  If the VM then writes 100M to a file in the
1060	   middle of the image, there would now be two holes represented in the
1061	   metadata and 100M in the data.

1063	   This section introduces a new operation READ_PLUS (Section 13.10)
1064	   which supports all the features of READ but includes an extension to
1065	   support sparse pattern files.  READ_PLUS is guaranteed to perform no
1066	   worse than READ, and can dramatically improve performance with sparse
1067	   files.  READ_PLUS does not depend on pNFS protocol features, but can
1068	   be used by pNFS to support sparse files.

1070	3.2.  Terminology

1072	   Regular file:  An object of file type NF4REG or NF4NAMEDATTR.

1074	   Sparse file:  A Regular file that contains one or more Holes.

1076	   Hole:  A byte range within a Sparse file that contains regions of all
1077	      zeroes.  For block-based file systems, this could also be an
1078	      unallocated region of the file.

1080	   Hole Threshold:  The minimum length of a Hole as determined by the
1081	      server.  If a server chooses to define a Hole Threshold, then it
1082	      would not return hole information about holes with a length
1083	      shorter than the Hole Threshold.

1085	3.3.  Determining the next hole/data

1087	   Solaris and ZFS support an extension to lseek(2) that allows
1088	   applications to discover holes in a file.  The values, SEEK_HOLE and
1089	   SEEK_DATA, allow clients to seek to the next hole or beginning of
1090	   data, respectively.

1092	4.  Space Reservation

1094	4.1.  Introduction

1096	   This section describes a set of operations that allow applications
1097	   such as hypervisors to reserve space for a file, report the amount of
1098	   actual disk space a file occupies and freeup the backing space of a
1099	   file when it is not required.  In virtualized environments, virtual
1100	   disk files are often stored on NFS mounted volumes.  Since virtual
1101	   disk files represent the hard disks of virtual machines, hypervisors
1102	   often have to guarantee certain properties for the file.

1104	   One such example is space reservation.  When a hypervisor creates a
1105	   virtual disk file, it often tries to preallocate the space for the
1106	   file so that there are no future allocation related errors during the
1107	   operation of the virtual machine.  Such errors prevent a virtual
1108	   machine from continuing execution and result in downtime.

1110	   Currently, in order to achieve such a guarantee, applications zero
1111	   the entire file.  The initial zeroing allocates the backing blocks
1112	   and all subsequent writes are overwrites of already allocated blocks.
1113	   This approach is not only inefficient in terms of the amount of I/O
1114	   done, it is also not guaranteed to work on filesystems that are log
1115	   structured or deduplicated.  An efficient way of guaranteeing space
1116	   reservation would be beneficial to such applications.

1118	   If the space_reserved attribute is set on a file, it is guaranteed
1119	   that writes that do not grow the file will not fail with
1120	   NFSERR_NOSPC.

1122	   Another useful feature would be the ability to report the number of
1123	   blocks that would be freed when a file is deleted.  Currently, NFS
1124	   reports two size attributes:

1126	   size  The logical file size of the file.

1128	   space_used  The size in bytes that the file occupies on disk

1130	   While these attributes are sufficient for space accounting in
1131	   traditional filesystems, they prove to be inadequate in modern
1132	   filesystems that support block sharing.  In such filesystems,
1133	   multiple inodes can point to a single block with a block reference
1134	   count to guard against premature freeing.  Having a way to tell the
1135	   number of blocks that would be freed if the file was deleted would be
1136	   useful to applications that wish to migrate files when a volume is
1137	   low on space.

1139	   Since virtual disks represent a hard drive in a virtual machine, a
1140	   virtual disk can be viewed as a filesystem within a file.  Since not
1141	   all blocks within a filesystem are in use, there is an opportunity to
1142	   reclaim blocks that are no longer in use.  A call to deallocate
1143	   blocks could result in better space efficiency.  Lesser space MAY be
1144	   consumed for backups after block deallocation.

1146	   The following operations and attributes can be used to resolve this
1147	   issues:

1149	   space_reserved  This attribute specifies whether the blocks backing
1150	      the file have been preallocated.

1152	   space_freed  This attribute specifies the space freed when a file is
1153	      deleted, taking block sharing into consideration.

1155	   INITIALIZED  This operation zeroes and/or deallocates the blocks
1156	      backing a region of the file.

1158	   If space_used of a file is interpreted to mean the size in bytes of
1159	   all disk blocks pointed to by the inode of the file, then shared
1160	   blocks get double counted, over-reporting the space utilization.
1161	   This also has the adverse effect that the deletion of a file with
1162	   shared blocks frees up less than space_used bytes.

1164	   On the other hand, if space_used is interpreted to mean the size in
1165	   bytes of those disk blocks unique to the inode of the file, then
1166	   shared blocks are not counted in any file, resulting in under-
1167	   reporting of the space utilization.

1169	   For example, two files A and B have 10 blocks each.  Let 6 of these
1170	   blocks be shared between them.  Thus, the combined space utilized by
1171	   the two files is 14 * BLOCK_SIZE bytes.  In the former case, the
1172	   combined space utilization of the two files would be reported as 20 *
1173	   BLOCK_SIZE.  However, deleting either would only result in 4 *
1174	   BLOCK_SIZE being freed.  Conversely, the latter interpretation would
1175	   report that the space utilization is only 8 * BLOCK_SIZE.

1177	   Adding another size attribute, space_freed, is helpful in solving
1178	   this problem. space_freed is the number of blocks that are allocated
1179	   to the given file that would be freed on its deletion.  In the
1180	   example, both A and B would report space_freed as 4 * BLOCK_SIZE and
1181	   space_used as 10 * BLOCK_SIZE.  If A is deleted, B will report
1182	   space_freed as 10 * BLOCK_SIZE as the deletion of B would result in
1183	   the deallocation of all 10 blocks.

1185	   The addition of this problem doesn't solve the problem of space being
1186	   over-reported.  However, over-reporting is better than under-
1187	   reporting.

1189	5.  Support for Application IO Hints

1191	5.1.  Introduction

1193	   Applications currently have several options for communicating I/O
1194	   access patterns to the NFS client.  While this can help the NFS
1195	   client optimize I/O and caching for a file, it does not allow the NFS
1196	   server and its exported file system to do likewise.  Therefore, here
1197	   we put forth a proposal for the NFSv4.2 protocol to allow
1198	   applications to communicate their expected behavior to the server.

1200	   By communicating expected access pattern, e.g., sequential or random,
1201	   and data re-use behavior, e.g., data range will be read multiple
1202	   times and should be cached, the server will be able to better
1203	   understand what optimizations it should implement for access to a
1204	   file.  For example, if a application indicates it will never read the
1205	   data more than once, then the file system can avoid polluting the
1206	   data cache and not cache the data.

1208	   The first application that can issue client I/O hints is the
1209	   posix_fadvise operation.  For example, on Linux, when an application
1210	   uses posix_fadvise to specify a file will be read sequentially, Linux
1211	   doubles the readahead buffer size.

1213	   Another instance where applications provide an indication of their
1214	   desired I/O behavior is the use of direct I/O. By specifying direct
1215	   I/O, clients will no longer cache data, but this information is not
1216	   passed to the server, which will continue caching data.

1218	   Application specific NFS clients such as those used by hypervisors
1219	   and databases can also leverage application hints to communicate
1220	   their specialized requirements.

1222	   This section adds a new IO_ADVISE operation to communicate the client
1223	   file access patterns to the NFS server.  The NFS server upon
1224	   receiving a IO_ADVISE operation MAY choose to alter its I/O and
1225	   caching behavior, but is under no obligation to do so.

1227	5.2.  POSIX Requirements

1229	   The first key requirement of the IO_ADVISE operation is to support
1230	   the posix_fadvise function [6], which is supported in Linux and many
1231	   other operating systems.  Examples and guidance on how to use
1232	   posix_fadvise to improve performance can be found here [16].
1233	   posix_fadvise is defined as follows,

1235	      int posix_fadvise(int fd, off_t offset, off_t len, int advice);

1237	   The posix_fadvise() function shall advise the implementation on the
1238	   expected behavior of the application with respect to the data in the
1239	   file associated with the open file descriptor, fd, starting at offset
1240	   and continuing for len bytes.  The specified range need not currently
1241	   exist in the file.  If len is zero, all data following offset is
1242	   specified.  The implementation may use this information to optimize
1243	   handling of the specified data.  The posix_fadvise() function shall
1244	   have no effect on the semantics of other operations on the specified
1245	   data, although it may affect the performance of other operations.

1247	   The advice to be applied to the data is specified by the advice
1248	   parameter and may be one of the following values:

1250	   POSIX_FADV_NORMAL -  Specifies that the application has no advice to
1251	      give on its behavior with respect to the specified data.  It is
1252	      the default characteristic if no advice is given for an open file.

1254	   POSIX_FADV_SEQUENTIAL -  Specifies that the application expects to
1255	      access the specified data sequentially from lower offsets to
1256	      higher offsets.

1258	   POSIX_FADV_RANDOM -  Specifies that the application expects to access
1259	      the specified data in a random order.

1261	   POSIX_FADV_WILLNEED -  Specifies that the application expects to
1262	      access the specified data in the near future.

1264	   POSIX_FADV_DONTNEED -  Specifies that the application expects that it
1265	      will not access the specified data in the near future.

1267	   POSIX_FADV_NOREUSE -  Specifies that the application expects to
1268	      access the specified data once and then not reuse it thereafter.

1270	   Upon successful completion, posix_fadvise() shall return zero;
1271	   otherwise, an error number shall be returned to indicate the error.

1273	5.3.  Additional Requirements

1275	   Many use cases exist for sending application I/O hints to the server
1276	   that cannot utilize the POSIX supported interface.  This is because
1277	   some applications may benefit from additional hints not specified by
1278	   posix_fadvise, and some applications may not use POSIX altogether.

1280	   One use case is "Opportunistic Prefetch", which allows a stateid
1281	   holder to tell the server that it is possible that it will access the
1282	   specified data in the near future.  This is similar to
1283	   POSIX_FADV_WILLNEED, but the client is unsure it will in fact read
1284	   the specified data, so the server should only prefetch the data if it
1285	   can be done at a marginal cost.  For example, when a server receives
1286	   this hint, it could prefetch only the indirect blocks for a file
1287	   instead of all the data.  This would still improve performance if the
1288	   client does read the data, but with less pressure on server memory.

1290	   An example use case for this hint is a database that reads in a
1291	   single record that points to additional records in either other areas
1292	   of the same file or different files located on the same or different
1293	   server.  While it is likely that the application may access the
1294	   additional records, it is far from guaranteed.  Therefore, the
1295	   database may issue an opportunistic prefetch (instead of
1296	   POSIX_FADV_WILLNEED) for the data in the other files pointed to by
1297	   the record.

1299	   Another use case is "Direct I/O", which allows a stated holder to
1300	   inform the server that it does not wish to cache data.  Today, for
1301	   applications that only intend to read data once, the use of direct
1302	   I/O disables client caching, but does not affect server caching.  By
1303	   caching data that will not be re-read, the server is polluting its
1304	   cache and possibly causing useful cached data to be evicted.  By
1305	   informing the server of its expected I/O access, this situation can
1306	   be avoid.  Direct I/O can be used in Linux and AIX via the open()
1307	   O_DIRECT parameter, in Solaris via the directio() function, and in
1308	   Windows via the CreateFile() FILE_FLAG_NO_BUFFERING flag.

1310	   Another use case is "Backward Sequential Read", which allows a stated
1311	   holder to inform the server that it intends to read the specified
1312	   data backwards, i.e., back the end to the beginning.  This is
1313	   different than POSIX_FADV_SEQUENTIAL, whose implied intention was
1314	   that data will be read from beginning to end.  This hint allows
1315	   servers to prefetch data at the end of the range first, and then
1316	   prefetch data sequentially in a backwards manner to the start of the
1317	   data range.  One example of an application that can make use of this
1318	   hint is video editing.

1320	5.4.  Security Considerations

1322	   None.

1324	5.5.  IANA Considerations

1326	   The IO_ADVISE_type4 will be extended through an IANA registry.

1328	6.  Application Data Block Support

1330	   At the OS level, files are contained on disk blocks.  Applications
1331	   are also free to impose structure on the data contained in a file and
1332	   we can define an Application Data Block (ADB) to be such a structure.
1333	   From the application's viewpoint, it only wants to handle ADBs and
1334	   not raw bytes (see [17]).  An ADB is typically comprised of two
1335	   sections: a header and data.  The header describes the
1336	   characteristics of the block and can provide a means to detect
1337	   corruption in the data payload.  The data section is typically
1338	   initialized to all zeros.

1340	   The format of the header is application specific, but there are two
1341	   main components typically encountered:

1343	   1.  An ADB Number (ADBN), which allows the application to determine
1344	       which data block is being referenced.  The ADBN is a logical
1345	       block number and is useful when the client is not storing the
1346	       blocks in contiguous memory.

1348	   2.  Fields to describe the state of the ADB and a means to detect
1349	       block corruption.  For both pieces of data, a useful property is
1350	       that allowed values be unique in that if passed across the
1351	       network, corruption due to translation between big and little
1352	       endian architectures are detectable.  For example, 0xF0DEDEF0 has
1353	       the same bit pattern in both architectures.

1355	   Applications already impose structures on files [17] and detect
1356	   corruption in data blocks [18].  What they are not able to do is
1357	   efficiently transfer and store ADBs.  To initialize a file with ADBs,
1358	   the client must send the full ADB to the server and that must be
1359	   stored on the server.  When the application is initializing a file to
1360	   have the ADB structure, it could compress the ADBs to just the
1361	   information to necessary to later reconstruct the header portion of
1362	   the ADB when the contents are read back.  Using sparse file
1363	   techniques, the disk blocks described by would not be allocated.
1364	   Unlike sparse file techniques, there would be a small cost to store
1365	   the compressed header data.

1367	   In this section, we are going to define a generic framework for an
1368	   ADB, present one approach to detecting corruption in a given ADB
1369	   implementation, and describe the model for how the client and server
1370	   can support efficient initialization of ADBs, reading of ADB holes,
1371	   punching holes in ADBs, and space reservation.  Further, we need to
1372	   be able to extend this model to applications which do not support
1373	   ADBs, but wish to be able to handle sparse files, hole punching, and
1374	   space reservation.

1376	6.1.  Generic Framework

1378	   We want the representation of the ADB to be flexible enough to
1379	   support many different applications.  The most basic approach is no
1380	   imposition of a block at all, which means we are working with the raw
1381	   bytes.  Such an approach would be useful for storing holes, punching
1382	   holes, etc.  In more complex deployments, a server might be
1383	   supporting multiple applications, each with their own definition of
1384	   the ADB.  One might store the ADBN at the start of the block and then
1385	   have a guard pattern to detect corruption [19].  The next might store
1386	   the ADBN at an offset of 100 bytes within the block and have no guard
1387	   pattern at all.  The point is that existing applications might
1388	   already have well defined formats for their data blocks.

1390	   The guard pattern can be used to represent the state of the block, to
1391	   protect against corruption, or both.  Again, it needs to be able to
1392	   be placed anywhere within the ADB.

1394	   We need to be able to represent the starting offset of the block and
1395	   the size of the block.  Note that nothing prevents the application
1396	   from defining different sized blocks in a file.

1398	6.1.1.  Data Block Representation

1400	   struct app_data_block4 {
1401	           offset4         adb_offset;
1402	           length4         adb_block_size;
1403	           length4         adb_block_count;
1404	           length4         adb_reloff_blocknum;
1405	           count4          adb_block_num;
1406	           length4         adb_reloff_pattern;
1407	           opaque          adb_pattern<>;
1408	   };
1409	   The app_data_block4 structure captures the abstraction presented for
1410	   the ADB.  The additional fields present are to allow the transmission
1411	   of adb_block_count ADBs at one time.  We also use adb_block_num to
1412	   convey the ADBN of the first block in the sequence.  Each ADB will
1413	   contain the same adb_pattern string.

1415	   As both adb_block_num and adb_pattern are optional, if either
1416	   adb_reloff_pattern or adb_reloff_blocknum is set to NFS4_UINT64_MAX,
1417	   then the corresponding field is not set in any of the ADB.

1419	6.1.2.  Data Content

1421	   /*
1422	    * Use an enum such that we can extend new types.
1423	    */
1424	   enum data_content4 {
1425	           NFS4_CONTENT_DATA = 0,
1426	           NFS4_CONTENT_APP_BLOCK = 1,
1427	           NFS4_CONTENT_HOLE = 2
1428	   };

1430	   New operations might need to differentiate between wanting to access
1431	   data versus an ADB.  Also, future minor versions might want to
1432	   introduce new data formats.  This enumeration allows that to occur.

1434	6.2.  pNFS Considerations

1436	   While this document does not mandate how sparse ADBs are recorded on
1437	   the server, it does make the assumption that such information is not
1438	   in the file.  I.e., the information is metadata.  As such, the
1439	   INITIALIZE operation is defined to be not supported by the DS - it
1440	   must be issued to the MDS.  But since the client must not assume a
1441	   priori whether a read is sparse or not, the READ_PLUS operation MUST
1442	   be supported by both the DS and the MDS.  I.e., the client might
1443	   impose on the MDS to asynchronously read the data from the DS.

1445	   Furthermore, each DS MUST not report to a client either a sparse ADB
1446	   or data which belongs to another DS.  One implication of this
1447	   requirement is that the app_data_block4's adb_block_size MUST be
1448	   either be the stripe width or the stripe width must be an even
1449	   multiple of it.

1451	   The second implication here is that the DS must be able to use the
1452	   Control Protocol to determine from the MDS where the sparse ADBs
1453	   occur.  [[Comment.3: Need to discuss what happens if after the file
1454	   is being written to and an INITIALIZE occurs? --TH]] Perhaps instead
1455	   of the DS pulling from the MDS, the MDS pushes to the DS?  Thus an
1456	   INITIALIZE causes a new push?  [[Comment.4: Still need to consider
1457	   race cases of the DS getting a WRITE and the MDS getting an
1458	   INITIALIZE. --TH]]

1460	6.3.  An Example of Detecting Corruption

1462	   In this section, we define an ADB format in which corruption can be
1463	   detected.  Note that this is just one possible format and means to
1464	   detect corruption.

1466	   Consider a very basic implementation of an operating system's disk
1467	   blocks.  A block is either data or it is an indirect block which
1468	   allows for files to be larger than one block.  It is desired to be
1469	   able to initialize a block.  Lastly, to quickly unlink a file, a
1470	   block can be marked invalid.  The contents remain intact - which
1471	   would enable this OS application to undelete a file.

1473	   The application defines 4k sized data blocks, with an 8 byte block
1474	   counter occurring at offset 0 in the block, and with the guard
1475	   pattern occurring at offset 8 inside the block.  Furthermore, the
1476	   guard pattern can take one of four states:

1478	   0xfeedface -   This is the FREE state and indicates that the ADB
1479	      format has been applied.

1481	   0xcafedead -   This is the DATA state and indicates that real data
1482	      has been written to this block.

1484	   0xe4e5c001 -   This is the INDIRECT state and indicates that the
1485	      block contains block counter numbers that are chained off of this
1486	      block.

1488	   0xba1ed4a3 -   This is the INVALID state and indicates that the block
1489	      contains data whose contents are garbage.

1491	   Finally, it also defines an 8 byte checksum [20] starting at byte 16
1492	   which applies to the remaining contents of the block.  If the state
1493	   is FREE, then that checksum is trivially zero.  As such, the
1494	   application has no need to transfer the checksum implicitly inside
1495	   the ADB - it need not make the transfer layer aware of the fact that
1496	   there is a checksum (see [18] for an example of checksums used to
1497	   detect corruption in application data blocks).

1499	   Corruption in each ADB can be detected thusly:

1501	   o  If the guard pattern is anything other than one of the allowed
1502	      values, including all zeros.

1504	   o  If the guard pattern is FREE and any other byte in the remainder
1505	      of the ADB is anything other than zero.

1507	   o  If the guard pattern is anything other than FREE, then if the
1508	      stored checksum does not match the computed checksum.

1510	   o  If the guard pattern is INDIRECT and one of the stored indirect
1511	      block numbers has a value greater than the number of ADBs in the
1512	      file.

1514	   o  If the guard pattern is INDIRECT and one of the stored indirect
1515	      block numbers is a duplicate of another stored indirect block
1516	      number.

1518	   As can be seen, the application can detect errors based on the
1519	   combination of the guard pattern state and the checksum.  But also,
1520	   the application can detect corruption based on the state and the
1521	   contents of the ADB.  This last point is important in validating the
1522	   minimum amount of data we incorporated into our generic framework.
1523	   I.e., the guard pattern is sufficient in allowing applications to
1524	   design their own corruption detection.

1526	   Finally, it is important to note that none of these corruption checks
1527	   occur in the transport layer.  The server and client components are
1528	   totally unaware of the file format and might report everything as
1529	   being transferred correctly even in the case the application detects
1530	   corruption.

1532	6.4.  Example of READ_PLUS

1534	   The hypothetical application presented in Section 6.3 can be used to
1535	   illustrate how READ_PLUS would return an array of results.  A file is
1536	   created and initialized with 100 4k ADBs in the FREE state:

1538	      INITIALIZE {0, 4k, 100, 0, 0, 8, 0xfeedface}

1540	   Further, assume the application writes a single ADB at 16k, changing
1541	   the guard pattern to 0xcafedead, we would then have in memory:

1543	      0 -> (16k - 1)   : 4k, 4, 0, 0, 8, 0xfeedface
1544	      16k -> (20k - 1) : 00 00 00 05 ca fe de ad XX XX ... XX XX
1545	      20k -> 400k      : 4k, 95, 0, 6, 0xfeedface

1547	   And when the client did a READ_PLUS of 64k at the start of the file,
1548	   it would get back a result of an ADB, some data, and a final ADB:

1550	      ADB {0, 4, 0, 0, 8, 0xfeedface}
1551	      data 4k
1552	      ADB {20k, 4k, 59, 0, 6, 0xfeedface}

1554	6.5.  Zero Filled Holes

1556	   As applications are free to define the structure of an ADB, it is
1557	   trivial to define an ADB which supports zero filled holes.  Such a
1558	   case would encompass the traditional definitions of a sparse file and
1559	   hole punching.  For example, to punch a 64k hole, starting at 100M,
1560	   into an existing file which has no ADB structure:

1562	      INITIALIZE {100M, 64k, 1, NFS4_UINT64_MAX,
1563	                  0, NFS4_UINT64_MAX, 0x0}

1565	7.  Labeled NFS

1567	7.1.  Introduction

1569	   Access control models such as Unix permissions or Access Control
1570	   Lists are commonly referred to as Discretionary Access Control (DAC)
1571	   models.  These systems base their access decisions on user identity
1572	   and resource ownership.  In contrast Mandatory Access Control (MAC)
1573	   models base their access control decisions on the label on the
1574	   subject (usually a process) and the object it wishes to access [7].
1575	   These labels may contain user identity information but usually
1576	   contain additional information.  In DAC systems users are free to
1577	   specify the access rules for resources that they own.  MAC models
1578	   base their security decisions on a system wide policy established by
1579	   an administrator or organization which the users do not have the
1580	   ability to override.  In this section, we add a MAC model to NFSv4.

1582	   The first change necessary is to devise a method for transporting and
1583	   storing security label data on NFSv4 file objects.  Security labels
1584	   have several semantics that are met by NFSv4 recommended attributes
1585	   such as the ability to set the label value upon object creation.
1586	   Access control on these attributes are done through a combination of
1587	   two mechanisms.  As with other recommended attributes on file objects
1588	   the usual DAC checks (ACLs and permission bits) will be performed to
1589	   ensure that proper file ownership is enforced.  In addition a MAC
1590	   system MAY be employed on the client, server, or both to enforce
1591	   additional policy on what subjects may modify security label
1592	   information.

1594	   The second change is to provide a method for the server to notify the
1595	   client that the attribute changed on an open file on the server.  If
1596	   the file is closed, then during the open attempt, the client will
1597	   gather the new attribute value.  The server MUST not communicate the
1598	   new value of the attribute, the client MUST query it.  This
1599	   requirement stems from the need for the client to provide sufficient
1600	   access rights to the attribute.

1602	   The final change necessary is a modification to the RPC layer used in
1603	   NFSv4 in the form of a new version of the RPCSEC_GSS [8] framework.
1604	   In order for an NFSv4 server to apply MAC checks it must obtain
1605	   additional information from the client.  Several methods were
1606	   explored for performing this and it was decided that the best
1607	   approach was to incorporate the ability to make security attribute
1608	   assertions through the RPC mechanism.  RPCSECGSSv3 [5] outlines a
1609	   method to assert additional security information such as security
1610	   labels on gss context creation and have that data bound to all RPC
1611	   requests that make use of that context.

1613	7.2.  Definitions

1615	   Label Format Specifier (LFS):  is an identifier used by the client to
1616	      establish the syntactic format of the security label and the
1617	      semantic meaning of its components.  These specifiers exist in a
1618	      registry associated with documents describing the format and
1619	      semantics of the label.

1621	   Label Format Registry:  is the IANA registry containing all
1622	      registered LFS along with references to the documents that
1623	      describe the syntactic format and semantics of the security label.

1625	   Policy Identifier (PI):  is an optional part of the definition of a
1626	      Label Format Specifier which allows for clients and server to
1627	      identify specific security policies.

1629	   Object:  is a passive resource within the system that we wish to be
1630	      protected.  Objects can be entities such as files, directories,
1631	      pipes, sockets, and many other system resources relevant to the
1632	      protection of the system state.

1634	   Subject:  A subject is an active entity usually a process which is
1635	      requesting access to an object.

1637	   Multi-Level Security (MLS):  is a traditional model where objects are
1638	      given a sensitivity level (Unclassified, Secret, Top Secret, etc)
1639	      and a category set [21].

1641	7.3.  MAC Security Attribute

1643	   MAC models base access decisions on security attributes bound to
1644	   subjects and objects.  This information can range from a user
1645	   identity for an identity based MAC model, sensitivity levels for
1646	   Multi-level security, or a type for Type Enforcement.  These models
1647	   base their decisions on different criteria but the semantics of the
1648	   security attribute remain the same.  The semantics required by the
1649	   security attributes are listed below:

1651	   o  Must provide flexibility with respect to MAC model.

1653	   o  Must provide the ability to atomically set security information
1654	      upon object creation.

1656	   o  Must provide the ability to enforce access control decisions both
1657	      on the client and the server.

1659	   o  Must not expose an object to either the client or server name
1660	      space before its security information has been bound to it.

1662	   NFSv4 implements the security attribute as a recommended attribute.
1663	   These attributes have a fixed format and semantics, which conflicts
1664	   with the flexible nature of the security attribute.  To resolve this
1665	   the security attribute consists of two components.  The first
1666	   component is a LFS as defined in [22] to allow for interoperability
1667	   between MAC mechanisms.  The second component is an opaque field
1668	   which is the actual security attribute data.  To allow for various
1669	   MAC models NFSv4 should be used solely as a transport mechanism for
1670	   the security attribute.  It is the responsibility of the endpoints to
1671	   consume the security attribute and make access decisions based on
1672	   their respective models.  In addition, creation of objects through
1673	   OPEN and CREATE allows for the security attribute to be specified
1674	   upon creation.  By providing an atomic create and set operation for
1675	   the security attribute it is possible to enforce the second and
1676	   fourth requirements.  The recommended attribute FATTR4_SEC_LABEL will
1677	   be used to satisfy this requirement.

1679	7.3.1.  Interpreting FATTR4_SEC_LABEL

1681	   The XDR [23] necessary to implement Labeled NFSv4 is presented below:

1683	   const FATTR4_SEC_LABEL   = 81;

1685	   typedef uint32_t  policy4;

1687	                                 Figure 6

1689	   struct labelformat_spec4 {
1690	           policy4 lfs_lfs;
1691	           policy4 lfs_pi;
1692	   };

1694	   struct sec_label_attr_info {
1695	           labelformat_spec4       slai_lfs;
1696	           opaque                  slai_data<>;
1697	   };

1699	   The FATTR4_SEC_LABEL contains an array of two components with the
1700	   first component being an LFS.  It serves to provide the receiving end
1701	   with the information necessary to translate the security attribute
1702	   into a form that is usable by the endpoint.  Label Formats assigned
1703	   an LFS may optionally choose to include a Policy Identifier field to
1704	   allow for complex policy deployments.  The LFS and Label Format
1705	   Registry are described in detail in [22].  The translation used to
1706	   interpret the security attribute is not specified as part of the
1707	   protocol as it may depend on various factors.  The second component
1708	   is an opaque section which contains the data of the attribute.  This
1709	   component is dependent on the MAC model to interpret and enforce.

1711	   In particular, it is the responsibility of the LFS specification to
1712	   define a maximum size for the opaque section, slai_data<>.  When
1713	   creating or modifying a label for an object, the client needs to be
1714	   guaranteed that the server will accept a label that is sized
1715	   correctly.  By both client and server being part of a specific MAC
1716	   model, the client will be aware of the size.

1718	7.3.2.  Delegations

1720	   In the event that a security attribute is changed on the server while
1721	   a client holds a delegation on the file, the client should follow the
1722	   existing protocol with respect to attribute changes.  It should flush
1723	   all changes back to the server and relinquish the delegation.

1725	7.3.3.  Permission Checking

1727	   It is not feasible to enumerate all possible MAC models and even
1728	   levels of protection within a subset of these models.  This means
1729	   that the NFSv4 client and servers cannot be expected to directly make
1730	   access control decisions based on the security attribute.  Instead
1731	   NFSv4 should defer permission checking on this attribute to the host
1732	   system.  These checks are performed in addition to existing DAC and
1733	   ACL checks outlined in the NFSv4 protocol.  Section 7.6 gives a
1734	   specific example of how the security attribute is handled under a
1735	   particular MAC model.

1737	7.3.4.  Object Creation

1739	   When creating files in NFSv4 the OPEN and CREATE operations are used.
1740	   One of the parameters to these operations is an fattr4 structure
1741	   containing the attributes the file is to be created with.  This
1742	   allows NFSv4 to atomically set the security attribute of files upon
1743	   creation.  When a client is MAC aware it must always provide the
1744	   initial security attribute upon file creation.  In the event that the
1745	   server is the only MAC aware entity in the system it should ignore
1746	   the security attribute specified by the client and instead make the
1747	   determination itself.  A more in depth explanation can be found in
1748	   Section 7.6.

1750	7.3.5.  Existing Objects

1752	   Note that under the MAC model, all objects must have labels.
1753	   Therefore, if an existing server is upgraded to include LNFS support,
1754	   then it is the responsibility of the security system to define the
1755	   behavior for existing objects.  For example, if the security system
1756	   is LFS 0, which means the server just stores and returns labels, then
1757	   existing files should return labels which are set to an empty value.

1759	7.3.6.  Label Changes

1761	   As per the requirements, when a file's security label is modified,
1762	   the server must notify all clients which have the file opened of the
1763	   change in label.  It does so with CB_ATTR_CHANGED.  There are
1764	   preconditions to making an attribute change imposed by NFSv4 and the
1765	   security system might want to impose others.  In the process of
1766	   meeting these preconditions, the server may chose to either serve the
1767	   request in whole or return NFS4ERR_DELAY to the SETATTR operation.

1769	   If there are open delegations on the file belonging to client other
1770	   than the one making the label change, then the process described in
1771	   Section 7.3.2 must be followed.

1773	   As the server is always presented with the subject label from the
1774	   client, it does not necessarily need to communicate the fact that the
1775	   label has changed to the client.  In the cases where the change
1776	   outright denies the client access, the client will be able to quickly
1777	   determine that there is a new label in effect.  It is in cases where
1778	   the client may share the same object between multiple subjects or a
1779	   security system which is not strictly hierarchical that the
1780	   CB_ATTR_CHANGED callback is very useful.  It allows the server to
1781	   inform the clients that the cached security attribute is now stale.

1783	   Consider a system in which the clients enforce MAC checks and and the
1784	   server has a very simple security system which just stores the
1785	   labels.  In this system, the MAC label check always allows access,
1786	   regardless of the subject label.

1788	   The way in which MAC labels are enforced is by the client.  So if
1789	   client A changes a security label on a file, then the server MUST
1790	   inform all clients that have the file opened that the label has
1791	   changed via CB_ATTR_CHANGED.  Then the clients MUST retrieve the new
1792	   label and MUST enforce access via the new attribute values.

1794	7.4.  pNFS Considerations

1796	   This section examines the issues in deploying LNFS in a pNFS
1797	   community of servers.

1799	7.4.1.  MAC Label Checks

1801	   The new FATTR4_SEC_LABEL attribute is metadata information and as
1802	   such the DS is not aware of the value contained on the MDS.
1803	   Fortunately, the NFSv4.1 protocol [2] already has provisions for
1804	   doing access level checks from the DS to the MDS.  In order for the
1805	   DS to validate the subject label presented by the client, it SHOULD
1806	   utilize this mechanism.

1808	   If a file's FATTR4_SEC_LABEL is changed, then the MDS should utilize
1809	   CB_ATTR_CHANGED to inform the client of that fact.  If the MDS is
1810	   maintaining

1812	7.5.  Discovery of Server LNFS Support

1814	   The server can easily determine that a client supports LNFS when it
1815	   queries for the FATTR4_SEC_LABEL label for an object.  Note that it
1816	   cannot assume that the presence of RPCSEC_GSSv3 indicates LNFS
1817	   support.  The client might need to discover which LFS the server
1818	   supports.

1820	   A server which supports LNFS MUST allow a client with any subject
1821	   label to retrieve the FATTR4_SEC_LABEL attribute for the root
1822	   filehandle, ROOTFH.  The following compound must always succeed as
1823	   far as a MAC label check is concerned:

1825	        PUTROOTFH, GETATTR {FATTR4_SEC_LABEL}

1827	   Note that the server might have imposed a security flavor on the root
1828	   that precludes such access.  I.e., if the server requires kerberized
1829	   access and the client presents a compound with AUTH_SYS, then the
1830	   server is allowed to return NFS4ERR_WRONGSEC in this case.  But if
1831	   the client presents a correct security flavor, then the server MUST
1832	   return the FATTR4_SEC_LABEL attribute with the supported LFS filled
1833	   in.

1835	7.6.  MAC Security NFS Modes of Operation

1837	   A system using Labeled NFS may operate in two modes.  The first mode
1838	   provides the most protection and is called "full mode".  In this mode
1839	   both the client and server implement a MAC model allowing each end to
1840	   make an access control decision.  The remaining mode is called the
1841	   "guest mode" and in this mode one end of the connection is not
1842	   implementing a MAC model and thus offers less protection than full
1843	   mode.

1845	7.6.1.  Full Mode

1847	   Full mode environments consist of MAC aware NFSv4 servers and clients
1848	   and may be composed of mixed MAC models and policies.  The system
1849	   requires that both the client and server have an opportunity to
1850	   perform an access control check based on all relevant information
1851	   within the network.  The file object security attribute is provided
1852	   using the mechanism described in Section 7.3.  The security attribute
1853	   of the subject making the request is transported at the RPC layer
1854	   using the mechanism described in RPCSECGSSv3 [5].

1856	7.6.1.1.  Initial Labeling and Translation

1858	   The ability to create a file is an action that a MAC model may wish
1859	   to mediate.  The client is given the responsibility to determine the
1860	   initial security attribute to be placed on a file.  This allows the
1861	   client to make a decision as to the acceptable security attributes to
1862	   create a file with before sending the request to the server.  Once
1863	   the server receives the creation request from the client it may
1864	   choose to evaluate if the security attribute is acceptable.

1866	   Security attributes on the client and server may vary based on MAC
1867	   model and policy.  To handle this the security attribute field has an
1868	   LFS component.  This component is a mechanism for the host to
1869	   identify the format and meaning of the opaque portion of the security
1870	   attribute.  A full mode environment may contain hosts operating in
1871	   several different LFSs.  In this case a mechanism for translating the
1872	   opaque portion of the security attribute is needed.  The actual
1873	   translation function will vary based on MAC model and policy and is
1874	   out of the scope of this document.  If a translation is unavailable
1875	   for a given LFS then the request SHOULD be denied.  Another recourse
1876	   is to allow the host to provide a fallback mapping for unknown
1877	   security attributes.

1879	7.6.1.2.  Policy Enforcement

1881	   In full mode access control decisions are made by both the clients
1882	   and servers.  When a client makes a request it takes the security
1883	   attribute from the requesting process and makes an access control
1884	   decision based on that attribute and the security attribute of the
1885	   object it is trying to access.  If the client denies that access an
1886	   RPC call to the server is never made.  If however the access is
1887	   allowed the client will make a call to the NFS server.

1889	   When the server receives the request from the client it extracts the
1890	   security attribute conveyed in the RPC request.  The server then uses
1891	   this security attribute and the attribute of the object the client is
1892	   trying to access to make an access control decision.  If the server's
1893	   policy allows this access it will fulfill the client's request,
1894	   otherwise it will return NFS4ERR_ACCESS.

1896	   Implementations MAY validate security attributes supplied over the
1897	   network to ensure that they are within a set of attributes permitted
1898	   from a specific peer, and if not, reject them.  Note that a system
1899	   may permit a different set of attributes to be accepted from each
1900	   peer.

1902	7.6.1.3.  Label Aware Only Server

1904	   If the LFS is 0, then it indicates a server which is label aware, but
1905	   does not enforce policies.  Such a server will store and retrieve all
1906	   object labels presented by clients, notify the clients of any label
1907	   changes via CB_ATTR_CHANGED, but will not restrict access via the
1908	   subject label.  Instead, it will expect the clients to enforce all
1909	   such access locally.

1911	7.6.2.  Guest Mode

1913	   Guest mode implies that either the client or the server does not
1914	   handle labels.  If the client is not LNFS aware, then it will not
1915	   offer subject labels to the server.  The server is the only entity
1916	   enforcing policy, and may selectively provide standard NFS services
1917	   to clients based on their authentication credentials and/or
1918	   associated network attributes (e.g., IP address, network interface).
1919	   The level of trust and access extended to a client in this mode is
1920	   configuration-specific.  If the server is not LNFS aware, then it
1921	   will not return object labels to the client.  Clients in this
1922	   environment are may consist of groups implementing different MAC
1923	   model policies.  The system requires that all clients in the
1924	   environment be responsible for access control checks.

1926	7.7.  Security Considerations

1928	   This entire document deals with security issues.

1930	   Depending on the level of protection the MAC system offers there may
1931	   be a requirement to tightly bind the security attribute to the data.

1933	   When only one of the client or server enforces labels, it is
1934	   important to realize that the other side is not enforcing MAC
1935	   protections.  Alternate methods might be in use to handle the lack of
1936	   MAC support and care should be taken to identify and mitigate threats
1937	   from possible tampering outside of these methods.

1939	   An example of this is that a server that modifies READDIR or LOOKUP
1940	   results based on the client's subject label might want to always
1941	   construct the same subject label for a client which does not present
1942	   one.  This will prevent a non-LNFS client from mixing entries in the
1943	   directory cache.

1945	8.  Sharing change attribute implementation details with NFSv4 clients

1947	8.1.  Introduction

1949	   Although both the NFSv4 [10] and NFSv4.1 protocol [2], define the
1950	   change attribute as being mandatory to implement, there is little in
1951	   the way of guidance.  The only feature that is mandated by them is
1952	   that the value must change whenever the file data or metadata change.

1954	   While this allows for a wide range of implementations, it also leaves
1955	   the client with a conundrum: how does it determine which is the most
1956	   recent value for the change attribute in a case where several RPC
1957	   calls have been issued in parallel?  In other words if two COMPOUNDs,
1958	   both containing WRITE and GETATTR requests for the same file, have
1959	   been issued in parallel, how does the client determine which of the
1960	   two change attribute values returned in the replies to the GETATTR
1961	   requests corresponds to the most recent state of the file?  In some
1962	   cases, the only recourse may be to send another COMPOUND containing a
1963	   third GETATTR that is fully serialised with the first two.

1965	   NFSv4.2 avoids this kind of inefficiency by allowing the server to
1966	   share details about how the change attribute is expected to evolve,
1967	   so that the client may immediately determine which, out of the
1968	   several change attribute values returned by the server, is the most
1969	   recent.

1971	8.2.  Definition of the 'change_attr_type' per-file system attribute

1973	   enum change_attr_typeinfo {
1974	              NFS4_CHANGE_TYPE_IS_MONOTONIC_INCR         = 0,
1975	              NFS4_CHANGE_TYPE_IS_VERSION_COUNTER        = 1,
1976	              NFS4_CHANGE_TYPE_IS_VERSION_COUNTER_NOPNFS = 2,
1977	              NFS4_CHANGE_TYPE_IS_TIME_METADATA          = 3,
1978	              NFS4_CHANGE_TYPE_IS_UNDEFINED              = 4
1979	   };

1981	        +------------------+----+---------------------------+-----+
1982	        | Name             | Id | Data Type                 | Acc |
1983	        +------------------+----+---------------------------+-----+
1984	        | change_attr_type | XX | enum change_attr_typeinfo | R   |
1985	        +------------------+----+---------------------------+-----+

1987	   The solution enables the NFS server to provide additional information
1988	   about how it expects the change attribute value to evolve after the
1989	   file data or metadata has changed. 'change_attr_type' is defined as a
1990	   new recommended attribute, and takes values from enum
1991	   change_attr_typeinfo as follows:

1993	   NFS4_CHANGE_TYPE_IS_MONOTONIC_INCR:  The change attribute value MUST
1994	      monotonically increase for every atomic change to the file
1995	      attributes, data or directory contents.

1997	   NFS4_CHANGE_TYPE_IS_VERSION_COUNTER:  The change attribute value MUST
1998	      be incremented by one unit for every atomic change to the file
1999	      attributes, data or directory contents.  This property is
2000	      preserved when writing to pNFS data servers.

2002	   NFS4_CHANGE_TYPE_IS_VERSION_COUNTER_NOPNFS:  The change attribute
2003	      value MUST be incremented by one unit for every atomic change to
2004	      the file attributes, data or directory contents.  In the case
2005	      where the client is writing to pNFS data servers, the number of
2006	      increments is not guaranteed to exactly match the number of
2007	      writes.

2009	   NFS4_CHANGE_TYPE_IS_TIME_METADATA:  The change attribute is
2010	      implemented as suggested in the NFSv4 spec [10] in terms of the
2011	      time_metadata attribute.

2013	   NFS4_CHANGE_TYPE_IS_UNDEFINED:  The change attribute does not take
2014	      values that fit into any of these categories.

2016	   If either NFS4_CHANGE_TYPE_IS_MONOTONIC_INCR,
2017	   NFS4_CHANGE_TYPE_IS_VERSION_COUNTER, or
2018	   NFS4_CHANGE_TYPE_IS_TIME_METADATA are set, then the client knows at
2019	   the very least that the change attribute is monotonically increasing,
2020	   which is sufficient to resolve the question of which value is the
2021	   most recent.

2023	   If the client sees the value NFS4_CHANGE_TYPE_IS_TIME_METADATA, then
2024	   by inspecting the value of the 'time_delta' attribute it additionally
2025	   has the option of detecting rogue server implementations that use
2026	   time_metadata in violation of the spec.

2028	   Finally, if the client sees NFS4_CHANGE_TYPE_IS_VERSION_COUNTER, it
2029	   has the ability to predict what the resulting change attribute value
2030	   should be after a COMPOUND containing a SETATTR, WRITE, or CREATE.
2031	   This again allows it to detect changes made in parallel by another
2032	   client.  The value NFS4_CHANGE_TYPE_IS_VERSION_COUNTER_NOPNFS permits
2033	   the same, but only if the client is not doing pNFS WRITEs.

2035	9.  Security Considerations

2037	10.  Error Values

2039	   NFS error numbers are assigned to failed operations within a Compound
2040	   (COMPOUND or CB_COMPOUND) request.  A Compound request contains a
2041	   number of NFS operations that have their results encoded in sequence
2042	   in a Compound reply.  The results of successful operations will
2043	   consist of an NFS4_OK status followed by the encoded results of the
2044	   operation.  If an NFS operation fails, an error status will be
2045	   entered in the reply and the Compound request will be terminated.

2047	10.1.  Error Definitions

2049	                        Protocol Error Definitions

2051	         +--------------------------+--------+------------------+
2052	         | Error                    | Number | Description      |
2053	         +--------------------------+--------+------------------+
2054	         | NFS4ERR_BADLABEL         | 10093  | Section 10.1.3.1 |
2055	         | NFS4ERR_MAC_ACCESS       | 10094  | Section 10.1.3.2 |
2056	         | NFS4ERR_METADATA_NOTSUPP | 10090  | Section 10.1.2.1 |
2057	         | NFS4ERR_OFFLOAD_DENIED   | 10091  | Section 10.1.2.2 |
2058	         | NFS4ERR_PARTNER_NO_AUTH  | 10089  | Section 10.1.2.3 |
2059	         | NFS4ERR_PARTNER_NOTSUPP  | 10088  | Section 10.1.2.4 |
2060	         | NFS4ERR_UNION_NOTSUPP    | 10095  | Section 10.1.1.1 |
2061	         | NFS4ERR_WRONG_LFS        | 10092  | Section 10.1.3.3 |
2062	         +--------------------------+--------+------------------+

2064	                                  Table 1

2066	10.1.1.  General Errors

2068	   This section deals with errors that are applicable to a broad set of
2069	   different purposes.

2071	10.1.1.1.  NFS4ERR_UNION_NOTSUPP (Error Code 10095)

2073	   One of the arguments to the operation is a discriminated union and
2074	   while the server supports the given operation, it does not support
2075	   the selected arm of the discriminated union.  For an example, see
2076	   READ_PLUS (Section 13.10).

2078	10.1.2.  Server to Server Copy Errors

2080	   These errors deal with the interaction between server to server
2081	   copies.

2083	10.1.2.1.  NFS4ERR_METADATA_NOTSUPP (Error Code 10090)

2085	   The destination file cannot support the same metadata as the source
2086	   file.

2088	10.1.2.2.  NFS4ERR_OFFLOAD_DENIED (Error Code 10091)

2090	   The copy offload operation is supported by both the source and the
2091	   destination, but the destination is not allowing it for this file.
2092	   If the client sees this error, it should fall back to the normal copy
2093	   semantics.

2095	10.1.2.3.  NFS4ERR_PARTNER_NO_AUTH (Error Code 10089)

2097	   The remote server does not authorize a server-to-server copy offload
2098	   operation.  This may be due to the client's failure to send the
2099	   COPY_NOTIFY operation to the remote server, the remote server
2100	   receiving a server-to-server copy offload request after the copy
2101	   lease time expired, or for some other permission problem.

2103	10.1.2.4.  NFS4ERR_PARTNER_NOTSUPP (Error Code 10088)

2105	   The remote server does not support the server-to-server copy offload
2106	   protocol.

2108	10.1.3.  Labeled NFS Errors

2110	   These errors are used in LNFS.

2112	10.1.3.1.  NFS4ERR_BADLABEL (Error Code 10093)
2113	10.1.3.2.  NFS4ERR_MAC_ACCESS (Error Code 10094)

2115	10.1.3.3.  NFS4ERR_WRONG_LFS (Error Code 10092)

2117	11.  File Attributes

2119	11.1.  Attribute Definitions

2121	11.1.1.  Attribute 77: space_reserved

2123	   The space_reserve attribute is a read/write attribute of type
2124	   boolean.  It is a per file attribute.  When the space_reserved
2125	   attribute is set via SETATTR, the server must ensure that there is
2126	   disk space to accommodate every byte in the file before it can return
2127	   success.  If the server cannot guarantee this, it must return
2128	   NFS4ERR_NOSPC.

2130	   If the client tries to grow a file which has the space_reserved
2131	   attribute set, the server must guarantee that there is disk space to
2132	   accommodate every byte in the file with the new size before it can
2133	   return success.  If the server cannot guarantee this, it must return
2134	   NFS4ERR_NOSPC.

2136	   It is not required that the server allocate the space to the file
2137	   before returning success.  The allocation can be deferred, however,
2138	   it must be guaranteed that it will not fail for lack of space.

2140	   The value of space_reserved can be obtained at any time through
2141	   GETATTR.

2143	   In order to avoid ambiguity, the space_reserve bit cannot be set
2144	   along with the size bit in SETATTR.  Increasing the size of a file
2145	   with space_reserve set will fail if space reservation cannot be
2146	   guaranteed for the new size.  If the file size is decreased, space
2147	   reservation is only guaranteed for the new size and the extra blocks
2148	   backing the file can be released.

2150	11.1.2.  Attribute 78: space_freed

2152	   space_freed gives the number of bytes freed if the file is deleted.
2153	   This attribute is read only and is of type length4.  It is a per file
2154	   attribute.

2156	12.  Operations: REQUIRED, RECOMMENDED, or OPTIONAL

2158	   The following tables summarize the operations of the NFSv4.2 protocol
2159	   and the corresponding designation of REQUIRED, RECOMMENDED, and
2160	   OPTIONAL to implement or MUST NOT implement.  The designation of MUST
2161	   NOT implement is reserved for those operations that were defined in
2162	   either NFSv4.0 or NFSV4.1 and MUST NOT be implemented in NFSv4.2.

2164	   For the most part, the REQUIRED, RECOMMENDED, or OPTIONAL designation
2165	   for operations sent by the client is for the server implementation.
2166	   The client is generally required to implement the operations needed
2167	   for the operating environment for which it serves.  For example, a
2168	   read-only NFSv4.2 client would have no need to implement the WRITE
2169	   operation and is not required to do so.

2171	   The REQUIRED or OPTIONAL designation for callback operations sent by
2172	   the server is for both the client and server.  Generally, the client
2173	   has the option of creating the backchannel and sending the operations
2174	   on the fore channel that will be a catalyst for the server sending
2175	   callback operations.  A partial exception is CB_RECALL_SLOT; the only
2176	   way the client can avoid supporting this operation is by not creating
2177	   a backchannel.

2179	   Since this is a summary of the operations and their designation,
2180	   there are subtleties that are not presented here.  Therefore, if
2181	   there is a question of the requirements of implementation, the
2182	   operation descriptions themselves must be consulted along with other
2183	   relevant explanatory text within this either specification or that of
2184	   NFSv4.1 [2]..

2186	   The abbreviations used in the second and third columns of the table
2187	   are defined as follows.

2189	   REQ  REQUIRED to implement

2191	   REC  RECOMMEND to implement

2193	   OPT  OPTIONAL to implement

2195	   MNI  MUST NOT implement

2197	   For the NFSv4.2 features that are OPTIONAL, the operations that
2198	   support those features are OPTIONAL, and the server would return
2199	   NFS4ERR_NOTSUPP in response to the client's use of those operations.
2200	   If an OPTIONAL feature is supported, it is possible that a set of
2201	   operations related to the feature become REQUIRED to implement.  The
2202	   third column of the table designates the feature(s) and if the
2203	   operation is REQUIRED or OPTIONAL in the presence of support for the
2204	   feature.

2206	   The OPTIONAL features identified and their abbreviations are as
2207	   follows:

2209	   pNFS  Parallel NFS

2211	   FDELG  File Delegations

2213	   DDELG  Directory Delegations

2215	   COPY  Server Side Copy

2217	   ADB  Application Data Blocks

2219	                                Operations

2221	   +----------------------+--------------------+-----------------------+
2222	   | Operation            | REQ, REC, OPT, or  | Feature (REQ, REC, or |
2223	   |                      | MNI                | OPT)                  |
2224	   +----------------------+--------------------+-----------------------+
2225	   | ACCESS               | REQ                |                       |
2226	   | BACKCHANNEL_CTL      | REQ                |                       |
2227	   | BIND_CONN_TO_SESSION | REQ                |                       |
2228	   | CLOSE                | REQ                |                       |
2229	   | COMMIT               | REQ                |                       |
2230	   | COPY                 | OPT                | COPY (REQ)            |
2231	   | COPY_ABORT           | OPT                | COPY (REQ)            |
2232	   | COPY_NOTIFY          | OPT                | COPY (REQ)            |
2233	   | COPY_REVOKE          | OPT                | COPY (REQ)            |
2234	   | COPY_STATUS          | OPT                | COPY (REQ)            |
2235	   | CREATE               | REQ                |                       |
2236	   | CREATE_SESSION       | REQ                |                       |
2237	   | DELEGPURGE           | OPT                | FDELG (REQ)           |
2238	   | DELEGRETURN          | OPT                | FDELG, DDELG, pNFS    |
2239	   |                      |                    | (REQ)                 |
2240	   | DESTROY_CLIENTID     | REQ                |                       |
2241	   | DESTROY_SESSION      | REQ                |                       |
2242	   | EXCHANGE_ID          | REQ                |                       |
2243	   | FREE_STATEID         | REQ                |                       |
2244	   | GETATTR              | REQ                |                       |
2245	   | GETDEVICEINFO        | OPT                | pNFS (REQ)            |
2246	   | GETDEVICELIST        | OPT                | pNFS (OPT)            |
2247	   | GETFH                | REQ                |                       |
2248	   | INITIALIZE           | OPT                | ADB (REQ)             |
2249	   | GET_DIR_DELEGATION   | OPT                | DDELG (REQ)           |
2250	   | LAYOUTCOMMIT         | OPT                | pNFS (REQ)            |
2251	   | LAYOUTGET            | OPT                | pNFS (REQ)            |
2252	   | LAYOUTRETURN         | OPT                | pNFS (REQ)            |
2253	   | LINK                 | OPT                |                       |
2254	   | LOCK                 | REQ                |                       |
2255	   | LOCKT                | REQ                |                       |
2256	   | LOCKU                | REQ                |                       |
2257	   | LOOKUP               | REQ                |                       |
2258	   | LOOKUPP              | REQ                |                       |
2259	   | NVERIFY              | REQ                |                       |
2260	   | OPEN                 | REQ                |                       |
2261	   | OPENATTR             | OPT                |                       |
2262	   | OPEN_CONFIRM         | MNI                |                       |
2263	   | OPEN_DOWNGRADE       | REQ                |                       |
2264	   | PUTFH                | REQ                |                       |
2265	   | PUTPUBFH             | REQ                |                       |
2266	   | PUTROOTFH            | REQ                |                       |
2267	   | READ                 | OPT                |                       |
2268	   | READDIR              | REQ                |                       |
2269	   | READLINK             | OPT                |                       |
2270	   | READ_PLUS            | OPT                | ADB (REQ)             |
2271	   | RECLAIM_COMPLETE     | REQ                |                       |
2272	   | RELEASE_LOCKOWNER    | MNI                |                       |
2273	   | REMOVE               | REQ                |                       |
2274	   | RENAME               | REQ                |                       |
2275	   | RENEW                | MNI                |                       |
2276	   | RESTOREFH            | REQ                |                       |
2277	   | SAVEFH               | REQ                |                       |
2278	   | SECINFO              | REQ                |                       |
2279	   | SECINFO_NO_NAME      | REC                | pNFS file layout      |
2280	   |                      |                    | (REQ)                 |
2281	   | SEQUENCE             | REQ                |                       |
2282	   | SETATTR              | REQ                |                       |
2283	   | SETCLIENTID          | MNI                |                       |
2284	   | SETCLIENTID_CONFIRM  | MNI                |                       |
2285	   | SET_SSV              | REQ                |                       |
2286	   | TEST_STATEID         | REQ                |                       |
2287	   | VERIFY               | REQ                |                       |
2288	   | WANT_DELEGATION      | OPT                | FDELG (OPT)           |
2289	   | WRITE                | REQ                |                       |
2290	   +----------------------+--------------------+-----------------------+
2291	                            Callback Operations

2293	   +-------------------------+-------------------+---------------------+
2294	   | Operation               | REQ, REC, OPT, or | Feature (REQ, REC,  |
2295	   |                         | MNI               | or OPT)             |
2296	   +-------------------------+-------------------+---------------------+
2297	   | CB_COPY                 | OPT               | COPY (REQ)          |
2298	   | CB_GETATTR              | OPT               | FDELG (REQ)         |
2299	   | CB_LAYOUTRECALL         | OPT               | pNFS (REQ)          |
2300	   | CB_NOTIFY               | OPT               | DDELG (REQ)         |
2301	   | CB_NOTIFY_DEVICEID      | OPT               | pNFS (OPT)          |
2302	   | CB_NOTIFY_LOCK          | OPT               |                     |
2303	   | CB_PUSH_DELEG           | OPT               | FDELG (OPT)         |
2304	   | CB_RECALL               | OPT               | FDELG, DDELG, pNFS  |
2305	   |                         |                   | (REQ)               |
2306	   | CB_RECALL_ANY           | OPT               | FDELG, DDELG, pNFS  |
2307	   |                         |                   | (REQ)               |
2308	   | CB_RECALL_SLOT          | REQ               |                     |
2309	   | CB_RECALLABLE_OBJ_AVAIL | OPT               | DDELG, pNFS (REQ)   |
2310	   | CB_SEQUENCE             | OPT               | FDELG, DDELG, pNFS  |
2311	   |                         |                   | (REQ)               |
2312	   | CB_WANTS_CANCELLED      | OPT               | FDELG, DDELG, pNFS  |
2313	   |                         |                   | (REQ)               |
2314	   +-------------------------+-------------------+---------------------+

2316	13.  NFSv4.2 Operations

2318	13.1.  Operation 59: COPY - Initiate a server-side copy

2320	13.1.1.  ARGUMENT

2322	   const COPY4_GUARDED     = 0x00000001;
2323	   const COPY4_METADATA    = 0x00000002;

2325	   struct COPY4args {
2326	           /* SAVED_FH: source file */
2327	           /* CURRENT_FH: destination file or */
2328	           /*             directory           */
2329	           offset4         ca_src_offset;
2330	           offset4         ca_dst_offset;
2331	           length4         ca_count;
2332	           uint32_t        ca_flags;
2333	           component4      ca_destination;
2334	           netloc4         ca_source_server<>;
2335	   };

2337	13.1.2.  RESULT

2339	   union COPY4res switch (nfsstat4 cr_status) {
2340	           case NFS4_OK:
2341	                   stateid4        cr_callback_id<1>;
2342	           default:
2343	                   length4         cr_bytes_copied;
2344	   };

2346	13.1.3.  DESCRIPTION

2348	   The COPY operation is used for both intra-server and inter-server
2349	   copies.  In both cases, the COPY is always sent from the client to
2350	   the destination server of the file copy.  The COPY operation requests
2351	   that a file be copied from the location specified by the SAVED_FH
2352	   value to the location specified by the combination of CURRENT_FH and
2353	   ca_destination.

2355	   The SAVED_FH must be a regular file.  If SAVED_FH is not a regular
2356	   file, the operation MUST fail and return NFS4ERR_WRONG_TYPE.

2358	   In order to set SAVED_FH to the source file handle, the compound
2359	   procedure requesting the COPY will include a sub-sequence of
2360	   operations such as

2362	      PUTFH source-fh
2363	      SAVEFH

2365	   If the request is for a server-to-server copy, the source-fh is a
2366	   filehandle from the source server and the compound procedure is being
2367	   executed on the destination server.  In this case, the source-fh is a
2368	   foreign filehandle on the server receiving the COPY request.  If
2369	   either PUTFH or SAVEFH checked the validity of the filehandle, the
2370	   operation would likely fail and return NFS4ERR_STALE.

2372	   In order to avoid this problem, the minor version incorporating the
2373	   COPY operations will need to make a few small changes in the handling
2374	   of existing operations.  If a server supports the server-to-server
2375	   COPY feature, a PUTFH followed by a SAVEFH MUST NOT return
2376	   NFS4ERR_STALE for either operation.  These restrictions do not pose
2377	   substantial difficulties for servers.  The CURRENT_FH and SAVED_FH
2378	   may be validated in the context of the operation referencing them and
2379	   an NFS4ERR_STALE error returned for an invalid file handle at that
2380	   point.

2382	   The CURRENT_FH and ca_destination together specify the destination of
2383	   the copy operation.  If ca_destination is of 0 (zero) length, then
2384	   CURRENT_FH specifies the target file.  In this case, CURRENT_FH MUST
2385	   be a regular file and not a directory.  If ca_destination is not of 0
2386	   (zero) length, the ca_destination argument specifies the file name to
2387	   which the data will be copied within the directory identified by
2388	   CURRENT_FH.  In this case, CURRENT_FH MUST be a directory and not a
2389	   regular file.

2391	   If the file named by ca_destination does not exist and the operation
2392	   completes successfully, the file will be visible in the file system
2393	   namespace.  If the file does not exist and the operation fails, the
2394	   file MAY be visible in the file system namespace depending on when
2395	   the failure occurs and on the implementation of the NFS server
2396	   receiving the COPY operation.  If the ca_destination name cannot be
2397	   created in the destination file system (due to file name
2398	   restrictions, such as case or length), the operation MUST fail.

2400	   The ca_src_offset is the offset within the source file from which the
2401	   data will be read, the ca_dst_offset is the offset within the
2402	   destination file to which the data will be written, and the ca_count
2403	   is the number of bytes that will be copied.  An offset of 0 (zero)
2404	   specifies the start of the file.  A count of 0 (zero) requests that
2405	   all bytes from ca_src_offset through EOF be copied to the
2406	   destination.  If concurrent modifications to the source file overlap
2407	   with the source file region being copied, the data copied may include
2408	   all, some, or none of the modifications.  The client can use standard
2409	   NFS operations (e.g., OPEN with OPEN4_SHARE_DENY_WRITE or mandatory
2410	   byte range locks) to protect against concurrent modifications if the
2411	   client is concerned about this.  If the source file's end of file is
2412	   being modified in parallel with a copy that specifies a count of 0
2413	   (zero) bytes, the amount of data copied is implementation dependent
2414	   (clients may guard against this case by specifying a non-zero count
2415	   value or preventing modification of the source file as mentioned
2416	   above).

2418	   If the source offset or the source offset plus count is greater than
2419	   or equal to the size of the source file, the operation will fail with
2420	   NFS4ERR_INVAL.  The destination offset or destination offset plus
2421	   count may be greater than the size of the destination file.  This
2422	   allows for the client to issue parallel copies to implement
2423	   operations such as "cat file1 file2 file3 file4 > dest".

2425	   If the destination file is created as a result of this command, the
2426	   destination file's size will be equal to the number of bytes
2427	   successfully copied.  If the destination file already existed, the
2428	   destination file's size may increase as a result of this operation
2429	   (e.g. if ca_dst_offset plus ca_count is greater than the
2430	   destination's initial size).

2432	   If the ca_source_server list is specified, then this is an inter-
2433	   server copy operation and the source file is on a remote server.  The
2434	   client is expected to have previously issued a successful COPY_NOTIFY
2435	   request to the remote source server.  The ca_source_server list
2436	   SHOULD be the same as the COPY_NOTIFY response's cnr_source_server
2437	   list.  If the client includes the entries from the COPY_NOTIFY
2438	   response's cnr_source_server list in the ca_source_server list, the
2439	   source server can indicate a specific copy protocol for the
2440	   destination server to use by returning a URL, which specifies both a
2441	   protocol service and server name.  Server-to-server copy protocol
2442	   considerations are described in Section 2.2.3 and Section 2.4.1.

2444	   The ca_flags argument allows the copy operation to be customized in
2445	   the following ways using the guarded flag (COPY4_GUARDED) and the
2446	   metadata flag (COPY4_METADATA).

2448	   If the guarded flag is set and the destination exists on the server,
2449	   this operation will fail with NFS4ERR_EXIST.

2451	   If the guarded flag is not set and the destination exists on the
2452	   server, the behavior is implementation dependent.

2454	   If the metadata flag is set and the client is requesting a whole file
2455	   copy (i.e., ca_count is 0 (zero)), a subset of the destination file's
2456	   attributes MUST be the same as the source file's corresponding
2457	   attributes and a subset of the destination file's attributes SHOULD
2458	   be the same as the source file's corresponding attributes.  The
2459	   attributes in the MUST and SHOULD copy subsets will be defined for
2460	   each NFS version.

2462	   For NFSv4.1, Table 2 and Table 3 list the REQUIRED and RECOMMENDED
2463	   attributes respectively.  A "MUST" in the "Copy to destination file?"
2464	   column indicates that the attribute is part of the MUST copy set.  A
2465	   "SHOULD" in the "Copy to destination file?" column indicates that the
2466	   attribute is part of the SHOULD copy set.

2468	          +--------------------+----+---------------------------+
2469	          | Name               | Id | Copy to destination file? |
2470	          +--------------------+----+---------------------------+
2471	          | supported_attrs    | 0  | no                        |
2472	          | type               | 1  | MUST                      |
2473	          | fh_expire_type     | 2  | no                        |
2474	          | change             | 3  | SHOULD                    |
2475	          | size               | 4  | MUST                      |
2476	          | link_support       | 5  | no                        |
2477	          | symlink_support    | 6  | no                        |
2478	          | named_attr         | 7  | no                        |
2479	          | fsid               | 8  | no                        |
2480	          | unique_handles     | 9  | no                        |
2481	          | lease_time         | 10 | no                        |
2482	          | rdattr_error       | 11 | no                        |
2483	          | filehandle         | 19 | no                        |
2484	          | suppattr_exclcreat | 75 | no                        |
2485	          +--------------------+----+---------------------------+

2487	                                  Table 2

2489	          +--------------------+----+---------------------------+
2490	          | Name               | Id | Copy to destination file? |
2491	          +--------------------+----+---------------------------+
2492	          | acl                | 12 | MUST                      |
2493	          | aclsupport         | 13 | no                        |
2494	          | archive            | 14 | no                        |
2495	          | cansettime         | 15 | no                        |
2496	          | case_insensitive   | 16 | no                        |
2497	          | case_preserving    | 17 | no                        |
2498	          | change_policy      | 60 | no                        |
2499	          | chown_restricted   | 18 | MUST                      |
2500	          | dacl               | 58 | MUST                      |
2501	          | dir_notif_delay    | 56 | no                        |
2502	          | dirent_notif_delay | 57 | no                        |
2503	          | fileid             | 20 | no                        |
2504	          | files_avail        | 21 | no                        |
2505	          | files_free         | 22 | no                        |
2506	          | files_total        | 23 | no                        |
2507	          | fs_charset_cap     | 76 | no                        |
2508	          | fs_layout_type     | 62 | no                        |
2509	          | fs_locations       | 24 | no                        |
2510	          | fs_locations_info  | 67 | no                        |
2511	          | fs_status          | 61 | no                        |
2512	          | hidden             | 25 | MUST                      |
2513	          | homogeneous        | 26 | no                        |
2514	          | layout_alignment   | 66 | no                        |
2515	          | layout_blksize     | 65 | no                        |
2516	          | layout_hint        | 63 | no                        |
2517	          | layout_type        | 64 | no                        |
2518	          | maxfilesize        | 27 | no                        |
2519	          | maxlink            | 28 | no                        |
2520	          | maxname            | 29 | no                        |
2521	          | maxread            | 30 | no                        |
2522	          | maxwrite           | 31 | no                        |
2523	          | mdsthreshold       | 68 | no                        |
2524	          | mimetype           | 32 | MUST                      |
2525	          | mode               | 33 | MUST                      |
2526	          | mode_set_masked    | 74 | no                        |
2527	          | mounted_on_fileid  | 55 | no                        |
2528	          | no_trunc           | 34 | no                        |
2529	          | numlinks           | 35 | no                        |
2530	          | owner              | 36 | MUST                      |
2531	          | owner_group        | 37 | MUST                      |
2532	          | quota_avail_hard   | 38 | no                        |
2533	          | quota_avail_soft   | 39 | no                        |
2534	          | quota_used         | 40 | no                        |
2535	          | rawdev             | 41 | no                        |
2536	          | retentevt_get      | 71 | MUST                      |
2537	          | retentevt_set      | 72 | no                        |
2538	          | retention_get      | 69 | MUST                      |
2539	          | retention_hold     | 73 | MUST                      |
2540	          | retention_set      | 70 | no                        |
2541	          | sacl               | 59 | MUST                      |
2542	          | space_avail        | 42 | no                        |
2543	          | space_free         | 43 | no                        |
2544	          | space_freed        | 78 | no                        |
2545	          | space_reserved     | 77 | MUST                      |
2546	          | space_total        | 44 | no                        |
2547	          | space_used         | 45 | no                        |
2548	          | system             | 46 | MUST                      |
2549	          | time_access        | 47 | MUST                      |
2550	          | time_access_set    | 48 | no                        |
2551	          | time_backup        | 49 | no                        |
2552	          | time_create        | 50 | MUST                      |
2553	          | time_delta         | 51 | no                        |
2554	          | time_metadata      | 52 | SHOULD                    |
2555	          | time_modify        | 53 | MUST                      |
2556	          | time_modify_set    | 54 | no                        |
2557	          +--------------------+----+---------------------------+

2559	                                  Table 3

2561	   [NOTE: The source file's attribute values will take precedence over
2562	   any attribute values inherited by the destination file.]
2563	   In the case of an inter-server copy or an intra-server copy between
2564	   file systems, the attributes supported for the source file and
2565	   destination file could be different.  By definition,the REQUIRED
2566	   attributes will be supported in all cases.  If the metadata flag is
2567	   set and the source file has a RECOMMENDED attribute that is not
2568	   supported for the destination file, the copy MUST fail with
2569	   NFS4ERR_ATTRNOTSUPP.

2571	   Any attribute supported by the destination server that is not set on
2572	   the source file SHOULD be left unset.

2574	   Metadata attributes not exposed via the NFS protocol SHOULD be copied
2575	   to the destination file where appropriate.

2577	   The destination file's named attributes are not duplicated from the
2578	   source file.  After the copy process completes, the client MAY
2579	   attempt to duplicate named attributes using standard NFSv4
2580	   operations.  However, the destination file's named attribute
2581	   capabilities MAY be different from the source file's named attribute
2582	   capabilities.

2584	   If the metadata flag is not set and the client is requesting a whole
2585	   file copy (i.e., ca_count is 0 (zero)), the destination file's
2586	   metadata is implementation dependent.

2588	   If the client is requesting a partial file copy (i.e., ca_count is
2589	   not 0 (zero)), the client SHOULD NOT set the metadata flag and the
2590	   server MUST ignore the metadata flag.

2592	   If the operation does not result in an immediate failure, the server
2593	   will return NFS4_OK, and the CURRENT_FH will remain the destination's
2594	   filehandle.

2596	   If an immediate failure does occur, cr_bytes_copied will be set to
2597	   the number of bytes copied to the destination file before the error
2598	   occurred.  The cr_bytes_copied value indicates the number of bytes
2599	   copied but not which specific bytes have been copied.

2601	   A return of NFS4_OK indicates that either the operation is complete
2602	   or the operation was initiated and a callback will be used to deliver
2603	   the final status of the operation.

2605	   If the cr_callback_id is returned, this indicates that the operation
2606	   was initiated and a CB_COPY callback will deliver the final results
2607	   of the operation.  The cr_callback_id stateid is termed a copy
2608	   stateid in this context.  The server is given the option of returning
2609	   the results in a callback because the data may require a relatively
2610	   long period of time to copy.

2612	   If no cr_callback_id is returned, the operation completed
2613	   synchronously and no callback will be issued by the server.  The
2614	   completion status of the operation is indicated by cr_status.

2616	   If the copy completes successfully, either synchronously or
2617	   asynchronously, the data copied from the source file to the
2618	   destination file MUST appear identical to the NFS client.  However,
2619	   the NFS server's on disk representation of the data in the source
2620	   file and destination file MAY differ.  For example, the NFS server
2621	   might encrypt, compress, deduplicate, or otherwise represent the on
2622	   disk data in the source and destination file differently.

2624	   In the event of a failure the state of the destination file is
2625	   implementation dependent.  The COPY operation may fail for the
2626	   following reasons (this is a partial list).

2628	   o  NFS4ERR_MOVED

2630	   o  NFS4ERR_NOTSUPP

2632	   o  NFS4ERR_PARTNER_NOTSUPP

2634	   o  NFS4ERR_OFFLOAD_DENIED

2636	   o  NFS4ERR_PARTNER_NO_AUTH

2638	   o  NFS4ERR_FBIG

2640	   o  NFS4ERR_NOTDIR

2642	   o  NFS4ERR_WRONG_TYPE

2644	   o  NFS4ERR_ISDIR

2646	   o  NFS4ERR_INVAL

2648	   o  NFS4ERR_DELAY

2650	   o  NFS4ERR_METADATA_NOTSUPP

2652	   o  NFS4ERR_WRONGSEC

2654	13.2.  Operation 60: COPY_ABORT - Cancel a server-side copy
2655	13.2.1.  ARGUMENT

2657	   struct COPY_ABORT4args {
2658	           /* CURRENT_FH: desination file */
2659	           stateid4        caa_stateid;
2660	   };

2662	13.2.2.  RESULT

2664	   struct COPY_ABORT4res {
2665	           nfsstat4        car_status;
2666	   };

2668	13.2.3.  DESCRIPTION

2670	   COPY_ABORT is used for both intra- and inter-server asynchronous
2671	   copies.  The COPY_ABORT operation allows the client to cancel a
2672	   server-side copy operation that it initiated.  This operation is sent
2673	   in a COMPOUND request from the client to the destination server.
2674	   This operation may be used to cancel a copy when the application that
2675	   requested the copy exits before the operation is completed or for
2676	   some other reason.

2678	   The request contains the filehandle and copy stateid cookies that act
2679	   as the context for the previously initiated copy operation.

2681	   The result's car_status field indicates whether the cancel was
2682	   successful or not.  A value of NFS4_OK indicates that the copy
2683	   operation was canceled and no callback will be issued by the server.
2684	   A copy operation that is successfully canceled may result in none,
2685	   some, or all of the data copied.

2687	   If the server supports asynchronous copies, the server is REQUIRED to
2688	   support the COPY_ABORT operation.

2690	   The COPY_ABORT operation may fail for the following reasons (this is
2691	   a partial list):

2693	   o  NFS4ERR_NOTSUPP

2695	   o  NFS4ERR_RETRY

2697	   o  NFS4ERR_COMPLETE_ALREADY

2699	   o  NFS4ERR_SERVERFAULT

2701	13.3.  Operation 61: COPY_NOTIFY - Notify a source server of a future
2702	       copy

2704	13.3.1.  ARGUMENT

2706	   struct COPY_NOTIFY4args {
2707	           /* CURRENT_FH: source file */
2708	           netloc4         cna_destination_server;
2709	   };

2711	13.3.2.  RESULT

2713	   struct COPY_NOTIFY4resok {
2714	           nfstime4        cnr_lease_time;
2715	           netloc4         cnr_source_server<>;
2716	   };

2718	   union COPY_NOTIFY4res switch (nfsstat4 cnr_status) {
2719	           case NFS4_OK:
2720	                   COPY_NOTIFY4resok       resok4;
2721	           default:
2722	                   void;
2723	   };

2725	13.3.3.  DESCRIPTION

2727	   This operation is used for an inter-server copy.  A client sends this
2728	   operation in a COMPOUND request to the source server to authorize a
2729	   destination server identified by cna_destination_server to read the
2730	   file specified by CURRENT_FH on behalf of the given user.

2732	   The cna_destination_server MUST be specified using the netloc4
2733	   network location format.  The server is not required to resolve the
2734	   cna_destination_server address before completing this operation.

2736	   If this operation succeeds, the source server will allow the
2737	   cna_destination_server to copy the specified file on behalf of the
2738	   given user.  If COPY_NOTIFY succeeds, the destination server is
2739	   granted permission to read the file as long as both of the following
2740	   conditions are met:

2742	   o  The destination server begins reading the source file before the
2743	      cnr_lease_time expires.  If the cnr_lease_time expires while the
2744	      destination server is still reading the source file, the
2745	      destination server is allowed to finish reading the file.

2747	   o  The client has not issued a COPY_REVOKE for the same combination
2748	      of user, filehandle, and destination server.

2750	   The cnr_lease_time is chosen by the source server.  A cnr_lease_time
2751	   of 0 (zero) indicates an infinite lease.  To renew the copy lease
2752	   time the client should resend the same copy notification request to
2753	   the source server.

2755	   To avoid the need for synchronized clocks, copy lease times are
2756	   granted by the server as a time delta.  However, there is a
2757	   requirement that the client and server clocks do not drift
2758	   excessively over the duration of the lease.  There is also the issue
2759	   of propagation delay across the network which could easily be several
2760	   hundred milliseconds as well as the possibility that requests will be
2761	   lost and need to be retransmitted.

2763	   To take propagation delay into account, the client should subtract it
2764	   from copy lease times (e.g., if the client estimates the one-way
2765	   propagation delay as 200 milliseconds, then it can assume that the
2766	   lease is already 200 milliseconds old when it gets it).  In addition,
2767	   it will take another 200 milliseconds to get a response back to the
2768	   server.  So the client must send a lease renewal or send the copy
2769	   offload request to the cna_destination_server at least 400
2770	   milliseconds before the copy lease would expire.  If the propagation
2771	   delay varies over the life of the lease (e.g., the client is on a
2772	   mobile host), the client will need to continuously subtract the
2773	   increase in propagation delay from the copy lease times.

2775	   The server's copy lease period configuration should take into account
2776	   the network distance of the clients that will be accessing the
2777	   server's resources.  It is expected that the lease period will take
2778	   into account the network propagation delays and other network delay
2779	   factors for the client population.  Since the protocol does not allow
2780	   for an automatic method to determine an appropriate copy lease
2781	   period, the server's administrator may have to tune the copy lease
2782	   period.

2784	   A successful response will also contain a list of names, addresses,
2785	   and URLs called cnr_source_server, on which the source is willing to
2786	   accept connections from the destination.  These might not be
2787	   reachable from the client and might be located on networks to which
2788	   the client has no connection.

2790	   If the client wishes to perform an inter-server copy, the client MUST
2791	   send a COPY_NOTIFY to the source server.  Therefore, the source
2792	   server MUST support COPY_NOTIFY.

2794	   For a copy only involving one server (the source and destination are
2795	   on the same server), this operation is unnecessary.

2797	   The COPY_NOTIFY operation may fail for the following reasons (this is
2798	   a partial list):

2800	   o  NFS4ERR_MOVED

2802	   o  NFS4ERR_NOTSUPP

2804	   o  NFS4ERR_WRONGSEC

2806	13.4.  Operation 62: COPY_REVOKE - Revoke a destination server's copy
2807	       privileges

2809	13.4.1.  ARGUMENT

2811	   struct COPY_REVOKE4args {
2812	           /* CURRENT_FH: source file */
2813	           netloc4         cra_destination_server;
2814	   };

2816	13.4.2.  RESULT

2818	   struct COPY_REVOKE4res {
2819	           nfsstat4        crr_status;
2820	   };

2822	13.4.3.  DESCRIPTION

2824	   This operation is used for an inter-server copy.  A client sends this
2825	   operation in a COMPOUND request to the source server to revoke the
2826	   authorization of a destination server identified by
2827	   cra_destination_server from reading the file specified by CURRENT_FH
2828	   on behalf of given user.  If the cra_destination_server has already
2829	   begun copying the file, a successful return from this operation
2830	   indicates that further access will be prevented.

2832	   The cra_destination_server MUST be specified using the netloc4
2833	   network location format.  The server is not required to resolve the
2834	   cra_destination_server address before completing this operation.

2836	   The COPY_REVOKE operation is useful in situations in which the source
2837	   server granted a very long or infinite lease on the destination
2838	   server's ability to read the source file and all copy operations on
2839	   the source file have been completed.

2841	   For a copy only involving one server (the source and destination are
2842	   on the same server), this operation is unnecessary.

2844	   If the server supports COPY_NOTIFY, the server is REQUIRED to support
2845	   the COPY_REVOKE operation.

2847	   The COPY_REVOKE operation may fail for the following reasons (this is
2848	   a partial list):

2850	   o  NFS4ERR_MOVED

2852	   o  NFS4ERR_NOTSUPP

2854	13.5.  Operation 63: COPY_STATUS - Poll for status of a server-side copy

2856	13.5.1.  ARGUMENT

2858	   struct COPY_STATUS4args {
2859	           /* CURRENT_FH: destination file */
2860	           stateid4        csa_stateid;
2861	   };

2863	13.5.2.  RESULT

2865	   struct COPY_STATUS4resok {
2866	           length4         csr_bytes_copied;
2867	           nfsstat4        csr_complete<1>;
2868	   };

2870	   union COPY_STATUS4res switch (nfsstat4 csr_status) {
2871	           case NFS4_OK:
2872	                   COPY_STATUS4resok       resok4;
2873	           default:
2874	                   void;
2875	   };

2877	13.5.3.  DESCRIPTION

2879	   COPY_STATUS is used for both intra- and inter-server asynchronous
2880	   copies.  The COPY_STATUS operation allows the client to poll the
2881	   server to determine the status of an asynchronous copy operation.
2882	   This operation is sent by the client to the destination server.

2884	   If this operation is successful, the number of bytes copied are
2885	   returned to the client in the csr_bytes_copied field.  The
2886	   csr_bytes_copied value indicates the number of bytes copied but not
2887	   which specific bytes have been copied.

2889	   If the optional csr_complete field is present, the copy has
2890	   completed.  In this case the status value indicates the result of the
2891	   asynchronous copy operation.  In all cases, the server will also
2892	   deliver the final results of the asynchronous copy in a CB_COPY
2893	   operation.

2895	   The failure of this operation does not indicate the result of the
2896	   asynchronous copy in any way.

2898	   If the server supports asynchronous copies, the server is REQUIRED to
2899	   support the COPY_STATUS operation.

2901	   The COPY_STATUS operation may fail for the following reasons (this is
2902	   a partial list):

2904	   o  NFS4ERR_NOTSUPP

2906	   o  NFS4ERR_BAD_STATEID

2908	   o  NFS4ERR_EXPIRED

2910	13.6.  Modification to Operation 42: EXCHANGE_ID - Instantiate Client ID

2912	13.6.1.  ARGUMENT

2914	      /* new */
2915	      const EXCHGID4_FLAG_SUPP_FENCE_OPS      = 0x00000004;

2917	13.6.2.  RESULT

2919	      Unchanged

2921	13.6.3.  MOTIVATION

2923	   Enterprise applications require guarantees that an operation has
2924	   either aborted or completed.  NFSv4.1 provides this guarantee as long
2925	   as the session is alive: simply send a SEQUENCE operation on the same
2926	   slot with a new sequence number, and the successful return of
2927	   SEQUENCE indicates the previous operation has completed.  However, if
2928	   the session is lost, there is no way to know when any in progress
2929	   operations have aborted or completed.  In hindsight, the NFSv4.1
2930	   specification should have mandated that DESTROY_SESSION abort/
2931	   complete all outstanding operations.

2933	13.6.4.  DESCRIPTION

2935	   A client SHOULD request the EXCHGID4_FLAG_SUPP_FENCE_OPS capability
2936	   when it sends an EXCHANGE_ID operation.  The server SHOULD set this
2937	   capability in the EXCHANGE_ID reply whether the client requests it or
2938	   not.  If the client ID is created with this capability then the
2939	   following will occur:

2941	   o  The server will not reply to DESTROY_SESSION until all operations
2942	      in progress are completed or aborted.

2944	   o  The server will not reply to subsequent EXCHANGE_ID invoked on the
2945	      same Client Owner with a new verifier until all operations in
2946	      progress on the Client ID's session are completed or aborted.

2948	   o  When DESTROY_CLIENTID is invoked, if there are sessions (both idle
2949	      and non-idle), opens, locks, delegations, layouts, and/or wants
2950	      (Section 18.49) associated with the client ID are removed.
2951	      Pending operations will be completed or aborted before the
2952	      sessions, opens, locks, delegations, layouts, and/or wants are
2953	      deleted.

2955	   o  The NFS server SHOULD support client ID trunking, and if it does
2956	      and the EXCHGID4_FLAG_SUPP_FENCE_OPS capability is enabled, then a
2957	      session ID created on one node of the storage cluster MUST be
2958	      destroyable via DESTROY_SESSION.  In addition, DESTROY_CLIENTID
2959	      and an EXCHANGE_ID with a new verifier affects all sessions
2960	      regardless what node the sessions were created on.

2962	13.7.  Operation 64: INITIALIZE

2964	   This operation can be used to initialize the structure imposed by an
2965	   application onto a file, i.e., ADBs, and to punch a hole into a file.

2967	13.7.1.  ARGUMENT

2969	   /*
2970	    * We use data_content4 in case we wish to
2971	    * extend new types later. Note that we
2972	    * are explicitly disallowing data.
2973	    */
2974	   union initialize_arg4 switch (data_content4 content) {
2975	   case NFS4_CONTENT_APP_BLOCK:
2976	           app_data_block4 ia_adb;
2977	   case NFS4_CONTENT_HOLE:
2978	           data_info4      ia_hole;
2979	   default:
2980	           void;
2981	   };

2983	   struct INITIALIZE4args {
2984	           /* CURRENT_FH: file */
2985	           stateid4        ia_stateid;
2986	           stable_how4     ia_stable;
2987	           initialize_arg4 ia_data<>;
2988	   };

2990	13.7.2.  RESULT

2992	   struct INITIALIZE4resok {
2993	           count4          ir_count;
2994	           stable_how4     ir_committed;
2995	           verifier4       ir_writeverf;
2996	           data_content4   ir_sparse;
2997	   };

2999	   union INITIALIZE4res switch (nfsstat4 status) {
3000	   case NFS4_OK:
3001	           INITIALIZE4resok        resok4;
3002	   default:
3003	           void;
3004	   };

3006	13.7.3.  DESCRIPTION
3007	13.7.3.1.  Hole punching

3009	   Whenever a client wishes to zero the blocks backing a particular
3010	   region in the file, it calls the INITIALIZE operation with the
3011	   current filehandle set to the filehandle of the file in question, and
3012	   the equivalent of start offset and length in bytes of the region set
3013	   in ia_hole.di_offset and ia_hole.di_length respectively.  If the
3014	   ia_hole.di_allocated is set to TRUE, then the blocks will be zeroed
3015	   and if it is set to FALSE, then they will be deallocated.  All
3016	   further reads to this region MUST return zeros until overwritten.
3017	   The filehandle specified must be that of a regular file.

3019	   Situations may arise where di_offset and/or di_offset + di_length
3020	   will not be aligned to a boundary that the server does allocations/
3021	   deallocations in.  For most filesystems, this is the block size of
3022	   the file system.  In such a case, the server can deallocate as many
3023	   bytes as it can in the region.  The blocks that cannot be deallocated
3024	   MUST be zeroed.  Except for the block deallocation and maximum hole
3025	   punching capability, a INITIALIZE operation is to be treated similar
3026	   to a write of zeroes.

3028	   The server is not required to complete deallocating the blocks
3029	   specified in the operation before returning.  It is acceptable to
3030	   have the deallocation be deferred.  In fact, INITIALIZE is merely a
3031	   hint; it is valid for a server to return success without ever doing
3032	   anything towards deallocating the blocks backing the region
3033	   specified.  However, any future reads to the region MUST return
3034	   zeroes.

3036	   If used to hole punch, INITIALIZE will result in the space_used
3037	   attribute being decreased by the number of bytes that were
3038	   deallocated.  The space_freed attribute may or may not decrease,
3039	   depending on the support and whether the blocks backing the specified
3040	   range were shared or not.  The size attribute will remain unchanged.

3042	   The INITIALIZE operation MUST NOT change the space reservation
3043	   guarantee of the file.  While the server can deallocate the blocks
3044	   specified by di_offset and di_length, future writes to this region
3045	   MUST NOT fail with NFSERR_NOSPC.

3047	   The INITIALIZE operation may fail for the following reasons (this is
3048	   a partial list):

3050	   NFS4ERR_NOTSUPP  The Hole punch operations are not supported by the
3051	      NFS server receiving this request.

3053	   NFS4ERR_DIR  The current filehandle is of type NF4DIR.

3055	   NFS4ERR_SYMLINK  The current filehandle is of type NF4LNK.

3057	   NFS4ERR_WRONG_TYPE  The current filehandle does not designate an
3058	      ordinary file.

3060	13.7.3.2.  ADBs

3062	   If the server supports ADBs, then it MUST support the
3063	   NFS4_CONTENT_APP_BLOCK arm of the INITIALIZE operation.  The server
3064	   has no concept of the structure imposed by the application.  It is
3065	   only when the application writes to a section of the file does order
3066	   get imposed.  In order to detect corruption even before the
3067	   application utilizes the file, the application will want to
3068	   initialize a range of ADBs using INITIALIZE.

3070	   For ADBs, when the client invokes the INITIALIZE operation, it has
3071	   two desired results:

3073	   1.  The structure described by the app_data_block4 be imposed on the
3074	       file.

3076	   2.  The contents described by the app_data_block4 be sparse.

3078	   If the server supports the INITIALIZE operation, it still might not
3079	   support sparse files.  So if it receives the INITIALIZE operation,
3080	   then it MUST populate the contents of the file with the initialized
3081	   ADBs.

3083	   If the data was already initialized, there are two interesting
3084	   scenarios:

3086	   1.  The data blocks are allocated.

3088	   2.  Initializing in the middle of an existing ADB.

3090	   If the data blocks were already allocated, then the INITIALIZE is a
3091	   hole punch operation.  If INITIALIZE supports sparse files, then the
3092	   data blocks are to be deallocated.  If not, then the data blocks are
3093	   to be rewritten in the indicated ADB format.

3095	   Since the server has no knowledge of ADBs, it should not report
3096	   misaligned creation of ADBs.  Even while it can detect them, it
3097	   cannot disallow them, as the application might be in the process of
3098	   changing the size of the ADBs.  Thus the server must be prepared to
3099	   handle an INITIALIZE into an existing ADB.

3101	   This document does not mandate the manner in which the server stores
3102	   ADBs sparsely for a file.  It does assume that if ADBs are stored
3103	   sparsely, then the server can detect when an INITIALIZE arrives that
3104	   will force a new ADB to start inside an existing ADB.  For example,
3105	   assume that ADBi has a adb_block_size of 4k and that an INITIALIZE
3106	   starts 1k inside ADBi.  The server should [[Comment.5: Need to flesh
3107	   this out. --TH]]

3109	13.8.  Operation 67: IO_ADVISE - Application I/O access pattern hints

3111	   This section introduces a new operation, named IO_ADVISE, which
3112	   allows NFS clients to communicate application I/O access pattern
3113	   hints to the NFS server.  This new operation will allow hints to be
3114	   sent to the server when applications use posix_fadvise, direct I/O,
3115	   or at any other point at which the client finds useful.

3117	13.8.1.  ARGUMENT

3119	   enum IO_ADVISE_type4 {
3120	           IO_ADVISE4_NORMAL                       = 0,
3121	           IO_ADVISE4_SEQUENTIAL                   = 1,
3122	           IO_ADVISE4_SEQUENTIAL_BACKWARDS         = 2,
3123	           IO_ADVISE4_RANDOM                       = 3,
3124	           IO_ADVISE4_WILLNEED                     = 4,
3125	           IO_ADVISE4_WILLNEED_OPPORTUNISTIC       = 5,
3126	           IO_ADVISE4_DONTNEED                     = 6,
3127	           IO_ADVISE4_NOREUSE                      = 7,
3128	           IO_ADVISE4_READ                         = 8,
3129	           IO_ADVISE4_WRITE                        = 9,
3130	           IO_ADVISE4_INIT_PROXIMITY               = 10
3131	   };

3133	   struct IO_ADVISE4args {
3134	           /* CURRENT_FH: file */
3135	           stateid4        iar_stateid;
3136	           offset4         iar_offset;
3137	           length4         iar_count;
3138	           bitmap4         iar_hints;
3139	   };

3141	13.8.2.  RESULT

3143	   struct IO_ADVISE4resok {
3144	           bitmap4 ior_hints;
3145	   };

3147	   union IO_ADVISE4res switch (nfsstat4 _status) {
3148	   case NFS4_OK:
3149	           IO_ADVISE4resok resok4;
3150	   default:
3151	           void;
3152	   };

3154	13.8.3.  DESCRIPTION

3156	   The IO_ADVISE operation sends an I/O access pattern hint to the
3157	   server for the owner of stated for a given byte range specified by
3158	   iar_offset and iar_count.  The byte range specified by iar_offset and
3159	   iar_count need not currently exist in the file, but the iar_hints
3160	   will apply to the byte range when it does exist.  If iar_count is 0,
3161	   all data following iar_offset is specified.  The server MAY ignore
3162	   the advice.

3164	   The following are the possible hints:

3166	   IO_ADVISE4_NORMAL  Specifies that the application has no advice to
3167	      give on its behavior with respect to the specified data.  It is
3168	      the default characteristic if no advice is given.

3170	   IO_ADVISE4_SEQUENTIAL  Specifies that the stated holder expects to
3171	      access the specified data sequentially from lower offsets to
3172	      higher offsets.

3174	   IO_ADVISE4_SEQUENTIAL BACKWARDS  Specifies that the stated holder
3175	      expects to access the specified data sequentially from higher
3176	      offsets to lower offsets.

3178	   IO_ADVISE4_RANDOM  Specifies that the stated holder expects to access
3179	      the specified data in a random order.

3181	   IO_ADVISE4_WILLNEED  Specifies that the stated holder expects to
3182	      access the specified data in the near future.

3184	   IO_ADVISE4_WILLNEED_OPPORTUNISTIC  Specifies that the stated holder
3185	      expects to possibly access the data in the near future.  This is a
3186	      speculative hint, and therefore the server should prefetch data or
3187	      indirect blocks only if it can be done at a marginal cost.

3189	   IO_ADVISE_DONTNEED  Specifies that the stated holder expects that it
3190	      will not access the specified data in the near future.

3192	   IO_ADVISE_NOREUSE  Specifies that the stated holder expects to access
3193	      the specified data once and then not reuse it thereafter.

3195	   IO_ADVISE4_READ  Specifies that the stated holder expects to read the
3196	      specified data in the near future.

3198	   IO_ADVISE4_WRITE  Specifies that the stated holder expects to write
3199	      the specified data in the near future.

3201	   IO_ADVISE4_INIT_PROXIMITY  The client has recently accessed the byte
3202	      range in its own cache.  This informs the server that the data in
3203	      the byte range remains important to the client.  When the server
3204	      reaches resource exhaustion, knowing which data is more important
3205	      allows the server to make better choices about which data to, for
3206	      example purge from a cache, or move to secondary storage.  It also
3207	      informs the server which delegations are more important, since if
3208	      delegations are working correctly, once delegated to a client, a
3209	      server might never receive another I/O request for the file.

3211	   The server will return success if the operation is properly formed,
3212	   otherwise the server will return an error.  The server MUST NOT
3213	   return an error if it does not recognize or does not support the
3214	   requested advice.  This is also true even if the client sends
3215	   contradictory hints to the server, e.g., IO_ADVISE4_SEQUENTIAL and
3216	   IO_ADVISE4_RANDOM in a single IO_ADVISE operation.  In this case, the
3217	   server MUST return success and a ior_hints value that indicates the
3218	   hint it intends to optimize.  For contradictory hints, this may mean
3219	   simply returning IO_ADVISE4_NORMAL for example.

3221	   The ior_hints returned by the server is primarily for debugging
3222	   purposes since the server is under no obligation to carry out the
3223	   hints that it describes in the ior_hints result.  In addition, while
3224	   the server may have intended to implement the hints returned in
3225	   ior_hints, as time progresses, the server may need to change its
3226	   handling of a given file due to several reasons including, but not
3227	   limited to, memory pressure, additional IO_ADVISE hints sent by other
3228	   clients, and heuristically detected file access patterns.

3230	   The server MAY return different advice than what the client
3231	   requested.  If it does, then this might be due to one of several
3232	   conditions, including, but not limited to another client advising of
3233	   a different I/O access pattern; a different I/O access pattern from
3234	   another client that that the server has heuristically detected; or
3235	   the server is not able to support the requested I/O access pattern,
3236	   perhaps due to a temporary resource limitation.

3238	   Each issuance of the IO_ADVISE operation overrides all previous
3239	   issuances of IO_ADVISE for a given byte range.  This effectively
3240	   follows a strategy of last hint wins for a given stated and byte
3241	   range.

3243	   Clients should assume that hints included in an IO_ADVISE operation
3244	   will be forgotten once the file is closed.

3246	13.8.4.  IMPLEMENTATION

3248	   The NFS client may choose to issue an IO_ADVISE operation to the
3249	   server in several different instances.

3251	   The most obvious is in direct response to an applications execution
3252	   of posix_fadvise.  In this case, IO_ADVISE4_WRITE and IO_ADVISE4_READ
3253	   may be set based upon the type of file access specified when the file
3254	   was opened.

3256	   Another useful point would be when an application indicates it is
3257	   using direct I/O. Direct I/O may be specified at file open, in which
3258	   case a IO_ADVISE may be included in the same compound as the OPEN
3259	   operation with the IO_ADVISE4_NOREUSE flag set.  Direct I/O may also
3260	   be specified separately, in which case a IO_ADVISE operation can be
3261	   sent to the server separately.  As above, IO_ADVISE4_WRITE and
3262	   IO_ADVISE4_READ may be set based upon the type of file access
3263	   specified when the file was opened.

3265	13.8.5.  pNFS File Layout Data Type Considerations

3267	   The IO_ADVISE considerations for pNFS are very similar to the COMMIT
3268	   considerations for pNFS.  That is, as with COMMIT, some NFS server
3269	   implementations prefer IO_ADVISE be done on the DS, and some prefer
3270	   it be done on the MDS.

3272	   So for the file's layout type, it is proposed that NFSv4.2 include an
3273	   additional hint NFL42_CARE_IO_ADVISE_THRU_MDS which is valid only on
3274	   NFSv4.2 or higher.  Any file's layout obtained with NFSv4.1 MUST NOT
3275	   have NFL42_UFLG_IO_ADVISE_THRU_MDS set.  Any file's layout obtained
3276	   with NFSv4.2 MAY have NFL42_UFLG_IO_ADVISE_THRU_MDS set.  If the
3277	   client does not implement IO_ADVISE, then it MUST ignore
3278	   NFL42_UFLG_IO_ADVISE_THRU_MDS.

3280	   If NFL42_UFLG_IO_ADVISE_THRU_MDS is set, then if the client
3281	   implements IO_ADVISE, then if it wants the DS to honor IO_ADVISE, the
3282	   client MUST send the operation to the MDS, and the server will
3283	   communicate the advice back each DS.  If the client sends IO_ADVISE
3284	   to the DS, then the server MAY return NFS4ERR_NOTSUPP.

3286	   If NFL42_UFLG_IO_ADVISE_THRU_MDS is not set, then this indicates to
3287	   client that if wants to inform the server via IO_ADVISE of the
3288	   client's intended use of the file, then the client SHOULD send an
3289	   IO_ADVISE to each DS.  While the client MAY always send IO_ADVISE to
3290	   the MDS, if the server has not set NFL42_UFLG_IO_ADVISE_THRU_MDS, the
3291	   client should expect that such an IO_ADVISE is futile.  Note that a
3292	   client SHOULD use the same set of arguments on each IO_ADVISE sent to
3293	   a DS for the same open file reference.

3295	   The server is not required to support different advice for different
3296	   DS's with the same open file reference.

3298	13.8.5.1.  Dense and Sparse Packing Considerations

3300	   The IO_ADVISE operation MUST use the iar_offset and byte range as
3301	   dictated by the presence or absence of NFL4_UFLG_DENSE.

3303	   E.g., if NFL4_UFLG_DENSE is present, and a READ or WRITE to the DS
3304	   for iar_offset 0 really means iar_offset 10000 in the logical file,
3305	   then an IO_ADVISE for iar_offset 0 means iar_offset 10000.

3307	   E.g., if NFL4_UFLG_DENSE is absent, then a READ or WRITE to the DS
3308	   for iar_offset 0 really means iar_offset 0 in the logical file, then
3309	   an IO_ADVISE for iar_offset 0 means iar_offset 0 in the logical file.

3311	   E.g., if NFL4_UFLG_DENSE is present, the stripe unit is 1000 bytes
3312	   and the stripe count is 10, and the dense DS file is serving
3313	   iar_offset 0.  A READ or WRITE to the DS for iar_offsets 0, 1000,
3314	   2000, and 3000, really mean iar_offsets 10000, 20000, 30000, and
3315	   40000 (implying a stripe count of 10 and a stripe unit of 1000), then
3316	   an IO_ADVISE sent to the same DS with an iar_offset of 500, and a
3317	   iar_count of 3000 means that the IO_ADVISE applies to these byte
3318	   ranges of the dense DS file:

3320	     - 500 to 999
3321	     - 1000 to 1999
3322	     - 2000 to 2999
3323	     - 3000 to 3499

3325	   I.e., the contiguous range 500 to 3499 as specified in IO_ADVISE.

3327	   It also applies to these byte ranges of the logical file:

3329	     - 10500 to 10999 (500 bytes)
3330	     - 20000 to 20999 (1000 bytes)
3331	     - 30000 to 30999 (1000 bytes)
3332	     - 40000 to 40499 (500 bytes)
3333	     (total            3000 bytes)

3335	   E.g., if NFL4_UFLG_DENSE is absent, the stripe unit is 250 bytes, the
3336	   stripe count is 4, and the sparse DS file is serving iar_offset 0.
3337	   Then a READ or WRITE to the DS for iar_offsets 0, 1000, 2000, and
3338	   3000, really mean iar_offsets 0, 1000, 2000, and 3000 in the logical
3339	   file, keeping in mind that on the DS file,. byte ranges 250 to 999,
3340	   1250 to 1999, 2250 to 2999, and 3250 to 3999 are not accessible.
3341	   Then an IO_ADVISE sent to the same DS with an iar_offset of 500, and
3342	   a iar_count of 3000 means that the IO_ADVISE applies to these byte
3343	   ranges of the logical file and the sparse DS file:

3345	     - 500 to 999 (500 bytes)   - no effect
3346	     - 1000 to 1249 (250 bytes) - effective
3347	     - 1250 to 1999 (750 bytes) - no effect
3348	     - 2000 to 2249 (250 bytes) - effective
3349	     - 2250 to 2999 (750 bytes) - no effect
3350	     - 3000 to 3249 (250 bytes) - effective
3351	     - 3250 to 3499 (250 bytes) - no effect
3352	     (subtotal      2250 bytes) - no effect
3353	     (subtotal       750 bytes) - effective
3354	     (grand total   3000 bytes) - no effect + effective

3356	   If neither of the flags NFL42_UFLG_IO_ADVISE_THRU_MDS and
3357	   NFL4_UFLG_DENSE are set in the layout, then any IO_ADVISE request
3358	   sent to the data server with a byte range that overlaps stripe unit
3359	   that the data server does not serve MUST NOT result in the status
3360	   NFS4ERR_PNFS_IO_HOLE.  Instead, the response SHOULD be successful and
3361	   if the server applies IO_ADVISE hints on any stripe units that
3362	   overlap with the specified range, those hints SHOULD be indicated in
3363	   the response.

3365	13.8.6.  Number of Supported File Segments

3367	   In theory IO_ADVISE allows a client and server to support multiple
3368	   file segments, meaning that different, possibly overlapping, byte
3369	   ranges of the same open file reference will support different hints.
3370	   This is not practical, and in general the server will support just
3371	   one set of hints, and these will apply to the entire file.  However,
3372	   there are some hints that very ephemeral, and are essentially amount
3373	   to one time instructions to the NFS server, which will be forgotten
3374	   momentarily after IO_ADVISE is executed.

3376	   The following hints will always apply to the entire file, regardless
3377	   of the specified byte range:

3379	   o  IO_ADVISE4_NORMAL

3381	   o  IO_ADVISE4_SEQUENTIAL

3383	   o  IO_ADVISE4_SEQUENTIAL_BACKWARDS

3385	   o  IO_ADVISE4_RANDOM

3387	   The following hints will always apply to specified byte range, and
3388	   will treated as one time instructions:

3390	   o  IO_ADVISE4_WILLNEED

3392	   o  IO_ADVISE4_WILLNEED_OPPORTUNISTIC

3394	   o  IO_ADVISE4_DONTNEED

3396	   o  IO_ADVISE4_NOREUSE

3398	   The following hints are modifiers to all other hints, and will apply
3399	   to the entire file and/or to a one time instruction on the specified
3400	   byte range:

3402	   o  IO_ADVISE4_READ

3404	   o  IO_ADVISE4_WRITE

3406	13.9.  Changes to Operation 51: LAYOUTRETURN

3408	13.9.1.  Introduction

3410	   In the pNFS description provided in [2], the client is not enabled to
3411	   relay an error code from the DS to the MDS.  In the specification of
3412	   the Objects-Based Layout protocol [9], use is made of the opaque
3413	   lrf_body field of the LAYOUTRETURN argument to do such a relaying of
3414	   error codes.  In this section, we define a new data structure to
3415	   enable the passing of error codes back to the MDS and provide some
3416	   guidelines on what both the client and MDS should expect in such
3417	   circumstances.

3419	   There are two broad classes of errors, transient and persistent.  The
3420	   client SHOULD strive to only use this new mechanism to report
3421	   persistent errors.  It MUST be able to deal with transient issues by
3422	   itself.  Also, while the client might consider an issue to be
3423	   persistent, it MUST be prepared for the MDS to consider such issues
3424	   to be persistent.  A prime example of this is if the MDS fences off a
3425	   client from either a stateid or a filehandle.  The client will get an
3426	   error from the DS and might relay either NFS4ERR_ACCESS or
3427	   NFS4ERR_STALE_STATEID back to the MDS, with the belief that this is a
3428	   hard error.  The MDS on the other hand, is waiting for the client to
3429	   report such an error.  For it, the mission is accomplished in that
3430	   the client has returned a layout that the MDS had most likley
3431	   recalled.

3433	   The existing LAYOUTRETURN operation is extended by introducing a new
3434	   data structure to report errors, layoutreturn_device_error4.  Also,
3435	   layoutreturn_device_error4 is introduced to enable an array of errors
3436	   to be reported.

3438	13.9.2.  ARGUMENT

3440	   The ARGUMENT specification of the LAYOUTRETURN operation in section
3441	   18.44.1 of [2] is augmented by the following XDR code [23]:

3443	   struct layoutreturn_device_error4 {
3444	           deviceid4       lrde_deviceid;
3445	           nfsstat4        lrde_status;
3446	           nfs_opnum4      lrde_opnum;
3447	   };

3449	   struct layoutreturn_error_report4 {
3450	           layoutreturn_device_error4      lrer_errors<>;
3451	   };

3453	13.9.3.  RESULT

3455	   The RESULT of the LAYOUTRETURN operation is unchanged; see section
3456	   18.44.2 of [2].

3458	13.9.4.  DESCRIPTION

3460	   The following text is added to the end of the LAYOUTRETURN operation
3461	   DESCRIPTION in section 18.44.3 of [2].

3463	   When a client used LAYOUTRETURN with a type of LAYOUTRETURN4_FILE,
3464	   then if the lrf_body field is NULL, it indicates to the MDS that the
3465	   client experienced no errors.  If lrf_body is non-NULL, then the
3466	   field references error information which is layout type specific.
3467	   I.e., the Objects-Based Layout protocol can continue to utilize
3468	   lrf_body as specified in [9].  For both Files-Based Layouts, the
3469	   field references a layoutreturn_device_error4, which contains an
3470	   array of layoutreturn_device_error4.

3472	   Each individual layoutreturn_device_error4 descibes a single error
3473	   associated with a DS, which is identfied via lrde_deviceid.  The
3474	   operation which returned the error is identified via lrde_opnum.
3475	   Finally the NFS error value (nfsstat4) encountered is provided via
3476	   lrde_status and may consist of the following error codes:

3478	   NFS4_OKAY:  No issues were found for this device.

3480	   NFS4ERR_NXIO:  The client was unable to establish any communication
3481	      with the DS.

3483	   NFS4ERR_*:  The client was able to establish communication with the
3484	      DS and is returning one of the allowed error codes for the
3485	      operation denoted by lrde_opnum.

3487	13.9.5.  IMPLEMENTATION

3489	   The following text is added to the end of the LAYOUTRETURN operation
3490	   IMPLEMENTATION in section 18.4.4 of [2].

3492	   A client that expects to use pNFS for a mounted filesystem SHOULD
3493	   check for pNFS support at mount time.  This check SHOULD be performed
3494	   by sending a GETDEVICELIST operation, followed by layout-type-
3495	   specific checks for accessibility of each storage device returned by
3496	   GETDEVICELIST.  If the NFS server does not support pNFS, the
3497	   GETDEVICELIST operation will be rejected with an NFS4ERR_NOTSUPP
3498	   error; in this situation it is up to the client to determine whether
3499	   it is acceptable to proceed with NFS-only access.

3501	   Clients are expected to tolerate transient storage device errors, and
3502	   hence clients SHOULD NOT use the LAYOUTRETURN error handling for
3503	   device access problems that may be transient.  The methods by which a
3504	   client decides whether an access problem is transient vs. persistent
3505	   are implementation-specific, but may include retrying I/Os to a data
3506	   server under appropriate conditions.

3508	   When an I/O fails to a storage device, the client SHOULD retry the
3509	   failed I/O via the MDS.  In this situation, before retrying the I/O,
3510	   the client SHOULD return the layout, or the affected portion thereof,
3511	   and SHOULD indicate which storage device or devices was problematic.
3512	   If the client does not do this, the MDS may issue a layout recall
3513	   callback in order to perform the retried I/O.

3515	   The client needs to be cognizant that since this error handling is
3516	   optional in the MDS, the MDS may silently ignore this functionality.
3517	   Also, as the MDS may consider some issues the client reports to be
3518	   expected (see Section 13.9.1), the client might find it difficult to
3519	   detect a MDS which has not implemented error handling via
3520	   LAYOUTRETURN.

3522	   If an MDS is aware that a storage device is proving problematic to a
3523	   client, the MDS SHOULD NOT include that storage device in any pNFS
3524	   layouts sent to that client.  If the MDS is aware that a storage
3525	   device is affecting many clients, then the MDS SHOULD NOT include
3526	   that storage device in any pNFS layouts sent out.  Clients must still
3527	   be aware that the MDS might not have any choice in using the storage
3528	   device, i.e., there might only be one possible layout for the system.

3530	   Another interesting complication is that for existing files, the MDS
3531	   might have no choice in which storage devices to hand out to clients.
3532	   The MDS might try to restripe a file across a different storage
3533	   device, but clients need to be aware that not all implementations
3534	   have restriping support.

3536	   An MDS SHOULD react to a client return of layouts with errors by not
3537	   using the problematic storage devices in layouts for that client, but
3538	   the MDS is not required to indefinitely retain per-client storage
3539	   device error information.  An MDS is also not required to
3540	   automatically reinstate use of a previously problematic storage
3541	   device; administrative intervention may be required instead.

3543	   A client MAY perform I/O via the MDS even when the client holds a
3544	   layout that covers the I/O; servers MUST support this client
3545	   behavior, and MAY recall layouts as needed to complete I/Os.

3547	13.10.  Operation 65: READ_PLUS

3549	   READ_PLUS is a new read operation which allows NFS clients to avoid
3550	   reading holes in a sparse file and to efficiently transfer ADBs.
3551	   READ_PLUS supports all the features of the existing NFSv4.1 READ
3552	   operation [2] but also extends the response to avoid returning data
3553	   for portions of the file which are either initialized and contain no
3554	   backing store or if the result would appear to be so.  I.e., if the
3555	   result was a data block composed entirely of zeros, then it is easier
3556	   to return a hole.  Returning data blocks of unitialized data wastes
3557	   computational and network resources, thus reducing performance.
3558	   READ_PLUS uses a new result structure that tells the client that the
3559	   result is all zeroes AND the byte-range of the hole in which the
3560	   request was made.

3562	   If the client sends a READ operation, it is explicitly stating that
3563	   it is neither supporting sparse files nor ADBs.  So if a READ occurs
3564	   on a sparse ADB or file, then the server must expand such data to be
3565	   raw bytes.  If a READ occurs in the middle of a hole or ADB, the
3566	   server can only send back bytes starting from that offset.

3568	   Such an operation is inefficient for transfer of sparse sections of
3569	   the file.  As such, READ is marked as OBSOLETE in NFSv4.2.  Instead,
3570	   a client should issue READ_PLUS.  Note that as the client has no a
3571	   priori knowledge of whether either an ADB or a hole is present or
3572	   not, it should always use READ_PLUS.

3574	13.10.1.  ARGUMENT

3576	   struct READ_PLUS4args {
3577	           /* CURRENT_FH: file */
3578	           stateid4        rpa_stateid;
3579	           offset4         rpa_offset;
3580	           count4          rpa_count;
3581	   };

3583	13.10.2.  RESULT

3585	   union read_plus_content switch (data_content4 content) {
3586	   case NFS4_CONTENT_DATA:
3587	           opaque          rpc_data<>;
3588	   case NFS4_CONTENT_APP_BLOCK:
3589	           app_data_block4 rpc_block;
3590	   case NFS4_CONTENT_HOLE:
3591	           data_info4      rpc_hole;
3592	   default:
3593	           void;
3594	   };

3596	   /*
3597	    * Allow a return of an array of contents.
3598	    */
3599	   struct read_plus_res4 {
3600	           bool                    rpr_eof;
3601	           read_plus_content       rpr_contents<>;
3602	   };

3604	   union READ_PLUS4res switch (nfsstat4 status) {
3605	   case NFS4_OK:
3606	           read_plus_res4  resok4;
3607	   default:
3608	           void;
3609	   };

3611	13.10.3.  DESCRIPTION

3613	   The READ_PLUS operation is based upon the NFSv4.1 READ operation [2]
3614	   and similarly reads data from the regular file identified by the
3615	   current filehandle.

3617	   The client provides a rpa_offset of where the READ_PLUS is to start
3618	   and a rpa_count of how many bytes are to be read.  A rpa_offset of
3619	   zero means to read data starting at the beginning of the file.  If
3620	   rpa_offset is greater than or equal to the size of the file, the
3621	   status NFS4_OK is returned with di_length (the data length) set to
3622	   zero and eof set to TRUE.  READ_PLUS is subject to access permissions
3623	   checking.

3625	   The READ_PLUS result is comprised of an array of rpr_contents, each
3626	   of which describe a data_content4 type of data.  For NFSv4.2, the
3627	   allowed values are data, ADB, and hole.  A server is required to
3628	   support the data type, but neither ADB nor hole.  Both an ADB and a
3629	   hole must be returned in its entirety - clients must be prepared to
3630	   get more information than they requested.

3632	   READ_PLUS has to support all of the errors which are returned by READ
3633	   plus NFS4ERR_UNION_NOTSUPP.  If the client asks for a hole and the
3634	   server does not support that arm of the discriminated union, but does
3635	   support one or more additional arms, it can signal to the client that
3636	   it supports the operation, but not the arm with
3637	   NFS4ERR_UNION_NOTSUPP.

3639	   If the data to be returned is comprised entirely of zeros, then the
3640	   server may elect to return that data as a hole.  The server
3641	   differentiates this to the client by setting di_allocated to TRUE in
3642	   this case.  Note that in such a scenario, the server is not required
3643	   to determine the full extent of the "hole" - it does not need to
3644	   determine where the zeros start and end.

3646	   The server may elect to return adjacent elements of the same type.
3647	   For example, the guard pattern or block size of an ADB might change,
3648	   which would require adjacent elements of type ADB.  Likewise if the
3649	   server has a range of data comprised entirely of zeros and then a
3650	   hole, it might want to return two adjacent holes to the client.

3652	   If the client specifies a rpa_count value of zero, the READ_PLUS
3653	   succeeds and returns zero bytes of data, again subject to access
3654	   permissions checking.  In all situations, the server may choose to
3655	   return fewer bytes than specified by the client.  The client needs to
3656	   check for this condition and handle the condition appropriately.

3658	   If the client specifies an rpa_offset and rpa_count value that is
3659	   entirely contained within a hole of the file, then the di_offset and
3660	   di_length returned must be for the entire hole.  This result is
3661	   considered valid until the file is changed (detected via the change
3662	   attribute).  The server MUST provide the same semantics for the hole
3663	   as if the client read the region and received zeroes; the implied
3664	   holes contents lifetime MUST be exactly the same as any other read
3665	   data.

3667	   If the client specifies an rpa_offset and rpa_count value that begins
3668	   in a non-hole of the file but extends into hole the server should
3669	   return an array comprised of both data and a hole.  The client MUST
3670	   be prepared for the server to reurn a short read describing just the
3671	   data.  The client will then issue another READ_PLUS for the remaining
3672	   bytes, which the server will respond with information about the hole
3673	   in the file.

3675	   Except when special stateids are used, the stateid value for a
3676	   READ_PLUS request represents a value returned from a previous byte-
3677	   range lock or share reservation request or the stateid associated
3678	   with a delegation.  The stateid identifies the associated owners if
3679	   any and is used by the server to verify that the associated locks are
3680	   still valid (e.g., have not been revoked).

3682	   If the read ended at the end-of-file (formally, in a correctly formed
3683	   READ_PLUS operation, if rpa_offset + rpa_count is equal to the size
3684	   of the file), or the READ_PLUS operation extends beyond the size of
3685	   the file (if rpa_offset + rpa_count is greater than the size of the
3686	   file), eof is returned as TRUE; otherwise, it is FALSE.  A successful
3687	   READ_PLUS of an empty file will always return eof as TRUE.

3689	   If the current filehandle is not an ordinary file, an error will be
3690	   returned to the client.  In the case that the current filehandle
3691	   represents an object of type NF4DIR, NFS4ERR_ISDIR is returned.  If
3692	   the current filehandle designates a symbolic link, NFS4ERR_SYMLINK is
3693	   returned.  In all other cases, NFS4ERR_WRONG_TYPE is returned.

3695	   For a READ_PLUS with a stateid value of all bits equal to zero, the
3696	   server MAY allow the READ_PLUS to be serviced subject to mandatory
3697	   byte-range locks or the current share deny modes for the file.  For a
3698	   READ_PLUS with a stateid value of all bits equal to one, the server
3699	   MAY allow READ_PLUS operations to bypass locking checks at the
3700	   server.

3702	   On success, the current filehandle retains its value.

3704	13.10.4.  IMPLEMENTATION

3706	   If the server returns a short read, then the client should send
3707	   another READ_PLUS to get the remaining data.  A server may return
3708	   less data than requested under several circumstances.  The file may
3709	   have been truncated by another client or perhaps on the server
3710	   itself, changing the file size from what the requesting client
3711	   believes to be the case.  This would reduce the actual amount of data
3712	   available to the client.  It is possible that the server reduced the
3713	   transfer size and so return a short read result.  Server resource
3714	   exhaustion may also occur in a short read.

3716	   If mandatory byte-range locking is in effect for the file, and if the
3717	   byte-range corresponding to the data to be read from the file is
3718	   WRITE_LT locked by an owner not associated with the stateid, the
3719	   server will return the NFS4ERR_LOCKED error.  The client should try
3720	   to get the appropriate READ_LT via the LOCK operation before re-
3721	   attempting the READ_PLUS.  When the READ_PLUS completes, the client
3722	   should release the byte-range lock via LOCKU.  In addition, the
3723	   server MUST return an array of rpr_contents with values of that are
3724	   within the owner's locked byte range.

3726	   If another client has an OPEN_DELEGATE_WRITE delegation for the file
3727	   being read, the delegation must be recalled, and the operation cannot
3728	   proceed until that delegation is returned or revoked.  Except where
3729	   this happens very quickly, one or more NFS4ERR_DELAY errors will be
3730	   returned to requests made while the delegation remains outstanding.
3731	   Normally, delegations will not be recalled as a result of a READ_PLUS
3732	   operation since the recall will occur as a result of an earlier OPEN.
3733	   However, since it is possible for a READ_PLUS to be done with a
3734	   special stateid, the server needs to check for this case even though
3735	   the client should have done an OPEN previously.

3737	13.10.4.1.  Additional pNFS Implementation Information

3739	   With pNFS, the semantics of using READ_PLUS remains the same.  Any
3740	   data server MAY return a hole or ADB result for a READ_PLUS request
3741	   that it receives.

3743	   When a data server chooses to return a hole result, it has the option
3744	   of returning hole information for the data stored on that data server
3745	   (as defined by the data layout), but it MUST not return results for a
3746	   byte range that includes data managed by another data server.  Data
3747	   servers that can obtain hole information for the parts of the file
3748	   stored on that data server, the data server SHOULD return HOLE_INFO
3749	   and the byte range of the hole stored on that data server.

3751	   A data server should do its best to return as much information about
3752	   a hole as is feasible without having to contact the metadata server.
3753	   If communication with the metadata server is required, then every
3754	   attempt should be taken to minimize the number of requests.

3756	   If mandatory locking is enforced, then the data server must also
3757	   ensure that to return only information for a Hole that is within the
3758	   owner's locked byte range.

3760	13.10.5.  READ_PLUS with Sparse Files Example

3762	   The following table describes a sparse file.  For each byte range,
3763	   the file contains either non-zero data or a hole.  In addition, the
3764	   server in this example uses a Hole Threshold of 32K.

3766	                        +-------------+----------+
3767	                        | Byte-Range  | Contents |
3768	                        +-------------+----------+
3769	                        | 0-15999     | Hole     |
3770	                        | 16K-31999   | Non-Zero |
3771	                        | 32K-255999  | Hole     |
3772	                        | 256K-287999 | Non-Zero |
3773	                        | 288K-353999 | Hole     |
3774	                        | 354K-417999 | Non-Zero |
3775	                        +-------------+----------+

3777	                                  Table 4

3779	   Under the given circumstances, if a client was to read from the file
3780	   with a max read size of 64K, the following will be the results for
3781	   the given READ_PLUS calls.  This assumes the client has already
3782	   opened the file, acquired a valid stateid ('s' in the example), and
3783	   just needs to issue READ_PLUS requests.

3785	   1.  READ_PLUS(s, 0, 64K) --> NFS_OK, eof = false, <data[0,32K],
3786	       hole[32K,224K]>.  Since the first hole is less than the server's
3787	       Hole Threshhold, the first 32K of the file is returned as data
3788	       and the remaining 32K is returned as a hole which actually
3789	       extends to 256K.

3791	   2.  READ_PLUS(s, 32K, 64K) --> NFS_OK, eof = false, <hole[32K,224K]>
3792	       The requested range was all zeros, and the current hole begins at
3793	       offset 32K and is 224K in length.  Note that the client should
3794	       not have followed up the previous READ_PLUS request with this one
3795	       as the hole information from the previous call extended past what
3796	       the client was requesting.

3798	   3.  READ_PLUS(s, 256K, 64K) --> NFS_OK, eof = false, <data[256K,
3799	       288K], hole[288K, 354K]>.  Returns an array of the 32K data and
3800	       the hole which extends to 354K.

3802	   4.  READ_PLUS(s, 354K, 64K) --> NFS_OK, eof = true, <data[354K,
3803	       418K]>.  Returns the final 64K of data and informs the client
3804	       there is no more data in the file.

3806	13.11.  Operation 66: SEEK

3808	   SEEK is an operation that allows a client to determine the location
3809	   of the next data_content4 in a file.

3811	13.11.1.  ARGUMENT

3813	   struct SEEK4args {
3814	           /* CURRENT_FH: file */
3815	           stateid4        sa_stateid;
3816	           offset4         sa_offset;
3817	           data_content4   sa_what;
3818	   };

3820	13.11.2.  RESULT

3822	   union seek_content switch (data_content4 content) {
3823	   case NFS4_CONTENT_DATA:
3824	           data_info4      sc_data;
3825	   case NFS4_CONTENT_APP_BLOCK:
3826	           app_data_block4 sc_block;
3827	   case NFS4_CONTENT_HOLE:
3828	           data_info4      sc_hole;
3829	   default:
3830	           void;
3831	   };

3833	   struct seek_res4 {
3834	           bool                    sr_eof;
3835	           seek_content            sr_contents;
3836	   };

3838	   union SEEK4res switch (nfsstat4 status) {
3839	   case NFS4_OK:
3840	           seek_res4       resok4;
3841	   default:
3842	           void;
3843	   };

3845	13.11.3.  DESCRIPTION

3847	   From the given sa_offset, find the next data_content4 of type sa_what
3848	   in the file.  For either a hole or ADB, this must return the
3849	   data_content4 in its entirety.  For data, it must not return the
3850	   actual data.

3852	   SEEK must follow the same rules for stateids as READ_PLUS
3853	   (Section 13.10.3).

3855	   If the server could not find a corresponding sa_what, then the status
3856	   would still be NFS4_OK, but sr_eof would be TRUE.  The sr_contents
3857	   would contain a zero-ed out content of the appropriate type.

3859	14.  NFSv4.2 Callback Operations

3861	14.1.  Procedure 16: CB_ATTR_CHANGED - Notify Client that the File's
3862	       Attributes Changed

3864	14.1.1.  ARGUMENTS

3866	   struct CB_ATTR_CHANGED4args {
3867	           nfs_fh4         acca_fh;
3868	           bitmap4         acca_critical;
3869	           bitmap4         acca_info;
3870	   };

3872	14.1.2.  RESULTS

3874	   struct CB_ATTR_CHANGED4res {
3875	           nfsstat4        accr_status;
3876	   };

3878	14.1.3.  DESCRIPTION

3880	   The CB_ATTR_CHANGED callback operation is used by the server to
3881	   indicate to the client that the file's attributes have been modified
3882	   on the server.  The server does not convey how the attributes have
3883	   changed, just that they have been modified.  The server can inform
3884	   the client about both critical and informational attribute changes in
3885	   the bitmask arguments.  The client SHOULD query the server about all
3886	   attributes set in acca_critical.  For all changes reflected in
3887	   acca_info, the client can decide whether or not it wants to poll the
3888	   server.

3890	   The CB_ATTR_CHANGED callback operation with the FATTR4_SEC_LABEL set
3891	   in acca_critical is the method used by the server to indicate that
3892	   the MAC label for the file referenced by acca_fh has changed.  In
3893	   many ways, the server does not care about the result returned by the
3894	   client.

3896	14.2.  Operation 15: CB_COPY - Report results of a server-side copy

3898	14.2.1.  ARGUMENT

3900	   union copy_info4 switch (nfsstat4 cca_status) {
3901	           case NFS4_OK:
3902	                   void;
3903	           default:
3904	                   length4         cca_bytes_copied;
3905	   };

3907	   struct CB_COPY4args {
3908	           nfs_fh4         cca_fh;
3909	           stateid4        cca_stateid;
3910	           copy_info4      cca_copy_info;
3911	   };

3913	14.2.2.  RESULT

3915	   struct CB_COPY4res {
3916	           nfsstat4        ccr_status;
3917	   };

3919	14.2.3.  DESCRIPTION

3921	   CB_COPY is used for both intra- and inter-server asynchronous copies.
3922	   The CB_COPY callback informs the client of the result of an
3923	   asynchronous server-side copy.  This operation is sent by the
3924	   destination server to the client in a CB_COMPOUND request.  The copy
3925	   is identified by the filehandle and stateid arguments.  The result is
3926	   indicated by the status field.  If the copy failed, cca_bytes_copied
3927	   contains the number of bytes copied before the failure occurred.  The
3928	   cca_bytes_copied value indicates the number of bytes copied but not
3929	   which specific bytes have been copied.

3931	   In the absence of an established backchannel, the server cannot
3932	   signal the completion of the COPY via a CB_COPY callback.  The loss
3933	   of a callback channel would be indicated by the server setting the
3934	   SEQ4_STATUS_CB_PATH_DOWN flag in the sr_status_flags field of the
3935	   SEQUENCE operation.  The client must re-establish the callback
3936	   channel to receive the status of the COPY operation.  Prolonged loss
3937	   of the callback channel could result in the server dropping the COPY
3938	   operation state and invalidating the copy stateid.

3940	   If the client supports the COPY operation, the client is REQUIRED to
3941	   support the CB_COPY operation.

3943	   The CB_COPY operation may fail for the following reasons (this is a
3944	   partial list):

3946	   NFS4ERR_NOTSUPP:  The copy offload operation is not supported by the
3947	      NFS client receiving this request.

3949	15.  IANA Considerations

3951	   This section uses terms that are defined in [24].

3953	16.  References

3955	16.1.  Normative References

3957	   [1]   Bradner, S., "Key words for use in RFCs to Indicate Requirement
3958	         Levels", March 1997.

3960	   [2]   Shepler, S., Eisler, M., and D. Noveck, "Network File System
3961	         (NFS) Version 4 Minor Version 1 Protocol", RFC 5661,
3962	         January 2010.

3964	   [3]   Haynes, T., "Network File System (NFS) Version 4 Minor Version
3965	         2 External Data Representation Standard (XDR) Description",
3966	         March 2011.

3968	   [4]   Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform
3969	         Resource Identifier (URI): Generic Syntax", STD 66, RFC 3986,
3970	         January 2005.

3972	   [5]   Haynes, T. and N. Williams, "Remote Procedure Call (RPC)
3973	         Security Version 3", draft-williams-rpcsecgssv3 (work in
3974	         progress), 2011.

3976	   [6]   The Open Group, "Section 'posix_fadvise()' of System Interfaces
3977	         of The Open Group Base Specifications Issue 6, IEEE Std 1003.1,
3978	         2004 Edition", 2004.

3980	   [7]   Haynes, T., "Requirements for Labeled NFS",
3981	         draft-ietf-nfsv4-labreqs-00 (work in progress).

3983	   [8]   Eisler, M., Chiu, A., and L. Ling, "RPCSEC_GSS Protocol
3984	         Specification", RFC 2203, September 1997.

3986	   [9]   Halevy, B., Welch, B., and J. Zelenka, "Object-Based Parallel
3987	         NFS (pNFS) Operations", RFC 5664, January 2010.

3989	16.2.  Informative References

3991	   [10]  Haynes, T. and D. Noveck, "Network File System (NFS) version 4
3992	         Protocol", draft-ietf-nfsv4-rfc3530bis-09 (Work In Progress),
3993	         March 2011.

3995	   [11]  Lentini, J., Everhart, C., Ellard, D., Tewari, R., and M. Naik,
3996	         "NSDB Protocol for Federated Filesystems",
3997	         draft-ietf-nfsv4-federated-fs-protocol (Work In Progress),
3998	         2010.

4000	   [12]  Lentini, J., Everhart, C., Ellard, D., Tewari, R., and M. Naik,
4001	         "Administration Protocol for Federated Filesystems",
4002	         draft-ietf-nfsv4-federated-fs-admin (Work In Progress), 2010.

4004	   [13]  Fielding, R., Gettys, J., Mogul, J., Frystyk, H., Masinter, L.,
4005	         Leach, P., and T. Berners-Lee, "Hypertext Transfer Protocol --
4006	         HTTP/1.1", RFC 2616, June 1999.

4008	   [14]  Postel, J. and J. Reynolds, "File Transfer Protocol", STD 9,
4009	         RFC 959, October 1985.

4011	   [15]  Simpson, W., "PPP Challenge Handshake Authentication Protocol
4012	         (CHAP)", RFC 1994, August 1996.

4014	   [16]  VanDeBogart, S., Frost, C., and E. Kohler, "Reducing Seek
4015	         Overhead with Application-Directed Prefetching", Proceedings of
4016	         USENIX Annual Technical Conference , June 2009.

4018	   [17]  Strohm, R., "Chapter 2, Data Blocks, Extents, and Segments, of
4019	         Oracle Database Concepts 11g Release 1 (11.1)", January 2011.

4021	   [18]  Ashdown, L., "Chapter 15, Validating Database Files and
4022	         Backups, of Oracle Database Backup and Recovery User's Guide
4023	         11g Release 1 (11.1)", August 2008.

4025	   [19]  McDougall, R. and J. Mauro, "Section 11.4.3, Detecting Memory
4026	         Corruption of Solaris Internals", 2007.

4028	   [20]  Bairavasundaram, L., Goodson, G., Schroeder, B., Arpaci-
4029	         Dusseau, A., and R. Arpaci-Dusseau, "An Analysis of Data
4030	         Corruption in the Storage Stack", Proceedings of the 6th USENIX
4031	         Symposium on File and Storage Technologies (FAST '08) , 2008.

4033	   [21]  "Section 46.6. Multi-Level Security (MLS) of Deployment Guide:
4034	         Deployment, configuration and administration of Red Hat
4035	         Enterprise Linux 5, Edition 6", 2011.

4037	   [22]  Quigley, D. and J. Lu, "Registry Specification for MAC Security
4038	         Label Formats", draft-quigley-label-format-registry (work in
4039	         progress), 2011.

4041	   [23]  Eisler, M., "XDR: External Data Representation Standard",
4042	         RFC 4506, May 2006.

4044	   [24]  Narten, T. and H. Alvestrand, "Guidelines for Writing an IANA
4045	         Considerations Section in RFCs", BCP 26, RFC 5226, May 2008.

4047	Appendix A.  Acknowledgments

4049	   For the pNFS Access Permissions Check, the original draft was by
4050	   Sorin Faibish, David Black, Mike Eisler, and Jason Glasgow.  The work
4051	   was influenced by discussions with Benny Halevy and Bruce Fields.  A
4052	   review was done by Tom Haynes.

4054	   For the Sharing change attribute implementation details with NFSv4
4055	   clients, the original draft was by Trond Myklebust.

4057	   For the NFS Server-side Copy, the original draft was by James
4058	   Lentini, Mike Eisler, Deepak Kenchammana, Anshul Madan, and Rahul
4059	   Iyer.  Tom Talpey co-authored an unpublished version of that
4060	   document.  It was also was reviewed by a number of individuals:
4061	   Pranoop Erasani, Tom Haynes, Arthur Lent, Trond Myklebust, Dave
4062	   Noveck, Theresa Lingutla-Raj, Manjunath Shankararao, Satyam Vaghani,
4063	   and Nico Williams.

4065	   For the NFS space reservation operations, the original draft was by
4066	   Mike Eisler, James Lentini, Manjunath Shankararao, and Rahul Iyer.

4068	   For the sparse file support, the original draft was by Dean
4069	   Hildebrand and Marc Eshel.  Valuable input and advice was received
4070	   from Sorin Faibish, Bruce Fields, Benny Halevy, Trond Myklebust, and
4071	   Richard Scheffenegger.

4073	   For the Application IO Hints, the original draft was by Dean
4074	   Hildebrand, Mike Eisler, Trond Myklebust, and Sam Falkner.  Some
4075	   early reviwers included Benny Halevy and Pranoop Erasani.

4077	   For Labeled NFS, the original draft was by David Quigley, James
4078	   Morris, Jarret Lu, and Tom Haynes.  Peter Staubach, Trond Myklebust,
4079	   Sorrin Faibish, Nico Williams, and David Black also contributed in
4080	   the final push to get this accepted.

4082	Appendix B.  RFC Editor Notes

4084	   [RFC Editor: please remove this section prior to publishing this
4085	   document as an RFC]

4087	   [RFC Editor: prior to publishing this document as an RFC, please
4088	   replace all occurrences of RFCTBD10 with RFCxxxx where xxxx is the
4089	   RFC number of this document]

4091	Author's Address

4093	   Thomas Haynes
4094	   NetApp
4095	   9110 E 66th St
4096	   Tulsa, OK  74133
4097	   USA

4099	   Phone: +1 918 307 1415
4100	   Email: thomas@netapp.com
4101	   URI:   http://www.tulsalabs.com