idnits 2.17.1 

draft-ietf-nfsv4-minorversion2-09.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  == There are 5 instances of lines with non-RFC6890-compliant IPv4 addresses
     in the document.  If these are example addresses, they should be changed.

  == There are 5 instances of lines with private range IPv4 addresses in the
     document.  If these are generic example addresses, they should be changed
     to use any of the ranges defined in RFC 6890 (or successor): 192.0.2.x,
     198.51.100.x or 203.0.113.x.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  == Using lowercase 'not' together with uppercase 'MUST', 'SHALL', 'SHOULD',
     or 'RECOMMENDED' is not an accepted usage according to RFC 2119.  Please
     use uppercase 'NOT' together with RFC 2119 keywords (if that is what you
     mean).
     
     Found 'MUST not' in this paragraph:
     
     Furthermore, each DS MUST not report to a client either a sparse
     ADB or data which belongs to another DS.  One implication of this
     requirement is that the app_data_block4's adb_block_size MUST be either
     be the stripe width or the stripe width must be an even multiple of it.

  == Using lowercase 'not' together with uppercase 'MUST', 'SHALL', 'SHOULD',
     or 'RECOMMENDED' is not an accepted usage according to RFC 2119.  Please
     use uppercase 'NOT' together with RFC 2119 keywords (if that is what you
     mean).
     
     Found 'MUST not' in this paragraph:
     
     The second change is to provide a method for the server to notify
     the client that the attribute changed on an open file on the server.  If
     the file is closed, then during the open attempt, the client will gather
     the new attribute value.  The server MUST not communicate the new value
     of the attribute, the client MUST query it.  This requirement stems from
     the need for the client to provide sufficient access rights to the
     attribute.

  == Using lowercase 'not' together with uppercase 'MUST', 'SHALL', 'SHOULD',
     or 'RECOMMENDED' is not an accepted usage according to RFC 2119.  Please
     use uppercase 'NOT' together with RFC 2119 keywords (if that is what you
     mean).
     
     Found 'MUST not' in this paragraph:
     
     When a data server chooses to return a hole result, it has the
     option of returning hole information for the data stored on that data
     server (as defined by the data layout), but it MUST not return results
     for a byte range that includes data managed by another data server.  Data
     servers that can obtain hole information for the parts of the file stored
     on that data server, the data server SHOULD return HOLE_INFO and the byte
     range of the hole stored on that data server.

  == The document seems to contain a disclaimer for pre-RFC5378 work, but was
     first submitted on or after 10 November 2008.  The disclaimer is usually
     necessary only for documents that revise or obsolete older RFCs, and that
     take significant amounts of text from those RFCs.  If you can contact all
     authors of the source material and they are willing to grant the BCP78
     rights to the IETF Trust, you can and should remove the disclaimer. 
     Otherwise, the disclaimer is needed and you can ignore this comment. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (May 02, 2012) is 4370 days in the past.  Is this
     intentional?

  -- Found something which looks like a code comment -- if you have code
     sections in the document, please surround them with '<CODE BEGINS>' and
     '<CODE ENDS>' lines.


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  == Missing Reference: '0' is mentioned on line 3756, but not defined

  -- Looks like a reference, but probably isn't: '32K' on line 3756

  -- Possible downref: Non-RFC (?) normative reference: ref. '1'

  ** Obsolete normative reference: RFC 5661 (ref. '2') (Obsoleted by RFC 8881)

  -- Possible downref: Non-RFC (?) normative reference: ref. '3'

  -- Possible downref: Non-RFC (?) normative reference: ref. '6'

  == Outdated reference: A later version (-05) exists of
     draft-ietf-nfsv4-labreqs-00

  ** Downref: Normative reference to an Informational draft:
     draft-ietf-nfsv4-labreqs (ref. '7')

  == Outdated reference: A later version (-35) exists of
     draft-ietf-nfsv4-rfc3530bis-09

  -- Obsolete informational reference (is this intentional?): RFC 2616 (ref.
     '13') (Obsoleted by RFC 7230, RFC 7231, RFC 7232, RFC 7233, RFC 7234, RFC
     7235)

  -- Obsolete informational reference (is this intentional?): RFC 5226 (ref.
     '24') (Obsoleted by RFC 8126)


     Summary: 2 errors (**), 0 flaws (~~), 10 warnings (==), 8 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	NFSv4                                                          T. Haynes
3	Internet-Draft                                                    Editor
4	Intended status: Standards Track                            May 02, 2012
5	Expires: November 3, 2012

7	                     NFS Version 4 Minor Version 2
8	                 draft-ietf-nfsv4-minorversion2-09.txt

10	Abstract

12	   This Internet-Draft describes NFS version 4 minor version two,
13	   focusing mainly on the protocol extensions made from NFS version 4
14	   minor version 0 and NFS version 4 minor version 1.  Major extensions
15	   introduced in NFS version 4 minor version two include: Server-side
16	   Copy, Space Reservations, and Support for Sparse Files.

18	Requirements Language

20	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
21	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
22	   document are to be interpreted as described in RFC 2119 [1].

24	Status of this Memo

26	   This Internet-Draft is submitted in full conformance with the
27	   provisions of BCP 78 and BCP 79.

29	   Internet-Drafts are working documents of the Internet Engineering
30	   Task Force (IETF).  Note that other groups may also distribute
31	   working documents as Internet-Drafts.  The list of current Internet-
32	   Drafts is at http://datatracker.ietf.org/drafts/current/.

34	   Internet-Drafts are draft documents valid for a maximum of six months
35	   and may be updated, replaced, or obsoleted by other documents at any
36	   time.  It is inappropriate to use Internet-Drafts as reference
37	   material or to cite them other than as "work in progress."

39	   This Internet-Draft will expire on November 3, 2012.

41	Copyright Notice

43	   Copyright (c) 2012 IETF Trust and the persons identified as the
44	   document authors.  All rights reserved.

46	   This document is subject to BCP 78 and the IETF Trust's Legal
47	   Provisions Relating to IETF Documents
48	   (http://trustee.ietf.org/license-info) in effect on the date of
49	   publication of this document.  Please review these documents
50	   carefully, as they describe your rights and restrictions with respect
51	   to this document.  Code Components extracted from this document must
52	   include Simplified BSD License text as described in Section 4.e of
53	   the Trust Legal Provisions and are provided without warranty as
54	   described in the Simplified BSD License.

56	   This document may contain material from IETF Documents or IETF
57	   Contributions published or made publicly available before November
58	   10, 2008.  The person(s) controlling the copyright in some of this
59	   material may not have granted the IETF Trust the right to allow
60	   modifications of such material outside the IETF Standards Process.
61	   Without obtaining an adequate license from the person(s) controlling
62	   the copyright in such materials, this document may not be modified
63	   outside the IETF Standards Process, and derivative works of it may
64	   not be created outside the IETF Standards Process, except to format
65	   it for publication as an RFC or to translate it into languages other
66	   than English.

68	Table of Contents

70	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  6
71	     1.1.   The NFS Version 4 Minor Version 2 Protocol  . . . . . . .  6
72	     1.2.   Scope of This Document  . . . . . . . . . . . . . . . . .  6
73	     1.3.   NFSv4.2 Goals . . . . . . . . . . . . . . . . . . . . . .  6
74	     1.4.   Overview of NFSv4.2 Features  . . . . . . . . . . . . . .  6
75	       1.4.1.  Sparse Files . . . . . . . . . . . . . . . . . . . . .  6
76	       1.4.2.  Application I/O Advise . . . . . . . . . . . . . . . .  7
77	     1.5.   Differences from NFSv4.1  . . . . . . . . . . . . . . . .  7
78	   2.  NFS Server-side Copy . . . . . . . . . . . . . . . . . . . . .  7
79	     2.1.   Introduction  . . . . . . . . . . . . . . . . . . . . . .  7
80	     2.2.   Protocol Overview . . . . . . . . . . . . . . . . . . . .  8
81	       2.2.1.  Intra-Server Copy  . . . . . . . . . . . . . . . . . .  9
82	       2.2.2.  Inter-Server Copy  . . . . . . . . . . . . . . . . . . 11
83	       2.2.3.  Server-to-Server Copy Protocol . . . . . . . . . . . . 13
84	     2.3.   Operations  . . . . . . . . . . . . . . . . . . . . . . . 15
85	       2.3.1.  netloc4 - Network Locations  . . . . . . . . . . . . . 15
86	       2.3.2.  Copy Offload Stateids  . . . . . . . . . . . . . . . . 16
87	     2.4.   Security Considerations . . . . . . . . . . . . . . . . . 16
88	       2.4.1.  Inter-Server Copy Security . . . . . . . . . . . . . . 16
89	   3.  Sparse Files . . . . . . . . . . . . . . . . . . . . . . . . . 25
90	     3.1.   Introduction  . . . . . . . . . . . . . . . . . . . . . . 25
91	     3.2.   Terminology . . . . . . . . . . . . . . . . . . . . . . . 25
92	   4.  Space Reservation  . . . . . . . . . . . . . . . . . . . . . . 26
93	     4.1.   Introduction  . . . . . . . . . . . . . . . . . . . . . . 26
94	   5.  Support for Application IO Hints . . . . . . . . . . . . . . . 28
95	     5.1.   Introduction  . . . . . . . . . . . . . . . . . . . . . . 28
96	     5.2.   POSIX Requirements  . . . . . . . . . . . . . . . . . . . 29
97	     5.3.   Additional Requirements . . . . . . . . . . . . . . . . . 30
98	     5.4.   Security Considerations . . . . . . . . . . . . . . . . . 31
99	     5.5.   IANA Considerations . . . . . . . . . . . . . . . . . . . 31
100	   6.  Application Data Block Support . . . . . . . . . . . . . . . . 31
101	     6.1.   Generic Framework . . . . . . . . . . . . . . . . . . . . 32
102	       6.1.1.  Data Block Representation  . . . . . . . . . . . . . . 32
103	       6.1.2.  Data Content . . . . . . . . . . . . . . . . . . . . . 33
104	     6.2.   pNFS Considerations . . . . . . . . . . . . . . . . . . . 33
105	     6.3.   An Example of Detecting Corruption  . . . . . . . . . . . 34
106	     6.4.   Example of READ_PLUS  . . . . . . . . . . . . . . . . . . 35
107	     6.5.   Zero Filled Holes . . . . . . . . . . . . . . . . . . . . 36
108	   7.  Labeled NFS  . . . . . . . . . . . . . . . . . . . . . . . . . 36
109	     7.1.   Introduction  . . . . . . . . . . . . . . . . . . . . . . 36
110	     7.2.   Definitions . . . . . . . . . . . . . . . . . . . . . . . 37
111	     7.3.   MAC Security Attribute  . . . . . . . . . . . . . . . . . 37
112	       7.3.1.  Interpreting FATTR4_SEC_LABEL  . . . . . . . . . . . . 38
113	       7.3.2.  Delegations  . . . . . . . . . . . . . . . . . . . . . 39
114	       7.3.3.  Permission Checking  . . . . . . . . . . . . . . . . . 39
115	       7.3.4.  Object Creation  . . . . . . . . . . . . . . . . . . . 39
116	       7.3.5.  Existing Objects . . . . . . . . . . . . . . . . . . . 40
117	       7.3.6.  Label Changes  . . . . . . . . . . . . . . . . . . . . 40
118	     7.4.   pNFS Considerations . . . . . . . . . . . . . . . . . . . 41
119	     7.5.   Discovery of Server LNFS Support  . . . . . . . . . . . . 41
120	     7.6.   MAC Security NFS Modes of Operation . . . . . . . . . . . 41
121	       7.6.1.  Full Mode  . . . . . . . . . . . . . . . . . . . . . . 42
122	       7.6.2.  Guest Mode . . . . . . . . . . . . . . . . . . . . . . 43
123	     7.7.   Security Considerations . . . . . . . . . . . . . . . . . 43
124	   8.  Sharing change attribute implementation details with NFSv4
125	       clients  . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
126	     8.1.   Introduction  . . . . . . . . . . . . . . . . . . . . . . 44
127	     8.2.   Definition of the 'change_attr_type' per-file system
128	            attribute . . . . . . . . . . . . . . . . . . . . . . . . 44
129	   9.  Security Considerations  . . . . . . . . . . . . . . . . . . . 46
130	   10. Error Values . . . . . . . . . . . . . . . . . . . . . . . . . 46
131	     10.1.  Error Definitions . . . . . . . . . . . . . . . . . . . . 46
132	       10.1.1. General Errors . . . . . . . . . . . . . . . . . . . . 46
133	       10.1.2. Server to Server Copy Errors . . . . . . . . . . . . . 46
134	       10.1.3. Labeled NFS Errors . . . . . . . . . . . . . . . . . . 47
135	   11. File Attributes  . . . . . . . . . . . . . . . . . . . . . . . 47
136	     11.1.  Attribute Definitions . . . . . . . . . . . . . . . . . . 47
137	   12. Operations: REQUIRED, RECOMMENDED, or OPTIONAL . . . . . . . . 48
138	   13. NFSv4.2 Operations . . . . . . . . . . . . . . . . . . . . . . 52
139	     13.1.  Operation 59: COPY - Initiate a server-side copy  . . . . 52
140	     13.2.  Operation 60: COPY_ABORT - Cancel a server-side copy  . . 59
141	     13.3.  Operation 61: COPY_NOTIFY - Notify a source server of
142	            a future copy . . . . . . . . . . . . . . . . . . . . . . 60
143	     13.4.  Operation 62: COPY_REVOKE - Revoke a destination
144	            server's copy privileges  . . . . . . . . . . . . . . . . 62
145	     13.5.  Operation 63: COPY_STATUS - Poll for status of a
146	            server-side copy  . . . . . . . . . . . . . . . . . . . . 63
147	     13.6.  Modification to Operation 42: EXCHANGE_ID -
148	            Instantiate Client ID . . . . . . . . . . . . . . . . . . 64
149	     13.7.  Operation 64: INITIALIZE  . . . . . . . . . . . . . . . . 65
150	     13.8.  Operation 67: IO_ADVISE - Application I/O access
151	            pattern hints . . . . . . . . . . . . . . . . . . . . . . 69
152	     13.9.  Changes to Operation 51: LAYOUTRETURN . . . . . . . . . . 75
153	     13.10. Operation 65: READ_PLUS . . . . . . . . . . . . . . . . . 78
154	     13.11. Operation 66: SEEK  . . . . . . . . . . . . . . . . . . . 83
155	   14. NFSv4.2 Callback Operations  . . . . . . . . . . . . . . . . . 84
156	     14.1.  Procedure 16: CB_ATTR_CHANGED - Notify Client that
157	            the File's Attributes Changed . . . . . . . . . . . . . . 84
158	     14.2.  Operation 15: CB_COPY - Report results of a
159	            server-side copy  . . . . . . . . . . . . . . . . . . . . 85
160	   15. IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 87
161	   16. References . . . . . . . . . . . . . . . . . . . . . . . . . . 87
162	     16.1.  Normative References  . . . . . . . . . . . . . . . . . . 87
163	     16.2.  Informative References  . . . . . . . . . . . . . . . . . 88

165	   Appendix A.  Acknowledgments . . . . . . . . . . . . . . . . . . . 89
166	   Appendix B.  RFC Editor Notes  . . . . . . . . . . . . . . . . . . 90
167	   Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 90

169	1.  Introduction

171	1.1.  The NFS Version 4 Minor Version 2 Protocol

173	   The NFS version 4 minor version 2 (NFSv4.2) protocol is the third
174	   minor version of the NFS version 4 (NFSv4) protocol.  The first minor
175	   version, NFSv4.0, is described in [10] and the second minor version,
176	   NFSv4.1, is described in [2].  It follows the guidelines for minor
177	   versioning that are listed in Section 11 of [10].

179	   As a minor version, NFSv4.2 is consistent with the overall goals for
180	   NFSv4, but extends the protocol so as to better meet those goals,
181	   based on experiences with NFSv4.1.  In addition, NFSv4.2 has adopted
182	   some additional goals, which motivate some of the major extensions in
183	   NFSv4.2.

185	1.2.  Scope of This Document

187	   This document describes the NFSv4.2 protocol.  With respect to
188	   NFSv4.0 and NFSv4.1, this document does not:

190	   o  describe the NFSv4.0 or NFSv4.1 protocols, except where needed to
191	      contrast with NFSv4.2.

193	   o  modify the specification of the NFSv4.0 or NFSv4.1 protocols.

195	   o  clarify the NFSv4.0 or NFSv4.1 protocols.  I.e., any
196	      clarifications made here apply to NFSv4.2 and neither of the prior
197	      protocols.

199	   The full XDR for NFSv4.2 is presented in [3].

201	1.3.  NFSv4.2 Goals

203	   [[Comment.1: This needs fleshing out! --TH]]

205	1.4.  Overview of NFSv4.2 Features

207	   [[Comment.2: This needs fleshing out! --TH]]

209	1.4.1.  Sparse Files

211	   Two new operations are defined to support the reading of sparse files
212	   (READ_PLUS) and the punching of holes to remove backing storage
213	   (INITIALIZE).

215	1.4.2.  Application I/O Advise

217	   We propose a new IO_ADVISE operation for NFSv4.2 that clients can use
218	   to communicate expected I/O behavior to the server.  By communicating
219	   future I/O behavior such as whether a file will be accessed
220	   sequentially or randomly, and whether a file will or will not be
221	   accessed in the near future, servers can optimize future I/O requests
222	   for a file by, for example, prefetching or evicting data.  This
223	   operation can be used to support the posix_fadvise function as well
224	   as other applications such as databases and video editors.

226	1.5.  Differences from NFSv4.1

228	   In NFSv4.1, the only way to introduce new variants of an operation
229	   was to introduce a new operation.  I.e., READ becomes either READ2 or
230	   READ_PLUS.  With the use of discriminated unions as parameters to
231	   such functions in NFSv4.2, it is possible to add a new arm in a
232	   subsequent minor version.  And it is also possible to move such an
233	   operation from OPTIONAL/RECOMMENDED to REQUIRED.  Forcing an
234	   implementation to adopt each arm of a discriminated union at such a
235	   time does not meet the spirit of the minor versioning rules.  As
236	   such, new arms of a discriminated union MUST follow the same
237	   guidelines for minor versioning as operations in NFSv4.1 - i.e., they
238	   may not be made REQUIRED.  To support this, a new error code,
239	   NFS4ERR_UNION_NOTSUPP, is introduced which allows the server to
240	   communicate to the client that the operation is supported, but the
241	   specific arm of the discriminated union is not.

243	2.  NFS Server-side Copy

245	2.1.  Introduction

247	   This section describes a server-side copy feature for the NFS
248	   protocol.

250	   The server-side copy feature provides a mechanism for the NFS client
251	   to perform a file copy on the server without the data being
252	   transmitted back and forth over the network.

254	   Without this feature, an NFS client copies data from one location to
255	   another by reading the data from the server over the network, and
256	   then writing the data back over the network to the server.  Using
257	   this server-side copy operation, the client is able to instruct the
258	   server to copy the data locally without the data being sent back and
259	   forth over the network unnecessarily.

261	   In general, this feature is useful whenever data is copied from one
262	   location to another on the server.  It is particularly useful when
263	   copying the contents of a file from a backup.  Backup-versions of a
264	   file are copied for a number of reasons, including restoring and
265	   cloning data.

267	   If the source object and destination object are on different file
268	   servers, the file servers will communicate with one another to
269	   perform the copy operation.  The server-to-server protocol by which
270	   this is accomplished is not defined in this document.

272	2.2.  Protocol Overview

274	   The server-side copy offload operations support both intra-server and
275	   inter-server file copies.  An intra-server copy is a copy in which
276	   the source file and destination file reside on the same server.  In
277	   an inter-server copy, the source file and destination file are on
278	   different servers.  In both cases, the copy may be performed
279	   synchronously or asynchronously.

281	   Throughout the rest of this document, we refer to the NFS server
282	   containing the source file as the "source server" and the NFS server
283	   to which the file is transferred as the "destination server".  In the
284	   case of an intra-server copy, the source server and destination
285	   server are the same server.  Therefore in the context of an intra-
286	   server copy, the terms source server and destination server refer to
287	   the single server performing the copy.

289	   The operations described below are designed to copy files.  Other
290	   file system objects can be copied by building on these operations or
291	   using other techniques.  For example if the user wishes to copy a
292	   directory, the client can synthesize a directory copy by first
293	   creating the destination directory and then copying the source
294	   directory's files to the new destination directory.  If the user
295	   wishes to copy a namespace junction [11] [12], the client can use the
296	   ONC RPC Federated Filesystem protocol [12] to perform the copy.
297	   Specifically the client can determine the source junction's
298	   attributes using the FEDFS_LOOKUP_FSN procedure and create a
299	   duplicate junction using the FEDFS_CREATE_JUNCTION procedure.

301	   For the inter-server copy protocol, the operations are defined to be
302	   compatible with a server-to-server copy protocol in which the
303	   destination server reads the file data from the source server.  This
304	   model in which the file data is pulled from the source by the
305	   destination has a number of advantages over a model in which the
306	   source pushes the file data to the destination.  The advantages of
307	   the pull model include:

309	   o  The pull model only requires a remote server (i.e., the
310	      destination server) to be granted read access.  A push model
311	      requires a remote server (i.e., the source server) to be granted
312	      write access, which is more privileged.

314	   o  The pull model allows the destination server to stop reading if it
315	      has run out of space.  In a push model, the destination server
316	      must flow control the source server in this situation.

318	   o  The pull model allows the destination server to easily flow
319	      control the data stream by adjusting the size of its read
320	      operations.  In a push model, the destination server does not have
321	      this ability.  The source server in a push model is capable of
322	      writing chunks larger than the destination server has requested in
323	      attributes and session parameters.  In theory, the destination
324	      server could perform a "short" write in this situation, but this
325	      approach is known to behave poorly in practice.

327	   The following operations are provided to support server-side copy:

329	   COPY_NOTIFY:  For inter-server copies, the client sends this
330	      operation to the source server to notify it of a future file copy
331	      from a given destination server for the given user.

333	   COPY_REVOKE:  Also for inter-server copies, the client sends this
334	      operation to the source server to revoke permission to copy a file
335	      for the given user.

337	   COPY:  Used by the client to request a file copy.

339	   COPY_ABORT:  Used by the client to abort an asynchronous file copy.

341	   COPY_STATUS:  Used by the client to poll the status of an
342	      asynchronous file copy.

344	   CB_COPY:  Used by the destination server to report the results of an
345	      asynchronous file copy to the client.

347	   These operations are described in detail in Section 2.3.  This
348	   section provides an overview of how these operations are used to
349	   perform server-side copies.

351	2.2.1.  Intra-Server Copy

353	   To copy a file on a single server, the client uses a COPY operation.
354	   The server may respond to the copy operation with the final results
355	   of the copy or it may perform the copy asynchronously and deliver the
356	   results using a CB_COPY operation callback.  If the copy is performed
357	   asynchronously, the client may poll the status of the copy using
358	   COPY_STATUS or cancel the copy using COPY_ABORT.

360	   A synchronous intra-server copy is shown in Figure 1.  In this
361	   example, the NFS server chooses to perform the copy synchronously.
362	   The copy operation is completed, either successfully or
363	   unsuccessfully, before the server replies to the client's request.
364	   The server's reply contains the final result of the operation.

366	     Client                                  Server
367	        +                                      +
368	        |                                      |
369	        |--- COPY ---------------------------->| Client requests
370	        |<------------------------------------/| a file copy
371	        |                                      |
372	        |                                      |

374	                Figure 1: A synchronous intra-server copy.

376	   An asynchronous intra-server copy is shown in Figure 2.  In this
377	   example, the NFS server performs the copy asynchronously.  The
378	   server's reply to the copy request indicates that the copy operation
379	   was initiated and the final result will be delivered at a later time.
380	   The server's reply also contains a copy stateid.  The client may use
381	   this copy stateid to poll for status information (as shown) or to
382	   cancel the copy using a COPY_ABORT.  When the server completes the
383	   copy, the server performs a callback to the client and reports the
384	   results.

386	     Client                                  Server
387	        +                                      +
388	        |                                      |
389	        |--- COPY ---------------------------->| Client requests
390	        |<------------------------------------/| a file copy
391	        |                                      |
392	        |                                      |
393	        |--- COPY_STATUS --------------------->| Client may poll
394	        |<------------------------------------/| for status
395	        |                                      |
396	        |                  .                   | Multiple COPY_STATUS
397	        |                  .                   | operations may be sent.
398	        |                  .                   |
399	        |                                      |
400	        |<-- CB_COPY --------------------------| Server reports results
401	        |\------------------------------------>|
402	        |                                      |

404	               Figure 2: An asynchronous intra-server copy.

406	2.2.2.  Inter-Server Copy

408	   A copy may also be performed between two servers.  The copy protocol
409	   is designed to accommodate a variety of network topologies.  As shown
410	   in Figure 3, the client and servers may be connected by multiple
411	   networks.  In particular, the servers may be connected by a
412	   specialized, high speed network (network 192.168.33.0/24 in the
413	   diagram) that does not include the client.  The protocol allows the
414	   client to setup the copy between the servers (over network
415	   10.11.78.0/24 in the diagram) and for the servers to communicate on
416	   the high speed network if they choose to do so.

418	                             192.168.33.0/24
419	                 +-------------------------------------+
420	                 |                                     |
421	                 |                                     |
422	                 | 192.168.33.18                       | 192.168.33.56
423	         +-------+------+                       +------+------+
424	         |     Source   |                       | Destination |
425	         +-------+------+                       +------+------+
426	                 | 10.11.78.18                         | 10.11.78.56
427	                 |                                     |
428	                 |                                     |
429	                 |             10.11.78.0/24           |
430	                 +------------------+------------------+
431	                                    |
432	                                    |
433	                                    | 10.11.78.243
434	                              +-----+-----+
435	                              |   Client  |
436	                              +-----------+

438	            Figure 3: An example inter-server network topology.

440	   For an inter-server copy, the client notifies the source server that
441	   a file will be copied by the destination server using a COPY_NOTIFY
442	   operation.  The client then initiates the copy by sending the COPY
443	   operation to the destination server.  The destination server may
444	   perform the copy synchronously or asynchronously.

446	   A synchronous inter-server copy is shown in Figure 4.  In this case,
447	   the destination server chooses to perform the copy before responding
448	   to the client's COPY request.

450	   An asynchronous copy is shown in Figure 5.  In this case, the
451	   destination server chooses to respond to the client's COPY request
452	   immediately and then perform the copy asynchronously.

454	     Client                Source         Destination
455	        +                    +                 +
456	        |                    |                 |
457	        |--- COPY_NOTIFY --->|                 |
458	        |<------------------/|                 |
459	        |                    |                 |
460	        |                    |                 |
461	        |--- COPY ---------------------------->|
462	        |                    |                 |
463	        |                    |                 |
464	        |                    |<----- read -----|
465	        |                    |\--------------->|
466	        |                    |                 |
467	        |                    |        .        | Multiple reads may
468	        |                    |        .        | be necessary
469	        |                    |        .        |
470	        |                    |                 |
471	        |                    |                 |
472	        |<------------------------------------/| Destination replies
473	        |                    |                 | to COPY

475	                Figure 4: A synchronous inter-server copy.

477	     Client                Source         Destination
478	        +                    +                 +
479	        |                    |                 |
480	        |--- COPY_NOTIFY --->|                 |
481	        |<------------------/|                 |
482	        |                    |                 |
483	        |                    |                 |
484	        |--- COPY ---------------------------->|
485	        |<------------------------------------/|
486	        |                    |                 |
487	        |                    |                 |
488	        |                    |<----- read -----|
489	        |                    |\--------------->|
490	        |                    |                 |
491	        |                    |        .        | Multiple reads may
492	        |                    |        .        | be necessary
493	        |                    |        .        |
494	        |                    |                 |
495	        |                    |                 |
496	        |--- COPY_STATUS --------------------->| Client may poll
497	        |<------------------------------------/| for status
498	        |                    |                 |
499	        |                    |        .        | Multiple COPY_STATUS
500	        |                    |        .        | operations may be sent
501	        |                    |        .        |
502	        |                    |                 |
503	        |                    |                 |
504	        |                    |                 |
505	        |<-- CB_COPY --------------------------| Destination reports
506	        |\------------------------------------>| results
507	        |                    |                 |

509	               Figure 5: An asynchronous inter-server copy.

511	2.2.3.  Server-to-Server Copy Protocol

513	   During an inter-server copy, the destination server reads the file
514	   data from the source server.  The source server and destination
515	   server are not required to use a specific protocol to transfer the
516	   file data.  The choice of what protocol to use is ultimately the
517	   destination server's decision.

519	2.2.3.1.  Using NFSv4.x as a Server-to-Server Copy Protocol

521	   The destination server MAY use standard NFSv4.x (where x >= 1) to
522	   read the data from the source server.  If NFSv4.x is used for the
523	   server-to-server copy protocol, the destination server can use the
524	   filehandle contained in the COPY request with standard NFSv4.x
525	   operations to read data from the source server.  Specifically, the
526	   destination server may use the NFSv4.x OPEN operation's CLAIM_FH
527	   facility to open the file being copied and obtain an open stateid.
528	   Using the stateid, the destination server may then use NFSv4.x READ
529	   operations to read the file.

531	2.2.3.2.  Using an alternative Server-to-Server Copy Protocol

533	   In a homogeneous environment, the source and destination servers
534	   might be able to perform the file copy extremely efficiently using
535	   specialized protocols.  For example the source and destination
536	   servers might be two nodes sharing a common file system format for
537	   the source and destination file systems.  Thus the source and
538	   destination are in an ideal position to efficiently render the image
539	   of the source file to the destination file by replicating the file
540	   system formats at the block level.  Another possibility is that the
541	   source and destination might be two nodes sharing a common storage
542	   area network, and thus there is no need to copy any data at all, and
543	   instead ownership of the file and its contents might simply be re-
544	   assigned to the destination.  To allow for these possibilities, the
545	   destination server is allowed to use a server-to-server copy protocol
546	   of its choice.

548	   In a heterogeneous environment, using a protocol other than NFSv4.x
549	   (e.g,.  HTTP [13] or FTP [14]) presents some challenges.  In
550	   particular, the destination server is presented with the challenge of
551	   accessing the source file given only an NFSv4.x filehandle.

553	   One option for protocols that identify source files with path names
554	   is to use an ASCII hexadecimal representation of the source
555	   filehandle as the file name.

557	   Another option for the source server is to use URLs to direct the
558	   destination server to a specialized service.  For example, the
559	   response to COPY_NOTIFY could include the URL
560	   ftp://s1.example.com:9999/_FH/0x12345, where 0x12345 is the ASCII
561	   hexadecimal representation of the source filehandle.  When the
562	   destination server receives the source server's URL, it would use
563	   "_FH/0x12345" as the file name to pass to the FTP server listening on
564	   port 9999 of s1.example.com.  On port 9999 there would be a special
565	   instance of the FTP service that understands how to convert NFS
566	   filehandles to an open file descriptor (in many operating systems,
567	   this would require a new system call, one which is the inverse of the
568	   makefh() function that the pre-NFSv4 MOUNT service needs).

570	   Authenticating and identifying the destination server to the source
571	   server is also a challenge.  Recommendations for how to accomplish
572	   this are given in Section 2.4.1.2.4 and Section 2.4.1.4.

574	2.3.  Operations

576	   In the sections that follow, several operations are defined that
577	   together provide the server-side copy feature.  These operations are
578	   intended to be OPTIONAL operations as defined in section 17 of [2].
579	   The COPY_NOTIFY, COPY_REVOKE, COPY, COPY_ABORT, and COPY_STATUS
580	   operations are designed to be sent within an NFSv4 COMPOUND
581	   procedure.  The CB_COPY operation is designed to be sent within an
582	   NFSv4 CB_COMPOUND procedure.

584	   Each operation is performed in the context of the user identified by
585	   the ONC RPC credential of its containing COMPOUND or CB_COMPOUND
586	   request.  For example, a COPY_ABORT operation issued by a given user
587	   indicates that a specified COPY operation initiated by the same user
588	   be canceled.  Therefore a COPY_ABORT MUST NOT interfere with a copy
589	   of the same file initiated by another user.

591	   An NFS server MAY allow an administrative user to monitor or cancel
592	   copy operations using an implementation specific interface.

594	2.3.1.  netloc4 - Network Locations

596	   The server-side copy operations specify network locations using the
597	   netloc4 data type shown below:

599	   enum netloc_type4 {
600	           NL4_NAME        = 0,
601	           NL4_URL         = 1,
602	           NL4_NETADDR     = 2
603	   };
604	   union netloc4 switch (netloc_type4 nl_type) {
605	           case NL4_NAME:          utf8str_cis nl_name;
606	           case NL4_URL:           utf8str_cis nl_url;
607	           case NL4_NETADDR:       netaddr4    nl_addr;
608	   };

610	   If the netloc4 is of type NL4_NAME, the nl_name field MUST be
611	   specified as a UTF-8 string.  The nl_name is expected to be resolved
612	   to a network address via DNS, LDAP, NIS, /etc/hosts, or some other
613	   means.  If the netloc4 is of type NL4_URL, a server URL [4]
614	   appropriate for the server-to-server copy operation is specified as a
615	   UTF-8 string.  If the netloc4 is of type NL4_NETADDR, the nl_addr
616	   field MUST contain a valid netaddr4 as defined in Section 3.3.9 of
617	   [2].

619	   When netloc4 values are used for an inter-server copy as shown in
620	   Figure 3, their values may be evaluated on the source server,
621	   destination server, and client.  The network environment in which
622	   these systems operate should be configured so that the netloc4 values
623	   are interpreted as intended on each system.

625	2.3.2.  Copy Offload Stateids

627	   A server may perform a copy offload operation asynchronously.  An
628	   asynchronous copy is tracked using a copy offload stateid.  Copy
629	   offload stateids are included in the COPY, COPY_ABORT, COPY_STATUS,
630	   and CB_COPY operations.

632	   Section 8.2.4 of [2] specifies that stateids are valid until either
633	   (A) the client or server restart or (B) the client returns the
634	   resource.

636	   A copy offload stateid will be valid until either (A) the client or
637	   server restarts or (B) the client returns the resource by issuing a
638	   COPY_ABORT operation or the client replies to a CB_COPY operation.

640	   A copy offload stateid's seqid MUST NOT be 0 (zero).  In the context
641	   of a copy offload operation, it is ambiguous to indicate the most
642	   recent copy offload operation using a stateid with seqid of 0 (zero).
643	   Therefore a copy offload stateid with seqid of 0 (zero) MUST be
644	   considered invalid.

646	2.4.  Security Considerations

648	   The security considerations pertaining to NFSv4 [10] apply to this
649	   document.

651	   The standard security mechanisms provide by NFSv4 [10] may be used to
652	   secure the protocol described in this document.

654	   NFSv4 clients and servers supporting the the inter-server copy
655	   operations described in this document are REQUIRED to implement [5],
656	   including the RPCSEC_GSSv3 privileges copy_from_auth and
657	   copy_to_auth.  If the server-to-server copy protocol is ONC RPC
658	   based, the servers are also REQUIRED to implement the RPCSEC_GSSv3
659	   privilege copy_confirm_auth.  These requirements to implement are not
660	   requirements to use.  NFSv4 clients and servers are RECOMMENDED to
661	   use [5] to secure server-side copy operations.

663	2.4.1.  Inter-Server Copy Security

665	2.4.1.1.  Requirements for Secure Inter-Server Copy

667	   Inter-server copy is driven by several requirements:

669	   o  The specification MUST NOT mandate an inter-server copy protocol.
670	      There are many ways to copy data.  Some will be more optimal than
671	      others depending on the identities of the source server and
672	      destination server.  For example the source and destination
673	      servers might be two nodes sharing a common file system format for
674	      the source and destination file systems.  Thus the source and
675	      destination are in an ideal position to efficiently render the
676	      image of the source file to the destination file by replicating
677	      the file system formats at the block level.  In other cases, the
678	      source and destination might be two nodes sharing a common storage
679	      area network, and thus there is no need to copy any data at all,
680	      and instead ownership of the file and its contents simply gets re-
681	      assigned to the destination.

683	   o  The specification MUST provide guidance for using NFSv4.x as a
684	      copy protocol.  For those source and destination servers willing
685	      to use NFSv4.x there are specific security considerations that
686	      this specification can and does address.

688	   o  The specification MUST NOT mandate pre-configuration between the
689	      source and destination server.  Requiring that the source and
690	      destination first have a "copying relationship" increases the
691	      administrative burden.  However the specification MUST NOT
692	      preclude implementations that require pre-configuration.

694	   o  The specification MUST NOT mandate a trust relationship between
695	      the source and destination server.  The NFSv4 security model
696	      requires mutual authentication between a principal on an NFS
697	      client and a principal on an NFS server.  This model MUST continue
698	      with the introduction of COPY.

700	2.4.1.2.  Inter-Server Copy with RPCSEC_GSSv3

702	   When the client sends a COPY_NOTIFY to the source server to expect
703	   the destination to attempt to copy data from the source server, it is
704	   expected that this copy is being done on behalf of the principal
705	   (called the "user principal") that sent the RPC request that encloses
706	   the COMPOUND procedure that contains the COPY_NOTIFY operation.  The
707	   user principal is identified by the RPC credentials.  A mechanism
708	   that allows the user principal to authorize the destination server to
709	   perform the copy in a manner that lets the source server properly
710	   authenticate the destination's copy, and without allowing the
711	   destination to exceed its authorization is necessary.

713	   An approach that sends delegated credentials of the client's user
714	   principal to the destination server is not used for the following
715	   reasons.  If the client's user delegated its credentials, the
716	   destination would authenticate as the user principal.  If the
717	   destination were using the NFSv4 protocol to perform the copy, then
718	   the source server would authenticate the destination server as the
719	   user principal, and the file copy would securely proceed.  However,
720	   this approach would allow the destination server to copy other files.
721	   The user principal would have to trust the destination server to not
722	   do so.  This is counter to the requirements, and therefore is not
723	   considered.  Instead an approach using RPCSEC_GSSv3 [5] privileges is
724	   proposed.

726	   One of the stated applications of the proposed RPCSEC_GSSv3 protocol
727	   is compound client host and user authentication [+ privilege
728	   assertion].  For inter-server file copy, we require compound NFS
729	   server host and user authentication [+ privilege assertion].  The
730	   distinction between the two is one without meaning.

732	   RPCSEC_GSSv3 introduces the notion of privileges.  We define three
733	   privileges:

735	   copy_from_auth:  A user principal is authorizing a source principal
736	      ("nfs@<source>") to allow a destination principal ("nfs@
737	      <destination>") to copy a file from the source to the destination.
738	      This privilege is established on the source server before the user
739	      principal sends a COPY_NOTIFY operation to the source server.

741	   struct copy_from_auth_priv {
742	           secret4             cfap_shared_secret;
743	           netloc4             cfap_destination;
744	           /* the NFSv4 user name that the user principal maps to */
745	           utf8str_mixed       cfap_username;
746	           /* equal to seq_num of rpc_gss_cred_vers_3_t */
747	           unsigned int        cfap_seq_num;
748	   };

750	      cfp_shared_secret is a secret value the user principal generates.

752	   copy_to_auth:  A user principal is authorizing a destination
753	      principal ("nfs@<destination>") to allow it to copy a file from
754	      the source to the destination.  This privilege is established on
755	      the destination server before the user principal sends a COPY
756	      operation to the destination server.

758	   struct copy_to_auth_priv {
759	           /* equal to cfap_shared_secret */
760	           secret4              ctap_shared_secret;
761	           netloc4              ctap_source;
762	           /* the NFSv4 user name that the user principal maps to */
763	           utf8str_mixed        ctap_username;
764	           /* equal to seq_num of rpc_gss_cred_vers_3_t */
765	           unsigned int         ctap_seq_num;
766	   };

768	      ctap_shared_secret is a secret value the user principal generated
769	      and was used to establish the copy_from_auth privilege with the
770	      source principal.

772	   copy_confirm_auth:  A destination principal is confirming with the
773	      source principal that it is authorized to copy data from the
774	      source on behalf of the user principal.  When the inter-server
775	      copy protocol is NFSv4, or for that matter, any protocol capable
776	      of being secured via RPCSEC_GSSv3 (i.e., any ONC RPC protocol),
777	      this privilege is established before the file is copied from the
778	      source to the destination.

780	   struct copy_confirm_auth_priv {
781	           /* equal to GSS_GetMIC() of cfap_shared_secret */
782	           opaque              ccap_shared_secret_mic<>;
783	           /* the NFSv4 user name that the user principal maps to */
784	           utf8str_mixed       ccap_username;
785	           /* equal to seq_num of rpc_gss_cred_vers_3_t */
786	           unsigned int        ccap_seq_num;
787	   };

789	2.4.1.2.1.  Establishing a Security Context

791	   When the user principal wants to COPY a file between two servers, if
792	   it has not established copy_from_auth and copy_to_auth privileges on
793	   the servers, it establishes them:

795	   o  The user principal generates a secret it will share with the two
796	      servers.  This shared secret will be placed in the
797	      cfap_shared_secret and ctap_shared_secret fields of the
798	      appropriate privilege data types, copy_from_auth_priv and
799	      copy_to_auth_priv.

801	   o  An instance of copy_from_auth_priv is filled in with the shared
802	      secret, the destination server, and the NFSv4 user id of the user
803	      principal.  It will be sent with an RPCSEC_GSS3_CREATE procedure,
804	      and so cfap_seq_num is set to the seq_num of the credential of the
805	      RPCSEC_GSS3_CREATE procedure.  Because cfap_shared_secret is a
806	      secret, after XDR encoding copy_from_auth_priv, GSS_Wrap() (with
807	      privacy) is invoked on copy_from_auth_priv.  The
808	      RPCSEC_GSS3_CREATE procedure's arguments are:

810	      struct {
811	         rpc_gss3_gss_binding    *compound_binding;
812	         rpc_gss3_chan_binding   *chan_binding_mic;
813	         rpc_gss3_assertion      assertions<>;
814	         rpc_gss3_extension      extensions<>;
815	      } rpc_gss3_create_args;

817	      The string "copy_from_auth" is placed in assertions[0].privs.  The
818	      output of GSS_Wrap() is placed in extensions[0].data.  The field
819	      extensions[0].critical is set to TRUE.  The source server calls
820	      GSS_Unwrap() on the privilege, and verifies that the seq_num
821	      matches the credential.  It then verifies that the NFSv4 user id
822	      being asserted matches the source server's mapping of the user
823	      principal.  If it does, the privilege is established on the source
824	      server as: <"copy_from_auth", user id, destination>.  The
825	      successful reply to RPCSEC_GSS3_CREATE has:

827	      struct {
828	         opaque                  handle<>;
829	         rpc_gss3_chan_binding   *chan_binding_mic;
830	         rpc_gss3_assertion      granted_assertions<>;
831	         rpc_gss3_assertion      server_assertions<>;
832	         rpc_gss3_extension      extensions<>;
833	      } rpc_gss3_create_res;

835	      The field "handle" is the RPCSEC_GSSv3 handle that the client will
836	      use on COPY_NOTIFY requests involving the source and destination
837	      server. granted_assertions[0].privs will be equal to
838	      "copy_from_auth".  The server will return a GSS_Wrap() of
839	      copy_to_auth_priv.

841	   o  An instance of copy_to_auth_priv is filled in with the shared
842	      secret, the source server, and the NFSv4 user id.  It will be sent
843	      with an RPCSEC_GSS3_CREATE procedure, and so ctap_seq_num is set
844	      to the seq_num of the credential of the RPCSEC_GSS3_CREATE
845	      procedure.  Because ctap_shared_secret is a secret, after XDR
846	      encoding copy_to_auth_priv, GSS_Wrap() is invoked on
847	      copy_to_auth_priv.  The RPCSEC_GSS3_CREATE procedure's arguments
848	      are:

850	      struct {
851	         rpc_gss3_gss_binding    *compound_binding;
852	         rpc_gss3_chan_binding   *chan_binding_mic;
853	         rpc_gss3_assertion      assertions<>;
854	         rpc_gss3_extension      extensions<>;
855	      } rpc_gss3_create_args;

857	      The string "copy_to_auth" is placed in assertions[0].privs.  The
858	      output of GSS_Wrap() is placed in extensions[0].data.  The field
859	      extensions[0].critical is set to TRUE.  After unwrapping,
860	      verifying the seq_num, and the user principal to NFSv4 user ID
861	      mapping, the destination establishes a privilege of
862	      <"copy_to_auth", user id, source>.  The successful reply to
863	      RPCSEC_GSS3_CREATE has:

865	      struct {
866	         opaque                  handle<>;
867	         rpc_gss3_chan_binding   *chan_binding_mic;
868	         rpc_gss3_assertion      granted_assertions<>;
869	         rpc_gss3_assertion      server_assertions<>;
870	         rpc_gss3_extension      extensions<>;
871	      } rpc_gss3_create_res;

873	      The field "handle" is the RPCSEC_GSSv3 handle that the client will
874	      use on COPY requests involving the source and destination server.
875	      The field granted_assertions[0].privs will be equal to
876	      "copy_to_auth".  The server will return a GSS_Wrap() of
877	      copy_to_auth_priv.

879	2.4.1.2.2.  Starting a Secure Inter-Server Copy

881	   When the client sends a COPY_NOTIFY request to the source server, it
882	   uses the privileged "copy_from_auth" RPCSEC_GSSv3 handle.
883	   cna_destination_server in COPY_NOTIFY MUST be the same as the name of
884	   the destination server specified in copy_from_auth_priv.  Otherwise,
885	   COPY_NOTIFY will fail with NFS4ERR_ACCESS.  The source server
886	   verifies that the privilege <"copy_from_auth", user id, destination>
887	   exists, and annotates it with the source filehandle, if the user
888	   principal has read access to the source file, and if administrative
889	   policies give the user principal and the NFS client read access to
890	   the source file (i.e., if the ACCESS operation would grant read
891	   access).  Otherwise, COPY_NOTIFY will fail with NFS4ERR_ACCESS.

893	   When the client sends a COPY request to the destination server, it
894	   uses the privileged "copy_to_auth" RPCSEC_GSSv3 handle.
895	   ca_source_server in COPY MUST be the same as the name of the source
896	   server specified in copy_to_auth_priv.  Otherwise, COPY will fail
897	   with NFS4ERR_ACCESS.  The destination server verifies that the
898	   privilege <"copy_to_auth", user id, source> exists, and annotates it
899	   with the source and destination filehandles.  If the client has
900	   failed to establish the "copy_to_auth" policy it will reject the
901	   request with NFS4ERR_PARTNER_NO_AUTH.

903	   If the client sends a COPY_REVOKE to the source server to rescind the
904	   destination server's copy privilege, it uses the privileged
905	   "copy_from_auth" RPCSEC_GSSv3 handle and the cra_destination_server
906	   in COPY_REVOKE MUST be the same as the name of the destination server
907	   specified in copy_from_auth_priv.  The source server will then delete
908	   the <"copy_from_auth", user id, destination> privilege and fail any
909	   subsequent copy requests sent under the auspices of this privilege
910	   from the destination server.

912	2.4.1.2.3.  Securing ONC RPC Server-to-Server Copy Protocols

914	   After a destination server has a "copy_to_auth" privilege established
915	   on it, and it receives a COPY request, if it knows it will use an ONC
916	   RPC protocol to copy data, it will establish a "copy_confirm_auth"
917	   privilege on the source server, using nfs@<destination> as the
918	   initiator principal, and nfs@<source> as the target principal.

920	   The value of the field ccap_shared_secret_mic is a GSS_VerifyMIC() of
921	   the shared secret passed in the copy_to_auth privilege.  The field
922	   ccap_username is the mapping of the user principal to an NFSv4 user
923	   name ("user"@"domain" form), and MUST be the same as ctap_username
924	   and cfap_username.  The field ccap_seq_num is the seq_num of the
925	   RPCSEC_GSSv3 credential used for the RPCSEC_GSS3_CREATE procedure the
926	   destination will send to the source server to establish the
927	   privilege.

929	   The source server verifies the privilege, and establishes a
930	   <"copy_confirm_auth", user id, destination> privilege.  If the source
931	   server fails to verify the privilege, the COPY operation will be
932	   rejected with NFS4ERR_PARTNER_NO_AUTH.  All subsequent ONC RPC
933	   requests sent from the destination to copy data from the source to
934	   the destination will use the RPCSEC_GSSv3 handle returned by the
935	   source's RPCSEC_GSS3_CREATE response.

937	   Note that the use of the "copy_confirm_auth" privilege accomplishes
938	   the following:

940	   o  if a protocol like NFS is being used, with export policies, export
941	      policies can be overridden in case the destination server as-an-
942	      NFS-client is not authorized

944	   o  manual configuration to allow a copy relationship between the
945	      source and destination is not needed.

947	   If the attempt to establish a "copy_confirm_auth" privilege fails,
948	   then when the user principal sends a COPY request to destination, the
949	   destination server will reject it with NFS4ERR_PARTNER_NO_AUTH.

951	2.4.1.2.4.  Securing Non ONC RPC Server-to-Server Copy Protocols

953	   If the destination won't be using ONC RPC to copy the data, then the
954	   source and destination are using an unspecified copy protocol.  The
955	   destination could use the shared secret and the NFSv4 user id to
956	   prove to the source server that the user principal has authorized the
957	   copy.

959	   For protocols that authenticate user names with passwords (e.g., HTTP
960	   [13] and FTP [14]), the nfsv4 user id could be used as the user name,
961	   and an ASCII hexadecimal representation of the RPCSEC_GSSv3 shared
962	   secret could be used as the user password or as input into non-
963	   password authentication methods like CHAP [15].

965	2.4.1.3.  Inter-Server Copy via ONC RPC but without RPCSEC_GSSv3

967	   ONC RPC security flavors other than RPCSEC_GSSv3 MAY be used with the
968	   server-side copy offload operations described in this document.  In
969	   particular, host-based ONC RPC security flavors such as AUTH_NONE and
970	   AUTH_SYS MAY be used.  If a host-based security flavor is used, a
971	   minimal level of protection for the server-to-server copy protocol is
972	   possible.

974	   In the absence of strong security mechanisms such as RPCSEC_GSSv3,
975	   the challenge is how the source server and destination server
976	   identify themselves to each other, especially in the presence of
977	   multi-homed source and destination servers.  In a multi-homed
978	   environment, the destination server might not contact the source
979	   server from the same network address specified by the client in the
980	   COPY_NOTIFY.  This can be overcome using the procedure described
981	   below.

983	   When the client sends the source server the COPY_NOTIFY operation,
984	   the source server may reply to the client with a list of target
985	   addresses, names, and/or URLs and assign them to the unique
986	   quadruple: <random number, source fh, user ID, destination address
987	   Y>.  If the destination uses one of these target netlocs to contact
988	   the source server, the source server will be able to uniquely
989	   identify the destination server, even if the destination server does
990	   not connect from the address specified by the client in COPY_NOTIFY.
991	   The level of assurance in this identification depends on the
992	   unpredictability, strength and secrecy of the random number.

994	   For example, suppose the network topology is as shown in Figure 3.
995	   If the source filehandle is 0x12345, the source server may respond to
996	   a COPY_NOTIFY for destination 10.11.78.56 with the URLs:

998	      nfs://10.11.78.18//_COPY/FvhH1OKbu8VrxvV1erdjvR7N/10.11.78.56/_FH/
999	      0x12345

1001	      nfs://192.168.33.18//_COPY/FvhH1OKbu8VrxvV1erdjvR7N/10.11.78.56/
1002	      _FH/0x12345

1004	   The name component after _COPY is 24 characters of base 64, more than
1005	   enough to encode a 128 bit random number.

1007	   The client will then send these URLs to the destination server in the
1008	   COPY operation.  Suppose that the 192.168.33.0/24 network is a high
1009	   speed network and the destination server decides to transfer the file
1010	   over this network.  If the destination contacts the source server
1011	   from 192.168.33.56 over this network using NFSv4.1, it does the
1012	   following:

1014	   COMPOUND  { PUTROOTFH, LOOKUP "_COPY" ; LOOKUP
1015	      "FvhH1OKbu8VrxvV1erdjvR7N" ; LOOKUP "10.11.78.56"; LOOKUP "_FH" ;
1016	      OPEN "0x12345" ; GETFH }

1018	   Provided that the random number is unpredictable and has been kept
1019	   secret by the parties involved, the source server will therefore know
1020	   that these NFSv4.x operations are being issued by the destination
1021	   server identified in the COPY_NOTIFY.  This random number technique
1022	   only provides initial authentication of the destination server, and
1023	   cannot defend against man-in-the-middle attacks after authentication
1024	   or an eavesdropper that observes the random number on the wire.
1025	   Other secure communication techniques (e.g., IPsec) are necessary to
1026	   block these attacks.

1028	2.4.1.4.  Inter-Server Copy without ONC RPC and RPCSEC_GSSv3

1030	   The same techniques as Section 2.4.1.3, using unique URLs for each
1031	   destination server, can be used for other protocols (e.g., HTTP [13]
1032	   and FTP [14]) as well.

1034	3.  Sparse Files

1036	3.1.  Introduction

1038	   A sparse file is a common way of representing a large file without
1039	   having to utilize all of the disk space for it.  Consequently, a
1040	   sparse file uses less physical space than its size indicates.  This
1041	   means the file contains 'holes', byte ranges within the file that
1042	   contain no data.  Most modern file systems support sparse files,
1043	   including most UNIX file systems and NTFS, but notably not Apple's
1044	   HFS+.  Common examples of sparse files include Virtual Machine (VM)
1045	   OS/disk images, database files, log files, and even checkpoint
1046	   recovery files most commonly used by the HPC community.

1048	   If an application reads a hole in a sparse file, the file system must
1049	   return all zeros to the application.  For local data access there is
1050	   little penalty, but with NFS these zeroes must be transferred back to
1051	   the client.  If an application uses the NFS client to read data into
1052	   memory, this wastes time and bandwidth as the application waits for
1053	   the zeroes to be transferred.

1055	   A sparse file is typically created by initializing the file to be all
1056	   zeros - nothing is written to the data in the file, instead the hole
1057	   is recorded in the metadata for the file.  So a 8G disk image might
1058	   be represented initially by a couple hundred bits in the inode and
1059	   nothing on the disk.  If the VM then writes 100M to a file in the
1060	   middle of the image, there would now be two holes represented in the
1061	   metadata and 100M in the data.

1063	   This section introduces a new operation READ_PLUS (Section 13.10)
1064	   which supports all the features of READ but includes an extension to
1065	   support sparse pattern files.  READ_PLUS is guaranteed to perform no
1066	   worse than READ, and can dramatically improve performance with sparse
1067	   files.  READ_PLUS does not depend on pNFS protocol features, but can
1068	   be used by pNFS to support sparse files.

1070	3.2.  Terminology

1072	   Regular file:  An object of file type NF4REG or NF4NAMEDATTR.

1074	   Sparse file:  A Regular file that contains one or more Holes.

1076	   Hole:  A byte range within a Sparse file that contains regions of all
1077	      zeroes.  For block-based file systems, this could also be an
1078	      unallocated region of the file.

1080	   Hole Threshold:  The minimum length of a Hole as determined by the
1081	      server.  If a server chooses to define a Hole Threshold, then it
1082	      would not return hole information about holes with a length
1083	      shorter than the Hole Threshold.

1085	4.  Space Reservation

1087	4.1.  Introduction

1089	   This section describes a set of operations that allow applications
1090	   such as hypervisors to reserve space for a file, report the amount of
1091	   actual disk space a file occupies and freeup the backing space of a
1092	   file when it is not required.  In virtualized environments, virtual
1093	   disk files are often stored on NFS mounted volumes.  Since virtual
1094	   disk files represent the hard disks of virtual machines, hypervisors
1095	   often have to guarantee certain properties for the file.

1097	   One such example is space reservation.  When a hypervisor creates a
1098	   virtual disk file, it often tries to preallocate the space for the
1099	   file so that there are no future allocation related errors during the
1100	   operation of the virtual machine.  Such errors prevent a virtual
1101	   machine from continuing execution and result in downtime.

1103	   Currently, in order to achieve such a guarantee, applications zero
1104	   the entire file.  The initial zeroing allocates the backing blocks
1105	   and all subsequent writes are overwrites of already allocated blocks.
1106	   This approach is not only inefficient in terms of the amount of I/O
1107	   done, it is also not guaranteed to work on filesystems that are log
1108	   structured or deduplicated.  An efficient way of guaranteeing space
1109	   reservation would be beneficial to such applications.

1111	   If the space_reserved attribute is set on a file, it is guaranteed
1112	   that writes that do not grow the file will not fail with
1113	   NFSERR_NOSPC.

1115	   Another useful feature would be the ability to report the number of
1116	   blocks that would be freed when a file is deleted.  Currently, NFS
1117	   reports two size attributes:

1119	   size  The logical file size of the file.

1121	   space_used  The size in bytes that the file occupies on disk

1123	   While these attributes are sufficient for space accounting in
1124	   traditional filesystems, they prove to be inadequate in modern
1125	   filesystems that support block sharing.  In such filesystems,
1126	   multiple inodes can point to a single block with a block reference
1127	   count to guard against premature freeing.  Having a way to tell the
1128	   number of blocks that would be freed if the file was deleted would be
1129	   useful to applications that wish to migrate files when a volume is
1130	   low on space.

1132	   Since virtual disks represent a hard drive in a virtual machine, a
1133	   virtual disk can be viewed as a filesystem within a file.  Since not
1134	   all blocks within a filesystem are in use, there is an opportunity to
1135	   reclaim blocks that are no longer in use.  A call to deallocate
1136	   blocks could result in better space efficiency.  Lesser space MAY be
1137	   consumed for backups after block deallocation.

1139	   The following operations and attributes can be used to resolve this
1140	   issues:

1142	   space_reserved  This attribute specifies whether the blocks backing
1143	      the file have been preallocated.

1145	   space_freed  This attribute specifies the space freed when a file is
1146	      deleted, taking block sharing into consideration.

1148	   INITIALIZED  This operation zeroes and/or deallocates the blocks
1149	      backing a region of the file.

1151	   If space_used of a file is interpreted to mean the size in bytes of
1152	   all disk blocks pointed to by the inode of the file, then shared
1153	   blocks get double counted, over-reporting the space utilization.
1154	   This also has the adverse effect that the deletion of a file with
1155	   shared blocks frees up less than space_used bytes.

1157	   On the other hand, if space_used is interpreted to mean the size in
1158	   bytes of those disk blocks unique to the inode of the file, then
1159	   shared blocks are not counted in any file, resulting in under-
1160	   reporting of the space utilization.

1162	   For example, two files A and B have 10 blocks each.  Let 6 of these
1163	   blocks be shared between them.  Thus, the combined space utilized by
1164	   the two files is 14 * BLOCK_SIZE bytes.  In the former case, the
1165	   combined space utilization of the two files would be reported as 20 *
1166	   BLOCK_SIZE.  However, deleting either would only result in 4 *
1167	   BLOCK_SIZE being freed.  Conversely, the latter interpretation would
1168	   report that the space utilization is only 8 * BLOCK_SIZE.

1170	   Adding another size attribute, space_freed, is helpful in solving
1171	   this problem. space_freed is the number of blocks that are allocated
1172	   to the given file that would be freed on its deletion.  In the
1173	   example, both A and B would report space_freed as 4 * BLOCK_SIZE and
1174	   space_used as 10 * BLOCK_SIZE.  If A is deleted, B will report
1175	   space_freed as 10 * BLOCK_SIZE as the deletion of B would result in
1176	   the deallocation of all 10 blocks.

1178	   The addition of this problem doesn't solve the problem of space being
1179	   over-reported.  However, over-reporting is better than under-
1180	   reporting.

1182	5.  Support for Application IO Hints

1184	5.1.  Introduction

1186	   Applications currently have several options for communicating I/O
1187	   access patterns to the NFS client.  While this can help the NFS
1188	   client optimize I/O and caching for a file, it does not allow the NFS
1189	   server and its exported file system to do likewise.  Therefore, here
1190	   we put forth a proposal for the NFSv4.2 protocol to allow
1191	   applications to communicate their expected behavior to the server.

1193	   By communicating expected access pattern, e.g., sequential or random,
1194	   and data re-use behavior, e.g., data range will be read multiple
1195	   times and should be cached, the server will be able to better
1196	   understand what optimizations it should implement for access to a
1197	   file.  For example, if a application indicates it will never read the
1198	   data more than once, then the file system can avoid polluting the
1199	   data cache and not cache the data.

1201	   The first application that can issue client I/O hints is the
1202	   posix_fadvise operation.  For example, on Linux, when an application
1203	   uses posix_fadvise to specify a file will be read sequentially, Linux
1204	   doubles the readahead buffer size.

1206	   Another instance where applications provide an indication of their
1207	   desired I/O behavior is the use of direct I/O. By specifying direct
1208	   I/O, clients will no longer cache data, but this information is not
1209	   passed to the server, which will continue caching data.

1211	   Application specific NFS clients such as those used by hypervisors
1212	   and databases can also leverage application hints to communicate
1213	   their specialized requirements.

1215	   This section adds a new IO_ADVISE operation to communicate the client
1216	   file access patterns to the NFS server.  The NFS server upon
1217	   receiving a IO_ADVISE operation MAY choose to alter its I/O and
1218	   caching behavior, but is under no obligation to do so.

1220	5.2.  POSIX Requirements

1222	   The first key requirement of the IO_ADVISE operation is to support
1223	   the posix_fadvise function [6], which is supported in Linux and many
1224	   other operating systems.  Examples and guidance on how to use
1225	   posix_fadvise to improve performance can be found here [16].
1226	   posix_fadvise is defined as follows,

1228	      int posix_fadvise(int fd, off_t offset, off_t len, int advice);

1230	   The posix_fadvise() function shall advise the implementation on the
1231	   expected behavior of the application with respect to the data in the
1232	   file associated with the open file descriptor, fd, starting at offset
1233	   and continuing for len bytes.  The specified range need not currently
1234	   exist in the file.  If len is zero, all data following offset is
1235	   specified.  The implementation may use this information to optimize
1236	   handling of the specified data.  The posix_fadvise() function shall
1237	   have no effect on the semantics of other operations on the specified
1238	   data, although it may affect the performance of other operations.

1240	   The advice to be applied to the data is specified by the advice
1241	   parameter and may be one of the following values:

1243	   POSIX_FADV_NORMAL -  Specifies that the application has no advice to
1244	      give on its behavior with respect to the specified data.  It is
1245	      the default characteristic if no advice is given for an open file.

1247	   POSIX_FADV_SEQUENTIAL -  Specifies that the application expects to
1248	      access the specified data sequentially from lower offsets to
1249	      higher offsets.

1251	   POSIX_FADV_RANDOM -  Specifies that the application expects to access
1252	      the specified data in a random order.

1254	   POSIX_FADV_WILLNEED -  Specifies that the application expects to
1255	      access the specified data in the near future.

1257	   POSIX_FADV_DONTNEED -  Specifies that the application expects that it
1258	      will not access the specified data in the near future.

1260	   POSIX_FADV_NOREUSE -  Specifies that the application expects to
1261	      access the specified data once and then not reuse it thereafter.

1263	   Upon successful completion, posix_fadvise() shall return zero;
1264	   otherwise, an error number shall be returned to indicate the error.

1266	5.3.  Additional Requirements

1268	   Many use cases exist for sending application I/O hints to the server
1269	   that cannot utilize the POSIX supported interface.  This is because
1270	   some applications may benefit from additional hints not specified by
1271	   posix_fadvise, and some applications may not use POSIX altogether.

1273	   One use case is "Opportunistic Prefetch", which allows a stateid
1274	   holder to tell the server that it is possible that it will access the
1275	   specified data in the near future.  This is similar to
1276	   POSIX_FADV_WILLNEED, but the client is unsure it will in fact read
1277	   the specified data, so the server should only prefetch the data if it
1278	   can be done at a marginal cost.  For example, when a server receives
1279	   this hint, it could prefetch only the indirect blocks for a file
1280	   instead of all the data.  This would still improve performance if the
1281	   client does read the data, but with less pressure on server memory.

1283	   An example use case for this hint is a database that reads in a
1284	   single record that points to additional records in either other areas
1285	   of the same file or different files located on the same or different
1286	   server.  While it is likely that the application may access the
1287	   additional records, it is far from guaranteed.  Therefore, the
1288	   database may issue an opportunistic prefetch (instead of
1289	   POSIX_FADV_WILLNEED) for the data in the other files pointed to by
1290	   the record.

1292	   Another use case is "Direct I/O", which allows a stated holder to
1293	   inform the server that it does not wish to cache data.  Today, for
1294	   applications that only intend to read data once, the use of direct
1295	   I/O disables client caching, but does not affect server caching.  By
1296	   caching data that will not be re-read, the server is polluting its
1297	   cache and possibly causing useful cached data to be evicted.  By
1298	   informing the server of its expected I/O access, this situation can
1299	   be avoid.  Direct I/O can be used in Linux and AIX via the open()
1300	   O_DIRECT parameter, in Solaris via the directio() function, and in
1301	   Windows via the CreateFile() FILE_FLAG_NO_BUFFERING flag.

1303	   Another use case is "Backward Sequential Read", which allows a stated
1304	   holder to inform the server that it intends to read the specified
1305	   data backwards, i.e., back the end to the beginning.  This is
1306	   different than POSIX_FADV_SEQUENTIAL, whose implied intention was
1307	   that data will be read from beginning to end.  This hint allows
1308	   servers to prefetch data at the end of the range first, and then
1309	   prefetch data sequentially in a backwards manner to the start of the
1310	   data range.  One example of an application that can make use of this
1311	   hint is video editing.

1313	5.4.  Security Considerations

1315	   None.

1317	5.5.  IANA Considerations

1319	   The IO_ADVISE_type4 will be extended through an IANA registry.

1321	6.  Application Data Block Support

1323	   At the OS level, files are contained on disk blocks.  Applications
1324	   are also free to impose structure on the data contained in a file and
1325	   we can define an Application Data Block (ADB) to be such a structure.
1326	   From the application's viewpoint, it only wants to handle ADBs and
1327	   not raw bytes (see [17]).  An ADB is typically comprised of two
1328	   sections: a header and data.  The header describes the
1329	   characteristics of the block and can provide a means to detect
1330	   corruption in the data payload.  The data section is typically
1331	   initialized to all zeros.

1333	   The format of the header is application specific, but there are two
1334	   main components typically encountered:

1336	   1.  An ADB Number (ADBN), which allows the application to determine
1337	       which data block is being referenced.  The ADBN is a logical
1338	       block number and is useful when the client is not storing the
1339	       blocks in contiguous memory.

1341	   2.  Fields to describe the state of the ADB and a means to detect
1342	       block corruption.  For both pieces of data, a useful property is
1343	       that allowed values be unique in that if passed across the
1344	       network, corruption due to translation between big and little
1345	       endian architectures are detectable.  For example, 0xF0DEDEF0 has
1346	       the same bit pattern in both architectures.

1348	   Applications already impose structures on files [17] and detect
1349	   corruption in data blocks [18].  What they are not able to do is
1350	   efficiently transfer and store ADBs.  To initialize a file with ADBs,
1351	   the client must send the full ADB to the server and that must be
1352	   stored on the server.  When the application is initializing a file to
1353	   have the ADB structure, it could compress the ADBs to just the
1354	   information to necessary to later reconstruct the header portion of
1355	   the ADB when the contents are read back.  Using sparse file
1356	   techniques, the disk blocks described by would not be allocated.
1357	   Unlike sparse file techniques, there would be a small cost to store
1358	   the compressed header data.

1360	   In this section, we are going to define a generic framework for an
1361	   ADB, present one approach to detecting corruption in a given ADB
1362	   implementation, and describe the model for how the client and server
1363	   can support efficient initialization of ADBs, reading of ADB holes,
1364	   punching holes in ADBs, and space reservation.  Further, we need to
1365	   be able to extend this model to applications which do not support
1366	   ADBs, but wish to be able to handle sparse files, hole punching, and
1367	   space reservation.

1369	6.1.  Generic Framework

1371	   We want the representation of the ADB to be flexible enough to
1372	   support many different applications.  The most basic approach is no
1373	   imposition of a block at all, which means we are working with the raw
1374	   bytes.  Such an approach would be useful for storing holes, punching
1375	   holes, etc.  In more complex deployments, a server might be
1376	   supporting multiple applications, each with their own definition of
1377	   the ADB.  One might store the ADBN at the start of the block and then
1378	   have a guard pattern to detect corruption [19].  The next might store
1379	   the ADBN at an offset of 100 bytes within the block and have no guard
1380	   pattern at all.  The point is that existing applications might
1381	   already have well defined formats for their data blocks.

1383	   The guard pattern can be used to represent the state of the block, to
1384	   protect against corruption, or both.  Again, it needs to be able to
1385	   be placed anywhere within the ADB.

1387	   We need to be able to represent the starting offset of the block and
1388	   the size of the block.  Note that nothing prevents the application
1389	   from defining different sized blocks in a file.

1391	6.1.1.  Data Block Representation

1393	   struct app_data_block4 {
1394	           offset4         adb_offset;
1395	           length4         adb_block_size;
1396	           length4         adb_block_count;
1397	           length4         adb_reloff_blocknum;
1398	           count4          adb_block_num;
1399	           length4         adb_reloff_pattern;
1400	           opaque          adb_pattern<>;
1401	   };

1403	   The app_data_block4 structure captures the abstraction presented for
1404	   the ADB.  The additional fields present are to allow the transmission
1405	   of adb_block_count ADBs at one time.  We also use adb_block_num to
1406	   convey the ADBN of the first block in the sequence.  Each ADB will
1407	   contain the same adb_pattern string.

1409	   As both adb_block_num and adb_pattern are optional, if either
1410	   adb_reloff_pattern or adb_reloff_blocknum is set to NFS4_UINT64_MAX,
1411	   then the corresponding field is not set in any of the ADB.

1413	6.1.2.  Data Content

1415	   /*
1416	    * Use an enum such that we can extend new types.
1417	    */
1418	   enum data_content4 {
1419	           NFS4_CONTENT_DATA = 0,
1420	           NFS4_CONTENT_APP_BLOCK = 1,
1421	           NFS4_CONTENT_HOLE = 2
1422	   };

1424	   New operations might need to differentiate between wanting to access
1425	   data versus an ADB.  Also, future minor versions might want to
1426	   introduce new data formats.  This enumeration allows that to occur.

1428	6.2.  pNFS Considerations

1430	   While this document does not mandate how sparse ADBs are recorded on
1431	   the server, it does make the assumption that such information is not
1432	   in the file.  I.e., the information is metadata.  As such, the
1433	   INITIALIZE operation is defined to be not supported by the DS - it
1434	   must be issued to the MDS.  But since the client must not assume a
1435	   priori whether a read is sparse or not, the READ_PLUS operation MUST
1436	   be supported by both the DS and the MDS.  I.e., the client might
1437	   impose on the MDS to asynchronously read the data from the DS.

1439	   Furthermore, each DS MUST not report to a client either a sparse ADB
1440	   or data which belongs to another DS.  One implication of this
1441	   requirement is that the app_data_block4's adb_block_size MUST be
1442	   either be the stripe width or the stripe width must be an even
1443	   multiple of it.

1445	   The second implication here is that the DS must be able to use the
1446	   Control Protocol to determine from the MDS where the sparse ADBs
1447	   occur.  [[Comment.3: Need to discuss what happens if after the file
1448	   is being written to and an INITIALIZE occurs? --TH]] Perhaps instead
1449	   of the DS pulling from the MDS, the MDS pushes to the DS?  Thus an
1450	   INITIALIZE causes a new push?  [[Comment.4: Still need to consider
1451	   race cases of the DS getting a WRITE and the MDS getting an
1452	   INITIALIZE. --TH]]

1454	6.3.  An Example of Detecting Corruption

1456	   In this section, we define an ADB format in which corruption can be
1457	   detected.  Note that this is just one possible format and means to
1458	   detect corruption.

1460	   Consider a very basic implementation of an operating system's disk
1461	   blocks.  A block is either data or it is an indirect block which
1462	   allows for files to be larger than one block.  It is desired to be
1463	   able to initialize a block.  Lastly, to quickly unlink a file, a
1464	   block can be marked invalid.  The contents remain intact - which
1465	   would enable this OS application to undelete a file.

1467	   The application defines 4k sized data blocks, with an 8 byte block
1468	   counter occurring at offset 0 in the block, and with the guard
1469	   pattern occurring at offset 8 inside the block.  Furthermore, the
1470	   guard pattern can take one of four states:

1472	   0xfeedface -   This is the FREE state and indicates that the ADB
1473	      format has been applied.

1475	   0xcafedead -   This is the DATA state and indicates that real data
1476	      has been written to this block.

1478	   0xe4e5c001 -   This is the INDIRECT state and indicates that the
1479	      block contains block counter numbers that are chained off of this
1480	      block.

1482	   0xba1ed4a3 -   This is the INVALID state and indicates that the block
1483	      contains data whose contents are garbage.

1485	   Finally, it also defines an 8 byte checksum [20] starting at byte 16
1486	   which applies to the remaining contents of the block.  If the state
1487	   is FREE, then that checksum is trivially zero.  As such, the
1488	   application has no need to transfer the checksum implicitly inside
1489	   the ADB - it need not make the transfer layer aware of the fact that
1490	   there is a checksum (see [18] for an example of checksums used to
1491	   detect corruption in application data blocks).

1493	   Corruption in each ADB can be detected thusly:

1495	   o  If the guard pattern is anything other than one of the allowed
1496	      values, including all zeros.

1498	   o  If the guard pattern is FREE and any other byte in the remainder
1499	      of the ADB is anything other than zero.

1501	   o  If the guard pattern is anything other than FREE, then if the
1502	      stored checksum does not match the computed checksum.

1504	   o  If the guard pattern is INDIRECT and one of the stored indirect
1505	      block numbers has a value greater than the number of ADBs in the
1506	      file.

1508	   o  If the guard pattern is INDIRECT and one of the stored indirect
1509	      block numbers is a duplicate of another stored indirect block
1510	      number.

1512	   As can be seen, the application can detect errors based on the
1513	   combination of the guard pattern state and the checksum.  But also,
1514	   the application can detect corruption based on the state and the
1515	   contents of the ADB.  This last point is important in validating the
1516	   minimum amount of data we incorporated into our generic framework.
1517	   I.e., the guard pattern is sufficient in allowing applications to
1518	   design their own corruption detection.

1520	   Finally, it is important to note that none of these corruption checks
1521	   occur in the transport layer.  The server and client components are
1522	   totally unaware of the file format and might report everything as
1523	   being transferred correctly even in the case the application detects
1524	   corruption.

1526	6.4.  Example of READ_PLUS

1528	   The hypothetical application presented in Section 6.3 can be used to
1529	   illustrate how READ_PLUS would return an array of results.  A file is
1530	   created and initialized with 100 4k ADBs in the FREE state:

1532	      INITIALIZE {0, 4k, 100, 0, 0, 8, 0xfeedface}

1534	   Further, assume the application writes a single ADB at 16k, changing
1535	   the guard pattern to 0xcafedead, we would then have in memory:

1537	      0 -> (16k - 1)   : 4k, 4, 0, 0, 8, 0xfeedface
1538	      16k -> (20k - 1) : 00 00 00 05 ca fe de ad XX XX ... XX XX
1539	      20k -> 400k      : 4k, 95, 0, 6, 0xfeedface

1541	   And when the client did a READ_PLUS of 64k at the start of the file,
1542	   it would get back a result of an ADB, some data, and a final ADB:

1544	      ADB {0, 4, 0, 0, 8, 0xfeedface}
1545	      data 4k
1546	      ADB {20k, 4k, 59, 0, 6, 0xfeedface}

1548	6.5.  Zero Filled Holes

1550	   As applications are free to define the structure of an ADB, it is
1551	   trivial to define an ADB which supports zero filled holes.  Such a
1552	   case would encompass the traditional definitions of a sparse file and
1553	   hole punching.  For example, to punch a 64k hole, starting at 100M,
1554	   into an existing file which has no ADB structure:

1556	      INITIALIZE {100M, 64k, 1, NFS4_UINT64_MAX,
1557	                  0, NFS4_UINT64_MAX, 0x0}

1559	7.  Labeled NFS

1561	7.1.  Introduction

1563	   Access control models such as Unix permissions or Access Control
1564	   Lists are commonly referred to as Discretionary Access Control (DAC)
1565	   models.  These systems base their access decisions on user identity
1566	   and resource ownership.  In contrast Mandatory Access Control (MAC)
1567	   models base their access control decisions on the label on the
1568	   subject (usually a process) and the object it wishes to access [7].
1569	   These labels may contain user identity information but usually
1570	   contain additional information.  In DAC systems users are free to
1571	   specify the access rules for resources that they own.  MAC models
1572	   base their security decisions on a system wide policy established by
1573	   an administrator or organization which the users do not have the
1574	   ability to override.  In this section, we add a MAC model to NFSv4.

1576	   The first change necessary is to devise a method for transporting and
1577	   storing security label data on NFSv4 file objects.  Security labels
1578	   have several semantics that are met by NFSv4 recommended attributes
1579	   such as the ability to set the label value upon object creation.
1580	   Access control on these attributes are done through a combination of
1581	   two mechanisms.  As with other recommended attributes on file objects
1582	   the usual DAC checks (ACLs and permission bits) will be performed to
1583	   ensure that proper file ownership is enforced.  In addition a MAC
1584	   system MAY be employed on the client, server, or both to enforce
1585	   additional policy on what subjects may modify security label
1586	   information.

1588	   The second change is to provide a method for the server to notify the
1589	   client that the attribute changed on an open file on the server.  If
1590	   the file is closed, then during the open attempt, the client will
1591	   gather the new attribute value.  The server MUST not communicate the
1592	   new value of the attribute, the client MUST query it.  This
1593	   requirement stems from the need for the client to provide sufficient
1594	   access rights to the attribute.

1596	   The final change necessary is a modification to the RPC layer used in
1597	   NFSv4 in the form of a new version of the RPCSEC_GSS [8] framework.
1598	   In order for an NFSv4 server to apply MAC checks it must obtain
1599	   additional information from the client.  Several methods were
1600	   explored for performing this and it was decided that the best
1601	   approach was to incorporate the ability to make security attribute
1602	   assertions through the RPC mechanism.  RPCSECGSSv3 [5] outlines a
1603	   method to assert additional security information such as security
1604	   labels on gss context creation and have that data bound to all RPC
1605	   requests that make use of that context.

1607	7.2.  Definitions

1609	   Label Format Specifier (LFS):  is an identifier used by the client to
1610	      establish the syntactic format of the security label and the
1611	      semantic meaning of its components.  These specifiers exist in a
1612	      registry associated with documents describing the format and
1613	      semantics of the label.

1615	   Label Format Registry:  is the IANA registry containing all
1616	      registered LFS along with references to the documents that
1617	      describe the syntactic format and semantics of the security label.

1619	   Policy Identifier (PI):  is an optional part of the definition of a
1620	      Label Format Specifier which allows for clients and server to
1621	      identify specific security policies.

1623	   Object:  is a passive resource within the system that we wish to be
1624	      protected.  Objects can be entities such as files, directories,
1625	      pipes, sockets, and many other system resources relevant to the
1626	      protection of the system state.

1628	   Subject:  A subject is an active entity usually a process which is
1629	      requesting access to an object.

1631	   Multi-Level Security (MLS):  is a traditional model where objects are
1632	      given a sensitivity level (Unclassified, Secret, Top Secret, etc)
1633	      and a category set [21].

1635	7.3.  MAC Security Attribute

1637	   MAC models base access decisions on security attributes bound to
1638	   subjects and objects.  This information can range from a user
1639	   identity for an identity based MAC model, sensitivity levels for
1640	   Multi-level security, or a type for Type Enforcement.  These models
1641	   base their decisions on different criteria but the semantics of the
1642	   security attribute remain the same.  The semantics required by the
1643	   security attributes are listed below:

1645	   o  Must provide flexibility with respect to MAC model.

1647	   o  Must provide the ability to atomically set security information
1648	      upon object creation.

1650	   o  Must provide the ability to enforce access control decisions both
1651	      on the client and the server.

1653	   o  Must not expose an object to either the client or server name
1654	      space before its security information has been bound to it.

1656	   NFSv4 implements the security attribute as a recommended attribute.
1657	   These attributes have a fixed format and semantics, which conflicts
1658	   with the flexible nature of the security attribute.  To resolve this
1659	   the security attribute consists of two components.  The first
1660	   component is a LFS as defined in [22] to allow for interoperability
1661	   between MAC mechanisms.  The second component is an opaque field
1662	   which is the actual security attribute data.  To allow for various
1663	   MAC models NFSv4 should be used solely as a transport mechanism for
1664	   the security attribute.  It is the responsibility of the endpoints to
1665	   consume the security attribute and make access decisions based on
1666	   their respective models.  In addition, creation of objects through
1667	   OPEN and CREATE allows for the security attribute to be specified
1668	   upon creation.  By providing an atomic create and set operation for
1669	   the security attribute it is possible to enforce the second and
1670	   fourth requirements.  The recommended attribute FATTR4_SEC_LABEL will
1671	   be used to satisfy this requirement.

1673	7.3.1.  Interpreting FATTR4_SEC_LABEL

1675	   The XDR [23] necessary to implement Labeled NFSv4 is presented below:

1677	   const FATTR4_SEC_LABEL   = 81;

1679	   typedef uint32_t  policy4;

1681	                                 Figure 6

1683	   struct labelformat_spec4 {
1684	           policy4 lfs_lfs;
1685	           policy4 lfs_pi;
1686	   };

1688	   struct sec_label_attr_info {
1689	           labelformat_spec4       slai_lfs;
1690	           opaque                  slai_data<>;
1691	   };
1692	   The FATTR4_SEC_LABEL contains an array of two components with the
1693	   first component being an LFS.  It serves to provide the receiving end
1694	   with the information necessary to translate the security attribute
1695	   into a form that is usable by the endpoint.  Label Formats assigned
1696	   an LFS may optionally choose to include a Policy Identifier field to
1697	   allow for complex policy deployments.  The LFS and Label Format
1698	   Registry are described in detail in [22].  The translation used to
1699	   interpret the security attribute is not specified as part of the
1700	   protocol as it may depend on various factors.  The second component
1701	   is an opaque section which contains the data of the attribute.  This
1702	   component is dependent on the MAC model to interpret and enforce.

1704	   In particular, it is the responsibility of the LFS specification to
1705	   define a maximum size for the opaque section, slai_data<>.  When
1706	   creating or modifying a label for an object, the client needs to be
1707	   guaranteed that the server will accept a label that is sized
1708	   correctly.  By both client and server being part of a specific MAC
1709	   model, the client will be aware of the size.

1711	7.3.2.  Delegations

1713	   In the event that a security attribute is changed on the server while
1714	   a client holds a delegation on the file, the client should follow the
1715	   existing protocol with respect to attribute changes.  It should flush
1716	   all changes back to the server and relinquish the delegation.

1718	7.3.3.  Permission Checking

1720	   It is not feasible to enumerate all possible MAC models and even
1721	   levels of protection within a subset of these models.  This means
1722	   that the NFSv4 client and servers cannot be expected to directly make
1723	   access control decisions based on the security attribute.  Instead
1724	   NFSv4 should defer permission checking on this attribute to the host
1725	   system.  These checks are performed in addition to existing DAC and
1726	   ACL checks outlined in the NFSv4 protocol.  Section 7.6 gives a
1727	   specific example of how the security attribute is handled under a
1728	   particular MAC model.

1730	7.3.4.  Object Creation

1732	   When creating files in NFSv4 the OPEN and CREATE operations are used.
1733	   One of the parameters to these operations is an fattr4 structure
1734	   containing the attributes the file is to be created with.  This
1735	   allows NFSv4 to atomically set the security attribute of files upon
1736	   creation.  When a client is MAC aware it must always provide the
1737	   initial security attribute upon file creation.  In the event that the
1738	   server is the only MAC aware entity in the system it should ignore
1739	   the security attribute specified by the client and instead make the
1740	   determination itself.  A more in depth explanation can be found in
1741	   Section 7.6.

1743	7.3.5.  Existing Objects

1745	   Note that under the MAC model, all objects must have labels.
1746	   Therefore, if an existing server is upgraded to include LNFS support,
1747	   then it is the responsibility of the security system to define the
1748	   behavior for existing objects.  For example, if the security system
1749	   is LFS 0, which means the server just stores and returns labels, then
1750	   existing files should return labels which are set to an empty value.

1752	7.3.6.  Label Changes

1754	   As per the requirements, when a file's security label is modified,
1755	   the server must notify all clients which have the file opened of the
1756	   change in label.  It does so with CB_ATTR_CHANGED.  There are
1757	   preconditions to making an attribute change imposed by NFSv4 and the
1758	   security system might want to impose others.  In the process of
1759	   meeting these preconditions, the server may chose to either serve the
1760	   request in whole or return NFS4ERR_DELAY to the SETATTR operation.

1762	   If there are open delegations on the file belonging to client other
1763	   than the one making the label change, then the process described in
1764	   Section 7.3.2 must be followed.

1766	   As the server is always presented with the subject label from the
1767	   client, it does not necessarily need to communicate the fact that the
1768	   label has changed to the client.  In the cases where the change
1769	   outright denies the client access, the client will be able to quickly
1770	   determine that there is a new label in effect.  It is in cases where
1771	   the client may share the same object between multiple subjects or a
1772	   security system which is not strictly hierarchical that the
1773	   CB_ATTR_CHANGED callback is very useful.  It allows the server to
1774	   inform the clients that the cached security attribute is now stale.

1776	   Consider a system in which the clients enforce MAC checks and and the
1777	   server has a very simple security system which just stores the
1778	   labels.  In this system, the MAC label check always allows access,
1779	   regardless of the subject label.

1781	   The way in which MAC labels are enforced is by the client.  So if
1782	   client A changes a security label on a file, then the server MUST
1783	   inform all clients that have the file opened that the label has
1784	   changed via CB_ATTR_CHANGED.  Then the clients MUST retrieve the new
1785	   label and MUST enforce access via the new attribute values.

1787	7.4.  pNFS Considerations

1789	   This section examines the issues in deploying LNFS in a pNFS
1790	   community of servers.

1792	7.4.1.  MAC Label Checks

1794	   The new FATTR4_SEC_LABEL attribute is metadata information and as
1795	   such the DS is not aware of the value contained on the MDS.
1796	   Fortunately, the NFSv4.1 protocol [2] already has provisions for
1797	   doing access level checks from the DS to the MDS.  In order for the
1798	   DS to validate the subject label presented by the client, it SHOULD
1799	   utilize this mechanism.

1801	   If a file's FATTR4_SEC_LABEL is changed, then the MDS should utilize
1802	   CB_ATTR_CHANGED to inform the client of that fact.  If the MDS is
1803	   maintaining

1805	7.5.  Discovery of Server LNFS Support

1807	   The server can easily determine that a client supports LNFS when it
1808	   queries for the FATTR4_SEC_LABEL label for an object.  Note that it
1809	   cannot assume that the presence of RPCSEC_GSSv3 indicates LNFS
1810	   support.  The client might need to discover which LFS the server
1811	   supports.

1813	   A server which supports LNFS MUST allow a client with any subject
1814	   label to retrieve the FATTR4_SEC_LABEL attribute for the root
1815	   filehandle, ROOTFH.  The following compound must always succeed as
1816	   far as a MAC label check is concerned:

1818	        PUTROOTFH, GETATTR {FATTR4_SEC_LABEL}

1820	   Note that the server might have imposed a security flavor on the root
1821	   that precludes such access.  I.e., if the server requires kerberized
1822	   access and the client presents a compound with AUTH_SYS, then the
1823	   server is allowed to return NFS4ERR_WRONGSEC in this case.  But if
1824	   the client presents a correct security flavor, then the server MUST
1825	   return the FATTR4_SEC_LABEL attribute with the supported LFS filled
1826	   in.

1828	7.6.  MAC Security NFS Modes of Operation

1830	   A system using Labeled NFS may operate in two modes.  The first mode
1831	   provides the most protection and is called "full mode".  In this mode
1832	   both the client and server implement a MAC model allowing each end to
1833	   make an access control decision.  The remaining mode is called the
1834	   "guest mode" and in this mode one end of the connection is not
1835	   implementing a MAC model and thus offers less protection than full
1836	   mode.

1838	7.6.1.  Full Mode

1840	   Full mode environments consist of MAC aware NFSv4 servers and clients
1841	   and may be composed of mixed MAC models and policies.  The system
1842	   requires that both the client and server have an opportunity to
1843	   perform an access control check based on all relevant information
1844	   within the network.  The file object security attribute is provided
1845	   using the mechanism described in Section 7.3.  The security attribute
1846	   of the subject making the request is transported at the RPC layer
1847	   using the mechanism described in RPCSECGSSv3 [5].

1849	7.6.1.1.  Initial Labeling and Translation

1851	   The ability to create a file is an action that a MAC model may wish
1852	   to mediate.  The client is given the responsibility to determine the
1853	   initial security attribute to be placed on a file.  This allows the
1854	   client to make a decision as to the acceptable security attributes to
1855	   create a file with before sending the request to the server.  Once
1856	   the server receives the creation request from the client it may
1857	   choose to evaluate if the security attribute is acceptable.

1859	   Security attributes on the client and server may vary based on MAC
1860	   model and policy.  To handle this the security attribute field has an
1861	   LFS component.  This component is a mechanism for the host to
1862	   identify the format and meaning of the opaque portion of the security
1863	   attribute.  A full mode environment may contain hosts operating in
1864	   several different LFSs.  In this case a mechanism for translating the
1865	   opaque portion of the security attribute is needed.  The actual
1866	   translation function will vary based on MAC model and policy and is
1867	   out of the scope of this document.  If a translation is unavailable
1868	   for a given LFS then the request SHOULD be denied.  Another recourse
1869	   is to allow the host to provide a fallback mapping for unknown
1870	   security attributes.

1872	7.6.1.2.  Policy Enforcement

1874	   In full mode access control decisions are made by both the clients
1875	   and servers.  When a client makes a request it takes the security
1876	   attribute from the requesting process and makes an access control
1877	   decision based on that attribute and the security attribute of the
1878	   object it is trying to access.  If the client denies that access an
1879	   RPC call to the server is never made.  If however the access is
1880	   allowed the client will make a call to the NFS server.

1882	   When the server receives the request from the client it extracts the
1883	   security attribute conveyed in the RPC request.  The server then uses
1884	   this security attribute and the attribute of the object the client is
1885	   trying to access to make an access control decision.  If the server's
1886	   policy allows this access it will fulfill the client's request,
1887	   otherwise it will return NFS4ERR_ACCESS.

1889	   Implementations MAY validate security attributes supplied over the
1890	   network to ensure that they are within a set of attributes permitted
1891	   from a specific peer, and if not, reject them.  Note that a system
1892	   may permit a different set of attributes to be accepted from each
1893	   peer.

1895	7.6.1.3.  Label Aware Only Server

1897	   If the LFS is 0, then it indicates a server which is label aware, but
1898	   does not enforce policies.  Such a server will store and retrieve all
1899	   object labels presented by clients, notify the clients of any label
1900	   changes via CB_ATTR_CHANGED, but will not restrict access via the
1901	   subject label.  Instead, it will expect the clients to enforce all
1902	   such access locally.

1904	7.6.2.  Guest Mode

1906	   Guest mode implies that either the client or the server does not
1907	   handle labels.  If the client is not LNFS aware, then it will not
1908	   offer subject labels to the server.  The server is the only entity
1909	   enforcing policy, and may selectively provide standard NFS services
1910	   to clients based on their authentication credentials and/or
1911	   associated network attributes (e.g., IP address, network interface).
1912	   The level of trust and access extended to a client in this mode is
1913	   configuration-specific.  If the server is not LNFS aware, then it
1914	   will not return object labels to the client.  Clients in this
1915	   environment are may consist of groups implementing different MAC
1916	   model policies.  The system requires that all clients in the
1917	   environment be responsible for access control checks.

1919	7.7.  Security Considerations

1921	   This entire document deals with security issues.

1923	   Depending on the level of protection the MAC system offers there may
1924	   be a requirement to tightly bind the security attribute to the data.

1926	   When only one of the client or server enforces labels, it is
1927	   important to realize that the other side is not enforcing MAC
1928	   protections.  Alternate methods might be in use to handle the lack of
1929	   MAC support and care should be taken to identify and mitigate threats
1930	   from possible tampering outside of these methods.

1932	   An example of this is that a server that modifies READDIR or LOOKUP
1933	   results based on the client's subject label might want to always
1934	   construct the same subject label for a client which does not present
1935	   one.  This will prevent a non-LNFS client from mixing entries in the
1936	   directory cache.

1938	8.  Sharing change attribute implementation details with NFSv4 clients

1940	8.1.  Introduction

1942	   Although both the NFSv4 [10] and NFSv4.1 protocol [2], define the
1943	   change attribute as being mandatory to implement, there is little in
1944	   the way of guidance.  The only feature that is mandated by them is
1945	   that the value must change whenever the file data or metadata change.

1947	   While this allows for a wide range of implementations, it also leaves
1948	   the client with a conundrum: how does it determine which is the most
1949	   recent value for the change attribute in a case where several RPC
1950	   calls have been issued in parallel?  In other words if two COMPOUNDs,
1951	   both containing WRITE and GETATTR requests for the same file, have
1952	   been issued in parallel, how does the client determine which of the
1953	   two change attribute values returned in the replies to the GETATTR
1954	   requests corresponds to the most recent state of the file?  In some
1955	   cases, the only recourse may be to send another COMPOUND containing a
1956	   third GETATTR that is fully serialised with the first two.

1958	   NFSv4.2 avoids this kind of inefficiency by allowing the server to
1959	   share details about how the change attribute is expected to evolve,
1960	   so that the client may immediately determine which, out of the
1961	   several change attribute values returned by the server, is the most
1962	   recent.

1964	8.2.  Definition of the 'change_attr_type' per-file system attribute

1966	   enum change_attr_typeinfo {
1967	              NFS4_CHANGE_TYPE_IS_MONOTONIC_INCR         = 0,
1968	              NFS4_CHANGE_TYPE_IS_VERSION_COUNTER        = 1,
1969	              NFS4_CHANGE_TYPE_IS_VERSION_COUNTER_NOPNFS = 2,
1970	              NFS4_CHANGE_TYPE_IS_TIME_METADATA          = 3,
1971	              NFS4_CHANGE_TYPE_IS_UNDEFINED              = 4
1972	   };

1974	        +------------------+----+---------------------------+-----+
1975	        | Name             | Id | Data Type                 | Acc |
1976	        +------------------+----+---------------------------+-----+
1977	        | change_attr_type | XX | enum change_attr_typeinfo | R   |
1978	        +------------------+----+---------------------------+-----+

1980	   The solution enables the NFS server to provide additional information
1981	   about how it expects the change attribute value to evolve after the
1982	   file data or metadata has changed. 'change_attr_type' is defined as a
1983	   new recommended attribute, and takes values from enum
1984	   change_attr_typeinfo as follows:

1986	   NFS4_CHANGE_TYPE_IS_MONOTONIC_INCR:  The change attribute value MUST
1987	      monotonically increase for every atomic change to the file
1988	      attributes, data or directory contents.

1990	   NFS4_CHANGE_TYPE_IS_VERSION_COUNTER:  The change attribute value MUST
1991	      be incremented by one unit for every atomic change to the file
1992	      attributes, data or directory contents.  This property is
1993	      preserved when writing to pNFS data servers.

1995	   NFS4_CHANGE_TYPE_IS_VERSION_COUNTER_NOPNFS:  The change attribute
1996	      value MUST be incremented by one unit for every atomic change to
1997	      the file attributes, data or directory contents.  In the case
1998	      where the client is writing to pNFS data servers, the number of
1999	      increments is not guaranteed to exactly match the number of
2000	      writes.

2002	   NFS4_CHANGE_TYPE_IS_TIME_METADATA:  The change attribute is
2003	      implemented as suggested in the NFSv4 spec [10] in terms of the
2004	      time_metadata attribute.

2006	   NFS4_CHANGE_TYPE_IS_UNDEFINED:  The change attribute does not take
2007	      values that fit into any of these categories.

2009	   If either NFS4_CHANGE_TYPE_IS_MONOTONIC_INCR,
2010	   NFS4_CHANGE_TYPE_IS_VERSION_COUNTER, or
2011	   NFS4_CHANGE_TYPE_IS_TIME_METADATA are set, then the client knows at
2012	   the very least that the change attribute is monotonically increasing,
2013	   which is sufficient to resolve the question of which value is the
2014	   most recent.

2016	   If the client sees the value NFS4_CHANGE_TYPE_IS_TIME_METADATA, then
2017	   by inspecting the value of the 'time_delta' attribute it additionally
2018	   has the option of detecting rogue server implementations that use
2019	   time_metadata in violation of the spec.

2021	   Finally, if the client sees NFS4_CHANGE_TYPE_IS_VERSION_COUNTER, it
2022	   has the ability to predict what the resulting change attribute value
2023	   should be after a COMPOUND containing a SETATTR, WRITE, or CREATE.
2024	   This again allows it to detect changes made in parallel by another
2025	   client.  The value NFS4_CHANGE_TYPE_IS_VERSION_COUNTER_NOPNFS permits
2026	   the same, but only if the client is not doing pNFS WRITEs.

2028	9.  Security Considerations

2030	10.  Error Values

2032	   NFS error numbers are assigned to failed operations within a Compound
2033	   (COMPOUND or CB_COMPOUND) request.  A Compound request contains a
2034	   number of NFS operations that have their results encoded in sequence
2035	   in a Compound reply.  The results of successful operations will
2036	   consist of an NFS4_OK status followed by the encoded results of the
2037	   operation.  If an NFS operation fails, an error status will be
2038	   entered in the reply and the Compound request will be terminated.

2040	10.1.  Error Definitions

2042	                        Protocol Error Definitions

2044	         +--------------------------+--------+------------------+
2045	         | Error                    | Number | Description      |
2046	         +--------------------------+--------+------------------+
2047	         | NFS4ERR_BADLABEL         | 10093  | Section 10.1.3.1 |
2048	         | NFS4ERR_METADATA_NOTSUPP | 10090  | Section 10.1.2.1 |
2049	         | NFS4ERR_OFFLOAD_DENIED   | 10091  | Section 10.1.2.2 |
2050	         | NFS4ERR_PARTNER_NO_AUTH  | 10089  | Section 10.1.2.3 |
2051	         | NFS4ERR_PARTNER_NOTSUPP  | 10088  | Section 10.1.2.4 |
2052	         | NFS4ERR_UNION_NOTSUPP    | 10094  | Section 10.1.1.1 |
2053	         | NFS4ERR_WRONG_LFS        | 10092  | Section 10.1.3.2 |
2054	         +--------------------------+--------+------------------+

2056	                                  Table 1

2058	10.1.1.  General Errors

2060	   This section deals with errors that are applicable to a broad set of
2061	   different purposes.

2063	10.1.1.1.  NFS4ERR_UNION_NOTSUPP (Error Code 10094)

2065	   One of the arguments to the operation is a discriminated union and
2066	   while the server supports the given operation, it does not support
2067	   the selected arm of the discriminated union.  For an example, see
2068	   READ_PLUS (Section 13.10).

2070	10.1.2.  Server to Server Copy Errors

2072	   These errors deal with the interaction between server to server
2073	   copies.

2075	10.1.2.1.  NFS4ERR_METADATA_NOTSUPP (Error Code 10090)

2077	   The destination file cannot support the same metadata as the source
2078	   file.

2080	10.1.2.2.  NFS4ERR_OFFLOAD_DENIED (Error Code 10091)

2082	   The copy offload operation is supported by both the source and the
2083	   destination, but the destination is not allowing it for this file.
2084	   If the client sees this error, it should fall back to the normal copy
2085	   semantics.

2087	10.1.2.3.  NFS4ERR_PARTNER_NO_AUTH (Error Code 10089)

2089	   The remote server does not authorize a server-to-server copy offload
2090	   operation.  This may be due to the client's failure to send the
2091	   COPY_NOTIFY operation to the remote server, the remote server
2092	   receiving a server-to-server copy offload request after the copy
2093	   lease time expired, or for some other permission problem.

2095	10.1.2.4.  NFS4ERR_PARTNER_NOTSUPP (Error Code 10088)

2097	   The remote server does not support the server-to-server copy offload
2098	   protocol.

2100	10.1.3.  Labeled NFS Errors

2102	   These errors are used in LNFS.

2104	10.1.3.1.  NFS4ERR_BADLABEL (Error Code 10093)

2106	   The label specified is invalid in some manner.

2108	10.1.3.2.  NFS4ERR_WRONG_LFS (Error Code 10092)

2110	   The LFS specified in the subject label is not compatible with the LFS
2111	   in object label.

2113	11.  File Attributes

2115	11.1.  Attribute Definitions

2117	11.1.1.  Attribute 77: space_reserved

2119	   The space_reserve attribute is a read/write attribute of type
2120	   boolean.  It is a per file attribute.  When the space_reserved
2121	   attribute is set via SETATTR, the server must ensure that there is
2122	   disk space to accommodate every byte in the file before it can return
2123	   success.  If the server cannot guarantee this, it must return
2124	   NFS4ERR_NOSPC.

2126	   If the client tries to grow a file which has the space_reserved
2127	   attribute set, the server must guarantee that there is disk space to
2128	   accommodate every byte in the file with the new size before it can
2129	   return success.  If the server cannot guarantee this, it must return
2130	   NFS4ERR_NOSPC.

2132	   It is not required that the server allocate the space to the file
2133	   before returning success.  The allocation can be deferred, however,
2134	   it must be guaranteed that it will not fail for lack of space.

2136	   The value of space_reserved can be obtained at any time through
2137	   GETATTR.

2139	   In order to avoid ambiguity, the space_reserve bit cannot be set
2140	   along with the size bit in SETATTR.  Increasing the size of a file
2141	   with space_reserve set will fail if space reservation cannot be
2142	   guaranteed for the new size.  If the file size is decreased, space
2143	   reservation is only guaranteed for the new size and the extra blocks
2144	   backing the file can be released.

2146	11.1.2.  Attribute 78: space_freed

2148	   space_freed gives the number of bytes freed if the file is deleted.
2149	   This attribute is read only and is of type length4.  It is a per file
2150	   attribute.

2152	12.  Operations: REQUIRED, RECOMMENDED, or OPTIONAL

2154	   The following tables summarize the operations of the NFSv4.2 protocol
2155	   and the corresponding designation of REQUIRED, RECOMMENDED, and
2156	   OPTIONAL to implement or MUST NOT implement.  The designation of MUST
2157	   NOT implement is reserved for those operations that were defined in
2158	   either NFSv4.0 or NFSV4.1 and MUST NOT be implemented in NFSv4.2.

2160	   For the most part, the REQUIRED, RECOMMENDED, or OPTIONAL designation
2161	   for operations sent by the client is for the server implementation.
2162	   The client is generally required to implement the operations needed
2163	   for the operating environment for which it serves.  For example, a
2164	   read-only NFSv4.2 client would have no need to implement the WRITE
2165	   operation and is not required to do so.

2167	   The REQUIRED or OPTIONAL designation for callback operations sent by
2168	   the server is for both the client and server.  Generally, the client
2169	   has the option of creating the backchannel and sending the operations
2170	   on the fore channel that will be a catalyst for the server sending
2171	   callback operations.  A partial exception is CB_RECALL_SLOT; the only
2172	   way the client can avoid supporting this operation is by not creating
2173	   a backchannel.

2175	   Since this is a summary of the operations and their designation,
2176	   there are subtleties that are not presented here.  Therefore, if
2177	   there is a question of the requirements of implementation, the
2178	   operation descriptions themselves must be consulted along with other
2179	   relevant explanatory text within this either specification or that of
2180	   NFSv4.1 [2].

2182	   The abbreviations used in the second and third columns of the table
2183	   are defined as follows.

2185	   REQ  REQUIRED to implement

2187	   REC  RECOMMEND to implement

2189	   OPT  OPTIONAL to implement

2191	   MNI  MUST NOT implement

2193	   For the NFSv4.2 features that are OPTIONAL, the operations that
2194	   support those features are OPTIONAL, and the server would return
2195	   NFS4ERR_NOTSUPP in response to the client's use of those operations.
2196	   If an OPTIONAL feature is supported, it is possible that a set of
2197	   operations related to the feature become REQUIRED to implement.  The
2198	   third column of the table designates the feature(s) and if the
2199	   operation is REQUIRED or OPTIONAL in the presence of support for the
2200	   feature.

2202	   The OPTIONAL features identified and their abbreviations are as
2203	   follows:

2205	   pNFS  Parallel NFS

2207	   FDELG  File Delegations

2209	   DDELG  Directory Delegations

2211	   COPY  Server Side Copy

2213	   ADB  Application Data Blocks

2215	                                Operations

2217	   +----------------------+--------------------+-----------------------+
2218	   | Operation            | REQ, REC, OPT, or  | Feature (REQ, REC, or |
2219	   |                      | MNI                | OPT)                  |
2220	   +----------------------+--------------------+-----------------------+
2221	   | ACCESS               | REQ                |                       |
2222	   | BACKCHANNEL_CTL      | REQ                |                       |
2223	   | BIND_CONN_TO_SESSION | REQ                |                       |
2224	   | CLOSE                | REQ                |                       |
2225	   | COMMIT               | REQ                |                       |
2226	   | COPY                 | OPT                | COPY (REQ)            |
2227	   | COPY_ABORT           | OPT                | COPY (REQ)            |
2228	   | COPY_NOTIFY          | OPT                | COPY (REQ)            |
2229	   | COPY_REVOKE          | OPT                | COPY (REQ)            |
2230	   | COPY_STATUS          | OPT                | COPY (REQ)            |
2231	   | CREATE               | REQ                |                       |
2232	   | CREATE_SESSION       | REQ                |                       |
2233	   | DELEGPURGE           | OPT                | FDELG (REQ)           |
2234	   | DELEGRETURN          | OPT                | FDELG, DDELG, pNFS    |
2235	   |                      |                    | (REQ)                 |
2236	   | DESTROY_CLIENTID     | REQ                |                       |
2237	   | DESTROY_SESSION      | REQ                |                       |
2238	   | EXCHANGE_ID          | REQ                |                       |
2239	   | FREE_STATEID         | REQ                |                       |
2240	   | GETATTR              | REQ                |                       |
2241	   | GETDEVICEINFO        | OPT                | pNFS (REQ)            |
2242	   | GETDEVICELIST        | OPT                | pNFS (OPT)            |
2243	   | GETFH                | REQ                |                       |
2244	   | INITIALIZE           | OPT                | ADB (REQ)             |
2245	   | GET_DIR_DELEGATION   | OPT                | DDELG (REQ)           |
2246	   | LAYOUTCOMMIT         | OPT                | pNFS (REQ)            |
2247	   | LAYOUTGET            | OPT                | pNFS (REQ)            |
2248	   | LAYOUTRETURN         | OPT                | pNFS (REQ)            |
2249	   | LINK                 | OPT                |                       |
2250	   | LOCK                 | REQ                |                       |
2251	   | LOCKT                | REQ                |                       |
2252	   | LOCKU                | REQ                |                       |
2253	   | LOOKUP               | REQ                |                       |
2254	   | LOOKUPP              | REQ                |                       |
2255	   | NVERIFY              | REQ                |                       |
2256	   | OPEN                 | REQ                |                       |
2257	   | OPENATTR             | OPT                |                       |
2258	   | OPEN_CONFIRM         | MNI                |                       |
2259	   | OPEN_DOWNGRADE       | REQ                |                       |
2260	   | PUTFH                | REQ                |                       |
2261	   | PUTPUBFH             | REQ                |                       |
2262	   | PUTROOTFH            | REQ                |                       |
2263	   | READ                 | OPT                |                       |
2264	   | READDIR              | REQ                |                       |
2265	   | READLINK             | OPT                |                       |
2266	   | READ_PLUS            | OPT                | ADB (REQ)             |
2267	   | RECLAIM_COMPLETE     | REQ                |                       |
2268	   | RELEASE_LOCKOWNER    | MNI                |                       |
2269	   | REMOVE               | REQ                |                       |
2270	   | RENAME               | REQ                |                       |
2271	   | RENEW                | MNI                |                       |
2272	   | RESTOREFH            | REQ                |                       |
2273	   | SAVEFH               | REQ                |                       |
2274	   | SECINFO              | REQ                |                       |
2275	   | SECINFO_NO_NAME      | REC                | pNFS file layout      |
2276	   |                      |                    | (REQ)                 |
2277	   | SEQUENCE             | REQ                |                       |
2278	   | SETATTR              | REQ                |                       |
2279	   | SETCLIENTID          | MNI                |                       |
2280	   | SETCLIENTID_CONFIRM  | MNI                |                       |
2281	   | SET_SSV              | REQ                |                       |
2282	   | TEST_STATEID         | REQ                |                       |
2283	   | VERIFY               | REQ                |                       |
2284	   | WANT_DELEGATION      | OPT                | FDELG (OPT)           |
2285	   | WRITE                | REQ                |                       |
2286	   +----------------------+--------------------+-----------------------+

2288	                            Callback Operations

2290	   +-------------------------+-------------------+---------------------+
2291	   | Operation               | REQ, REC, OPT, or | Feature (REQ, REC,  |
2292	   |                         | MNI               | or OPT)             |
2293	   +-------------------------+-------------------+---------------------+
2294	   | CB_COPY                 | OPT               | COPY (REQ)          |
2295	   | CB_GETATTR              | OPT               | FDELG (REQ)         |
2296	   | CB_LAYOUTRECALL         | OPT               | pNFS (REQ)          |
2297	   | CB_NOTIFY               | OPT               | DDELG (REQ)         |
2298	   | CB_NOTIFY_DEVICEID      | OPT               | pNFS (OPT)          |
2299	   | CB_NOTIFY_LOCK          | OPT               |                     |
2300	   | CB_PUSH_DELEG           | OPT               | FDELG (OPT)         |
2301	   | CB_RECALL               | OPT               | FDELG, DDELG, pNFS  |
2302	   |                         |                   | (REQ)               |
2303	   | CB_RECALL_ANY           | OPT               | FDELG, DDELG, pNFS  |
2304	   |                         |                   | (REQ)               |
2305	   | CB_RECALL_SLOT          | REQ               |                     |
2306	   | CB_RECALLABLE_OBJ_AVAIL | OPT               | DDELG, pNFS (REQ)   |
2307	   | CB_SEQUENCE             | OPT               | FDELG, DDELG, pNFS  |
2308	   |                         |                   | (REQ)               |
2309	   | CB_WANTS_CANCELLED      | OPT               | FDELG, DDELG, pNFS  |
2310	   |                         |                   | (REQ)               |
2311	   +-------------------------+-------------------+---------------------+

2313	13.  NFSv4.2 Operations

2315	13.1.  Operation 59: COPY - Initiate a server-side copy

2317	13.1.1.  ARGUMENT

2319	   const COPY4_GUARDED     = 0x00000001;
2320	   const COPY4_METADATA    = 0x00000002;

2322	   struct COPY4args {
2323	           /* SAVED_FH: source file */
2324	           /* CURRENT_FH: destination file or */
2325	           /*             directory           */
2326	           offset4         ca_src_offset;
2327	           offset4         ca_dst_offset;
2328	           length4         ca_count;
2329	           uint32_t        ca_flags;
2330	           component4      ca_destination;
2331	           netloc4         ca_source_server<>;
2332	   };

2334	13.1.2.  RESULT

2336	   union COPY4res switch (nfsstat4 cr_status) {
2337	           case NFS4_OK:
2338	                   stateid4        cr_callback_id<1>;
2339	           default:
2340	                   length4         cr_bytes_copied;
2341	   };

2343	13.1.3.  DESCRIPTION

2345	   The COPY operation is used for both intra-server and inter-server
2346	   copies.  In both cases, the COPY is always sent from the client to
2347	   the destination server of the file copy.  The COPY operation requests
2348	   that a file be copied from the location specified by the SAVED_FH
2349	   value to the location specified by the combination of CURRENT_FH and
2350	   ca_destination.

2352	   The SAVED_FH must be a regular file.  If SAVED_FH is not a regular
2353	   file, the operation MUST fail and return NFS4ERR_WRONG_TYPE.

2355	   In order to set SAVED_FH to the source file handle, the compound
2356	   procedure requesting the COPY will include a sub-sequence of
2357	   operations such as
2358	      PUTFH source-fh
2359	      SAVEFH

2361	   If the request is for a server-to-server copy, the source-fh is a
2362	   filehandle from the source server and the compound procedure is being
2363	   executed on the destination server.  In this case, the source-fh is a
2364	   foreign filehandle on the server receiving the COPY request.  If
2365	   either PUTFH or SAVEFH checked the validity of the filehandle, the
2366	   operation would likely fail and return NFS4ERR_STALE.

2368	   In order to avoid this problem, the minor version incorporating the
2369	   COPY operations will need to make a few small changes in the handling
2370	   of existing operations.  If a server supports the server-to-server
2371	   COPY feature, a PUTFH followed by a SAVEFH MUST NOT return
2372	   NFS4ERR_STALE for either operation.  These restrictions do not pose
2373	   substantial difficulties for servers.  The CURRENT_FH and SAVED_FH
2374	   may be validated in the context of the operation referencing them and
2375	   an NFS4ERR_STALE error returned for an invalid file handle at that
2376	   point.

2378	   The CURRENT_FH and ca_destination together specify the destination of
2379	   the copy operation.  If ca_destination is of 0 (zero) length, then
2380	   CURRENT_FH specifies the target file.  In this case, CURRENT_FH MUST
2381	   be a regular file and not a directory.  If ca_destination is not of 0
2382	   (zero) length, the ca_destination argument specifies the file name to
2383	   which the data will be copied within the directory identified by
2384	   CURRENT_FH.  In this case, CURRENT_FH MUST be a directory and not a
2385	   regular file.

2387	   If the file named by ca_destination does not exist and the operation
2388	   completes successfully, the file will be visible in the file system
2389	   namespace.  If the file does not exist and the operation fails, the
2390	   file MAY be visible in the file system namespace depending on when
2391	   the failure occurs and on the implementation of the NFS server
2392	   receiving the COPY operation.  If the ca_destination name cannot be
2393	   created in the destination file system (due to file name
2394	   restrictions, such as case or length), the operation MUST fail.

2396	   The ca_src_offset is the offset within the source file from which the
2397	   data will be read, the ca_dst_offset is the offset within the
2398	   destination file to which the data will be written, and the ca_count
2399	   is the number of bytes that will be copied.  An offset of 0 (zero)
2400	   specifies the start of the file.  A count of 0 (zero) requests that
2401	   all bytes from ca_src_offset through EOF be copied to the
2402	   destination.  If concurrent modifications to the source file overlap
2403	   with the source file region being copied, the data copied may include
2404	   all, some, or none of the modifications.  The client can use standard
2405	   NFS operations (e.g., OPEN with OPEN4_SHARE_DENY_WRITE or mandatory
2406	   byte range locks) to protect against concurrent modifications if the
2407	   client is concerned about this.  If the source file's end of file is
2408	   being modified in parallel with a copy that specifies a count of 0
2409	   (zero) bytes, the amount of data copied is implementation dependent
2410	   (clients may guard against this case by specifying a non-zero count
2411	   value or preventing modification of the source file as mentioned
2412	   above).

2414	   If the source offset or the source offset plus count is greater than
2415	   or equal to the size of the source file, the operation will fail with
2416	   NFS4ERR_INVAL.  The destination offset or destination offset plus
2417	   count may be greater than the size of the destination file.  This
2418	   allows for the client to issue parallel copies to implement
2419	   operations such as "cat file1 file2 file3 file4 > dest".

2421	   If the destination file is created as a result of this command, the
2422	   destination file's size will be equal to the number of bytes
2423	   successfully copied.  If the destination file already existed, the
2424	   destination file's size may increase as a result of this operation
2425	   (e.g. if ca_dst_offset plus ca_count is greater than the
2426	   destination's initial size).

2428	   If the ca_source_server list is specified, then this is an inter-
2429	   server copy operation and the source file is on a remote server.  The
2430	   client is expected to have previously issued a successful COPY_NOTIFY
2431	   request to the remote source server.  The ca_source_server list
2432	   SHOULD be the same as the COPY_NOTIFY response's cnr_source_server
2433	   list.  If the client includes the entries from the COPY_NOTIFY
2434	   response's cnr_source_server list in the ca_source_server list, the
2435	   source server can indicate a specific copy protocol for the
2436	   destination server to use by returning a URL, which specifies both a
2437	   protocol service and server name.  Server-to-server copy protocol
2438	   considerations are described in Section 2.2.3 and Section 2.4.1.

2440	   The ca_flags argument allows the copy operation to be customized in
2441	   the following ways using the guarded flag (COPY4_GUARDED) and the
2442	   metadata flag (COPY4_METADATA).

2444	   If the guarded flag is set and the destination exists on the server,
2445	   this operation will fail with NFS4ERR_EXIST.

2447	   If the guarded flag is not set and the destination exists on the
2448	   server, the behavior is implementation dependent.

2450	   If the metadata flag is set and the client is requesting a whole file
2451	   copy (i.e., ca_count is 0 (zero)), a subset of the destination file's
2452	   attributes MUST be the same as the source file's corresponding
2453	   attributes and a subset of the destination file's attributes SHOULD
2454	   be the same as the source file's corresponding attributes.  The
2455	   attributes in the MUST and SHOULD copy subsets will be defined for
2456	   each NFS version.

2458	   For NFSv4.1, Table 2 and Table 3 list the REQUIRED and RECOMMENDED
2459	   attributes respectively.  A "MUST" in the "Copy to destination file?"
2460	   column indicates that the attribute is part of the MUST copy set.  A
2461	   "SHOULD" in the "Copy to destination file?" column indicates that the
2462	   attribute is part of the SHOULD copy set.

2464	          +--------------------+----+---------------------------+
2465	          | Name               | Id | Copy to destination file? |
2466	          +--------------------+----+---------------------------+
2467	          | supported_attrs    | 0  | no                        |
2468	          | type               | 1  | MUST                      |
2469	          | fh_expire_type     | 2  | no                        |
2470	          | change             | 3  | SHOULD                    |
2471	          | size               | 4  | MUST                      |
2472	          | link_support       | 5  | no                        |
2473	          | symlink_support    | 6  | no                        |
2474	          | named_attr         | 7  | no                        |
2475	          | fsid               | 8  | no                        |
2476	          | unique_handles     | 9  | no                        |
2477	          | lease_time         | 10 | no                        |
2478	          | rdattr_error       | 11 | no                        |
2479	          | filehandle         | 19 | no                        |
2480	          | suppattr_exclcreat | 75 | no                        |
2481	          +--------------------+----+---------------------------+

2483	                                  Table 2

2485	          +--------------------+----+---------------------------+
2486	          | Name               | Id | Copy to destination file? |
2487	          +--------------------+----+---------------------------+
2488	          | acl                | 12 | MUST                      |
2489	          | aclsupport         | 13 | no                        |
2490	          | archive            | 14 | no                        |
2491	          | cansettime         | 15 | no                        |
2492	          | case_insensitive   | 16 | no                        |
2493	          | case_preserving    | 17 | no                        |
2494	          | change_policy      | 60 | no                        |
2495	          | chown_restricted   | 18 | MUST                      |
2496	          | dacl               | 58 | MUST                      |
2497	          | dir_notif_delay    | 56 | no                        |
2498	          | dirent_notif_delay | 57 | no                        |
2499	          | fileid             | 20 | no                        |
2500	          | files_avail        | 21 | no                        |
2501	          | files_free         | 22 | no                        |
2502	          | files_total        | 23 | no                        |
2503	          | fs_charset_cap     | 76 | no                        |
2504	          | fs_layout_type     | 62 | no                        |
2505	          | fs_locations       | 24 | no                        |
2506	          | fs_locations_info  | 67 | no                        |
2507	          | fs_status          | 61 | no                        |
2508	          | hidden             | 25 | MUST                      |
2509	          | homogeneous        | 26 | no                        |
2510	          | layout_alignment   | 66 | no                        |
2511	          | layout_blksize     | 65 | no                        |
2512	          | layout_hint        | 63 | no                        |
2513	          | layout_type        | 64 | no                        |
2514	          | maxfilesize        | 27 | no                        |
2515	          | maxlink            | 28 | no                        |
2516	          | maxname            | 29 | no                        |
2517	          | maxread            | 30 | no                        |
2518	          | maxwrite           | 31 | no                        |
2519	          | mdsthreshold       | 68 | no                        |
2520	          | mimetype           | 32 | MUST                      |
2521	          | mode               | 33 | MUST                      |
2522	          | mode_set_masked    | 74 | no                        |
2523	          | mounted_on_fileid  | 55 | no                        |
2524	          | no_trunc           | 34 | no                        |
2525	          | numlinks           | 35 | no                        |
2526	          | owner              | 36 | MUST                      |
2527	          | owner_group        | 37 | MUST                      |
2528	          | quota_avail_hard   | 38 | no                        |
2529	          | quota_avail_soft   | 39 | no                        |
2530	          | quota_used         | 40 | no                        |
2531	          | rawdev             | 41 | no                        |
2532	          | retentevt_get      | 71 | MUST                      |
2533	          | retentevt_set      | 72 | no                        |
2534	          | retention_get      | 69 | MUST                      |
2535	          | retention_hold     | 73 | MUST                      |
2536	          | retention_set      | 70 | no                        |
2537	          | sacl               | 59 | MUST                      |
2538	          | space_avail        | 42 | no                        |
2539	          | space_free         | 43 | no                        |
2540	          | space_freed        | 78 | no                        |
2541	          | space_reserved     | 77 | MUST                      |
2542	          | space_total        | 44 | no                        |
2543	          | space_used         | 45 | no                        |
2544	          | system             | 46 | MUST                      |
2545	          | time_access        | 47 | MUST                      |
2546	          | time_access_set    | 48 | no                        |
2547	          | time_backup        | 49 | no                        |
2548	          | time_create        | 50 | MUST                      |
2549	          | time_delta         | 51 | no                        |
2550	          | time_metadata      | 52 | SHOULD                    |
2551	          | time_modify        | 53 | MUST                      |
2552	          | time_modify_set    | 54 | no                        |
2553	          +--------------------+----+---------------------------+

2555	                                  Table 3

2557	   [NOTE: The source file's attribute values will take precedence over
2558	   any attribute values inherited by the destination file.]

2560	   In the case of an inter-server copy or an intra-server copy between
2561	   file systems, the attributes supported for the source file and
2562	   destination file could be different.  By definition,the REQUIRED
2563	   attributes will be supported in all cases.  If the metadata flag is
2564	   set and the source file has a RECOMMENDED attribute that is not
2565	   supported for the destination file, the copy MUST fail with
2566	   NFS4ERR_ATTRNOTSUPP.

2568	   Any attribute supported by the destination server that is not set on
2569	   the source file SHOULD be left unset.

2571	   Metadata attributes not exposed via the NFS protocol SHOULD be copied
2572	   to the destination file where appropriate.

2574	   The destination file's named attributes are not duplicated from the
2575	   source file.  After the copy process completes, the client MAY
2576	   attempt to duplicate named attributes using standard NFSv4
2577	   operations.  However, the destination file's named attribute
2578	   capabilities MAY be different from the source file's named attribute
2579	   capabilities.

2581	   If the metadata flag is not set and the client is requesting a whole
2582	   file copy (i.e., ca_count is 0 (zero)), the destination file's
2583	   metadata is implementation dependent.

2585	   If the client is requesting a partial file copy (i.e., ca_count is
2586	   not 0 (zero)), the client SHOULD NOT set the metadata flag and the
2587	   server MUST ignore the metadata flag.

2589	   If the operation does not result in an immediate failure, the server
2590	   will return NFS4_OK, and the CURRENT_FH will remain the destination's
2591	   filehandle.

2593	   If an immediate failure does occur, cr_bytes_copied will be set to
2594	   the number of bytes copied to the destination file before the error
2595	   occurred.  The cr_bytes_copied value indicates the number of bytes
2596	   copied but not which specific bytes have been copied.

2598	   A return of NFS4_OK indicates that either the operation is complete
2599	   or the operation was initiated and a callback will be used to deliver
2600	   the final status of the operation.

2602	   If the cr_callback_id is returned, this indicates that the operation
2603	   was initiated and a CB_COPY callback will deliver the final results
2604	   of the operation.  The cr_callback_id stateid is termed a copy
2605	   stateid in this context.  The server is given the option of returning
2606	   the results in a callback because the data may require a relatively
2607	   long period of time to copy.

2609	   If no cr_callback_id is returned, the operation completed
2610	   synchronously and no callback will be issued by the server.  The
2611	   completion status of the operation is indicated by cr_status.

2613	   If the copy completes successfully, either synchronously or
2614	   asynchronously, the data copied from the source file to the
2615	   destination file MUST appear identical to the NFS client.  However,
2616	   the NFS server's on disk representation of the data in the source
2617	   file and destination file MAY differ.  For example, the NFS server
2618	   might encrypt, compress, deduplicate, or otherwise represent the on
2619	   disk data in the source and destination file differently.

2621	   In the event of a failure the state of the destination file is
2622	   implementation dependent.  The COPY operation may fail for the
2623	   following reasons (this is a partial list).

2625	   o  NFS4ERR_MOVED

2627	   o  NFS4ERR_NOTSUPP

2629	   o  NFS4ERR_PARTNER_NOTSUPP

2631	   o  NFS4ERR_OFFLOAD_DENIED

2633	   o  NFS4ERR_PARTNER_NO_AUTH

2635	   o  NFS4ERR_FBIG

2637	   o  NFS4ERR_NOTDIR

2639	   o  NFS4ERR_WRONG_TYPE

2641	   o  NFS4ERR_ISDIR

2643	   o  NFS4ERR_INVAL
2644	   o  NFS4ERR_DELAY

2646	   o  NFS4ERR_METADATA_NOTSUPP

2648	   o  NFS4ERR_WRONGSEC

2650	13.2.  Operation 60: COPY_ABORT - Cancel a server-side copy

2652	13.2.1.  ARGUMENT

2654	   struct COPY_ABORT4args {
2655	           /* CURRENT_FH: desination file */
2656	           stateid4        caa_stateid;
2657	   };

2659	13.2.2.  RESULT

2661	   struct COPY_ABORT4res {
2662	           nfsstat4        car_status;
2663	   };

2665	13.2.3.  DESCRIPTION

2667	   COPY_ABORT is used for both intra- and inter-server asynchronous
2668	   copies.  The COPY_ABORT operation allows the client to cancel a
2669	   server-side copy operation that it initiated.  This operation is sent
2670	   in a COMPOUND request from the client to the destination server.
2671	   This operation may be used to cancel a copy when the application that
2672	   requested the copy exits before the operation is completed or for
2673	   some other reason.

2675	   The request contains the filehandle and copy stateid cookies that act
2676	   as the context for the previously initiated copy operation.

2678	   The result's car_status field indicates whether the cancel was
2679	   successful or not.  A value of NFS4_OK indicates that the copy
2680	   operation was canceled and no callback will be issued by the server.
2681	   A copy operation that is successfully canceled may result in none,
2682	   some, or all of the data copied.

2684	   If the server supports asynchronous copies, the server is REQUIRED to
2685	   support the COPY_ABORT operation.

2687	   The COPY_ABORT operation may fail for the following reasons (this is
2688	   a partial list):

2690	   o  NFS4ERR_NOTSUPP

2692	   o  NFS4ERR_RETRY

2694	   o  NFS4ERR_COMPLETE_ALREADY

2696	   o  NFS4ERR_SERVERFAULT

2698	13.3.  Operation 61: COPY_NOTIFY - Notify a source server of a future
2699	       copy

2701	13.3.1.  ARGUMENT

2703	   struct COPY_NOTIFY4args {
2704	           /* CURRENT_FH: source file */
2705	           netloc4         cna_destination_server;
2706	   };

2708	13.3.2.  RESULT

2710	   struct COPY_NOTIFY4resok {
2711	           nfstime4        cnr_lease_time;
2712	           netloc4         cnr_source_server<>;
2713	   };

2715	   union COPY_NOTIFY4res switch (nfsstat4 cnr_status) {
2716	           case NFS4_OK:
2717	                   COPY_NOTIFY4resok       resok4;
2718	           default:
2719	                   void;
2720	   };

2722	13.3.3.  DESCRIPTION

2724	   This operation is used for an inter-server copy.  A client sends this
2725	   operation in a COMPOUND request to the source server to authorize a
2726	   destination server identified by cna_destination_server to read the
2727	   file specified by CURRENT_FH on behalf of the given user.

2729	   The cna_destination_server MUST be specified using the netloc4
2730	   network location format.  The server is not required to resolve the
2731	   cna_destination_server address before completing this operation.

2733	   If this operation succeeds, the source server will allow the
2734	   cna_destination_server to copy the specified file on behalf of the
2735	   given user.  If COPY_NOTIFY succeeds, the destination server is
2736	   granted permission to read the file as long as both of the following
2737	   conditions are met:

2739	   o  The destination server begins reading the source file before the
2740	      cnr_lease_time expires.  If the cnr_lease_time expires while the
2741	      destination server is still reading the source file, the
2742	      destination server is allowed to finish reading the file.

2744	   o  The client has not issued a COPY_REVOKE for the same combination
2745	      of user, filehandle, and destination server.

2747	   The cnr_lease_time is chosen by the source server.  A cnr_lease_time
2748	   of 0 (zero) indicates an infinite lease.  To renew the copy lease
2749	   time the client should resend the same copy notification request to
2750	   the source server.

2752	   To avoid the need for synchronized clocks, copy lease times are
2753	   granted by the server as a time delta.  However, there is a
2754	   requirement that the client and server clocks do not drift
2755	   excessively over the duration of the lease.  There is also the issue
2756	   of propagation delay across the network which could easily be several
2757	   hundred milliseconds as well as the possibility that requests will be
2758	   lost and need to be retransmitted.

2760	   To take propagation delay into account, the client should subtract it
2761	   from copy lease times (e.g., if the client estimates the one-way
2762	   propagation delay as 200 milliseconds, then it can assume that the
2763	   lease is already 200 milliseconds old when it gets it).  In addition,
2764	   it will take another 200 milliseconds to get a response back to the
2765	   server.  So the client must send a lease renewal or send the copy
2766	   offload request to the cna_destination_server at least 400
2767	   milliseconds before the copy lease would expire.  If the propagation
2768	   delay varies over the life of the lease (e.g., the client is on a
2769	   mobile host), the client will need to continuously subtract the
2770	   increase in propagation delay from the copy lease times.

2772	   The server's copy lease period configuration should take into account
2773	   the network distance of the clients that will be accessing the
2774	   server's resources.  It is expected that the lease period will take
2775	   into account the network propagation delays and other network delay
2776	   factors for the client population.  Since the protocol does not allow
2777	   for an automatic method to determine an appropriate copy lease
2778	   period, the server's administrator may have to tune the copy lease
2779	   period.

2781	   A successful response will also contain a list of names, addresses,
2782	   and URLs called cnr_source_server, on which the source is willing to
2783	   accept connections from the destination.  These might not be
2784	   reachable from the client and might be located on networks to which
2785	   the client has no connection.

2787	   If the client wishes to perform an inter-server copy, the client MUST
2788	   send a COPY_NOTIFY to the source server.  Therefore, the source
2789	   server MUST support COPY_NOTIFY.

2791	   For a copy only involving one server (the source and destination are
2792	   on the same server), this operation is unnecessary.

2794	   The COPY_NOTIFY operation may fail for the following reasons (this is
2795	   a partial list):

2797	   o  NFS4ERR_MOVED

2799	   o  NFS4ERR_NOTSUPP

2801	   o  NFS4ERR_WRONGSEC

2803	13.4.  Operation 62: COPY_REVOKE - Revoke a destination server's copy
2804	       privileges

2806	13.4.1.  ARGUMENT

2808	   struct COPY_REVOKE4args {
2809	           /* CURRENT_FH: source file */
2810	           netloc4         cra_destination_server;
2811	   };

2813	13.4.2.  RESULT

2815	   struct COPY_REVOKE4res {
2816	           nfsstat4        crr_status;
2817	   };

2819	13.4.3.  DESCRIPTION

2821	   This operation is used for an inter-server copy.  A client sends this
2822	   operation in a COMPOUND request to the source server to revoke the
2823	   authorization of a destination server identified by
2824	   cra_destination_server from reading the file specified by CURRENT_FH
2825	   on behalf of given user.  If the cra_destination_server has already
2826	   begun copying the file, a successful return from this operation
2827	   indicates that further access will be prevented.

2829	   The cra_destination_server MUST be specified using the netloc4
2830	   network location format.  The server is not required to resolve the
2831	   cra_destination_server address before completing this operation.

2833	   The COPY_REVOKE operation is useful in situations in which the source
2834	   server granted a very long or infinite lease on the destination
2835	   server's ability to read the source file and all copy operations on
2836	   the source file have been completed.

2838	   For a copy only involving one server (the source and destination are
2839	   on the same server), this operation is unnecessary.

2841	   If the server supports COPY_NOTIFY, the server is REQUIRED to support
2842	   the COPY_REVOKE operation.

2844	   The COPY_REVOKE operation may fail for the following reasons (this is
2845	   a partial list):

2847	   o  NFS4ERR_MOVED

2849	   o  NFS4ERR_NOTSUPP

2851	13.5.  Operation 63: COPY_STATUS - Poll for status of a server-side copy

2853	13.5.1.  ARGUMENT

2855	   struct COPY_STATUS4args {
2856	           /* CURRENT_FH: destination file */
2857	           stateid4        csa_stateid;
2858	   };

2860	13.5.2.  RESULT

2862	   struct COPY_STATUS4resok {
2863	           length4         csr_bytes_copied;
2864	           nfsstat4        csr_complete<1>;
2865	   };

2867	   union COPY_STATUS4res switch (nfsstat4 csr_status) {
2868	           case NFS4_OK:
2869	                   COPY_STATUS4resok       resok4;
2870	           default:
2871	                   void;
2872	   };

2874	13.5.3.  DESCRIPTION

2876	   COPY_STATUS is used for both intra- and inter-server asynchronous
2877	   copies.  The COPY_STATUS operation allows the client to poll the
2878	   server to determine the status of an asynchronous copy operation.
2879	   This operation is sent by the client to the destination server.

2881	   If this operation is successful, the number of bytes copied are
2882	   returned to the client in the csr_bytes_copied field.  The
2883	   csr_bytes_copied value indicates the number of bytes copied but not
2884	   which specific bytes have been copied.

2886	   If the optional csr_complete field is present, the copy has
2887	   completed.  In this case the status value indicates the result of the
2888	   asynchronous copy operation.  In all cases, the server will also
2889	   deliver the final results of the asynchronous copy in a CB_COPY
2890	   operation.

2892	   The failure of this operation does not indicate the result of the
2893	   asynchronous copy in any way.

2895	   If the server supports asynchronous copies, the server is REQUIRED to
2896	   support the COPY_STATUS operation.

2898	   The COPY_STATUS operation may fail for the following reasons (this is
2899	   a partial list):

2901	   o  NFS4ERR_NOTSUPP

2903	   o  NFS4ERR_BAD_STATEID

2905	   o  NFS4ERR_EXPIRED

2907	13.6.  Modification to Operation 42: EXCHANGE_ID - Instantiate Client ID

2909	13.6.1.  ARGUMENT

2911	      /* new */
2912	      const EXCHGID4_FLAG_SUPP_FENCE_OPS      = 0x00000004;

2914	13.6.2.  RESULT

2916	      Unchanged

2918	13.6.3.  MOTIVATION

2920	   Enterprise applications require guarantees that an operation has
2921	   either aborted or completed.  NFSv4.1 provides this guarantee as long
2922	   as the session is alive: simply send a SEQUENCE operation on the same
2923	   slot with a new sequence number, and the successful return of
2924	   SEQUENCE indicates the previous operation has completed.  However, if
2925	   the session is lost, there is no way to know when any in progress
2926	   operations have aborted or completed.  In hindsight, the NFSv4.1
2927	   specification should have mandated that DESTROY_SESSION abort/
2928	   complete all outstanding operations.

2930	13.6.4.  DESCRIPTION

2932	   A client SHOULD request the EXCHGID4_FLAG_SUPP_FENCE_OPS capability
2933	   when it sends an EXCHANGE_ID operation.  The server SHOULD set this
2934	   capability in the EXCHANGE_ID reply whether the client requests it or
2935	   not.  If the client ID is created with this capability then the
2936	   following will occur:

2938	   o  The server will not reply to DESTROY_SESSION until all operations
2939	      in progress are completed or aborted.

2941	   o  The server will not reply to subsequent EXCHANGE_ID invoked on the
2942	      same Client Owner with a new verifier until all operations in
2943	      progress on the Client ID's session are completed or aborted.

2945	   o  When DESTROY_CLIENTID is invoked, if there are sessions (both idle
2946	      and non-idle), opens, locks, delegations, layouts, and/or wants
2947	      (Section 18.49) associated with the client ID are removed.
2948	      Pending operations will be completed or aborted before the
2949	      sessions, opens, locks, delegations, layouts, and/or wants are
2950	      deleted.

2952	   o  The NFS server SHOULD support client ID trunking, and if it does
2953	      and the EXCHGID4_FLAG_SUPP_FENCE_OPS capability is enabled, then a
2954	      session ID created on one node of the storage cluster MUST be
2955	      destroyable via DESTROY_SESSION.  In addition, DESTROY_CLIENTID
2956	      and an EXCHANGE_ID with a new verifier affects all sessions
2957	      regardless what node the sessions were created on.

2959	13.7.  Operation 64: INITIALIZE

2961	   This operation can be used to initialize the structure imposed by an
2962	   application onto a file, i.e., ADBs, and to punch a hole into a file.

2964	13.7.1.  ARGUMENT

2966	   /*
2967	    * We use data_content4 in case we wish to
2968	    * extend new types later. Note that we
2969	    * are explicitly disallowing data.
2970	    */
2971	   union initialize_arg4 switch (data_content4 content) {
2972	   case NFS4_CONTENT_APP_BLOCK:
2973	           app_data_block4 ia_adb;
2974	   case NFS4_CONTENT_HOLE:
2975	           data_info4      ia_hole;
2976	   default:
2977	           void;
2978	   };

2980	   struct INITIALIZE4args {
2981	           /* CURRENT_FH: file */
2982	           stateid4        ia_stateid;
2983	           stable_how4     ia_stable;
2984	           initialize_arg4 ia_data<>;
2985	   };

2987	13.7.2.  RESULT

2989	   struct INITIALIZE4resok {
2990	           count4          ir_count;
2991	           stable_how4     ir_committed;
2992	           verifier4       ir_writeverf;
2993	           data_content4   ir_sparse;
2994	   };

2996	   union INITIALIZE4res switch (nfsstat4 status) {
2997	   case NFS4_OK:
2998	           INITIALIZE4resok        resok4;
2999	   default:
3000	           void;
3001	   };

3003	13.7.3.  DESCRIPTION
3004	13.7.3.1.  Hole punching

3006	   Whenever a client wishes to zero the blocks backing a particular
3007	   region in the file, it calls the INITIALIZE operation with the
3008	   current filehandle set to the filehandle of the file in question, and
3009	   the equivalent of start offset and length in bytes of the region set
3010	   in ia_hole.di_offset and ia_hole.di_length respectively.  If the
3011	   ia_hole.di_allocated is set to TRUE, then the blocks will be zeroed
3012	   and if it is set to FALSE, then they will be deallocated.  All
3013	   further reads to this region MUST return zeros until overwritten.
3014	   The filehandle specified must be that of a regular file.

3016	   Situations may arise where di_offset and/or di_offset + di_length
3017	   will not be aligned to a boundary that the server does allocations/
3018	   deallocations in.  For most filesystems, this is the block size of
3019	   the file system.  In such a case, the server can deallocate as many
3020	   bytes as it can in the region.  The blocks that cannot be deallocated
3021	   MUST be zeroed.  Except for the block deallocation and maximum hole
3022	   punching capability, a INITIALIZE operation is to be treated similar
3023	   to a write of zeroes.

3025	   The server is not required to complete deallocating the blocks
3026	   specified in the operation before returning.  It is acceptable to
3027	   have the deallocation be deferred.  In fact, INITIALIZE is merely a
3028	   hint; it is valid for a server to return success without ever doing
3029	   anything towards deallocating the blocks backing the region
3030	   specified.  However, any future reads to the region MUST return
3031	   zeroes.

3033	   If used to hole punch, INITIALIZE will result in the space_used
3034	   attribute being decreased by the number of bytes that were
3035	   deallocated.  The space_freed attribute may or may not decrease,
3036	   depending on the support and whether the blocks backing the specified
3037	   range were shared or not.  The size attribute will remain unchanged.

3039	   The INITIALIZE operation MUST NOT change the space reservation
3040	   guarantee of the file.  While the server can deallocate the blocks
3041	   specified by di_offset and di_length, future writes to this region
3042	   MUST NOT fail with NFSERR_NOSPC.

3044	   The INITIALIZE operation may fail for the following reasons (this is
3045	   a partial list):

3047	   NFS4ERR_NOTSUPP  The Hole punch operations are not supported by the
3048	      NFS server receiving this request.

3050	   NFS4ERR_DIR  The current filehandle is of type NF4DIR.

3052	   NFS4ERR_SYMLINK  The current filehandle is of type NF4LNK.

3054	   NFS4ERR_WRONG_TYPE  The current filehandle does not designate an
3055	      ordinary file.

3057	13.7.3.2.  ADBs

3059	   If the server supports ADBs, then it MUST support the
3060	   NFS4_CONTENT_APP_BLOCK arm of the INITIALIZE operation.  The server
3061	   has no concept of the structure imposed by the application.  It is
3062	   only when the application writes to a section of the file does order
3063	   get imposed.  In order to detect corruption even before the
3064	   application utilizes the file, the application will want to
3065	   initialize a range of ADBs using INITIALIZE.

3067	   For ADBs, when the client invokes the INITIALIZE operation, it has
3068	   two desired results:

3070	   1.  The structure described by the app_data_block4 be imposed on the
3071	       file.

3073	   2.  The contents described by the app_data_block4 be sparse.

3075	   If the server supports the INITIALIZE operation, it still might not
3076	   support sparse files.  So if it receives the INITIALIZE operation,
3077	   then it MUST populate the contents of the file with the initialized
3078	   ADBs.

3080	   If the data was already initialized, there are two interesting
3081	   scenarios:

3083	   1.  The data blocks are allocated.

3085	   2.  Initializing in the middle of an existing ADB.

3087	   If the data blocks were already allocated, then the INITIALIZE is a
3088	   hole punch operation.  If INITIALIZE supports sparse files, then the
3089	   data blocks are to be deallocated.  If not, then the data blocks are
3090	   to be rewritten in the indicated ADB format.

3092	   Since the server has no knowledge of ADBs, it should not report
3093	   misaligned creation of ADBs.  Even while it can detect them, it
3094	   cannot disallow them, as the application might be in the process of
3095	   changing the size of the ADBs.  Thus the server must be prepared to
3096	   handle an INITIALIZE into an existing ADB.

3098	   This document does not mandate the manner in which the server stores
3099	   ADBs sparsely for a file.  It does assume that if ADBs are stored
3100	   sparsely, then the server can detect when an INITIALIZE arrives that
3101	   will force a new ADB to start inside an existing ADB.  For example,
3102	   assume that ADBi has a adb_block_size of 4k and that an INITIALIZE
3103	   starts 1k inside ADBi.  The server should [[Comment.5: Need to flesh
3104	   this out. --TH]]

3106	13.8.  Operation 67: IO_ADVISE - Application I/O access pattern hints

3108	   This section introduces a new operation, named IO_ADVISE, which
3109	   allows NFS clients to communicate application I/O access pattern
3110	   hints to the NFS server.  This new operation will allow hints to be
3111	   sent to the server when applications use posix_fadvise, direct I/O,
3112	   or at any other point at which the client finds useful.

3114	13.8.1.  ARGUMENT

3116	   enum IO_ADVISE_type4 {
3117	           IO_ADVISE4_NORMAL                       = 0,
3118	           IO_ADVISE4_SEQUENTIAL                   = 1,
3119	           IO_ADVISE4_SEQUENTIAL_BACKWARDS         = 2,
3120	           IO_ADVISE4_RANDOM                       = 3,
3121	           IO_ADVISE4_WILLNEED                     = 4,
3122	           IO_ADVISE4_WILLNEED_OPPORTUNISTIC       = 5,
3123	           IO_ADVISE4_DONTNEED                     = 6,
3124	           IO_ADVISE4_NOREUSE                      = 7,
3125	           IO_ADVISE4_READ                         = 8,
3126	           IO_ADVISE4_WRITE                        = 9,
3127	           IO_ADVISE4_INIT_PROXIMITY               = 10
3128	   };

3130	   struct IO_ADVISE4args {
3131	           /* CURRENT_FH: file */
3132	           stateid4        iar_stateid;
3133	           offset4         iar_offset;
3134	           length4         iar_count;
3135	           bitmap4         iar_hints;
3136	   };

3138	13.8.2.  RESULT

3140	   struct IO_ADVISE4resok {
3141	           bitmap4 ior_hints;
3142	   };

3144	   union IO_ADVISE4res switch (nfsstat4 _status) {
3145	   case NFS4_OK:
3146	           IO_ADVISE4resok resok4;
3147	   default:
3148	           void;
3149	   };

3151	13.8.3.  DESCRIPTION

3153	   The IO_ADVISE operation sends an I/O access pattern hint to the
3154	   server for the owner of stated for a given byte range specified by
3155	   iar_offset and iar_count.  The byte range specified by iar_offset and
3156	   iar_count need not currently exist in the file, but the iar_hints
3157	   will apply to the byte range when it does exist.  If iar_count is 0,
3158	   all data following iar_offset is specified.  The server MAY ignore
3159	   the advice.

3161	   The following are the possible hints:

3163	   IO_ADVISE4_NORMAL  Specifies that the application has no advice to
3164	      give on its behavior with respect to the specified data.  It is
3165	      the default characteristic if no advice is given.

3167	   IO_ADVISE4_SEQUENTIAL  Specifies that the stated holder expects to
3168	      access the specified data sequentially from lower offsets to
3169	      higher offsets.

3171	   IO_ADVISE4_SEQUENTIAL BACKWARDS  Specifies that the stated holder
3172	      expects to access the specified data sequentially from higher
3173	      offsets to lower offsets.

3175	   IO_ADVISE4_RANDOM  Specifies that the stated holder expects to access
3176	      the specified data in a random order.

3178	   IO_ADVISE4_WILLNEED  Specifies that the stated holder expects to
3179	      access the specified data in the near future.

3181	   IO_ADVISE4_WILLNEED_OPPORTUNISTIC  Specifies that the stated holder
3182	      expects to possibly access the data in the near future.  This is a
3183	      speculative hint, and therefore the server should prefetch data or
3184	      indirect blocks only if it can be done at a marginal cost.

3186	   IO_ADVISE_DONTNEED  Specifies that the stated holder expects that it
3187	      will not access the specified data in the near future.

3189	   IO_ADVISE_NOREUSE  Specifies that the stated holder expects to access
3190	      the specified data once and then not reuse it thereafter.

3192	   IO_ADVISE4_READ  Specifies that the stated holder expects to read the
3193	      specified data in the near future.

3195	   IO_ADVISE4_WRITE  Specifies that the stated holder expects to write
3196	      the specified data in the near future.

3198	   IO_ADVISE4_INIT_PROXIMITY  The client has recently accessed the byte
3199	      range in its own cache.  This informs the server that the data in
3200	      the byte range remains important to the client.  When the server
3201	      reaches resource exhaustion, knowing which data is more important
3202	      allows the server to make better choices about which data to, for
3203	      example purge from a cache, or move to secondary storage.  It also
3204	      informs the server which delegations are more important, since if
3205	      delegations are working correctly, once delegated to a client, a
3206	      server might never receive another I/O request for the file.

3208	   The server will return success if the operation is properly formed,
3209	   otherwise the server will return an error.  The server MUST NOT
3210	   return an error if it does not recognize or does not support the
3211	   requested advice.  This is also true even if the client sends
3212	   contradictory hints to the server, e.g., IO_ADVISE4_SEQUENTIAL and
3213	   IO_ADVISE4_RANDOM in a single IO_ADVISE operation.  In this case, the
3214	   server MUST return success and a ior_hints value that indicates the
3215	   hint it intends to optimize.  For contradictory hints, this may mean
3216	   simply returning IO_ADVISE4_NORMAL for example.

3218	   The ior_hints returned by the server is primarily for debugging
3219	   purposes since the server is under no obligation to carry out the
3220	   hints that it describes in the ior_hints result.  In addition, while
3221	   the server may have intended to implement the hints returned in
3222	   ior_hints, as time progresses, the server may need to change its
3223	   handling of a given file due to several reasons including, but not
3224	   limited to, memory pressure, additional IO_ADVISE hints sent by other
3225	   clients, and heuristically detected file access patterns.

3227	   The server MAY return different advice than what the client
3228	   requested.  If it does, then this might be due to one of several
3229	   conditions, including, but not limited to another client advising of
3230	   a different I/O access pattern; a different I/O access pattern from
3231	   another client that that the server has heuristically detected; or
3232	   the server is not able to support the requested I/O access pattern,
3233	   perhaps due to a temporary resource limitation.

3235	   Each issuance of the IO_ADVISE operation overrides all previous
3236	   issuances of IO_ADVISE for a given byte range.  This effectively
3237	   follows a strategy of last hint wins for a given stated and byte
3238	   range.

3240	   Clients should assume that hints included in an IO_ADVISE operation
3241	   will be forgotten once the file is closed.

3243	13.8.4.  IMPLEMENTATION

3245	   The NFS client may choose to issue an IO_ADVISE operation to the
3246	   server in several different instances.

3248	   The most obvious is in direct response to an application's execution
3249	   of posix_fadvise.  In this case, IO_ADVISE4_WRITE and IO_ADVISE4_READ
3250	   may be set based upon the type of file access specified when the file
3251	   was opened.

3253	   Another useful point would be when an application indicates it is
3254	   using direct I/O. Direct I/O may be specified at file open, in which
3255	   case a IO_ADVISE may be included in the same compound as the OPEN
3256	   operation with the IO_ADVISE4_NOREUSE flag set.  Direct I/O may also
3257	   be specified separately, in which case a IO_ADVISE operation can be
3258	   sent to the server separately.  As above, IO_ADVISE4_WRITE and
3259	   IO_ADVISE4_READ may be set based upon the type of file access
3260	   specified when the file was opened.

3262	13.8.5.  pNFS File Layout Data Type Considerations

3264	   The IO_ADVISE considerations for pNFS are very similar to the COMMIT
3265	   considerations for pNFS.  That is, as with COMMIT, some NFS server
3266	   implementations prefer IO_ADVISE be done on the DS, and some prefer
3267	   it be done on the MDS.

3269	   So for the file's layout type, it is proposed that NFSv4.2 include an
3270	   additional hint NFL42_CARE_IO_ADVISE_THRU_MDS which is valid only on
3271	   NFSv4.2 or higher.  Any file's layout obtained with NFSv4.1 MUST NOT
3272	   have NFL42_UFLG_IO_ADVISE_THRU_MDS set.  Any file's layout obtained
3273	   with NFSv4.2 MAY have NFL42_UFLG_IO_ADVISE_THRU_MDS set.  If the
3274	   client does not implement IO_ADVISE, then it MUST ignore
3275	   NFL42_UFLG_IO_ADVISE_THRU_MDS.

3277	   If NFL42_UFLG_IO_ADVISE_THRU_MDS is set, then if the client
3278	   implements IO_ADVISE, then if it wants the DS to honor IO_ADVISE, the
3279	   client MUST send the operation to the MDS, and the server will
3280	   communicate the advice back each DS.  If the client sends IO_ADVISE
3281	   to the DS, then the server MAY return NFS4ERR_NOTSUPP.

3283	   If NFL42_UFLG_IO_ADVISE_THRU_MDS is not set, then this indicates to
3284	   client that if wants to inform the server via IO_ADVISE of the
3285	   client's intended use of the file, then the client SHOULD send an
3286	   IO_ADVISE to each DS.  While the client MAY always send IO_ADVISE to
3287	   the MDS, if the server has not set NFL42_UFLG_IO_ADVISE_THRU_MDS, the
3288	   client should expect that such an IO_ADVISE is futile.  Note that a
3289	   client SHOULD use the same set of arguments on each IO_ADVISE sent to
3290	   a DS for the same open file reference.

3292	   The server is not required to support different advice for different
3293	   DS's with the same open file reference.

3295	13.8.5.1.  Dense and Sparse Packing Considerations

3297	   The IO_ADVISE operation MUST use the iar_offset and byte range as
3298	   dictated by the presence or absence of NFL4_UFLG_DENSE.

3300	   E.g., if NFL4_UFLG_DENSE is present, and a READ or WRITE to the DS
3301	   for iar_offset 0 really means iar_offset 10000 in the logical file,
3302	   then an IO_ADVISE for iar_offset 0 means iar_offset 10000.

3304	   E.g., if NFL4_UFLG_DENSE is absent, then a READ or WRITE to the DS
3305	   for iar_offset 0 really means iar_offset 0 in the logical file, then
3306	   an IO_ADVISE for iar_offset 0 means iar_offset 0 in the logical file.

3308	   E.g., if NFL4_UFLG_DENSE is present, the stripe unit is 1000 bytes
3309	   and the stripe count is 10, and the dense DS file is serving
3310	   iar_offset 0.  A READ or WRITE to the DS for iar_offsets 0, 1000,
3311	   2000, and 3000, really mean iar_offsets 10000, 20000, 30000, and
3312	   40000 (implying a stripe count of 10 and a stripe unit of 1000), then
3313	   an IO_ADVISE sent to the same DS with an iar_offset of 500, and a
3314	   iar_count of 3000 means that the IO_ADVISE applies to these byte
3315	   ranges of the dense DS file:

3317	     - 500 to 999
3318	     - 1000 to 1999
3319	     - 2000 to 2999
3320	     - 3000 to 3499

3322	   I.e., the contiguous range 500 to 3499 as specified in IO_ADVISE.

3324	   It also applies to these byte ranges of the logical file:

3326	     - 10500 to 10999 (500 bytes)
3327	     - 20000 to 20999 (1000 bytes)
3328	     - 30000 to 30999 (1000 bytes)
3329	     - 40000 to 40499 (500 bytes)
3330	     (total            3000 bytes)

3332	   E.g., if NFL4_UFLG_DENSE is absent, the stripe unit is 250 bytes, the
3333	   stripe count is 4, and the sparse DS file is serving iar_offset 0.
3334	   Then a READ or WRITE to the DS for iar_offsets 0, 1000, 2000, and
3335	   3000, really mean iar_offsets 0, 1000, 2000, and 3000 in the logical
3336	   file, keeping in mind that on the DS file,. byte ranges 250 to 999,
3337	   1250 to 1999, 2250 to 2999, and 3250 to 3999 are not accessible.
3338	   Then an IO_ADVISE sent to the same DS with an iar_offset of 500, and
3339	   a iar_count of 3000 means that the IO_ADVISE applies to these byte
3340	   ranges of the logical file and the sparse DS file:

3342	     - 500 to 999 (500 bytes)   - no effect
3343	     - 1000 to 1249 (250 bytes) - effective
3344	     - 1250 to 1999 (750 bytes) - no effect
3345	     - 2000 to 2249 (250 bytes) - effective
3346	     - 2250 to 2999 (750 bytes) - no effect
3347	     - 3000 to 3249 (250 bytes) - effective
3348	     - 3250 to 3499 (250 bytes) - no effect
3349	     (subtotal      2250 bytes) - no effect
3350	     (subtotal       750 bytes) - effective
3351	     (grand total   3000 bytes) - no effect + effective

3353	   If neither of the flags NFL42_UFLG_IO_ADVISE_THRU_MDS and
3354	   NFL4_UFLG_DENSE are set in the layout, then any IO_ADVISE request
3355	   sent to the data server with a byte range that overlaps stripe unit
3356	   that the data server does not serve MUST NOT result in the status
3357	   NFS4ERR_PNFS_IO_HOLE.  Instead, the response SHOULD be successful and
3358	   if the server applies IO_ADVISE hints on any stripe units that
3359	   overlap with the specified range, those hints SHOULD be indicated in
3360	   the response.

3362	13.8.6.  Number of Supported File Segments

3364	   In theory IO_ADVISE allows a client and server to support multiple
3365	   file segments, meaning that different, possibly overlapping, byte
3366	   ranges of the same open file reference will support different hints.
3367	   This is not practical, and in general the server will support just
3368	   one set of hints, and these will apply to the entire file.  However,
3369	   there are some hints that very ephemeral, and are essentially amount
3370	   to one time instructions to the NFS server, which will be forgotten
3371	   momentarily after IO_ADVISE is executed.

3373	   The following hints will always apply to the entire file, regardless
3374	   of the specified byte range:

3376	   o  IO_ADVISE4_NORMAL

3378	   o  IO_ADVISE4_SEQUENTIAL

3380	   o  IO_ADVISE4_SEQUENTIAL_BACKWARDS

3382	   o  IO_ADVISE4_RANDOM

3384	   The following hints will always apply to specified byte range, and
3385	   will treated as one time instructions:

3387	   o  IO_ADVISE4_WILLNEED

3389	   o  IO_ADVISE4_WILLNEED_OPPORTUNISTIC

3391	   o  IO_ADVISE4_DONTNEED

3393	   o  IO_ADVISE4_NOREUSE

3395	   The following hints are modifiers to all other hints, and will apply
3396	   to the entire file and/or to a one time instruction on the specified
3397	   byte range:

3399	   o  IO_ADVISE4_READ

3401	   o  IO_ADVISE4_WRITE

3403	13.9.  Changes to Operation 51: LAYOUTRETURN

3405	13.9.1.  Introduction

3407	   In the pNFS description provided in [2], the client is not enabled to
3408	   relay an error code from the DS to the MDS.  In the specification of
3409	   the Objects-Based Layout protocol [9], use is made of the opaque
3410	   lrf_body field of the LAYOUTRETURN argument to do such a relaying of
3411	   error codes.  In this section, we define a new data structure to
3412	   enable the passing of error codes back to the MDS and provide some
3413	   guidelines on what both the client and MDS should expect in such
3414	   circumstances.

3416	   There are two broad classes of errors, transient and persistent.  The
3417	   client SHOULD strive to only use this new mechanism to report
3418	   persistent errors.  It MUST be able to deal with transient issues by
3419	   itself.  Also, while the client might consider an issue to be
3420	   persistent, it MUST be prepared for the MDS to consider such issues
3421	   to be persistent.  A prime example of this is if the MDS fences off a
3422	   client from either a stateid or a filehandle.  The client will get an
3423	   error from the DS and might relay either NFS4ERR_ACCESS or
3424	   NFS4ERR_STALE_STATEID back to the MDS, with the belief that this is a
3425	   hard error.  The MDS on the other hand, is waiting for the client to
3426	   report such an error.  For it, the mission is accomplished in that
3427	   the client has returned a layout that the MDS had most likley
3428	   recalled.

3430	   The existing LAYOUTRETURN operation is extended by introducing a new
3431	   data structure to report errors, layoutreturn_device_error4.  Also,
3432	   layoutreturn_device_error4 is introduced to enable an array of errors
3433	   to be reported.

3435	13.9.2.  ARGUMENT

3437	   The ARGUMENT specification of the LAYOUTRETURN operation in section
3438	   18.44.1 of [2] is augmented by the following XDR code [23]:

3440	   struct layoutreturn_device_error4 {
3441	           deviceid4       lrde_deviceid;
3442	           nfsstat4        lrde_status;
3443	           nfs_opnum4      lrde_opnum;
3444	   };

3446	   struct layoutreturn_error_report4 {
3447	           layoutreturn_device_error4      lrer_errors<>;
3448	   };

3450	13.9.3.  RESULT

3452	   The RESULT of the LAYOUTRETURN operation is unchanged; see section
3453	   18.44.2 of [2].

3455	13.9.4.  DESCRIPTION

3457	   The following text is added to the end of the LAYOUTRETURN operation
3458	   DESCRIPTION in section 18.44.3 of [2].

3460	   When a client used LAYOUTRETURN with a type of LAYOUTRETURN4_FILE,
3461	   then if the lrf_body field is NULL, it indicates to the MDS that the
3462	   client experienced no errors.  If lrf_body is non-NULL, then the
3463	   field references error information which is layout type specific.
3464	   I.e., the Objects-Based Layout protocol can continue to utilize
3465	   lrf_body as specified in [9].  For both Files-Based Layouts, the
3466	   field references a layoutreturn_device_error4, which contains an
3467	   array of layoutreturn_device_error4.

3469	   Each individual layoutreturn_device_error4 descibes a single error
3470	   associated with a DS, which is identfied via lrde_deviceid.  The
3471	   operation which returned the error is identified via lrde_opnum.
3472	   Finally the NFS error value (nfsstat4) encountered is provided via
3473	   lrde_status and may consist of the following error codes:

3475	   NFS4_OKAY:  No issues were found for this device.

3477	   NFS4ERR_NXIO:  The client was unable to establish any communication
3478	      with the DS.

3480	   NFS4ERR_*:  The client was able to establish communication with the
3481	      DS and is returning one of the allowed error codes for the
3482	      operation denoted by lrde_opnum.

3484	13.9.5.  IMPLEMENTATION

3486	   The following text is added to the end of the LAYOUTRETURN operation
3487	   IMPLEMENTATION in section 18.4.4 of [2].

3489	   A client that expects to use pNFS for a mounted filesystem SHOULD
3490	   check for pNFS support at mount time.  This check SHOULD be performed
3491	   by sending a GETDEVICELIST operation, followed by layout-type-
3492	   specific checks for accessibility of each storage device returned by
3493	   GETDEVICELIST.  If the NFS server does not support pNFS, the
3494	   GETDEVICELIST operation will be rejected with an NFS4ERR_NOTSUPP
3495	   error; in this situation it is up to the client to determine whether
3496	   it is acceptable to proceed with NFS-only access.

3498	   Clients are expected to tolerate transient storage device errors, and
3499	   hence clients SHOULD NOT use the LAYOUTRETURN error handling for
3500	   device access problems that may be transient.  The methods by which a
3501	   client decides whether an access problem is transient vs. persistent
3502	   are implementation-specific, but may include retrying I/Os to a data
3503	   server under appropriate conditions.

3505	   When an I/O fails to a storage device, the client SHOULD retry the
3506	   failed I/O via the MDS.  In this situation, before retrying the I/O,
3507	   the client SHOULD return the layout, or the affected portion thereof,
3508	   and SHOULD indicate which storage device or devices was problematic.
3509	   If the client does not do this, the MDS may issue a layout recall
3510	   callback in order to perform the retried I/O.

3512	   The client needs to be cognizant that since this error handling is
3513	   optional in the MDS, the MDS may silently ignore this functionality.
3514	   Also, as the MDS may consider some issues the client reports to be
3515	   expected (see Section 13.9.1), the client might find it difficult to
3516	   detect a MDS which has not implemented error handling via
3517	   LAYOUTRETURN.

3519	   If an MDS is aware that a storage device is proving problematic to a
3520	   client, the MDS SHOULD NOT include that storage device in any pNFS
3521	   layouts sent to that client.  If the MDS is aware that a storage
3522	   device is affecting many clients, then the MDS SHOULD NOT include
3523	   that storage device in any pNFS layouts sent out.  Clients must still
3524	   be aware that the MDS might not have any choice in using the storage
3525	   device, i.e., there might only be one possible layout for the system.

3527	   Another interesting complication is that for existing files, the MDS
3528	   might have no choice in which storage devices to hand out to clients.
3529	   The MDS might try to restripe a file across a different storage
3530	   device, but clients need to be aware that not all implementations
3531	   have restriping support.

3533	   An MDS SHOULD react to a client return of layouts with errors by not
3534	   using the problematic storage devices in layouts for that client, but
3535	   the MDS is not required to indefinitely retain per-client storage
3536	   device error information.  An MDS is also not required to
3537	   automatically reinstate use of a previously problematic storage
3538	   device; administrative intervention may be required instead.

3540	   A client MAY perform I/O via the MDS even when the client holds a
3541	   layout that covers the I/O; servers MUST support this client
3542	   behavior, and MAY recall layouts as needed to complete I/Os.

3544	13.10.  Operation 65: READ_PLUS

3546	   READ_PLUS is a new read operation which allows NFS clients to avoid
3547	   reading holes in a sparse file and to efficiently transfer ADBs.
3548	   READ_PLUS supports all the features of the existing NFSv4.1 READ
3549	   operation [2] but also extends the response to avoid returning data
3550	   for portions of the file which are either initialized and contain no
3551	   backing store or if the result would appear to be so.  I.e., if the
3552	   result was a data block composed entirely of zeros, then it is easier
3553	   to return a hole.  Returning data blocks of unitialized data wastes
3554	   computational and network resources, thus reducing performance.
3555	   READ_PLUS uses a new result structure that tells the client that the
3556	   result is all zeroes AND the byte-range of the hole in which the
3557	   request was made.

3559	   If the client sends a READ operation, it is explicitly stating that
3560	   it is neither supporting sparse files nor ADBs.  So if a READ occurs
3561	   on a sparse ADB or file, then the server must expand such data to be
3562	   raw bytes.  If a READ occurs in the middle of a hole or ADB, the
3563	   server can only send back bytes starting from that offset.

3565	   Such an operation is inefficient for transfer of sparse sections of
3566	   the file.  As such, READ is marked as OBSOLETE in NFSv4.2.  Instead,
3567	   a client should issue READ_PLUS.  Note that as the client has no a
3568	   priori knowledge of whether either an ADB or a hole is present or
3569	   not, it should always use READ_PLUS.

3571	13.10.1.  ARGUMENT

3573	   struct READ_PLUS4args {
3574	           /* CURRENT_FH: file */
3575	           stateid4        rpa_stateid;
3576	           offset4         rpa_offset;
3577	           count4          rpa_count;
3578	   };

3580	13.10.2.  RESULT

3582	   union read_plus_content switch (data_content4 content) {
3583	   case NFS4_CONTENT_DATA:
3584	           opaque          rpc_data<>;
3585	   case NFS4_CONTENT_APP_BLOCK:
3586	           app_data_block4 rpc_block;
3587	   case NFS4_CONTENT_HOLE:
3588	           data_info4      rpc_hole;
3589	   default:
3590	           void;
3591	   };

3593	   /*
3594	    * Allow a return of an array of contents.
3595	    */
3596	   struct read_plus_res4 {
3597	           bool                    rpr_eof;
3598	           read_plus_content       rpr_contents<>;
3599	   };

3601	   union READ_PLUS4res switch (nfsstat4 status) {
3602	   case NFS4_OK:
3603	           read_plus_res4  resok4;
3604	   default:
3605	           void;
3606	   };

3608	13.10.3.  DESCRIPTION

3610	   The READ_PLUS operation is based upon the NFSv4.1 READ operation [2]
3611	   and similarly reads data from the regular file identified by the
3612	   current filehandle.

3614	   The client provides a rpa_offset of where the READ_PLUS is to start
3615	   and a rpa_count of how many bytes are to be read.  A rpa_offset of
3616	   zero means to read data starting at the beginning of the file.  If
3617	   rpa_offset is greater than or equal to the size of the file, the
3618	   status NFS4_OK is returned with di_length (the data length) set to
3619	   zero and eof set to TRUE.  READ_PLUS is subject to access permissions
3620	   checking.

3622	   The READ_PLUS result is comprised of an array of rpr_contents, each
3623	   of which describe a data_content4 type of data.  For NFSv4.2, the
3624	   allowed values are data, ADB, and hole.  A server is required to
3625	   support the data type, but neither ADB nor hole.  Both an ADB and a
3626	   hole must be returned in its entirety - clients must be prepared to
3627	   get more information than they requested.

3629	   READ_PLUS has to support all of the errors which are returned by READ
3630	   plus NFS4ERR_UNION_NOTSUPP.  If the client asks for a hole and the
3631	   server does not support that arm of the discriminated union, but does
3632	   support one or more additional arms, it can signal to the client that
3633	   it supports the operation, but not the arm with
3634	   NFS4ERR_UNION_NOTSUPP.

3636	   If the data to be returned is comprised entirely of zeros, then the
3637	   server may elect to return that data as a hole.  The server
3638	   differentiates this to the client by setting di_allocated to TRUE in
3639	   this case.  Note that in such a scenario, the server is not required
3640	   to determine the full extent of the "hole" - it does not need to
3641	   determine where the zeros start and end.

3643	   The server may elect to return adjacent elements of the same type.
3644	   For example, the guard pattern or block size of an ADB might change,
3645	   which would require adjacent elements of type ADB.  Likewise if the
3646	   server has a range of data comprised entirely of zeros and then a
3647	   hole, it might want to return two adjacent holes to the client.

3649	   If the client specifies a rpa_count value of zero, the READ_PLUS
3650	   succeeds and returns zero bytes of data, again subject to access
3651	   permissions checking.  In all situations, the server may choose to
3652	   return fewer bytes than specified by the client.  The client needs to
3653	   check for this condition and handle the condition appropriately.

3655	   If the client specifies an rpa_offset and rpa_count value that is
3656	   entirely contained within a hole of the file, then the di_offset and
3657	   di_length returned must be for the entire hole.  This result is
3658	   considered valid until the file is changed (detected via the change
3659	   attribute).  The server MUST provide the same semantics for the hole
3660	   as if the client read the region and received zeroes; the implied
3661	   holes contents lifetime MUST be exactly the same as any other read
3662	   data.

3664	   If the client specifies an rpa_offset and rpa_count value that begins
3665	   in a non-hole of the file but extends into hole the server should
3666	   return an array comprised of both data and a hole.  The client MUST
3667	   be prepared for the server to return a short read describing just the
3668	   data.  The client will then issue another READ_PLUS for the remaining
3669	   bytes, which the server will respond with information about the hole
3670	   in the file.

3672	   Except when special stateids are used, the stateid value for a
3673	   READ_PLUS request represents a value returned from a previous byte-
3674	   range lock or share reservation request or the stateid associated
3675	   with a delegation.  The stateid identifies the associated owners if
3676	   any and is used by the server to verify that the associated locks are
3677	   still valid (e.g., have not been revoked).

3679	   If the read ended at the end-of-file (formally, in a correctly formed
3680	   READ_PLUS operation, if rpa_offset + rpa_count is equal to the size
3681	   of the file), or the READ_PLUS operation extends beyond the size of
3682	   the file (if rpa_offset + rpa_count is greater than the size of the
3683	   file), eof is returned as TRUE; otherwise, it is FALSE.  A successful
3684	   READ_PLUS of an empty file will always return eof as TRUE.

3686	   If the current filehandle is not an ordinary file, an error will be
3687	   returned to the client.  In the case that the current filehandle
3688	   represents an object of type NF4DIR, NFS4ERR_ISDIR is returned.  If
3689	   the current filehandle designates a symbolic link, NFS4ERR_SYMLINK is
3690	   returned.  In all other cases, NFS4ERR_WRONG_TYPE is returned.

3692	   For a READ_PLUS with a stateid value of all bits equal to zero, the
3693	   server MAY allow the READ_PLUS to be serviced subject to mandatory
3694	   byte-range locks or the current share deny modes for the file.  For a
3695	   READ_PLUS with a stateid value of all bits equal to one, the server
3696	   MAY allow READ_PLUS operations to bypass locking checks at the
3697	   server.

3699	   On success, the current filehandle retains its value.

3701	13.10.4.  IMPLEMENTATION

3703	   In general, the IMPLEMENTATION notes for READ in Section 18.22.4 of
3704	   [2] also apply to READ_PLUS.  One delta is that when the owner has a
3705	   locked byte range, the server MUST return an array of rpr_contents
3706	   with values inside that range.

3708	13.10.4.1.  Additional pNFS Implementation Information

3710	   With pNFS, the semantics of using READ_PLUS remains the same.  Any
3711	   data server MAY return a hole or ADB result for a READ_PLUS request
3712	   that it receives.

3714	   When a data server chooses to return a hole result, it has the option
3715	   of returning hole information for the data stored on that data server
3716	   (as defined by the data layout), but it MUST not return results for a
3717	   byte range that includes data managed by another data server.  Data
3718	   servers that can obtain hole information for the parts of the file
3719	   stored on that data server, the data server SHOULD return HOLE_INFO
3720	   and the byte range of the hole stored on that data server.

3722	   A data server should do its best to return as much information about
3723	   a hole as is feasible without having to contact the metadata server.
3724	   If communication with the metadata server is required, then every
3725	   attempt should be taken to minimize the number of requests.

3727	   If mandatory locking is enforced, then the data server must also
3728	   ensure that to return only information for a Hole that is within the
3729	   owner's locked byte range.

3731	13.10.5.  READ_PLUS with Sparse Files Example

3733	   The following table describes a sparse file.  For each byte range,
3734	   the file contains either non-zero data or a hole.  In addition, the
3735	   server in this example uses a Hole Threshold of 32K.

3737	                        +-------------+----------+
3738	                        | Byte-Range  | Contents |
3739	                        +-------------+----------+
3740	                        | 0-15999     | Hole     |
3741	                        | 16K-31999   | Non-Zero |
3742	                        | 32K-255999  | Hole     |
3743	                        | 256K-287999 | Non-Zero |
3744	                        | 288K-353999 | Hole     |
3745	                        | 354K-417999 | Non-Zero |
3746	                        +-------------+----------+

3748	                                  Table 4

3750	   Under the given circumstances, if a client was to read from the file
3751	   with a max read size of 64K, the following will be the results for
3752	   the given READ_PLUS calls.  This assumes the client has already
3753	   opened the file, acquired a valid stateid ('s' in the example), and
3754	   just needs to issue READ_PLUS requests.

3756	   1.  READ_PLUS(s, 0, 64K) --> NFS_OK, eof = false, <data[0,32K],
3757	       hole[32K,224K]>.  Since the first hole is less than the server's
3758	       Hole Threshhold, the first 32K of the file is returned as data
3759	       and the remaining 32K is returned as a hole which actually
3760	       extends to 256K.

3762	   2.  READ_PLUS(s, 32K, 64K) --> NFS_OK, eof = false, <hole[32K,224K]>
3763	       The requested range was all zeros, and the current hole begins at
3764	       offset 32K and is 224K in length.  Note that the client should
3765	       not have followed up the previous READ_PLUS request with this one
3766	       as the hole information from the previous call extended past what
3767	       the client was requesting.

3769	   3.  READ_PLUS(s, 256K, 64K) --> NFS_OK, eof = false, <data[256K,
3770	       288K], hole[288K, 354K]>.  Returns an array of the 32K data and
3771	       the hole which extends to 354K.

3773	   4.  READ_PLUS(s, 354K, 64K) --> NFS_OK, eof = true, <data[354K,
3774	       418K]>.  Returns the final 64K of data and informs the client
3775	       there is no more data in the file.

3777	13.11.  Operation 66: SEEK

3779	   SEEK is an operation that allows a client to determine the location
3780	   of the next data_content4 in a file.  It allows an implementation of
3781	   the emerging extension to lseek(2) to allow clients to determine
3782	   SEEK_HOLE and SEEK_DATA.

3784	13.11.1.  ARGUMENT

3786	   struct SEEK4args {
3787	           /* CURRENT_FH: file */
3788	           stateid4        sa_stateid;
3789	           offset4         sa_offset;
3790	           data_content4   sa_what;
3791	   };

3793	13.11.2.  RESULT

3795	   union seek_content switch (data_content4 content) {
3796	   case NFS4_CONTENT_DATA:
3797	           data_info4      sc_data;
3798	   case NFS4_CONTENT_APP_BLOCK:
3799	           app_data_block4 sc_block;
3800	   case NFS4_CONTENT_HOLE:
3801	           data_info4      sc_hole;
3802	   default:
3803	           void;
3804	   };

3806	   struct seek_res4 {
3807	           bool                    sr_eof;
3808	           seek_content            sr_contents;
3809	   };

3811	   union SEEK4res switch (nfsstat4 status) {
3812	   case NFS4_OK:
3813	           seek_res4       resok4;
3814	   default:
3815	           void;
3816	   };

3818	13.11.3.  DESCRIPTION

3820	   From the given sa_offset, find the next data_content4 of type sa_what
3821	   in the file.  For either a hole or ADB, this must return the
3822	   data_content4 in its entirety.  For data, it must not return the
3823	   actual data.

3825	   SEEK must follow the same rules for stateids as READ_PLUS
3826	   (Section 13.10.3).

3828	   If the server could not find a corresponding sa_what, then the status
3829	   would still be NFS4_OK, but sr_eof would be TRUE.  The sr_contents
3830	   would contain a zero-ed out content of the appropriate type.

3832	14.  NFSv4.2 Callback Operations

3834	14.1.  Procedure 16: CB_ATTR_CHANGED - Notify Client that the File's
3835	       Attributes Changed

3837	14.1.1.  ARGUMENTS

3839	   struct CB_ATTR_CHANGED4args {
3840	           nfs_fh4         acca_fh;
3841	           bitmap4         acca_critical;
3842	           bitmap4         acca_info;
3843	   };

3845	14.1.2.  RESULTS

3847	   struct CB_ATTR_CHANGED4res {
3848	           nfsstat4        accr_status;
3849	   };

3851	14.1.3.  DESCRIPTION

3853	   The CB_ATTR_CHANGED callback operation is used by the server to
3854	   indicate to the client that the file's attributes have been modified
3855	   on the server.  The server does not convey how the attributes have
3856	   changed, just that they have been modified.  The server can inform
3857	   the client about both critical and informational attribute changes in
3858	   the bitmask arguments.  The client SHOULD query the server about all
3859	   attributes set in acca_critical.  For all changes reflected in
3860	   acca_info, the client can decide whether or not it wants to poll the
3861	   server.

3863	   The CB_ATTR_CHANGED callback operation with the FATTR4_SEC_LABEL set
3864	   in acca_critical is the method used by the server to indicate that
3865	   the MAC label for the file referenced by acca_fh has changed.  In
3866	   many ways, the server does not care about the result returned by the
3867	   client.

3869	14.2.  Operation 15: CB_COPY - Report results of a server-side copy
3870	14.2.1.  ARGUMENT

3872	   union copy_info4 switch (nfsstat4 cca_status) {
3873	           case NFS4_OK:
3874	                   void;
3875	           default:
3876	                   length4         cca_bytes_copied;
3877	   };

3879	   struct CB_COPY4args {
3880	           nfs_fh4         cca_fh;
3881	           stateid4        cca_stateid;
3882	           copy_info4      cca_copy_info;
3883	   };

3885	14.2.2.  RESULT

3887	   struct CB_COPY4res {
3888	           nfsstat4        ccr_status;
3889	   };

3891	14.2.3.  DESCRIPTION

3893	   CB_COPY is used for both intra- and inter-server asynchronous copies.
3894	   The CB_COPY callback informs the client of the result of an
3895	   asynchronous server-side copy.  This operation is sent by the
3896	   destination server to the client in a CB_COMPOUND request.  The copy
3897	   is identified by the filehandle and stateid arguments.  The result is
3898	   indicated by the status field.  If the copy failed, cca_bytes_copied
3899	   contains the number of bytes copied before the failure occurred.  The
3900	   cca_bytes_copied value indicates the number of bytes copied but not
3901	   which specific bytes have been copied.

3903	   In the absence of an established backchannel, the server cannot
3904	   signal the completion of the COPY via a CB_COPY callback.  The loss
3905	   of a callback channel would be indicated by the server setting the
3906	   SEQ4_STATUS_CB_PATH_DOWN flag in the sr_status_flags field of the
3907	   SEQUENCE operation.  The client must re-establish the callback
3908	   channel to receive the status of the COPY operation.  Prolonged loss
3909	   of the callback channel could result in the server dropping the COPY
3910	   operation state and invalidating the copy stateid.

3912	   If the client supports the COPY operation, the client is REQUIRED to
3913	   support the CB_COPY operation.

3915	   The CB_COPY operation may fail for the following reasons (this is a
3916	   partial list):

3918	   NFS4ERR_NOTSUPP:  The copy offload operation is not supported by the
3919	      NFS client receiving this request.

3921	15.  IANA Considerations

3923	   This section uses terms that are defined in [24].

3925	16.  References

3927	16.1.  Normative References

3929	   [1]   Bradner, S., "Key words for use in RFCs to Indicate Requirement
3930	         Levels", March 1997.

3932	   [2]   Shepler, S., Eisler, M., and D. Noveck, "Network File System
3933	         (NFS) Version 4 Minor Version 1 Protocol", RFC 5661,
3934	         January 2010.

3936	   [3]   Haynes, T., "Network File System (NFS) Version 4 Minor Version
3937	         2 External Data Representation Standard (XDR) Description",
3938	         March 2011.

3940	   [4]   Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform
3941	         Resource Identifier (URI): Generic Syntax", STD 66, RFC 3986,
3942	         January 2005.

3944	   [5]   Haynes, T. and N. Williams, "Remote Procedure Call (RPC)
3945	         Security Version 3", draft-williams-rpcsecgssv3 (work in
3946	         progress), 2011.

3948	   [6]   The Open Group, "Section 'posix_fadvise()' of System Interfaces
3949	         of The Open Group Base Specifications Issue 6, IEEE Std 1003.1,
3950	         2004 Edition", 2004.

3952	   [7]   Haynes, T., "Requirements for Labeled NFS",
3953	         draft-ietf-nfsv4-labreqs-00 (work in progress).

3955	   [8]   Eisler, M., Chiu, A., and L. Ling, "RPCSEC_GSS Protocol
3956	         Specification", RFC 2203, September 1997.

3958	   [9]   Halevy, B., Welch, B., and J. Zelenka, "Object-Based Parallel
3959	         NFS (pNFS) Operations", RFC 5664, January 2010.

3961	16.2.  Informative References

3963	   [10]  Haynes, T. and D. Noveck, "Network File System (NFS) version 4
3964	         Protocol", draft-ietf-nfsv4-rfc3530bis-09 (Work In Progress),
3965	         March 2011.

3967	   [11]  Lentini, J., Everhart, C., Ellard, D., Tewari, R., and M. Naik,
3968	         "NSDB Protocol for Federated Filesystems",
3969	         draft-ietf-nfsv4-federated-fs-protocol (Work In Progress),
3970	         2010.

3972	   [12]  Lentini, J., Everhart, C., Ellard, D., Tewari, R., and M. Naik,
3973	         "Administration Protocol for Federated Filesystems",
3974	         draft-ietf-nfsv4-federated-fs-admin (Work In Progress), 2010.

3976	   [13]  Fielding, R., Gettys, J., Mogul, J., Frystyk, H., Masinter, L.,
3977	         Leach, P., and T. Berners-Lee, "Hypertext Transfer Protocol --
3978	         HTTP/1.1", RFC 2616, June 1999.

3980	   [14]  Postel, J. and J. Reynolds, "File Transfer Protocol", STD 9,
3981	         RFC 959, October 1985.

3983	   [15]  Simpson, W., "PPP Challenge Handshake Authentication Protocol
3984	         (CHAP)", RFC 1994, August 1996.

3986	   [16]  VanDeBogart, S., Frost, C., and E. Kohler, "Reducing Seek
3987	         Overhead with Application-Directed Prefetching", Proceedings of
3988	         USENIX Annual Technical Conference , June 2009.

3990	   [17]  Strohm, R., "Chapter 2, Data Blocks, Extents, and Segments, of
3991	         Oracle Database Concepts 11g Release 1 (11.1)", January 2011.

3993	   [18]  Ashdown, L., "Chapter 15, Validating Database Files and
3994	         Backups, of Oracle Database Backup and Recovery User's Guide
3995	         11g Release 1 (11.1)", August 2008.

3997	   [19]  McDougall, R. and J. Mauro, "Section 11.4.3, Detecting Memory
3998	         Corruption of Solaris Internals", 2007.

4000	   [20]  Bairavasundaram, L., Goodson, G., Schroeder, B., Arpaci-
4001	         Dusseau, A., and R. Arpaci-Dusseau, "An Analysis of Data
4002	         Corruption in the Storage Stack", Proceedings of the 6th USENIX
4003	         Symposium on File and Storage Technologies (FAST '08) , 2008.

4005	   [21]  "Section 46.6. Multi-Level Security (MLS) of Deployment Guide:
4006	         Deployment, configuration and administration of Red Hat
4007	         Enterprise Linux 5, Edition 6", 2011.

4009	   [22]  Quigley, D. and J. Lu, "Registry Specification for MAC Security
4010	         Label Formats", draft-quigley-label-format-registry (work in
4011	         progress), 2011.

4013	   [23]  Eisler, M., "XDR: External Data Representation Standard",
4014	         RFC 4506, May 2006.

4016	   [24]  Narten, T. and H. Alvestrand, "Guidelines for Writing an IANA
4017	         Considerations Section in RFCs", BCP 26, RFC 5226, May 2008.

4019	Appendix A.  Acknowledgments

4021	   For the pNFS Access Permissions Check, the original draft was by
4022	   Sorin Faibish, David Black, Mike Eisler, and Jason Glasgow.  The work
4023	   was influenced by discussions with Benny Halevy and Bruce Fields.  A
4024	   review was done by Tom Haynes.

4026	   For the Sharing change attribute implementation details with NFSv4
4027	   clients, the original draft was by Trond Myklebust.

4029	   For the NFS Server-side Copy, the original draft was by James
4030	   Lentini, Mike Eisler, Deepak Kenchammana, Anshul Madan, and Rahul
4031	   Iyer.  Tom Talpey co-authored an unpublished version of that
4032	   document.  It was also was reviewed by a number of individuals:
4033	   Pranoop Erasani, Tom Haynes, Arthur Lent, Trond Myklebust, Dave
4034	   Noveck, Theresa Lingutla-Raj, Manjunath Shankararao, Satyam Vaghani,
4035	   and Nico Williams.

4037	   For the NFS space reservation operations, the original draft was by
4038	   Mike Eisler, James Lentini, Manjunath Shankararao, and Rahul Iyer.

4040	   For the sparse file support, the original draft was by Dean
4041	   Hildebrand and Marc Eshel.  Valuable input and advice was received
4042	   from Sorin Faibish, Bruce Fields, Benny Halevy, Trond Myklebust, and
4043	   Richard Scheffenegger.

4045	   For the Application IO Hints, the original draft was by Dean
4046	   Hildebrand, Mike Eisler, Trond Myklebust, and Sam Falkner.  Some
4047	   early reviwers included Benny Halevy and Pranoop Erasani.

4049	   For Labeled NFS, the original draft was by David Quigley, James
4050	   Morris, Jarret Lu, and Tom Haynes.  Peter Staubach, Trond Myklebust,
4051	   Stephen Smalley, Sorrin Faibish, Nico Williams, and David Black also
4052	   contributed in the final push to get this accepted.

4054	Appendix B.  RFC Editor Notes

4056	   [RFC Editor: please remove this section prior to publishing this
4057	   document as an RFC]

4059	   [RFC Editor: prior to publishing this document as an RFC, please
4060	   replace all occurrences of RFCTBD10 with RFCxxxx where xxxx is the
4061	   RFC number of this document]

4063	Author's Address

4065	   Thomas Haynes
4066	   NetApp
4067	   9110 E 66th St
4068	   Tulsa, OK  74133
4069	   USA

4071	   Phone: +1 918 307 1415
4072	   Email: thomas@netapp.com
4073	   URI:   http://www.tulsalabs.com