idnits 2.17.1 

draft-ietf-nfsv4-minorversion2-10.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  == There are 5 instances of lines with non-RFC6890-compliant IPv4 addresses
     in the document.  If these are example addresses, they should be changed.

  == There are 5 instances of lines with private range IPv4 addresses in the
     document.  If these are generic example addresses, they should be changed
     to use any of the ranges defined in RFC 6890 (or successor): 192.0.2.x,
     198.51.100.x or 203.0.113.x.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  == Using lowercase 'not' together with uppercase 'MUST', 'SHALL', 'SHOULD',
     or 'RECOMMENDED' is not an accepted usage according to RFC 2119.  Please
     use uppercase 'NOT' together with RFC 2119 keywords (if that is what you
     mean).
     
     Found 'MUST not' in this paragraph:
     
     Furthermore, each DS MUST not report to a client a sparse ADB which
     belongs to another DS.  One implication of this requirement is that the
     app_data_block4's adb_block_size MUST be either be the stripe width or
     the stripe width must be an even multiple of it.  The second implication
     here is that the DS must be able to use the Control Protocol to determine
     from the MDS where the sparse ADBs occur.

  == Using lowercase 'not' together with uppercase 'MUST', 'SHALL', 'SHOULD',
     or 'RECOMMENDED' is not an accepted usage according to RFC 2119.  Please
     use uppercase 'NOT' together with RFC 2119 keywords (if that is what you
     mean).
     
     Found 'MUST not' in this paragraph:
     
     The second change is to provide a method for the server to notify
     the client that the attribute changed on an open file on the server.  If
     the file is closed, then during the open attempt, the client will gather
     the new attribute value.  The server MUST not communicate the new value
     of the attribute, the client MUST query it.  This requirement stems from
     the need for the client to provide sufficient access rights to the
     attribute.

  == Using lowercase 'not' together with uppercase 'MUST', 'SHALL', 'SHOULD',
     or 'RECOMMENDED' is not an accepted usage according to RFC 2119.  Please
     use uppercase 'NOT' together with RFC 2119 keywords (if that is what you
     mean).
     
     Found 'MUST not' in this paragraph:
     
     With pNFS, the semantics of using READ_PLUS remains the same.  Any
     data server MAY return a hole or ADB result for a READ_PLUS request that
     it receives.  When a data server chooses to return such a result, it has
     the option of returning information for the data stored on that data
     server (as defined by the data layout), but it MUST not return results
     for a byte range that includes data managed by another data server.

  == The document seems to contain a disclaimer for pre-RFC5378 work, but was
     first submitted on or after 10 November 2008.  The disclaimer is usually
     necessary only for documents that revise or obsolete older RFCs, and that
     take significant amounts of text from those RFCs.  If you can contact all
     authors of the source material and they are willing to grant the BCP78
     rights to the IETF Trust, you can and should remove the disclaimer. 
     Otherwise, the disclaimer is needed and you can ignore this comment. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (May 08, 2012) is 4370 days in the past.  Is this
     intentional?

  -- Found something which looks like a code comment -- if you have code
     sections in the document, please surround them with '<CODE BEGINS>' and
     '<CODE ENDS>' lines.


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  == Missing Reference: '0' is mentioned on line 3787, but not defined

  -- Looks like a reference, but probably isn't: '32K' on line 3787

  -- Possible downref: Non-RFC (?) normative reference: ref. '1'

  ** Obsolete normative reference: RFC 5661 (ref. '2') (Obsoleted by RFC 8881)

  -- Possible downref: Non-RFC (?) normative reference: ref. '3'

  -- Possible downref: Non-RFC (?) normative reference: ref. '6'

  == Outdated reference: A later version (-05) exists of
     draft-ietf-nfsv4-labreqs-00

  ** Downref: Normative reference to an Informational draft:
     draft-ietf-nfsv4-labreqs (ref. '7')

  == Outdated reference: A later version (-35) exists of
     draft-ietf-nfsv4-rfc3530bis-09

  -- Obsolete informational reference (is this intentional?): RFC 2616 (ref.
     '13') (Obsoleted by RFC 7230, RFC 7231, RFC 7232, RFC 7233, RFC 7234, RFC
     7235)

  -- Obsolete informational reference (is this intentional?): RFC 5226 (ref.
     '25') (Obsoleted by RFC 8126)


     Summary: 2 errors (**), 0 flaws (~~), 10 warnings (==), 8 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	NFSv4                                                          T. Haynes
3	Internet-Draft                                                    Editor
4	Intended status: Standards Track                            May 08, 2012
5	Expires: November 9, 2012

7	                     NFS Version 4 Minor Version 2
8	                 draft-ietf-nfsv4-minorversion2-10.txt

10	Abstract

12	   This Internet-Draft describes NFS version 4 minor version two,
13	   focusing mainly on the protocol extensions made from NFS version 4
14	   minor version 0 and NFS version 4 minor version 1.  Major extensions
15	   introduced in NFS version 4 minor version two include: Server-side
16	   Copy, Application I/O Advise, Space Reservations, Sparse Files,
17	   Application Data Blocks, and Labeled NFS.

19	Requirements Language

21	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
22	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
23	   document are to be interpreted as described in RFC 2119 [1].

25	Status of this Memo

27	   This Internet-Draft is submitted in full conformance with the
28	   provisions of BCP 78 and BCP 79.

30	   Internet-Drafts are working documents of the Internet Engineering
31	   Task Force (IETF).  Note that other groups may also distribute
32	   working documents as Internet-Drafts.  The list of current Internet-
33	   Drafts is at http://datatracker.ietf.org/drafts/current/.

35	   Internet-Drafts are draft documents valid for a maximum of six months
36	   and may be updated, replaced, or obsoleted by other documents at any
37	   time.  It is inappropriate to use Internet-Drafts as reference
38	   material or to cite them other than as "work in progress."

40	   This Internet-Draft will expire on November 9, 2012.

42	Copyright Notice

44	   Copyright (c) 2012 IETF Trust and the persons identified as the
45	   document authors.  All rights reserved.

47	   This document is subject to BCP 78 and the IETF Trust's Legal
48	   Provisions Relating to IETF Documents
49	   (http://trustee.ietf.org/license-info) in effect on the date of
50	   publication of this document.  Please review these documents
51	   carefully, as they describe your rights and restrictions with respect
52	   to this document.  Code Components extracted from this document must
53	   include Simplified BSD License text as described in Section 4.e of
54	   the Trust Legal Provisions and are provided without warranty as
55	   described in the Simplified BSD License.

57	   This document may contain material from IETF Documents or IETF
58	   Contributions published or made publicly available before November
59	   10, 2008.  The person(s) controlling the copyright in some of this
60	   material may not have granted the IETF Trust the right to allow
61	   modifications of such material outside the IETF Standards Process.
62	   Without obtaining an adequate license from the person(s) controlling
63	   the copyright in such materials, this document may not be modified
64	   outside the IETF Standards Process, and derivative works of it may
65	   not be created outside the IETF Standards Process, except to format
66	   it for publication as an RFC or to translate it into languages other
67	   than English.

69	Table of Contents

71	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  6
72	     1.1.   The NFS Version 4 Minor Version 2 Protocol  . . . . . . .  6
73	     1.2.   Scope of This Document  . . . . . . . . . . . . . . . . .  6
74	     1.3.   NFSv4.2 Goals . . . . . . . . . . . . . . . . . . . . . .  6
75	     1.4.   Overview of NFSv4.2 Features  . . . . . . . . . . . . . .  7
76	       1.4.1.  Sparse Files . . . . . . . . . . . . . . . . . . . . .  7
77	       1.4.2.  Application I/O Advise . . . . . . . . . . . . . . . .  7
78	     1.5.   Differences from NFSv4.1  . . . . . . . . . . . . . . . .  7
79	   2.  NFS Server-side Copy . . . . . . . . . . . . . . . . . . . . .  7
80	     2.1.   Introduction  . . . . . . . . . . . . . . . . . . . . . .  7
81	     2.2.   Protocol Overview . . . . . . . . . . . . . . . . . . . .  8
82	       2.2.1.  Intra-Server Copy  . . . . . . . . . . . . . . . . . . 10
83	       2.2.2.  Inter-Server Copy  . . . . . . . . . . . . . . . . . . 11
84	       2.2.3.  Server-to-Server Copy Protocol . . . . . . . . . . . . 14
85	     2.3.   Operations  . . . . . . . . . . . . . . . . . . . . . . . 16
86	       2.3.1.  netloc4 - Network Locations  . . . . . . . . . . . . . 16
87	       2.3.2.  Copy Offload Stateids  . . . . . . . . . . . . . . . . 17
88	     2.4.   Security Considerations . . . . . . . . . . . . . . . . . 17
89	       2.4.1.  Inter-Server Copy Security . . . . . . . . . . . . . . 17
90	   3.  Support for Application IO Hints . . . . . . . . . . . . . . . 26
91	     3.1.   Introduction  . . . . . . . . . . . . . . . . . . . . . . 26
92	     3.2.   POSIX Requirements  . . . . . . . . . . . . . . . . . . . 26
93	     3.3.   Additional Requirements . . . . . . . . . . . . . . . . . 27
94	     3.4.   Security Considerations . . . . . . . . . . . . . . . . . 28
95	     3.5.   IANA Considerations . . . . . . . . . . . . . . . . . . . 28
96	   4.  Sparse Files . . . . . . . . . . . . . . . . . . . . . . . . . 28
97	     4.1.   Introduction  . . . . . . . . . . . . . . . . . . . . . . 29
98	     4.2.   Terminology . . . . . . . . . . . . . . . . . . . . . . . 29
99	   5.  Space Reservation  . . . . . . . . . . . . . . . . . . . . . . 30
100	     5.1.   Introduction  . . . . . . . . . . . . . . . . . . . . . . 30
101	   6.  Application Data Block Support . . . . . . . . . . . . . . . . 32
102	     6.1.   Generic Framework . . . . . . . . . . . . . . . . . . . . 33
103	       6.1.1.  Data Block Representation  . . . . . . . . . . . . . . 33
104	       6.1.2.  Data Content . . . . . . . . . . . . . . . . . . . . . 34
105	     6.2.   pNFS Considerations . . . . . . . . . . . . . . . . . . . 34
106	     6.3.   An Example of Detecting Corruption  . . . . . . . . . . . 34
107	     6.4.   Example of READ_PLUS  . . . . . . . . . . . . . . . . . . 36
108	     6.5.   Zero Filled Holes . . . . . . . . . . . . . . . . . . . . 36
109	   7.  Labeled NFS  . . . . . . . . . . . . . . . . . . . . . . . . . 36
110	     7.1.   Introduction  . . . . . . . . . . . . . . . . . . . . . . 37
111	     7.2.   Definitions . . . . . . . . . . . . . . . . . . . . . . . 38
112	     7.3.   MAC Security Attribute  . . . . . . . . . . . . . . . . . 38
113	       7.3.1.  Delegations  . . . . . . . . . . . . . . . . . . . . . 39
114	       7.3.2.  Permission Checking  . . . . . . . . . . . . . . . . . 39
115	       7.3.3.  Object Creation  . . . . . . . . . . . . . . . . . . . 39
116	       7.3.4.  Existing Objects . . . . . . . . . . . . . . . . . . . 40
117	       7.3.5.  Label Changes  . . . . . . . . . . . . . . . . . . . . 40
118	     7.4.   pNFS Considerations . . . . . . . . . . . . . . . . . . . 40
119	     7.5.   Discovery of Server LNFS Support  . . . . . . . . . . . . 41
120	     7.6.   MAC Security NFS Modes of Operation . . . . . . . . . . . 41
121	       7.6.1.  Full Mode  . . . . . . . . . . . . . . . . . . . . . . 42
122	       7.6.2.  Guest Mode . . . . . . . . . . . . . . . . . . . . . . 43
123	     7.7.   Security Considerations . . . . . . . . . . . . . . . . . 43
124	   8.  Sharing change attribute implementation details with NFSv4
125	       clients  . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
126	     8.1.   Introduction  . . . . . . . . . . . . . . . . . . . . . . 44
127	   9.  Security Considerations  . . . . . . . . . . . . . . . . . . . 44
128	   10. Error Values . . . . . . . . . . . . . . . . . . . . . . . . . 44
129	     10.1.  Error Definitions . . . . . . . . . . . . . . . . . . . . 45
130	       10.1.1. General Errors . . . . . . . . . . . . . . . . . . . . 45
131	       10.1.2. Server to Server Copy Errors . . . . . . . . . . . . . 45
132	       10.1.3. Labeled NFS Errors . . . . . . . . . . . . . . . . . . 46
133	   11. New File Attributes  . . . . . . . . . . . . . . . . . . . . . 46
134	     11.1.  New RECOMMENDED Attributes - List and Definition
135	            References  . . . . . . . . . . . . . . . . . . . . . . . 46
136	     11.2.  Attribute Definitions . . . . . . . . . . . . . . . . . . 47
137	   12. Operations: REQUIRED, RECOMMENDED, or OPTIONAL . . . . . . . . 50
138	   13. NFSv4.2 Operations . . . . . . . . . . . . . . . . . . . . . . 53
139	     13.1.  Operation 59: COPY - Initiate a server-side copy  . . . . 53
140	     13.2.  Operation 60: COPY_ABORT - Cancel a server-side copy  . . 61
141	     13.3.  Operation 61: COPY_NOTIFY - Notify a source server of
142	            a future copy . . . . . . . . . . . . . . . . . . . . . . 62
143	     13.4.  Operation 62: COPY_REVOKE - Revoke a destination
144	            server's copy privileges  . . . . . . . . . . . . . . . . 64
145	     13.5.  Operation 63: COPY_STATUS - Poll for status of a
146	            server-side copy  . . . . . . . . . . . . . . . . . . . . 65
147	     13.6.  Modification to Operation 42: EXCHANGE_ID -
148	            Instantiate Client ID . . . . . . . . . . . . . . . . . . 66
149	     13.7.  Operation 64: INITIALIZE  . . . . . . . . . . . . . . . . 67
150	     13.8.  Operation 67: IO_ADVISE - Application I/O access
151	            pattern hints . . . . . . . . . . . . . . . . . . . . . . 71
152	     13.9.  Changes to Operation 51: LAYOUTRETURN . . . . . . . . . . 77
153	     13.10. Operation 65: READ_PLUS . . . . . . . . . . . . . . . . . 80
154	     13.11. Operation 66: SEEK  . . . . . . . . . . . . . . . . . . . 85
155	   14. NFSv4.2 Callback Operations  . . . . . . . . . . . . . . . . . 86
156	     14.1.  Procedure 16: CB_ATTR_CHANGED - Notify Client that
157	            the File's Attributes Changed . . . . . . . . . . . . . . 86
158	     14.2.  Operation 15: CB_COPY - Report results of a
159	            server-side copy  . . . . . . . . . . . . . . . . . . . . 87
160	   15. IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 89
161	   16. References . . . . . . . . . . . . . . . . . . . . . . . . . . 89
162	     16.1.  Normative References  . . . . . . . . . . . . . . . . . . 89
163	     16.2.  Informative References  . . . . . . . . . . . . . . . . . 90
164	   Appendix A.  Acknowledgments . . . . . . . . . . . . . . . . . . . 91
165	   Appendix B.  RFC Editor Notes  . . . . . . . . . . . . . . . . . . 92
166	   Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 92

168	1.  Introduction

170	1.1.  The NFS Version 4 Minor Version 2 Protocol

172	   The NFS version 4 minor version 2 (NFSv4.2) protocol is the third
173	   minor version of the NFS version 4 (NFSv4) protocol.  The first minor
174	   version, NFSv4.0, is described in [10] and the second minor version,
175	   NFSv4.1, is described in [2].  It follows the guidelines for minor
176	   versioning that are listed in Section 11 of [10].

178	   As a minor version, NFSv4.2 is consistent with the overall goals for
179	   NFSv4, but extends the protocol so as to better meet those goals,
180	   based on experiences with NFSv4.1.  In addition, NFSv4.2 has adopted
181	   some additional goals, which motivate some of the major extensions in
182	   NFSv4.2.

184	1.2.  Scope of This Document

186	   This document describes the NFSv4.2 protocol.  With respect to
187	   NFSv4.0 and NFSv4.1, this document does not:

189	   o  describe the NFSv4.0 or NFSv4.1 protocols, except where needed to
190	      contrast with NFSv4.2.

192	   o  modify the specification of the NFSv4.0 or NFSv4.1 protocols.

194	   o  clarify the NFSv4.0 or NFSv4.1 protocols.  I.e., any
195	      clarifications made here apply to NFSv4.2 and neither of the prior
196	      protocols.

198	   The full XDR for NFSv4.2 is presented in [3].

200	1.3.  NFSv4.2 Goals

202	   The goal of the design of NFSv4.2 is to take common local filesystem
203	   features and offer them remotely.  These features might

205	   o  already be available on the servers, e.g., sparse files

207	   o  be under development as a new standard, e.g., SEEK_HOLE and
208	      SEEK_DATA

210	   o  be used by clients with the servers via some proprietary means,
211	      e.g., Labeled NFS

213	   but the clients are not able to leverage them on the server within
214	   the confines of the NFS protocol.

216	1.4.  Overview of NFSv4.2 Features

218	   [[Comment.1: This needs fleshing out! --TH]]

220	1.4.1.  Sparse Files

222	   Two new operations are defined to support the reading of sparse files
223	   (READ_PLUS) and the punching of holes to remove backing storage
224	   (INITIALIZE).

226	1.4.2.  Application I/O Advise

228	   We propose a new IO_ADVISE operation for NFSv4.2 that clients can use
229	   to communicate expected I/O behavior to the server.  By communicating
230	   future I/O behavior such as whether a file will be accessed
231	   sequentially or randomly, and whether a file will or will not be
232	   accessed in the near future, servers can optimize future I/O requests
233	   for a file by, for example, prefetching or evicting data.  This
234	   operation can be used to support the posix_fadvise function as well
235	   as other applications such as databases and video editors.

237	1.5.  Differences from NFSv4.1

239	   In NFSv4.1, the only way to introduce new variants of an operation
240	   was to introduce a new operation.  I.e., READ becomes either READ2 or
241	   READ_PLUS.  With the use of discriminated unions as parameters to
242	   such functions in NFSv4.2, it is possible to add a new arm in a
243	   subsequent minor version.  And it is also possible to move such an
244	   operation from OPTIONAL/RECOMMENDED to REQUIRED.  Forcing an
245	   implementation to adopt each arm of a discriminated union at such a
246	   time does not meet the spirit of the minor versioning rules.  As
247	   such, new arms of a discriminated union MUST follow the same
248	   guidelines for minor versioning as operations in NFSv4.1 - i.e., they
249	   may not be made REQUIRED.  To support this, a new error code,
250	   NFS4ERR_UNION_NOTSUPP, is introduced which allows the server to
251	   communicate to the client that the operation is supported, but the
252	   specific arm of the discriminated union is not.

254	2.  NFS Server-side Copy

256	2.1.  Introduction

258	   This section describes a server-side copy feature for the NFS
259	   protocol.

261	   The server-side copy feature provides a mechanism for the NFS client
262	   to perform a file copy on the server without the data being
263	   transmitted back and forth over the network.

265	   Without this feature, an NFS client copies data from one location to
266	   another by reading the data from the server over the network, and
267	   then writing the data back over the network to the server.  Using
268	   this server-side copy operation, the client is able to instruct the
269	   server to copy the data locally without the data being sent back and
270	   forth over the network unnecessarily.

272	   In general, this feature is useful whenever data is copied from one
273	   location to another on the server.  It is particularly useful when
274	   copying the contents of a file from a backup.  Backup-versions of a
275	   file are copied for a number of reasons, including restoring and
276	   cloning data.

278	   If the source object and destination object are on different file
279	   servers, the file servers will communicate with one another to
280	   perform the copy operation.  The server-to-server protocol by which
281	   this is accomplished is not defined in this document.

283	2.2.  Protocol Overview

285	   The server-side copy offload operations support both intra-server and
286	   inter-server file copies.  An intra-server copy is a copy in which
287	   the source file and destination file reside on the same server.  In
288	   an inter-server copy, the source file and destination file are on
289	   different servers.  In both cases, the copy may be performed
290	   synchronously or asynchronously.

292	   Throughout the rest of this document, we refer to the NFS server
293	   containing the source file as the "source server" and the NFS server
294	   to which the file is transferred as the "destination server".  In the
295	   case of an intra-server copy, the source server and destination
296	   server are the same server.  Therefore in the context of an intra-
297	   server copy, the terms source server and destination server refer to
298	   the single server performing the copy.

300	   The operations described below are designed to copy files.  Other
301	   file system objects can be copied by building on these operations or
302	   using other techniques.  For example if the user wishes to copy a
303	   directory, the client can synthesize a directory copy by first
304	   creating the destination directory and then copying the source
305	   directory's files to the new destination directory.  If the user
306	   wishes to copy a namespace junction [11] [12], the client can use the
307	   ONC RPC Federated Filesystem protocol [12] to perform the copy.
308	   Specifically the client can determine the source junction's
309	   attributes using the FEDFS_LOOKUP_FSN procedure and create a
310	   duplicate junction using the FEDFS_CREATE_JUNCTION procedure.

312	   For the inter-server copy protocol, the operations are defined to be
313	   compatible with a server-to-server copy protocol in which the
314	   destination server reads the file data from the source server.  This
315	   model in which the file data is pulled from the source by the
316	   destination has a number of advantages over a model in which the
317	   source pushes the file data to the destination.  The advantages of
318	   the pull model include:

320	   o  The pull model only requires a remote server (i.e., the
321	      destination server) to be granted read access.  A push model
322	      requires a remote server (i.e., the source server) to be granted
323	      write access, which is more privileged.

325	   o  The pull model allows the destination server to stop reading if it
326	      has run out of space.  In a push model, the destination server
327	      must flow control the source server in this situation.

329	   o  The pull model allows the destination server to easily flow
330	      control the data stream by adjusting the size of its read
331	      operations.  In a push model, the destination server does not have
332	      this ability.  The source server in a push model is capable of
333	      writing chunks larger than the destination server has requested in
334	      attributes and session parameters.  In theory, the destination
335	      server could perform a "short" write in this situation, but this
336	      approach is known to behave poorly in practice.

338	   The following operations are provided to support server-side copy:

340	   COPY_NOTIFY:  For inter-server copies, the client sends this
341	      operation to the source server to notify it of a future file copy
342	      from a given destination server for the given user.

344	   COPY_REVOKE:  Also for inter-server copies, the client sends this
345	      operation to the source server to revoke permission to copy a file
346	      for the given user.

348	   COPY:  Used by the client to request a file copy.

350	   COPY_ABORT:  Used by the client to abort an asynchronous file copy.

352	   COPY_STATUS:  Used by the client to poll the status of an
353	      asynchronous file copy.

355	   CB_COPY:  Used by the destination server to report the results of an
356	      asynchronous file copy to the client.

358	   These operations are described in detail in Section 2.3.  This
359	   section provides an overview of how these operations are used to
360	   perform server-side copies.

362	2.2.1.  Intra-Server Copy

364	   To copy a file on a single server, the client uses a COPY operation.
365	   The server may respond to the copy operation with the final results
366	   of the copy or it may perform the copy asynchronously and deliver the
367	   results using a CB_COPY operation callback.  If the copy is performed
368	   asynchronously, the client may poll the status of the copy using
369	   COPY_STATUS or cancel the copy using COPY_ABORT.

371	   A synchronous intra-server copy is shown in Figure 1.  In this
372	   example, the NFS server chooses to perform the copy synchronously.
373	   The copy operation is completed, either successfully or
374	   unsuccessfully, before the server replies to the client's request.
375	   The server's reply contains the final result of the operation.

377	     Client                                  Server
378	        +                                      +
379	        |                                      |
380	        |--- COPY ---------------------------->| Client requests
381	        |<------------------------------------/| a file copy
382	        |                                      |
383	        |                                      |

385	                Figure 1: A synchronous intra-server copy.

387	   An asynchronous intra-server copy is shown in Figure 2.  In this
388	   example, the NFS server performs the copy asynchronously.  The
389	   server's reply to the copy request indicates that the copy operation
390	   was initiated and the final result will be delivered at a later time.
391	   The server's reply also contains a copy stateid.  The client may use
392	   this copy stateid to poll for status information (as shown) or to
393	   cancel the copy using a COPY_ABORT.  When the server completes the
394	   copy, the server performs a callback to the client and reports the
395	   results.

397	     Client                                  Server
398	        +                                      +
399	        |                                      |
400	        |--- COPY ---------------------------->| Client requests
401	        |<------------------------------------/| a file copy
402	        |                                      |
403	        |                                      |
404	        |--- COPY_STATUS --------------------->| Client may poll
405	        |<------------------------------------/| for status
406	        |                                      |
407	        |                  .                   | Multiple COPY_STATUS
408	        |                  .                   | operations may be sent.
409	        |                  .                   |
410	        |                                      |
411	        |<-- CB_COPY --------------------------| Server reports results
412	        |\------------------------------------>|
413	        |                                      |

415	               Figure 2: An asynchronous intra-server copy.

417	2.2.2.  Inter-Server Copy

419	   A copy may also be performed between two servers.  The copy protocol
420	   is designed to accommodate a variety of network topologies.  As shown
421	   in Figure 3, the client and servers may be connected by multiple
422	   networks.  In particular, the servers may be connected by a
423	   specialized, high speed network (network 192.168.33.0/24 in the
424	   diagram) that does not include the client.  The protocol allows the
425	   client to setup the copy between the servers (over network
426	   10.11.78.0/24 in the diagram) and for the servers to communicate on
427	   the high speed network if they choose to do so.

429	                             192.168.33.0/24
430	                 +-------------------------------------+
431	                 |                                     |
432	                 |                                     |
433	                 | 192.168.33.18                       | 192.168.33.56
434	         +-------+------+                       +------+------+
435	         |     Source   |                       | Destination |
436	         +-------+------+                       +------+------+
437	                 | 10.11.78.18                         | 10.11.78.56
438	                 |                                     |
439	                 |                                     |
440	                 |             10.11.78.0/24           |
441	                 +------------------+------------------+
442	                                    |
443	                                    |
444	                                    | 10.11.78.243
445	                              +-----+-----+
446	                              |   Client  |
447	                              +-----------+

449	            Figure 3: An example inter-server network topology.

451	   For an inter-server copy, the client notifies the source server that
452	   a file will be copied by the destination server using a COPY_NOTIFY
453	   operation.  The client then initiates the copy by sending the COPY
454	   operation to the destination server.  The destination server may
455	   perform the copy synchronously or asynchronously.

457	   A synchronous inter-server copy is shown in Figure 4.  In this case,
458	   the destination server chooses to perform the copy before responding
459	   to the client's COPY request.

461	   An asynchronous copy is shown in Figure 5.  In this case, the
462	   destination server chooses to respond to the client's COPY request
463	   immediately and then perform the copy asynchronously.

465	     Client                Source         Destination
466	        +                    +                 +
467	        |                    |                 |
468	        |--- COPY_NOTIFY --->|                 |
469	        |<------------------/|                 |
470	        |                    |                 |
471	        |                    |                 |
472	        |--- COPY ---------------------------->|
473	        |                    |                 |
474	        |                    |                 |
475	        |                    |<----- read -----|
476	        |                    |\--------------->|
477	        |                    |                 |
478	        |                    |        .        | Multiple reads may
479	        |                    |        .        | be necessary
480	        |                    |        .        |
481	        |                    |                 |
482	        |                    |                 |
483	        |<------------------------------------/| Destination replies
484	        |                    |                 | to COPY

486	                Figure 4: A synchronous inter-server copy.

488	     Client                Source         Destination
489	        +                    +                 +
490	        |                    |                 |
491	        |--- COPY_NOTIFY --->|                 |
492	        |<------------------/|                 |
493	        |                    |                 |
494	        |                    |                 |
495	        |--- COPY ---------------------------->|
496	        |<------------------------------------/|
497	        |                    |                 |
498	        |                    |                 |
499	        |                    |<----- read -----|
500	        |                    |\--------------->|
501	        |                    |                 |
502	        |                    |        .        | Multiple reads may
503	        |                    |        .        | be necessary
504	        |                    |        .        |
505	        |                    |                 |
506	        |                    |                 |
507	        |--- COPY_STATUS --------------------->| Client may poll
508	        |<------------------------------------/| for status
509	        |                    |                 |
510	        |                    |        .        | Multiple COPY_STATUS
511	        |                    |        .        | operations may be sent
512	        |                    |        .        |
513	        |                    |                 |
514	        |                    |                 |
515	        |                    |                 |
516	        |<-- CB_COPY --------------------------| Destination reports
517	        |\------------------------------------>| results
518	        |                    |                 |

520	               Figure 5: An asynchronous inter-server copy.

522	2.2.3.  Server-to-Server Copy Protocol

524	   During an inter-server copy, the destination server reads the file
525	   data from the source server.  The source server and destination
526	   server are not required to use a specific protocol to transfer the
527	   file data.  The choice of what protocol to use is ultimately the
528	   destination server's decision.

530	2.2.3.1.  Using NFSv4.x as a Server-to-Server Copy Protocol

532	   The destination server MAY use standard NFSv4.x (where x >= 1) to
533	   read the data from the source server.  If NFSv4.x is used for the
534	   server-to-server copy protocol, the destination server can use the
535	   filehandle contained in the COPY request with standard NFSv4.x
536	   operations to read data from the source server.  Specifically, the
537	   destination server may use the NFSv4.x OPEN operation's CLAIM_FH
538	   facility to open the file being copied and obtain an open stateid.
539	   Using the stateid, the destination server may then use NFSv4.x READ
540	   operations to read the file.

542	2.2.3.2.  Using an alternative Server-to-Server Copy Protocol

544	   In a homogeneous environment, the source and destination servers
545	   might be able to perform the file copy extremely efficiently using
546	   specialized protocols.  For example the source and destination
547	   servers might be two nodes sharing a common file system format for
548	   the source and destination file systems.  Thus the source and
549	   destination are in an ideal position to efficiently render the image
550	   of the source file to the destination file by replicating the file
551	   system formats at the block level.  Another possibility is that the
552	   source and destination might be two nodes sharing a common storage
553	   area network, and thus there is no need to copy any data at all, and
554	   instead ownership of the file and its contents might simply be re-
555	   assigned to the destination.  To allow for these possibilities, the
556	   destination server is allowed to use a server-to-server copy protocol
557	   of its choice.

559	   In a heterogeneous environment, using a protocol other than NFSv4.x
560	   (e.g,.  HTTP [13] or FTP [14]) presents some challenges.  In
561	   particular, the destination server is presented with the challenge of
562	   accessing the source file given only an NFSv4.x filehandle.

564	   One option for protocols that identify source files with path names
565	   is to use an ASCII hexadecimal representation of the source
566	   filehandle as the file name.

568	   Another option for the source server is to use URLs to direct the
569	   destination server to a specialized service.  For example, the
570	   response to COPY_NOTIFY could include the URL
571	   ftp://s1.example.com:9999/_FH/0x12345, where 0x12345 is the ASCII
572	   hexadecimal representation of the source filehandle.  When the
573	   destination server receives the source server's URL, it would use
574	   "_FH/0x12345" as the file name to pass to the FTP server listening on
575	   port 9999 of s1.example.com.  On port 9999 there would be a special
576	   instance of the FTP service that understands how to convert NFS
577	   filehandles to an open file descriptor (in many operating systems,
578	   this would require a new system call, one which is the inverse of the
579	   makefh() function that the pre-NFSv4 MOUNT service needs).

581	   Authenticating and identifying the destination server to the source
582	   server is also a challenge.  Recommendations for how to accomplish
583	   this are given in Section 2.4.1.2.4 and Section 2.4.1.4.

585	2.3.  Operations

587	   In the sections that follow, several operations are defined that
588	   together provide the server-side copy feature.  These operations are
589	   intended to be OPTIONAL operations as defined in section 17 of [2].
590	   The COPY_NOTIFY, COPY_REVOKE, COPY, COPY_ABORT, and COPY_STATUS
591	   operations are designed to be sent within an NFSv4 COMPOUND
592	   procedure.  The CB_COPY operation is designed to be sent within an
593	   NFSv4 CB_COMPOUND procedure.

595	   Each operation is performed in the context of the user identified by
596	   the ONC RPC credential of its containing COMPOUND or CB_COMPOUND
597	   request.  For example, a COPY_ABORT operation issued by a given user
598	   indicates that a specified COPY operation initiated by the same user
599	   be canceled.  Therefore a COPY_ABORT MUST NOT interfere with a copy
600	   of the same file initiated by another user.

602	   An NFS server MAY allow an administrative user to monitor or cancel
603	   copy operations using an implementation specific interface.

605	2.3.1.  netloc4 - Network Locations

607	   The server-side copy operations specify network locations using the
608	   netloc4 data type shown below:

610	   enum netloc_type4 {
611	           NL4_NAME        = 0,
612	           NL4_URL         = 1,
613	           NL4_NETADDR     = 2
614	   };
615	   union netloc4 switch (netloc_type4 nl_type) {
616	           case NL4_NAME:          utf8str_cis nl_name;
617	           case NL4_URL:           utf8str_cis nl_url;
618	           case NL4_NETADDR:       netaddr4    nl_addr;
619	   };

621	   If the netloc4 is of type NL4_NAME, the nl_name field MUST be
622	   specified as a UTF-8 string.  The nl_name is expected to be resolved
623	   to a network address via DNS, LDAP, NIS, /etc/hosts, or some other
624	   means.  If the netloc4 is of type NL4_URL, a server URL [4]
625	   appropriate for the server-to-server copy operation is specified as a
626	   UTF-8 string.  If the netloc4 is of type NL4_NETADDR, the nl_addr
627	   field MUST contain a valid netaddr4 as defined in Section 3.3.9 of
628	   [2].

630	   When netloc4 values are used for an inter-server copy as shown in
631	   Figure 3, their values may be evaluated on the source server,
632	   destination server, and client.  The network environment in which
633	   these systems operate should be configured so that the netloc4 values
634	   are interpreted as intended on each system.

636	2.3.2.  Copy Offload Stateids

638	   A server may perform a copy offload operation asynchronously.  An
639	   asynchronous copy is tracked using a copy offload stateid.  Copy
640	   offload stateids are included in the COPY, COPY_ABORT, COPY_STATUS,
641	   and CB_COPY operations.

643	   Section 8.2.4 of [2] specifies that stateids are valid until either
644	   (A) the client or server restart or (B) the client returns the
645	   resource.

647	   A copy offload stateid will be valid until either (A) the client or
648	   server restarts or (B) the client returns the resource by issuing a
649	   COPY_ABORT operation or the client replies to a CB_COPY operation.

651	   A copy offload stateid's seqid MUST NOT be 0 (zero).  In the context
652	   of a copy offload operation, it is ambiguous to indicate the most
653	   recent copy offload operation using a stateid with seqid of 0 (zero).
654	   Therefore a copy offload stateid with seqid of 0 (zero) MUST be
655	   considered invalid.

657	2.4.  Security Considerations

659	   The security considerations pertaining to NFSv4 [10] apply to this
660	   document.

662	   The standard security mechanisms provide by NFSv4 [10] may be used to
663	   secure the protocol described in this document.

665	   NFSv4 clients and servers supporting the the inter-server copy
666	   operations described in this document are REQUIRED to implement [5],
667	   including the RPCSEC_GSSv3 privileges copy_from_auth and
668	   copy_to_auth.  If the server-to-server copy protocol is ONC RPC
669	   based, the servers are also REQUIRED to implement the RPCSEC_GSSv3
670	   privilege copy_confirm_auth.  These requirements to implement are not
671	   requirements to use.  NFSv4 clients and servers are RECOMMENDED to
672	   use [5] to secure server-side copy operations.

674	2.4.1.  Inter-Server Copy Security

676	2.4.1.1.  Requirements for Secure Inter-Server Copy

678	   Inter-server copy is driven by several requirements:

680	   o  The specification MUST NOT mandate an inter-server copy protocol.
681	      There are many ways to copy data.  Some will be more optimal than
682	      others depending on the identities of the source server and
683	      destination server.  For example the source and destination
684	      servers might be two nodes sharing a common file system format for
685	      the source and destination file systems.  Thus the source and
686	      destination are in an ideal position to efficiently render the
687	      image of the source file to the destination file by replicating
688	      the file system formats at the block level.  In other cases, the
689	      source and destination might be two nodes sharing a common storage
690	      area network, and thus there is no need to copy any data at all,
691	      and instead ownership of the file and its contents simply gets re-
692	      assigned to the destination.

694	   o  The specification MUST provide guidance for using NFSv4.x as a
695	      copy protocol.  For those source and destination servers willing
696	      to use NFSv4.x there are specific security considerations that
697	      this specification can and does address.

699	   o  The specification MUST NOT mandate pre-configuration between the
700	      source and destination server.  Requiring that the source and
701	      destination first have a "copying relationship" increases the
702	      administrative burden.  However the specification MUST NOT
703	      preclude implementations that require pre-configuration.

705	   o  The specification MUST NOT mandate a trust relationship between
706	      the source and destination server.  The NFSv4 security model
707	      requires mutual authentication between a principal on an NFS
708	      client and a principal on an NFS server.  This model MUST continue
709	      with the introduction of COPY.

711	2.4.1.2.  Inter-Server Copy with RPCSEC_GSSv3

713	   When the client sends a COPY_NOTIFY to the source server to expect
714	   the destination to attempt to copy data from the source server, it is
715	   expected that this copy is being done on behalf of the principal
716	   (called the "user principal") that sent the RPC request that encloses
717	   the COMPOUND procedure that contains the COPY_NOTIFY operation.  The
718	   user principal is identified by the RPC credentials.  A mechanism
719	   that allows the user principal to authorize the destination server to
720	   perform the copy in a manner that lets the source server properly
721	   authenticate the destination's copy, and without allowing the
722	   destination to exceed its authorization is necessary.

724	   An approach that sends delegated credentials of the client's user
725	   principal to the destination server is not used for the following
726	   reasons.  If the client's user delegated its credentials, the
727	   destination would authenticate as the user principal.  If the
728	   destination were using the NFSv4 protocol to perform the copy, then
729	   the source server would authenticate the destination server as the
730	   user principal, and the file copy would securely proceed.  However,
731	   this approach would allow the destination server to copy other files.
732	   The user principal would have to trust the destination server to not
733	   do so.  This is counter to the requirements, and therefore is not
734	   considered.  Instead an approach using RPCSEC_GSSv3 [5] privileges is
735	   proposed.

737	   One of the stated applications of the proposed RPCSEC_GSSv3 protocol
738	   is compound client host and user authentication [+ privilege
739	   assertion].  For inter-server file copy, we require compound NFS
740	   server host and user authentication [+ privilege assertion].  The
741	   distinction between the two is one without meaning.

743	   RPCSEC_GSSv3 introduces the notion of privileges.  We define three
744	   privileges:

746	   copy_from_auth:  A user principal is authorizing a source principal
747	      ("nfs@<source>") to allow a destination principal ("nfs@
748	      <destination>") to copy a file from the source to the destination.
749	      This privilege is established on the source server before the user
750	      principal sends a COPY_NOTIFY operation to the source server.

752	   struct copy_from_auth_priv {
753	           secret4             cfap_shared_secret;
754	           netloc4             cfap_destination;
755	           /* the NFSv4 user name that the user principal maps to */
756	           utf8str_mixed       cfap_username;
757	           /* equal to seq_num of rpc_gss_cred_vers_3_t */
758	           unsigned int        cfap_seq_num;
759	   };

761	      cfp_shared_secret is a secret value the user principal generates.

763	   copy_to_auth:  A user principal is authorizing a destination
764	      principal ("nfs@<destination>") to allow it to copy a file from
765	      the source to the destination.  This privilege is established on
766	      the destination server before the user principal sends a COPY
767	      operation to the destination server.

769	   struct copy_to_auth_priv {
770	           /* equal to cfap_shared_secret */
771	           secret4              ctap_shared_secret;
772	           netloc4              ctap_source;
773	           /* the NFSv4 user name that the user principal maps to */
774	           utf8str_mixed        ctap_username;
775	           /* equal to seq_num of rpc_gss_cred_vers_3_t */
776	           unsigned int         ctap_seq_num;
777	   };

779	      ctap_shared_secret is a secret value the user principal generated
780	      and was used to establish the copy_from_auth privilege with the
781	      source principal.

783	   copy_confirm_auth:  A destination principal is confirming with the
784	      source principal that it is authorized to copy data from the
785	      source on behalf of the user principal.  When the inter-server
786	      copy protocol is NFSv4, or for that matter, any protocol capable
787	      of being secured via RPCSEC_GSSv3 (i.e., any ONC RPC protocol),
788	      this privilege is established before the file is copied from the
789	      source to the destination.

791	   struct copy_confirm_auth_priv {
792	           /* equal to GSS_GetMIC() of cfap_shared_secret */
793	           opaque              ccap_shared_secret_mic<>;
794	           /* the NFSv4 user name that the user principal maps to */
795	           utf8str_mixed       ccap_username;
796	           /* equal to seq_num of rpc_gss_cred_vers_3_t */
797	           unsigned int        ccap_seq_num;
798	   };

800	2.4.1.2.1.  Establishing a Security Context

802	   When the user principal wants to COPY a file between two servers, if
803	   it has not established copy_from_auth and copy_to_auth privileges on
804	   the servers, it establishes them:

806	   o  The user principal generates a secret it will share with the two
807	      servers.  This shared secret will be placed in the
808	      cfap_shared_secret and ctap_shared_secret fields of the
809	      appropriate privilege data types, copy_from_auth_priv and
810	      copy_to_auth_priv.

812	   o  An instance of copy_from_auth_priv is filled in with the shared
813	      secret, the destination server, and the NFSv4 user id of the user
814	      principal.  It will be sent with an RPCSEC_GSS3_CREATE procedure,
815	      and so cfap_seq_num is set to the seq_num of the credential of the
816	      RPCSEC_GSS3_CREATE procedure.  Because cfap_shared_secret is a
817	      secret, after XDR encoding copy_from_auth_priv, GSS_Wrap() (with
818	      privacy) is invoked on copy_from_auth_priv.  The
819	      RPCSEC_GSS3_CREATE procedure's arguments are:

821	      struct {
822	         rpc_gss3_gss_binding    *compound_binding;
823	         rpc_gss3_chan_binding   *chan_binding_mic;
824	         rpc_gss3_assertion      assertions<>;
825	         rpc_gss3_extension      extensions<>;
826	      } rpc_gss3_create_args;

828	      The string "copy_from_auth" is placed in assertions[0].privs.  The
829	      output of GSS_Wrap() is placed in extensions[0].data.  The field
830	      extensions[0].critical is set to TRUE.  The source server calls
831	      GSS_Unwrap() on the privilege, and verifies that the seq_num
832	      matches the credential.  It then verifies that the NFSv4 user id
833	      being asserted matches the source server's mapping of the user
834	      principal.  If it does, the privilege is established on the source
835	      server as: <"copy_from_auth", user id, destination>.  The
836	      successful reply to RPCSEC_GSS3_CREATE has:

838	      struct {
839	         opaque                  handle<>;
840	         rpc_gss3_chan_binding   *chan_binding_mic;
841	         rpc_gss3_assertion      granted_assertions<>;
842	         rpc_gss3_assertion      server_assertions<>;
843	         rpc_gss3_extension      extensions<>;
844	      } rpc_gss3_create_res;

846	      The field "handle" is the RPCSEC_GSSv3 handle that the client will
847	      use on COPY_NOTIFY requests involving the source and destination
848	      server. granted_assertions[0].privs will be equal to
849	      "copy_from_auth".  The server will return a GSS_Wrap() of
850	      copy_to_auth_priv.

852	   o  An instance of copy_to_auth_priv is filled in with the shared
853	      secret, the source server, and the NFSv4 user id.  It will be sent
854	      with an RPCSEC_GSS3_CREATE procedure, and so ctap_seq_num is set
855	      to the seq_num of the credential of the RPCSEC_GSS3_CREATE
856	      procedure.  Because ctap_shared_secret is a secret, after XDR
857	      encoding copy_to_auth_priv, GSS_Wrap() is invoked on
858	      copy_to_auth_priv.  The RPCSEC_GSS3_CREATE procedure's arguments
859	      are:

861	      struct {
862	         rpc_gss3_gss_binding    *compound_binding;
863	         rpc_gss3_chan_binding   *chan_binding_mic;
864	         rpc_gss3_assertion      assertions<>;
865	         rpc_gss3_extension      extensions<>;
866	      } rpc_gss3_create_args;

868	      The string "copy_to_auth" is placed in assertions[0].privs.  The
869	      output of GSS_Wrap() is placed in extensions[0].data.  The field
870	      extensions[0].critical is set to TRUE.  After unwrapping,
871	      verifying the seq_num, and the user principal to NFSv4 user ID
872	      mapping, the destination establishes a privilege of
873	      <"copy_to_auth", user id, source>.  The successful reply to
874	      RPCSEC_GSS3_CREATE has:

876	      struct {
877	         opaque                  handle<>;
878	         rpc_gss3_chan_binding   *chan_binding_mic;
879	         rpc_gss3_assertion      granted_assertions<>;
880	         rpc_gss3_assertion      server_assertions<>;
881	         rpc_gss3_extension      extensions<>;
882	      } rpc_gss3_create_res;

884	      The field "handle" is the RPCSEC_GSSv3 handle that the client will
885	      use on COPY requests involving the source and destination server.
886	      The field granted_assertions[0].privs will be equal to
887	      "copy_to_auth".  The server will return a GSS_Wrap() of
888	      copy_to_auth_priv.

890	2.4.1.2.2.  Starting a Secure Inter-Server Copy

892	   When the client sends a COPY_NOTIFY request to the source server, it
893	   uses the privileged "copy_from_auth" RPCSEC_GSSv3 handle.
894	   cna_destination_server in COPY_NOTIFY MUST be the same as the name of
895	   the destination server specified in copy_from_auth_priv.  Otherwise,
896	   COPY_NOTIFY will fail with NFS4ERR_ACCESS.  The source server
897	   verifies that the privilege <"copy_from_auth", user id, destination>
898	   exists, and annotates it with the source filehandle, if the user
899	   principal has read access to the source file, and if administrative
900	   policies give the user principal and the NFS client read access to
901	   the source file (i.e., if the ACCESS operation would grant read
902	   access).  Otherwise, COPY_NOTIFY will fail with NFS4ERR_ACCESS.

904	   When the client sends a COPY request to the destination server, it
905	   uses the privileged "copy_to_auth" RPCSEC_GSSv3 handle.
906	   ca_source_server in COPY MUST be the same as the name of the source
907	   server specified in copy_to_auth_priv.  Otherwise, COPY will fail
908	   with NFS4ERR_ACCESS.  The destination server verifies that the
909	   privilege <"copy_to_auth", user id, source> exists, and annotates it
910	   with the source and destination filehandles.  If the client has
911	   failed to establish the "copy_to_auth" policy it will reject the
912	   request with NFS4ERR_PARTNER_NO_AUTH.

914	   If the client sends a COPY_REVOKE to the source server to rescind the
915	   destination server's copy privilege, it uses the privileged
916	   "copy_from_auth" RPCSEC_GSSv3 handle and the cra_destination_server
917	   in COPY_REVOKE MUST be the same as the name of the destination server
918	   specified in copy_from_auth_priv.  The source server will then delete
919	   the <"copy_from_auth", user id, destination> privilege and fail any
920	   subsequent copy requests sent under the auspices of this privilege
921	   from the destination server.

923	2.4.1.2.3.  Securing ONC RPC Server-to-Server Copy Protocols

925	   After a destination server has a "copy_to_auth" privilege established
926	   on it, and it receives a COPY request, if it knows it will use an ONC
927	   RPC protocol to copy data, it will establish a "copy_confirm_auth"
928	   privilege on the source server, using nfs@<destination> as the
929	   initiator principal, and nfs@<source> as the target principal.

931	   The value of the field ccap_shared_secret_mic is a GSS_VerifyMIC() of
932	   the shared secret passed in the copy_to_auth privilege.  The field
933	   ccap_username is the mapping of the user principal to an NFSv4 user
934	   name ("user"@"domain" form), and MUST be the same as ctap_username
935	   and cfap_username.  The field ccap_seq_num is the seq_num of the
936	   RPCSEC_GSSv3 credential used for the RPCSEC_GSS3_CREATE procedure the
937	   destination will send to the source server to establish the
938	   privilege.

940	   The source server verifies the privilege, and establishes a
941	   <"copy_confirm_auth", user id, destination> privilege.  If the source
942	   server fails to verify the privilege, the COPY operation will be
943	   rejected with NFS4ERR_PARTNER_NO_AUTH.  All subsequent ONC RPC
944	   requests sent from the destination to copy data from the source to
945	   the destination will use the RPCSEC_GSSv3 handle returned by the
946	   source's RPCSEC_GSS3_CREATE response.

948	   Note that the use of the "copy_confirm_auth" privilege accomplishes
949	   the following:

951	   o  if a protocol like NFS is being used, with export policies, export
952	      policies can be overridden in case the destination server as-an-
953	      NFS-client is not authorized

955	   o  manual configuration to allow a copy relationship between the
956	      source and destination is not needed.

958	   If the attempt to establish a "copy_confirm_auth" privilege fails,
959	   then when the user principal sends a COPY request to destination, the
960	   destination server will reject it with NFS4ERR_PARTNER_NO_AUTH.

962	2.4.1.2.4.  Securing Non ONC RPC Server-to-Server Copy Protocols

964	   If the destination won't be using ONC RPC to copy the data, then the
965	   source and destination are using an unspecified copy protocol.  The
966	   destination could use the shared secret and the NFSv4 user id to
967	   prove to the source server that the user principal has authorized the
968	   copy.

970	   For protocols that authenticate user names with passwords (e.g., HTTP
971	   [13] and FTP [14]), the nfsv4 user id could be used as the user name,
972	   and an ASCII hexadecimal representation of the RPCSEC_GSSv3 shared
973	   secret could be used as the user password or as input into non-
974	   password authentication methods like CHAP [15].

976	2.4.1.3.  Inter-Server Copy via ONC RPC but without RPCSEC_GSSv3

978	   ONC RPC security flavors other than RPCSEC_GSSv3 MAY be used with the
979	   server-side copy offload operations described in this document.  In
980	   particular, host-based ONC RPC security flavors such as AUTH_NONE and
981	   AUTH_SYS MAY be used.  If a host-based security flavor is used, a
982	   minimal level of protection for the server-to-server copy protocol is
983	   possible.

985	   In the absence of strong security mechanisms such as RPCSEC_GSSv3,
986	   the challenge is how the source server and destination server
987	   identify themselves to each other, especially in the presence of
988	   multi-homed source and destination servers.  In a multi-homed
989	   environment, the destination server might not contact the source
990	   server from the same network address specified by the client in the
991	   COPY_NOTIFY.  This can be overcome using the procedure described
992	   below.

994	   When the client sends the source server the COPY_NOTIFY operation,
995	   the source server may reply to the client with a list of target
996	   addresses, names, and/or URLs and assign them to the unique
997	   quadruple: <random number, source fh, user ID, destination address
998	   Y>.  If the destination uses one of these target netlocs to contact
999	   the source server, the source server will be able to uniquely
1000	   identify the destination server, even if the destination server does
1001	   not connect from the address specified by the client in COPY_NOTIFY.
1002	   The level of assurance in this identification depends on the
1003	   unpredictability, strength and secrecy of the random number.

1005	   For example, suppose the network topology is as shown in Figure 3.
1006	   If the source filehandle is 0x12345, the source server may respond to
1007	   a COPY_NOTIFY for destination 10.11.78.56 with the URLs:

1009	      nfs://10.11.78.18//_COPY/FvhH1OKbu8VrxvV1erdjvR7N/10.11.78.56/_FH/
1010	      0x12345

1012	      nfs://192.168.33.18//_COPY/FvhH1OKbu8VrxvV1erdjvR7N/10.11.78.56/
1013	      _FH/0x12345

1015	   The name component after _COPY is 24 characters of base 64, more than
1016	   enough to encode a 128 bit random number.

1018	   The client will then send these URLs to the destination server in the
1019	   COPY operation.  Suppose that the 192.168.33.0/24 network is a high
1020	   speed network and the destination server decides to transfer the file
1021	   over this network.  If the destination contacts the source server
1022	   from 192.168.33.56 over this network using NFSv4.1, it does the
1023	   following:

1025	   COMPOUND  { PUTROOTFH, LOOKUP "_COPY" ; LOOKUP
1026	      "FvhH1OKbu8VrxvV1erdjvR7N" ; LOOKUP "10.11.78.56"; LOOKUP "_FH" ;
1027	      OPEN "0x12345" ; GETFH }

1029	   Provided that the random number is unpredictable and has been kept
1030	   secret by the parties involved, the source server will therefore know
1031	   that these NFSv4.x operations are being issued by the destination
1032	   server identified in the COPY_NOTIFY.  This random number technique
1033	   only provides initial authentication of the destination server, and
1034	   cannot defend against man-in-the-middle attacks after authentication
1035	   or an eavesdropper that observes the random number on the wire.
1036	   Other secure communication techniques (e.g., IPsec) are necessary to
1037	   block these attacks.

1039	2.4.1.4.  Inter-Server Copy without ONC RPC and RPCSEC_GSSv3

1041	   The same techniques as Section 2.4.1.3, using unique URLs for each
1042	   destination server, can be used for other protocols (e.g., HTTP [13]
1043	   and FTP [14]) as well.

1045	3.  Support for Application IO Hints

1047	3.1.  Introduction

1049	   Applications currently have several options for communicating I/O
1050	   access patterns to the NFS client.  While this can help the NFS
1051	   client optimize I/O and caching for a file, it does not allow the NFS
1052	   server and its exported file system to do likewise.  Therefore, here
1053	   we put forth a proposal for the NFSv4.2 protocol to allow
1054	   applications to communicate their expected behavior to the server.

1056	   By communicating expected access pattern, e.g., sequential or random,
1057	   and data re-use behavior, e.g., data range will be read multiple
1058	   times and should be cached, the server will be able to better
1059	   understand what optimizations it should implement for access to a
1060	   file.  For example, if a application indicates it will never read the
1061	   data more than once, then the file system can avoid polluting the
1062	   data cache and not cache the data.

1064	   The first application that can issue client I/O hints is the
1065	   posix_fadvise operation.  For example, on Linux, when an application
1066	   uses posix_fadvise to specify a file will be read sequentially, Linux
1067	   doubles the readahead buffer size.

1069	   Another instance where applications provide an indication of their
1070	   desired I/O behavior is the use of direct I/O. By specifying direct
1071	   I/O, clients will no longer cache data, but this information is not
1072	   passed to the server, which will continue caching data.

1074	   Application specific NFS clients such as those used by hypervisors
1075	   and databases can also leverage application hints to communicate
1076	   their specialized requirements.

1078	   This section adds a new IO_ADVISE operation to communicate the client
1079	   file access patterns to the NFS server.  The NFS server upon
1080	   receiving a IO_ADVISE operation MAY choose to alter its I/O and
1081	   caching behavior, but is under no obligation to do so.

1083	3.2.  POSIX Requirements

1085	   The first key requirement of the IO_ADVISE operation is to support
1086	   the posix_fadvise function [6], which is supported in Linux and many
1087	   other operating systems.  Examples and guidance on how to use
1088	   posix_fadvise to improve performance can be found here [16].
1089	   posix_fadvise is defined as follows,

1091	      int posix_fadvise(int fd, off_t offset, off_t len, int advice);

1093	   The posix_fadvise() function shall advise the implementation on the
1094	   expected behavior of the application with respect to the data in the
1095	   file associated with the open file descriptor, fd, starting at offset
1096	   and continuing for len bytes.  The specified range need not currently
1097	   exist in the file.  If len is zero, all data following offset is
1098	   specified.  The implementation may use this information to optimize
1099	   handling of the specified data.  The posix_fadvise() function shall
1100	   have no effect on the semantics of other operations on the specified
1101	   data, although it may affect the performance of other operations.

1103	   The advice to be applied to the data is specified by the advice
1104	   parameter and may be one of the following values:

1106	   POSIX_FADV_NORMAL -  Specifies that the application has no advice to
1107	      give on its behavior with respect to the specified data.  It is
1108	      the default characteristic if no advice is given for an open file.

1110	   POSIX_FADV_SEQUENTIAL -  Specifies that the application expects to
1111	      access the specified data sequentially from lower offsets to
1112	      higher offsets.

1114	   POSIX_FADV_RANDOM -  Specifies that the application expects to access
1115	      the specified data in a random order.

1117	   POSIX_FADV_WILLNEED -  Specifies that the application expects to
1118	      access the specified data in the near future.

1120	   POSIX_FADV_DONTNEED -  Specifies that the application expects that it
1121	      will not access the specified data in the near future.

1123	   POSIX_FADV_NOREUSE -  Specifies that the application expects to
1124	      access the specified data once and then not reuse it thereafter.

1126	   Upon successful completion, posix_fadvise() shall return zero;
1127	   otherwise, an error number shall be returned to indicate the error.

1129	3.3.  Additional Requirements

1131	   Many use cases exist for sending application I/O hints to the server
1132	   that cannot utilize the POSIX supported interface.  This is because
1133	   some applications may benefit from additional hints not specified by
1134	   posix_fadvise, and some applications may not use POSIX altogether.

1136	   One use case is "Opportunistic Prefetch", which allows a stateid
1137	   holder to tell the server that it is possible that it will access the
1138	   specified data in the near future.  This is similar to
1139	   POSIX_FADV_WILLNEED, but the client is unsure it will in fact read
1140	   the specified data, so the server should only prefetch the data if it
1141	   can be done at a marginal cost.  For example, when a server receives
1142	   this hint, it could prefetch only the indirect blocks for a file
1143	   instead of all the data.  This would still improve performance if the
1144	   client does read the data, but with less pressure on server memory.

1146	   An example use case for this hint is a database that reads in a
1147	   single record that points to additional records in either other areas
1148	   of the same file or different files located on the same or different
1149	   server.  While it is likely that the application may access the
1150	   additional records, it is far from guaranteed.  Therefore, the
1151	   database may issue an opportunistic prefetch (instead of
1152	   POSIX_FADV_WILLNEED) for the data in the other files pointed to by
1153	   the record.

1155	   Another use case is "Direct I/O", which allows a stated holder to
1156	   inform the server that it does not wish to cache data.  Today, for
1157	   applications that only intend to read data once, the use of direct
1158	   I/O disables client caching, but does not affect server caching.  By
1159	   caching data that will not be re-read, the server is polluting its
1160	   cache and possibly causing useful cached data to be evicted.  By
1161	   informing the server of its expected I/O access, this situation can
1162	   be avoid.  Direct I/O can be used in Linux and AIX via the open()
1163	   O_DIRECT parameter, in Solaris via the directio() function, and in
1164	   Windows via the CreateFile() FILE_FLAG_NO_BUFFERING flag.

1166	   Another use case is "Backward Sequential Read", which allows a stated
1167	   holder to inform the server that it intends to read the specified
1168	   data backwards, i.e., back the end to the beginning.  This is
1169	   different than POSIX_FADV_SEQUENTIAL, whose implied intention was
1170	   that data will be read from beginning to end.  This hint allows
1171	   servers to prefetch data at the end of the range first, and then
1172	   prefetch data sequentially in a backwards manner to the start of the
1173	   data range.  One example of an application that can make use of this
1174	   hint is video editing.

1176	3.4.  Security Considerations

1178	   None.

1180	3.5.  IANA Considerations

1182	   The IO_ADVISE_type4 will be extended through an IANA registry.

1184	4.  Sparse Files
1185	4.1.  Introduction

1187	   A sparse file is a common way of representing a large file without
1188	   having to utilize all of the disk space for it.  Consequently, a
1189	   sparse file uses less physical space than its size indicates.  This
1190	   means the file contains 'holes', byte ranges within the file that
1191	   contain no data.  Most modern file systems support sparse files,
1192	   including most UNIX file systems and NTFS, but notably not Apple's
1193	   HFS+.  Common examples of sparse files include Virtual Machine (VM)
1194	   OS/disk images, database files, log files, and even checkpoint
1195	   recovery files most commonly used by the HPC community.

1197	   If an application reads a hole in a sparse file, the file system must
1198	   return all zeros to the application.  For local data access there is
1199	   little penalty, but with NFS these zeroes must be transferred back to
1200	   the client.  If an application uses the NFS client to read data into
1201	   memory, this wastes time and bandwidth as the application waits for
1202	   the zeroes to be transferred.

1204	   A sparse file is typically created by initializing the file to be all
1205	   zeros - nothing is written to the data in the file, instead the hole
1206	   is recorded in the metadata for the file.  So a 8G disk image might
1207	   be represented initially by a couple hundred bits in the inode and
1208	   nothing on the disk.  If the VM then writes 100M to a file in the
1209	   middle of the image, there would now be two holes represented in the
1210	   metadata and 100M in the data.

1212	   Two new operations INITIALIZE (Section 13.7) and READ_PLUS
1213	   (Section 13.10) are introduced.  INITIALIZE allows for the creation
1214	   of a sparse file and for hole punching.  An application might want to
1215	   zero out a range of the file.  READ_PLUS supports all the features of
1216	   READ but includes an extension to support sparse pattern files
1217	   (Section 6.1.2).  READ_PLUS is guaranteed to perform no worse than
1218	   READ, and can dramatically improve performance with sparse files.
1219	   READ_PLUS does not depend on pNFS protocol features, but can be used
1220	   by pNFS to support sparse files.

1222	4.2.  Terminology

1224	   Regular file:  An object of file type NF4REG or NF4NAMEDATTR.

1226	   Sparse file:  A Regular file that contains one or more Holes.

1228	   Hole:  A byte range within a Sparse file that contains regions of all
1229	      zeroes.  For block-based file systems, this could also be an
1230	      unallocated region of the file.

1232	   Hole Threshold:  The minimum length of a Hole as determined by the
1233	      server.  If a server chooses to define a Hole Threshold, then it
1234	      would not return hole information about holes with a length
1235	      shorter than the Hole Threshold.

1237	5.  Space Reservation

1239	5.1.  Introduction

1241	   This section describes a set of operations that allow applications
1242	   such as hypervisors to reserve space for a file, report the amount of
1243	   actual disk space a file occupies and freeup the backing space of a
1244	   file when it is not required.  In virtualized environments, virtual
1245	   disk files are often stored on NFS mounted volumes.  Since virtual
1246	   disk files represent the hard disks of virtual machines, hypervisors
1247	   often have to guarantee certain properties for the file.

1249	   One such example is space reservation.  When a hypervisor creates a
1250	   virtual disk file, it often tries to preallocate the space for the
1251	   file so that there are no future allocation related errors during the
1252	   operation of the virtual machine.  Such errors prevent a virtual
1253	   machine from continuing execution and result in downtime.

1255	   Currently, in order to achieve such a guarantee, applications zero
1256	   the entire file.  The initial zeroing allocates the backing blocks
1257	   and all subsequent writes are overwrites of already allocated blocks.
1258	   This approach is not only inefficient in terms of the amount of I/O
1259	   done, it is also not guaranteed to work on filesystems that are log
1260	   structured or deduplicated.  An efficient way of guaranteeing space
1261	   reservation would be beneficial to such applications.

1263	   If the space_reserved attribute (see Section 11.2.3) is set on a
1264	   file, it is guaranteed that writes that do not grow the file will not
1265	   fail with NFSERR_NOSPC.

1267	   Another useful feature would be the ability to report the number of
1268	   blocks that would be freed when a file is deleted.  Currently, NFS
1269	   reports two size attributes:

1271	   size  The logical file size of the file.

1273	   space_used  The size in bytes that the file occupies on disk

1275	   While these attributes are sufficient for space accounting in
1276	   traditional filesystems, they prove to be inadequate in modern
1277	   filesystems that support block sharing.  In such filesystems,
1278	   multiple inodes can point to a single block with a block reference
1279	   count to guard against premature freeing.  Having a way to tell the
1280	   number of blocks that would be freed if the file was deleted would be
1281	   useful to applications that wish to migrate files when a volume is
1282	   low on space.

1284	   Since virtual disks represent a hard drive in a virtual machine, a
1285	   virtual disk can be viewed as a filesystem within a file.  Since not
1286	   all blocks within a filesystem are in use, there is an opportunity to
1287	   reclaim blocks that are no longer in use.  A call to deallocate
1288	   blocks could result in better space efficiency.  Lesser space MAY be
1289	   consumed for backups after block deallocation.

1291	   The following operations and attributes can be used to resolve this
1292	   issues:

1294	   space_reserved  This attribute specifies whether the blocks backing
1295	      the file have been preallocated.

1297	   space_freed  This attribute specifies the space freed when a file is
1298	      deleted, taking block sharing into consideration.

1300	   INITIALIZED  This operation zeroes and/or deallocates the blocks
1301	      backing a region of the file.

1303	   If space_used of a file is interpreted to mean the size in bytes of
1304	   all disk blocks pointed to by the inode of the file, then shared
1305	   blocks get double counted, over-reporting the space utilization.
1306	   This also has the adverse effect that the deletion of a file with
1307	   shared blocks frees up less than space_used bytes.

1309	   On the other hand, if space_used is interpreted to mean the size in
1310	   bytes of those disk blocks unique to the inode of the file, then
1311	   shared blocks are not counted in any file, resulting in under-
1312	   reporting of the space utilization.

1314	   For example, two files A and B have 10 blocks each.  Let 6 of these
1315	   blocks be shared between them.  Thus, the combined space utilized by
1316	   the two files is 14 * BLOCK_SIZE bytes.  In the former case, the
1317	   combined space utilization of the two files would be reported as 20 *
1318	   BLOCK_SIZE.  However, deleting either would only result in 4 *
1319	   BLOCK_SIZE being freed.  Conversely, the latter interpretation would
1320	   report that the space utilization is only 8 * BLOCK_SIZE.

1322	   Adding another size attribute, space_freed (see Section 11.2.4), is
1323	   helpful in solving this problem. space_freed is the number of blocks
1324	   that are allocated to the given file that would be freed on its
1325	   deletion.  In the example, both A and B would report space_freed as 4
1326	   * BLOCK_SIZE and space_used as 10 * BLOCK_SIZE.  If A is deleted, B
1327	   will report space_freed as 10 * BLOCK_SIZE as the deletion of B would
1328	   result in the deallocation of all 10 blocks.

1330	   The addition of this problem doesn't solve the problem of space being
1331	   over-reported.  However, over-reporting is better than under-
1332	   reporting.

1334	6.  Application Data Block Support

1336	   At the OS level, files are contained on disk blocks.  Applications
1337	   are also free to impose structure on the data contained in a file and
1338	   we can define an Application Data Block (ADB) to be such a structure.
1339	   From the application's viewpoint, it only wants to handle ADBs and
1340	   not raw bytes (see [17]).  An ADB is typically comprised of two
1341	   sections: a header and data.  The header describes the
1342	   characteristics of the block and can provide a means to detect
1343	   corruption in the data payload.  The data section is typically
1344	   initialized to all zeros.

1346	   The format of the header is application specific, but there are two
1347	   main components typically encountered:

1349	   1.  An ADB Number (ADBN), which allows the application to determine
1350	       which data block is being referenced.  The ADBN is a logical
1351	       block number and is useful when the client is not storing the
1352	       blocks in contiguous memory.

1354	   2.  Fields to describe the state of the ADB and a means to detect
1355	       block corruption.  For both pieces of data, a useful property is
1356	       that allowed values be unique in that if passed across the
1357	       network, corruption due to translation between big and little
1358	       endian architectures are detectable.  For example, 0xF0DEDEF0 has
1359	       the same bit pattern in both architectures.

1361	   Applications already impose structures on files [17] and detect
1362	   corruption in data blocks [18].  What they are not able to do is
1363	   efficiently transfer and store ADBs.  To initialize a file with ADBs,
1364	   the client must send the full ADB to the server and that must be
1365	   stored on the server.  When the application is initializing a file to
1366	   have the ADB structure, it could compress the ADBs to just the
1367	   information to necessary to later reconstruct the header portion of
1368	   the ADB when the contents are read back.  Using sparse file
1369	   techniques, the disk blocks described by would not be allocated.
1370	   Unlike sparse file techniques, there would be a small cost to store
1371	   the compressed header data.

1373	   In this section, we are going to define a generic framework for an
1374	   ADB, present one approach to detecting corruption in a given ADB
1375	   implementation, and describe the model for how the client and server
1376	   can support efficient initialization of ADBs, reading of ADB holes,
1377	   punching holes in ADBs, and space reservation.

1379	6.1.  Generic Framework

1381	   We want the representation of the ADB to be flexible enough to
1382	   support many different applications.  The most basic approach is no
1383	   imposition of a block at all, which means we are working with the raw
1384	   bytes.  Such an approach would be useful for storing holes, punching
1385	   holes, etc.  In more complex deployments, a server might be
1386	   supporting multiple applications, each with their own definition of
1387	   the ADB.  One might store the ADBN at the start of the block and then
1388	   have a guard pattern to detect corruption [19].  The next might store
1389	   the ADBN at an offset of 100 bytes within the block and have no guard
1390	   pattern at all.  I.e., existing applications might already have well
1391	   defined formats for their data blocks.

1393	   The guard pattern can be used to represent the state of the block, to
1394	   protect against corruption, or both.  Again, it needs to be able to
1395	   be placed anywhere within the ADB.

1397	   We need to be able to represent the starting offset of the block and
1398	   the size of the block.  Note that nothing prevents the application
1399	   from defining different sized blocks in a file.

1401	6.1.1.  Data Block Representation

1403	   struct app_data_block4 {
1404	           offset4         adb_offset;
1405	           length4         adb_block_size;
1406	           length4         adb_block_count;
1407	           length4         adb_reloff_blocknum;
1408	           count4          adb_block_num;
1409	           length4         adb_reloff_pattern;
1410	           opaque          adb_pattern<>;
1411	   };

1413	   The app_data_block4 structure captures the abstraction presented for
1414	   the ADB.  The additional fields present are to allow the transmission
1415	   of adb_block_count ADBs at one time.  We also use adb_block_num to
1416	   convey the ADBN of the first block in the sequence.  Each ADB will
1417	   contain the same adb_pattern string.

1419	   As both adb_block_num and adb_pattern are optional, if either
1420	   adb_reloff_pattern or adb_reloff_blocknum is set to NFS4_UINT64_MAX,
1421	   then the corresponding field is not set in any of the ADB.

1423	6.1.2.  Data Content

1425	   /*
1426	    * Use an enum such that we can extend new types.
1427	    */
1428	   enum data_content4 {
1429	           NFS4_CONTENT_DATA = 0,
1430	           NFS4_CONTENT_APP_BLOCK = 1,
1431	           NFS4_CONTENT_HOLE = 2
1432	   };

1434	   New operations might need to differentiate between wanting to access
1435	   data versus an ADB.  Also, future minor versions might want to
1436	   introduce new data formats.  This enumeration allows that to occur.

1438	6.2.  pNFS Considerations

1440	   While this document does not mandate how sparse ADBs are recorded on
1441	   the server, it does make the assumption that such information is not
1442	   in the file.  I.e., the information is metadata.  As such, the
1443	   INITIALIZE operation is defined to be not supported by the DS - it
1444	   must be issued to the MDS.  But since the client must not assume a
1445	   priori whether a read is sparse or not, the READ_PLUS operation MUST
1446	   be supported by both the DS and the MDS.  I.e., the client might
1447	   impose on the MDS to asynchronously read the data from the DS.

1449	   Furthermore, each DS MUST not report to a client a sparse ADB which
1450	   belongs to another DS.  One implication of this requirement is that
1451	   the app_data_block4's adb_block_size MUST be either be the stripe
1452	   width or the stripe width must be an even multiple of it.  The second
1453	   implication here is that the DS must be able to use the Control
1454	   Protocol to determine from the MDS where the sparse ADBs occur.

1456	6.3.  An Example of Detecting Corruption

1458	   In this section, we define an ADB format in which corruption can be
1459	   detected.  Note that this is just one possible format and means to
1460	   detect corruption.

1462	   Consider a very basic implementation of an operating system's disk
1463	   blocks.  A block is either data or it is an indirect block which
1464	   allows for files to be larger than one block.  It is desired to be
1465	   able to initialize a block.  Lastly, to quickly unlink a file, a
1466	   block can be marked invalid.  The contents remain intact - which
1467	   would enable this OS application to undelete a file.

1469	   The application defines 4k sized data blocks, with an 8 byte block
1470	   counter occurring at offset 0 in the block, and with the guard
1471	   pattern occurring at offset 8 inside the block.  Furthermore, the
1472	   guard pattern can take one of four states:

1474	   0xfeedface -   This is the FREE state and indicates that the ADB
1475	      format has been applied.

1477	   0xcafedead -   This is the DATA state and indicates that real data
1478	      has been written to this block.

1480	   0xe4e5c001 -   This is the INDIRECT state and indicates that the
1481	      block contains block counter numbers that are chained off of this
1482	      block.

1484	   0xba1ed4a3 -   This is the INVALID state and indicates that the block
1485	      contains data whose contents are garbage.

1487	   Finally, it also defines an 8 byte checksum [20] starting at byte 16
1488	   which applies to the remaining contents of the block.  If the state
1489	   is FREE, then that checksum is trivially zero.  As such, the
1490	   application has no need to transfer the checksum implicitly inside
1491	   the ADB - it need not make the transfer layer aware of the fact that
1492	   there is a checksum (see [18] for an example of checksums used to
1493	   detect corruption in application data blocks).

1495	   Corruption in each ADB can be detected thusly:

1497	   o  If the guard pattern is anything other than one of the allowed
1498	      values, including all zeros.

1500	   o  If the guard pattern is FREE and any other byte in the remainder
1501	      of the ADB is anything other than zero.

1503	   o  If the guard pattern is anything other than FREE, then if the
1504	      stored checksum does not match the computed checksum.

1506	   o  If the guard pattern is INDIRECT and one of the stored indirect
1507	      block numbers has a value greater than the number of ADBs in the
1508	      file.

1510	   o  If the guard pattern is INDIRECT and one of the stored indirect
1511	      block numbers is a duplicate of another stored indirect block
1512	      number.

1514	   As can be seen, the application can detect errors based on the
1515	   combination of the guard pattern state and the checksum.  But also,
1516	   the application can detect corruption based on the state and the
1517	   contents of the ADB.  This last point is important in validating the
1518	   minimum amount of data we incorporated into our generic framework.

1520	   I.e., the guard pattern is sufficient in allowing applications to
1521	   design their own corruption detection.

1523	   Finally, it is important to note that none of these corruption checks
1524	   occur in the transport layer.  The server and client components are
1525	   totally unaware of the file format and might report everything as
1526	   being transferred correctly even in the case the application detects
1527	   corruption.

1529	6.4.  Example of READ_PLUS

1531	   The hypothetical application presented in Section 6.3 can be used to
1532	   illustrate how READ_PLUS would return an array of results.  A file is
1533	   created and initialized with 100 4k ADBs in the FREE state:

1535	      INITIALIZE {0, 4k, 100, 0, 0, 8, 0xfeedface}

1537	   Further, assume the application writes a single ADB at 16k, changing
1538	   the guard pattern to 0xcafedead, we would then have in memory:

1540	      0 -> (16k - 1)   : 4k, 4, 0, 0, 8, 0xfeedface
1541	      16k -> (20k - 1) : 00 00 00 05 ca fe de ad XX XX ... XX XX
1542	      20k -> 400k      : 4k, 95, 0, 6, 0xfeedface

1544	   And when the client did a READ_PLUS of 64k at the start of the file,
1545	   it would get back a result of an ADB, some data, and a final ADB:

1547	      ADB {0, 4, 0, 0, 8, 0xfeedface}
1548	      data 4k
1549	      ADB {20k, 4k, 59, 0, 6, 0xfeedface}

1551	6.5.  Zero Filled Holes

1553	   As applications are free to define the structure of an ADB, it is
1554	   trivial to define an ADB which supports zero filled holes.  Such a
1555	   case would encompass the traditional definitions of a sparse file and
1556	   hole punching.  For example, to punch a 64k hole, starting at 100M,
1557	   into an existing file which has no ADB structure:

1559	      INITIALIZE {100M, 64k, 1, NFS4_UINT64_MAX,
1560	                  0, NFS4_UINT64_MAX, 0x0}

1562	7.  Labeled NFS
1563	7.1.  Introduction

1565	   Access control models such as Unix permissions or Access Control
1566	   Lists are commonly referred to as Discretionary Access Control (DAC)
1567	   models.  These systems base their access decisions on user identity
1568	   and resource ownership.  In contrast Mandatory Access Control (MAC)
1569	   models base their access control decisions on the label on the
1570	   subject (usually a process) and the object it wishes to access [7].
1571	   These labels may contain user identity information but usually
1572	   contain additional information.  In DAC systems users are free to
1573	   specify the access rules for resources that they own.  MAC models
1574	   base their security decisions on a system wide policy established by
1575	   an administrator or organization which the users do not have the
1576	   ability to override.  In this section, we add a MAC model to NFSv4.

1578	   The first change necessary is to devise a method for transporting and
1579	   storing security label data on NFSv4 file objects.  Security labels
1580	   have several semantics that are met by NFSv4 recommended attributes
1581	   such as the ability to set the label value upon object creation.
1582	   Access control on these attributes are done through a combination of
1583	   two mechanisms.  As with other recommended attributes on file objects
1584	   the usual DAC checks (ACLs and permission bits) will be performed to
1585	   ensure that proper file ownership is enforced.  In addition a MAC
1586	   system MAY be employed on the client, server, or both to enforce
1587	   additional policy on what subjects may modify security label
1588	   information.

1590	   The second change is to provide a method for the server to notify the
1591	   client that the attribute changed on an open file on the server.  If
1592	   the file is closed, then during the open attempt, the client will
1593	   gather the new attribute value.  The server MUST not communicate the
1594	   new value of the attribute, the client MUST query it.  This
1595	   requirement stems from the need for the client to provide sufficient
1596	   access rights to the attribute.

1598	   The final change necessary is a modification to the RPC layer used in
1599	   NFSv4 in the form of a new version of the RPCSEC_GSS [8] framework.
1600	   In order for an NFSv4 server to apply MAC checks it must obtain
1601	   additional information from the client.  Several methods were
1602	   explored for performing this and it was decided that the best
1603	   approach was to incorporate the ability to make security attribute
1604	   assertions through the RPC mechanism.  RPCSECGSSv3 [5] outlines a
1605	   method to assert additional security information such as security
1606	   labels on gss context creation and have that data bound to all RPC
1607	   requests that make use of that context.

1609	7.2.  Definitions

1611	   Label Format Specifier (LFS):  is an identifier used by the client to
1612	      establish the syntactic format of the security label and the
1613	      semantic meaning of its components.  These specifiers exist in a
1614	      registry associated with documents describing the format and
1615	      semantics of the label.

1617	   Label Format Registry:  is the IANA registry containing all
1618	      registered LFS along with references to the documents that
1619	      describe the syntactic format and semantics of the security label.

1621	   Policy Identifier (PI):  is an optional part of the definition of a
1622	      Label Format Specifier which allows for clients and server to
1623	      identify specific security policies.

1625	   Object:  is a passive resource within the system that we wish to be
1626	      protected.  Objects can be entities such as files, directories,
1627	      pipes, sockets, and many other system resources relevant to the
1628	      protection of the system state.

1630	   Subject:  A subject is an active entity usually a process which is
1631	      requesting access to an object.

1633	   Multi-Level Security (MLS):  is a traditional model where objects are
1634	      given a sensitivity level (Unclassified, Secret, Top Secret, etc)
1635	      and a category set [21].

1637	7.3.  MAC Security Attribute

1639	   MAC models base access decisions on security attributes bound to
1640	   subjects and objects.  This information can range from a user
1641	   identity for an identity based MAC model, sensitivity levels for
1642	   Multi-level security, or a type for Type Enforcement.  These models
1643	   base their decisions on different criteria but the semantics of the
1644	   security attribute remain the same.  The semantics required by the
1645	   security attributes are listed below:

1647	   o  Must provide flexibility with respect to MAC model.

1649	   o  Must provide the ability to atomically set security information
1650	      upon object creation.

1652	   o  Must provide the ability to enforce access control decisions both
1653	      on the client and the server.

1655	   o  Must not expose an object to either the client or server name
1656	      space before its security information has been bound to it.

1658	   NFSv4 implements the security attribute as a recommended attribute.
1659	   These attributes have a fixed format and semantics, which conflicts
1660	   with the flexible nature of the security attribute.  To resolve this
1661	   the security attribute consists of two components.  The first
1662	   component is a LFS as defined in [22] to allow for interoperability
1663	   between MAC mechanisms.  The second component is an opaque field
1664	   which is the actual security attribute data.  To allow for various
1665	   MAC models NFSv4 should be used solely as a transport mechanism for
1666	   the security attribute.  It is the responsibility of the endpoints to
1667	   consume the security attribute and make access decisions based on
1668	   their respective models.  In addition, creation of objects through
1669	   OPEN and CREATE allows for the security attribute to be specified
1670	   upon creation.  By providing an atomic create and set operation for
1671	   the security attribute it is possible to enforce the second and
1672	   fourth requirements.  The recommended attribute FATTR4_SEC_LABEL (see
1673	   Section 11.2.2) will be used to satisfy this requirement.

1675	7.3.1.  Delegations

1677	   In the event that a security attribute is changed on the server while
1678	   a client holds a delegation on the file, the client should follow the
1679	   existing protocol with respect to attribute changes.  It should flush
1680	   all changes back to the server and relinquish the delegation.

1682	7.3.2.  Permission Checking

1684	   It is not feasible to enumerate all possible MAC models and even
1685	   levels of protection within a subset of these models.  This means
1686	   that the NFSv4 client and servers cannot be expected to directly make
1687	   access control decisions based on the security attribute.  Instead
1688	   NFSv4 should defer permission checking on this attribute to the host
1689	   system.  These checks are performed in addition to existing DAC and
1690	   ACL checks outlined in the NFSv4 protocol.  Section 7.6 gives a
1691	   specific example of how the security attribute is handled under a
1692	   particular MAC model.

1694	7.3.3.  Object Creation

1696	   When creating files in NFSv4 the OPEN and CREATE operations are used.
1697	   One of the parameters to these operations is an fattr4 structure
1698	   containing the attributes the file is to be created with.  This
1699	   allows NFSv4 to atomically set the security attribute of files upon
1700	   creation.  When a client is MAC aware it must always provide the
1701	   initial security attribute upon file creation.  In the event that the
1702	   server is the only MAC aware entity in the system it should ignore
1703	   the security attribute specified by the client and instead make the
1704	   determination itself.  A more in depth explanation can be found in
1705	   Section 7.6.

1707	7.3.4.  Existing Objects

1709	   Note that under the MAC model, all objects must have labels.
1710	   Therefore, if an existing server is upgraded to include LNFS support,
1711	   then it is the responsibility of the security system to define the
1712	   behavior for existing objects.  For example, if the security system
1713	   is LFS 0, which means the server just stores and returns labels, then
1714	   existing files should return labels which are set to an empty value.

1716	7.3.5.  Label Changes

1718	   As per the requirements, when a file's security label is modified,
1719	   the server must notify all clients which have the file opened of the
1720	   change in label.  It does so with CB_ATTR_CHANGED.  There are
1721	   preconditions to making an attribute change imposed by NFSv4 and the
1722	   security system might want to impose others.  In the process of
1723	   meeting these preconditions, the server may chose to either serve the
1724	   request in whole or return NFS4ERR_DELAY to the SETATTR operation.

1726	   If there are open delegations on the file belonging to client other
1727	   than the one making the label change, then the process described in
1728	   Section 7.3.1 must be followed.

1730	   As the server is always presented with the subject label from the
1731	   client, it does not necessarily need to communicate the fact that the
1732	   label has changed to the client.  In the cases where the change
1733	   outright denies the client access, the client will be able to quickly
1734	   determine that there is a new label in effect.  It is in cases where
1735	   the client may share the same object between multiple subjects or a
1736	   security system which is not strictly hierarchical that the
1737	   CB_ATTR_CHANGED callback is very useful.  It allows the server to
1738	   inform the clients that the cached security attribute is now stale.

1740	   Consider a system in which the clients enforce MAC checks and and the
1741	   server has a very simple security system which just stores the
1742	   labels.  In this system, the MAC label check always allows access,
1743	   regardless of the subject label.

1745	   The way in which MAC labels are enforced is by the client.  So if
1746	   client A changes a security label on a file, then the server MUST
1747	   inform all clients that have the file opened that the label has
1748	   changed via CB_ATTR_CHANGED.  Then the clients MUST retrieve the new
1749	   label and MUST enforce access via the new attribute values.

1751	7.4.  pNFS Considerations

1753	   This section examines the issues in deploying LNFS in a pNFS
1754	   community of servers.

1756	7.4.1.  MAC Label Checks

1758	   The new FATTR4_SEC_LABEL attribute is metadata information and as
1759	   such the DS is not aware of the value contained on the MDS.
1760	   Fortunately, the NFSv4.1 protocol [2] already has provisions for
1761	   doing access level checks from the DS to the MDS.  In order for the
1762	   DS to validate the subject label presented by the client, it SHOULD
1763	   utilize this mechanism.

1765	   If a file's FATTR4_SEC_LABEL is changed, then the MDS should utilize
1766	   CB_ATTR_CHANGED to inform the client of that fact.  If the MDS is
1767	   maintaining

1769	7.5.  Discovery of Server LNFS Support

1771	   The server can easily determine that a client supports LNFS when it
1772	   queries for the FATTR4_SEC_LABEL label for an object.  Note that it
1773	   cannot assume that the presence of RPCSEC_GSSv3 indicates LNFS
1774	   support.  The client might need to discover which LFS the server
1775	   supports.

1777	   A server which supports LNFS MUST allow a client with any subject
1778	   label to retrieve the FATTR4_SEC_LABEL attribute for the root
1779	   filehandle, ROOTFH.  The following compound must always succeed as
1780	   far as a MAC label check is concerned:

1782	        PUTROOTFH, GETATTR {FATTR4_SEC_LABEL}

1784	   Note that the server might have imposed a security flavor on the root
1785	   that precludes such access.  I.e., if the server requires kerberized
1786	   access and the client presents a compound with AUTH_SYS, then the
1787	   server is allowed to return NFS4ERR_WRONGSEC in this case.  But if
1788	   the client presents a correct security flavor, then the server MUST
1789	   return the FATTR4_SEC_LABEL attribute with the supported LFS filled
1790	   in.

1792	7.6.  MAC Security NFS Modes of Operation

1794	   A system using Labeled NFS may operate in two modes.  The first mode
1795	   provides the most protection and is called "full mode".  In this mode
1796	   both the client and server implement a MAC model allowing each end to
1797	   make an access control decision.  The remaining mode is called the
1798	   "guest mode" and in this mode one end of the connection is not
1799	   implementing a MAC model and thus offers less protection than full
1800	   mode.

1802	7.6.1.  Full Mode

1804	   Full mode environments consist of MAC aware NFSv4 servers and clients
1805	   and may be composed of mixed MAC models and policies.  The system
1806	   requires that both the client and server have an opportunity to
1807	   perform an access control check based on all relevant information
1808	   within the network.  The file object security attribute is provided
1809	   using the mechanism described in Section 7.3.  The security attribute
1810	   of the subject making the request is transported at the RPC layer
1811	   using the mechanism described in RPCSECGSSv3 [5].

1813	7.6.1.1.  Initial Labeling and Translation

1815	   The ability to create a file is an action that a MAC model may wish
1816	   to mediate.  The client is given the responsibility to determine the
1817	   initial security attribute to be placed on a file.  This allows the
1818	   client to make a decision as to the acceptable security attributes to
1819	   create a file with before sending the request to the server.  Once
1820	   the server receives the creation request from the client it may
1821	   choose to evaluate if the security attribute is acceptable.

1823	   Security attributes on the client and server may vary based on MAC
1824	   model and policy.  To handle this the security attribute field has an
1825	   LFS component.  This component is a mechanism for the host to
1826	   identify the format and meaning of the opaque portion of the security
1827	   attribute.  A full mode environment may contain hosts operating in
1828	   several different LFSs.  In this case a mechanism for translating the
1829	   opaque portion of the security attribute is needed.  The actual
1830	   translation function will vary based on MAC model and policy and is
1831	   out of the scope of this document.  If a translation is unavailable
1832	   for a given LFS then the request SHOULD be denied.  Another recourse
1833	   is to allow the host to provide a fallback mapping for unknown
1834	   security attributes.

1836	7.6.1.2.  Policy Enforcement

1838	   In full mode access control decisions are made by both the clients
1839	   and servers.  When a client makes a request it takes the security
1840	   attribute from the requesting process and makes an access control
1841	   decision based on that attribute and the security attribute of the
1842	   object it is trying to access.  If the client denies that access an
1843	   RPC call to the server is never made.  If however the access is
1844	   allowed the client will make a call to the NFS server.

1846	   When the server receives the request from the client it extracts the
1847	   security attribute conveyed in the RPC request.  The server then uses
1848	   this security attribute and the attribute of the object the client is
1849	   trying to access to make an access control decision.  If the server's
1850	   policy allows this access it will fulfill the client's request,
1851	   otherwise it will return NFS4ERR_ACCESS.

1853	   Implementations MAY validate security attributes supplied over the
1854	   network to ensure that they are within a set of attributes permitted
1855	   from a specific peer, and if not, reject them.  Note that a system
1856	   may permit a different set of attributes to be accepted from each
1857	   peer.

1859	7.6.1.3.  Label Aware Only Server

1861	   If the LFS is 0, then it indicates a server which is label aware, but
1862	   does not enforce policies.  Such a server will store and retrieve all
1863	   object labels presented by clients, notify the clients of any label
1864	   changes via CB_ATTR_CHANGED, but will not restrict access via the
1865	   subject label.  Instead, it will expect the clients to enforce all
1866	   such access locally.

1868	7.6.2.  Guest Mode

1870	   Guest mode implies that either the client or the server does not
1871	   handle labels.  If the client is not LNFS aware, then it will not
1872	   offer subject labels to the server.  The server is the only entity
1873	   enforcing policy, and may selectively provide standard NFS services
1874	   to clients based on their authentication credentials and/or
1875	   associated network attributes (e.g., IP address, network interface).
1876	   The level of trust and access extended to a client in this mode is
1877	   configuration-specific.  If the server is not LNFS aware, then it
1878	   will not return object labels to the client.  Clients in this
1879	   environment are may consist of groups implementing different MAC
1880	   model policies.  The system requires that all clients in the
1881	   environment be responsible for access control checks.

1883	7.7.  Security Considerations

1885	   This entire document deals with security issues.

1887	   Depending on the level of protection the MAC system offers there may
1888	   be a requirement to tightly bind the security attribute to the data.

1890	   When only one of the client or server enforces labels, it is
1891	   important to realize that the other side is not enforcing MAC
1892	   protections.  Alternate methods might be in use to handle the lack of
1893	   MAC support and care should be taken to identify and mitigate threats
1894	   from possible tampering outside of these methods.

1896	   An example of this is that a server that modifies READDIR or LOOKUP
1897	   results based on the client's subject label might want to always
1898	   construct the same subject label for a client which does not present
1899	   one.  This will prevent a non-LNFS client from mixing entries in the
1900	   directory cache.

1902	8.  Sharing change attribute implementation details with NFSv4 clients

1904	8.1.  Introduction

1906	   Although both the NFSv4 [10] and NFSv4.1 protocol [2], define the
1907	   change attribute as being mandatory to implement, there is little in
1908	   the way of guidance.  The only feature that is mandated by them is
1909	   that the value must change whenever the file data or metadata change.

1911	   While this allows for a wide range of implementations, it also leaves
1912	   the client with a conundrum: how does it determine which is the most
1913	   recent value for the change attribute in a case where several RPC
1914	   calls have been issued in parallel?  In other words if two COMPOUNDs,
1915	   both containing WRITE and GETATTR requests for the same file, have
1916	   been issued in parallel, how does the client determine which of the
1917	   two change attribute values returned in the replies to the GETATTR
1918	   requests correspond to the most recent state of the file?  In some
1919	   cases, the only recourse may be to send another COMPOUND containing a
1920	   third GETATTR that is fully serialised with the first two.

1922	   NFSv4.2 avoids this kind of inefficiency by allowing the server to
1923	   share details about how the change attribute is expected to evolve,
1924	   so that the client may immediately determine which, out of the
1925	   several change attribute values returned by the server, is the most
1926	   recent. change_attr_type is defined as a new recommended attribute
1927	   (see Section 11.2.1), and is per filesystem.

1929	9.  Security Considerations

1931	10.  Error Values

1933	   NFS error numbers are assigned to failed operations within a Compound
1934	   (COMPOUND or CB_COMPOUND) request.  A Compound request contains a
1935	   number of NFS operations that have their results encoded in sequence
1936	   in a Compound reply.  The results of successful operations will
1937	   consist of an NFS4_OK status followed by the encoded results of the
1938	   operation.  If an NFS operation fails, an error status will be
1939	   entered in the reply and the Compound request will be terminated.

1941	10.1.  Error Definitions

1943	                        Protocol Error Definitions

1945	         +--------------------------+--------+------------------+
1946	         | Error                    | Number | Description      |
1947	         +--------------------------+--------+------------------+
1948	         | NFS4ERR_BADLABEL         | 10093  | Section 10.1.3.1 |
1949	         | NFS4ERR_METADATA_NOTSUPP | 10090  | Section 10.1.2.1 |
1950	         | NFS4ERR_OFFLOAD_DENIED   | 10091  | Section 10.1.2.2 |
1951	         | NFS4ERR_PARTNER_NO_AUTH  | 10089  | Section 10.1.2.3 |
1952	         | NFS4ERR_PARTNER_NOTSUPP  | 10088  | Section 10.1.2.4 |
1953	         | NFS4ERR_UNION_NOTSUPP    | 10094  | Section 10.1.1.1 |
1954	         | NFS4ERR_WRONG_LFS        | 10092  | Section 10.1.3.2 |
1955	         +--------------------------+--------+------------------+

1957	                                  Table 1

1959	10.1.1.  General Errors

1961	   This section deals with errors that are applicable to a broad set of
1962	   different purposes.

1964	10.1.1.1.  NFS4ERR_UNION_NOTSUPP (Error Code 10094)

1966	   One of the arguments to the operation is a discriminated union and
1967	   while the server supports the given operation, it does not support
1968	   the selected arm of the discriminated union.  For an example, see
1969	   READ_PLUS (Section 13.10).

1971	10.1.2.  Server to Server Copy Errors

1973	   These errors deal with the interaction between server to server
1974	   copies.

1976	10.1.2.1.  NFS4ERR_METADATA_NOTSUPP (Error Code 10090)

1978	   The destination file cannot support the same metadata as the source
1979	   file.

1981	10.1.2.2.  NFS4ERR_OFFLOAD_DENIED (Error Code 10091)

1983	   The copy offload operation is supported by both the source and the
1984	   destination, but the destination is not allowing it for this file.
1985	   If the client sees this error, it should fall back to the normal copy
1986	   semantics.

1988	10.1.2.3.  NFS4ERR_PARTNER_NO_AUTH (Error Code 10089)

1990	   The remote server does not authorize a server-to-server copy offload
1991	   operation.  This may be due to the client's failure to send the
1992	   COPY_NOTIFY operation to the remote server, the remote server
1993	   receiving a server-to-server copy offload request after the copy
1994	   lease time expired, or for some other permission problem.

1996	10.1.2.4.  NFS4ERR_PARTNER_NOTSUPP (Error Code 10088)

1998	   The remote server does not support the server-to-server copy offload
1999	   protocol.

2001	10.1.3.  Labeled NFS Errors

2003	   These errors are used in LNFS.

2005	10.1.3.1.  NFS4ERR_BADLABEL (Error Code 10093)

2007	   The label specified is invalid in some manner.

2009	10.1.3.2.  NFS4ERR_WRONG_LFS (Error Code 10092)

2011	   The LFS specified in the subject label is not compatible with the LFS
2012	   in object label.

2014	11.  New File Attributes

2016	11.1.  New RECOMMENDED Attributes - List and Definition References

2018	   The list of new RECOMMENDED attributes appears in Table 2.  The
2019	   meaning of the columns of the table are:

2021	   Name:  The name of the attribute.

2023	   Id:  The number assigned to the attribute.  In the event of conflicts
2024	      between the assigned number and [3], the latter is likely
2025	      authoritative, but should be resolved with Errata to this document
2026	      and/or [3].  See [23] for the Errata process.

2028	   Data Type:  The XDR data type of the attribute.

2030	   Acc:  Access allowed to the attribute.

2032	      R  means read-only (GETATTR may retrieve, SETATTR may not set).

2034	      W  means write-only (SETATTR may set, GETATTR may not retrieve).

2036	      R W   means read/write (GETATTR may retrieve, SETATTR may set).

2038	   Defined in:  The section of this specification that describes the
2039	      attribute.

2041	   +------------------+----+-------------------+-----+----------------+
2042	   | Name             | Id | Data Type         | Acc | Defined in     |
2043	   +------------------+----+-------------------+-----+----------------+
2044	   | change_attr_type | 79 | change_attr_type4 | R   | Section 11.2.1 |
2045	   | sec_label        | 80 | sec_label4        | R W | Section 11.2.2 |
2046	   | space_reserved   | 77 | boolean           | R W | Section 11.2.3 |
2047	   | space_freed      | 78 | length4           | R   | Section 11.2.4 |
2048	   +------------------+----+-------------------+-----+----------------+

2050	                                  Table 2

2052	11.2.  Attribute Definitions

2054	11.2.1.  Attribute 79: change_attr_type

2056	   enum change_attr_type4 {
2057	              NFS4_CHANGE_TYPE_IS_MONOTONIC_INCR         = 0,
2058	              NFS4_CHANGE_TYPE_IS_VERSION_COUNTER        = 1,
2059	              NFS4_CHANGE_TYPE_IS_VERSION_COUNTER_NOPNFS = 2,
2060	              NFS4_CHANGE_TYPE_IS_TIME_METADATA          = 3,
2061	              NFS4_CHANGE_TYPE_IS_UNDEFINED              = 4
2062	   };

2064	   change_attr_type is a per filesystem attribute which enables the
2065	   NFSv4.2 server to provide additional information about how it expects
2066	   the change attribute value to evolve after the file data or metadata
2067	   has changed.

2069	   NFS4_CHANGE_TYPE_IS_MONOTONIC_INCR:  The change attribute value MUST
2070	      monotonically increase for every atomic change to the file
2071	      attributes, data or directory contents.

2073	   NFS4_CHANGE_TYPE_IS_VERSION_COUNTER:  The change attribute value MUST
2074	      be incremented by one unit for every atomic change to the file
2075	      attributes, data or directory contents.  This property is
2076	      preserved when writing to pNFS data servers.

2078	   NFS4_CHANGE_TYPE_IS_VERSION_COUNTER_NOPNFS:  The change attribute
2079	      value MUST be incremented by one unit for every atomic change to
2080	      the file attributes, data or directory contents.  In the case
2081	      where the client is writing to pNFS data servers, the number of
2082	      increments is not guaranteed to exactly match the number of
2083	      writes.

2085	   NFS4_CHANGE_TYPE_IS_TIME_METADATA:  The change attribute is
2086	      implemented as suggested in the NFSv4 spec [10] in terms of the
2087	      time_metadata attribute.

2089	   NFS4_CHANGE_TYPE_IS_UNDEFINED:  The change attribute does not take
2090	      values that fit into any of these categories.

2092	   If either NFS4_CHANGE_TYPE_IS_MONOTONIC_INCR,
2093	   NFS4_CHANGE_TYPE_IS_VERSION_COUNTER, or
2094	   NFS4_CHANGE_TYPE_IS_TIME_METADATA are set, then the client knows at
2095	   the very least that the change attribute is monotonically increasing,
2096	   which is sufficient to resolve the question of which value is the
2097	   most recent.

2099	   If the client sees the value NFS4_CHANGE_TYPE_IS_TIME_METADATA, then
2100	   by inspecting the value of the 'time_delta' attribute it additionally
2101	   has the option of detecting rogue server implementations that use
2102	   time_metadata in violation of the spec.

2104	   Finally, if the client sees NFS4_CHANGE_TYPE_IS_VERSION_COUNTER, it
2105	   has the ability to predict what the resulting change attribute value
2106	   should be after a COMPOUND containing a SETATTR, WRITE, or CREATE.
2107	   This again allows it to detect changes made in parallel by another
2108	   client.  The value NFS4_CHANGE_TYPE_IS_VERSION_COUNTER_NOPNFS permits
2109	   the same, but only if the client is not doing pNFS WRITEs.

2111	11.2.2.  Attribute 80: sec_label

2113	   typedef uint32_t  policy4;

2115	   struct labelformat_spec4 {
2116	           policy4 lfs_lfs;
2117	           policy4 lfs_pi;
2118	   };

2120	   struct sec_label4 {
2121	           labelformat_spec4       slai_lfs;
2122	           opaque                  slai_data<>;
2123	   };
2124	   The FATTR4_SEC_LABEL contains an array of two components with the
2125	   first component being an LFS.  It serves to provide the receiving end
2126	   with the information necessary to translate the security attribute
2127	   into a form that is usable by the endpoint.  Label Formats assigned
2128	   an LFS may optionally choose to include a Policy Identifier field to
2129	   allow for complex policy deployments.  The LFS and Label Format
2130	   Registry are described in detail in [22].  The translation used to
2131	   interpret the security attribute is not specified as part of the
2132	   protocol as it may depend on various factors.  The second component
2133	   is an opaque section which contains the data of the attribute.  This
2134	   component is dependent on the MAC model to interpret and enforce.

2136	   In particular, it is the responsibility of the LFS specification to
2137	   define a maximum size for the opaque section, slai_data<>.  When
2138	   creating or modifying a label for an object, the client needs to be
2139	   guaranteed that the server will accept a label that is sized
2140	   correctly.  By both client and server being part of a specific MAC
2141	   model, the client will be aware of the size.

2143	11.2.3.  Attribute 77: space_reserved

2145	   The space_reserve attribute is a read/write attribute of type
2146	   boolean.  It is a per file attribute.  When the space_reserved
2147	   attribute is set via SETATTR, the server must ensure that there is
2148	   disk space to accommodate every byte in the file before it can return
2149	   success.  If the server cannot guarantee this, it must return
2150	   NFS4ERR_NOSPC.

2152	   If the client tries to grow a file which has the space_reserved
2153	   attribute set, the server must guarantee that there is disk space to
2154	   accommodate every byte in the file with the new size before it can
2155	   return success.  If the server cannot guarantee this, it must return
2156	   NFS4ERR_NOSPC.

2158	   It is not required that the server allocate the space to the file
2159	   before returning success.  The allocation can be deferred, however,
2160	   it must be guaranteed that it will not fail for lack of space.

2162	   The value of space_reserved can be obtained at any time through
2163	   GETATTR.

2165	   In order to avoid ambiguity, the space_reserve bit cannot be set
2166	   along with the size bit in SETATTR.  Increasing the size of a file
2167	   with space_reserve set will fail if space reservation cannot be
2168	   guaranteed for the new size.  If the file size is decreased, space
2169	   reservation is only guaranteed for the new size and the extra blocks
2170	   backing the file can be released.

2172	11.2.4.  Attribute 78: space_freed

2174	   space_freed gives the number of bytes freed if the file is deleted.
2175	   This attribute is read only and is of type length4.  It is a per file
2176	   attribute.

2178	12.  Operations: REQUIRED, RECOMMENDED, or OPTIONAL

2180	   The following tables summarize the operations of the NFSv4.2 protocol
2181	   and the corresponding designation of REQUIRED, RECOMMENDED, and
2182	   OPTIONAL to implement or either OBSOLETE if implemented or MUST NOT
2183	   implement.  The designation of OBSOLETE if implemented is reserved
2184	   for those operations which are defined in either NFSv4.0 or NFSV4.1,
2185	   can be implemented in NFSv4.2, and are intended to be MUST NOT be
2186	   implemented in NFSv4.3.  The designation of MUST NOT implement is
2187	   reserved for those operations that were defined in either NFSv4.0 or
2188	   NFSV4.1 and MUST NOT be implemented in NFSv4.2.

2190	   For the most part, the REQUIRED, RECOMMENDED, or OPTIONAL designation
2191	   for operations sent by the client is for the server implementation.
2192	   The client is generally required to implement the operations needed
2193	   for the operating environment for which it serves.  For example, a
2194	   read-only NFSv4.2 client would have no need to implement the WRITE
2195	   operation and is not required to do so.

2197	   The REQUIRED or OPTIONAL designation for callback operations sent by
2198	   the server is for both the client and server.  Generally, the client
2199	   has the option of creating the backchannel and sending the operations
2200	   on the fore channel that will be a catalyst for the server sending
2201	   callback operations.  A partial exception is CB_RECALL_SLOT; the only
2202	   way the client can avoid supporting this operation is by not creating
2203	   a backchannel.

2205	   Since this is a summary of the operations and their designation,
2206	   there are subtleties that are not presented here.  Therefore, if
2207	   there is a question of the requirements of implementation, the
2208	   operation descriptions themselves must be consulted along with other
2209	   relevant explanatory text within this either specification or that of
2210	   NFSv4.1 [2].

2212	   The abbreviations used in the second and third columns of the table
2213	   are defined as follows.

2215	   REQ  REQUIRED to implement

2217	   REC  RECOMMEND to implement

2219	   OPT  OPTIONAL to implement

2221	   OBS  MUST NOT implement

2223	   MNI  MUST NOT implement

2225	   For the NFSv4.2 features that are OPTIONAL, the operations that
2226	   support those features are OPTIONAL, and the server would return
2227	   NFS4ERR_NOTSUPP in response to the client's use of those operations.
2228	   If an OPTIONAL feature is supported, it is possible that a set of
2229	   operations related to the feature become REQUIRED to implement.  The
2230	   third column of the table designates the feature(s) and if the
2231	   operation is REQUIRED or OPTIONAL in the presence of support for the
2232	   feature.

2234	   The OPTIONAL features identified and their abbreviations are as
2235	   follows:

2237	   pNFS  Parallel NFS

2239	   FDELG  File Delegations

2241	   DDELG  Directory Delegations

2243	   COPY  Server Side Copy

2245	   ADB  Application Data Blocks

2247	                                Operations

2249	   +----------------------+--------------------+-----------------------+
2250	   | Operation            | REQ, REC, OPT, or  | Feature (REQ, REC, or |
2251	   |                      | MNI                | OPT)                  |
2252	   +----------------------+--------------------+-----------------------+
2253	   | ACCESS               | REQ                |                       |
2254	   | BACKCHANNEL_CTL      | REQ                |                       |
2255	   | BIND_CONN_TO_SESSION | REQ                |                       |
2256	   | CLOSE                | REQ                |                       |
2257	   | COMMIT               | REQ                |                       |
2258	   | COPY                 | OPT                | COPY (REQ)            |
2259	   | COPY_ABORT           | OPT                | COPY (REQ)            |
2260	   | COPY_NOTIFY          | OPT                | COPY (REQ)            |
2261	   | COPY_REVOKE          | OPT                | COPY (REQ)            |
2262	   | COPY_STATUS          | OPT                | COPY (REQ)            |
2263	   | CREATE               | REQ                |                       |
2264	   | CREATE_SESSION       | REQ                |                       |
2265	   | DELEGPURGE           | OPT                | FDELG (REQ)           |
2266	   | DELEGRETURN          | OPT                | FDELG, DDELG, pNFS    |
2267	   |                      |                    | (REQ)                 |
2268	   | DESTROY_CLIENTID     | REQ                |                       |
2269	   | DESTROY_SESSION      | REQ                |                       |
2270	   | EXCHANGE_ID          | REQ                |                       |
2271	   | FREE_STATEID         | REQ                |                       |
2272	   | GETATTR              | REQ                |                       |
2273	   | GETDEVICEINFO        | OPT                | pNFS (REQ)            |
2274	   | GETDEVICELIST        | OPT                | pNFS (OPT)            |
2275	   | GETFH                | REQ                |                       |
2276	   | INITIALIZE           | OPT                | ADB (REQ)             |
2277	   | GET_DIR_DELEGATION   | OPT                | DDELG (REQ)           |
2278	   | LAYOUTCOMMIT         | OPT                | pNFS (REQ)            |
2279	   | LAYOUTGET            | OPT                | pNFS (REQ)            |
2280	   | LAYOUTRETURN         | OPT                | pNFS (REQ)            |
2281	   | LINK                 | OPT                |                       |
2282	   | LOCK                 | REQ                |                       |
2283	   | LOCKT                | REQ                |                       |
2284	   | LOCKU                | REQ                |                       |
2285	   | LOOKUP               | REQ                |                       |
2286	   | LOOKUPP              | REQ                |                       |
2287	   | NVERIFY              | REQ                |                       |
2288	   | OPEN                 | REQ                |                       |
2289	   | OPENATTR             | OPT                |                       |
2290	   | OPEN_CONFIRM         | MNI                |                       |
2291	   | OPEN_DOWNGRADE       | REQ                |                       |
2292	   | PUTFH                | REQ                |                       |
2293	   | PUTPUBFH             | REQ                |                       |
2294	   | PUTROOTFH            | REQ                |                       |
2295	   | READ                 | OBS                |                       |
2296	   | READDIR              | REQ                |                       |
2297	   | READLINK             | OPT                |                       |
2298	   | READ_PLUS            | OPT                | ADB (REQ)             |
2299	   | RECLAIM_COMPLETE     | REQ                |                       |
2300	   | RELEASE_LOCKOWNER    | MNI                |                       |
2301	   | REMOVE               | REQ                |                       |
2302	   | RENAME               | REQ                |                       |
2303	   | RENEW                | MNI                |                       |
2304	   | RESTOREFH            | REQ                |                       |
2305	   | SAVEFH               | REQ                |                       |
2306	   | SECINFO              | REQ                |                       |
2307	   | SECINFO_NO_NAME      | REC                | pNFS file layout      |
2308	   |                      |                    | (REQ)                 |
2309	   | SEQUENCE             | REQ                |                       |
2310	   | SETATTR              | REQ                |                       |
2311	   | SETCLIENTID          | MNI                |                       |
2312	   | SETCLIENTID_CONFIRM  | MNI                |                       |
2313	   | SET_SSV              | REQ                |                       |
2314	   | TEST_STATEID         | REQ                |                       |
2315	   | VERIFY               | REQ                |                       |
2316	   | WANT_DELEGATION      | OPT                | FDELG (OPT)           |
2317	   | WRITE                | REQ                |                       |
2318	   +----------------------+--------------------+-----------------------+

2320	                            Callback Operations

2322	   +-------------------------+-------------------+---------------------+
2323	   | Operation               | REQ, REC, OPT, or | Feature (REQ, REC,  |
2324	   |                         | MNI               | or OPT)             |
2325	   +-------------------------+-------------------+---------------------+
2326	   | CB_COPY                 | OPT               | COPY (REQ)          |
2327	   | CB_GETATTR              | OPT               | FDELG (REQ)         |
2328	   | CB_LAYOUTRECALL         | OPT               | pNFS (REQ)          |
2329	   | CB_NOTIFY               | OPT               | DDELG (REQ)         |
2330	   | CB_NOTIFY_DEVICEID      | OPT               | pNFS (OPT)          |
2331	   | CB_NOTIFY_LOCK          | OPT               |                     |
2332	   | CB_PUSH_DELEG           | OPT               | FDELG (OPT)         |
2333	   | CB_RECALL               | OPT               | FDELG, DDELG, pNFS  |
2334	   |                         |                   | (REQ)               |
2335	   | CB_RECALL_ANY           | OPT               | FDELG, DDELG, pNFS  |
2336	   |                         |                   | (REQ)               |
2337	   | CB_RECALL_SLOT          | REQ               |                     |
2338	   | CB_RECALLABLE_OBJ_AVAIL | OPT               | DDELG, pNFS (REQ)   |
2339	   | CB_SEQUENCE             | OPT               | FDELG, DDELG, pNFS  |
2340	   |                         |                   | (REQ)               |
2341	   | CB_WANTS_CANCELLED      | OPT               | FDELG, DDELG, pNFS  |
2342	   |                         |                   | (REQ)               |
2343	   +-------------------------+-------------------+---------------------+

2345	13.  NFSv4.2 Operations

2347	13.1.  Operation 59: COPY - Initiate a server-side copy
2348	13.1.1.  ARGUMENT

2350	   const COPY4_GUARDED     = 0x00000001;
2351	   const COPY4_METADATA    = 0x00000002;

2353	   struct COPY4args {
2354	           /* SAVED_FH: source file */
2355	           /* CURRENT_FH: destination file or */
2356	           /*             directory           */
2357	           offset4         ca_src_offset;
2358	           offset4         ca_dst_offset;
2359	           length4         ca_count;
2360	           uint32_t        ca_flags;
2361	           component4      ca_destination;
2362	           netloc4         ca_source_server<>;
2363	   };

2365	13.1.2.  RESULT

2367	   union COPY4res switch (nfsstat4 cr_status) {
2368	           case NFS4_OK:
2369	                   stateid4        cr_callback_id<1>;
2370	           default:
2371	                   length4         cr_bytes_copied;
2372	   };

2374	13.1.3.  DESCRIPTION

2376	   The COPY operation is used for both intra-server and inter-server
2377	   copies.  In both cases, the COPY is always sent from the client to
2378	   the destination server of the file copy.  The COPY operation requests
2379	   that a file be copied from the location specified by the SAVED_FH
2380	   value to the location specified by the combination of CURRENT_FH and
2381	   ca_destination.

2383	   The SAVED_FH must be a regular file.  If SAVED_FH is not a regular
2384	   file, the operation MUST fail and return NFS4ERR_WRONG_TYPE.

2386	   In order to set SAVED_FH to the source file handle, the compound
2387	   procedure requesting the COPY will include a sub-sequence of
2388	   operations such as

2390	      PUTFH source-fh
2391	      SAVEFH

2393	   If the request is for a server-to-server copy, the source-fh is a
2394	   filehandle from the source server and the compound procedure is being
2395	   executed on the destination server.  In this case, the source-fh is a
2396	   foreign filehandle on the server receiving the COPY request.  If
2397	   either PUTFH or SAVEFH checked the validity of the filehandle, the
2398	   operation would likely fail and return NFS4ERR_STALE.

2400	   In order to avoid this problem, the minor version incorporating the
2401	   COPY operations will need to make a few small changes in the handling
2402	   of existing operations.  If a server supports the server-to-server
2403	   COPY feature, a PUTFH followed by a SAVEFH MUST NOT return
2404	   NFS4ERR_STALE for either operation.  These restrictions do not pose
2405	   substantial difficulties for servers.  The CURRENT_FH and SAVED_FH
2406	   may be validated in the context of the operation referencing them and
2407	   an NFS4ERR_STALE error returned for an invalid file handle at that
2408	   point.

2410	   The CURRENT_FH and ca_destination together specify the destination of
2411	   the copy operation.  If ca_destination is of 0 (zero) length, then
2412	   CURRENT_FH specifies the target file.  In this case, CURRENT_FH MUST
2413	   be a regular file and not a directory.  If ca_destination is not of 0
2414	   (zero) length, the ca_destination argument specifies the file name to
2415	   which the data will be copied within the directory identified by
2416	   CURRENT_FH.  In this case, CURRENT_FH MUST be a directory and not a
2417	   regular file.

2419	   If the file named by ca_destination does not exist and the operation
2420	   completes successfully, the file will be visible in the file system
2421	   namespace.  If the file does not exist and the operation fails, the
2422	   file MAY be visible in the file system namespace depending on when
2423	   the failure occurs and on the implementation of the NFS server
2424	   receiving the COPY operation.  If the ca_destination name cannot be
2425	   created in the destination file system (due to file name
2426	   restrictions, such as case or length), the operation MUST fail.

2428	   The ca_src_offset is the offset within the source file from which the
2429	   data will be read, the ca_dst_offset is the offset within the
2430	   destination file to which the data will be written, and the ca_count
2431	   is the number of bytes that will be copied.  An offset of 0 (zero)
2432	   specifies the start of the file.  A count of 0 (zero) requests that
2433	   all bytes from ca_src_offset through EOF be copied to the
2434	   destination.  If concurrent modifications to the source file overlap
2435	   with the source file region being copied, the data copied may include
2436	   all, some, or none of the modifications.  The client can use standard
2437	   NFS operations (e.g., OPEN with OPEN4_SHARE_DENY_WRITE or mandatory
2438	   byte range locks) to protect against concurrent modifications if the
2439	   client is concerned about this.  If the source file's end of file is
2440	   being modified in parallel with a copy that specifies a count of 0
2441	   (zero) bytes, the amount of data copied is implementation dependent
2442	   (clients may guard against this case by specifying a non-zero count
2443	   value or preventing modification of the source file as mentioned
2444	   above).

2446	   If the source offset or the source offset plus count is greater than
2447	   or equal to the size of the source file, the operation will fail with
2448	   NFS4ERR_INVAL.  The destination offset or destination offset plus
2449	   count may be greater than the size of the destination file.  This
2450	   allows for the client to issue parallel copies to implement
2451	   operations such as "cat file1 file2 file3 file4 > dest".

2453	   If the destination file is created as a result of this command, the
2454	   destination file's size will be equal to the number of bytes
2455	   successfully copied.  If the destination file already existed, the
2456	   destination file's size may increase as a result of this operation
2457	   (e.g. if ca_dst_offset plus ca_count is greater than the
2458	   destination's initial size).

2460	   If the ca_source_server list is specified, then this is an inter-
2461	   server copy operation and the source file is on a remote server.  The
2462	   client is expected to have previously issued a successful COPY_NOTIFY
2463	   request to the remote source server.  The ca_source_server list
2464	   SHOULD be the same as the COPY_NOTIFY response's cnr_source_server
2465	   list.  If the client includes the entries from the COPY_NOTIFY
2466	   response's cnr_source_server list in the ca_source_server list, the
2467	   source server can indicate a specific copy protocol for the
2468	   destination server to use by returning a URL, which specifies both a
2469	   protocol service and server name.  Server-to-server copy protocol
2470	   considerations are described in Section 2.2.3 and Section 2.4.1.

2472	   The ca_flags argument allows the copy operation to be customized in
2473	   the following ways using the guarded flag (COPY4_GUARDED) and the
2474	   metadata flag (COPY4_METADATA).

2476	   If the guarded flag is set and the destination exists on the server,
2477	   this operation will fail with NFS4ERR_EXIST.

2479	   If the guarded flag is not set and the destination exists on the
2480	   server, the behavior is implementation dependent.

2482	   If the metadata flag is set and the client is requesting a whole file
2483	   copy (i.e., ca_count is 0 (zero)), a subset of the destination file's
2484	   attributes MUST be the same as the source file's corresponding
2485	   attributes and a subset of the destination file's attributes SHOULD
2486	   be the same as the source file's corresponding attributes.  The
2487	   attributes in the MUST and SHOULD copy subsets will be defined for
2488	   each NFS version.

2490	   For NFSv4.2, Table 3 and Table 4 list the REQUIRED and RECOMMENDED
2491	   attributes respectively.  A "MUST" in the "Copy to destination file?"
2492	   column indicates that the attribute is part of the MUST copy set.  A
2493	   "SHOULD" in the "Copy to destination file?" column indicates that the
2494	   attribute is part of the SHOULD copy set.

2496	          +--------------------+----+---------------------------+
2497	          | Name               | Id | Copy to destination file? |
2498	          +--------------------+----+---------------------------+
2499	          | supported_attrs    | 0  | no                        |
2500	          | type               | 1  | MUST                      |
2501	          | fh_expire_type     | 2  | no                        |
2502	          | change             | 3  | SHOULD                    |
2503	          | size               | 4  | MUST                      |
2504	          | link_support       | 5  | no                        |
2505	          | symlink_support    | 6  | no                        |
2506	          | named_attr         | 7  | no                        |
2507	          | fsid               | 8  | no                        |
2508	          | unique_handles     | 9  | no                        |
2509	          | lease_time         | 10 | no                        |
2510	          | rdattr_error       | 11 | no                        |
2511	          | filehandle         | 19 | no                        |
2512	          | suppattr_exclcreat | 75 | no                        |
2513	          +--------------------+----+---------------------------+

2515	                                  Table 3

2517	          +--------------------+----+---------------------------+
2518	          | Name               | Id | Copy to destination file? |
2519	          +--------------------+----+---------------------------+
2520	          | acl                | 12 | MUST                      |
2521	          | aclsupport         | 13 | no                        |
2522	          | archive            | 14 | no                        |
2523	          | cansettime         | 15 | no                        |
2524	          | case_insensitive   | 16 | no                        |
2525	          | case_preserving    | 17 | no                        |
2526	          | change_attr_type   | 79 | no                        |
2527	          | change_policy      | 60 | no                        |
2528	          | chown_restricted   | 18 | MUST                      |
2529	          | dacl               | 58 | MUST                      |
2530	          | dir_notif_delay    | 56 | no                        |
2531	          | dirent_notif_delay | 57 | no                        |
2532	          | fileid             | 20 | no                        |
2533	          | files_avail        | 21 | no                        |
2534	          | files_free         | 22 | no                        |
2535	          | files_total        | 23 | no                        |
2536	          | fs_charset_cap     | 76 | no                        |
2537	          | fs_layout_type     | 62 | no                        |
2538	          | fs_locations       | 24 | no                        |
2539	          | fs_locations_info  | 67 | no                        |
2540	          | fs_status          | 61 | no                        |
2541	          | hidden             | 25 | MUST                      |
2542	          | homogeneous        | 26 | no                        |
2543	          | layout_alignment   | 66 | no                        |
2544	          | layout_blksize     | 65 | no                        |
2545	          | layout_hint        | 63 | no                        |
2546	          | layout_type        | 64 | no                        |
2547	          | maxfilesize        | 27 | no                        |
2548	          | maxlink            | 28 | no                        |
2549	          | maxname            | 29 | no                        |
2550	          | maxread            | 30 | no                        |
2551	          | maxwrite           | 31 | no                        |
2552	          | mdsthreshold       | 68 | no                        |
2553	          | mimetype           | 32 | MUST                      |
2554	          | mode               | 33 | MUST                      |
2555	          | mode_set_masked    | 74 | no                        |
2556	          | mounted_on_fileid  | 55 | no                        |
2557	          | no_trunc           | 34 | no                        |
2558	          | numlinks           | 35 | no                        |
2559	          | owner              | 36 | MUST                      |
2560	          | owner_group        | 37 | MUST                      |
2561	          | quota_avail_hard   | 38 | no                        |
2562	          | quota_avail_soft   | 39 | no                        |
2563	          | quota_used         | 40 | no                        |
2564	          | rawdev             | 41 | no                        |
2565	          | retentevt_get      | 71 | MUST                      |
2566	          | retentevt_set      | 72 | no                        |
2567	          | retention_get      | 69 | MUST                      |
2568	          | retention_hold     | 73 | MUST                      |
2569	          | retention_set      | 70 | no                        |
2570	          | sacl               | 59 | MUST                      |
2571	          | sec_label          | 80 | MUST                      |
2572	          | space_avail        | 42 | no                        |
2573	          | space_free         | 43 | no                        |
2574	          | space_freed        | 78 | no                        |
2575	          | space_reserved     | 77 | MUST                      |
2576	          | space_total        | 44 | no                        |
2577	          | space_used         | 45 | no                        |
2578	          | system             | 46 | MUST                      |
2579	          | time_access        | 47 | MUST                      |
2580	          | time_access_set    | 48 | no                        |
2581	          | time_backup        | 49 | no                        |
2582	          | time_create        | 50 | MUST                      |
2583	          | time_delta         | 51 | no                        |
2584	          | time_metadata      | 52 | SHOULD                    |
2585	          | time_modify        | 53 | MUST                      |
2586	          | time_modify_set    | 54 | no                        |
2587	          +--------------------+----+---------------------------+

2589	                                  Table 4

2591	   [NOTE: The source file's attribute values will take precedence over
2592	   any attribute values inherited by the destination file.]

2594	   In the case of an inter-server copy or an intra-server copy between
2595	   file systems, the attributes supported for the source file and
2596	   destination file could be different.  By definition,the REQUIRED
2597	   attributes will be supported in all cases.  If the metadata flag is
2598	   set and the source file has a RECOMMENDED attribute that is not
2599	   supported for the destination file, the copy MUST fail with
2600	   NFS4ERR_ATTRNOTSUPP.

2602	   Any attribute supported by the destination server that is not set on
2603	   the source file SHOULD be left unset.

2605	   Metadata attributes not exposed via the NFS protocol SHOULD be copied
2606	   to the destination file where appropriate.

2608	   The destination file's named attributes are not duplicated from the
2609	   source file.  After the copy process completes, the client MAY
2610	   attempt to duplicate named attributes using standard NFSv4
2611	   operations.  However, the destination file's named attribute
2612	   capabilities MAY be different from the source file's named attribute
2613	   capabilities.

2615	   If the metadata flag is not set and the client is requesting a whole
2616	   file copy (i.e., ca_count is 0 (zero)), the destination file's
2617	   metadata is implementation dependent.

2619	   If the client is requesting a partial file copy (i.e., ca_count is
2620	   not 0 (zero)), the client SHOULD NOT set the metadata flag and the
2621	   server MUST ignore the metadata flag.

2623	   If the operation does not result in an immediate failure, the server
2624	   will return NFS4_OK, and the CURRENT_FH will remain the destination's
2625	   filehandle.

2627	   If an immediate failure does occur, cr_bytes_copied will be set to
2628	   the number of bytes copied to the destination file before the error
2629	   occurred.  The cr_bytes_copied value indicates the number of bytes
2630	   copied but not which specific bytes have been copied.

2632	   A return of NFS4_OK indicates that either the operation is complete
2633	   or the operation was initiated and a callback will be used to deliver
2634	   the final status of the operation.

2636	   If the cr_callback_id is returned, this indicates that the operation
2637	   was initiated and a CB_COPY callback will deliver the final results
2638	   of the operation.  The cr_callback_id stateid is termed a copy
2639	   stateid in this context.  The server is given the option of returning
2640	   the results in a callback because the data may require a relatively
2641	   long period of time to copy.

2643	   If no cr_callback_id is returned, the operation completed
2644	   synchronously and no callback will be issued by the server.  The
2645	   completion status of the operation is indicated by cr_status.

2647	   If the copy completes successfully, either synchronously or
2648	   asynchronously, the data copied from the source file to the
2649	   destination file MUST appear identical to the NFS client.  However,
2650	   the NFS server's on disk representation of the data in the source
2651	   file and destination file MAY differ.  For example, the NFS server
2652	   might encrypt, compress, deduplicate, or otherwise represent the on
2653	   disk data in the source and destination file differently.

2655	   In the event of a failure the state of the destination file is
2656	   implementation dependent.  The COPY operation may fail for the
2657	   following reasons (this is a partial list).

2659	   o  NFS4ERR_MOVED

2661	   o  NFS4ERR_NOTSUPP

2663	   o  NFS4ERR_PARTNER_NOTSUPP

2665	   o  NFS4ERR_OFFLOAD_DENIED

2667	   o  NFS4ERR_PARTNER_NO_AUTH

2669	   o  NFS4ERR_FBIG

2671	   o  NFS4ERR_NOTDIR

2673	   o  NFS4ERR_WRONG_TYPE

2675	   o  NFS4ERR_ISDIR

2677	   o  NFS4ERR_INVAL

2679	   o  NFS4ERR_DELAY
2680	   o  NFS4ERR_METADATA_NOTSUPP

2682	   o  NFS4ERR_WRONGSEC

2684	13.2.  Operation 60: COPY_ABORT - Cancel a server-side copy

2686	13.2.1.  ARGUMENT

2688	   struct COPY_ABORT4args {
2689	           /* CURRENT_FH: desination file */
2690	           stateid4        caa_stateid;
2691	   };

2693	13.2.2.  RESULT

2695	   struct COPY_ABORT4res {
2696	           nfsstat4        car_status;
2697	   };

2699	13.2.3.  DESCRIPTION

2701	   COPY_ABORT is used for both intra- and inter-server asynchronous
2702	   copies.  The COPY_ABORT operation allows the client to cancel a
2703	   server-side copy operation that it initiated.  This operation is sent
2704	   in a COMPOUND request from the client to the destination server.
2705	   This operation may be used to cancel a copy when the application that
2706	   requested the copy exits before the operation is completed or for
2707	   some other reason.

2709	   The request contains the filehandle and copy stateid cookies that act
2710	   as the context for the previously initiated copy operation.

2712	   The result's car_status field indicates whether the cancel was
2713	   successful or not.  A value of NFS4_OK indicates that the copy
2714	   operation was canceled and no callback will be issued by the server.
2715	   A copy operation that is successfully canceled may result in none,
2716	   some, or all of the data copied.

2718	   If the server supports asynchronous copies, the server is REQUIRED to
2719	   support the COPY_ABORT operation.

2721	   The COPY_ABORT operation may fail for the following reasons (this is
2722	   a partial list):

2724	   o  NFS4ERR_NOTSUPP
2725	   o  NFS4ERR_RETRY

2727	   o  NFS4ERR_COMPLETE_ALREADY

2729	   o  NFS4ERR_SERVERFAULT

2731	13.3.  Operation 61: COPY_NOTIFY - Notify a source server of a future
2732	       copy

2734	13.3.1.  ARGUMENT

2736	   struct COPY_NOTIFY4args {
2737	           /* CURRENT_FH: source file */
2738	           netloc4         cna_destination_server;
2739	   };

2741	13.3.2.  RESULT

2743	   struct COPY_NOTIFY4resok {
2744	           nfstime4        cnr_lease_time;
2745	           netloc4         cnr_source_server<>;
2746	   };

2748	   union COPY_NOTIFY4res switch (nfsstat4 cnr_status) {
2749	           case NFS4_OK:
2750	                   COPY_NOTIFY4resok       resok4;
2751	           default:
2752	                   void;
2753	   };

2755	13.3.3.  DESCRIPTION

2757	   This operation is used for an inter-server copy.  A client sends this
2758	   operation in a COMPOUND request to the source server to authorize a
2759	   destination server identified by cna_destination_server to read the
2760	   file specified by CURRENT_FH on behalf of the given user.

2762	   The cna_destination_server MUST be specified using the netloc4
2763	   network location format.  The server is not required to resolve the
2764	   cna_destination_server address before completing this operation.

2766	   If this operation succeeds, the source server will allow the
2767	   cna_destination_server to copy the specified file on behalf of the
2768	   given user.  If COPY_NOTIFY succeeds, the destination server is
2769	   granted permission to read the file as long as both of the following
2770	   conditions are met:

2772	   o  The destination server begins reading the source file before the
2773	      cnr_lease_time expires.  If the cnr_lease_time expires while the
2774	      destination server is still reading the source file, the
2775	      destination server is allowed to finish reading the file.

2777	   o  The client has not issued a COPY_REVOKE for the same combination
2778	      of user, filehandle, and destination server.

2780	   The cnr_lease_time is chosen by the source server.  A cnr_lease_time
2781	   of 0 (zero) indicates an infinite lease.  To renew the copy lease
2782	   time the client should resend the same copy notification request to
2783	   the source server.

2785	   To avoid the need for synchronized clocks, copy lease times are
2786	   granted by the server as a time delta.  However, there is a
2787	   requirement that the client and server clocks do not drift
2788	   excessively over the duration of the lease.  There is also the issue
2789	   of propagation delay across the network which could easily be several
2790	   hundred milliseconds as well as the possibility that requests will be
2791	   lost and need to be retransmitted.

2793	   To take propagation delay into account, the client should subtract it
2794	   from copy lease times (e.g., if the client estimates the one-way
2795	   propagation delay as 200 milliseconds, then it can assume that the
2796	   lease is already 200 milliseconds old when it gets it).  In addition,
2797	   it will take another 200 milliseconds to get a response back to the
2798	   server.  So the client must send a lease renewal or send the copy
2799	   offload request to the cna_destination_server at least 400
2800	   milliseconds before the copy lease would expire.  If the propagation
2801	   delay varies over the life of the lease (e.g., the client is on a
2802	   mobile host), the client will need to continuously subtract the
2803	   increase in propagation delay from the copy lease times.

2805	   The server's copy lease period configuration should take into account
2806	   the network distance of the clients that will be accessing the
2807	   server's resources.  It is expected that the lease period will take
2808	   into account the network propagation delays and other network delay
2809	   factors for the client population.  Since the protocol does not allow
2810	   for an automatic method to determine an appropriate copy lease
2811	   period, the server's administrator may have to tune the copy lease
2812	   period.

2814	   A successful response will also contain a list of names, addresses,
2815	   and URLs called cnr_source_server, on which the source is willing to
2816	   accept connections from the destination.  These might not be
2817	   reachable from the client and might be located on networks to which
2818	   the client has no connection.

2820	   If the client wishes to perform an inter-server copy, the client MUST
2821	   send a COPY_NOTIFY to the source server.  Therefore, the source
2822	   server MUST support COPY_NOTIFY.

2824	   For a copy only involving one server (the source and destination are
2825	   on the same server), this operation is unnecessary.

2827	   The COPY_NOTIFY operation may fail for the following reasons (this is
2828	   a partial list):

2830	   o  NFS4ERR_MOVED

2832	   o  NFS4ERR_NOTSUPP

2834	   o  NFS4ERR_WRONGSEC

2836	13.4.  Operation 62: COPY_REVOKE - Revoke a destination server's copy
2837	       privileges

2839	13.4.1.  ARGUMENT

2841	   struct COPY_REVOKE4args {
2842	           /* CURRENT_FH: source file */
2843	           netloc4         cra_destination_server;
2844	   };

2846	13.4.2.  RESULT

2848	   struct COPY_REVOKE4res {
2849	           nfsstat4        crr_status;
2850	   };

2852	13.4.3.  DESCRIPTION

2854	   This operation is used for an inter-server copy.  A client sends this
2855	   operation in a COMPOUND request to the source server to revoke the
2856	   authorization of a destination server identified by
2857	   cra_destination_server from reading the file specified by CURRENT_FH
2858	   on behalf of given user.  If the cra_destination_server has already
2859	   begun copying the file, a successful return from this operation
2860	   indicates that further access will be prevented.

2862	   The cra_destination_server MUST be specified using the netloc4
2863	   network location format.  The server is not required to resolve the
2864	   cra_destination_server address before completing this operation.

2866	   The COPY_REVOKE operation is useful in situations in which the source
2867	   server granted a very long or infinite lease on the destination
2868	   server's ability to read the source file and all copy operations on
2869	   the source file have been completed.

2871	   For a copy only involving one server (the source and destination are
2872	   on the same server), this operation is unnecessary.

2874	   If the server supports COPY_NOTIFY, the server is REQUIRED to support
2875	   the COPY_REVOKE operation.

2877	   The COPY_REVOKE operation may fail for the following reasons (this is
2878	   a partial list):

2880	   o  NFS4ERR_MOVED

2882	   o  NFS4ERR_NOTSUPP

2884	13.5.  Operation 63: COPY_STATUS - Poll for status of a server-side copy

2886	13.5.1.  ARGUMENT

2888	   struct COPY_STATUS4args {
2889	           /* CURRENT_FH: destination file */
2890	           stateid4        csa_stateid;
2891	   };

2893	13.5.2.  RESULT

2895	   struct COPY_STATUS4resok {
2896	           length4         csr_bytes_copied;
2897	           nfsstat4        csr_complete<1>;
2898	   };

2900	   union COPY_STATUS4res switch (nfsstat4 csr_status) {
2901	           case NFS4_OK:
2902	                   COPY_STATUS4resok       resok4;
2903	           default:
2904	                   void;
2905	   };

2907	13.5.3.  DESCRIPTION

2909	   COPY_STATUS is used for both intra- and inter-server asynchronous
2910	   copies.  The COPY_STATUS operation allows the client to poll the
2911	   server to determine the status of an asynchronous copy operation.
2912	   This operation is sent by the client to the destination server.

2914	   If this operation is successful, the number of bytes copied are
2915	   returned to the client in the csr_bytes_copied field.  The
2916	   csr_bytes_copied value indicates the number of bytes copied but not
2917	   which specific bytes have been copied.

2919	   If the optional csr_complete field is present, the copy has
2920	   completed.  In this case the status value indicates the result of the
2921	   asynchronous copy operation.  In all cases, the server will also
2922	   deliver the final results of the asynchronous copy in a CB_COPY
2923	   operation.

2925	   The failure of this operation does not indicate the result of the
2926	   asynchronous copy in any way.

2928	   If the server supports asynchronous copies, the server is REQUIRED to
2929	   support the COPY_STATUS operation.

2931	   The COPY_STATUS operation may fail for the following reasons (this is
2932	   a partial list):

2934	   o  NFS4ERR_NOTSUPP

2936	   o  NFS4ERR_BAD_STATEID

2938	   o  NFS4ERR_EXPIRED

2940	13.6.  Modification to Operation 42: EXCHANGE_ID - Instantiate Client ID

2942	13.6.1.  ARGUMENT

2944	      /* new */
2945	      const EXCHGID4_FLAG_SUPP_FENCE_OPS      = 0x00000004;

2947	13.6.2.  RESULT

2949	      Unchanged

2951	13.6.3.  MOTIVATION

2953	   Enterprise applications require guarantees that an operation has
2954	   either aborted or completed.  NFSv4.1 provides this guarantee as long
2955	   as the session is alive: simply send a SEQUENCE operation on the same
2956	   slot with a new sequence number, and the successful return of
2957	   SEQUENCE indicates the previous operation has completed.  However, if
2958	   the session is lost, there is no way to know when any in progress
2959	   operations have aborted or completed.  In hindsight, the NFSv4.1
2960	   specification should have mandated that DESTROY_SESSION abort/
2961	   complete all outstanding operations.

2963	13.6.4.  DESCRIPTION

2965	   A client SHOULD request the EXCHGID4_FLAG_SUPP_FENCE_OPS capability
2966	   when it sends an EXCHANGE_ID operation.  The server SHOULD set this
2967	   capability in the EXCHANGE_ID reply whether the client requests it or
2968	   not.  If the client ID is created with this capability then the
2969	   following will occur:

2971	   o  The server will not reply to DESTROY_SESSION until all operations
2972	      in progress are completed or aborted.

2974	   o  The server will not reply to subsequent EXCHANGE_ID invoked on the
2975	      same Client Owner with a new verifier until all operations in
2976	      progress on the Client ID's session are completed or aborted.

2978	   o  When DESTROY_CLIENTID is invoked, if there are sessions (both idle
2979	      and non-idle), opens, locks, delegations, layouts, and/or wants
2980	      (Section 18.49 of [2]) associated with the client ID are removed.
2981	      Pending operations will be completed or aborted before the
2982	      sessions, opens, locks, delegations, layouts, and/or wants are
2983	      deleted.

2985	   o  The NFS server SHOULD support client ID trunking, and if it does
2986	      and the EXCHGID4_FLAG_SUPP_FENCE_OPS capability is enabled, then a
2987	      session ID created on one node of the storage cluster MUST be
2988	      destroyable via DESTROY_SESSION.  In addition, DESTROY_CLIENTID
2989	      and an EXCHANGE_ID with a new verifier affects all sessions
2990	      regardless what node the sessions were created on.

2992	13.7.  Operation 64: INITIALIZE

2994	   This operation can be used to initialize the structure imposed by an
2995	   application onto a file, i.e., ADBs, and to punch a hole into a file.

2997	13.7.1.  ARGUMENT

2999	   /*
3000	    * We use data_content4 in case we wish to
3001	    * extend new types later. Note that we
3002	    * are explicitly disallowing data.
3003	    */
3004	   union initialize_arg4 switch (data_content4 content) {
3005	   case NFS4_CONTENT_APP_BLOCK:
3006	           app_data_block4 ia_adb;
3007	   case NFS4_CONTENT_HOLE:
3008	           data_info4      ia_hole;
3009	   default:
3010	           void;
3011	   };

3013	   struct INITIALIZE4args {
3014	           /* CURRENT_FH: file */
3015	           stateid4        ia_stateid;
3016	           stable_how4     ia_stable;
3017	           initialize_arg4 ia_data<>;
3018	   };

3020	13.7.2.  RESULT

3022	   struct INITIALIZE4resok {
3023	           count4          ir_count;
3024	           stable_how4     ir_committed;
3025	           verifier4       ir_writeverf;
3026	           data_content4   ir_sparse;
3027	   };

3029	   union INITIALIZE4res switch (nfsstat4 status) {
3030	   case NFS4_OK:
3031	           INITIALIZE4resok        resok4;
3032	   default:
3033	           void;
3034	   };

3036	13.7.3.  DESCRIPTION

3038	   Using the data_content4 (Section 6.1.2), INITIALIZE can be used
3039	   either to punch holes or to impose ADB structure on a file.

3041	13.7.3.1.  Hole punching

3043	   Whenever a client wishes to zero the blocks backing a particular
3044	   region in the file, it calls the INITIALIZE operation with the
3045	   current filehandle set to the filehandle of the file in question, and
3046	   the equivalent of start offset and length in bytes of the region set
3047	   in ia_hole.di_offset and ia_hole.di_length respectively.  If the
3048	   ia_hole.di_allocated is set to TRUE, then the blocks will be zeroed
3049	   and if it is set to FALSE, then they will be deallocated.  All
3050	   further reads to this region MUST return zeros until overwritten.
3051	   The filehandle specified must be that of a regular file.

3053	   Situations may arise where di_offset and/or di_offset + di_length
3054	   will not be aligned to a boundary that the server does allocations/
3055	   deallocations in.  For most filesystems, this is the block size of
3056	   the file system.  In such a case, the server can deallocate as many
3057	   bytes as it can in the region.  The blocks that cannot be deallocated
3058	   MUST be zeroed.  Except for the block deallocation and maximum hole
3059	   punching capability, a INITIALIZE operation is to be treated similar
3060	   to a write of zeroes.

3062	   The server is not required to complete deallocating the blocks
3063	   specified in the operation before returning.  It is acceptable to
3064	   have the deallocation be deferred.  In fact, INITIALIZE is merely a
3065	   hint; it is valid for a server to return success without ever doing
3066	   anything towards deallocating the blocks backing the region
3067	   specified.  However, any future reads to the region MUST return
3068	   zeroes.

3070	   If used to hole punch, INITIALIZE will result in the space_used
3071	   attribute being decreased by the number of bytes that were
3072	   deallocated.  The space_freed attribute may or may not decrease,
3073	   depending on the support and whether the blocks backing the specified
3074	   range were shared or not.  The size attribute will remain unchanged.

3076	   The INITIALIZE operation MUST NOT change the space reservation
3077	   guarantee of the file.  While the server can deallocate the blocks
3078	   specified by di_offset and di_length, future writes to this region
3079	   MUST NOT fail with NFSERR_NOSPC.

3081	   The INITIALIZE operation may fail for the following reasons (this is
3082	   a partial list):

3084	   NFS4ERR_NOTSUPP  The Hole punch operations are not supported by the
3085	      NFS server receiving this request.

3087	   NFS4ERR_DIR  The current filehandle is of type NF4DIR.

3089	   NFS4ERR_SYMLINK  The current filehandle is of type NF4LNK.

3091	   NFS4ERR_WRONG_TYPE  The current filehandle does not designate an
3092	      ordinary file.

3094	13.7.3.2.  ADBs

3096	   If the server supports ADBs, then it MUST support the
3097	   NFS4_CONTENT_APP_BLOCK arm of the INITIALIZE operation.  The server
3098	   has no concept of the structure imposed by the application.  It is
3099	   only when the application writes to a section of the file does order
3100	   get imposed.  In order to detect corruption even before the
3101	   application utilizes the file, the application will want to
3102	   initialize a range of ADBs using INITIALIZE.

3104	   For ADBs, when the client invokes the INITIALIZE operation, it has
3105	   two desired results:

3107	   1.  The structure described by the app_data_block4 be imposed on the
3108	       file.

3110	   2.  The contents described by the app_data_block4 be sparse.

3112	   If the server supports the INITIALIZE operation, it still might not
3113	   support sparse files.  So if it receives the INITIALIZE operation,
3114	   then it MUST populate the contents of the file with the initialized
3115	   ADBs.

3117	   If the data was already initialized, there are two interesting
3118	   scenarios:

3120	   1.  The data blocks are allocated.

3122	   2.  Initializing in the middle of an existing ADB.

3124	   If the data blocks were already allocated, then the INITIALIZE is a
3125	   hole punch operation.  If INITIALIZE supports sparse files, then the
3126	   data blocks are to be deallocated.  If not, then the data blocks are
3127	   to be rewritten in the indicated ADB format.

3129	   Since the server has no knowledge of ADBs, it should not report
3130	   misaligned creation of ADBs.  Even while it can detect them, it
3131	   cannot disallow them, as the application might be in the process of
3132	   changing the size of the ADBs.  Thus the server must be prepared to
3133	   handle an INITIALIZE into an existing ADB.

3135	   This document does not mandate the manner in which the server stores
3136	   ADBs sparsely for a file.  It does assume that if ADBs are stored
3137	   sparsely, then the server can detect when an INITIALIZE arrives that
3138	   will force a new ADB to start inside an existing ADB.  For example,
3139	   assume that ADBi has a adb_block_size of 4k and that an INITIALIZE
3140	   starts 1k inside ADBi.  The server should [[Comment.2: Need to flesh
3141	   this out. --TH]]

3143	13.8.  Operation 67: IO_ADVISE - Application I/O access pattern hints

3145	   This section introduces a new operation, named IO_ADVISE, which
3146	   allows NFS clients to communicate application I/O access pattern
3147	   hints to the NFS server.  This new operation will allow hints to be
3148	   sent to the server when applications use posix_fadvise, direct I/O,
3149	   or at any other point at which the client finds useful.

3151	13.8.1.  ARGUMENT

3153	   enum IO_ADVISE_type4 {
3154	           IO_ADVISE4_NORMAL                       = 0,
3155	           IO_ADVISE4_SEQUENTIAL                   = 1,
3156	           IO_ADVISE4_SEQUENTIAL_BACKWARDS         = 2,
3157	           IO_ADVISE4_RANDOM                       = 3,
3158	           IO_ADVISE4_WILLNEED                     = 4,
3159	           IO_ADVISE4_WILLNEED_OPPORTUNISTIC       = 5,
3160	           IO_ADVISE4_DONTNEED                     = 6,
3161	           IO_ADVISE4_NOREUSE                      = 7,
3162	           IO_ADVISE4_READ                         = 8,
3163	           IO_ADVISE4_WRITE                        = 9,
3164	           IO_ADVISE4_INIT_PROXIMITY               = 10
3165	   };

3167	   struct IO_ADVISE4args {
3168	           /* CURRENT_FH: file */
3169	           stateid4        iar_stateid;
3170	           offset4         iar_offset;
3171	           length4         iar_count;
3172	           bitmap4         iar_hints;
3173	   };

3175	13.8.2.  RESULT

3177	   struct IO_ADVISE4resok {
3178	           bitmap4 ior_hints;
3179	   };

3181	   union IO_ADVISE4res switch (nfsstat4 _status) {
3182	   case NFS4_OK:
3183	           IO_ADVISE4resok resok4;
3184	   default:
3185	           void;
3186	   };

3188	13.8.3.  DESCRIPTION

3190	   The IO_ADVISE operation sends an I/O access pattern hint to the
3191	   server for the owner of stated for a given byte range specified by
3192	   iar_offset and iar_count.  The byte range specified by iar_offset and
3193	   iar_count need not currently exist in the file, but the iar_hints
3194	   will apply to the byte range when it does exist.  If iar_count is 0,
3195	   all data following iar_offset is specified.  The server MAY ignore
3196	   the advice.

3198	   The following are the possible hints:

3200	   IO_ADVISE4_NORMAL  Specifies that the application has no advice to
3201	      give on its behavior with respect to the specified data.  It is
3202	      the default characteristic if no advice is given.

3204	   IO_ADVISE4_SEQUENTIAL  Specifies that the stated holder expects to
3205	      access the specified data sequentially from lower offsets to
3206	      higher offsets.

3208	   IO_ADVISE4_SEQUENTIAL BACKWARDS  Specifies that the stated holder
3209	      expects to access the specified data sequentially from higher
3210	      offsets to lower offsets.

3212	   IO_ADVISE4_RANDOM  Specifies that the stated holder expects to access
3213	      the specified data in a random order.

3215	   IO_ADVISE4_WILLNEED  Specifies that the stated holder expects to
3216	      access the specified data in the near future.

3218	   IO_ADVISE4_WILLNEED_OPPORTUNISTIC  Specifies that the stated holder
3219	      expects to possibly access the data in the near future.  This is a
3220	      speculative hint, and therefore the server should prefetch data or
3221	      indirect blocks only if it can be done at a marginal cost.

3223	   IO_ADVISE_DONTNEED  Specifies that the stated holder expects that it
3224	      will not access the specified data in the near future.

3226	   IO_ADVISE_NOREUSE  Specifies that the stated holder expects to access
3227	      the specified data once and then not reuse it thereafter.

3229	   IO_ADVISE4_READ  Specifies that the stated holder expects to read the
3230	      specified data in the near future.

3232	   IO_ADVISE4_WRITE  Specifies that the stated holder expects to write
3233	      the specified data in the near future.

3235	   IO_ADVISE4_INIT_PROXIMITY  The client has recently accessed the byte
3236	      range in its own cache.  This informs the server that the data in
3237	      the byte range remains important to the client.  When the server
3238	      reaches resource exhaustion, knowing which data is more important
3239	      allows the server to make better choices about which data to, for
3240	      example purge from a cache, or move to secondary storage.  It also
3241	      informs the server which delegations are more important, since if
3242	      delegations are working correctly, once delegated to a client, a
3243	      server might never receive another I/O request for the file.

3245	   The server will return success if the operation is properly formed,
3246	   otherwise the server will return an error.  The server MUST NOT
3247	   return an error if it does not recognize or does not support the
3248	   requested advice.  This is also true even if the client sends
3249	   contradictory hints to the server, e.g., IO_ADVISE4_SEQUENTIAL and
3250	   IO_ADVISE4_RANDOM in a single IO_ADVISE operation.  In this case, the
3251	   server MUST return success and a ior_hints value that indicates the
3252	   hint it intends to optimize.  For contradictory hints, this may mean
3253	   simply returning IO_ADVISE4_NORMAL for example.

3255	   The ior_hints returned by the server is primarily for debugging
3256	   purposes since the server is under no obligation to carry out the
3257	   hints that it describes in the ior_hints result.  In addition, while
3258	   the server may have intended to implement the hints returned in
3259	   ior_hints, as time progresses, the server may need to change its
3260	   handling of a given file due to several reasons including, but not
3261	   limited to, memory pressure, additional IO_ADVISE hints sent by other
3262	   clients, and heuristically detected file access patterns.

3264	   The server MAY return different advice than what the client
3265	   requested.  If it does, then this might be due to one of several
3266	   conditions, including, but not limited to another client advising of
3267	   a different I/O access pattern; a different I/O access pattern from
3268	   another client that that the server has heuristically detected; or
3269	   the server is not able to support the requested I/O access pattern,
3270	   perhaps due to a temporary resource limitation.

3272	   Each issuance of the IO_ADVISE operation overrides all previous
3273	   issuances of IO_ADVISE for a given byte range.  This effectively
3274	   follows a strategy of last hint wins for a given stated and byte
3275	   range.

3277	   Clients should assume that hints included in an IO_ADVISE operation
3278	   will be forgotten once the file is closed.

3280	13.8.4.  IMPLEMENTATION

3282	   The NFS client may choose to issue an IO_ADVISE operation to the
3283	   server in several different instances.

3285	   The most obvious is in direct response to an application's execution
3286	   of posix_fadvise.  In this case, IO_ADVISE4_WRITE and IO_ADVISE4_READ
3287	   may be set based upon the type of file access specified when the file
3288	   was opened.

3290	   Another useful point would be when an application indicates it is
3291	   using direct I/O. Direct I/O may be specified at file open, in which
3292	   case a IO_ADVISE may be included in the same compound as the OPEN
3293	   operation with the IO_ADVISE4_NOREUSE flag set.  Direct I/O may also
3294	   be specified separately, in which case a IO_ADVISE operation can be
3295	   sent to the server separately.  As above, IO_ADVISE4_WRITE and
3296	   IO_ADVISE4_READ may be set based upon the type of file access
3297	   specified when the file was opened.

3299	13.8.5.  pNFS File Layout Data Type Considerations

3301	   The IO_ADVISE considerations for pNFS are very similar to the COMMIT
3302	   considerations for pNFS.  That is, as with COMMIT, some NFS server
3303	   implementations prefer IO_ADVISE be done on the DS, and some prefer
3304	   it be done on the MDS.

3306	   So for the file's layout type, it is proposed that NFSv4.2 include an
3307	   additional hint NFL42_CARE_IO_ADVISE_THRU_MDS which is valid only on
3308	   NFSv4.2 or higher.  Any file's layout obtained with NFSv4.1 MUST NOT
3309	   have NFL42_UFLG_IO_ADVISE_THRU_MDS set.  Any file's layout obtained
3310	   with NFSv4.2 MAY have NFL42_UFLG_IO_ADVISE_THRU_MDS set.  If the
3311	   client does not implement IO_ADVISE, then it MUST ignore
3312	   NFL42_UFLG_IO_ADVISE_THRU_MDS.

3314	   If NFL42_UFLG_IO_ADVISE_THRU_MDS is set, then if the client
3315	   implements IO_ADVISE, then if it wants the DS to honor IO_ADVISE, the
3316	   client MUST send the operation to the MDS, and the server will
3317	   communicate the advice back each DS.  If the client sends IO_ADVISE
3318	   to the DS, then the server MAY return NFS4ERR_NOTSUPP.

3320	   If NFL42_UFLG_IO_ADVISE_THRU_MDS is not set, then this indicates to
3321	   client that if wants to inform the server via IO_ADVISE of the
3322	   client's intended use of the file, then the client SHOULD send an
3323	   IO_ADVISE to each DS.  While the client MAY always send IO_ADVISE to
3324	   the MDS, if the server has not set NFL42_UFLG_IO_ADVISE_THRU_MDS, the
3325	   client should expect that such an IO_ADVISE is futile.  Note that a
3326	   client SHOULD use the same set of arguments on each IO_ADVISE sent to
3327	   a DS for the same open file reference.

3329	   The server is not required to support different advice for different
3330	   DS's with the same open file reference.

3332	13.8.5.1.  Dense and Sparse Packing Considerations

3334	   The IO_ADVISE operation MUST use the iar_offset and byte range as
3335	   dictated by the presence or absence of NFL4_UFLG_DENSE.

3337	   E.g., if NFL4_UFLG_DENSE is present, and a READ or WRITE to the DS
3338	   for iar_offset 0 really means iar_offset 10000 in the logical file,
3339	   then an IO_ADVISE for iar_offset 0 means iar_offset 10000.

3341	   E.g., if NFL4_UFLG_DENSE is absent, then a READ or WRITE to the DS
3342	   for iar_offset 0 really means iar_offset 0 in the logical file, then
3343	   an IO_ADVISE for iar_offset 0 means iar_offset 0 in the logical file.

3345	   E.g., if NFL4_UFLG_DENSE is present, the stripe unit is 1000 bytes
3346	   and the stripe count is 10, and the dense DS file is serving
3347	   iar_offset 0.  A READ or WRITE to the DS for iar_offsets 0, 1000,
3348	   2000, and 3000, really mean iar_offsets 10000, 20000, 30000, and
3349	   40000 (implying a stripe count of 10 and a stripe unit of 1000), then
3350	   an IO_ADVISE sent to the same DS with an iar_offset of 500, and a
3351	   iar_count of 3000 means that the IO_ADVISE applies to these byte
3352	   ranges of the dense DS file:

3354	     - 500 to 999
3355	     - 1000 to 1999
3356	     - 2000 to 2999
3357	     - 3000 to 3499

3359	   I.e., the contiguous range 500 to 3499 as specified in IO_ADVISE.

3361	   It also applies to these byte ranges of the logical file:

3363	     - 10500 to 10999 (500 bytes)
3364	     - 20000 to 20999 (1000 bytes)
3365	     - 30000 to 30999 (1000 bytes)
3366	     - 40000 to 40499 (500 bytes)
3367	     (total            3000 bytes)

3369	   E.g., if NFL4_UFLG_DENSE is absent, the stripe unit is 250 bytes, the
3370	   stripe count is 4, and the sparse DS file is serving iar_offset 0.
3371	   Then a READ or WRITE to the DS for iar_offsets 0, 1000, 2000, and
3372	   3000, really mean iar_offsets 0, 1000, 2000, and 3000 in the logical
3373	   file, keeping in mind that on the DS file,. byte ranges 250 to 999,
3374	   1250 to 1999, 2250 to 2999, and 3250 to 3999 are not accessible.
3375	   Then an IO_ADVISE sent to the same DS with an iar_offset of 500, and
3376	   a iar_count of 3000 means that the IO_ADVISE applies to these byte
3377	   ranges of the logical file and the sparse DS file:

3379	     - 500 to 999 (500 bytes)   - no effect
3380	     - 1000 to 1249 (250 bytes) - effective
3381	     - 1250 to 1999 (750 bytes) - no effect
3382	     - 2000 to 2249 (250 bytes) - effective
3383	     - 2250 to 2999 (750 bytes) - no effect
3384	     - 3000 to 3249 (250 bytes) - effective
3385	     - 3250 to 3499 (250 bytes) - no effect
3386	     (subtotal      2250 bytes) - no effect
3387	     (subtotal       750 bytes) - effective
3388	     (grand total   3000 bytes) - no effect + effective

3390	   If neither of the flags NFL42_UFLG_IO_ADVISE_THRU_MDS and
3391	   NFL4_UFLG_DENSE are set in the layout, then any IO_ADVISE request
3392	   sent to the data server with a byte range that overlaps stripe unit
3393	   that the data server does not serve MUST NOT result in the status
3394	   NFS4ERR_PNFS_IO_HOLE.  Instead, the response SHOULD be successful and
3395	   if the server applies IO_ADVISE hints on any stripe units that
3396	   overlap with the specified range, those hints SHOULD be indicated in
3397	   the response.

3399	13.8.6.  Number of Supported File Segments

3401	   In theory IO_ADVISE allows a client and server to support multiple
3402	   file segments, meaning that different, possibly overlapping, byte
3403	   ranges of the same open file reference will support different hints.
3404	   This is not practical, and in general the server will support just
3405	   one set of hints, and these will apply to the entire file.  However,
3406	   there are some hints that very ephemeral, and are essentially amount
3407	   to one time instructions to the NFS server, which will be forgotten
3408	   momentarily after IO_ADVISE is executed.

3410	   The following hints will always apply to the entire file, regardless
3411	   of the specified byte range:

3413	   o  IO_ADVISE4_NORMAL

3415	   o  IO_ADVISE4_SEQUENTIAL

3417	   o  IO_ADVISE4_SEQUENTIAL_BACKWARDS

3419	   o  IO_ADVISE4_RANDOM

3421	   The following hints will always apply to specified byte range, and
3422	   will treated as one time instructions:

3424	   o  IO_ADVISE4_WILLNEED

3426	   o  IO_ADVISE4_WILLNEED_OPPORTUNISTIC

3428	   o  IO_ADVISE4_DONTNEED

3430	   o  IO_ADVISE4_NOREUSE

3432	   The following hints are modifiers to all other hints, and will apply
3433	   to the entire file and/or to a one time instruction on the specified
3434	   byte range:

3436	   o  IO_ADVISE4_READ

3438	   o  IO_ADVISE4_WRITE

3440	13.9.  Changes to Operation 51: LAYOUTRETURN

3442	13.9.1.  Introduction

3444	   In the pNFS description provided in [2], the client is not capable to
3445	   relay an error code from the DS to the MDS.  In the specification of
3446	   the Objects-Based Layout protocol [9], use is made of the opaque
3447	   lrf_body field of the LAYOUTRETURN argument to do such a relaying of
3448	   error codes.  In this section, we define a new data structure to
3449	   enable the passing of error codes back to the MDS and provide some
3450	   guidelines on what both the client and MDS should expect in such
3451	   circumstances.

3453	   There are two broad classes of errors, transient and persistent.  The
3454	   client SHOULD strive to only use this new mechanism to report
3455	   persistent errors.  It MUST be able to deal with transient issues by
3456	   itself.  Also, while the client might consider an issue to be
3457	   persistent, it MUST be prepared for the MDS to consider such issues
3458	   to be transient.  A prime example of this is if the MDS fences off a
3459	   client from either a stateid or a filehandle.  The client will get an
3460	   error from the DS and might relay either NFS4ERR_ACCESS or
3461	   NFS4ERR_BAD_STATEID back to the MDS, with the belief that this is a
3462	   hard error.  If the MDS is informed by the client that there is an
3463	   error, it can safely ignore that.  For it, the mission is
3464	   accomplished in that the client has returned a layout that the MDS
3465	   had most likley recalled.

3467	   The client might also need to inform the MDS that it cannot reach one
3468	   or more of the DSes.  While the MDS can detect the connectivity of
3469	   both of these paths:

3471	   o  MDS to DS

3473	   o  MDS to client

3475	   it cannot determine if the client and DS path is working.  As with
3476	   the case of the DS passing errors to the client, it must be prepared
3477	   for the MDS to consider such outages as being transistory.

3479	   The existing LAYOUTRETURN operation is extended by introducing a new
3480	   data structure to report errors, layoutreturn_device_error4.  Also,
3481	   layoutreturn_device_error4 is introduced to enable an array of errors
3482	   to be reported.

3484	13.9.2.  ARGUMENT

3486	   The ARGUMENT specification of the LAYOUTRETURN operation in section
3487	   18.44.1 of [2] is augmented by the following XDR code [24]:

3489	   struct layoutreturn_device_error4 {
3490	           deviceid4       lrde_deviceid;
3491	           nfsstat4        lrde_status;
3492	           nfs_opnum4      lrde_opnum;
3493	   };

3495	   struct layoutreturn_error_report4 {
3496	           layoutreturn_device_error4      lrer_errors<>;
3497	   };

3499	13.9.3.  RESULT

3501	   The RESULT of the LAYOUTRETURN operation is unchanged; see section
3502	   18.44.2 of [2].

3504	13.9.4.  DESCRIPTION

3506	   The following text is added to the end of the LAYOUTRETURN operation
3507	   DESCRIPTION in section 18.44.3 of [2].

3509	   When a client uses LAYOUTRETURN with a type of LAYOUTRETURN4_FILE,
3510	   then if the lrf_body field is NULL, it indicates to the MDS that the
3511	   client experienced no errors.  If lrf_body is non-NULL, then the
3512	   field references error information which is layout type specific.
3513	   I.e., the Objects-Based Layout protocol can continue to utilize
3514	   lrf_body as specified in [9].  For both Files-Based and Block-Based
3515	   Layouts, the field references a layoutreturn_device_error4, which
3516	   contains an array of layoutreturn_device_error4.

3518	   Each individual layoutreturn_device_error4 descibes a single error
3519	   associated with a DS, which is identfied via lrde_deviceid.  The
3520	   operation which returned the error is identified via lrde_opnum.
3521	   Finally the NFS error value (nfsstat4) encountered is provided via
3522	   lrde_status and may consist of the following error codes:

3524	   NFS4ERR_NXIO:  The client was unable to establish any communication
3525	      with the DS.

3527	   NFS4ERR_*:  The client was able to establish communication with the
3528	      DS and is returning one of the allowed error codes for the
3529	      operation denoted by lrde_opnum.

3531	13.9.5.  IMPLEMENTATION

3533	   The following text is added to the end of the LAYOUTRETURN operation
3534	   IMPLEMENTATION in section 18.4.4 of [2].

3536	   Clients are expected to tolerate transient storage device errors, and
3537	   hence clients SHOULD NOT use the LAYOUTRETURN error handling for
3538	   device access problems that may be transient.  The methods by which a
3539	   client decides whether a device access problem is transient vs.
3540	   persistent are implementation-specific, but may include retrying I/Os
3541	   to a data server under appropriate conditions.

3543	   When an I/O fails to a storage device, the client SHOULD retry the
3544	   failed I/O via the MDS.  In this situation, before retrying the I/O,
3545	   the client SHOULD return the layout, or the affected portion thereof,
3546	   and SHOULD indicate which storage device or devices was problematic.
3547	   The client needs to do this when the DS is being unresponsive in
3548	   order to fence off any failed write attempts, and ensure that they do
3549	   not end up overwriting any later data being written through the MDS.
3550	   If the client does not do this, the MDS MAY issue a layout recall
3551	   callback in order to perform the retried I/O.

3553	   The client needs to be cognizant that since this error handling is
3554	   optional in the MDS, the MDS may silently ignore this functionality.
3555	   Also, as the MDS may consider some issues the client reports to be
3556	   expected (see Section 13.9.1), the client might find it difficult to
3557	   detect a MDS which has not implemented error handling via
3558	   LAYOUTRETURN.

3560	   If an MDS is aware that a storage device is proving problematic to a
3561	   client, the MDS SHOULD NOT include that storage device in any pNFS
3562	   layouts sent to that client.  If the MDS is aware that a storage
3563	   device is affecting many clients, then the MDS SHOULD NOT include
3564	   that storage device in any pNFS layouts sent out.  If a client asks
3565	   for a new layout for the file from the MDS, it MUST be prepared for
3566	   the MDS to return that storage device in the layout.  The MDS might
3567	   not have any choice in using the storage device, i.e., there might
3568	   only be one possible layout for the system.  Also, in the case of
3569	   existing files, the MDS might have no choice in which storage devices
3570	   to hand out to clients.

3572	   The MDS is not required to indefinitely retain per-client storage
3573	   device error information.  An MDS is also not required to
3574	   automatically reinstate use of a previously problematic storage
3575	   device; administrative intervention may be required instead.

3577	13.10.  Operation 65: READ_PLUS

3579	   READ_PLUS is a new variant of the NFSv4.1 READ operation [2].
3580	   Besides being able to support all of the data semantics of READ, it
3581	   can also be used by the server to return either holes or ADBs to the
3582	   client.  For holes, READ_PLUS extends the response to avoid returning
3583	   data for portions of the file which are either initialized and
3584	   contain no backing store or if the result would appear to be so.
3585	   I.e., if the result was a data block composed entirely of zeros, then
3586	   it is easier to return a hole.  Returning data blocks of unitialized
3587	   data wastes computational and network resources, thus reducing
3588	   performance.  For ADBs, READ_PLUS is used to return the metadata
3589	   describing the portions of the file which are either initialized and
3590	   contain no backing store.

3592	   If the client sends a READ operation, it is explicitly stating that
3593	   it is neither supporting sparse files nor ADBs.  So if a READ occurs
3594	   on a sparse ADB or file, then the server must expand such data to be
3595	   raw bytes.  If a READ occurs in the middle of a hole or ADB, the
3596	   server can only send back bytes starting from that offset.  In
3597	   contrast, if a READ_PLUS occurs in the middle of a hole or ADB, the
3598	   server can send back a range which starts before the offset and
3599	   extends past the range.

3601	   READ is inefficient for transfer of sparse sections of the file.  As
3602	   such, READ is marked as OBSOLETE in NFSv4.2.  Instead, a client
3603	   should issue READ_PLUS.  Note that as the client has no a priori
3604	   knowledge of whether either an ADB or a hole is present or not, it
3605	   should always use READ_PLUS.

3607	13.10.1.  ARGUMENT

3609	   struct READ_PLUS4args {
3610	           /* CURRENT_FH: file */
3611	           stateid4        rpa_stateid;
3612	           offset4         rpa_offset;
3613	           count4          rpa_count;
3614	   };

3616	13.10.2.  RESULT

3618	   union read_plus_content switch (data_content4 content) {
3619	   case NFS4_CONTENT_DATA:
3620	           opaque          rpc_data<>;
3621	   case NFS4_CONTENT_APP_BLOCK:
3622	           app_data_block4 rpc_block;
3623	   case NFS4_CONTENT_HOLE:
3624	           data_info4      rpc_hole;
3625	   default:
3626	           void;
3627	   };

3629	   /*
3630	    * Allow a return of an array of contents.
3631	    */
3632	   struct read_plus_res4 {
3633	           bool                    rpr_eof;
3634	           read_plus_content       rpr_contents<>;
3635	   };

3637	   union READ_PLUS4res switch (nfsstat4 status) {
3638	   case NFS4_OK:
3639	           read_plus_res4  resok4;
3640	   default:
3641	           void;
3642	   };

3644	13.10.3.  DESCRIPTION

3646	   The READ_PLUS operation is based upon the NFSv4.1 READ operation [2]
3647	   and similarly reads data from the regular file identified by the
3648	   current filehandle.

3650	   The client provides a rpa_offset of where the READ_PLUS is to start
3651	   and a rpa_count of how many bytes are to be read.  A rpa_offset of
3652	   zero means to read data starting at the beginning of the file.  If
3653	   rpa_offset is greater than or equal to the size of the file, the
3654	   status NFS4_OK is returned with di_length (the data length) set to
3655	   zero and eof set to TRUE.

3657	   The READ_PLUS result is comprised of an array of rpr_contents, each
3658	   of which describe a data_content4 type of data (Section 6.1.2).  For
3659	   NFSv4.2, the allowed values are data, ADB, and hole.  A server is
3660	   required to support the data type, but neither ADB nor hole.  Both an
3661	   ADB and a hole must be returned in its entirety - clients must be
3662	   prepared to get more information than they requested.

3664	   READ_PLUS has to support all of the errors which are returned by READ
3665	   plus NFS4ERR_UNION_NOTSUPP.  If the client asks for a hole and the
3666	   server does not support that arm of the discriminated union, but does
3667	   support one or more additional arms, it can signal to the client that
3668	   it supports the operation, but not the arm with
3669	   NFS4ERR_UNION_NOTSUPP.

3671	   If the data to be returned is comprised entirely of zeros, then the
3672	   server may elect to return that data as a hole.  The server
3673	   differentiates this to the client by setting di_allocated to TRUE in
3674	   this case.  Note that in such a scenario, the server is not required
3675	   to determine the full extent of the "hole" - it does not need to
3676	   determine where the zeros start and end.

3678	   The server may elect to return adjacent elements of the same type.
3679	   For example, the guard pattern or block size of an ADB might change,
3680	   which would require adjacent elements of type ADB.  Likewise if the
3681	   server has a range of data comprised entirely of zeros and then a
3682	   hole, it might want to return two adjacent holes to the client.

3684	   If the client specifies a rpa_count value of zero, the READ_PLUS
3685	   succeeds and returns zero bytes of data.  In all situations, the
3686	   server may choose to return fewer bytes than specified by the client.
3687	   The client needs to check for this condition and handle the condition
3688	   appropriately.

3690	   If the client specifies an rpa_offset and rpa_count value that is
3691	   entirely contained within a hole of the file, then the di_offset and
3692	   di_length returned must be for the entire hole.  This result is
3693	   considered valid until the file is changed (detected via the change
3694	   attribute).  The server MUST provide the same semantics for the hole
3695	   as if the client read the region and received zeroes; the implied
3696	   holes contents lifetime MUST be exactly the same as any other read
3697	   data.

3699	   If the client specifies an rpa_offset and rpa_count value that begins
3700	   in a non-hole of the file but extends into hole the server should
3701	   return an array comprised of both data and a hole.  The client MUST
3702	   be prepared for the server to return a short read describing just the
3703	   data.  The client will then issue another READ_PLUS for the remaining
3704	   bytes, which the server will respond with information about the hole
3705	   in the file.

3707	   Except when special stateids are used, the stateid value for a
3708	   READ_PLUS request represents a value returned from a previous byte-
3709	   range lock or share reservation request or the stateid associated
3710	   with a delegation.  The stateid identifies the associated owners if
3711	   any and is used by the server to verify that the associated locks are
3712	   still valid (e.g., have not been revoked).

3714	   If the read ended at the end-of-file (formally, in a correctly formed
3715	   READ_PLUS operation, if rpa_offset + rpa_count is equal to the size
3716	   of the file), or the READ_PLUS operation extends beyond the size of
3717	   the file (if rpa_offset + rpa_count is greater than the size of the
3718	   file), eof is returned as TRUE; otherwise, it is FALSE.  A successful
3719	   READ_PLUS of an empty file will always return eof as TRUE.

3721	   If the current filehandle is not an ordinary file, an error will be
3722	   returned to the client.  In the case that the current filehandle
3723	   represents an object of type NF4DIR, NFS4ERR_ISDIR is returned.  If
3724	   the current filehandle designates a symbolic link, NFS4ERR_SYMLINK is
3725	   returned.  In all other cases, NFS4ERR_WRONG_TYPE is returned.

3727	   For a READ_PLUS with a stateid value of all bits equal to zero, the
3728	   server MAY allow the READ_PLUS to be serviced subject to mandatory
3729	   byte-range locks or the current share deny modes for the file.  For a
3730	   READ_PLUS with a stateid value of all bits equal to one, the server
3731	   MAY allow READ_PLUS operations to bypass locking checks at the
3732	   server.

3734	   On success, the current filehandle retains its value.

3736	13.10.4.  IMPLEMENTATION

3738	   In general, the IMPLEMENTATION notes for READ in Section 18.22.4 of
3739	   [2] also apply to READ_PLUS.  One delta is that when the owner has a
3740	   locked byte range, the server MUST return an array of rpr_contents
3741	   with values inside that range.

3743	13.10.4.1.  Additional pNFS Implementation Information

3745	   With pNFS, the semantics of using READ_PLUS remains the same.  Any
3746	   data server MAY return a hole or ADB result for a READ_PLUS request
3747	   that it receives.  When a data server chooses to return such a
3748	   result, it has the option of returning information for the data
3749	   stored on that data server (as defined by the data layout), but it
3750	   MUST not return results for a byte range that includes data managed
3751	   by another data server.

3753	   A data server should do its best to return as much information about
3754	   a hole ADB as is feasible without having to contact the metadata
3755	   server.  If communication with the metadata server is required, then
3756	   every attempt should be taken to minimize the number of requests.

3758	   If mandatory locking is enforced, then the data server must also
3759	   ensure that to return only information that is within the owner's
3760	   locked byte range.

3762	13.10.5.  READ_PLUS with Sparse Files Example

3764	   The following table describes a sparse file.  For each byte range,
3765	   the file contains either non-zero data or a hole.  In addition, the
3766	   server in this example uses a Hole Threshold of 32K.

3768	                        +-------------+----------+
3769	                        | Byte-Range  | Contents |
3770	                        +-------------+----------+
3771	                        | 0-15999     | Hole     |
3772	                        | 16K-31999   | Non-Zero |
3773	                        | 32K-255999  | Hole     |
3774	                        | 256K-287999 | Non-Zero |
3775	                        | 288K-353999 | Hole     |
3776	                        | 354K-417999 | Non-Zero |
3777	                        +-------------+----------+

3779	                                  Table 5

3781	   Under the given circumstances, if a client was to read from the file
3782	   with a max read size of 64K, the following will be the results for
3783	   the given READ_PLUS calls.  This assumes the client has already
3784	   opened the file, acquired a valid stateid ('s' in the example), and
3785	   just needs to issue READ_PLUS requests.

3787	   1.  READ_PLUS(s, 0, 64K) --> NFS_OK, eof = false, <data[0,32K],
3788	       hole[32K,224K]>.  Since the first hole is less than the server's
3789	       Hole Threshhold, the first 32K of the file is returned as data
3790	       and the remaining 32K is returned as a hole which actually
3791	       extends to 256K.

3793	   2.  READ_PLUS(s, 32K, 64K) --> NFS_OK, eof = false, <hole[32K,224K]>
3794	       The requested range was all zeros, and the current hole begins at
3795	       offset 32K and is 224K in length.  Note that the client should
3796	       not have followed up the previous READ_PLUS request with this one
3797	       as the hole information from the previous call extended past what
3798	       the client was requesting.

3800	   3.  READ_PLUS(s, 256K, 64K) --> NFS_OK, eof = false, <data[256K,
3801	       288K], hole[288K, 354K]>.  Returns an array of the 32K data and
3802	       the hole which extends to 354K.

3804	   4.  READ_PLUS(s, 354K, 64K) --> NFS_OK, eof = true, <data[354K,
3805	       418K]>.  Returns the final 64K of data and informs the client
3806	       there is no more data in the file.

3808	13.11.  Operation 66: SEEK

3810	   SEEK is an operation that allows a client to determine the location
3811	   of the next data_content4 in a file.  It allows an implementation of
3812	   the emerging extension to lseek(2) to allow clients to determine
3813	   SEEK_HOLE and SEEK_DATA.

3815	13.11.1.  ARGUMENT

3817	   struct SEEK4args {
3818	           /* CURRENT_FH: file */
3819	           stateid4        sa_stateid;
3820	           offset4         sa_offset;
3821	           data_content4   sa_what;
3822	   };

3824	13.11.2.  RESULT

3826	   union seek_content switch (data_content4 content) {
3827	   case NFS4_CONTENT_DATA:
3828	           data_info4      sc_data;
3829	   case NFS4_CONTENT_APP_BLOCK:
3830	           app_data_block4 sc_block;
3831	   case NFS4_CONTENT_HOLE:
3832	           data_info4      sc_hole;
3833	   default:
3834	           void;
3835	   };

3837	   struct seek_res4 {
3838	           bool                    sr_eof;
3839	           seek_content            sr_contents;
3840	   };

3842	   union SEEK4res switch (nfsstat4 status) {
3843	   case NFS4_OK:
3844	           seek_res4       resok4;
3845	   default:
3846	           void;
3847	   };

3849	13.11.3.  DESCRIPTION

3851	   From the given sa_offset, find the next data_content4 of type sa_what
3852	   in the file.  For either a hole or ADB, this must return the
3853	   data_content4 in its entirety.  For data, it must not return the
3854	   actual data.

3856	   SEEK must follow the same rules for stateids as READ_PLUS
3857	   (Section 13.10.3).

3859	   If the server could not find a corresponding sa_what, then the status
3860	   would still be NFS4_OK, but sr_eof would be TRUE.  The sr_contents
3861	   would contain a zero-ed out content of the appropriate type.

3863	14.  NFSv4.2 Callback Operations

3865	14.1.  Procedure 16: CB_ATTR_CHANGED - Notify Client that the File's
3866	       Attributes Changed

3868	14.1.1.  ARGUMENTS

3870	   struct CB_ATTR_CHANGED4args {
3871	           nfs_fh4         acca_fh;
3872	           bitmap4         acca_critical;
3873	           bitmap4         acca_info;
3874	   };

3876	14.1.2.  RESULTS

3878	   struct CB_ATTR_CHANGED4res {
3879	           nfsstat4        accr_status;
3880	   };

3882	14.1.3.  DESCRIPTION

3884	   The CB_ATTR_CHANGED callback operation is used by the server to
3885	   indicate to the client that the file's attributes have been modified
3886	   on the server.  The server does not convey how the attributes have
3887	   changed, just that they have been modified.  The server can inform
3888	   the client about both critical and informational attribute changes in
3889	   the bitmask arguments.  The client SHOULD query the server about all
3890	   attributes set in acca_critical.  For all changes reflected in
3891	   acca_info, the client can decide whether or not it wants to poll the
3892	   server.

3894	   The CB_ATTR_CHANGED callback operation with the FATTR4_SEC_LABEL set
3895	   in acca_critical is the method used by the server to indicate that
3896	   the MAC label for the file referenced by acca_fh has changed.  In
3897	   many ways, the server does not care about the result returned by the
3898	   client.

3900	14.2.  Operation 15: CB_COPY - Report results of a server-side copy
3901	14.2.1.  ARGUMENT

3903	   union copy_info4 switch (nfsstat4 cca_status) {
3904	           case NFS4_OK:
3905	                   void;
3906	           default:
3907	                   length4         cca_bytes_copied;
3908	   };

3910	   struct CB_COPY4args {
3911	           nfs_fh4         cca_fh;
3912	           stateid4        cca_stateid;
3913	           copy_info4      cca_copy_info;
3914	   };

3916	14.2.2.  RESULT

3918	   struct CB_COPY4res {
3919	           nfsstat4        ccr_status;
3920	   };

3922	14.2.3.  DESCRIPTION

3924	   CB_COPY is used for both intra- and inter-server asynchronous copies.
3925	   The CB_COPY callback informs the client of the result of an
3926	   asynchronous server-side copy.  This operation is sent by the
3927	   destination server to the client in a CB_COMPOUND request.  The copy
3928	   is identified by the filehandle and stateid arguments.  The result is
3929	   indicated by the status field.  If the copy failed, cca_bytes_copied
3930	   contains the number of bytes copied before the failure occurred.  The
3931	   cca_bytes_copied value indicates the number of bytes copied but not
3932	   which specific bytes have been copied.

3934	   In the absence of an established backchannel, the server cannot
3935	   signal the completion of the COPY via a CB_COPY callback.  The loss
3936	   of a callback channel would be indicated by the server setting the
3937	   SEQ4_STATUS_CB_PATH_DOWN flag in the sr_status_flags field of the
3938	   SEQUENCE operation.  The client must re-establish the callback
3939	   channel to receive the status of the COPY operation.  Prolonged loss
3940	   of the callback channel could result in the server dropping the COPY
3941	   operation state and invalidating the copy stateid.

3943	   If the client supports the COPY operation, the client is REQUIRED to
3944	   support the CB_COPY operation.

3946	   The CB_COPY operation may fail for the following reasons (this is a
3947	   partial list):

3949	   NFS4ERR_NOTSUPP:  The copy offload operation is not supported by the
3950	      NFS client receiving this request.

3952	15.  IANA Considerations

3954	   This section uses terms that are defined in [25].

3956	16.  References

3958	16.1.  Normative References

3960	   [1]   Bradner, S., "Key words for use in RFCs to Indicate Requirement
3961	         Levels", March 1997.

3963	   [2]   Shepler, S., Eisler, M., and D. Noveck, "Network File System
3964	         (NFS) Version 4 Minor Version 1 Protocol", RFC 5661,
3965	         January 2010.

3967	   [3]   Haynes, T., "Network File System (NFS) Version 4 Minor Version
3968	         2 External Data Representation Standard (XDR) Description",
3969	         March 2011.

3971	   [4]   Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform
3972	         Resource Identifier (URI): Generic Syntax", STD 66, RFC 3986,
3973	         January 2005.

3975	   [5]   Haynes, T. and N. Williams, "Remote Procedure Call (RPC)
3976	         Security Version 3", draft-williams-rpcsecgssv3 (work in
3977	         progress), 2011.

3979	   [6]   The Open Group, "Section 'posix_fadvise()' of System Interfaces
3980	         of The Open Group Base Specifications Issue 6, IEEE Std 1003.1,
3981	         2004 Edition", 2004.

3983	   [7]   Haynes, T., "Requirements for Labeled NFS",
3984	         draft-ietf-nfsv4-labreqs-00 (work in progress).

3986	   [8]   Eisler, M., Chiu, A., and L. Ling, "RPCSEC_GSS Protocol
3987	         Specification", RFC 2203, September 1997.

3989	   [9]   Halevy, B., Welch, B., and J. Zelenka, "Object-Based Parallel
3990	         NFS (pNFS) Operations", RFC 5664, January 2010.

3992	16.2.  Informative References

3994	   [10]  Haynes, T. and D. Noveck, "Network File System (NFS) version 4
3995	         Protocol", draft-ietf-nfsv4-rfc3530bis-09 (Work In Progress),
3996	         March 2011.

3998	   [11]  Lentini, J., Everhart, C., Ellard, D., Tewari, R., and M. Naik,
3999	         "NSDB Protocol for Federated Filesystems",
4000	         draft-ietf-nfsv4-federated-fs-protocol (Work In Progress),
4001	         2010.

4003	   [12]  Lentini, J., Everhart, C., Ellard, D., Tewari, R., and M. Naik,
4004	         "Administration Protocol for Federated Filesystems",
4005	         draft-ietf-nfsv4-federated-fs-admin (Work In Progress), 2010.

4007	   [13]  Fielding, R., Gettys, J., Mogul, J., Frystyk, H., Masinter, L.,
4008	         Leach, P., and T. Berners-Lee, "Hypertext Transfer Protocol --
4009	         HTTP/1.1", RFC 2616, June 1999.

4011	   [14]  Postel, J. and J. Reynolds, "File Transfer Protocol", STD 9,
4012	         RFC 959, October 1985.

4014	   [15]  Simpson, W., "PPP Challenge Handshake Authentication Protocol
4015	         (CHAP)", RFC 1994, August 1996.

4017	   [16]  VanDeBogart, S., Frost, C., and E. Kohler, "Reducing Seek
4018	         Overhead with Application-Directed Prefetching", Proceedings of
4019	         USENIX Annual Technical Conference , June 2009.

4021	   [17]  Strohm, R., "Chapter 2, Data Blocks, Extents, and Segments, of
4022	         Oracle Database Concepts 11g Release 1 (11.1)", January 2011.

4024	   [18]  Ashdown, L., "Chapter 15, Validating Database Files and
4025	         Backups, of Oracle Database Backup and Recovery User's Guide
4026	         11g Release 1 (11.1)", August 2008.

4028	   [19]  McDougall, R. and J. Mauro, "Section 11.4.3, Detecting Memory
4029	         Corruption of Solaris Internals", 2007.

4031	   [20]  Bairavasundaram, L., Goodson, G., Schroeder, B., Arpaci-
4032	         Dusseau, A., and R. Arpaci-Dusseau, "An Analysis of Data
4033	         Corruption in the Storage Stack", Proceedings of the 6th USENIX
4034	         Symposium on File and Storage Technologies (FAST '08) , 2008.

4036	   [21]  "Section 46.6. Multi-Level Security (MLS) of Deployment Guide:
4037	         Deployment, configuration and administration of Red Hat
4038	         Enterprise Linux 5, Edition 6", 2011.

4040	   [22]  Quigley, D. and J. Lu, "Registry Specification for MAC Security
4041	         Label Formats", draft-quigley-label-format-registry (work in
4042	         progress), 2011.

4044	   [23]  ISEG, "IESG Processing of RFC Errata for the IETF Stream",
4045	         2008.

4047	   [24]  Eisler, M., "XDR: External Data Representation Standard",
4048	         RFC 4506, May 2006.

4050	   [25]  Narten, T. and H. Alvestrand, "Guidelines for Writing an IANA
4051	         Considerations Section in RFCs", BCP 26, RFC 5226, May 2008.

4053	Appendix A.  Acknowledgments

4055	   For the pNFS Access Permissions Check, the original draft was by
4056	   Sorin Faibish, David Black, Mike Eisler, and Jason Glasgow.  The work
4057	   was influenced by discussions with Benny Halevy and Bruce Fields.  A
4058	   review was done by Tom Haynes.

4060	   For the Sharing change attribute implementation details with NFSv4
4061	   clients, the original draft was by Trond Myklebust.

4063	   For the NFS Server-side Copy, the original draft was by James
4064	   Lentini, Mike Eisler, Deepak Kenchammana, Anshul Madan, and Rahul
4065	   Iyer.  Tom Talpey co-authored an unpublished version of that
4066	   document.  It was also was reviewed by a number of individuals:
4067	   Pranoop Erasani, Tom Haynes, Arthur Lent, Trond Myklebust, Dave
4068	   Noveck, Theresa Lingutla-Raj, Manjunath Shankararao, Satyam Vaghani,
4069	   and Nico Williams.

4071	   For the NFS space reservation operations, the original draft was by
4072	   Mike Eisler, James Lentini, Manjunath Shankararao, and Rahul Iyer.

4074	   For the sparse file support, the original draft was by Dean
4075	   Hildebrand and Marc Eshel.  Valuable input and advice was received
4076	   from Sorin Faibish, Bruce Fields, Benny Halevy, Trond Myklebust, and
4077	   Richard Scheffenegger.

4079	   For the Application IO Hints, the original draft was by Dean
4080	   Hildebrand, Mike Eisler, Trond Myklebust, and Sam Falkner.  Some
4081	   early reviwers included Benny Halevy and Pranoop Erasani.

4083	   For Labeled NFS, the original draft was by David Quigley, James
4084	   Morris, Jarret Lu, and Tom Haynes.  Peter Staubach, Trond Myklebust,
4085	   Stephen Smalley, Sorrin Faibish, Nico Williams, and David Black also
4086	   contributed in the final push to get this accepted.

4088	Appendix B.  RFC Editor Notes

4090	   [RFC Editor: please remove this section prior to publishing this
4091	   document as an RFC]

4093	   [RFC Editor: prior to publishing this document as an RFC, please
4094	   replace all occurrences of RFCTBD10 with RFCxxxx where xxxx is the
4095	   RFC number of this document]

4097	Author's Address

4099	   Thomas Haynes
4100	   NetApp
4101	   9110 E 66th St
4102	   Tulsa, OK  74133
4103	   USA

4105	   Phone: +1 918 307 1415
4106	   Email: thomas@netapp.com
4107	   URI:   http://www.tulsalabs.com