idnits 2.17.1 

draft-ietf-nfsv4-minorversion2-03.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  == There are 5 instances of lines with non-RFC6890-compliant IPv4 addresses
     in the document.  If these are example addresses, they should be changed.

  == There are 5 instances of lines with private range IPv4 addresses in the
     document.  If these are generic example addresses, they should be changed
     to use any of the ranges defined in RFC 6890 (or successor): 192.0.2.x,
     198.51.100.x or 203.0.113.x.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  == Using lowercase 'not' together with uppercase 'MUST', 'SHALL', 'SHOULD',
     or 'RECOMMENDED' is not an accepted usage according to RFC 2119.  Please
     use uppercase 'NOT' together with RFC 2119 keywords (if that is what you
     mean).
     
     Found 'MUST not' in this paragraph:
     
     Furthermore, each DS MUST not report to a client either a sparse
     ADB or data which belongs to another DS.  One implication of this
     requirement is that the app_data_block4's adb_block_size MUST be either
     be the stripe width or the stripe width must be an even multiple of it.

  == Using lowercase 'not' together with uppercase 'MUST', 'SHALL', 'SHOULD',
     or 'RECOMMENDED' is not an accepted usage according to RFC 2119.  Please
     use uppercase 'NOT' together with RFC 2119 keywords (if that is what you
     mean).
     
     Found 'MUST not' in this paragraph:
     
     When a data server chooses to return a READ_HOLE result, it has the
     option of returning hole information for the data stored on that data
     server (as defined by the data layout), but it MUST not return a
     nfs_readplusreshole structure with a byte range that includes data
     managed by another data server.

  == Using lowercase 'not' together with uppercase 'MUST', 'SHALL', 'SHOULD',
     or 'RECOMMENDED' is not an accepted usage according to RFC 2119.  Please
     use uppercase 'NOT' together with RFC 2119 keywords (if that is what you
     mean).
     
     Found 'MUST not' in this paragraph:
     
     The second change is to provide a method for the server to notify
     the client that the attribute changed on an open file on the server.  If
     the file is closed, then during the open attempt, the client will gather
     the new attribute value.  The server MUST not communicate the new value
     of the attribute, the client MUST query it.  This requirement stems from
     the need for the client to provide sufficient access rights to the
     attribute.

  == The document seems to contain a disclaimer for pre-RFC5378 work, but was
     first submitted on or after 10 November 2008.  The disclaimer is usually
     necessary only for documents that revise or obsolete older RFCs, and that
     take significant amounts of text from those RFCs.  If you can contact all
     authors of the source material and they are willing to grant the BCP78
     rights to the IETF Trust, you can and should remove the disclaimer. 
     Otherwise, the disclaimer is needed and you can ignore this comment. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (August 14, 2011) is 4639 days in the past.  Is this
     intentional?

  -- Found something which looks like a code comment -- if you have code
     sections in the document, please surround them with '<CODE BEGINS>' and
     '<CODE ENDS>' lines.


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  == Missing Reference: '0' is mentioned on line 1101, but not defined

  == Unused Reference: '8' is defined on line 4049, but no explicit reference
     was found in the text

  == Unused Reference: '9' is defined on line 4053, but no explicit reference
     was found in the text

  == Unused Reference: '24' is defined on line 4110, but no explicit
     reference was found in the text

  == Unused Reference: '25' is defined on line 4113, but no explicit
     reference was found in the text

  == Unused Reference: '26' is defined on line 4116, but no explicit
     reference was found in the text

  == Unused Reference: '27' is defined on line 4119, but no explicit
     reference was found in the text

  == Unused Reference: '28' is defined on line 4123, but no explicit
     reference was found in the text

  == Unused Reference: '29' is defined on line 4125, but no explicit
     reference was found in the text

  == Unused Reference: '30' is defined on line 4128, but no explicit
     reference was found in the text

  == Unused Reference: '31' is defined on line 4131, but no explicit
     reference was found in the text

  == Unused Reference: '32' is defined on line 4134, but no explicit
     reference was found in the text

  -- Possible downref: Non-RFC (?) normative reference: ref. '1'

  ** Obsolete normative reference: RFC 5661 (ref. '2') (Obsoleted by RFC 8881)

  -- Possible downref: Non-RFC (?) normative reference: ref. '3'

  == Outdated reference: A later version (-35) exists of
     draft-ietf-nfsv4-rfc3530bis-09

  -- Obsolete informational reference (is this intentional?): RFC 2616 (ref.
     '14') (Obsoleted by RFC 7230, RFC 7231, RFC 7232, RFC 7233, RFC 7234, RFC
     7235)

  -- Obsolete informational reference (is this intentional?): RFC 5226 (ref.
     '23') (Obsoleted by RFC 8126)

  -- Obsolete informational reference (is this intentional?): RFC 3530 (ref.
     '32') (Obsoleted by RFC 7530)


     Summary: 1 error (**), 0 flaws (~~), 20 warnings (==), 7 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	NFSv4                                                          T. Haynes
3	Internet-Draft                                                    Editor
4	Intended status: Standards Track                         August 14, 2011
5	Expires: February 15, 2012

7	                     NFS Version 4 Minor Version 2
8	                 draft-ietf-nfsv4-minorversion2-03.txt

10	Abstract

12	   This Internet-Draft describes NFS version 4 minor version two,
13	   focusing mainly on the protocol extensions made from NFS version 4
14	   minor version 0 and NFS version 4 minor version 1.  Major extensions
15	   introduced in NFS version 4 minor version two include: Server-side
16	   Copy, Space Reservations, and Support for Sparse Files.

18	Requirements Language

20	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
21	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
22	   document are to be interpreted as described in RFC 2119 [1].

24	Status of this Memo

26	   This Internet-Draft is submitted in full conformance with the
27	   provisions of BCP 78 and BCP 79.

29	   Internet-Drafts are working documents of the Internet Engineering
30	   Task Force (IETF).  Note that other groups may also distribute
31	   working documents as Internet-Drafts.  The list of current Internet-
32	   Drafts is at http://datatracker.ietf.org/drafts/current/.

34	   Internet-Drafts are draft documents valid for a maximum of six months
35	   and may be updated, replaced, or obsoleted by other documents at any
36	   time.  It is inappropriate to use Internet-Drafts as reference
37	   material or to cite them other than as "work in progress."

39	   This Internet-Draft will expire on February 15, 2012.

41	Copyright Notice

43	   Copyright (c) 2011 IETF Trust and the persons identified as the
44	   document authors.  All rights reserved.

46	   This document is subject to BCP 78 and the IETF Trust's Legal
47	   Provisions Relating to IETF Documents
48	   (http://trustee.ietf.org/license-info) in effect on the date of
49	   publication of this document.  Please review these documents
50	   carefully, as they describe your rights and restrictions with respect
51	   to this document.  Code Components extracted from this document must
52	   include Simplified BSD License text as described in Section 4.e of
53	   the Trust Legal Provisions and are provided without warranty as
54	   described in the Simplified BSD License.

56	   This document may contain material from IETF Documents or IETF
57	   Contributions published or made publicly available before November
58	   10, 2008.  The person(s) controlling the copyright in some of this
59	   material may not have granted the IETF Trust the right to allow
60	   modifications of such material outside the IETF Standards Process.
61	   Without obtaining an adequate license from the person(s) controlling
62	   the copyright in such materials, this document may not be modified
63	   outside the IETF Standards Process, and derivative works of it may
64	   not be created outside the IETF Standards Process, except to format
65	   it for publication as an RFC or to translate it into languages other
66	   than English.

68	Table of Contents

70	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  6
71	     1.1.  The NFS Version 4 Minor Version 2 Protocol . . . . . . . .  6
72	     1.2.  Scope of This Document . . . . . . . . . . . . . . . . . .  6
73	     1.3.  NFSv4.2 Goals  . . . . . . . . . . . . . . . . . . . . . .  6
74	     1.4.  Overview of NFSv4.2 Features . . . . . . . . . . . . . . .  6
75	     1.5.  Differences from NFSv4.1 . . . . . . . . . . . . . . . . .  6
76	   2.  pNFS LAYOUTRETURN Error Handling . . . . . . . . . . . . . . .  6
77	     2.1.  Introduction . . . . . . . . . . . . . . . . . . . . . . .  6
78	     2.2.  Changes to Operation 51: LAYOUTRETURN  . . . . . . . . . .  7
79	       2.2.1.  ARGUMENT . . . . . . . . . . . . . . . . . . . . . . .  7
80	       2.2.2.  RESULT . . . . . . . . . . . . . . . . . . . . . . . .  7
81	       2.2.3.  DESCRIPTION  . . . . . . . . . . . . . . . . . . . . .  7
82	       2.2.4.  IMPLEMENTATION . . . . . . . . . . . . . . . . . . . .  8
83	   3.  Sharing change attribute implementation details with NFSv4
84	       clients  . . . . . . . . . . . . . . . . . . . . . . . . . . .  9
85	     3.1.  Abstract . . . . . . . . . . . . . . . . . . . . . . . . .  9
86	     3.2.  Introduction . . . . . . . . . . . . . . . . . . . . . . . 10
87	     3.3.  Definition of the 'change_attr_type' per-file system
88	           attribute  . . . . . . . . . . . . . . . . . . . . . . . . 10
89	   4.  NFS Server-side Copy . . . . . . . . . . . . . . . . . . . . . 11
90	     4.1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . 12
91	     4.2.  Protocol Overview  . . . . . . . . . . . . . . . . . . . . 12
92	       4.2.1.  Intra-Server Copy  . . . . . . . . . . . . . . . . . . 14
93	       4.2.2.  Inter-Server Copy  . . . . . . . . . . . . . . . . . . 15
94	       4.2.3.  Server-to-Server Copy Protocol . . . . . . . . . . . . 18
95	     4.3.  Operations . . . . . . . . . . . . . . . . . . . . . . . . 20
96	       4.3.1.  netloc4 - Network Locations  . . . . . . . . . . . . . 20
97	       4.3.2.  Copy Offload Stateids  . . . . . . . . . . . . . . . . 21
98	     4.4.  Security Considerations  . . . . . . . . . . . . . . . . . 21
99	       4.4.1.  Inter-Server Copy Security . . . . . . . . . . . . . . 21
100	   5.  Application Data Block Support . . . . . . . . . . . . . . . . 29
101	     5.1.  Generic Framework  . . . . . . . . . . . . . . . . . . . . 30
102	       5.1.1.  Data Block Representation  . . . . . . . . . . . . . . 31
103	       5.1.2.  Data Content . . . . . . . . . . . . . . . . . . . . . 31
104	     5.2.  pNFS Considerations  . . . . . . . . . . . . . . . . . . . 31
105	     5.3.  An Example of Detecting Corruption . . . . . . . . . . . . 32
106	     5.4.  Example of READ_PLUS . . . . . . . . . . . . . . . . . . . 34
107	     5.5.  Zero Filled Holes  . . . . . . . . . . . . . . . . . . . . 34
108	   6.  Space Reservation  . . . . . . . . . . . . . . . . . . . . . . 34
109	     6.1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . 34
110	     6.2.  Use Cases  . . . . . . . . . . . . . . . . . . . . . . . . 35
111	       6.2.1.  Space Reservation  . . . . . . . . . . . . . . . . . . 36
112	       6.2.2.  Space freed on deletes . . . . . . . . . . . . . . . . 36
113	       6.2.3.  Operations and attributes  . . . . . . . . . . . . . . 37
114	       6.2.4.  Attribute 77: space_reserved . . . . . . . . . . . . . 37
115	       6.2.5.  Attribute 78: space_freed  . . . . . . . . . . . . . . 38
116	       6.2.6.  Attribute 79: max_hole_punch . . . . . . . . . . . . . 38
117	       6.2.7.  Operation 64: HOLE_PUNCH - Zero and deallocate
118	               blocks backing the file in the specified range.  . . . 38
119	   7.  Sparse Files . . . . . . . . . . . . . . . . . . . . . . . . . 39
120	     7.1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . 39
121	     7.2.  Terminology  . . . . . . . . . . . . . . . . . . . . . . . 40
122	     7.3.  Applications and Sparse Files  . . . . . . . . . . . . . . 41
123	     7.4.  Overview of Sparse Files and NFSv4 . . . . . . . . . . . . 42
124	     7.5.  Operation 65: READ_PLUS  . . . . . . . . . . . . . . . . . 43
125	       7.5.1.  ARGUMENT . . . . . . . . . . . . . . . . . . . . . . . 43
126	       7.5.2.  RESULT . . . . . . . . . . . . . . . . . . . . . . . . 44
127	       7.5.3.  DESCRIPTION  . . . . . . . . . . . . . . . . . . . . . 44
128	       7.5.4.  IMPLEMENTATION . . . . . . . . . . . . . . . . . . . . 46
129	       7.5.5.  READ_PLUS with Sparse Files Example  . . . . . . . . . 47
130	     7.6.  Related Work . . . . . . . . . . . . . . . . . . . . . . . 48
131	     7.7.  Other Proposed Designs . . . . . . . . . . . . . . . . . . 48
132	       7.7.1.  Multi-Data Server Hole Information . . . . . . . . . . 48
133	       7.7.2.  Data Result Array  . . . . . . . . . . . . . . . . . . 49
134	       7.7.3.  User-Defined Sparse Mask . . . . . . . . . . . . . . . 49
135	       7.7.4.  Allocated flag . . . . . . . . . . . . . . . . . . . . 49
136	       7.7.5.  Dense and Sparse pNFS File Layouts . . . . . . . . . . 50
137	   8.  Labeled NFS  . . . . . . . . . . . . . . . . . . . . . . . . . 50
138	     8.1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . 50
139	     8.2.  Definitions  . . . . . . . . . . . . . . . . . . . . . . . 52
140	     8.3.  MAC Security Attribute . . . . . . . . . . . . . . . . . . 52
141	       8.3.1.  Interpreting FATTR4_SEC_LABEL  . . . . . . . . . . . . 53
142	       8.3.2.  Delegations  . . . . . . . . . . . . . . . . . . . . . 54
143	       8.3.3.  Permission Checking  . . . . . . . . . . . . . . . . . 54
144	       8.3.4.  Object Creation  . . . . . . . . . . . . . . . . . . . 55
145	       8.3.5.  Existing Objects . . . . . . . . . . . . . . . . . . . 55
146	       8.3.6.  Label Changes  . . . . . . . . . . . . . . . . . . . . 55
147	     8.4.  Procedure 16: CB_ATTR_CHANGED - Notify Client that the
148	           File's Attributes Changed  . . . . . . . . . . . . . . . . 56
149	     8.5.  pNFS Considerations  . . . . . . . . . . . . . . . . . . . 57
150	     8.6.  Discovery of Server LNFS Support . . . . . . . . . . . . . 57
151	     8.7.  MAC Security NFS Modes of Operation  . . . . . . . . . . . 58
152	       8.7.1.  Full Mode  . . . . . . . . . . . . . . . . . . . . . . 58
153	       8.7.2.  Smart Client Mode  . . . . . . . . . . . . . . . . . . 59
154	       8.7.3.  Smart Server Mode  . . . . . . . . . . . . . . . . . . 60
155	     8.8.  Use Cases  . . . . . . . . . . . . . . . . . . . . . . . . 61
156	       8.8.1.  Full MAC labeling support for remotely mounted
157	               filesystems  . . . . . . . . . . . . . . . . . . . . . 61
158	       8.8.2.  MAC labeling of virtual machine images stored on
159	               the network  . . . . . . . . . . . . . . . . . . . . . 61
160	       8.8.3.  International Traffic in Arms Regulations (ITAR) . . . 62
161	       8.8.4.  Legal Hold/eDiscovery  . . . . . . . . . . . . . . . . 62
162	       8.8.5.  Simple security label storage  . . . . . . . . . . . . 63
163	       8.8.6.  Diskless Linux . . . . . . . . . . . . . . . . . . . . 63
164	       8.8.7.  Multi-Level Security . . . . . . . . . . . . . . . . . 64
165	     8.9.  Security Considerations  . . . . . . . . . . . . . . . . . 65
166	   9.  Security Considerations  . . . . . . . . . . . . . . . . . . . 66
167	   10. Operations: REQUIRED, RECOMMENDED, or OPTIONAL . . . . . . . . 66
168	   11. NFSv4.2 Operations . . . . . . . . . . . . . . . . . . . . . . 69
169	     11.1. Operation 59: COPY - Initiate a server-side copy . . . . . 69
170	     11.2. Operation 60: COPY_ABORT - Cancel a server-side copy . . . 77
171	     11.3. Operation 61: COPY_NOTIFY - Notify a source server of
172	           a future copy  . . . . . . . . . . . . . . . . . . . . . . 78
173	     11.4. Operation 62: COPY_REVOKE - Revoke a destination
174	           server's copy privileges . . . . . . . . . . . . . . . . . 80
175	     11.5. Operation 63: COPY_STATUS - Poll for status of a
176	           server-side copy . . . . . . . . . . . . . . . . . . . . . 81
177	     11.6. Operation 64: INITIALIZE . . . . . . . . . . . . . . . . . 83
178	     11.7. Modification to Operation 42: EXCHANGE_ID -
179	           Instantiate Client ID  . . . . . . . . . . . . . . . . . . 85
180	     11.8. Operation 65: READ_PLUS  . . . . . . . . . . . . . . . . . 86
181	   12. NFSv4.2 Callback Operations  . . . . . . . . . . . . . . . . . 88
182	     12.1. Operation 15: CB_COPY - Report results of a
183	           server-side copy . . . . . . . . . . . . . . . . . . . . . 88
184	   13. IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 89
185	   14. References . . . . . . . . . . . . . . . . . . . . . . . . . . 89
186	     14.1. Normative References . . . . . . . . . . . . . . . . . . . 89
187	     14.2. Informative References . . . . . . . . . . . . . . . . . . 90
188	   Appendix A.  Acknowledgments . . . . . . . . . . . . . . . . . . . 91
189	   Appendix B.  RFC Editor Notes  . . . . . . . . . . . . . . . . . . 92
190	   Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 92

192	1.  Introduction

194	1.1.  The NFS Version 4 Minor Version 2 Protocol

196	   The NFS version 4 minor version 2 (NFSv4.2) protocol is the third
197	   minor version of the NFS version 4 (NFSv4) protocol.  The first minor
198	   version, NFSv4.0, is described in [10] and the second minor version,
199	   NFSv4.1, is described in [2].  It follows the guidelines for minor
200	   versioning that are listed in Section 11 of RFC 3530bis.

202	   As a minor version, NFSv4.2 is consistent with the overall goals for
203	   NFSv4, but extends the protocol so as to better meet those goals,
204	   based on experiences with NFSv4.1.  In addition, NFSv4.2 has adopted
205	   some additional goals, which motivate some of the major extensions in
206	   NFSv4.2.

208	1.2.  Scope of This Document

210	   This document describes the NFSv4.2 protocol.  With respect to
211	   NFSv4.0 and NFSv4.1, this document does not:

213	   o  describe the NFSv4.0 or NFSv4.1 protocols, except where needed to
214	      contrast with NFSv4.2.

216	   o  modify the specification of the NFSv4.0 or NFSv4.1 protocols.

218	   o  clarify the NFSv4.0 or NFSv4.1 protocols.

220	   The full XDR for NFSv4.2 is presented in [3].

222	1.3.  NFSv4.2 Goals

224	1.4.  Overview of NFSv4.2 Features

226	1.5.  Differences from NFSv4.1

228	2.  pNFS LAYOUTRETURN Error Handling

230	2.1.  Introduction

232	   In the pNFS description provided in [2], the client is not enabled to
233	   relay an error code from the DS to the MDS.  In the specification of
234	   the Objects-Based Layout protocol [4], use is made of the opaque
235	   lrf_body field of the LAYOUTRETURN argument to do such a relaying of
236	   error codes.  In this section, we define a new data structure to
237	   enable the passing of error codes back to the MDS and provide some
238	   guidelines on what both the client and MDS should expect in such
239	   circumstances.

241	   There are two broad classes of errors, transient and persistent.  The
242	   client SHOULD strive to only use this new mechanism to report
243	   persistent errors.  It MUST be able to deal with transient issues by
244	   itself.  Also, while the client might consider an issue to be
245	   persistent, it MUST be prepared for the MDS to consider such issues
246	   to be persistent.  A prime example of this is if the MDS fences off a
247	   client from either a stateid or a filehandle.  The client will get an
248	   error from the DS and might relay either NFS4ERR_ACCESS or
249	   NFS4ERR_STALE_STATEID back to the MDS, with the belief that this is a
250	   hard error.  The MDS on the other hand, is waiting for the client to
251	   report such an error.  For it, the mission is accomplished in that
252	   the client has returned a layout that the MDS had most likley
253	   recalled.

255	2.2.  Changes to Operation 51: LAYOUTRETURN

257	   The existing LAYOUTRETURN operation is extended by introducing a new
258	   data structure to report errors, layoutreturn_device_error4.  Also,
259	   layoutreturn_device_error4 is introduced to enable an array of errors
260	   to be reported.

262	2.2.1.  ARGUMENT

264	   The ARGUMENT specification of the LAYOUTRETURN operation in section
265	   18.44.1 of [2] is augmented by the following XDR code [11]:

267	   struct layoutreturn_device_error4 {
268	           deviceid4       lrde_deviceid;
269	           nfsstat4        lrde_status;
270	           nfs_opnum4      lrde_opnum;
271	   };

273	   struct layoutreturn_error_report4 {
274	           layoutreturn_device_error4      lrer_errors<>;
275	   };

277	2.2.2.  RESULT

279	   The RESULT of the LAYOUTRETURN operation is unchanged; see section
280	   18.44.2 of [2].

282	2.2.3.  DESCRIPTION

284	   The following text is added to the end of the LAYOUTRETURN operation
285	   DESCRIPTION in section 18.44.3 of [2].

287	   When a client used LAYOUTRETURN with a type of LAYOUTRETURN4_FILE,
288	   then if the lrf_body field is NULL, it indicates to the MDS that the
289	   client experienced no errors.  If lrf_body is non-NULL, then the
290	   field references error information which is layout type specific.
291	   I.e., the Objects-Based Layout protocol can continue to utilize
292	   lrf_body as specified in [4].  For both Files-Based Layouts, the
293	   field references a layoutreturn_device_error4, which contains an
294	   array of layoutreturn_device_error4.

296	   Each individual layoutreturn_device_error4 descibes a single error
297	   associated with a DS, which is identfied via lrde_deviceid.  The
298	   operation which returned the error is identified via lrde_opnum.
299	   Finally the NFS error value (nfsstat4) encountered is provided via
300	   lrde_status and may consist of the following error codes:

302	   NFS4_OKAY:  No issues were found for this device.

304	   NFS4ERR_NXIO:  The client was unable to establish any communication
305	      with the DS.

307	   NFS4ERR_*:  The client was able to establish communication with the
308	      DS and is returning one of the allowed error codes for the
309	      operation denoted by lrde_opnum.

311	2.2.4.  IMPLEMENTATION

313	   The following text is added to the end of the LAYOUTRETURN operation
314	   IMPLEMENTATION in section 18.4.4 of [2].

316	   A client that expects to use pNFS for a mounted filesystem SHOULD
317	   check for pNFS support at mount time.  This check SHOULD be performed
318	   by sending a GETDEVICELIST operation, followed by layout-type-
319	   specific checks for accessibility of each storage device returned by
320	   GETDEVICELIST.  If the NFS server does not support pNFS, the
321	   GETDEVICELIST operation will be rejected with an NFS4ERR_NOTSUPP
322	   error; in this situation it is up to the client to determine whether
323	   it is acceptable to proceed with NFS-only access.

325	   Clients are expected to tolerate transient storage device errors, and
326	   hence clients SHOULD NOT use the LAYOUTRETURN error handling for
327	   device access problems that may be transient.  The methods by which a
328	   client decides whether an access problem is transient vs. persistent
329	   are implementation-specific, but may include retrying I/Os to a data
330	   server under appropriate conditions.

332	   When an I/O fails to a storage device, the client SHOULD retry the
333	   failed I/O via the MDS.  In this situation, before retrying the I/O,
334	   the client SHOULD return the layout, or the affected portion thereof,
335	   and SHOULD indicate which storage device or devices was problematic.
336	   If the client does not do this, the MDS may issue a layout recall
337	   callback in order to perform the retried I/O.

339	   The client needs to be cognizant that since this error handling is
340	   optional in the MDS, the MDS may silently ignore this functionality.
341	   Also, as the MDS may consider some issues the client reports to be
342	   expected (see Section 2.1), the client might find it difficult to
343	   detect a MDS which has not implemented error handling via
344	   LAYOUTRETURN.

346	   If an MDS is aware that a storage device is proving problematic to a
347	   client, the MDS SHOULD NOT include that storage device in any pNFS
348	   layouts sent to that client.  If the MDS is aware that a storage
349	   device is affecting many clients, then the MDS SHOULD NOT include
350	   that storage device in any pNFS layouts sent out.  Clients must still
351	   be aware that the MDS might not have any choice in using the storage
352	   device, i.e., there might only be one possible layout for the system.

354	   Another interesting complication is that for existing files, the MDS
355	   might have no choice in which storage devices to hand out to clients.
356	   The MDS might try to restripe a file across a different storage
357	   device, but clients need to be aware that not all implementations
358	   have restriping support.

360	   An MDS SHOULD react to a client return of layouts with errors by not
361	   using the problematic storage devices in layouts for that client, but
362	   the MDS is not required to indefinitely retain per-client storage
363	   device error information.  An MDS is also not required to
364	   automatically reinstate use of a previously problematic storage
365	   device; administrative intervention may be required instead.

367	   A client MAY perform I/O via the MDS even when the client holds a
368	   layout that covers the I/O; servers MUST support this client
369	   behavior, and MAY recall layouts as needed to complete I/Os.

371	3.  Sharing change attribute implementation details with NFSv4 clients

373	3.1.  Abstract

375	   This document describes an extension to the NFSv4 protocol that
376	   allows the server to share information about the implementation of
377	   its change attribute with the client.  The aim is to improve the
378	   client's ability to determine the order in which parallel updates to
379	   the same file were processed.

381	3.2.  Introduction

383	   Although both the NFSv4 [10] and NFSv4.1 protocol [2], define the
384	   change attribute as being mandatory to implement, there is little in
385	   the way of guidance.  The only feature that is mandated by the spec
386	   is that the value must change whenever the file data or metadata
387	   change.

389	   While this allows for a wide range of implementations, it also leaves
390	   the client with a conundrum: how does it determine which is the most
391	   recent value for the change attribute in a case where several RPC
392	   calls have been issued in parallel?  In other words if two COMPOUNDs,
393	   both containing WRITE and GETATTR requests for the same file, have
394	   been issued in parallel, how does the client determine which of the
395	   two change attribute values returned in the replies to the GETATTR
396	   requests corresponds to the most recent state of the file?  In some
397	   cases, the only recourse may be to send another COMPOUND containing a
398	   third GETATTR that is fully serialised with the first two.

400	   In order to avoid this kind of inefficiency, we propose a method to
401	   allow the server to share details about how the change attribute is
402	   expected to evolve, so that the client may immediately determine
403	   which, out of the several change attribute values returned by the
404	   server, is the most recent.

406	3.3.  Definition of the 'change_attr_type' per-file system attribute

408	   enum change_attr_typeinfo {
409	              NFS4_CHANGE_TYPE_IS_MONOTONIC_INCR         = 0,
410	              NFS4_CHANGE_TYPE_IS_VERSION_COUNTER        = 1,
411	              NFS4_CHANGE_TYPE_IS_VERSION_COUNTER_NOPNFS = 2,
412	              NFS4_CHANGE_TYPE_IS_TIME_METADATA          = 3,
413	              NFS4_CHANGE_TYPE_IS_UNDEFINED              = 4
414	   };

416	        +------------------+----+---------------------------+-----+
417	        | Name             | Id | Data Type                 | Acc |
418	        +------------------+----+---------------------------+-----+
419	        | change_attr_type | XX | enum change_attr_typeinfo | R   |
420	        +------------------+----+---------------------------+-----+

422	   The proposed solution is to enable the NFS server to provide
423	   additional information about how it expects the change attribute
424	   value to evolve after the file data or metadata has changed.  To do
425	   so, we define a new recommended attribute, 'change_attr_type', which
426	   may take values from enum change_attr_typeinfo as follows:

428	   NFS4_CHANGE_TYPE_IS_MONOTONIC_INCR:  The change attribute value MUST
429	      monotonically increase for every atomic change to the file
430	      attributes, data or directory contents.

432	   NFS4_CHANGE_TYPE_IS_VERSION_COUNTER:  The change attribute value MUST
433	      be incremented by one unit for every atomic change to the file
434	      attributes, data or directory contents.  This property is
435	      preserved when writing to pNFS data servers.

437	   NFS4_CHANGE_TYPE_IS_VERSION_COUNTER_NOPNFS:  The change attribute
438	      value MUST be incremented by one unit for every atomic change to
439	      the file attributes, data or directory contents.  In the case
440	      where the client is writing to pNFS data servers, the number of
441	      increments is not guaranteed to exactly match the number of
442	      writes.

444	   NFS4_CHANGE_TYPE_IS_TIME_METADATA:  The change attribute is
445	      implemented as suggested in the NFSv4 spec [10] in terms of the
446	      time_metadata attribute.

448	   NFS4_CHANGE_TYPE_IS_UNDEFINED:  The change attribute does not take
449	      values that fit into any of these categories.

451	   If either NFS4_CHANGE_TYPE_IS_MONOTONIC_INCR,
452	   NFS4_CHANGE_TYPE_IS_VERSION_COUNTER, or
453	   NFS4_CHANGE_TYPE_IS_TIME_METADATA are set, then the client knows at
454	   the very least that the change attribute is monotonically increasing,
455	   which is sufficient to resolve the question of which value is the
456	   most recent.

458	   If the client sees the value NFS4_CHANGE_TYPE_IS_TIME_METADATA, then
459	   by inspecting the value of the 'time_delta' attribute it additionally
460	   has the option of detecting rogue server implementations that use
461	   time_metadata in violation of the spec.

463	   Finally, if the client sees NFS4_CHANGE_TYPE_IS_VERSION_COUNTER, it
464	   has the ability to predict what the resulting change attribute value
465	   should be after a COMPOUND containing a SETATTR, WRITE, or CREATE.
466	   This again allows it to detect changes made in parallel by another
467	   client.  The value NFS4_CHANGE_TYPE_IS_VERSION_COUNTER_NOPNFS permits
468	   the same, but only if the client is not doing pNFS WRITEs.

470	4.  NFS Server-side Copy
471	4.1.  Introduction

473	   This document describes a server-side copy feature for the NFS
474	   protocol.

476	   The server-side copy feature provides a mechanism for the NFS client
477	   to perform a file copy on the server without the data being
478	   transmitted back and forth over the network.

480	   Without this feature, an NFS client copies data from one location to
481	   another by reading the data from the server over the network, and
482	   then writing the data back over the network to the server.  Using
483	   this server-side copy operation, the client is able to instruct the
484	   server to copy the data locally without the data being sent back and
485	   forth over the network unnecessarily.

487	   In general, this feature is useful whenever data is copied from one
488	   location to another on the server.  It is particularly useful when
489	   copying the contents of a file from a backup.  Backup-versions of a
490	   file are copied for a number of reasons, including restoring and
491	   cloning data.

493	   If the source object and destination object are on different file
494	   servers, the file servers will communicate with one another to
495	   perform the copy operation.  The server-to-server protocol by which
496	   this is accomplished is not defined in this document.

498	4.2.  Protocol Overview

500	   The server-side copy offload operations support both intra-server and
501	   inter-server file copies.  An intra-server copy is a copy in which
502	   the source file and destination file reside on the same server.  In
503	   an inter-server copy, the source file and destination file are on
504	   different servers.  In both cases, the copy may be performed
505	   synchronously or asynchronously.

507	   Throughout the rest of this document, we refer to the NFS server
508	   containing the source file as the "source server" and the NFS server
509	   to which the file is transferred as the "destination server".  In the
510	   case of an intra-server copy, the source server and destination
511	   server are the same server.  Therefore in the context of an intra-
512	   server copy, the terms source server and destination server refer to
513	   the single server performing the copy.

515	   The operations described below are designed to copy files.  Other
516	   file system objects can be copied by building on these operations or
517	   using other techniques.  For example if the user wishes to copy a
518	   directory, the client can synthesize a directory copy by first
519	   creating the destination directory and then copying the source
520	   directory's files to the new destination directory.  If the user
521	   wishes to copy a namespace junction [12] [13], the client can use the
522	   ONC RPC Federated Filesystem protocol [13] to perform the copy.
523	   Specifically the client can determine the source junction's
524	   attributes using the FEDFS_LOOKUP_FSN procedure and create a
525	   duplicate junction using the FEDFS_CREATE_JUNCTION procedure.

527	   For the inter-server copy protocol, the operations are defined to be
528	   compatible with a server-to-server copy protocol in which the
529	   destination server reads the file data from the source server.  This
530	   model in which the file data is pulled from the source by the
531	   destination has a number of advantages over a model in which the
532	   source pushes the file data to the destination.  The advantages of
533	   the pull model include:

535	   o  The pull model only requires a remote server (i.e., the
536	      destination server) to be granted read access.  A push model
537	      requires a remote server (i.e., the source server) to be granted
538	      write access, which is more privileged.

540	   o  The pull model allows the destination server to stop reading if it
541	      has run out of space.  In a push model, the destination server
542	      must flow control the source server in this situation.

544	   o  The pull model allows the destination server to easily flow
545	      control the data stream by adjusting the size of its read
546	      operations.  In a push model, the destination server does not have
547	      this ability.  The source server in a push model is capable of
548	      writing chunks larger than the destination server has requested in
549	      attributes and session parameters.  In theory, the destination
550	      server could perform a "short" write in this situation, but this
551	      approach is known to behave poorly in practice.

553	   The following operations are provided to support server-side copy:

555	   COPY_NOTIFY:  For inter-server copies, the client sends this
556	      operation to the source server to notify it of a future file copy
557	      from a given destination server for the given user.

559	   COPY_REVOKE:  Also for inter-server copies, the client sends this
560	      operation to the source server to revoke permission to copy a file
561	      for the given user.

563	   COPY:  Used by the client to request a file copy.

565	   COPY_ABORT:  Used by the client to abort an asynchronous file copy.

567	   COPY_STATUS:  Used by the client to poll the status of an
568	      asynchronous file copy.

570	   CB_COPY:  Used by the destination server to report the results of an
571	      asynchronous file copy to the client.

573	   These operations are described in detail in Section 4.3.  This
574	   section provides an overview of how these operations are used to
575	   perform server-side copies.

577	4.2.1.  Intra-Server Copy

579	   To copy a file on a single server, the client uses a COPY operation.
580	   The server may respond to the copy operation with the final results
581	   of the copy or it may perform the copy asynchronously and deliver the
582	   results using a CB_COPY operation callback.  If the copy is performed
583	   asynchronously, the client may poll the status of the copy using
584	   COPY_STATUS or cancel the copy using COPY_ABORT.

586	   A synchronous intra-server copy is shown in Figure 1.  In this
587	   example, the NFS server chooses to perform the copy synchronously.
588	   The copy operation is completed, either successfully or
589	   unsuccessfully, before the server replies to the client's request.
590	   The server's reply contains the final result of the operation.

592	     Client                                  Server
593	        +                                      +
594	        |                                      |
595	        |--- COPY ---------------------------->| Client requests
596	        |<------------------------------------/| a file copy
597	        |                                      |
598	        |                                      |

600	                Figure 1: A synchronous intra-server copy.

602	   An asynchronous intra-server copy is shown in Figure 2.  In this
603	   example, the NFS server performs the copy asynchronously.  The
604	   server's reply to the copy request indicates that the copy operation
605	   was initiated and the final result will be delivered at a later time.
606	   The server's reply also contains a copy stateid.  The client may use
607	   this copy stateid to poll for status information (as shown) or to
608	   cancel the copy using a COPY_ABORT.  When the server completes the
609	   copy, the server performs a callback to the client and reports the
610	   results.

612	     Client                                  Server
613	        +                                      +
614	        |                                      |
615	        |--- COPY ---------------------------->| Client requests
616	        |<------------------------------------/| a file copy
617	        |                                      |
618	        |                                      |
619	        |--- COPY_STATUS --------------------->| Client may poll
620	        |<------------------------------------/| for status
621	        |                                      |
622	        |                  .                   | Multiple COPY_STATUS
623	        |                  .                   | operations may be sent.
624	        |                  .                   |
625	        |                                      |
626	        |<-- CB_COPY --------------------------| Server reports results
627	        |\------------------------------------>|
628	        |                                      |

630	               Figure 2: An asynchronous intra-server copy.

632	4.2.2.  Inter-Server Copy

634	   A copy may also be performed between two servers.  The copy protocol
635	   is designed to accommodate a variety of network topologies.  As shown
636	   in Figure 3, the client and servers may be connected by multiple
637	   networks.  In particular, the servers may be connected by a
638	   specialized, high speed network (network 192.168.33.0/24 in the
639	   diagram) that does not include the client.  The protocol allows the
640	   client to setup the copy between the servers (over network
641	   10.11.78.0/24 in the diagram) and for the servers to communicate on
642	   the high speed network if they choose to do so.

644	                             192.168.33.0/24
645	                 +-------------------------------------+
646	                 |                                     |
647	                 |                                     |
648	                 | 192.168.33.18                       | 192.168.33.56
649	         +-------+------+                       +------+------+
650	         |     Source   |                       | Destination |
651	         +-------+------+                       +------+------+
652	                 | 10.11.78.18                         | 10.11.78.56
653	                 |                                     |
654	                 |                                     |
655	                 |             10.11.78.0/24           |
656	                 +------------------+------------------+
657	                                    |
658	                                    |
659	                                    | 10.11.78.243
660	                              +-----+-----+
661	                              |   Client  |
662	                              +-----------+

664	            Figure 3: An example inter-server network topology.

666	   For an inter-server copy, the client notifies the source server that
667	   a file will be copied by the destination server using a COPY_NOTIFY
668	   operation.  The client then initiates the copy by sending the COPY
669	   operation to the destination server.  The destination server may
670	   perform the copy synchronously or asynchronously.

672	   A synchronous inter-server copy is shown in Figure 4.  In this case,
673	   the destination server chooses to perform the copy before responding
674	   to the client's COPY request.

676	   An asynchronous copy is shown in Figure 5.  In this case, the
677	   destination server chooses to respond to the client's COPY request
678	   immediately and then perform the copy asynchronously.

680	     Client                Source         Destination
681	        +                    +                 +
682	        |                    |                 |
683	        |--- COPY_NOTIFY --->|                 |
684	        |<------------------/|                 |
685	        |                    |                 |
686	        |                    |                 |
687	        |--- COPY ---------------------------->|
688	        |                    |                 |
689	        |                    |                 |
690	        |                    |<----- read -----|
691	        |                    |\--------------->|
692	        |                    |                 |
693	        |                    |        .        | Multiple reads may
694	        |                    |        .        | be necessary
695	        |                    |        .        |
696	        |                    |                 |
697	        |                    |                 |
698	        |<------------------------------------/| Destination replies
699	        |                    |                 | to COPY

701	                Figure 4: A synchronous inter-server copy.

703	     Client                Source         Destination
704	        +                    +                 +
705	        |                    |                 |
706	        |--- COPY_NOTIFY --->|                 |
707	        |<------------------/|                 |
708	        |                    |                 |
709	        |                    |                 |
710	        |--- COPY ---------------------------->|
711	        |<------------------------------------/|
712	        |                    |                 |
713	        |                    |                 |
714	        |                    |<----- read -----|
715	        |                    |\--------------->|
716	        |                    |                 |
717	        |                    |        .        | Multiple reads may
718	        |                    |        .        | be necessary
719	        |                    |        .        |
720	        |                    |                 |
721	        |                    |                 |
722	        |--- COPY_STATUS --------------------->| Client may poll
723	        |<------------------------------------/| for status
724	        |                    |                 |
725	        |                    |        .        | Multiple COPY_STATUS
726	        |                    |        .        | operations may be sent
727	        |                    |        .        |
728	        |                    |                 |
729	        |                    |                 |
730	        |                    |                 |
731	        |<-- CB_COPY --------------------------| Destination reports
732	        |\------------------------------------>| results
733	        |                    |                 |

735	               Figure 5: An asynchronous inter-server copy.

737	4.2.3.  Server-to-Server Copy Protocol

739	   During an inter-server copy, the destination server reads the file
740	   data from the source server.  The source server and destination
741	   server are not required to use a specific protocol to transfer the
742	   file data.  The choice of what protocol to use is ultimately the
743	   destination server's decision.

745	4.2.3.1.  Using NFSv4.x as a Server-to-Server Copy Protocol

747	   The destination server MAY use standard NFSv4.x (where x >= 1) to
748	   read the data from the source server.  If NFSv4.x is used for the
749	   server-to-server copy protocol, the destination server can use the
750	   filehandle contained in the COPY request with standard NFSv4.x
751	   operations to read data from the source server.  Specifically, the
752	   destination server may use the NFSv4.x OPEN operation's CLAIM_FH
753	   facility to open the file being copied and obtain an open stateid.
754	   Using the stateid, the destination server may then use NFSv4.x READ
755	   operations to read the file.

757	4.2.3.2.  Using an alternative Server-to-Server Copy Protocol

759	   In a homogeneous environment, the source and destination servers
760	   might be able to perform the file copy extremely efficiently using
761	   specialized protocols.  For example the source and destination
762	   servers might be two nodes sharing a common file system format for
763	   the source and destination file systems.  Thus the source and
764	   destination are in an ideal position to efficiently render the image
765	   of the source file to the destination file by replicating the file
766	   system formats at the block level.  Another possibility is that the
767	   source and destination might be two nodes sharing a common storage
768	   area network, and thus there is no need to copy any data at all, and
769	   instead ownership of the file and its contents might simply be re-
770	   assigned to the destination.  To allow for these possibilities, the
771	   destination server is allowed to use a server-to-server copy protocol
772	   of its choice.

774	   In a heterogeneous environment, using a protocol other than NFSv4.x
775	   (e.g,.  HTTP [14] or FTP [15]) presents some challenges.  In
776	   particular, the destination server is presented with the challenge of
777	   accessing the source file given only an NFSv4.x filehandle.

779	   One option for protocols that identify source files with path names
780	   is to use an ASCII hexadecimal representation of the source
781	   filehandle as the file name.

783	   Another option for the source server is to use URLs to direct the
784	   destination server to a specialized service.  For example, the
785	   response to COPY_NOTIFY could include the URL
786	   ftp://s1.example.com:9999/_FH/0x12345, where 0x12345 is the ASCII
787	   hexadecimal representation of the source filehandle.  When the
788	   destination server receives the source server's URL, it would use
789	   "_FH/0x12345" as the file name to pass to the FTP server listening on
790	   port 9999 of s1.example.com.  On port 9999 there would be a special
791	   instance of the FTP service that understands how to convert NFS
792	   filehandles to an open file descriptor (in many operating systems,
793	   this would require a new system call, one which is the inverse of the
794	   makefh() function that the pre-NFSv4 MOUNT service needs).

796	   Authenticating and identifying the destination server to the source
797	   server is also a challenge.  Recommendations for how to accomplish
798	   this are given in Section 4.4.1.2.4 and Section 4.4.1.4.

800	4.3.  Operations

802	   In the sections that follow, several operations are defined that
803	   together provide the server-side copy feature.  These operations are
804	   intended to be OPTIONAL operations as defined in section 17 of [2].
805	   The COPY_NOTIFY, COPY_REVOKE, COPY, COPY_ABORT, and COPY_STATUS
806	   operations are designed to be sent within an NFSv4 COMPOUND
807	   procedure.  The CB_COPY operation is designed to be sent within an
808	   NFSv4 CB_COMPOUND procedure.

810	   Each operation is performed in the context of the user identified by
811	   the ONC RPC credential of its containing COMPOUND or CB_COMPOUND
812	   request.  For example, a COPY_ABORT operation issued by a given user
813	   indicates that a specified COPY operation initiated by the same user
814	   be canceled.  Therefore a COPY_ABORT MUST NOT interfere with a copy
815	   of the same file initiated by another user.

817	   An NFS server MAY allow an administrative user to monitor or cancel
818	   copy operations using an implementation specific interface.

820	4.3.1.  netloc4 - Network Locations

822	   The server-side copy operations specify network locations using the
823	   netloc4 data type shown below:

825	   enum netloc_type4 {
826	           NL4_NAME        = 0,
827	           NL4_URL         = 1,
828	           NL4_NETADDR     = 2
829	   };
830	   union netloc4 switch (netloc_type4 nl_type) {
831	           case NL4_NAME:          utf8str_cis nl_name;
832	           case NL4_URL:           utf8str_cis nl_url;
833	           case NL4_NETADDR:       netaddr4    nl_addr;
834	   };

836	   If the netloc4 is of type NL4_NAME, the nl_name field MUST be
837	   specified as a UTF-8 string.  The nl_name is expected to be resolved
838	   to a network address via DNS, LDAP, NIS, /etc/hosts, or some other
839	   means.  If the netloc4 is of type NL4_URL, a server URL [5]
840	   appropriate for the server-to-server copy operation is specified as a
841	   UTF-8 string.  If the netloc4 is of type NL4_NETADDR, the nl_addr
842	   field MUST contain a valid netaddr4 as defined in Section 3.3.9 of
843	   [2].

845	   When netloc4 values are used for an inter-server copy as shown in
846	   Figure 3, their values may be evaluated on the source server,
847	   destination server, and client.  The network environment in which
848	   these systems operate should be configured so that the netloc4 values
849	   are interpreted as intended on each system.

851	4.3.2.  Copy Offload Stateids

853	   A server may perform a copy offload operation asynchronously.  An
854	   asynchronous copy is tracked using a copy offload stateid.  Copy
855	   offload stateids are included in the COPY, COPY_ABORT, COPY_STATUS,
856	   and CB_COPY operations.

858	   Section 8.2.4 of [2] specifies that stateids are valid until either
859	   (A) the client or server restart or (B) the client returns the
860	   resource.

862	   A copy offload stateid will be valid until either (A) the client or
863	   server restart or (B) the client returns the resource by issuing a
864	   COPY_ABORT operation or the client replies to a CB_COPY operation.

866	   A copy offload stateid's seqid MUST NOT be 0 (zero).  In the context
867	   of a copy offload operation, it is ambiguous to indicate the most
868	   recent copy offload operation using a stateid with seqid of 0 (zero).
869	   Therefore a copy offload stateid with seqid of 0 (zero) MUST be
870	   considered invalid.

872	4.4.  Security Considerations

874	   The security considerations pertaining to NFSv4 [10] apply to this
875	   document.

877	   The standard security mechanisms provide by NFSv4 [10] may be used to
878	   secure the protocol described in this document.

880	   NFSv4 clients and servers supporting the the inter-server copy
881	   operations described in this document are REQUIRED to implement [6],
882	   including the RPCSEC_GSSv3 privileges copy_from_auth and
883	   copy_to_auth.  If the server-to-server copy protocol is ONC RPC
884	   based, the servers are also REQUIRED to implement the RPCSEC_GSSv3
885	   privilege copy_confirm_auth.  These requirements to implement are not
886	   requirements to use.  NFSv4 clients and servers are RECOMMENDED to
887	   use [6] to secure server-side copy operations.

889	4.4.1.  Inter-Server Copy Security

891	4.4.1.1.  Requirements for Secure Inter-Server Copy

893	   Inter-server copy is driven by several requirements:

895	   o  The specification MUST NOT mandate an inter-server copy protocol.
896	      There are many ways to copy data.  Some will be more optimal than
897	      others depending on the identities of the source server and
898	      destination server.  For example the source and destination
899	      servers might be two nodes sharing a common file system format for
900	      the source and destination file systems.  Thus the source and
901	      destination are in an ideal position to efficiently render the
902	      image of the source file to the destination file by replicating
903	      the file system formats at the block level.  In other cases, the
904	      source and destination might be two nodes sharing a common storage
905	      area network, and thus there is no need to copy any data at all,
906	      and instead ownership of the file and its contents simply gets re-
907	      assigned to the destination.

909	   o  The specification MUST provide guidance for using NFSv4.x as a
910	      copy protocol.  For those source and destination servers willing
911	      to use NFSv4.x there are specific security considerations that
912	      this specification can and does address.

914	   o  The specification MUST NOT mandate pre-configuration between the
915	      source and destination server.  Requiring that the source and
916	      destination first have a "copying relationship" increases the
917	      administrative burden.  However the specification MUST NOT
918	      preclude implementations that require pre-configuration.

920	   o  The specification MUST NOT mandate a trust relationship between
921	      the source and destination server.  The NFSv4 security model
922	      requires mutual authentication between a principal on an NFS
923	      client and a principal on an NFS server.  This model MUST continue
924	      with the introduction of COPY.

926	4.4.1.2.  Inter-Server Copy with RPCSEC_GSSv3

928	   When the client sends a COPY_NOTIFY to the source server to expect
929	   the destination to attempt to copy data from the source server, it is
930	   expected that this copy is being done on behalf of the principal
931	   (called the "user principal") that sent the RPC request that encloses
932	   the COMPOUND procedure that contains the COPY_NOTIFY operation.  The
933	   user principal is identified by the RPC credentials.  A mechanism
934	   that allows the user principal to authorize the destination server to
935	   perform the copy in a manner that lets the source server properly
936	   authenticate the destination's copy, and without allowing the
937	   destination to exceed its authorization is necessary.

939	   An approach that sends delegated credentials of the client's user
940	   principal to the destination server is not used for the following
941	   reasons.  If the client's user delegated its credentials, the
942	   destination would authenticate as the user principal.  If the
943	   destination were using the NFSv4 protocol to perform the copy, then
944	   the source server would authenticate the destination server as the
945	   user principal, and the file copy would securely proceed.  However,
946	   this approach would allow the destination server to copy other files.
947	   The user principal would have to trust the destination server to not
948	   do so.  This is counter to the requirements, and therefore is not
949	   considered.  Instead an approach using RPCSEC_GSSv3 [6] privileges is
950	   proposed.

952	   One of the stated applications of the proposed RPCSEC_GSSv3 protocol
953	   is compound client host and user authentication [+ privilege
954	   assertion].  For inter-server file copy, we require compound NFS
955	   server host and user authentication [+ privilege assertion].  The
956	   distinction between the two is one without meaning.

958	   RPCSEC_GSSv3 introduces the notion of privileges.  We define three
959	   privileges:

961	   copy_from_auth:  A user principal is authorizing a source principal
962	      ("nfs@<source>") to allow a destination principal ("nfs@
963	      <destination>") to copy a file from the source to the destination.
964	      This privilege is established on the source server before the user
965	      principal sends a COPY_NOTIFY operation to the source server.

967	   struct copy_from_auth_priv {
968	           secret4             cfap_shared_secret;
969	           netloc4             cfap_destination;
970	           /* the NFSv4 user name that the user principal maps to */
971	           utf8str_mixed       cfap_username;
972	           /* equal to seq_num of rpc_gss_cred_vers_3_t */
973	           unsigned int        cfap_seq_num;
974	   };

976	      cap_shared_secret is a secret value the user principal generates.

978	   copy_to_auth:  A user principal is authorizing a destination
979	      principal ("nfs@<destination>") to allow it to copy a file from
980	      the source to the destination.  This privilege is established on
981	      the destination server before the user principal sends a COPY
982	      operation to the destination server.

984	   struct copy_to_auth_priv {
985	           /* equal to cfap_shared_secret */
986	           secret4              ctap_shared_secret;
987	           netloc4              ctap_source;
988	           /* the NFSv4 user name that the user principal maps to */
989	           utf8str_mixed        ctap_username;
990	           /* equal to seq_num of rpc_gss_cred_vers_3_t */
991	           unsigned int         ctap_seq_num;
992	   };

994	      ctap_shared_secret is a secret value the user principal generated
995	      and was used to establish the copy_from_auth privilege with the
996	      source principal.

998	   copy_confirm_auth:  A destination principal is confirming with the
999	      source principal that it is authorized to copy data from the
1000	      source on behalf of the user principal.  When the inter-server
1001	      copy protocol is NFSv4, or for that matter, any protocol capable
1002	      of being secured via RPCSEC_GSSv3 (i.e., any ONC RPC protocol),
1003	      this privilege is established before the file is copied from the
1004	      source to the destination.

1006	   struct copy_confirm_auth_priv {
1007	           /* equal to GSS_GetMIC() of cfap_shared_secret */
1008	           opaque              ccap_shared_secret_mic<>;
1009	           /* the NFSv4 user name that the user principal maps to */
1010	           utf8str_mixed       ccap_username;
1011	           /* equal to seq_num of rpc_gss_cred_vers_3_t */
1012	           unsigned int        ccap_seq_num;
1013	   };

1015	4.4.1.2.1.  Establishing a Security Context

1017	   When the user principal wants to COPY a file between two servers, if
1018	   it has not established copy_from_auth and copy_to_auth privileges on
1019	   the servers, it establishes them:

1021	   o  The user principal generates a secret it will share with the two
1022	      servers.  This shared secret will be placed in the
1023	      cfap_shared_secret and ctap_shared_secret fields of the
1024	      appropriate privilege data types, copy_from_auth_priv and
1025	      copy_to_auth_priv.

1027	   o  An instance of copy_from_auth_priv is filled in with the shared
1028	      secret, the destination server, and the NFSv4 user id of the user
1029	      principal.  It will be sent with an RPCSEC_GSS3_CREATE procedure,
1030	      and so cfap_seq_num is set to the seq_num of the credential of the
1031	      RPCSEC_GSS3_CREATE procedure.  Because cfap_shared_secret is a
1032	      secret, after XDR encoding copy_from_auth_priv, GSS_Wrap() (with
1033	      privacy) is invoked on copy_from_auth_priv.  The
1034	      RPCSEC_GSS3_CREATE procedure's arguments are:

1036	      struct {
1037	         rpc_gss3_gss_binding    *compound_binding;
1038	         rpc_gss3_chan_binding   *chan_binding_mic;
1039	         rpc_gss3_assertion      assertions<>;
1040	         rpc_gss3_extension      extensions<>;
1041	      } rpc_gss3_create_args;

1043	      The string "copy_from_auth" is placed in assertions[0].privs.  The
1044	      output of GSS_Wrap() is placed in extensions[0].data.  The field
1045	      extensions[0].critical is set to TRUE.  The source server calls
1046	      GSS_Unwrap() on the privilege, and verifies that the seq_num
1047	      matches the credential.  It then verifies that the NFSv4 user id
1048	      being asserted matches the source server's mapping of the user
1049	      principal.  If it does, the privilege is established on the source
1050	      server as: <"copy_from_auth", user id, destination>.  The
1051	      successful reply to RPCSEC_GSS3_CREATE has:

1053	      struct {
1054	         opaque                  handle<>;
1055	         rpc_gss3_chan_binding   *chan_binding_mic;
1056	         rpc_gss3_assertion      granted_assertions<>;
1057	         rpc_gss3_assertion      server_assertions<>;
1058	         rpc_gss3_extension      extensions<>;
1059	      } rpc_gss3_create_res;

1061	      The field "handle" is the RPCSEC_GSSv3 handle that the client will
1062	      use on COPY_NOTIFY requests involving the source and destination
1063	      server. granted_assertions[0].privs will be equal to
1064	      "copy_from_auth".  The server will return a GSS_Wrap() of
1065	      copy_to_auth_priv.

1067	   o  An instance of copy_to_auth_priv is filled in with the shared
1068	      secret, the source server, and the NFSv4 user id.  It will be sent
1069	      with an RPCSEC_GSS3_CREATE procedure, and so ctap_seq_num is set
1070	      to the seq_num of the credential of the RPCSEC_GSS3_CREATE
1071	      procedure.  Because ctap_shared_secret is a secret, after XDR
1072	      encoding copy_to_auth_priv, GSS_Wrap() is invoked on
1073	      copy_to_auth_priv.  The RPCSEC_GSS3_CREATE procedure's arguments
1074	      are:

1076	      struct {
1077	         rpc_gss3_gss_binding    *compound_binding;
1078	         rpc_gss3_chan_binding   *chan_binding_mic;
1079	         rpc_gss3_assertion      assertions<>;
1080	         rpc_gss3_extension      extensions<>;
1081	      } rpc_gss3_create_args;

1083	      The string "copy_to_auth" is placed in assertions[0].privs.  The
1084	      output of GSS_Wrap() is placed in extensions[0].data.  The field
1085	      extensions[0].critical is set to TRUE.  After unwrapping,
1086	      verifying the seq_num, and the user principal to NFSv4 user ID
1087	      mapping, the destination establishes a privilege of
1088	      <"copy_to_auth", user id, source>.  The successful reply to
1089	      RPCSEC_GSS3_CREATE has:

1091	      struct {
1092	         opaque                  handle<>;
1093	         rpc_gss3_chan_binding   *chan_binding_mic;
1094	         rpc_gss3_assertion      granted_assertions<>;
1095	         rpc_gss3_assertion      server_assertions<>;
1096	         rpc_gss3_extension      extensions<>;
1097	      } rpc_gss3_create_res;

1099	      The field "handle" is the RPCSEC_GSSv3 handle that the client will
1100	      use on COPY requests involving the source and destination server.
1101	      The field granted_assertions[0].privs will be equal to
1102	      "copy_to_auth".  The server will return a GSS_Wrap() of
1103	      copy_to_auth_priv.

1105	4.4.1.2.2.  Starting a Secure Inter-Server Copy

1107	   When the client sends a COPY_NOTIFY request to the source server, it
1108	   uses the privileged "copy_from_auth" RPCSEC_GSSv3 handle.
1109	   cna_destination_server in COPY_NOTIFY MUST be the same as the name of
1110	   the destination server specified in copy_from_auth_priv.  Otherwise,
1111	   COPY_NOTIFY will fail with NFS4ERR_ACCESS.  The source server
1112	   verifies that the privilege <"copy_from_auth", user id, destination>
1113	   exists, and annotates it with the source filehandle, if the user
1114	   principal has read access to the source file, and if administrative
1115	   policies give the user principal and the NFS client read access to
1116	   the source file (i.e., if the ACCESS operation would grant read
1117	   access).  Otherwise, COPY_NOTIFY will fail with NFS4ERR_ACCESS.

1119	   When the client sends a COPY request to the destination server, it
1120	   uses the privileged "copy_to_auth" RPCSEC_GSSv3 handle.
1121	   ca_source_server in COPY MUST be the same as the name of the source
1122	   server specified in copy_to_auth_priv.  Otherwise, COPY will fail
1123	   with NFS4ERR_ACCESS.  The destination server verifies that the
1124	   privilege <"copy_to_auth", user id, source> exists, and annotates it
1125	   with the source and destination filehandles.  If the client has
1126	   failed to establish the "copy_to_auth" policy it will reject the
1127	   request with NFS4ERR_PARTNER_NO_AUTH.

1129	   If the client sends a COPY_REVOKE to the source server to rescind the
1130	   destination server's copy privilege, it uses the privileged
1131	   "copy_from_auth" RPCSEC_GSSv3 handle and the cra_destination_server
1132	   in COPY_REVOKE MUST be the same as the name of the destination server
1133	   specified in copy_from_auth_priv.  The source server will then delete
1134	   the <"copy_from_auth", user id, destination> privilege and fail any
1135	   subsequent copy requests sent under the auspices of this privilege
1136	   from the destination server.

1138	4.4.1.2.3.  Securing ONC RPC Server-to-Server Copy Protocols

1140	   After a destination server has a "copy_to_auth" privilege established
1141	   on it, and it receives a COPY request, if it knows it will use an ONC
1142	   RPC protocol to copy data, it will establish a "copy_confirm_auth"
1143	   privilege on the source server, using nfs@<destination> as the
1144	   initiator principal, and nfs@<source> as the target principal.

1146	   The value of the field ccap_shared_secret_mic is a GSS_VerifyMIC() of
1147	   the shared secret passed in the copy_to_auth privilege.  The field
1148	   ccap_username is the mapping of the user principal to an NFSv4 user
1149	   name ("user"@"domain" form), and MUST be the same as ctap_username
1150	   and cfap_username.  The field ccap_seq_num is the seq_num of the
1151	   RPCSEC_GSSv3 credential used for the RPCSEC_GSS3_CREATE procedure the
1152	   destination will send to the source server to establish the
1153	   privilege.

1155	   The source server verifies the privilege, and establishes a
1156	   <"copy_confirm_auth", user id, destination> privilege.  If the source
1157	   server fails to verify the privilege, the COPY operation will be
1158	   rejected with NFS4ERR_PARTNER_NO_AUTH.  All subsequent ONC RPC
1159	   requests sent from the destination to copy data from the source to
1160	   the destination will use the RPCSEC_GSSv3 handle returned by the
1161	   source's RPCSEC_GSS3_CREATE response.

1163	   Note that the use of the "copy_confirm_auth" privilege accomplishes
1164	   the following:

1166	   o  if a protocol like NFS is being used, with export policies, export
1167	      policies can be overridden in case the destination server as-an-
1168	      NFS-client is not authorized

1170	   o  manual configuration to allow a copy relationship between the
1171	      source and destination is not needed.

1173	   If the attempt to establish a "copy_confirm_auth" privilege fails,
1174	   then when the user principal sends a COPY request to destination, the
1175	   destination server will reject it with NFS4ERR_PARTNER_NO_AUTH.

1177	4.4.1.2.4.  Securing Non ONC RPC Server-to-Server Copy Protocols

1179	   If the destination won't be using ONC RPC to copy the data, then the
1180	   source and destination are using an unspecified copy protocol.  The
1181	   destination could use the shared secret and the NFSv4 user id to
1182	   prove to the source server that the user principal has authorized the
1183	   copy.

1185	   For protocols that authenticate user names with passwords (e.g., HTTP
1186	   [14] and FTP [15]), the nfsv4 user id could be used as the user name,
1187	   and an ASCII hexadecimal representation of the RPCSEC_GSSv3 shared
1188	   secret could be used as the user password or as input into non-
1189	   password authentication methods like CHAP [16].

1191	4.4.1.3.  Inter-Server Copy via ONC RPC but without RPCSEC_GSSv3

1193	   ONC RPC security flavors other than RPCSEC_GSSv3 MAY be used with the
1194	   server-side copy offload operations described in this document.  In
1195	   particular, host-based ONC RPC security flavors such as AUTH_NONE and
1196	   AUTH_SYS MAY be used.  If a host-based security flavor is used, a
1197	   minimal level of protection for the server-to-server copy protocol is
1198	   possible.

1200	   In the absence of strong security mechanisms such as RPCSEC_GSSv3,
1201	   the challenge is how the source server and destination server
1202	   identify themselves to each other, especially in the presence of
1203	   multi-homed source and destination servers.  In a multi-homed
1204	   environment, the destination server might not contact the source
1205	   server from the same network address specified by the client in the
1206	   COPY_NOTIFY.  This can be overcome using the procedure described
1207	   below.

1209	   When the client sends the source server the COPY_NOTIFY operation,
1210	   the source server may reply to the client with a list of target
1211	   addresses, names, and/or URLs and assign them to the unique triple:
1212	   <source fh, user ID, destination address Y>.  If the destination uses
1213	   one of these target netlocs to contact the source server, the source
1214	   server will be able to uniquely identify the destination server, even
1215	   if the destination server does not connect from the address specified
1216	   by the client in COPY_NOTIFY.

1218	   For example, suppose the network topology is as shown in Figure 3.
1219	   If the source filehandle is 0x12345, the source server may respond to
1220	   a COPY_NOTIFY for destination 10.11.78.56 with the URLs:

1222	      nfs://10.11.78.18//_COPY/10.11.78.56/_FH/0x12345

1224	      nfs://192.168.33.18//_COPY/10.11.78.56/_FH/0x12345

1226	   The client will then send these URLs to the destination server in the
1227	   COPY operation.  Suppose that the 192.168.33.0/24 network is a high
1228	   speed network and the destination server decides to transfer the file
1229	   over this network.  If the destination contacts the source server
1230	   from 192.168.33.56 over this network using NFSv4.1, it does the
1231	   following:

1233	   COMPOUND  { PUTROOTFH, LOOKUP "_COPY" ; LOOKUP "10.11.78.56"; LOOKUP
1234	      "_FH" ; OPEN "0x12345" ; GETFH }

1236	   The source server will therefore know that these NFSv4.1 operations
1237	   are being issued by the destination server identified in the
1238	   COPY_NOTIFY.

1240	4.4.1.4.  Inter-Server Copy without ONC RPC and RPCSEC_GSSv3

1242	   The same techniques as Section 4.4.1.3, using unique URLs for each
1243	   destination server, can be used for other protocols (e.g., HTTP [14]
1244	   and FTP [15]) as well.

1246	5.  Application Data Block Support

1248	   At the OS level, files are contained on disk blocks.  Applications
1249	   are also free to impose structure on the data contained in a file and
1250	   we can define an Application Data Block (ADB) to be such a structure.
1251	   From the application's viewpoint, it only wants to handle ADBs and
1252	   not raw bytes (see [17]).  An ADB is typically comprised of two
1253	   sections: a header and data.  The header describes the
1254	   characteristics of the block and can provide a means to detect
1255	   corruption in the data payload.  The data section is typically
1256	   initialized to all zeros.

1258	   The format of the header is application specific, but there are two
1259	   main components typically encountered:

1261	   1.  An ADB Number (ADBN), which allows the application to determine
1262	       which data block is being referenced.  The ADBN is a logical
1263	       block number and is useful when the client is not storing the
1264	       blocks in contiguous memory.

1266	   2.  Fields to describe the state of the ADB and a means to detect
1267	       block corruption.  For both pieces of data, a useful property is
1268	       that allowed values be unique in that if passed across the
1269	       network, corruption due to translation between big and little
1270	       endian architectures are detectable.  For example, 0xF0DEDEF0 has
1271	       the same bit pattern in both architectures.

1273	   Applications already impose structures on files [17] and detect
1274	   corruption in data blocks [18].  What they are not able to do is
1275	   efficiently transfer and store ADBs.  To initialize a file with ADBs,
1276	   the client must send the full ADB to the server and that must be
1277	   stored on the server.  When the application is initializing a file to
1278	   have the ADB structure, it could compress the ADBs to just the
1279	   information to necessary to later reconstruct the header portion of
1280	   the ADB when the contents are read back.  Using sparse file
1281	   techniques, the disk blocks described by would not be allocated.
1282	   Unlike sparse file techniques, there would be a small cost to store
1283	   the compressed header data.

1285	   In this section, we are going to define a generic framework for an
1286	   ADB, present one approach to detecting corruption in a given ADB
1287	   implementation, and describe the model for how the client and server
1288	   can support efficient initialization of ADBs, reading of ADB holes,
1289	   punching holes in ADBs, and space reservation.  Further, we need to
1290	   be able to extend this model to applications which do not support
1291	   ADBs, but wish to be able to handle sparse files, hole punching, and
1292	   space reservation.

1294	5.1.  Generic Framework

1296	   We want the representation of the ADB to be flexible enough to
1297	   support many different applications.  The most basic approach is no
1298	   imposition of a block at all, which means we are working with the raw
1299	   bytes.  Such an approach would be useful for storing holes, punching
1300	   holes, etc.  In more complex deployments, a server might be
1301	   supporting multiple applications, each with their own definition of
1302	   the ADB.  One might store the ADBN at the start of the block and then
1303	   have a guard pattern to detect corruption [19].  The next might store
1304	   the ADBN at an offset of 100 bytes within the block and have no guard
1305	   pattern at all.  The point is that existing applications might
1306	   already have well defined formats for their data blocks.

1308	   The guard pattern can be used to represent the state of the block, to
1309	   protect against corruption, or both.  Again, it needs to be able to
1310	   be placed anywhere within the ADB.

1312	   We need to be able to represent the starting offset of the block and
1313	   the size of the block.  Note that nothing prevents the application
1314	   from defining different sized blocks in a file.

1316	5.1.1.  Data Block Representation

1318	   struct app_data_block4 {
1319	           offset4         adb_offset;
1320	           length4         adb_block_size;
1321	           length4         adb_block_count;
1322	           length4         adb_reloff_blocknum;
1323	           count4          adb_block_num;
1324	           length4         adb_reloff_pattern;
1325	           opaque          adb_pattern<>;
1326	   };

1328	   The app_data_block4 structure captures the abstraction presented for
1329	   the ADB.  The additional fields present are to allow the transmission
1330	   of adb_block_count ADBs at one time.  We also use adb_block_num to
1331	   convey the ADBN of the first block in the sequence.  Each ADB will
1332	   contain the same adb_pattern string.

1334	   As both adb_block_num and adb_pattern are optional, if either
1335	   adb_reloff_pattern or adb_reloff_blocknum is set to NFS4_UINT64_MAX,
1336	   then the corresponding field is not set in any of the ADB.

1338	5.1.2.  Data Content

1340	   /*
1341	    * Use an enum such that we can extend new types.
1342	    */
1343	   enum data_content4 {
1344	           NFS4_CONTENT_DATA = 0,
1345	           NFS4_CONTENT_APP_BLOCK = 1,
1346	           NFS4_CONTENT_HOLE = 2
1347	   };

1349	   New operations might need to differentiate between wanting to access
1350	   data versus an ADB.  Also, future minor versions might want to
1351	   introduce new data formats.  This enumeration allows that to occur.

1353	5.2.  pNFS Considerations

1355	   While this document does not mandate how sparse ADBs are recorded on
1356	   the server, it does make the assumption that such information is not
1357	   in the file.  I.e., the information is metadata.  As such, the
1358	   INITIALIZE operation is defined to be not supported by the DS - it
1359	   must be issued to the MDS.  But since the client must not assume a
1360	   priori whether a read is sparse or not, the READ_PLUS operation MUST
1361	   be supported by both the DS and the MDS.  I.e., the client might
1362	   impose on the MDS to asynchronously read the data from the DS.

1364	   Furthermore, each DS MUST not report to a client either a sparse ADB
1365	   or data which belongs to another DS.  One implication of this
1366	   requirement is that the app_data_block4's adb_block_size MUST be
1367	   either be the stripe width or the stripe width must be an even
1368	   multiple of it.

1370	   The second implication here is that the DS must be able to use the
1371	   Control Protocol to determine from the MDS where the sparse ADBs
1372	   occur.  [[Comment.1: Need to discuss what happens if after the file
1373	   is being written to and an INITIALIZE occurs? --TH]] Perhaps instead
1374	   of the DS pulling from the MDS, the MDS pushes to the DS?  Thus an
1375	   INITIALIZE causes a new push?  [[Comment.2: Still need to consider
1376	   race cases of the DS getting a WRITE and the MDS getting an
1377	   INITIALIZE. --TH]]

1379	5.3.  An Example of Detecting Corruption

1381	   In this section, we define an ADB format in which corruption can be
1382	   detected.  Note that this is just one possible format and means to
1383	   detect corruption.

1385	   Consider a very basic implementation of an operating system's disk
1386	   blocks.  A block is either data or it is an indirect block which
1387	   allows for files to be larger than one block.  It is desired to be
1388	   able to initialize a block.  Lastly, to quickly unlink a file, a
1389	   block can be marked invalid.  The contents remain intact - which
1390	   would enable this OS application to undelete a file.

1392	   The application defines 4k sized data blocks, with an 8 byte block
1393	   counter occurring at offset 0 in the block, and with the guard
1394	   pattern occurring at offset 8 inside the block.  Furthermore, the
1395	   guard pattern can take one of four states:

1397	   0xfeedface -   This is the FREE state and indicates that the ADB
1398	      format has been applied.

1400	   0xcafedead -   This is the DATA state and indicates that real data
1401	      has been written to this block.

1403	   0xe4e5c001 -   This is the INDIRECT state and indicates that the
1404	      block contains block counter numbers that are chained off of this
1405	      block.

1407	   0xba1ed4a3 -   This is the INVALID state and indicates that the block
1408	      contains data whose contents are garbage.

1410	   Finally, it also defines an 8 byte checksum [20] starting at byte 16
1411	   which applies to the remaining contents of the block.  If the state
1412	   is FREE, then that checksum is trivially zero.  As such, the
1413	   application has no need to transfer the checksum implicitly inside
1414	   the ADB - it need not make the transfer layer aware of the fact that
1415	   there is a checksum (see [18] for an example of checksums used to
1416	   detect corruption in application data blocks).

1418	   Corruption in each ADB can be detected thusly:

1420	   o  If the guard pattern is anything other than one of the allowed
1421	      values, including all zeros.

1423	   o  If the guard pattern is FREE and any other byte in the remainder
1424	      of the ADB is anything other than zero.

1426	   o  If the guard pattern is anything other than FREE, then if the
1427	      stored checksum does not match the computed checksum.

1429	   o  If the guard pattern is INDIRECT and one of the stored indirect
1430	      block numbers has a value greater than the number of ADBs in the
1431	      file.

1433	   o  If the guard pattern is INDIRECT and one of the stored indirect
1434	      block numbers is a duplicate of another stored indirect block
1435	      number.

1437	   As can be seen, the application can detect errors based on the
1438	   combination of the guard pattern state and the checksum.  But also,
1439	   the application can detect corruption based on the state and the
1440	   contents of the ADB.  This last point is important in validating the
1441	   minimum amount of data we incorporated into our generic framework.
1442	   I.e., the guard pattern is sufficient in allowing applications to
1443	   design their own corruption detection.

1445	   Finally, it is important to note that none of these corruption checks
1446	   occur in the transport layer.  The server and client components are
1447	   totally unaware of the file format and might report everything as
1448	   being transferred correctly even in the case the application detects
1449	   corruption.

1451	5.4.  Example of READ_PLUS

1453	   The hypothetical application presented in Section 5.3 can be used to
1454	   illustrate how READ_PLUS would return an array of results.  A file is
1455	   created and initialized with 100 4k ADBs in the FREE state:

1457	      INITIALIZE {0, 4k, 100, 0, 0, 8, 0xfeedface}

1459	   Further, assume the application writes a single ADB at 16k, changing
1460	   the guard pattern to 0xcafedead, we would then have in memory:

1462	      0 -> (16k - 1)   : 4k, 4, 0, 0, 8, 0xfeedface
1463	      16k -> (20k - 1) : 00 00 00 05 ca fe de ad XX XX ... XX XX
1464	      20k -> 400k      : 4k, 95, 0, 6, 0xfeedface

1466	   And when the client did a READ_PLUS of 64k at the start of the file,
1467	   it would get back a result of an ADB, some data, and a final ADB:

1469	      ADB {0, 4, 0, 0, 8, 0xfeedface}
1470	      data 4k
1471	      ADB {20k, 4k, 59, 0, 6, 0xfeedface}

1473	5.5.  Zero Filled Holes

1475	   As applications are free to define the structure of an ADB, it is
1476	   trivial to define an ADB which supports zero filled holes.  Such a
1477	   case would encompass the traditional definitions of a sparse file and
1478	   hole punching.  For example, to punch a 64k hole, starting at 100M,
1479	   into an existing file which has no ADB structure:

1481	      INITIALIZE {100M, 64k, 1, NFS4_UINT64_MAX,
1482	                  0, NFS4_UINT64_MAX, 0x0}

1484	6.  Space Reservation

1486	6.1.  Introduction

1488	   This section describes a set of operations that allow applications
1489	   such as hypervisors to reserve space for a file, report the amount of
1490	   actual disk space a file occupies and freeup the backing space of a
1491	   file when it is not required.

1493	   In virtualized environments, virtual disk files are often stored on
1494	   NFS mounted volumes.  Since virtual disk files represent the hard
1495	   disks of virtual machines, hypervisors often have to guarantee
1496	   certain properties for the file.

1498	   One such example is space reservation.  When a hypervisor creates a
1499	   virtual disk file, it often tries to preallocate the space for the
1500	   file so that there are no future allocation related errors during the
1501	   operation of the virtual machine.  Such errors prevent a virtual
1502	   machine from continuing execution and result in downtime.

1504	   Another useful feature would be the ability to report the number of
1505	   blocks that would be freed when a file is deleted.  Currently, NFS
1506	   reports two size attributes:

1508	   size  The logical file size of the file.

1510	   space_used  The size in bytes that the file occupies on disk

1512	   While these attributes are sufficient for space accounting in
1513	   traditional filesystems, they prove to be inadequate in modern
1514	   filesystems that support block sharing.  Having a way to tell the
1515	   number of blocks that would be freed if the file was deleted would be
1516	   useful to applications that wish to migrate files when a volume is
1517	   low on space.

1519	   Since virtual disks represent a hard drive in a virtual machine, a
1520	   virtual disk can be viewed as a filesystem within a file.  Since not
1521	   all blocks within a filesystem are in use, there is an opportunity to
1522	   reclaim blocks that are no longer in use.  A call to deallocate
1523	   blocks could result in better space efficiency.  Lesser space MAY be
1524	   consumed for backups after block deallocation.

1526	   We propose the following operations and attributes for the
1527	   aforementioned use cases:

1529	   space_reserved  This attribute specifies whether the blocks backing
1530	      the file have been preallocated.

1532	   space_freed  This attribute specifies the space freed when a file is
1533	      deleted, taking block sharing into consideration.

1535	   max_hole_punch  This attribute specifies the maximum sized hole that
1536	      can be punched on the filesystem.

1538	   HOLE_PUNCH  This operation zeroes and/or deallocates the blocks
1539	      backing a region of the file.

1541	6.2.  Use Cases
1542	6.2.1.  Space Reservation

1544	   Some applications require that once a file of a certain size is
1545	   created, writes to that file never fail with an out of space
1546	   condition.  One such example is that of a hypervisor writing to a
1547	   virtual disk.  An out of space condition while writing to virtual
1548	   disks would mean that the virtual machine would need to be frozen.

1550	   Currently, in order to achieve such a guarantee, applications zero
1551	   the entire file.  The initial zeroing allocates the backing blocks
1552	   and all subsequent writes are overwrites of already allocated blocks.
1553	   This approach is not only inefficient in terms of the amount of I/O
1554	   done, it is also not guaranteed to work on filesystems that are log
1555	   structured or deduplicated.  An efficient way of guaranteeing space
1556	   reservation would be beneficial to such applications.

1558	   If the space_reserved attribute is set on a file, it is guaranteed
1559	   that writes that do not grow the file will not fail with
1560	   NFSERR_NOSPC.

1562	6.2.2.  Space freed on deletes

1564	   Currently, files in NFS have two size attributes:

1566	   size  The logical file size of the file.

1568	   space_used  The size in bytes that the file occupies on disk.

1570	   While these attributes are sufficient for space accounting in
1571	   traditional filesystems, they prove to be inadequate in modern
1572	   filesystems that support block sharing.  In such filesystems,
1573	   multiple inodes can point to a single block with a block reference
1574	   count to guard against premature freeing.

1576	   If space_used of a file is interpreted to mean the size in bytes of
1577	   all disk blocks pointed to by the inode of the file, then shared
1578	   blocks get double counted, over-reporting the space utilization.
1579	   This also has the adverse effect that the deletion of a file with
1580	   shared blocks frees up less than space_used bytes.

1582	   On the other hand, if space_used is interpreted to mean the size in
1583	   bytes of those disk blocks unique to the inode of the file, then
1584	   shared blocks are not counted in any file, resulting in under-
1585	   reporting of the space utilization.

1587	   For example, two files A and B have 10 blocks each.  Let 6 of these
1588	   blocks be shared between them.  Thus, the combined space utilized by
1589	   the two files is 14 * BLOCK_SIZE bytes.  In the former case, the
1590	   combined space utilization of the two files would be reported as 20 *
1591	   BLOCK_SIZE.  However, deleting either would only result in 4 *
1592	   BLOCK_SIZE being freed.  Conversely, the latter interpretation would
1593	   report that the space utilization is only 8 * BLOCK_SIZE.

1595	   Adding another size attribute, space_freed, is helpful in solving
1596	   this problem. space_freed is the number of blocks that are allocated
1597	   to the given file that would be freed on its deletion.  In the
1598	   example, both A and B would report space_freed as 4 * BLOCK_SIZE and
1599	   space_used as 10 * BLOCK_SIZE.  If A is deleted, B will report
1600	   space_freed as 10 * BLOCK_SIZE as the deletion of B would result in
1601	   the deallocation of all 10 blocks.

1603	   The addition of this problem doesn't solve the problem of space being
1604	   over-reported.  However, over-reporting is better than under-
1605	   reporting.

1607	6.2.3.  Operations and attributes

1609	   In the sections that follow, one operation and three attributes are
1610	   defined that together provide the space management facilities
1611	   outlined earlier in the document.  The operation is intended to be
1612	   OPTIONAL and the attributes RECOMMENDED as defined in section 17 of
1613	   [2].

1615	6.2.4.  Attribute 77: space_reserved

1617	   The space_reserve attribute is a read/write attribute of type
1618	   boolean.  It is a per file attribute.  When the space_reserved
1619	   attribute is set via SETATTR, the server must ensure that there is
1620	   disk space to accommodate every byte in the file before it can return
1621	   success.  If the server cannot guarantee this, it must return
1622	   NFS4ERR_NOSPC.

1624	   If the client tries to grow a file which has the space_reserved
1625	   attribute set, the server must guarantee that there is disk space to
1626	   accommodate every byte in the file with the new size before it can
1627	   return success.  If the server cannot guarantee this, it must return
1628	   NFS4ERR_NOSPC.

1630	   It is not required that the server allocate the space to the file
1631	   before returning success.  The allocation can be deferred, however,
1632	   it must be guaranteed that it will not fail for lack of space.

1634	   The value of space_reserved can be obtained at any time through
1635	   GETATTR.

1637	   In order to avoid ambiguity, the space_reserve bit cannot be set
1638	   along with the size bit in SETATTR.  Increasing the size of a file
1639	   with space_reserve set will fail if space reservation cannot be
1640	   guaranteed for the new size.  If the file size is decreased, space
1641	   reservation is only guaranteed for the new size and the extra blocks
1642	   backing the file can be released.

1644	6.2.5.  Attribute 78: space_freed

1646	   space_freed gives the number of bytes freed if the file is deleted.
1647	   This attribute is read only and is of type length4.  It is a per file
1648	   attribute.

1650	6.2.6.  Attribute 79: max_hole_punch

1652	   max_hole_punch specifies the maximum size of a hole that the
1653	   HOLE_PUNCH operation can handle.  This attribute is read only and of
1654	   type length4.  It is a per filesystem attribute.  This attribute MUST
1655	   be implemented if HOLE_PUNCH is implemented.

1657	6.2.7.  Operation 64: HOLE_PUNCH - Zero and deallocate blocks backing
1658	        the file in the specified range.

1660	   WARNING: Most of this section is now obsolete.  Parts of it need to
1661	   be scavanged for the ADB discussion, but for the most part, it cannot
1662	   be trusted.

1664	6.2.7.1.  DESCRIPTION

1666	   Whenever a client wishes to deallocate the blocks backing a
1667	   particular region in the file, it calls the HOLE_PUNCH operation with
1668	   the current filehandle set to the filehandle of the file in question,
1669	   start offset and length in bytes of the region set in hpa_offset and
1670	   hpa_count respectively.  All further reads to this region MUST return
1671	   zeros until overwritten.  The filehandle specified must be that of a
1672	   regular file.

1674	   Situations may arise where hpa_offset and/or hpa_offset + hpa_count
1675	   will not be aligned to a boundary that the server does allocations/
1676	   deallocations in.  For most filesystems, this is the block size of
1677	   the file system.  In such a case, the server can deallocate as many
1678	   bytes as it can in the region.  The blocks that cannot be deallocated
1679	   MUST be zeroed.  Except for the block deallocation and maximum hole
1680	   punching capability, a HOLE_PUNCH operation is to be treated similar
1681	   to a write of zeroes.

1683	   The server is not required to complete deallocating the blocks
1684	   specified in the operation before returning.  It is acceptable to
1685	   have the deallocation be deferred.  In fact, HOLE_PUNCH is merely a
1686	   hint; it is valid for a server to return success without ever doing
1687	   anything towards deallocating the blocks backing the region
1688	   specified.  However, any future reads to the region MUST return
1689	   zeroes.

1691	   HOLE_PUNCH will result in the space_used attribute being decreased by
1692	   the number of bytes that were deallocated.  The space_freed attribute
1693	   may or may not decrease, depending on the support and whether the
1694	   blocks backing the specified range were shared or not.  The size
1695	   attribute will remain unchanged.

1697	   The HOLE_PUNCH operation MUST NOT change the space reservation
1698	   guarantee of the file.  While the server can deallocate the blocks
1699	   specified by hpa_offset and hpa_count, future writes to this region
1700	   MUST NOT fail with NFSERR_NOSPC.

1702	   The HOLE_PUNCH operation may fail for the following reasons (this is
1703	   a partial list):

1705	   NFS4ERR_NOTSUPP  The Hole punch operations are not supported by the
1706	      NFS server receiving this request.

1708	   NFS4ERR_DIR  The current filehandle is of type NF4DIR.

1710	   NFS4ERR_SYMLINK  The current filehandle is of type NF4LNK.

1712	   NFS4ERR_WRONG_TYPE  The current filehandle does not designate an
1713	      ordinary file.

1715	7.  Sparse Files

1717	   WARNING: Most of this section needs to be reworked because of the
1718	   work going on in the ADB section.

1720	7.1.  Introduction

1722	   A sparse file is a common way of representing a large file without
1723	   having to utilize all of the disk space for it.  Consequently, a
1724	   sparse file uses less physical space than its size indicates.  This
1725	   means the file contains 'holes', byte ranges within the file that
1726	   contain no data.  Most modern file systems support sparse files,
1727	   including most UNIX file systems and NTFS, but notably not Apple's
1728	   HFS+.  Common examples of sparse files include Virtual Machine (VM)
1729	   OS/disk images, database files, log files, and even checkpoint
1730	   recovery files most commonly used by the HPC community.

1732	   If an application reads a hole in a sparse file, the file system must
1733	   returns all zeros to the application.  For local data access there is
1734	   little penalty, but with NFS these zeroes must be transferred back to
1735	   the client.  If an application uses the NFS client to read data into
1736	   memory, this wastes time and bandwidth as the application waits for
1737	   the zeroes to be transferred.

1739	   A sparse file is typically created by initializing the file to be all
1740	   zeros - nothing is written to the data in the file, instead the hole
1741	   is recorded in the metadata for the file.  So a 8G disk image might
1742	   be represented initially by a couple hundred bits in the inode and
1743	   nothing on the disk.  If the VM then writes 100M to a file in the
1744	   middle of the image, there would now be two holes represented in the
1745	   metadata and 100M in the data.

1747	   Other applications want to initialize a file to patterns other than
1748	   zero.  The problem with initializing to zero is that it is often
1749	   difficult to distinguish a byte-range of initialized to all zeroes
1750	   from data corruption, since a pattern of zeroes is a probable pattern
1751	   for corruption.  Instead, some applications, such as database
1752	   management systems, use pattern consisting of bytes or words of non-
1753	   zero values.

1755	   Besides reading sparse files and initializing them, applications
1756	   might want to hole punch, which is the deallocation of the data
1757	   blocks which back a region of the file.  At such time, the affected
1758	   blocks are reinitialized to a pattern.

1760	   This section introduces a new operation to read patterns from a file,
1761	   READ_PLUS, and a new operation to both initialize patterns and to
1762	   punch pattern holes into a file, WRITE_PLUS.  READ_PLUS supports all
1763	   the features of READ but includes an extension to support sparse
1764	   pattern files.  READ_PLUS is guaranteed to perform no worse than
1765	   READ, and can dramatically improve performance with sparse files.
1766	   READ_PLUS does not depend on pNFS protocol features, but can be used
1767	   by pNFS to support sparse files.

1769	7.2.  Terminology

1771	   Regular file:  An object of file type NF4REG or NF4NAMEDATTR.

1773	   Sparse file:  A Regular file that contains one or more Holes.

1775	   Hole:  A byte range within a Sparse file that contains regions of all
1776	      zeroes.  For block-based file systems, this could also be an
1777	      unallocated region of the file.

1779	   Hole Threshold  The minimum length of a Hole as determined by the
1780	      server.  If a server chooses to define a Hole Threshold, then it
1781	      would not return hole information (nfs_readplusreshole) with a
1782	      hole_offset and hole_length that specify a range shorter than the
1783	      Hole Threshold.

1785	7.3.  Applications and Sparse Files

1787	   Applications may cause an NFS client to read holes in a file for
1788	   several reasons.  This section describes three different application
1789	   workloads that cause the NFS client to transfer data unnecessarily.
1790	   These workloads are simply examples, and there are probably many more
1791	   workloads that are negatively impacted by sparse files.

1793	   The first workload that can cause holes to be read is sequential
1794	   reads within a sparse file.  When this happens, the NFS client may
1795	   perform read requests ("readahead") into sections of the file not
1796	   explicitly requested by the application.  Since the NFS client cannot
1797	   differentiate between holes and non-holes, the NFS client may
1798	   prefetch empty sections of the file.

1800	   This workload is exemplified by Virtual Machines and their associated
1801	   file system images, e.g., VMware .vmdk files, which are large sparse
1802	   files encapsulating an entire operating system.  If a VM reads files
1803	   within the file system image, this will translate to sequential NFS
1804	   read requests into the much larger file system image file.  Since NFS
1805	   does not understand the internals of the file system image, it ends
1806	   up performing readahead file holes.

1808	   The second workload is generated by copying a file from a directory
1809	   in NFS to either the same NFS server, to another file system, e.g.,
1810	   another NFS or Samba server, to a local ext3 file system, or even a
1811	   network socket.  In this case, bandwidth and server resources are
1812	   wasted as the entire file is transferred from the NFS server to the
1813	   NFS client.  Once a byte range of the file has been transferred to
1814	   the client, it is up to the client application, e.g., rsync, cp, scp,
1815	   on how it writes the data to the target location.  For example, cp
1816	   supports sparse files and will not write all zero regions, whereas
1817	   scp does not support sparse files and will transfer every byte of the
1818	   file.

1820	   The third workload is generated by applications that do not utilize
1821	   the NFS client cache, but instead use direct I/O and manage cached
1822	   data independently, e.g., databases.  These applications may perform
1823	   whole file caching with sparse files, which would mean that even the
1824	   holes will be transferred to the clients and cached.

1826	7.4.  Overview of Sparse Files and NFSv4

1828	   This proposal seeks to provide sparse file support to the largest
1829	   number of NFS client and server implementations, and as such proposes
1830	   to add a new return code to the mandatory NFSv4.1 READ_PLUS operation
1831	   instead of proposing additions or extensions of new or existing
1832	   optional features (such as pNFS).

1834	   As well, this document seeks to ensure that the proposed extensions
1835	   are simple and do not transfer data between the client and server
1836	   unnecessarily.  For example, one possible way to implement sparse
1837	   file read support would be to have the client, on the first hole
1838	   encountered or at OPEN time, request a Data Region Map from the
1839	   server.  A Data Region Map would specify all zero and non-zero
1840	   regions in a file.  While this option seems simple, it is less useful
1841	   and can become inefficient and cumbersome for several reasons:

1843	   o  Data Region Maps can be large, and transferring them can reduce
1844	      overall read performance.  For example, VMware's .vmdk files can
1845	      have a file size of over 100 GBs and have a map well over several
1846	      MBs.

1848	   o  Data Region Maps can change frequently, and become invalidated on
1849	      every write to the file.  NFSv4 has a single change attribute,
1850	      which means any change to any region of a file will invalidate all
1851	      Data Region Maps.  This can result in the map being transferred
1852	      multiple times with each update to the file.  For example, a VM
1853	      that updates a config file in its file system image would
1854	      invalidate the Data Region Map not only for itself, but for all
1855	      other clients accessing the same file system image.

1857	   o  Data Region Maps do not handle all zero-filled sections of the
1858	      file, reducing the effectiveness of the solution.  While it may be
1859	      possible to modify the maps to handle zero-filled sections (at
1860	      possibly great effort to the server), it is almost impossible with
1861	      pNFS.  With pNFS, the owner of the Data Region Map is the metadata
1862	      server, which is not in the data path and has no knowledge of the
1863	      contents of a data region.

1865	   Another way to handle holes is compression, but this not ideal since
1866	   it requires all implementations to agree on a single compression
1867	   algorithm and requires a fair amount of computational overhead.

1869	   Note that supporting writing to a sparse file does not require
1870	   changes to the protocol.  Applications and/or NFS implementations can
1871	   choose to ignore WRITE requests of all zeroes to the NFS server
1872	   without consequence.

1874	7.5.  Operation 65: READ_PLUS

1876	   The section introduces a new read operation, named READ_PLUS, which
1877	   allows NFS clients to avoid reading holes in a sparse file.
1878	   READ_PLUS is guaranteed to perform no worse than READ, and can
1879	   dramatically improve performance with sparse files.

1881	   READ_PLUS supports all the features of the existing NFSv4.1 READ
1882	   operation [2] and adds a simple yet significant extension to the
1883	   format of its response.  The change allows the client to avoid
1884	   returning all zeroes from a file hole, wasting computational and
1885	   network resources and reducing performance.  READ_PLUS uses a new
1886	   result structure that tells the client that the result is all zeroes
1887	   AND the byte-range of the hole in which the request was made.
1888	   Returning the hole's byte-range, and only upon request, avoids
1889	   transferring large Data Region Maps that may be soon invalidated and
1890	   contain information about a file that may not even be read in its
1891	   entirely.

1893	   A new read operation is required due to NFSv4.1 minor versioning
1894	   rules that do not allow modification of existing operation's
1895	   arguments or results.  READ_PLUS is designed in such a way to allow
1896	   future extensions to the result structure.  The same approach could
1897	   be taken to extend the argument structure, but a good use case is
1898	   first required to make such a change.

1900	7.5.1.  ARGUMENT

1902	   struct READ_PLUS4args {
1903	           /* CURRENT_FH: file */
1904	           stateid4        rpa_stateid;
1905	           offset4         rpa_offset;
1906	           count4          rpa_count;
1907	   };

1909	7.5.2.  RESULT

1911	   union read_plus_content switch (data_content4 content) {
1912	   case NFS4_CONTENT_DATA:
1913	           opaque          rpc_data<>;
1914	   case NFS4_CONTENT_APP_BLOCK:
1915	           app_data_block4 rpc_block;
1916	   case NFS4_CONTENT_HOLE:
1917	           hole_info4      rpc_hole;
1918	   default:
1919	           void;
1920	   };

1922	   /*
1923	    * Allow a return of an array of contents.
1924	    */
1925	   struct read_plus_res4 {
1926	           bool                    rpr_eof;
1927	           read_plus_content       rpr_contents<>;
1928	   };

1930	   union READ_PLUS4res switch (nfsstat4 status) {
1931	   case NFS4_OK:
1932	           read_plus_res4  resok4;
1933	   default:
1934	           void;
1935	   };

1937	7.5.3.  DESCRIPTION

1939	   The READ_PLUS operation is based upon the NFSv4.1 READ operation [2],
1940	   and similarly reads data from the regular file identified by the
1941	   current filehandle.

1943	   The client provides an offset of where the READ_PLUS is to start and
1944	   a count of how many bytes are to be read.  An offset of zero means to
1945	   read data starting at the beginning of the file.  If offset is
1946	   greater than or equal to the size of the file, the status NFS4_OK is
1947	   returned with nfs_readplusrestype4 set to READ_OK, data length set to
1948	   zero, and eof set to TRUE.  The READ_PLUS is subject to access
1949	   permissions checking.

1951	   If the client specifies a count value of zero, the READ_PLUS succeeds
1952	   and returns zero bytes of data, again subject to access permissions
1953	   checking.  In all situations, the server may choose to return fewer
1954	   bytes than specified by the client.  The client needs to check for
1955	   this condition and handle the condition appropriately.

1957	   If the client specifies an offset and count value that is entirely
1958	   contained within a hole of the file, the status NFS4_OK is returned
1959	   with nfs_readplusresok4 set to READ_HOLE, and if information is
1960	   available regarding the hole, a nfs_readplusreshole structure
1961	   containing the offset and range of the entire hole.  The
1962	   nfs_readplusreshole structure is considered valid until the file is
1963	   changed (detected via the change attribute).  The server MUST provide
1964	   the same semantics for nfs_readplusreshole as if the client read the
1965	   region and received zeroes; the implied holes contents lifetime MUST
1966	   be exactly the same as any other read data.

1968	   If the client specifies an offset and count value that begins in a
1969	   non-hole of the file but extends into hole the server should return a
1970	   short read with status NFS4_OK, nfs_readplusresok4 set to READ_OK,
1971	   and data length set to the number of bytes returned.  The client will
1972	   then issue another READ_PLUS for the remaining bytes, which the
1973	   server will respond with information about the hole in the file.

1975	   If the server knows that the requested byte range is into a hole of
1976	   the file, but has no further information regarding the hole, it
1977	   returns a nfs_readplusreshole structure with holeres4 set to
1978	   HOLE_NOINFO.

1980	   If hole information is available and can be returned to the client,
1981	   the server returns a nfs_readplusreshole structure with the value of
1982	   holeres4 to HOLE_INFO.  The values of hole_offset and hole_length
1983	   define the byte-range for the current hole in the file.  These values
1984	   represent the information known to the server and may describe a
1985	   byte-range smaller than the true size of the hole.

1987	   Except when special stateids are used, the stateid value for a
1988	   READ_PLUS request represents a value returned from a previous byte-
1989	   range lock or share reservation request or the stateid associated
1990	   with a delegation.  The stateid identifies the associated owners if
1991	   any and is used by the server to verify that the associated locks are
1992	   still valid (e.g., have not been revoked).

1994	   If the read ended at the end-of-file (formally, in a correctly formed
1995	   READ_PLUS operation, if offset + count is equal to the size of the
1996	   file), or the READ_PLUS operation extends beyond the size of the file
1997	   (if offset + count is greater than the size of the file), eof is
1998	   returned as TRUE; otherwise, it is FALSE.  A successful READ_PLUS of
1999	   an empty file will always return eof as TRUE.

2001	   If the current filehandle is not an ordinary file, an error will be
2002	   returned to the client.  In the case that the current filehandle
2003	   represents an object of type NF4DIR, NFS4ERR_ISDIR is returned.  If
2004	   the current filehandle designates a symbolic link, NFS4ERR_SYMLINK is
2005	   returned.  In all other cases, NFS4ERR_WRONG_TYPE is returned.

2007	   For a READ_PLUS with a stateid value of all bits equal to zero, the
2008	   server MAY allow the READ_PLUS to be serviced subject to mandatory
2009	   byte-range locks or the current share deny modes for the file.  For a
2010	   READ_PLUS with a stateid value of all bits equal to one, the server
2011	   MAY allow READ_PLUS operations to bypass locking checks at the
2012	   server.

2014	   On success, the current filehandle retains its value.

2016	7.5.4.  IMPLEMENTATION

2018	   If the server returns a "short read" (i.e., fewer data than requested
2019	   and eof is set to FALSE), the client should send another READ_PLUS to
2020	   get the remaining data.  A server may return less data than requested
2021	   under several circumstances.  The file may have been truncated by
2022	   another client or perhaps on the server itself, changing the file
2023	   size from what the requesting client believes to be the case.  This
2024	   would reduce the actual amount of data available to the client.  It
2025	   is possible that the server reduce the transfer size and so return a
2026	   short read result.  Server resource exhaustion may also occur in a
2027	   short read.

2029	   If mandatory byte-range locking is in effect for the file, and if the
2030	   byte-range corresponding to the data to be read from the file is
2031	   WRITE_LT locked by an owner not associated with the stateid, the
2032	   server will return the NFS4ERR_LOCKED error.  The client should try
2033	   to get the appropriate READ_LT via the LOCK operation before re-
2034	   attempting the READ_PLUS.  When the READ_PLUS completes, the client
2035	   should release the byte-range lock via LOCKU.  In addition, the
2036	   server MUST return a nfs_readplusreshole structure with values of
2037	   hole_offset and hole_length that are within the owner's locked byte
2038	   range.

2040	   If another client has an OPEN_DELEGATE_WRITE delegation for the file
2041	   being read, the delegation must be recalled, and the operation cannot
2042	   proceed until that delegation is returned or revoked.  Except where
2043	   this happens very quickly, one or more NFS4ERR_DELAY errors will be
2044	   returned to requests made while the delegation remains outstanding.
2045	   Normally, delegations will not be recalled as a result of a READ_PLUS
2046	   operation since the recall will occur as a result of an earlier OPEN.
2047	   However, since it is possible for a READ_PLUS to be done with a
2048	   special stateid, the server needs to check for this case even though
2049	   the client should have done an OPEN previously.

2051	7.5.4.1.  Additional pNFS Implementation Information

2053	   With pNFS, the semantics of using READ_PLUS remains the same.  Any
2054	   data server MAY return a READ_HOLE result for a READ_PLUS request
2055	   that it receives.

2057	   When a data server chooses to return a READ_HOLE result, it has the
2058	   option of returning hole information for the data stored on that data
2059	   server (as defined by the data layout), but it MUST not return a
2060	   nfs_readplusreshole structure with a byte range that includes data
2061	   managed by another data server.

2063	   1.  Data servers that cannot determine hole information SHOULD return
2064	       HOLE_NOINFO.

2066	   2.  Data servers that can obtain hole information for the parts of
2067	       the file stored on that data server, the data server SHOULD
2068	       return HOLE_INFO and the byte range of the hole stored on that
2069	       data server.

2071	   A data server should do its best to return as much information about
2072	   a hole as is feasible without having to contact the metadata server.
2073	   If communication with the metadata server is required, then every
2074	   attempt should be taken to minimize the number of requests.

2076	   If mandatory locking is enforced, then the data server must also
2077	   ensure that to return only information for a Hole that is within the
2078	   owner's locked byte range.

2080	7.5.5.  READ_PLUS with Sparse Files Example

2082	   To see how the return value READ_HOLE will work, the following table
2083	   describes a sparse file.  For each byte range, the file contains
2084	   either non-zero data or a hole.  In addition, the server in this
2085	   example uses a hole threshold of 32K.

2087	                        +-------------+----------+
2088	                        | Byte-Range  | Contents |
2089	                        +-------------+----------+
2090	                        | 0-15999     | Hole     |
2091	                        | 16K-31999   | Non-Zero |
2092	                        | 32K-255999  | Hole     |
2093	                        | 256K-287999 | Non-Zero |
2094	                        | 288K-353999 | Hole     |
2095	                        | 354K-417999 | Non-Zero |
2096	                        +-------------+----------+

2098	                                  Table 1

2100	   Under the given circumstances, if a client was to read the file from
2101	   beginning to end with a max read size of 64K, the following will be
2102	   the result.  This assumes the client has already opened the file and
2103	   acquired a valid stateid and just needs to issue READ_PLUS requests.

2105	   1.  READ_PLUS(s, 0, 64K) --> NFS_OK, readplusrestype4 = READ_OK, eof
2106	       = false, data<>[32K].  Return a short read, as the last half of
2107	       the request was all zeroes.  Note that the first hole is read
2108	       back as all zeros as it is below the hole threshhold.

2110	   2.  READ_PLUS(s, 32K, 64K) --> NFS_OK, readplusrestype4 = READ_HOLE,
2111	       nfs_readplusreshole(HOLE_INFO)(32K, 224K).  The requested range
2112	       was all zeros, and the current hole begins at offset 32K and is
2113	       224K in length.

2115	   3.  READ_PLUS(s, 256K, 64K) --> NFS_OK, readplusrestype4 = READ_OK,
2116	       eof = false, data<>[32K].  Return a short read, as the last half
2117	       of the request was all zeroes.

2119	   4.  READ_PLUS(s, 288K, 64K) --> NFS_OK, readplusrestype4 = READ_HOLE,
2120	       nfs_readplusreshole(HOLE_INFO)(288K, 66K).

2122	   5.  READ_PLUS(s, 354K, 64K) --> NFS_OK, readplusrestype4 = READ_OK,
2123	       eof = true, data<>[64K].

2125	7.6.  Related Work

2127	   Solaris and ZFS support an extension to lseek(2) that allows
2128	   applications to discover holes in a file.  The values, SEEK_HOLE and
2129	   SEEK_DATA, allow clients to seek to the next hole or beginning of
2130	   data, respectively.

2132	   XFS supports the XFS_IOC_GETBMAP extended attribute, which returns
2133	   the Data Region Map for a file.  Clients can then use this
2134	   information to avoid reading holes in a file.

2136	   NTFS and CIFS support the FSCTL_SET_SPARSE attribute, which allows
2137	   applications to control whether empty regions of the file are
2138	   preallocated and filled in with zeros or simply left unallocated.

2140	7.7.  Other Proposed Designs

2142	7.7.1.  Multi-Data Server Hole Information

2144	   The current design prohibits pnfs data servers from returning hole
2145	   information for regions of a file that are not stored on that data
2146	   server.  Having data servers return information regarding other data
2147	   servers changes the fundamental principal that all metadata
2148	   information comes from the metadata server.

2150	   Here is a brief description if we did choose to support multi-data
2151	   server hole information:

2153	   For a data server that can obtain hole information for the entire
2154	   file without severe performance impact, it MAY return HOLE_INFO and
2155	   the byte range of the entire file hole.  When a pNFS client receives
2156	   a READ_HOLE result and a non-empty nfs_readplusreshole structure, it
2157	   MAY use this information in conjunction with a valid layout for the
2158	   file to determine the next data server for the next region of data
2159	   that is not in a hole.

2161	7.7.2.  Data Result Array

2163	   If a single read request contains one or more Holes with a length
2164	   greater than the Sparse Threshold, the current design would return
2165	   results indicating a short read to the client.  A client would then
2166	   send a series of read requests to the server to retrieve information
2167	   for the Holes and the remaining data.  To avoid turning a single read
2168	   request into several exchanges between the client and server, the
2169	   server may need to choose a relatively large Sparse Threshold in
2170	   order to decrease the number of short reads it creates.  A large
2171	   Sparse Threshold may miss many smaller holes, which in turn may
2172	   negate the benefits of sparse read support.

2174	   To avoid this situation, one option is to have the READ_PLUS
2175	   operation return information for multiple holes in a single return
2176	   value.  This would allow several small holes to be described in a
2177	   single read response without requiring multliple exchanges between
2178	   the client and server.

2180	   One important item to consider with returning an array of data chunks
2181	   is its impact on RDMA, which may use different block sizes on the
2182	   client and server (among other things).

2184	7.7.3.  User-Defined Sparse Mask

2186	   Add mask (instead of just zeroes).  Specified by server or client?

2188	7.7.4.  Allocated flag

2190	   A Hole on the server may be an allocated byte-range consisting of all
2191	   zeroes or may not be allocated at all.  To ensure this information is
2192	   properly communicated to the client, it may be beneficial to add a
2193	   'alloc' flag to the HOLE_INFO section of nfs_readplusreshole.  This
2194	   would allow an NFS client to copy a file from one file system to
2195	   another and have it more closely resemble the original.

2197	7.7.5.  Dense and Sparse pNFS File Layouts

2199	   The hole information returned form a data server must be understood
2200	   by pNFS clients using both Dense or Sparse file layout types.  Does
2201	   the current READ_PLUS return value work for both layout types?  Does
2202	   the data server know if it is using dense or sparse so that it can
2203	   return the correct hole_offset and hole_length values?

2205	8.  Labeled NFS

2207	   WARNING: Need to pull out the requirements.

2209	8.1.  Introduction

2211	   Mandatory Access Control (MAC) systems have been mainstreamed in
2212	   modern operating systems such as Linux (R), FreeBSD (R), Solaris
2213	   (TM), and Windows Vista (R).  MAC systems bind security attributes to
2214	   subjects (processes) and objects within a system.  These attributes
2215	   are used with other information in the system to make access control
2216	   decisions.

2218	   Access control models such as Unix permissions or Access Control
2219	   Lists are commonly referred to as Discretionary Access Control (DAC)
2220	   models.  These systems base their access decisions on user identity
2221	   and resource ownership.  In contrast MAC models base their access
2222	   control decisions on the label on the subject (usually a process) and
2223	   the object it wishes to access.  These labels may contain user
2224	   identity information but usually contain additional information.  In
2225	   DAC systems users are free to specify the access rules for resources
2226	   that they own.  MAC models base their security decisions on a system
2227	   wide policy established by an administrator or organization which the
2228	   users do not have the ability to override.  DAC systems offer no real
2229	   protection against malicious or flawed software due to each program
2230	   running with the full permissions of the user executing it.
2231	   Inversely MAC models can confine malicious or flawed software and
2232	   usually act at a finer granularity than their DAC counterparts.

2234	   People desire to use NFSv4 with these systems.  A mechanism is
2235	   required to provide security attribute information to NFSv4 clients
2236	   and servers.  This mechanism has the following requirements:

2238	   (1)  Clients must be able to convey to the server the security
2239	        attribute of the subject making the access request.  The server
2240	        may provide a mechanism to enforce MAC policy based on the
2241	        requesting subject's security attribute.

2243	   (2)  Server must be able to store and retrieve the security attribute
2244	        of exported files as requested by the client.

2246	   (3)  Server must provide a mechanism for notifying clients of
2247	        attribute changes of files on the server.

2249	   (4)  Clients and Servers must be able to negotiate Label Formats and
2250	        Domains of Interpretation (DOI) and provide a mechanism to
2251	        translate between them as needed.

2253	   These four requirements are key to the system with only requirements
2254	   (2) and (3) requiring changes to NFSv4.  The ability to convey the
2255	   security attribute of the subject as described in requirement (1)
2256	   falls upon the RPC layer to implement (see [6]).  Requirement (4)
2257	   allows communication between different MAC implementations.  The
2258	   management of label formats, DOIs, and the translation between them
2259	   does not require any support from NFSv4 on a protocol level and is
2260	   out of the scope of this document.

2262	   The first change necessary is to devise a method for transporting and
2263	   storing security label data on NFSv4 file objects.  Security labels
2264	   have several semantics that are met by NFSv4 recommended attributes
2265	   such as the ability to set the label value upon object creation.
2266	   Access control on these attributes are done through a combination of
2267	   two mechanisms.  As with other recommended attributes on file objects
2268	   the usual DAC checks (ACLs and permission bits) will be performed to
2269	   ensure that proper file ownership is enforced.  In addition a MAC
2270	   system MAY be employed on the client, server, or both to enforce
2271	   additional policy on what subjects may modify security label
2272	   information.

2274	   The second change is to provide a method for the server to notify the
2275	   client that the attribute changed on an open file on the server.  If
2276	   the file is closed, then during the open attempt, the client will
2277	   gather the new attribute value.  The server MUST not communicate the
2278	   new value of the attribute, the client MUST query it.  This
2279	   requirement stems from the need for the client to provide sufficient
2280	   access rights to the attribute.

2282	   The final change necessary is a modification to the RPC layer used in
2283	   NFSv4 in the form of a new version of the RPCSEC_GSS [7] framework.
2284	   In order for an NFSv4 server to apply MAC checks it must obtain
2285	   additional information from the client.  Several methods were
2286	   explored for performing this and it was decided that the best
2287	   approach was to incorporate the ability to make security attribute
2288	   assertions through the RPC mechanism.  RPCSECGSSv3 [6] outlines a
2289	   method to assert additional security information such as security
2290	   labels on gss context creation and have that data bound to all RPC
2291	   requests that make use of that context.

2293	8.2.  Definitions

2295	   Label Format Specifier (LFS):  is an identifier used by the client to
2296	      establish the syntactic format of the security label and the
2297	      semantic meaning of its components.  These specifiers exist in a
2298	      registry associated with documents describing the format and
2299	      semantics of the label.

2301	   Label Format Registry:  is the IANA registry containing all
2302	      registered LFS along with references to the documents that
2303	      describe the syntactic format and semantics of the security label.

2305	   Policy Identifier (PI):  is an optional part of the definition of a
2306	      Label Format Specifier which allows for clients and server to
2307	      identify specific security policies.

2309	   Domain of Interpretation (DOI):  represents an administrative
2310	      security boundary, where all systems within the DOI have
2311	      semantically coherent labeling.  That is, a security attribute
2312	      must always mean exactly the same thing anywhere within the DOI.

2314	   Object:  is a passive resource within the system that we wish to be
2315	      protected.  Objects can be entities such as files, directories,
2316	      pipes, sockets, and many other system resources relevant to the
2317	      protection of the system state.

2319	   Subject:  A subject is an active entity usually a process which is
2320	      requesting access to an object.

2322	   Multi-Level Security (MLS):  is a traditional model where objects are
2323	      given a sensitivity level (Unclassified, Secret, Top Secret, etc)
2324	      and a category set [21].

2326	8.3.  MAC Security Attribute

2328	   MAC models base access decisions on security attributes bound to
2329	   subjects and objects.  This information can range from a user
2330	   identity for an identity based MAC model, sensitivity levels for
2331	   Multi-level security, or a type for Type Enforcement.  These models
2332	   base their decisions on different criteria but the semantics of the
2333	   security attribute remain the same.  The semantics required by the
2334	   security attributes are listed below:

2336	   o  Must provide flexibility with respect to MAC model.

2338	   o  Must provide the ability to atomically set security information
2339	      upon object creation

2341	   o  Must provide the ability to enforce access control decisions both
2342	      on the client and the server

2344	   o  Must not expose an object to either the client or server name
2345	      space before its security information has been bound to it.

2347	   NFSv4 provides several options for implementing the security
2348	   attribute.  The first option is to implement the security attribute
2349	   as a named attribute.  Named attributes provide flexibility since
2350	   they are treated as an opaque field but lack a way to atomically set
2351	   the attribute on creation.  In addition, named attributes themselves
2352	   are file system objects which need to be assigned a security
2353	   attribute.  This raises the question of how to assign security
2354	   attributes to the file and directories used to hold the security
2355	   attribute for the file in question.  The inability to atomically
2356	   assign the security attribute on file creation and the necessity to
2357	   assign security attributes to its sub-components makes named
2358	   attributes unacceptable as a method for storing security attributes.

2360	   The second option is to implement the security attribute as a
2361	   recommended attribute.  These attributes have a fixed format and
2362	   semantics, which conflicts with the flexible nature of the security
2363	   attribute.  To resolve this the security attribute consists of two
2364	   components.  The first component is a LFS as defined in [22] to allow
2365	   for interoperability between MAC mechanisms.  The second component is
2366	   an opaque field which is the actual security attribute data.  To
2367	   allow for various MAC models NFSv4 should be used solely as a
2368	   transport mechanism for the security attribute.  It is the
2369	   responsibility of the endpoints to consume the security attribute and
2370	   make access decisions based on their respective models.  In addition,
2371	   creation of objects through OPEN and CREATE allows for the security
2372	   attribute to be specified upon creation.  By providing an atomic
2373	   create and set operation for the security attribute it is possible to
2374	   enforce the second and fourth requirements.  The recommended
2375	   attribute FATTR4_SEC_LABEL will be used to satisfy this requirement.

2377	8.3.1.  Interpreting FATTR4_SEC_LABEL

2379	   The XDR [11] necessary to implement Labeled NFSv4 is presented in
2380	   Figure 6:

2382	         const FATTR4_SEC_LABEL   = 81;

2384	         typedef uint32_t  policy4;
2385	         struct labelformat_spec4 {
2386	           policy4   lfs_lfs;
2387	           policy4   lfs_pi;
2388	         };

2390	         struct sec_label_attr_info {
2391	           labelformat_spec4   slai_lfs;
2392	           opaque              slai_data<>;
2393	         };

2395	                                 Figure 6

2397	   The FATTR4_SEC_LABEL contains an array of two components with the
2398	   first component being an LFS.  It serves to provide the receiving end
2399	   with the information necessary to translate the security attribute
2400	   into a form that is usable by the endpoint.  Label Formats assigned
2401	   an LFS may optionally choose to include a Policy Identifier field to
2402	   allow for complex policy deployments.  The LFS and Label Format
2403	   Registry are described in detail in [22].  The translation used to
2404	   interpret the security attribute is not specified as part of the
2405	   protocol as it may depend on various factors.  The second component
2406	   is an opaque section which contains the data of the attribute.  This
2407	   component is dependent on the MAC model to interpret and enforce.

2409	   In particular, it is the responsibility of the LFS specification to
2410	   define a maximum size for the opaque section, slai_data<>.  When
2411	   creating or modifying a label for an object, the client needs to be
2412	   guaranteed that the server will accept a label that is sized
2413	   correctly.  By both client and server being part of a specific MAC
2414	   model, the client will be aware of the size.

2416	8.3.2.  Delegations

2418	   In the event that a security attribute is changed on the server while
2419	   a client holds a delegation on the file, the client should follow the
2420	   existing protocol with respect to attribute changes.  It should flush
2421	   all changes back to the server and relinquish the delegation.

2423	8.3.3.  Permission Checking

2425	   It is not feasible to enumerate all possible MAC models and even
2426	   levels of protection within a subset of these models.  This means
2427	   that the NFSv4 client and servers cannot be expected to directly make
2428	   access control decisions based on the security attribute.  Instead
2429	   NFSv4 should defer permission checking on this attribute to the host
2430	   system.  These checks are performed in addition to existing DAC and
2431	   ACL checks outlined in the NFSv4 protocol.  Section 8.7 gives a
2432	   specific example of how the security attribute is handled under a
2433	   particular MAC model.

2435	8.3.4.  Object Creation

2437	   When creating files in NFSv4 the OPEN and CREATE operations are used.
2438	   One of the parameters to these operations is an fattr4 structure
2439	   containing the attributes the file is to be created with.  This
2440	   allows NFSv4 to atomically set the security attribute of files upon
2441	   creation.  When a client is MAC aware it must always provide the
2442	   initial security attribute upon file creation.  In the event that the
2443	   server is the only MAC aware entity in the system it should ignore
2444	   the security attribute specified by the client and instead make the
2445	   determination itself.  A more in depth explanation can be found in
2446	   Section 8.7.

2448	8.3.5.  Existing Objects

2450	   Note that under the MAC model, all objects must have labels.
2451	   Therefore, if an existing server is upgraded to include LNFS support,
2452	   then it is the responsibility of the security system to define the
2453	   behavior for existing objects.  For example, if the security system
2454	   is LFS 0, which means the server just stores and returns labels, then
2455	   existing files should return labels which are set to an empty value.

2457	8.3.6.  Label Changes

2459	   As per the requirements, when a file's security label is modified,
2460	   the server must notify all clients which have the file opened of the
2461	   change in label.  It does so with CB_ATTR_CHANGED.  There are
2462	   preconditions to making an attribute change imposed by NFSv4 and the
2463	   security system might want to impose others.  In the process of
2464	   meeting these preconditions, the server may chose to either serve the
2465	   request in whole or return NFS4ERR_DELAY to the SETATTR operation.

2467	   If there are open delegations on the file belonging to client other
2468	   than the one making the label change, then the process described in
2469	   Section 8.3.2 must be followed.

2471	   As the server is always presented with the subject label from the
2472	   client, it does not necessarily need to communicate the fact that the
2473	   label has changed to the client.  In the cases where the change
2474	   outright denies the client access, the client will be able to quickly
2475	   determine that there is a new label in effect.  It is in cases where
2476	   the client may share the same object between multiple subjects or a
2477	   security system which is not strictly hierarchical that the
2478	   CB_ATTR_CHANGED callback is very useful.  It allows the server to
2479	   inform the clients that the cached security attribute is now stale.

2481	   In the scenario presented in Section 8.8.5, the clients are smart and
2482	   the server has a very simple security system which just stores the
2483	   labels.  In this system, the MAC label check always allows access,
2484	   regardless of the subject label.

2486	   The way in which MAC labels are enforced is by the smart client.  So
2487	   if client A changes a security label on a file, then the server MUST
2488	   inform all clients that have the file opened that the label has
2489	   changed via CB_ATTR_CHANGED.  Then the clients MUST retrieve the new
2490	   label and MUST enforce access via the new attribute values.

2492	   [[Comment.3: Describe a LFS of 0, which will be the means to indicate
2493	   such a deployment.  In the current LFR, 0 is marked as reserved.  If
2494	   we use it, then we define the default LFS to be used by a LNFS aware
2495	   server.  I.e., it lets smart clients work together in the face of a
2496	   dumb server.  Note that will supporting this system is optional, it
2497	   will make for a very good debugging mode during development.  I.e.,
2498	   even if a server does not deploy with another security system, this
2499	   mode gets your foot in the door. --TH]]

2501	8.4.  Procedure 16: CB_ATTR_CHANGED - Notify Client that the File's
2502	      Attributes Changed

2504	8.4.1.  ARGUMENTS

2506	        struct CB_ATTR_CHANGED4args {
2507	                nfs_fh4         acca_fh;
2508	                bitmap4         acca_critical;
2509	                bitmap4         acca_info;
2510	        };

2512	8.4.2.  RESULTS

2514	        struct CB_ATTR_CHANGED4res {
2515	                nfsstat4        accr_status;
2516	        };

2518	8.4.3.  DESCRIPTION

2520	   The CB_ATTR_CHANGED callback operation is used by the server to
2521	   indicate to the client that the file's attributes have been modified
2522	   on the server.  The server does not convey how the attributes have
2523	   changed, just that they have been modified.  The server can inform
2524	   the client about both critical and informational attribute changes in
2525	   the bitmask arguments.  The client SHOULD query the server about all
2526	   attributes set in acca_critical.  For all changes reflected in
2527	   acca_info, the client can decide whether or not it wants to poll the
2528	   server.

2530	   The CB_ATTR_CHANGED callback operation with the FATTR4_SEC_LABEL set
2531	   in acca_critical is the method used by the server to indicate that
2532	   the MAC label for the file referenced by acca_fh has changed.  In
2533	   many ways, the server does not care about the result returned by the
2534	   client.

2536	8.5.  pNFS Considerations

2538	   This section examines the issues in deploying LNFS in a pNFS
2539	   community of servers.

2541	8.5.1.  MAC Label Checks

2543	   The new FATTR4_SEC_LABEL attribute is metadata information and as
2544	   such the DS is not aware of the value contained on the MDS.
2545	   Fortunately, the NFSv4.1 protocol [2] already has provisions for
2546	   doing access level checks from the DS to the MDS.  In order for the
2547	   DS to validate the subject label presented by the client, it SHOULD
2548	   utilize this mechanism.

2550	   If a file's FATTR4_SEC_LABEL is changed, then the MDS should utilize
2551	   CB_ATTR_CHANGED to inform the client of that fact.  If the MDS is
2552	   maintaining

2554	8.6.  Discovery of Server LNFS Support

2556	   The server can easily determine that a client supports LNFS when it
2557	   queries for the FATTR4_SEC_LABEL label for an object.  Note that it
2558	   cannot assume that the presence of RPCSEC_GSSv3 indicates LNFS
2559	   support.  The client might need to discover which LFS the server
2560	   supports.

2562	   A server which supports LNFS MUST allow a client with any subject
2563	   label to retrieve the FATTR4_SEC_LABEL attribute for the root
2564	   filehandle, ROOTFH.  The following compound must always succeed as
2565	   far as a MAC label check is concerned:

2567	        PUTROOTFH, GETATTR {FATTR4_SEC_LABEL}

2569	   Note that the server might have imposed a security flavor on the root
2570	   that precludes such access.  I.e., if the server requires kerberized
2571	   access and the client presents a compound with AUTH_SYS, then the
2572	   server is allowed to return NFS4ERR_WRONGSEC in this case.  But if
2573	   the client presents a correct security flavor, then the server MUST
2574	   return the FATTR4_SEC_LABEL attribute with the supported LFS filled
2575	   in.

2577	8.7.  MAC Security NFS Modes of Operation

2579	   A system using Labeled NFS may operate in three modes.  The first
2580	   mode provides the most protection and is called "full mode".  In this
2581	   mode both the client and server implement a MAC model allowing each
2582	   end to make an access control decision.  The remaining two modes are
2583	   variations on each other and are called "smart client" and "smart
2584	   server" modes.  In these modes one end of the connection is not
2585	   implementing a MAC model and because of this these operating modes
2586	   offer less protection than full mode.

2588	8.7.1.  Full Mode

2590	   Full mode environments consist of MAC aware NFSv4 servers and clients
2591	   and may be composed of mixed MAC models and policies.  The system
2592	   requires that both the client and server have an opportunity to
2593	   perform an access control check based on all relevant information
2594	   within the network.  The file object security attribute is provided
2595	   using the mechanism described in Section 8.3.  The security attribute
2596	   of the subject making the request is transported at the RPC layer
2597	   using the mechanism described in RPCSECGSSv3 [6].

2599	8.7.1.1.  Initial Labeling and Translation

2601	   The ability to create a file is an action that a MAC model may wish
2602	   to mediate.  The client is given the responsibility to determine the
2603	   initial security attribute to be placed on a file.  This allows the
2604	   client to make a decision as to the acceptable security attributes to
2605	   create a file with before sending the request to the server.  Once
2606	   the server receives the creation request from the client it may
2607	   choose to evaluate if the security attribute is acceptable.

2609	   Security attributes on the client and server may vary based on MAC
2610	   model and policy.  To handle this the security attribute field has an
2611	   LFS component.  This component is a mechanism for the host to
2612	   identify the format and meaning of the opaque portion of the security
2613	   attribute.  A full mode environment may contain hosts operating in
2614	   several different LFSs and DOIs.  In this case a mechanism for
2615	   translating the opaque portion of the security attribute is needed.
2616	   The actual translation function will vary based on MAC model and
2617	   policy and is out of the scope of this document.  If a translation is
2618	   unavailable for a given LFS and DOI then the request SHOULD be
2619	   denied.  Another recourse is to allow the host to provide a fallback
2620	   mapping for unknown security attributes.

2622	8.7.1.2.  Policy Enforcement

2624	   In full mode access control decisions are made by both the clients
2625	   and servers.  When a client makes a request it takes the security
2626	   attribute from the requesting process and makes an access control
2627	   decision based on that attribute and the security attribute of the
2628	   object it is trying to access.  If the client denies that access an
2629	   RPC call to the server is never made.  If however the access is
2630	   allowed the client will make a call to the NFS server.

2632	   When the server receives the request from the client it extracts the
2633	   security attribute conveyed in the RPC request.  The server then uses
2634	   this security attribute and the attribute of the object the client is
2635	   trying to access to make an access control decision.  If the server's
2636	   policy allows this access it will fulfill the client's request,
2637	   otherwise it will return NFS4ERR_ACCESS.

2639	   Implementations MAY validate security attributes supplied over the
2640	   network to ensure that they are within a set of attributes permitted
2641	   from a specific peer, and if not, reject them.  Note that a system
2642	   may permit a different set of attributes to be accepted from each
2643	   peer.  An example of this can be seen in Section 8.8.7.1.

2645	8.7.2.  Smart Client Mode

2647	   Smart client environments consist of NFSv4 servers that are not MAC
2648	   aware but NFSv4 clients that are.  Clients in this environment are
2649	   may consist of groups implementing different MAC models policies.
2650	   The system requires that all clients in the environment be
2651	   responsible for access control checks.  Due to the amount of trust
2652	   placed in the clients this mode is only to be used in a trusted
2653	   environment.

2655	8.7.2.1.  Initial Labeling and Translation

2657	   Just like in full mode the client is responsible for determining the
2658	   initial label upon object creation.  The server in smart client mode
2659	   does not implement a MAC model, however, it may provide the ability
2660	   to restrict the creation and labeling of object with certain labels
2661	   based on different criteria as described in Section 8.7.1.2.

2663	   In a smart client environment a group of clients operate in a single
2664	   DOI.  This removes the need for the clients to maintain a set of DOI
2665	   translations.  Servers should provide a method to allow different
2666	   groups of clients to access the server at the same time.  However it
2667	   should not let two groups of clients operating in different DOIs to
2668	   access the same files.

2670	8.7.2.2.  Policy Enforcement

2672	   In smart client mode access control decisions are made by the
2673	   clients.  When a client accesses an object it obtains the security
2674	   attribute of the object from the server and combines it with the
2675	   security attribute of the process making the request to make an
2676	   access control decision.  This check is in addition to the DAC checks
2677	   provided by NFSv4 so this may fail based on the DAC criteria even if
2678	   the MAC policy grants access.  As the policy check is located on the
2679	   client an access control denial should take the form that is native
2680	   to the platform.

2682	8.7.3.  Smart Server Mode

2684	   Smart server environments consist of NFSv4 servers that are MAC aware
2685	   and one or more MAC unaware clients.  The server is the only entity
2686	   enforcing policy, and may selectively provide standard NFS services
2687	   to clients based on their authentication credentials and/or
2688	   associated network attributes (e.g., IP address, network interface).
2689	   The level of trust and access extended to a client in this mode is
2690	   configuration-specific.

2692	8.7.3.1.  Initial Labeling and Translation

2694	   In smart server mode all labeling and access control decisions are
2695	   performed by the NFSv4 server.  In this environment the NFSv4 clients
2696	   are not MAC aware so they cannot provide input into the access
2697	   control decision.  This requires the server to determine the initial
2698	   labeling of objects.  Normally the subject to use in this calculation
2699	   would originate from the client.  Instead the NFSv4 server may choose
2700	   to assign the subject security attribute based on their
2701	   authentication credentials and/or associated network attributes
2702	   (e.g., IP address, network interface).

2704	   In smart server mode security attributes are contained solely within
2705	   the NFSv4 server.  This means that all security attributes used in
2706	   the system remain within a single LFS and DOI.  Since security
2707	   attributes will not cross DOIs or change format there is no need to
2708	   provide any translation functionality above that which is needed
2709	   internally by the MAC model.

2711	8.7.3.2.  Policy Enforcement

2713	   All access control decisions in smart server mode are made by the
2714	   server.  The server will assign the subject a security attribute
2715	   based on some criteria (e.g., IP address, network interface).  Using
2716	   the newly calculated security attribute and the security attribute of
2717	   the object being requested the MAC model makes the access control
2718	   check and returns NFS4ERR_ACCESS on a denial and NFS4_OK on success.
2719	   This check is done transparently to the client so if the MAC
2720	   permission check fails the client may be unaware of the reason for
2721	   the permission failure.  When operating in this mode administrators
2722	   attempting to debug permission failures should be aware to check the
2723	   MAC policy running on the server in addition to the DAC settings.

2725	8.8.  Use Cases

2727	   MAC labeling is meant to allow NFSv4 to be deployed in site
2728	   configurable security schemes.  The LFS and opaque data scheme allows
2729	   for flexibility to meet these different implementations.  In this
2730	   section, we provide some examples of how NFSv4 could be deployed to
2731	   meet existing needs.  This is not an exhaustive listing.

2733	8.8.1.  Full MAC labeling support for remotely mounted filesystems

2735	   In this case, we assume a local networked environment where the
2736	   servers and clients are under common administrative control.  All
2737	   systems in this network have the same MAC implementation and
2738	   semantically identical MAC security labels for objects (i.e. labels
2739	   mean the same thing on different systems, even if the policies on
2740	   each system may differ to some extent).  Clients will be able to
2741	   apply fine-grained MAC policy to objects accessed via NFS mounts, and
2742	   thus improve the overall consistency of MAC policy application within
2743	   this environment.

2745	   An example of this case would be where user home directories are
2746	   remotely mounted, and fine-grained MAC policy is implemented to
2747	   protect, for example, private user data from being read by malicious
2748	   web scripts running in the user's browser.  With Labeled NFS, fine-
2749	   grained MAC labeling of the user's files will allow the local MAC
2750	   policy to be implemented and provide the desired protection.

2752	8.8.2.  MAC labeling of virtual machine images stored on the network

2754	   Virtualization is now a commonly implemented feature of modern
2755	   operating systems, and there is a need to ensure that MAC security
2756	   policy is able to to protect virtualized resources.  A common
2757	   implementation scheme involves storing virtualized guest filesystems
2758	   on a networked server, which are then mounted remotely by guests upon
2759	   instantiation.  In this case, there is a need to ensure that the
2760	   local guest kernel is able to access fine-grained MAC labels on the
2761	   remotely mounted filesystem so that its MAC security policy can be
2762	   applied.

2764	8.8.3.  International Traffic in Arms Regulations (ITAR)

2766	   The International Traffic in Arms Regulations (ITAR) is put forth by
2767	   the United States Department of State, Directorate of Defense and
2768	   Trade Controls.  ITAR places strict requirements on the export and
2769	   thus access of defense articles and defense services.  Organizations
2770	   that manage projects with articles and services deemed as within the
2771	   scope of ITAR must ensure the regulations are met.  The regulations
2772	   require an assurance that ITAR information is accessed on a need-to-
2773	   know basis, thus requiring strict, centrally managed access controls
2774	   on items labeled as ITAR.  Additionally, organizations must be able
2775	   to prove that the controls were adequately maintained and that
2776	   foreign nationals were not permitted access to these defense articles
2777	   or service.  ITAR control applicability may be dynamic; information
2778	   may become subject to ITAR after creation (e.g., when the defense
2779	   implications of technology are recognized).

2781	8.8.4.  Legal Hold/eDiscovery

2783	   Increased cases of legal holds on electronic sources of information
2784	   (ESI) have resulted in organizations taking a pro-active approach to
2785	   reduce the scope and thus costs associated with these activities.
2786	   ESI Data Maps are increasing in use and require support in operating
2787	   systems to strictly manage access controls in the case of a legal
2788	   hold.  The sizeable quantity of information involved in a legal
2789	   discovery request may preclude making a copy of the information to a
2790	   separate system that manages the legal hold on the copies; this
2791	   results in a need to enforce the legal hold on the original
2792	   information.

2794	   Organizations are taking steps to map out the sources of information
2795	   that are most likely to be placed under a legal hold, these efforts
2796	   result in ESI Data Maps.  ESI Data Maps specify the Electronic Source
2797	   of Information and the requirements for sensitivity and criticality.
2798	   In the case of a legal hold, the ESI data map and labels can be used
2799	   to ensure the legal hold is properly enforced on the predetermined
2800	   set of information.  An ESI data map narrows the scope of a legal
2801	   hold to the predetermined ESI.  The information must then be
2802	   protected at a level of security of which the weight and
2803	   admissibility of that evidence may be proved in a court of law.
2804	   Current systems use application level controls and do not adequately
2805	   meet the requirements.  Labels may be used in advance when an ESI
2806	   data map exercise is conducted with controls being applied at the
2807	   time of a hold or labels may be applied to data sets during an
2808	   eDiscovery exercise to ensure the data protections are adequate
2809	   during the legal hold period.

2811	   Note that this use case requires multi-attribute labels, as both
2812	   information sensitivity (e.g., to disclosure) and information
2813	   criticality (e.g., to continued business operations) need to be
2814	   captured.

2816	8.8.5.  Simple security label storage

2818	   In this case, a mixed and loosely administered network is assumed,
2819	   where nodes may be running a variety of operating systems with
2820	   different security mechanisms and security policies.  It is desired
2821	   that network file servers be simply capable of storing and retrieving
2822	   MAC security labels for clients which use such labels.  The Labeled
2823	   NFS protocol would be implemented here solely to enable transport of
2824	   MAC security labels across the network.  It should be noted that in
2825	   such an environment, overall security cannot be as strongly enforced
2826	   as in case (a), and that this scheme is aimed at allowing MAC-capable
2827	   clients to function with local MAC security policy enabled rather
2828	   than perhaps disabling it entirely.

2830	8.8.6.  Diskless Linux

2832	   A number of popular operating system distributions depend on a
2833	   mandatory access control (MAC) model to implement a kernel-enforced
2834	   security policy.  Typically, such models assign particular roles to
2835	   individual processes, which limit or permit performing certain
2836	   operations on a set of files, directories, sockets, or other objects.
2837	   While the enforcing of the policy is typically a matter for the
2838	   diskless NFS client itself, the filesystem objects in such models
2839	   will typically carry MAC labels that are used to define policy on
2840	   access.  These policies may, for instance, describe privilege
2841	   transitions that cannot be replicated using standard NFS ACL based
2842	   models.

2844	   For instance on a SYSV compatible system, if the 'init' process
2845	   spawns a process that attempts to start the 'NetworkManager'
2846	   executable, there may be a policy that sets up a role transition if
2847	   the 'init' process and 'NetworkManager' file labels match a
2848	   particular rule.  Without this role transition, the process may find
2849	   itself having insufficient privileges to perform its primary job of
2850	   configuring network interfaces.

2852	   In setups of this type, a lot of the policy targets (such as sockets
2853	   or privileged system calls) are entirely local to the client.  The
2854	   use of RPCSEC_GSSv3 for enforcing compliance at the server level is
2855	   therefore of limited value.  The ability to permanently label files
2856	   and have those labels read back by the client is, however, crucial to
2857	   the ability to enforce that policy.

2859	8.8.7.  Multi-Level Security

2861	   In a MLS system objects are generally assigned a sensitivity level
2862	   and a set of compartments.  The sensitivity levels within the system
2863	   are given an order ranging from lowest to highest classification
2864	   level.  Read access to an object is allowed when the sensitivity
2865	   level of the subject "dominates" the object it wants to access.  This
2866	   means that the sensitivity level of the subject is higher than that
2867	   of the object it wishes to access and that its set of compartments is
2868	   a super-set of the compartments on the object.

2870	   The rest of the section will just use sensitivity levels.  In general
2871	   the example is a client that wishes to list the contents of a
2872	   directory.  The system defines the sensitivity levels as Unclassified
2873	   (U), Secret (S), and Top Secret (TS).  The directory to be searched
2874	   is labeled Top Secret which means access to read the directory will
2875	   only be granted if the subject making the request is also labeled Top
2876	   Secret.

2878	8.8.7.1.  Full Mode

2880	   In the first part of this example a process on the client is running
2881	   at the Secret level.  The process issues a readdir system call which
2882	   enters the kernel.  Before translating the readdir system call into a
2883	   request to the NFSv4 server the host operating system will consult
2884	   the MAC module to see if the operation is allowed.  Since the process
2885	   is operating at Secret and the directory to be accessed is labeled
2886	   Top Secret the MAC module will deny the request and an error code is
2887	   returned to user space.

2889	   Consider a second case where instead of running at Secret the process
2890	   is running at Top Secret.  In this case the sensitivity of the
2891	   process is equal to or greater than that of the directory so the MAC
2892	   module will allow the request.  Now the readdir is translated into
2893	   the necessary NFSv4 call to the server.  For the RPC request the
2894	   client is using the proper credential to assert to the server that
2895	   the process is running at Top Secret.

2897	   When the server receives the request it extracts the security label
2898	   from the RPC session and retrieves the label on the directory.  The
2899	   server then checks with its MAC module if a Top Secret process is
2900	   allowed to read the contents of the Top Secret directory.  Since this
2901	   is allowed by the policy then the server will return the appropriate
2902	   information back to the client.

2904	   In this example the policy on the client and server were both the
2905	   same.  In the event that they were running different policies a
2906	   translation of the labels might be needed.  In this case it could be
2907	   possible for a check to pass on the client and fail on the server.
2908	   The server may consider additional information when making its policy
2909	   decisions.  For example the server could determine that a certain
2910	   subnet is only cleared for data up to Secret classification.  If that
2911	   constraint was in place for the example above the client would still
2912	   succeed, but the server would fail since the client is asserting a
2913	   label that it is not able to use (Top Secret on a Secret network).

2915	8.8.7.2.  Smart Client Mode

2917	   In smart client mode the example is identical to the first part of a
2918	   full mode operation.  A process on the client labeled Secret wishes
2919	   to access a Top Secret directory.  As in the full mode example this
2920	   is denied since Secret does not dominate Top Secret.  If the process
2921	   were operating at Top Secret it would pass the local access control
2922	   check and the NFSv4 operation would proceed as in a normal NFSv4
2923	   environment.

2925	8.8.7.3.  Smart Server Mode

2927	   In a smart server mode the client behaves as if it were in a normal
2928	   NFSv4 environment.  Since the process on the client does not provide
2929	   a security attribute the server must define a mechanism for labeling
2930	   all requests from a client.  Assume that the server is using the same
2931	   criteria used in the full mode example.  The server sees the request
2932	   as coming from a subnet that is a Secret network.  The server
2933	   determines that all clients on that subnet will have their requests
2934	   labeled with Secret.  Since the directory on the server is labeled
2935	   Top Secret and Secret does not dominate Top Secret the server would
2936	   fail the request with NFS4ERR_ACCESS.

2938	8.9.  Security Considerations

2940	   This entire document deals with security issues.

2942	   Depending on the level of protection the MAC system offers there may
2943	   be a requirement to tightly bind the security attribute to the data.

2945	   When either the client is in Smart Client Mode or server is in Smart
2946	   Server Mode, it is important to realize that the other side is not
2947	   enforcing MAC protections.  Alternate methods might be in use to
2948	   handle the lack of MAC support and care should be taken to identify
2949	   and mitigate threats from possible tampering outside of these
2950	   methods.

2952	   An example of this is that a smart server that modifies READDIR or
2953	   LOOKUP results based on the client's subject label might want to
2954	   always construct the same subject label for a client which does not
2955	   present one.  This will prevent a non-LNFS client from mixing entries
2956	   in the directory cache.

2958	9.  Security Considerations

2960	10.  Operations: REQUIRED, RECOMMENDED, or OPTIONAL

2962	   The following tables summarize the operations of the NFSv4.2 protocol
2963	   and the corresponding designation of REQUIRED, RECOMMENDED, and
2964	   OPTIONAL to implement or MUST NOT implement.  The designation of MUST
2965	   NOT implement is reserved for those operations that were defined in
2966	   either NFSv4.0 or NFSV4.1 and MUST NOT be implemented in NFSv4.2.

2968	   For the most part, the REQUIRED, RECOMMENDED, or OPTIONAL designation
2969	   for operations sent by the client is for the server implementation.
2970	   The client is generally required to implement the operations needed
2971	   for the operating environment for which it serves.  For example, a
2972	   read-only NFSv4.2 client would have no need to implement the WRITE
2973	   operation and is not required to do so.

2975	   The REQUIRED or OPTIONAL designation for callback operations sent by
2976	   the server is for both the client and server.  Generally, the client
2977	   has the option of creating the backchannel and sending the operations
2978	   on the fore channel that will be a catalyst for the server sending
2979	   callback operations.  A partial exception is CB_RECALL_SLOT; the only
2980	   way the client can avoid supporting this operation is by not creating
2981	   a backchannel.

2983	   Since this is a summary of the operations and their designation,
2984	   there are subtleties that are not presented here.  Therefore, if
2985	   there is a question of the requirements of implementation, the
2986	   operation descriptions themselves must be consulted along with other
2987	   relevant explanatory text within this either specification or that of
2988	   NFSv4.1 [2]..

2990	   The abbreviations used in the second and third columns of the table
2991	   are defined as follows.

2993	   REQ  REQUIRED to implement

2995	   REC  RECOMMEND to implement

2997	   OPT  OPTIONAL to implement
2998	   MNI  MUST NOT implement

3000	   For the NFSv4.2 features that are OPTIONAL, the operations that
3001	   support those features are OPTIONAL, and the server would return
3002	   NFS4ERR_NOTSUPP in response to the client's use of those operations.
3003	   If an OPTIONAL feature is supported, it is possible that a set of
3004	   operations related to the feature become REQUIRED to implement.  The
3005	   third column of the table designates the feature(s) and if the
3006	   operation is REQUIRED or OPTIONAL in the presence of support for the
3007	   feature.

3009	   The OPTIONAL features identified and their abbreviations are as
3010	   follows:

3012	   pNFS  Parallel NFS

3014	   FDELG  File Delegations

3016	   DDELG  Directory Delegations

3018	   COPY  Server Side Copy

3020	   ADB  Application Data Blocks

3022	                                Operations

3024	   +----------------------+--------------------+-----------------------+
3025	   | Operation            | REQ, REC, OPT, or  | Feature (REQ, REC, or |
3026	   |                      | MNI                | OPT)                  |
3027	   +----------------------+--------------------+-----------------------+
3028	   | ACCESS               | REQ                |                       |
3029	   | BACKCHANNEL_CTL      | REQ                |                       |
3030	   | BIND_CONN_TO_SESSION | REQ                |                       |
3031	   | CLOSE                | REQ                |                       |
3032	   | COMMIT               | REQ                |                       |
3033	   | COPY                 | OPT                | COPY (REQ)            |
3034	   | COPY_ABORT           | OPT                | COPY (REQ)            |
3035	   | COPY_NOTIFY          | OPT                | COPY (REQ)            |
3036	   | COPY_REVOKE          | OPT                | COPY (REQ)            |
3037	   | COPY_STATUS          | OPT                | COPY (REQ)            |
3038	   | CREATE               | REQ                |                       |
3039	   | CREATE_SESSION       | REQ                |                       |
3040	   | DELEGPURGE           | OPT                | FDELG (REQ)           |
3041	   | DELEGRETURN          | OPT                | FDELG, DDELG, pNFS    |
3042	   |                      |                    | (REQ)                 |
3043	   | DESTROY_CLIENTID     | REQ                |                       |
3044	   | DESTROY_SESSION      | REQ                |                       |
3045	   | EXCHANGE_ID          | REQ                |                       |
3046	   | FREE_STATEID         | REQ                |                       |
3047	   | GETATTR              | REQ                |                       |
3048	   | GETDEVICEINFO        | OPT                | pNFS (REQ)            |
3049	   | GETDEVICELIST        | OPT                | pNFS (OPT)            |
3050	   | GETFH                | REQ                |                       |
3051	   | INITIALIZE           | OPT                | ADB (REQ)             |
3052	   | GET_DIR_DELEGATION   | OPT                | DDELG (REQ)           |
3053	   | LAYOUTCOMMIT         | OPT                | pNFS (REQ)            |
3054	   | LAYOUTGET            | OPT                | pNFS (REQ)            |
3055	   | LAYOUTRETURN         | OPT                | pNFS (REQ)            |
3056	   | LINK                 | OPT                |                       |
3057	   | LOCK                 | REQ                |                       |
3058	   | LOCKT                | REQ                |                       |
3059	   | LOCKU                | REQ                |                       |
3060	   | LOOKUP               | REQ                |                       |
3061	   | LOOKUPP              | REQ                |                       |
3062	   | NVERIFY              | REQ                |                       |
3063	   | OPEN                 | REQ                |                       |
3064	   | OPENATTR             | OPT                |                       |
3065	   | OPEN_CONFIRM         | MNI                |                       |
3066	   | OPEN_DOWNGRADE       | REQ                |                       |
3067	   | PUTFH                | REQ                |                       |
3068	   | PUTPUBFH             | REQ                |                       |
3069	   | PUTROOTFH            | REQ                |                       |
3070	   | READ                 | OPT                |                       |
3071	   | READDIR              | REQ                |                       |
3072	   | READLINK             | OPT                |                       |
3073	   | READ_PLUS            | OPT                | ADB (REQ)             |
3074	   | RECLAIM_COMPLETE     | REQ                |                       |
3075	   | RELEASE_LOCKOWNER    | MNI                |                       |
3076	   | REMOVE               | REQ                |                       |
3077	   | RENAME               | REQ                |                       |
3078	   | RENEW                | MNI                |                       |
3079	   | RESTOREFH            | REQ                |                       |
3080	   | SAVEFH               | REQ                |                       |
3081	   | SECINFO              | REQ                |                       |
3082	   | SECINFO_NO_NAME      | REC                | pNFS file layout      |
3083	   |                      |                    | (REQ)                 |
3084	   | SEQUENCE             | REQ                |                       |
3085	   | SETATTR              | REQ                |                       |
3086	   | SETCLIENTID          | MNI                |                       |
3087	   | SETCLIENTID_CONFIRM  | MNI                |                       |
3088	   | SET_SSV              | REQ                |                       |
3089	   | TEST_STATEID         | REQ                |                       |
3090	   | VERIFY               | REQ                |                       |
3091	   | WANT_DELEGATION      | OPT                | FDELG (OPT)           |
3092	   | WRITE                | REQ                |                       |
3093	   +----------------------+--------------------+-----------------------+
3094	                            Callback Operations

3096	   +-------------------------+-------------------+---------------------+
3097	   | Operation               | REQ, REC, OPT, or | Feature (REQ, REC,  |
3098	   |                         | MNI               | or OPT)             |
3099	   +-------------------------+-------------------+---------------------+
3100	   | CB_COPY                 | OPT               | COPY (REQ)          |
3101	   | CB_GETATTR              | OPT               | FDELG (REQ)         |
3102	   | CB_LAYOUTRECALL         | OPT               | pNFS (REQ)          |
3103	   | CB_NOTIFY               | OPT               | DDELG (REQ)         |
3104	   | CB_NOTIFY_DEVICEID      | OPT               | pNFS (OPT)          |
3105	   | CB_NOTIFY_LOCK          | OPT               |                     |
3106	   | CB_PUSH_DELEG           | OPT               | FDELG (OPT)         |
3107	   | CB_RECALL               | OPT               | FDELG, DDELG, pNFS  |
3108	   |                         |                   | (REQ)               |
3109	   | CB_RECALL_ANY           | OPT               | FDELG, DDELG, pNFS  |
3110	   |                         |                   | (REQ)               |
3111	   | CB_RECALL_SLOT          | REQ               |                     |
3112	   | CB_RECALLABLE_OBJ_AVAIL | OPT               | DDELG, pNFS (REQ)   |
3113	   | CB_SEQUENCE             | OPT               | FDELG, DDELG, pNFS  |
3114	   |                         |                   | (REQ)               |
3115	   | CB_WANTS_CANCELLED      | OPT               | FDELG, DDELG, pNFS  |
3116	   |                         |                   | (REQ)               |
3117	   +-------------------------+-------------------+---------------------+

3119	11.  NFSv4.2 Operations

3121	11.1.  Operation 59: COPY - Initiate a server-side copy

3123	11.1.1.  ARGUMENT

3125	   const COPY4_GUARDED     = 0x00000001;
3126	   const COPY4_METADATA    = 0x00000002;

3128	   struct COPY4args {
3129	           /* SAVED_FH: source file */
3130	           /* CURRENT_FH: destination file or */
3131	           /*             directory           */
3132	           offset4         ca_src_offset;
3133	           offset4         ca_dst_offset;
3134	           length4         ca_count;
3135	           uint32_t        ca_flags;
3136	           component4      ca_destination;
3137	           netloc4         ca_source_server<>;
3138	   };

3140	11.1.2.  RESULT

3142	   union COPY4res switch (nfsstat4 cr_status) {
3143	           case NFS4_OK:
3144	                   stateid4        cr_callback_id<1>;
3145	           default:
3146	                   length4         cr_bytes_copied;
3147	   };

3149	11.1.3.  DESCRIPTION

3151	   The COPY operation is used for both intra- and inter-server copies.
3152	   In both cases, the COPY is always sent from the client to the
3153	   destination server of the file copy.  The COPY operation requests
3154	   that a file be copied from the location specified by the SAVED_FH
3155	   value to the location specified by the combination of CURRENT_FH and
3156	   ca_destination.

3158	   The SAVED_FH must be a regular file.  If SAVED_FH is not a regular
3159	   file, the operation MUST fail and return NFS4ERR_WRONG_TYPE.

3161	   In order to set SAVED_FH to the source file handle, the compound
3162	   procedure requesting the COPY will include a sub-sequence of
3163	   operations such as

3165	      PUTFH source-fh
3166	      SAVEFH

3168	   If the request is for a server-to-server copy, the source-fh is a
3169	   filehandle from the source server and the compound procedure is being
3170	   executed on the destination server.  In this case, the source-fh is a
3171	   foreign filehandle on the server receiving the COPY request.  If
3172	   either PUTFH or SAVEFH checked the validity of the filehandle, the
3173	   operation would likely fail and return NFS4ERR_STALE.

3175	   In order to avoid this problem, the minor version incorporating the
3176	   COPY operations will need to make a few small changes in the handling
3177	   of existing operations.  If a server supports the server-to-server
3178	   COPY feature, a PUTFH followed by a SAVEFH MUST NOT return
3179	   NFS4ERR_STALE for either operation.  These restrictions do not pose
3180	   substantial difficulties for servers.  The CURRENT_FH and SAVED_FH
3181	   may be validated in the context of the operation referencing them and
3182	   an NFS4ERR_STALE error returned for an invalid file handle at that
3183	   point.

3185	   The CURRENT_FH and ca_destination together specify the destination of
3186	   the copy operation.  If ca_destination is of 0 (zero) length, then
3187	   CURRENT_FH specifies the target file.  In this case, CURRENT_FH MUST
3188	   be a regular file and not a directory.  If ca_destination is not of 0
3189	   (zero) length, the ca_destination argument specifies the file name to
3190	   which the data will be copied within the directory identified by
3191	   CURRENT_FH.  In this case, CURRENT_FH MUST be a directory and not a
3192	   regular file.

3194	   If the file named by ca_destination does not exist and the operation
3195	   completes successfully, the file will be visible in the file system
3196	   namespace.  If the file does not exist and the operation fails, the
3197	   file MAY be visible in the file system namespace depending on when
3198	   the failure occurs and on the implementation of the NFS server
3199	   receiving the COPY operation.  If the ca_destination name cannot be
3200	   created in the destination file system (due to file name
3201	   restrictions, such as case or length), the operation MUST fail.

3203	   The ca_src_offset is the offset within the source file from which the
3204	   data will be read, the ca_dst_offset is the offset within the
3205	   destination file to which the data will be written, and the ca_count
3206	   is the number of bytes that will be copied.  An offset of 0 (zero)
3207	   specifies the start of the file.  A count of 0 (zero) requests that
3208	   all bytes from ca_src_offset through EOF be copied to the
3209	   destination.  If concurrent modifications to the source file overlap
3210	   with the source file region being copied, the data copied may include
3211	   all, some, or none of the modifications.  The client can use standard
3212	   NFS operations (e.g., OPEN with OPEN4_SHARE_DENY_WRITE or mandatory
3213	   byte range locks) to protect against concurrent modifications if the
3214	   client is concerned about this.  If the source file's end of file is
3215	   being modified in parallel with a copy that specifies a count of 0
3216	   (zero) bytes, the amount of data copied is implementation dependent
3217	   (clients may guard against this case by specifying a non-zero count
3218	   value or preventing modification of the source file as mentioned
3219	   above).

3221	   If the source offset or the source offset plus count is greater than
3222	   or equal to the size of the source file, the operation will fail with
3223	   NFS4ERR_INVAL.  The destination offset or destination offset plus
3224	   count may be greater than the size of the destination file.  This
3225	   allows for the client to issue parallel copies to implement
3226	   operations such as "cat file1 file2 file3 file4 > dest".

3228	   If the destination file is created as a result of this command, the
3229	   destination file's size will be equal to the number of bytes
3230	   successfully copied.  If the destination file already existed, the
3231	   destination file's size may increase as a result of this operation
3232	   (e.g. if ca_dst_offset plus ca_count is greater than the
3233	   destination's initial size).

3235	   If the ca_source_server list is specified, then this is an inter-
3236	   server copy operation and the source file is on a remote server.  The
3237	   client is expected to have previously issued a successful COPY_NOTIFY
3238	   request to the remote source server.  The ca_source_server list
3239	   SHOULD be the same as the COPY_NOTIFY response's cnr_source_server
3240	   list.  If the client includes the entries from the COPY_NOTIFY
3241	   response's cnr_source_server list in the ca_source_server list, the
3242	   source server can indicate a specific copy protocol for the
3243	   destination server to use by returning a URL, which specifies both a
3244	   protocol service and server name.  Server-to-server copy protocol
3245	   considerations are described in Section 4.2.3 and Section 4.4.1.

3247	   The ca_flags argument allows the copy operation to be customized in
3248	   the following ways using the guarded flag (COPY4_GUARDED) and the
3249	   metadata flag (COPY4_METADATA).

3251	   If the guarded flag is set and the destination exists on the server,
3252	   this operation will fail with NFS4ERR_EXIST.

3254	   If the guarded flag is not set and the destination exists on the
3255	   server, the behavior is implementation dependent.

3257	   If the metadata flag is set and the client is requesting a whole file
3258	   copy (i.e., ca_count is 0 (zero)), a subset of the destination file's
3259	   attributes MUST be the same as the source file's corresponding
3260	   attributes and a subset of the destination file's attributes SHOULD
3261	   be the same as the source file's corresponding attributes.  The
3262	   attributes in the MUST and SHOULD copy subsets will be defined for
3263	   each NFS version.

3265	   For NFSv4.1, Table 2 and Table 3 list the REQUIRED and RECOMMENDED
3266	   attributes respectively.  A "MUST" in the "Copy to destination file?"
3267	   column indicates that the attribute is part of the MUST copy set.  A
3268	   "SHOULD" in the "Copy to destination file?" column indicates that the
3269	   attribute is part of the SHOULD copy set.

3271	          +--------------------+----+---------------------------+
3272	          | Name               | Id | Copy to destination file? |
3273	          +--------------------+----+---------------------------+
3274	          | supported_attrs    | 0  | no                        |
3275	          | type               | 1  | MUST                      |
3276	          | fh_expire_type     | 2  | no                        |
3277	          | change             | 3  | SHOULD                    |
3278	          | size               | 4  | MUST                      |
3279	          | link_support       | 5  | no                        |
3280	          | symlink_support    | 6  | no                        |
3281	          | named_attr         | 7  | no                        |
3282	          | fsid               | 8  | no                        |
3283	          | unique_handles     | 9  | no                        |
3284	          | lease_time         | 10 | no                        |
3285	          | rdattr_error       | 11 | no                        |
3286	          | filehandle         | 19 | no                        |
3287	          | suppattr_exclcreat | 75 | no                        |
3288	          +--------------------+----+---------------------------+

3290	                                  Table 2

3292	          +--------------------+----+---------------------------+
3293	          | Name               | Id | Copy to destination file? |
3294	          +--------------------+----+---------------------------+
3295	          | acl                | 12 | MUST                      |
3296	          | aclsupport         | 13 | no                        |
3297	          | archive            | 14 | no                        |
3298	          | cansettime         | 15 | no                        |
3299	          | case_insensitive   | 16 | no                        |
3300	          | case_preserving    | 17 | no                        |
3301	          | change_policy      | 60 | no                        |
3302	          | chown_restricted   | 18 | MUST                      |
3303	          | dacl               | 58 | MUST                      |
3304	          | dir_notif_delay    | 56 | no                        |
3305	          | dirent_notif_delay | 57 | no                        |
3306	          | fileid             | 20 | no                        |
3307	          | files_avail        | 21 | no                        |
3308	          | files_free         | 22 | no                        |
3309	          | files_total        | 23 | no                        |
3310	          | fs_charset_cap     | 76 | no                        |
3311	          | fs_layout_type     | 62 | no                        |
3312	          | fs_locations       | 24 | no                        |
3313	          | fs_locations_info  | 67 | no                        |
3314	          | fs_status          | 61 | no                        |
3315	          | hidden             | 25 | MUST                      |
3316	          | homogeneous        | 26 | no                        |
3317	          | layout_alignment   | 66 | no                        |
3318	          | layout_blksize     | 65 | no                        |
3319	          | layout_hint        | 63 | no                        |
3320	          | layout_type        | 64 | no                        |
3321	          | maxfilesize        | 27 | no                        |
3322	          | maxlink            | 28 | no                        |
3323	          | maxname            | 29 | no                        |
3324	          | maxread            | 30 | no                        |
3325	          | maxwrite           | 31 | no                        |
3326	          | max_hole_punch     | 31 | no                        |
3327	          | mdsthreshold       | 68 | no                        |
3328	          | mimetype           | 32 | MUST                      |
3329	          | mode               | 33 | MUST                      |
3330	          | mode_set_masked    | 74 | no                        |
3331	          | mounted_on_fileid  | 55 | no                        |
3332	          | no_trunc           | 34 | no                        |
3333	          | numlinks           | 35 | no                        |
3334	          | owner              | 36 | MUST                      |
3335	          | owner_group        | 37 | MUST                      |
3336	          | quota_avail_hard   | 38 | no                        |
3337	          | quota_avail_soft   | 39 | no                        |
3338	          | quota_used         | 40 | no                        |
3339	          | rawdev             | 41 | no                        |
3340	          | retentevt_get      | 71 | MUST                      |
3341	          | retentevt_set      | 72 | no                        |
3342	          | retention_get      | 69 | MUST                      |
3343	          | retention_hold     | 73 | MUST                      |
3344	          | retention_set      | 70 | no                        |
3345	          | sacl               | 59 | MUST                      |
3346	          | space_avail        | 42 | no                        |
3347	          | space_free         | 43 | no                        |
3348	          | space_freed        | 78 | no                        |
3349	          | space_reserved     | 77 | MUST                      |
3350	          | space_total        | 44 | no                        |
3351	          | space_used         | 45 | no                        |
3352	          | system             | 46 | MUST                      |
3353	          | time_access        | 47 | MUST                      |
3354	          | time_access_set    | 48 | no                        |
3355	          | time_backup        | 49 | no                        |
3356	          | time_create        | 50 | MUST                      |
3357	          | time_delta         | 51 | no                        |
3358	          | time_metadata      | 52 | SHOULD                    |
3359	          | time_modify        | 53 | MUST                      |
3360	          | time_modify_set    | 54 | no                        |
3361	          +--------------------+----+---------------------------+

3363	                                  Table 3

3365	   [NOTE: The source file's attribute values will take precedence over
3366	   any attribute values inherited by the destination file.]
3367	   In the case of an inter-server copy or an intra-server copy between
3368	   file systems, the attributes supported for the source file and
3369	   destination file could be different.  By definition,the REQUIRED
3370	   attributes will be supported in all cases.  If the metadata flag is
3371	   set and the source file has a RECOMMENDED attribute that is not
3372	   supported for the destination file, the copy MUST fail with
3373	   NFS4ERR_ATTRNOTSUPP.

3375	   Any attribute supported by the destination server that is not set on
3376	   the source file SHOULD be left unset.

3378	   Metadata attributes not exposed via the NFS protocol SHOULD be copied
3379	   to the destination file where appropriate.

3381	   The destination file's named attributes are not duplicated from the
3382	   source file.  After the copy process completes, the client MAY
3383	   attempt to duplicate named attributes using standard NFSv4
3384	   operations.  However, the destination file's named attribute
3385	   capabilities MAY be different from the source file's named attribute
3386	   capabilities.

3388	   If the metadata flag is not set and the client is requesting a whole
3389	   file copy (i.e., ca_count is 0 (zero)), the destination file's
3390	   metadata is implementation dependent.

3392	   If the client is requesting a partial file copy (i.e., ca_count is
3393	   not 0 (zero)), the client SHOULD NOT set the metadata flag and the
3394	   server MUST ignore the metadata flag.

3396	   If the operation does not result in an immediate failure, the server
3397	   will return NFS4_OK, and the CURRENT_FH will remain the destination's
3398	   filehandle.

3400	   If an immediate failure does occur, cr_bytes_copied will be set to
3401	   the number of bytes copied to the destination file before the error
3402	   occurred.  The cr_bytes_copied value indicates the number of bytes
3403	   copied but not which specific bytes have been copied.

3405	   A return of NFS4_OK indicates that either the operation is complete
3406	   or the operation was initiated and a callback will be used to deliver
3407	   the final status of the operation.

3409	   If the cr_callback_id is returned, this indicates that the operation
3410	   was initiated and a CB_COPY callback will deliver the final results
3411	   of the operation.  The cr_callback_id stateid is termed a copy
3412	   stateid in this context.  The server is given the option of returning
3413	   the results in a callback because the data may require a relatively
3414	   long period of time to copy.

3416	   If no cr_callback_id is returned, the operation completed
3417	   synchronously and no callback will be issued by the server.  The
3418	   completion status of the operation is indicated by cr_status.

3420	   If the copy completes successfully, either synchronously or
3421	   asynchronously, the data copied from the source file to the
3422	   destination file MUST appear identical to the NFS client.  However,
3423	   the NFS server's on disk representation of the data in the source
3424	   file and destination file MAY differ.  For example, the NFS server
3425	   might encrypt, compress, deduplicate, or otherwise represent the on
3426	   disk data in the source and destination file differently.

3428	   In the event of a failure the state of the destination file is
3429	   implementation dependent.  The COPY operation may fail for the
3430	   following reasons (this is a partial list).

3432	   NFS4ERR_MOVED:  The file system which contains the source file, or
3433	      the destination file or directory is not present.  The client can
3434	      determine the correct location and reissue the operation with the
3435	      correct location.

3437	   NFS4ERR_NOTSUPP:  The copy offload operation is not supported by the
3438	      NFS server receiving this request.

3440	   NFS4ERR_PARTNER_NOTSUPP:  The remote server does not support the
3441	      server-to-server copy offload protocol.

3443	   NFS4ERR_PARTNER_NO_AUTH:  The remote server does not authorize a
3444	      server-to-server copy offload operation.  This may be due to the
3445	      client's failure to send the COPY_NOTIFY operation to the remote
3446	      server, the remote server receiving a server-to-server copy
3447	      offload request after the copy lease time expired, or for some
3448	      other permission problem.

3450	   NFS4ERR_FBIG:  The copy operation would have caused the file to grow
3451	      beyond the server's limit.

3453	   NFS4ERR_NOTDIR:  The CURRENT_FH is a file and ca_destination has non-
3454	      zero length.

3456	   NFS4ERR_WRONG_TYPE:  The SAVED_FH is not a regular file.

3458	   NFS4ERR_ISDIR:  The CURRENT_FH is a directory and ca_destination has
3459	      zero length.

3461	   NFS4ERR_INVAL:  The source offset or offset plus count are greater
3462	      than or equal to the size of the source file.

3464	   NFS4ERR_DELAY:  The server does not have the resources to perform the
3465	      copy operation at the current time.  The client should retry the
3466	      operation sometime in the future.

3468	   NFS4ERR_METADATA_NOTSUPP:  The destination file cannot support the
3469	      same metadata as the source file.

3471	   NFS4ERR_WRONGSEC:  The security mechanism being used by the client
3472	      does not match the server's security policy.

3474	11.2.  Operation 60: COPY_ABORT - Cancel a server-side copy

3476	11.2.1.  ARGUMENT

3478	   struct COPY_ABORT4args {
3479	           /* CURRENT_FH: desination file */
3480	           stateid4        caa_stateid;
3481	   };

3483	11.2.2.  RESULT

3485	   struct COPY_ABORT4res {
3486	           nfsstat4        car_status;
3487	   };

3489	11.2.3.  DESCRIPTION

3491	   COPY_ABORT is used for both intra- and inter-server asynchronous
3492	   copies.  The COPY_ABORT operation allows the client to cancel a
3493	   server-side copy operation that it initiated.  This operation is sent
3494	   in a COMPOUND request from the client to the destination server.
3495	   This operation may be used to cancel a copy when the application that
3496	   requested the copy exits before the operation is completed or for
3497	   some other reason.

3499	   The request contains the filehandle and copy stateid cookies that act
3500	   as the context for the previously initiated copy operation.

3502	   The result's car_status field indicates whether the cancel was
3503	   successful or not.  A value of NFS4_OK indicates that the copy
3504	   operation was canceled and no callback will be issued by the server.
3505	   A copy operation that is successfully canceled may result in none,
3506	   some, or all of the data copied.

3508	   If the server supports asynchronous copies, the server is REQUIRED to
3509	   support the COPY_ABORT operation.

3511	   The COPY_ABORT operation may fail for the following reasons (this is
3512	   a partial list):

3514	   NFS4ERR_NOTSUPP:  The abort operation is not supported by the NFS
3515	      server receiving this request.

3517	   NFS4ERR_RETRY:  The abort failed, but a retry at some time in the
3518	      future MAY succeed.

3520	   NFS4ERR_COMPLETE_ALREADY:  The abort failed, and a callback will
3521	      deliver the results of the copy operation.

3523	   NFS4ERR_SERVERFAULT:  An error occurred on the server that does not
3524	      map to a specific error code.

3526	11.3.  Operation 61: COPY_NOTIFY - Notify a source server of a future
3527	       copy

3529	11.3.1.  ARGUMENT

3531	   struct COPY_NOTIFY4args {
3532	           /* CURRENT_FH: source file */
3533	           netloc4         cna_destination_server;
3534	   };

3536	11.3.2.  RESULT

3538	   struct COPY_NOTIFY4resok {
3539	           nfstime4        cnr_lease_time;
3540	           netloc4         cnr_source_server<>;
3541	   };

3543	   union COPY_NOTIFY4res switch (nfsstat4 cnr_status) {
3544	           case NFS4_OK:
3545	                   COPY_NOTIFY4resok       resok4;
3546	           default:
3547	                   void;
3548	   };

3550	11.3.3.  DESCRIPTION

3552	   This operation is used for an inter-server copy.  A client sends this
3553	   operation in a COMPOUND request to the source server to authorize a
3554	   destination server identified by cna_destination_server to read the
3555	   file specified by CURRENT_FH on behalf of the given user.

3557	   The cna_destination_server MUST be specified using the netloc4
3558	   network location format.  The server is not required to resolve the
3559	   cna_destination_server address before completing this operation.

3561	   If this operation succeeds, the source server will allow the
3562	   cna_destination_server to copy the specified file on behalf of the
3563	   given user.  If COPY_NOTIFY succeeds, the destination server is
3564	   granted permission to read the file as long as both of the following
3565	   conditions are met:

3567	   o  The destination server begins reading the source file before the
3568	      cnr_lease_time expires.  If the cnr_lease_time expires while the
3569	      destination server is still reading the source file, the
3570	      destination server is allowed to finish reading the file.

3572	   o  The client has not issued a COPY_REVOKE for the same combination
3573	      of user, filehandle, and destination server.

3575	   The cnr_lease_time is chosen by the source server.  A cnr_lease_time
3576	   of 0 (zero) indicates an infinite lease.  To renew the copy lease
3577	   time the client should resend the same copy notification request to
3578	   the source server.

3580	   To avoid the need for synchronized clocks, copy lease times are
3581	   granted by the server as a time delta.  However, there is a
3582	   requirement that the client and server clocks do not drift
3583	   excessively over the duration of the lease.  There is also the issue
3584	   of propagation delay across the network which could easily be several
3585	   hundred milliseconds as well as the possibility that requests will be
3586	   lost and need to be retransmitted.

3588	   To take propagation delay into account, the client should subtract it
3589	   from copy lease times (e.g., if the client estimates the one-way
3590	   propagation delay as 200 milliseconds, then it can assume that the
3591	   lease is already 200 milliseconds old when it gets it).  In addition,
3592	   it will take another 200 milliseconds to get a response back to the
3593	   server.  So the client must send a lease renewal or send the copy
3594	   offload request to the cna_destination_server at least 400
3595	   milliseconds before the copy lease would expire.  If the propagation
3596	   delay varies over the life of the lease (e.g., the client is on a
3597	   mobile host), the client will need to continuously subtract the
3598	   increase in propagation delay from the copy lease times.

3600	   The server's copy lease period configuration should take into account
3601	   the network distance of the clients that will be accessing the
3602	   server's resources.  It is expected that the lease period will take
3603	   into account the network propagation delays and other network delay
3604	   factors for the client population.  Since the protocol does not allow
3605	   for an automatic method to determine an appropriate copy lease
3606	   period, the server's administrator may have to tune the copy lease
3607	   period.

3609	   A successful response will also contain a list of names, addresses,
3610	   and URLs called cnr_source_server, on which the source is willing to
3611	   accept connections from the destination.  These might not be
3612	   reachable from the client and might be located on networks to which
3613	   the client has no connection.

3615	   If the client wishes to perform an inter-server copy, the client MUST
3616	   send a COPY_NOTIFY to the source server.  Therefore, the source
3617	   server MUST support COPY_NOTIFY.

3619	   For a copy only involving one server (the source and destination are
3620	   on the same server), this operation is unnecessary.

3622	   The COPY_NOTIFY operation may fail for the following reasons (this is
3623	   a partial list):

3625	   NFS4ERR_MOVED:  The file system which contains the source file is not
3626	      present on the source server.  The client can determine the
3627	      correct location and reissue the operation with the correct
3628	      location.

3630	   NFS4ERR_NOTSUPP:  The copy offload operation is not supported by the
3631	      NFS server receiving this request.

3633	   NFS4ERR_WRONGSEC:  The security mechanism being used by the client
3634	      does not match the server's security policy.

3636	11.4.  Operation 62: COPY_REVOKE - Revoke a destination server's copy
3637	       privileges

3639	11.4.1.  ARGUMENT

3641	   struct COPY_REVOKE4args {
3642	           /* CURRENT_FH: source file */
3643	           netloc4         cra_destination_server;
3644	   };

3646	11.4.2.  RESULT

3648	   struct COPY_REVOKE4res {
3649	           nfsstat4        crr_status;
3650	   };

3652	11.4.3.  DESCRIPTION

3654	   This operation is used for an inter-server copy.  A client sends this
3655	   operation in a COMPOUND request to the source server to revoke the
3656	   authorization of a destination server identified by
3657	   cra_destination_server from reading the file specified by CURRENT_FH
3658	   on behalf of given user.  If the cra_destination_server has already
3659	   begun copying the file, a successful return from this operation
3660	   indicates that further access will be prevented.

3662	   The cra_destination_server MUST be specified using the netloc4
3663	   network location format.  The server is not required to resolve the
3664	   cra_destination_server address before completing this operation.

3666	   The COPY_REVOKE operation is useful in situations in which the source
3667	   server granted a very long or infinite lease on the destination
3668	   server's ability to read the source file and all copy operations on
3669	   the source file have been completed.

3671	   For a copy only involving one server (the source and destination are
3672	   on the same server), this operation is unnecessary.

3674	   If the server supports COPY_NOTIFY, the server is REQUIRED to support
3675	   the COPY_REVOKE operation.

3677	   The COPY_REVOKE operation may fail for the following reasons (this is
3678	   a partial list):

3680	   NFS4ERR_MOVED:  The file system which contains the source file is not
3681	      present on the source server.  The client can determine the
3682	      correct location and reissue the operation with the correct
3683	      location.

3685	   NFS4ERR_NOTSUPP:  The copy offload operation is not supported by the
3686	      NFS server receiving this request.

3688	11.5.  Operation 63: COPY_STATUS - Poll for status of a server-side copy
3689	11.5.1.  ARGUMENT

3691	   struct COPY_STATUS4args {
3692	           /* CURRENT_FH: destination file */
3693	           stateid4        csa_stateid;
3694	   };

3696	11.5.2.  RESULT

3698	   struct COPY_STATUS4resok {
3699	           length4         csr_bytes_copied;
3700	           nfsstat4        csr_complete<1>;
3701	   };

3703	   union COPY_STATUS4res switch (nfsstat4 csr_status) {
3704	           case NFS4_OK:
3705	                   COPY_STATUS4resok       resok4;
3706	           default:
3707	                   void;
3708	   };

3710	11.5.3.  DESCRIPTION

3712	   COPY_STATUS is used for both intra- and inter-server asynchronous
3713	   copies.  The COPY_STATUS operation allows the client to poll the
3714	   server to determine the status of an asynchronous copy operation.
3715	   This operation is sent by the client to the destination server.

3717	   If this operation is successful, the number of bytes copied are
3718	   returned to the client in the csr_bytes_copied field.  The
3719	   csr_bytes_copied value indicates the number of bytes copied but not
3720	   which specific bytes have been copied.

3722	   If the optional csr_complete field is present, the copy has
3723	   completed.  In this case the status value indicates the result of the
3724	   asynchronous copy operation.  In all cases, the server will also
3725	   deliver the final results of the asynchronous copy in a CB_COPY
3726	   operation.

3728	   The failure of this operation does not indicate the result of the
3729	   asynchronous copy in any way.

3731	   If the server supports asynchronous copies, the server is REQUIRED to
3732	   support the COPY_STATUS operation.

3734	   The COPY_STATUS operation may fail for the following reasons (this is
3735	   a partial list):

3737	   NFS4ERR_NOTSUPP:  The copy status operation is not supported by the
3738	      NFS server receiving this request.

3740	   NFS4ERR_BAD_STATEID:  The stateid is not valid (see Section 4.3.2
3741	      below).

3743	   NFS4ERR_EXPIRED:  The stateid has expired (see Copy Offload Stateid
3744	      section below).

3746	11.6.  Operation 64: INITIALIZE

3748	   The server has no concept of the structure imposed by the
3749	   application.  It is only when the application writes to a section of
3750	   the file does order get imposed.  In order to detect corruption even
3751	   before the application utilizes the file, the application will want
3752	   to initialize a range of ADBs.  It uses the INITIALIZE operation to
3753	   do so.

3755	11.6.1.  ARGUMENT

3757	   /*
3758	    * We use data_content4 in case we wish to
3759	    * extend new types later. Note that we
3760	    * are explicitly disallowing data.
3761	    */
3762	   union initialize_arg4 switch (data_content4 content) {
3763	   case NFS4_CONTENT_APP_BLOCK:
3764	           app_data_block4 ia_adb;
3765	   case NFS4_CONTENT_HOLE:
3766	           hole_info4      ia_hole;
3767	   default:
3768	           void;
3769	   };

3771	   struct INITIALIZE4args {
3772	           /* CURRENT_FH: file */
3773	           stateid4        ia_stateid;
3774	           stable_how4     ia_stable;
3775	           initialize_arg4 ia_data<>;
3776	   };

3778	11.6.2.  RESULT

3780	   struct INITIALIZE4resok {
3781	           count4          ir_count;
3782	           stable_how4     ir_committed;
3783	           verifier4       ir_writeverf;
3784	           data_content4   ir_sparse;
3785	   };

3787	   union INITIALIZE4res switch (nfsstat4 status) {
3788	   case NFS4_OK:
3789	           INITIALIZE4resok        resok4;
3790	   default:
3791	           void;
3792	   };

3794	11.6.3.  DESCRIPTION

3796	   When the client invokes the INITIALIZE operation, it has two desired
3797	   results:

3799	   1.  The structure described by the app_data_block4 be imposed on the
3800	       file.

3802	   2.  The contents described by the app_data_block4 be sparse.

3804	   If the server supports the INITIALIZE operation, it still might not
3805	   support sparse files.  So if it receives the INITIALIZE operation,
3806	   then it MUST populate the contents of the file with the initialized
3807	   ADBs.  In other words, if the server supports INITIALIZE, then it
3808	   supports the concept of ADBs.  [[Comment.4: Do we want to support an
3809	   asynchronous INITIALIZE?  Do we have to? --TH]]

3811	   If the data was already initialized, There are two interesting
3812	   scenarios:

3814	   1.  The data blocks are allocated.

3816	   2.  Initializing in the middle of an existing ADB.

3818	   If the data blocks were already allocated, then the INITIALIZE is a
3819	   hole punch operation.  If INITIALIZE supports sparse files, then the
3820	   data blocks are to be deallocated.  If not, then the data blocks are
3821	   to be rewritten in the indicated ADB format.  [[Comment.5: Need to
3822	   document interaction between space reservation and hole punching?
3823	   --TH]]
3824	   Since the server has no knowledge of ADBs, it should not report
3825	   misaligned creation of ADBs.  Even while it can detect them, it
3826	   cannot disallow them, as the application might be in the process of
3827	   changing the size of the ADBs.  Thus the server must be prepared to
3828	   handle an INITIALIZE into an existing ADB.

3830	   This document does not mandate the manner in which the server stores
3831	   ADBs sparsely for a file.  It does assume that if ADBs are stored
3832	   sparsely, then the server can detect when an INITIALIZE arrives that
3833	   will force a new ADB to start inside an existing ADB.  For example,
3834	   assume that ADBi has a adb_block_size of 4k and that an INITIALIZE
3835	   starts 1k inside ADBi.  The server should [[Comment.6: Need to flesh
3836	   this out. --TH]]

3838	11.7.  Modification to Operation 42: EXCHANGE_ID - Instantiate Client ID

3840	11.7.1.  ARGUMENT

3842	      /* new */
3843	      const EXCHGID4_FLAG_SUPP_FENCE_OPS      = 0x00000004;

3845	11.7.2.  RESULT

3847	      Unchanged

3849	11.7.3.  MOTIVATION

3851	   Enterprise applications require guarantees that an operation has
3852	   either aborted or completed.  NFSv4.1 provides this guarantee as long
3853	   as the session is alive: simply send a SEQUENCE operation on the same
3854	   slot with a new sequence number, and the successful return of
3855	   SEQUENCE indicates the previous operation has completed.  However, if
3856	   the session is lost, there is no way to know when any in progress
3857	   operations have aborted or completed.  In hindsight, the NFSv4.1
3858	   specification should have mandated that DESTROY_SESSION abort/
3859	   complete all outstanding operations.

3861	11.7.4.  DESCRIPTION

3863	   A client SHOULD request the EXCHGID4_FLAG_SUPP_FENCE_OPS capability
3864	   when it sends an EXCHANGE_ID operation.  The server SHOULD set this
3865	   capability in the EXCHANGE_ID reply whether the client requests it or
3866	   not.  If the client ID is created with this capability then the
3867	   following will occur:

3869	   o  The server will not reply to DESTROY_SESSION until all operations
3870	      in progress are completed or aborted.

3872	   o  The server will not reply to subsequent EXCHANGE_ID invoked on the
3873	      same Client Owner with a new verifier until all operations in
3874	      progress on the Client ID's session are completed or aborted.

3876	   o  When DESTROY_CLIENTID is invoked, if there are sessions (both idle
3877	      and non-idle), opens, locks, delegations, layouts, and/or wants
3878	      (Section 18.49) associated with the client ID are removed.
3879	      Pending operations will be completed or aborted before the
3880	      sessions, opens, locks, delegations, layouts, and/or wants are
3881	      deleted.

3883	   o  The NFS server SHOULD support client ID trunking, and if it does
3884	      and the EXCHGID4_FLAG_SUPP_FENCE_OPS capability is enabled, then a
3885	      session ID created on one node of the storage cluster MUST be
3886	      destroyable via DESTROY_SESSION.  In addition, DESTROY_CLIENTID
3887	      and an EXCHANGE_ID with a new verifier affects all sessions
3888	      regardless what node the sessions were created on.

3890	11.8.  Operation 65: READ_PLUS

3892	   If the client sends a READ operation, it is explicitly stating that
3893	   it is not supporting sparse files.  So if a READ occurs on a sparse
3894	   ADB, then the server must expand such ADBs to be raw bytes.  If a
3895	   READ occurs in the middle of an ADB, the server can only send back
3896	   bytes starting from that offset.

3898	   Such an operation is inefficient for transfer of sparse sections of
3899	   the file.  As such, READ is marked as OBSOLETE in NFSv4.2.  Instead,
3900	   a client should issue READ_PLUS.  Note that as the client has no a
3901	   priori knowledge of whether an ADB is present or not, it should
3902	   always use READ_PLUS.

3904	11.8.1.  ARGUMENT

3906	   struct READ_PLUS4args {
3907	           /* CURRENT_FH: file */
3908	           stateid4        rpa_stateid;
3909	           offset4         rpa_offset;
3910	           count4          rpa_count;
3911	   };

3913	11.8.2.  RESULT

3915	   union read_plus_content switch (data_content4 content) {
3916	   case NFS4_CONTENT_DATA:
3917	           opaque          rpc_data<>;
3918	   case NFS4_CONTENT_APP_BLOCK:
3919	           app_data_block4 rpc_block;
3920	   case NFS4_CONTENT_HOLE:
3921	           hole_info4      rpc_hole;
3922	   default:
3923	           void;
3924	   };

3926	   /*
3927	    * Allow a return of an array of contents.
3928	    */
3929	   struct read_plus_res4 {
3930	           bool                    rpr_eof;
3931	           read_plus_content       rpr_contents<>;
3932	   };

3934	   union READ_PLUS4res switch (nfsstat4 status) {
3935	   case NFS4_OK:
3936	           read_plus_res4  resok4;
3937	   default:
3938	           void;
3939	   };

3941	11.8.3.  DESCRIPTION

3943	   Over the given range, READ_PLUS will return all data and ADBs found
3944	   as an array of read_plus_content.  It is possible to have consecutive
3945	   ADBs in the array as either different definitions of ADBs are present
3946	   or as the guard pattern changes.

3948	   Edge cases exist for ABDs which either begin before the rpa_offset
3949	   requested by the READ_PLUS or end after the rpa_count requested -
3950	   both of which may occur as not all applications which access the file
3951	   are aware of the main application imposing a format on the file
3952	   contents, i.e., tar, dd, cp, etc.  READ_PLUS MUST retrieve whole
3953	   ADBs, but it need not retrieve an entire sequences of ADBs.

3955	   The server MUST return a whole ADB because if it does not, it must
3956	   expand that partial ADB before it sends it to the client.  E.g., if
3957	   an ADB had a block size of 64k and the READ_PLUS was for 128k
3958	   starting at an offset of 32k inside the ADB, then the first 32k would
3959	   be converted to data.

3961	12.  NFSv4.2 Callback Operations

3963	12.1.  Operation 15: CB_COPY - Report results of a server-side copy

3965	12.1.1.  ARGUMENT

3967	   union copy_info4 switch (nfsstat4 cca_status) {
3968	           case NFS4_OK:
3969	                   void;
3970	           default:
3971	                   length4         cca_bytes_copied;
3972	   };

3974	   struct CB_COPY4args {
3975	           nfs_fh4         cca_fh;
3976	           stateid4        cca_stateid;
3977	           copy_info4      cca_copy_info;
3978	   };

3980	12.1.2.  RESULT

3982	   struct CB_COPY4res {
3983	           nfsstat4        ccr_status;
3984	   };

3986	12.1.3.  DESCRIPTION

3988	   CB_COPY is used for both intra- and inter-server asynchronous copies.
3989	   The CB_COPY callback informs the client of the result of an
3990	   asynchronous server-side copy.  This operation is sent by the
3991	   destination server to the client in a CB_COMPOUND request.  The copy
3992	   is identified by the filehandle and stateid arguments.  The result is
3993	   indicated by the status field.  If the copy failed, cca_bytes_copied
3994	   contains the number of bytes copied before the failure occurred.  The
3995	   cca_bytes_copied value indicates the number of bytes copied but not
3996	   which specific bytes have been copied.

3998	   In the absence of an established backchannel, the server cannot
3999	   signal the completion of the COPY via a CB_COPY callback.  The loss
4000	   of a callback channel would be indicated by the server setting the
4001	   SEQ4_STATUS_CB_PATH_DOWN flag in the sr_status_flags field of the
4002	   SEQUENCE operation.  The client must re-establish the callback
4003	   channel to receive the status of the COPY operation.  Prolonged loss
4004	   of the callback channel could result in the server dropping the COPY
4005	   operation state and invalidating the copy stateid.

4007	   If the client supports the COPY operation, the client is REQUIRED to
4008	   support the CB_COPY operation.

4010	   The CB_COPY operation may fail for the following reasons (this is a
4011	   partial list):

4013	   NFS4ERR_NOTSUPP:  The copy offload operation is not supported by the
4014	      NFS client receiving this request.

4016	13.  IANA Considerations

4018	   This section uses terms that are defined in [23].

4020	14.  References

4022	14.1.  Normative References

4024	   [1]   Bradner, S., "Key words for use in RFCs to Indicate Requirement
4025	         Levels", March 1997.

4027	   [2]   Shepler, S., Eisler, M., and D. Noveck, "Network File System
4028	         (NFS) Version 4 Minor Version 1 Protocol", RFC 5661,
4029	         January 2010.

4031	   [3]   Haynes, T., "Network File System (NFS) Version 4 Minor Version
4032	         2 External Data Representation Standard (XDR) Description",
4033	         March 2011.

4035	   [4]   Halevy, B., Welch, B., and J. Zelenka, "Object-Based Parallel
4036	         NFS (pNFS) Operations", RFC 5664, January 2010.

4038	   [5]   Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform
4039	         Resource Identifier (URI): Generic Syntax", STD 66, RFC 3986,
4040	         January 2005.

4042	   [6]   Haynes, T. and N. Williams, "Remote Procedure Call (RPC)
4043	         Security Version 3", draft-williams-rpcsecgssv3 (work in
4044	         progress), 2011.

4046	   [7]   Eisler, M., Chiu, A., and L. Ling, "RPCSEC_GSS Protocol
4047	         Specification", RFC 2203, September 1997.

4049	   [8]   Shepler, S., Eisler, M., and D. Noveck, "Network File System
4050	         (NFS) Version 4 Minor Version 1 External Data Representation
4051	         Standard (XDR) Description", RFC 5662, January 2010.

4053	   [9]   Black, D., Glasgow, J., and S. Fridella, "Parallel NFS (pNFS)
4054	         Block/Volume Layout", RFC 5663, January 2010.

4056	14.2.  Informative References

4058	   [10]  Haynes, T. and D. Noveck, "Network File System (NFS) version 4
4059	         Protocol", draft-ietf-nfsv4-rfc3530bis-09 (Work In Progress),
4060	         March 2011.

4062	   [11]  Eisler, M., "XDR: External Data Representation Standard",
4063	         RFC 4506, May 2006.

4065	   [12]  Lentini, J., Everhart, C., Ellard, D., Tewari, R., and M. Naik,
4066	         "NSDB Protocol for Federated Filesystems",
4067	         draft-ietf-nfsv4-federated-fs-protocol (Work In Progress),
4068	         2010.

4070	   [13]  Lentini, J., Everhart, C., Ellard, D., Tewari, R., and M. Naik,
4071	         "Administration Protocol for Federated Filesystems",
4072	         draft-ietf-nfsv4-federated-fs-admin (Work In Progress), 2010.

4074	   [14]  Fielding, R., Gettys, J., Mogul, J., Frystyk, H., Masinter, L.,
4075	         Leach, P., and T. Berners-Lee, "Hypertext Transfer Protocol --
4076	         HTTP/1.1", RFC 2616, June 1999.

4078	   [15]  Postel, J. and J. Reynolds, "File Transfer Protocol", STD 9,
4079	         RFC 959, October 1985.

4081	   [16]  Simpson, W., "PPP Challenge Handshake Authentication Protocol
4082	         (CHAP)", RFC 1994, August 1996.

4084	   [17]  Strohm, R., "Chapter 2, Data Blocks, Extents, and Segments, of
4085	         Oracle Database Concepts 11g Release 1 (11.1)", January 2011.

4087	   [18]  Ashdown, L., "Chapter 15, Validating Database Files and
4088	         Backups, of Oracle Database Backup and Recovery User's Guide
4089	         11g Release 1 (11.1)", August 2008.

4091	   [19]  McDougall, R. and J. Mauro, "Section 11.4.3, Detecting Memory
4092	         Corruption of Solaris Internals", 2007.

4094	   [20]  Bairavasundaram, L., Goodson, G., Schroeder, B., Arpaci-
4095	         Dusseau, A., and R. Arpaci-Dusseau, "An Analysis of Data
4096	         Corruption in the Storage Stack", Proceedings of the 6th USENIX
4097	         Symposium on File and Storage Technologies (FAST '08) , 2008.

4099	   [21]  "Section 46.6. Multi-Level Security (MLS) of Deployment Guide:
4100	         Deployment, configuration and administration of Red Hat
4101	         Enterprise Linux 5, Edition 6", 2011.

4103	   [22]  Quigley, D. and J. Lu, "Registry Specification for MAC Security
4104	         Label Formats", draft-quigley-label-format-registry (work in
4105	         progress), 2011.

4107	   [23]  Narten, T. and H. Alvestrand, "Guidelines for Writing an IANA
4108	         Considerations Section in RFCs", BCP 26, RFC 5226, May 2008.

4110	   [24]  Nowicki, B., "NFS: Network File System Protocol specification",
4111	         RFC 1094, March 1989.

4113	   [25]  Callaghan, B., Pawlowski, B., and P. Staubach, "NFS Version 3
4114	         Protocol Specification", RFC 1813, June 1995.

4116	   [26]  Srinivasan, R., "Binding Protocols for ONC RPC Version 2",
4117	         RFC 1833, August 1995.

4119	   [27]  Eisler, M., "NFS Version 2 and Version 3 Security Issues and
4120	         the NFS Protocol's Use of RPCSEC_GSS and Kerberos V5",
4121	         RFC 2623, June 1999.

4123	   [28]  Callaghan, B., "NFS URL Scheme", RFC 2224, October 1997.

4125	   [29]  Shepler, S., "NFS Version 4 Design Considerations", RFC 2624,
4126	         June 1999.

4128	   [30]  Reynolds, J., "Assigned Numbers: RFC 1700 is Replaced by an On-
4129	         line Database", RFC 3232, January 2002.

4131	   [31]  Linn, J., "The Kerberos Version 5 GSS-API Mechanism", RFC 1964,
4132	         June 1996.

4134	   [32]  Shepler, S., Callaghan, B., Robinson, D., Thurlow, R., Beame,
4135	         C., Eisler, M., and D. Noveck, "Network File System (NFS)
4136	         version 4 Protocol", RFC 3530, April 2003.

4138	Appendix A.  Acknowledgments

4140	   For the pNFS Access Permissions Check, the original draft was by
4141	   Sorin Faibish, David Black, Mike Eisler, and Jason Glasgow.  The work
4142	   was influenced by discussions with Benny Halevy and Bruce Fields.  A
4143	   review was done by Tom Haynes.

4145	   For the Sharing change attribute implementation details with NFSv4
4146	   clients, the original draft was by Trond Myklebust.

4148	   For the NFS Server-side Copy, the original draft was by James
4149	   Lentini, Mike Eisler, Deepak Kenchammana, Anshul Madan, and Rahul
4150	   Iyer.  Talpey co-authored an unpublished version of that document.

4152	   It was also was reviewed by a number of individuals: Pranoop Erasani,
4153	   Tom Haynes, Arthur Lent, Trond Myklebust, Dave Noveck, Theresa
4154	   Lingutla-Raj, Manjunath Shankararao, Satyam Vaghani, and Nico
4155	   Williams.

4157	   For the NFS space reservation operations, the original draft was by
4158	   Mike Eisler, James Lentini, Manjunath Shankararao, and Rahul Iyer.

4160	   For the sparse file support, the original draft was by Dean
4161	   Hildebrand and Marc Eshel.  Valuable input and advice was received
4162	   from Sorin Faibish, Bruce Fields, Benny Halevy, Trond Myklebust, and
4163	   Richard Scheffenegger.

4165	   For Labeled NFS, the original draft was by David Quigley, James
4166	   Morris, Jarret Lu, and Tom Haynes.  Peter Staubach, Trond Myklebust,
4167	   Sorrin Faibish, Nico Williams, and David Black also contributed in
4168	   the final push to get this accepted.

4170	Appendix B.  RFC Editor Notes

4172	   [RFC Editor: please remove this section prior to publishing this
4173	   document as an RFC]

4175	   [RFC Editor: prior to publishing this document as an RFC, please
4176	   replace all occurrences of RFCTBD10 with RFCxxxx where xxxx is the
4177	   RFC number of this document]

4179	Author's Address

4181	   Thomas Haynes
4182	   NetApp
4183	   9110 E 66th St
4184	   Tulsa, OK  74133
4185	   USA

4187	   Phone: +1 918 307 1415
4188	   Email: thomas@netapp.com
4189	   URI:   http://www.tulsalabs.com