idnits 2.17.1 

draft-ietf-nfsv4-minorversion2-04.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  == There are 5 instances of lines with non-RFC6890-compliant IPv4 addresses
     in the document.  If these are example addresses, they should be changed.

  == There are 5 instances of lines with private range IPv4 addresses in the
     document.  If these are generic example addresses, they should be changed
     to use any of the ranges defined in RFC 6890 (or successor): 192.0.2.x,
     198.51.100.x or 203.0.113.x.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  == Using lowercase 'not' together with uppercase 'MUST', 'SHALL', 'SHOULD',
     or 'RECOMMENDED' is not an accepted usage according to RFC 2119.  Please
     use uppercase 'NOT' together with RFC 2119 keywords (if that is what you
     mean).
     
     Found 'MUST not' in this paragraph:
     
     Furthermore, each DS MUST not report to a client either a sparse
     ADB or data which belongs to another DS.  One implication of this
     requirement is that the app_data_block4's adb_block_size MUST be either
     be the stripe width or the stripe width must be an even multiple of it.

  == Using lowercase 'not' together with uppercase 'MUST', 'SHALL', 'SHOULD',
     or 'RECOMMENDED' is not an accepted usage according to RFC 2119.  Please
     use uppercase 'NOT' together with RFC 2119 keywords (if that is what you
     mean).
     
     Found 'MUST not' in this paragraph:
     
     When a data server chooses to return a READ_HOLE result, it has the
     option of returning hole information for the data stored on that data
     server (as defined by the data layout), but it MUST not return a
     nfs_readplusreshole structure with a byte range that includes data
     managed by another data server.

  == Using lowercase 'not' together with uppercase 'MUST', 'SHALL', 'SHOULD',
     or 'RECOMMENDED' is not an accepted usage according to RFC 2119.  Please
     use uppercase 'NOT' together with RFC 2119 keywords (if that is what you
     mean).
     
     Found 'MUST not' in this paragraph:
     
     The second change is to provide a method for the server to notify
     the client that the attribute changed on an open file on the server.  If
     the file is closed, then during the open attempt, the client will gather
     the new attribute value.  The server MUST not communicate the new value
     of the attribute, the client MUST query it.  This requirement stems from
     the need for the client to provide sufficient access rights to the
     attribute.

  == The document seems to contain a disclaimer for pre-RFC5378 work, but was
     first submitted on or after 10 November 2008.  The disclaimer is usually
     necessary only for documents that revise or obsolete older RFCs, and that
     take significant amounts of text from those RFCs.  If you can contact all
     authors of the source material and they are willing to grant the BCP78
     rights to the IETF Trust, you can and should remove the disclaimer. 
     Otherwise, the disclaimer is needed and you can ignore this comment. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (August 24, 2011) is 4628 days in the past.  Is this
     intentional?

  -- Found something which looks like a code comment -- if you have code
     sections in the document, please surround them with '<CODE BEGINS>' and
     '<CODE ENDS>' lines.


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  == Missing Reference: '0' is mentioned on line 1089, but not defined

  == Unused Reference: '8' is defined on line 3777, but no explicit reference
     was found in the text

  == Unused Reference: '9' is defined on line 3781, but no explicit reference
     was found in the text

  == Unused Reference: '24' is defined on line 3838, but no explicit
     reference was found in the text

  == Unused Reference: '25' is defined on line 3841, but no explicit
     reference was found in the text

  == Unused Reference: '26' is defined on line 3844, but no explicit
     reference was found in the text

  == Unused Reference: '27' is defined on line 3847, but no explicit
     reference was found in the text

  == Unused Reference: '28' is defined on line 3851, but no explicit
     reference was found in the text

  == Unused Reference: '29' is defined on line 3853, but no explicit
     reference was found in the text

  == Unused Reference: '30' is defined on line 3856, but no explicit
     reference was found in the text

  == Unused Reference: '31' is defined on line 3859, but no explicit
     reference was found in the text

  == Unused Reference: '32' is defined on line 3862, but no explicit
     reference was found in the text

  -- Possible downref: Non-RFC (?) normative reference: ref. '1'

  ** Obsolete normative reference: RFC 5661 (ref. '2') (Obsoleted by RFC 8881)

  -- Possible downref: Non-RFC (?) normative reference: ref. '3'

  == Outdated reference: A later version (-35) exists of
     draft-ietf-nfsv4-rfc3530bis-09

  -- Obsolete informational reference (is this intentional?): RFC 2616 (ref.
     '14') (Obsoleted by RFC 7230, RFC 7231, RFC 7232, RFC 7233, RFC 7234, RFC
     7235)

  -- Obsolete informational reference (is this intentional?): RFC 5226 (ref.
     '23') (Obsoleted by RFC 8126)

  -- Obsolete informational reference (is this intentional?): RFC 3530 (ref.
     '32') (Obsoleted by RFC 7530)


     Summary: 1 error (**), 0 flaws (~~), 20 warnings (==), 7 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	NFSv4                                                          T. Haynes
3	Internet-Draft                                                    Editor
4	Intended status: Standards Track                         August 24, 2011
5	Expires: February 25, 2012

7	                     NFS Version 4 Minor Version 2
8	                 draft-ietf-nfsv4-minorversion2-04.txt

10	Abstract

12	   This Internet-Draft describes NFS version 4 minor version two,
13	   focusing mainly on the protocol extensions made from NFS version 4
14	   minor version 0 and NFS version 4 minor version 1.  Major extensions
15	   introduced in NFS version 4 minor version two include: Server-side
16	   Copy, Space Reservations, and Support for Sparse Files.

18	Requirements Language

20	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
21	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
22	   document are to be interpreted as described in RFC 2119 [1].

24	Status of this Memo

26	   This Internet-Draft is submitted in full conformance with the
27	   provisions of BCP 78 and BCP 79.

29	   Internet-Drafts are working documents of the Internet Engineering
30	   Task Force (IETF).  Note that other groups may also distribute
31	   working documents as Internet-Drafts.  The list of current Internet-
32	   Drafts is at http://datatracker.ietf.org/drafts/current/.

34	   Internet-Drafts are draft documents valid for a maximum of six months
35	   and may be updated, replaced, or obsoleted by other documents at any
36	   time.  It is inappropriate to use Internet-Drafts as reference
37	   material or to cite them other than as "work in progress."

39	   This Internet-Draft will expire on February 25, 2012.

41	Copyright Notice

43	   Copyright (c) 2011 IETF Trust and the persons identified as the
44	   document authors.  All rights reserved.

46	   This document is subject to BCP 78 and the IETF Trust's Legal
47	   Provisions Relating to IETF Documents
48	   (http://trustee.ietf.org/license-info) in effect on the date of
49	   publication of this document.  Please review these documents
50	   carefully, as they describe your rights and restrictions with respect
51	   to this document.  Code Components extracted from this document must
52	   include Simplified BSD License text as described in Section 4.e of
53	   the Trust Legal Provisions and are provided without warranty as
54	   described in the Simplified BSD License.

56	   This document may contain material from IETF Documents or IETF
57	   Contributions published or made publicly available before November
58	   10, 2008.  The person(s) controlling the copyright in some of this
59	   material may not have granted the IETF Trust the right to allow
60	   modifications of such material outside the IETF Standards Process.
61	   Without obtaining an adequate license from the person(s) controlling
62	   the copyright in such materials, this document may not be modified
63	   outside the IETF Standards Process, and derivative works of it may
64	   not be created outside the IETF Standards Process, except to format
65	   it for publication as an RFC or to translate it into languages other
66	   than English.

68	Table of Contents

70	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  6
71	     1.1.  The NFS Version 4 Minor Version 2 Protocol . . . . . . . .  6
72	     1.2.  Scope of This Document . . . . . . . . . . . . . . . . . .  6
73	     1.3.  NFSv4.2 Goals  . . . . . . . . . . . . . . . . . . . . . .  6
74	     1.4.  Overview of NFSv4.2 Features . . . . . . . . . . . . . . .  6
75	     1.5.  Differences from NFSv4.1 . . . . . . . . . . . . . . . . .  6
76	   2.  pNFS LAYOUTRETURN Error Handling . . . . . . . . . . . . . . .  6
77	     2.1.  Introduction . . . . . . . . . . . . . . . . . . . . . . .  7
78	     2.2.  Changes to Operation 51: LAYOUTRETURN  . . . . . . . . . .  7
79	       2.2.1.  ARGUMENT . . . . . . . . . . . . . . . . . . . . . . .  7
80	       2.2.2.  RESULT . . . . . . . . . . . . . . . . . . . . . . . .  8
81	       2.2.3.  DESCRIPTION  . . . . . . . . . . . . . . . . . . . . .  8
82	       2.2.4.  IMPLEMENTATION . . . . . . . . . . . . . . . . . . . .  8
83	   3.  Sharing change attribute implementation details with NFSv4
84	       clients  . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
85	     3.1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . 10
86	     3.2.  Definition of the 'change_attr_type' per-file system
87	           attribute  . . . . . . . . . . . . . . . . . . . . . . . . 10
88	   4.  NFS Server-side Copy . . . . . . . . . . . . . . . . . . . . . 11
89	     4.1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . 12
90	     4.2.  Protocol Overview  . . . . . . . . . . . . . . . . . . . . 12
91	       4.2.1.  Intra-Server Copy  . . . . . . . . . . . . . . . . . . 14
92	       4.2.2.  Inter-Server Copy  . . . . . . . . . . . . . . . . . . 15
93	       4.2.3.  Server-to-Server Copy Protocol . . . . . . . . . . . . 18
94	     4.3.  Operations . . . . . . . . . . . . . . . . . . . . . . . . 20
95	       4.3.1.  netloc4 - Network Locations  . . . . . . . . . . . . . 20
96	       4.3.2.  Copy Offload Stateids  . . . . . . . . . . . . . . . . 21
97	     4.4.  Security Considerations  . . . . . . . . . . . . . . . . . 21
98	       4.4.1.  Inter-Server Copy Security . . . . . . . . . . . . . . 21
99	   5.  Application Data Block Support . . . . . . . . . . . . . . . . 29
100	     5.1.  Generic Framework  . . . . . . . . . . . . . . . . . . . . 30
101	       5.1.1.  Data Block Representation  . . . . . . . . . . . . . . 31
102	       5.1.2.  Data Content . . . . . . . . . . . . . . . . . . . . . 31
103	     5.2.  pNFS Considerations  . . . . . . . . . . . . . . . . . . . 31
104	     5.3.  An Example of Detecting Corruption . . . . . . . . . . . . 32
105	     5.4.  Example of READ_PLUS . . . . . . . . . . . . . . . . . . . 34
106	     5.5.  Zero Filled Holes  . . . . . . . . . . . . . . . . . . . . 34
107	   6.  Space Reservation  . . . . . . . . . . . . . . . . . . . . . . 34
108	     6.1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . 34
109	     6.2.  Use Cases  . . . . . . . . . . . . . . . . . . . . . . . . 35
110	       6.2.1.  Space Reservation  . . . . . . . . . . . . . . . . . . 36
111	       6.2.2.  Space freed on deletes . . . . . . . . . . . . . . . . 36
112	       6.2.3.  Operations and attributes  . . . . . . . . . . . . . . 37
113	       6.2.4.  Attribute 77: space_reserved . . . . . . . . . . . . . 37
114	       6.2.5.  Attribute 78: space_freed  . . . . . . . . . . . . . . 38
115	       6.2.6.  Attribute 79: max_hole_punch . . . . . . . . . . . . . 38
116	       6.2.7.  Operation 64: HOLE_PUNCH - Zero and deallocate
117	               blocks backing the file in the specified range.  . . . 38
118	   7.  Sparse Files . . . . . . . . . . . . . . . . . . . . . . . . . 39
119	     7.1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . 39
120	     7.2.  Terminology  . . . . . . . . . . . . . . . . . . . . . . . 40
121	     7.3.  Applications and Sparse Files  . . . . . . . . . . . . . . 41
122	     7.4.  Overview of Sparse Files and NFSv4 . . . . . . . . . . . . 42
123	     7.5.  Operation 65: READ_PLUS  . . . . . . . . . . . . . . . . . 43
124	       7.5.1.  ARGUMENT . . . . . . . . . . . . . . . . . . . . . . . 43
125	       7.5.2.  RESULT . . . . . . . . . . . . . . . . . . . . . . . . 44
126	       7.5.3.  DESCRIPTION  . . . . . . . . . . . . . . . . . . . . . 44
127	       7.5.4.  IMPLEMENTATION . . . . . . . . . . . . . . . . . . . . 46
128	       7.5.5.  READ_PLUS with Sparse Files Example  . . . . . . . . . 47
129	     7.6.  Related Work . . . . . . . . . . . . . . . . . . . . . . . 48
130	     7.7.  Other Proposed Designs . . . . . . . . . . . . . . . . . . 48
131	       7.7.1.  Multi-Data Server Hole Information . . . . . . . . . . 48
132	       7.7.2.  Data Result Array  . . . . . . . . . . . . . . . . . . 49
133	       7.7.3.  User-Defined Sparse Mask . . . . . . . . . . . . . . . 49
134	       7.7.4.  Allocated flag . . . . . . . . . . . . . . . . . . . . 49
135	       7.7.5.  Dense and Sparse pNFS File Layouts . . . . . . . . . . 50
136	   8.  Labeled NFS  . . . . . . . . . . . . . . . . . . . . . . . . . 50
137	     8.1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . 50
138	     8.2.  Definitions  . . . . . . . . . . . . . . . . . . . . . . . 51
139	     8.3.  MAC Security Attribute . . . . . . . . . . . . . . . . . . 51
140	       8.3.1.  Interpreting FATTR4_SEC_LABEL  . . . . . . . . . . . . 52
141	       8.3.2.  Delegations  . . . . . . . . . . . . . . . . . . . . . 53
142	       8.3.3.  Permission Checking  . . . . . . . . . . . . . . . . . 53
143	       8.3.4.  Object Creation  . . . . . . . . . . . . . . . . . . . 54
144	       8.3.5.  Existing Objects . . . . . . . . . . . . . . . . . . . 54
145	       8.3.6.  Label Changes  . . . . . . . . . . . . . . . . . . . . 54
146	     8.4.  pNFS Considerations  . . . . . . . . . . . . . . . . . . . 55
147	     8.5.  Discovery of Server LNFS Support . . . . . . . . . . . . . 55
148	     8.6.  MAC Security NFS Modes of Operation  . . . . . . . . . . . 56
149	       8.6.1.  Full Mode  . . . . . . . . . . . . . . . . . . . . . . 56
150	       8.6.2.  Smart Client Mode  . . . . . . . . . . . . . . . . . . 57
151	       8.6.3.  Smart Server Mode  . . . . . . . . . . . . . . . . . . 58
152	     8.7.  Security Considerations  . . . . . . . . . . . . . . . . . 59
153	   9.  Security Considerations  . . . . . . . . . . . . . . . . . . . 59
154	   10. Operations: REQUIRED, RECOMMENDED, or OPTIONAL . . . . . . . . 59
155	   11. NFSv4.2 Operations . . . . . . . . . . . . . . . . . . . . . . 63
156	     11.1. Operation 59: COPY - Initiate a server-side copy . . . . . 63
157	     11.2. Operation 60: COPY_ABORT - Cancel a server-side copy . . . 71
158	     11.3. Operation 61: COPY_NOTIFY - Notify a source server of
159	           a future copy  . . . . . . . . . . . . . . . . . . . . . . 72
160	     11.4. Operation 62: COPY_REVOKE - Revoke a destination
161	           server's copy privileges . . . . . . . . . . . . . . . . . 75
162	     11.5. Operation 63: COPY_STATUS - Poll for status of a
163	           server-side copy . . . . . . . . . . . . . . . . . . . . . 76

165	     11.6. Operation 64: INITIALIZE . . . . . . . . . . . . . . . . . 77
166	     11.7. Modification to Operation 42: EXCHANGE_ID -
167	           Instantiate Client ID  . . . . . . . . . . . . . . . . . . 79
168	     11.8. Operation 65: READ_PLUS  . . . . . . . . . . . . . . . . . 81
169	   12. NFSv4.2 Callback Operations  . . . . . . . . . . . . . . . . . 83
170	     12.1. Procedure 16: CB_ATTR_CHANGED - Notify Client that the
171	           File's Attributes Changed  . . . . . . . . . . . . . . . . 83
172	     12.2. Operation 15: CB_COPY - Report results of a
173	           server-side copy . . . . . . . . . . . . . . . . . . . . . 83
174	   13. IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 85
175	   14. References . . . . . . . . . . . . . . . . . . . . . . . . . . 85
176	     14.1. Normative References . . . . . . . . . . . . . . . . . . . 85
177	     14.2. Informative References . . . . . . . . . . . . . . . . . . 86
178	   Appendix A.  Acknowledgments . . . . . . . . . . . . . . . . . . . 87
179	   Appendix B.  RFC Editor Notes  . . . . . . . . . . . . . . . . . . 88
180	   Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 88

182	1.  Introduction

184	1.1.  The NFS Version 4 Minor Version 2 Protocol

186	   The NFS version 4 minor version 2 (NFSv4.2) protocol is the third
187	   minor version of the NFS version 4 (NFSv4) protocol.  The first minor
188	   version, NFSv4.0, is described in [10] and the second minor version,
189	   NFSv4.1, is described in [2].  It follows the guidelines for minor
190	   versioning that are listed in Section 11 of [10].

192	   As a minor version, NFSv4.2 is consistent with the overall goals for
193	   NFSv4, but extends the protocol so as to better meet those goals,
194	   based on experiences with NFSv4.1.  In addition, NFSv4.2 has adopted
195	   some additional goals, which motivate some of the major extensions in
196	   NFSv4.2.

198	1.2.  Scope of This Document

200	   This document describes the NFSv4.2 protocol.  With respect to
201	   NFSv4.0 and NFSv4.1, this document does not:

203	   o  describe the NFSv4.0 or NFSv4.1 protocols, except where needed to
204	      contrast with NFSv4.2.

206	   o  modify the specification of the NFSv4.0 or NFSv4.1 protocols.

208	   o  clarify the NFSv4.0 or NFSv4.1 protocols.  I.e., any
209	      clarifications made here apply to NFSv4.2 and neither of the prior
210	      protocols.

212	   The full XDR for NFSv4.2 is presented in [3].

214	1.3.  NFSv4.2 Goals

216	   [[Comment.1: This needs fleshing out! --TH]]

218	1.4.  Overview of NFSv4.2 Features

220	   [[Comment.2: This needs fleshing out! --TH]]

222	1.5.  Differences from NFSv4.1

224	   [[Comment.3: This needs fleshing out! --TH]]

226	2.  pNFS LAYOUTRETURN Error Handling
227	2.1.  Introduction

229	   In the pNFS description provided in [2], the client is not enabled to
230	   relay an error code from the DS to the MDS.  In the specification of
231	   the Objects-Based Layout protocol [4], use is made of the opaque
232	   lrf_body field of the LAYOUTRETURN argument to do such a relaying of
233	   error codes.  In this section, we define a new data structure to
234	   enable the passing of error codes back to the MDS and provide some
235	   guidelines on what both the client and MDS should expect in such
236	   circumstances.

238	   There are two broad classes of errors, transient and persistent.  The
239	   client SHOULD strive to only use this new mechanism to report
240	   persistent errors.  It MUST be able to deal with transient issues by
241	   itself.  Also, while the client might consider an issue to be
242	   persistent, it MUST be prepared for the MDS to consider such issues
243	   to be persistent.  A prime example of this is if the MDS fences off a
244	   client from either a stateid or a filehandle.  The client will get an
245	   error from the DS and might relay either NFS4ERR_ACCESS or
246	   NFS4ERR_STALE_STATEID back to the MDS, with the belief that this is a
247	   hard error.  The MDS on the other hand, is waiting for the client to
248	   report such an error.  For it, the mission is accomplished in that
249	   the client has returned a layout that the MDS had most likley
250	   recalled.

252	2.2.  Changes to Operation 51: LAYOUTRETURN

254	   The existing LAYOUTRETURN operation is extended by introducing a new
255	   data structure to report errors, layoutreturn_device_error4.  Also,
256	   layoutreturn_device_error4 is introduced to enable an array of errors
257	   to be reported.

259	2.2.1.  ARGUMENT

261	   The ARGUMENT specification of the LAYOUTRETURN operation in section
262	   18.44.1 of [2] is augmented by the following XDR code [11]:

264	   struct layoutreturn_device_error4 {
265	           deviceid4       lrde_deviceid;
266	           nfsstat4        lrde_status;
267	           nfs_opnum4      lrde_opnum;
268	   };

270	   struct layoutreturn_error_report4 {
271	           layoutreturn_device_error4      lrer_errors<>;
272	   };

274	2.2.2.  RESULT

276	   The RESULT of the LAYOUTRETURN operation is unchanged; see section
277	   18.44.2 of [2].

279	2.2.3.  DESCRIPTION

281	   The following text is added to the end of the LAYOUTRETURN operation
282	   DESCRIPTION in section 18.44.3 of [2].

284	   When a client used LAYOUTRETURN with a type of LAYOUTRETURN4_FILE,
285	   then if the lrf_body field is NULL, it indicates to the MDS that the
286	   client experienced no errors.  If lrf_body is non-NULL, then the
287	   field references error information which is layout type specific.
288	   I.e., the Objects-Based Layout protocol can continue to utilize
289	   lrf_body as specified in [4].  For both Files-Based Layouts, the
290	   field references a layoutreturn_device_error4, which contains an
291	   array of layoutreturn_device_error4.

293	   Each individual layoutreturn_device_error4 descibes a single error
294	   associated with a DS, which is identfied via lrde_deviceid.  The
295	   operation which returned the error is identified via lrde_opnum.
296	   Finally the NFS error value (nfsstat4) encountered is provided via
297	   lrde_status and may consist of the following error codes:

299	   NFS4_OKAY:  No issues were found for this device.

301	   NFS4ERR_NXIO:  The client was unable to establish any communication
302	      with the DS.

304	   NFS4ERR_*:  The client was able to establish communication with the
305	      DS and is returning one of the allowed error codes for the
306	      operation denoted by lrde_opnum.

308	2.2.4.  IMPLEMENTATION

310	   The following text is added to the end of the LAYOUTRETURN operation
311	   IMPLEMENTATION in section 18.4.4 of [2].

313	   A client that expects to use pNFS for a mounted filesystem SHOULD
314	   check for pNFS support at mount time.  This check SHOULD be performed
315	   by sending a GETDEVICELIST operation, followed by layout-type-
316	   specific checks for accessibility of each storage device returned by
317	   GETDEVICELIST.  If the NFS server does not support pNFS, the
318	   GETDEVICELIST operation will be rejected with an NFS4ERR_NOTSUPP
319	   error; in this situation it is up to the client to determine whether
320	   it is acceptable to proceed with NFS-only access.

322	   Clients are expected to tolerate transient storage device errors, and
323	   hence clients SHOULD NOT use the LAYOUTRETURN error handling for
324	   device access problems that may be transient.  The methods by which a
325	   client decides whether an access problem is transient vs. persistent
326	   are implementation-specific, but may include retrying I/Os to a data
327	   server under appropriate conditions.

329	   When an I/O fails to a storage device, the client SHOULD retry the
330	   failed I/O via the MDS.  In this situation, before retrying the I/O,
331	   the client SHOULD return the layout, or the affected portion thereof,
332	   and SHOULD indicate which storage device or devices was problematic.
333	   If the client does not do this, the MDS may issue a layout recall
334	   callback in order to perform the retried I/O.

336	   The client needs to be cognizant that since this error handling is
337	   optional in the MDS, the MDS may silently ignore this functionality.
338	   Also, as the MDS may consider some issues the client reports to be
339	   expected (see Section 2.1), the client might find it difficult to
340	   detect a MDS which has not implemented error handling via
341	   LAYOUTRETURN.

343	   If an MDS is aware that a storage device is proving problematic to a
344	   client, the MDS SHOULD NOT include that storage device in any pNFS
345	   layouts sent to that client.  If the MDS is aware that a storage
346	   device is affecting many clients, then the MDS SHOULD NOT include
347	   that storage device in any pNFS layouts sent out.  Clients must still
348	   be aware that the MDS might not have any choice in using the storage
349	   device, i.e., there might only be one possible layout for the system.

351	   Another interesting complication is that for existing files, the MDS
352	   might have no choice in which storage devices to hand out to clients.
353	   The MDS might try to restripe a file across a different storage
354	   device, but clients need to be aware that not all implementations
355	   have restriping support.

357	   An MDS SHOULD react to a client return of layouts with errors by not
358	   using the problematic storage devices in layouts for that client, but
359	   the MDS is not required to indefinitely retain per-client storage
360	   device error information.  An MDS is also not required to
361	   automatically reinstate use of a previously problematic storage
362	   device; administrative intervention may be required instead.

364	   A client MAY perform I/O via the MDS even when the client holds a
365	   layout that covers the I/O; servers MUST support this client
366	   behavior, and MAY recall layouts as needed to complete I/Os.

368	3.  Sharing change attribute implementation details with NFSv4 clients

370	3.1.  Introduction

372	   Although both the NFSv4 [10] and NFSv4.1 protocol [2], define the
373	   change attribute as being mandatory to implement, there is little in
374	   the way of guidance.  The only feature that is mandated by them is
375	   that the value must change whenever the file data or metadata change.

377	   While this allows for a wide range of implementations, it also leaves
378	   the client with a conundrum: how does it determine which is the most
379	   recent value for the change attribute in a case where several RPC
380	   calls have been issued in parallel?  In other words if two COMPOUNDs,
381	   both containing WRITE and GETATTR requests for the same file, have
382	   been issued in parallel, how does the client determine which of the
383	   two change attribute values returned in the replies to the GETATTR
384	   requests corresponds to the most recent state of the file?  In some
385	   cases, the only recourse may be to send another COMPOUND containing a
386	   third GETATTR that is fully serialised with the first two.

388	   NFSv4.2 avoids this kind of inefficiency by allowing the server to
389	   share details about how the change attribute is expected to evolve,
390	   so that the client may immediately determine which, out of the
391	   several change attribute values returned by the server, is the most
392	   recent.

394	3.2.  Definition of the 'change_attr_type' per-file system attribute

396	   enum change_attr_typeinfo {
397	              NFS4_CHANGE_TYPE_IS_MONOTONIC_INCR         = 0,
398	              NFS4_CHANGE_TYPE_IS_VERSION_COUNTER        = 1,
399	              NFS4_CHANGE_TYPE_IS_VERSION_COUNTER_NOPNFS = 2,
400	              NFS4_CHANGE_TYPE_IS_TIME_METADATA          = 3,
401	              NFS4_CHANGE_TYPE_IS_UNDEFINED              = 4
402	   };

404	        +------------------+----+---------------------------+-----+
405	        | Name             | Id | Data Type                 | Acc |
406	        +------------------+----+---------------------------+-----+
407	        | change_attr_type | XX | enum change_attr_typeinfo | R   |
408	        +------------------+----+---------------------------+-----+

410	   The solution enables the NFS server to provide additional information
411	   about how it expects the change attribute value to evolve after the
412	   file data or metadata has changed. 'change_attr_type' is defined as a
413	   new recommended attribute, and takes values from enum
414	   change_attr_typeinfo as follows:

416	   NFS4_CHANGE_TYPE_IS_MONOTONIC_INCR:  The change attribute value MUST
417	      monotonically increase for every atomic change to the file
418	      attributes, data or directory contents.

420	   NFS4_CHANGE_TYPE_IS_VERSION_COUNTER:  The change attribute value MUST
421	      be incremented by one unit for every atomic change to the file
422	      attributes, data or directory contents.  This property is
423	      preserved when writing to pNFS data servers.

425	   NFS4_CHANGE_TYPE_IS_VERSION_COUNTER_NOPNFS:  The change attribute
426	      value MUST be incremented by one unit for every atomic change to
427	      the file attributes, data or directory contents.  In the case
428	      where the client is writing to pNFS data servers, the number of
429	      increments is not guaranteed to exactly match the number of
430	      writes.

432	   NFS4_CHANGE_TYPE_IS_TIME_METADATA:  The change attribute is
433	      implemented as suggested in the NFSv4 spec [10] in terms of the
434	      time_metadata attribute.

436	   NFS4_CHANGE_TYPE_IS_UNDEFINED:  The change attribute does not take
437	      values that fit into any of these categories.

439	   If either NFS4_CHANGE_TYPE_IS_MONOTONIC_INCR,
440	   NFS4_CHANGE_TYPE_IS_VERSION_COUNTER, or
441	   NFS4_CHANGE_TYPE_IS_TIME_METADATA are set, then the client knows at
442	   the very least that the change attribute is monotonically increasing,
443	   which is sufficient to resolve the question of which value is the
444	   most recent.

446	   If the client sees the value NFS4_CHANGE_TYPE_IS_TIME_METADATA, then
447	   by inspecting the value of the 'time_delta' attribute it additionally
448	   has the option of detecting rogue server implementations that use
449	   time_metadata in violation of the spec.

451	   Finally, if the client sees NFS4_CHANGE_TYPE_IS_VERSION_COUNTER, it
452	   has the ability to predict what the resulting change attribute value
453	   should be after a COMPOUND containing a SETATTR, WRITE, or CREATE.
454	   This again allows it to detect changes made in parallel by another
455	   client.  The value NFS4_CHANGE_TYPE_IS_VERSION_COUNTER_NOPNFS permits
456	   the same, but only if the client is not doing pNFS WRITEs.

458	4.  NFS Server-side Copy
459	4.1.  Introduction

461	   This section describes a server-side copy feature for the NFS
462	   protocol.

464	   The server-side copy feature provides a mechanism for the NFS client
465	   to perform a file copy on the server without the data being
466	   transmitted back and forth over the network.

468	   Without this feature, an NFS client copies data from one location to
469	   another by reading the data from the server over the network, and
470	   then writing the data back over the network to the server.  Using
471	   this server-side copy operation, the client is able to instruct the
472	   server to copy the data locally without the data being sent back and
473	   forth over the network unnecessarily.

475	   In general, this feature is useful whenever data is copied from one
476	   location to another on the server.  It is particularly useful when
477	   copying the contents of a file from a backup.  Backup-versions of a
478	   file are copied for a number of reasons, including restoring and
479	   cloning data.

481	   If the source object and destination object are on different file
482	   servers, the file servers will communicate with one another to
483	   perform the copy operation.  The server-to-server protocol by which
484	   this is accomplished is not defined in this document.

486	4.2.  Protocol Overview

488	   The server-side copy offload operations support both intra-server and
489	   inter-server file copies.  An intra-server copy is a copy in which
490	   the source file and destination file reside on the same server.  In
491	   an inter-server copy, the source file and destination file are on
492	   different servers.  In both cases, the copy may be performed
493	   synchronously or asynchronously.

495	   Throughout the rest of this document, we refer to the NFS server
496	   containing the source file as the "source server" and the NFS server
497	   to which the file is transferred as the "destination server".  In the
498	   case of an intra-server copy, the source server and destination
499	   server are the same server.  Therefore in the context of an intra-
500	   server copy, the terms source server and destination server refer to
501	   the single server performing the copy.

503	   The operations described below are designed to copy files.  Other
504	   file system objects can be copied by building on these operations or
505	   using other techniques.  For example if the user wishes to copy a
506	   directory, the client can synthesize a directory copy by first
507	   creating the destination directory and then copying the source
508	   directory's files to the new destination directory.  If the user
509	   wishes to copy a namespace junction [12] [13], the client can use the
510	   ONC RPC Federated Filesystem protocol [13] to perform the copy.
511	   Specifically the client can determine the source junction's
512	   attributes using the FEDFS_LOOKUP_FSN procedure and create a
513	   duplicate junction using the FEDFS_CREATE_JUNCTION procedure.

515	   For the inter-server copy protocol, the operations are defined to be
516	   compatible with a server-to-server copy protocol in which the
517	   destination server reads the file data from the source server.  This
518	   model in which the file data is pulled from the source by the
519	   destination has a number of advantages over a model in which the
520	   source pushes the file data to the destination.  The advantages of
521	   the pull model include:

523	   o  The pull model only requires a remote server (i.e., the
524	      destination server) to be granted read access.  A push model
525	      requires a remote server (i.e., the source server) to be granted
526	      write access, which is more privileged.

528	   o  The pull model allows the destination server to stop reading if it
529	      has run out of space.  In a push model, the destination server
530	      must flow control the source server in this situation.

532	   o  The pull model allows the destination server to easily flow
533	      control the data stream by adjusting the size of its read
534	      operations.  In a push model, the destination server does not have
535	      this ability.  The source server in a push model is capable of
536	      writing chunks larger than the destination server has requested in
537	      attributes and session parameters.  In theory, the destination
538	      server could perform a "short" write in this situation, but this
539	      approach is known to behave poorly in practice.

541	   The following operations are provided to support server-side copy:

543	   COPY_NOTIFY:  For inter-server copies, the client sends this
544	      operation to the source server to notify it of a future file copy
545	      from a given destination server for the given user.

547	   COPY_REVOKE:  Also for inter-server copies, the client sends this
548	      operation to the source server to revoke permission to copy a file
549	      for the given user.

551	   COPY:  Used by the client to request a file copy.

553	   COPY_ABORT:  Used by the client to abort an asynchronous file copy.

555	   COPY_STATUS:  Used by the client to poll the status of an
556	      asynchronous file copy.

558	   CB_COPY:  Used by the destination server to report the results of an
559	      asynchronous file copy to the client.

561	   These operations are described in detail in Section 4.3.  This
562	   section provides an overview of how these operations are used to
563	   perform server-side copies.

565	4.2.1.  Intra-Server Copy

567	   To copy a file on a single server, the client uses a COPY operation.
568	   The server may respond to the copy operation with the final results
569	   of the copy or it may perform the copy asynchronously and deliver the
570	   results using a CB_COPY operation callback.  If the copy is performed
571	   asynchronously, the client may poll the status of the copy using
572	   COPY_STATUS or cancel the copy using COPY_ABORT.

574	   A synchronous intra-server copy is shown in Figure 1.  In this
575	   example, the NFS server chooses to perform the copy synchronously.
576	   The copy operation is completed, either successfully or
577	   unsuccessfully, before the server replies to the client's request.
578	   The server's reply contains the final result of the operation.

580	     Client                                  Server
581	        +                                      +
582	        |                                      |
583	        |--- COPY ---------------------------->| Client requests
584	        |<------------------------------------/| a file copy
585	        |                                      |
586	        |                                      |

588	                Figure 1: A synchronous intra-server copy.

590	   An asynchronous intra-server copy is shown in Figure 2.  In this
591	   example, the NFS server performs the copy asynchronously.  The
592	   server's reply to the copy request indicates that the copy operation
593	   was initiated and the final result will be delivered at a later time.
594	   The server's reply also contains a copy stateid.  The client may use
595	   this copy stateid to poll for status information (as shown) or to
596	   cancel the copy using a COPY_ABORT.  When the server completes the
597	   copy, the server performs a callback to the client and reports the
598	   results.

600	     Client                                  Server
601	        +                                      +
602	        |                                      |
603	        |--- COPY ---------------------------->| Client requests
604	        |<------------------------------------/| a file copy
605	        |                                      |
606	        |                                      |
607	        |--- COPY_STATUS --------------------->| Client may poll
608	        |<------------------------------------/| for status
609	        |                                      |
610	        |                  .                   | Multiple COPY_STATUS
611	        |                  .                   | operations may be sent.
612	        |                  .                   |
613	        |                                      |
614	        |<-- CB_COPY --------------------------| Server reports results
615	        |\------------------------------------>|
616	        |                                      |

618	               Figure 2: An asynchronous intra-server copy.

620	4.2.2.  Inter-Server Copy

622	   A copy may also be performed between two servers.  The copy protocol
623	   is designed to accommodate a variety of network topologies.  As shown
624	   in Figure 3, the client and servers may be connected by multiple
625	   networks.  In particular, the servers may be connected by a
626	   specialized, high speed network (network 192.168.33.0/24 in the
627	   diagram) that does not include the client.  The protocol allows the
628	   client to setup the copy between the servers (over network
629	   10.11.78.0/24 in the diagram) and for the servers to communicate on
630	   the high speed network if they choose to do so.

632	                             192.168.33.0/24
633	                 +-------------------------------------+
634	                 |                                     |
635	                 |                                     |
636	                 | 192.168.33.18                       | 192.168.33.56
637	         +-------+------+                       +------+------+
638	         |     Source   |                       | Destination |
639	         +-------+------+                       +------+------+
640	                 | 10.11.78.18                         | 10.11.78.56
641	                 |                                     |
642	                 |                                     |
643	                 |             10.11.78.0/24           |
644	                 +------------------+------------------+
645	                                    |
646	                                    |
647	                                    | 10.11.78.243
648	                              +-----+-----+
649	                              |   Client  |
650	                              +-----------+

652	            Figure 3: An example inter-server network topology.

654	   For an inter-server copy, the client notifies the source server that
655	   a file will be copied by the destination server using a COPY_NOTIFY
656	   operation.  The client then initiates the copy by sending the COPY
657	   operation to the destination server.  The destination server may
658	   perform the copy synchronously or asynchronously.

660	   A synchronous inter-server copy is shown in Figure 4.  In this case,
661	   the destination server chooses to perform the copy before responding
662	   to the client's COPY request.

664	   An asynchronous copy is shown in Figure 5.  In this case, the
665	   destination server chooses to respond to the client's COPY request
666	   immediately and then perform the copy asynchronously.

668	     Client                Source         Destination
669	        +                    +                 +
670	        |                    |                 |
671	        |--- COPY_NOTIFY --->|                 |
672	        |<------------------/|                 |
673	        |                    |                 |
674	        |                    |                 |
675	        |--- COPY ---------------------------->|
676	        |                    |                 |
677	        |                    |                 |
678	        |                    |<----- read -----|
679	        |                    |\--------------->|
680	        |                    |                 |
681	        |                    |        .        | Multiple reads may
682	        |                    |        .        | be necessary
683	        |                    |        .        |
684	        |                    |                 |
685	        |                    |                 |
686	        |<------------------------------------/| Destination replies
687	        |                    |                 | to COPY

689	                Figure 4: A synchronous inter-server copy.

691	     Client                Source         Destination
692	        +                    +                 +
693	        |                    |                 |
694	        |--- COPY_NOTIFY --->|                 |
695	        |<------------------/|                 |
696	        |                    |                 |
697	        |                    |                 |
698	        |--- COPY ---------------------------->|
699	        |<------------------------------------/|
700	        |                    |                 |
701	        |                    |                 |
702	        |                    |<----- read -----|
703	        |                    |\--------------->|
704	        |                    |                 |
705	        |                    |        .        | Multiple reads may
706	        |                    |        .        | be necessary
707	        |                    |        .        |
708	        |                    |                 |
709	        |                    |                 |
710	        |--- COPY_STATUS --------------------->| Client may poll
711	        |<------------------------------------/| for status
712	        |                    |                 |
713	        |                    |        .        | Multiple COPY_STATUS
714	        |                    |        .        | operations may be sent
715	        |                    |        .        |
716	        |                    |                 |
717	        |                    |                 |
718	        |                    |                 |
719	        |<-- CB_COPY --------------------------| Destination reports
720	        |\------------------------------------>| results
721	        |                    |                 |

723	               Figure 5: An asynchronous inter-server copy.

725	4.2.3.  Server-to-Server Copy Protocol

727	   During an inter-server copy, the destination server reads the file
728	   data from the source server.  The source server and destination
729	   server are not required to use a specific protocol to transfer the
730	   file data.  The choice of what protocol to use is ultimately the
731	   destination server's decision.

733	4.2.3.1.  Using NFSv4.x as a Server-to-Server Copy Protocol

735	   The destination server MAY use standard NFSv4.x (where x >= 1) to
736	   read the data from the source server.  If NFSv4.x is used for the
737	   server-to-server copy protocol, the destination server can use the
738	   filehandle contained in the COPY request with standard NFSv4.x
739	   operations to read data from the source server.  Specifically, the
740	   destination server may use the NFSv4.x OPEN operation's CLAIM_FH
741	   facility to open the file being copied and obtain an open stateid.
742	   Using the stateid, the destination server may then use NFSv4.x READ
743	   operations to read the file.

745	4.2.3.2.  Using an alternative Server-to-Server Copy Protocol

747	   In a homogeneous environment, the source and destination servers
748	   might be able to perform the file copy extremely efficiently using
749	   specialized protocols.  For example the source and destination
750	   servers might be two nodes sharing a common file system format for
751	   the source and destination file systems.  Thus the source and
752	   destination are in an ideal position to efficiently render the image
753	   of the source file to the destination file by replicating the file
754	   system formats at the block level.  Another possibility is that the
755	   source and destination might be two nodes sharing a common storage
756	   area network, and thus there is no need to copy any data at all, and
757	   instead ownership of the file and its contents might simply be re-
758	   assigned to the destination.  To allow for these possibilities, the
759	   destination server is allowed to use a server-to-server copy protocol
760	   of its choice.

762	   In a heterogeneous environment, using a protocol other than NFSv4.x
763	   (e.g,.  HTTP [14] or FTP [15]) presents some challenges.  In
764	   particular, the destination server is presented with the challenge of
765	   accessing the source file given only an NFSv4.x filehandle.

767	   One option for protocols that identify source files with path names
768	   is to use an ASCII hexadecimal representation of the source
769	   filehandle as the file name.

771	   Another option for the source server is to use URLs to direct the
772	   destination server to a specialized service.  For example, the
773	   response to COPY_NOTIFY could include the URL
774	   ftp://s1.example.com:9999/_FH/0x12345, where 0x12345 is the ASCII
775	   hexadecimal representation of the source filehandle.  When the
776	   destination server receives the source server's URL, it would use
777	   "_FH/0x12345" as the file name to pass to the FTP server listening on
778	   port 9999 of s1.example.com.  On port 9999 there would be a special
779	   instance of the FTP service that understands how to convert NFS
780	   filehandles to an open file descriptor (in many operating systems,
781	   this would require a new system call, one which is the inverse of the
782	   makefh() function that the pre-NFSv4 MOUNT service needs).

784	   Authenticating and identifying the destination server to the source
785	   server is also a challenge.  Recommendations for how to accomplish
786	   this are given in Section 4.4.1.2.4 and Section 4.4.1.4.

788	4.3.  Operations

790	   In the sections that follow, several operations are defined that
791	   together provide the server-side copy feature.  These operations are
792	   intended to be OPTIONAL operations as defined in section 17 of [2].
793	   The COPY_NOTIFY, COPY_REVOKE, COPY, COPY_ABORT, and COPY_STATUS
794	   operations are designed to be sent within an NFSv4 COMPOUND
795	   procedure.  The CB_COPY operation is designed to be sent within an
796	   NFSv4 CB_COMPOUND procedure.

798	   Each operation is performed in the context of the user identified by
799	   the ONC RPC credential of its containing COMPOUND or CB_COMPOUND
800	   request.  For example, a COPY_ABORT operation issued by a given user
801	   indicates that a specified COPY operation initiated by the same user
802	   be canceled.  Therefore a COPY_ABORT MUST NOT interfere with a copy
803	   of the same file initiated by another user.

805	   An NFS server MAY allow an administrative user to monitor or cancel
806	   copy operations using an implementation specific interface.

808	4.3.1.  netloc4 - Network Locations

810	   The server-side copy operations specify network locations using the
811	   netloc4 data type shown below:

813	   enum netloc_type4 {
814	           NL4_NAME        = 0,
815	           NL4_URL         = 1,
816	           NL4_NETADDR     = 2
817	   };
818	   union netloc4 switch (netloc_type4 nl_type) {
819	           case NL4_NAME:          utf8str_cis nl_name;
820	           case NL4_URL:           utf8str_cis nl_url;
821	           case NL4_NETADDR:       netaddr4    nl_addr;
822	   };

824	   If the netloc4 is of type NL4_NAME, the nl_name field MUST be
825	   specified as a UTF-8 string.  The nl_name is expected to be resolved
826	   to a network address via DNS, LDAP, NIS, /etc/hosts, or some other
827	   means.  If the netloc4 is of type NL4_URL, a server URL [5]
828	   appropriate for the server-to-server copy operation is specified as a
829	   UTF-8 string.  If the netloc4 is of type NL4_NETADDR, the nl_addr
830	   field MUST contain a valid netaddr4 as defined in Section 3.3.9 of
831	   [2].

833	   When netloc4 values are used for an inter-server copy as shown in
834	   Figure 3, their values may be evaluated on the source server,
835	   destination server, and client.  The network environment in which
836	   these systems operate should be configured so that the netloc4 values
837	   are interpreted as intended on each system.

839	4.3.2.  Copy Offload Stateids

841	   A server may perform a copy offload operation asynchronously.  An
842	   asynchronous copy is tracked using a copy offload stateid.  Copy
843	   offload stateids are included in the COPY, COPY_ABORT, COPY_STATUS,
844	   and CB_COPY operations.

846	   Section 8.2.4 of [2] specifies that stateids are valid until either
847	   (A) the client or server restart or (B) the client returns the
848	   resource.

850	   A copy offload stateid will be valid until either (A) the client or
851	   server restart or (B) the client returns the resource by issuing a
852	   COPY_ABORT operation or the client replies to a CB_COPY operation.

854	   A copy offload stateid's seqid MUST NOT be 0 (zero).  In the context
855	   of a copy offload operation, it is ambiguous to indicate the most
856	   recent copy offload operation using a stateid with seqid of 0 (zero).
857	   Therefore a copy offload stateid with seqid of 0 (zero) MUST be
858	   considered invalid.

860	4.4.  Security Considerations

862	   The security considerations pertaining to NFSv4 [10] apply to this
863	   document.

865	   The standard security mechanisms provide by NFSv4 [10] may be used to
866	   secure the protocol described in this document.

868	   NFSv4 clients and servers supporting the the inter-server copy
869	   operations described in this document are REQUIRED to implement [6],
870	   including the RPCSEC_GSSv3 privileges copy_from_auth and
871	   copy_to_auth.  If the server-to-server copy protocol is ONC RPC
872	   based, the servers are also REQUIRED to implement the RPCSEC_GSSv3
873	   privilege copy_confirm_auth.  These requirements to implement are not
874	   requirements to use.  NFSv4 clients and servers are RECOMMENDED to
875	   use [6] to secure server-side copy operations.

877	4.4.1.  Inter-Server Copy Security

879	4.4.1.1.  Requirements for Secure Inter-Server Copy

881	   Inter-server copy is driven by several requirements:

883	   o  The specification MUST NOT mandate an inter-server copy protocol.
884	      There are many ways to copy data.  Some will be more optimal than
885	      others depending on the identities of the source server and
886	      destination server.  For example the source and destination
887	      servers might be two nodes sharing a common file system format for
888	      the source and destination file systems.  Thus the source and
889	      destination are in an ideal position to efficiently render the
890	      image of the source file to the destination file by replicating
891	      the file system formats at the block level.  In other cases, the
892	      source and destination might be two nodes sharing a common storage
893	      area network, and thus there is no need to copy any data at all,
894	      and instead ownership of the file and its contents simply gets re-
895	      assigned to the destination.

897	   o  The specification MUST provide guidance for using NFSv4.x as a
898	      copy protocol.  For those source and destination servers willing
899	      to use NFSv4.x there are specific security considerations that
900	      this specification can and does address.

902	   o  The specification MUST NOT mandate pre-configuration between the
903	      source and destination server.  Requiring that the source and
904	      destination first have a "copying relationship" increases the
905	      administrative burden.  However the specification MUST NOT
906	      preclude implementations that require pre-configuration.

908	   o  The specification MUST NOT mandate a trust relationship between
909	      the source and destination server.  The NFSv4 security model
910	      requires mutual authentication between a principal on an NFS
911	      client and a principal on an NFS server.  This model MUST continue
912	      with the introduction of COPY.

914	4.4.1.2.  Inter-Server Copy with RPCSEC_GSSv3

916	   When the client sends a COPY_NOTIFY to the source server to expect
917	   the destination to attempt to copy data from the source server, it is
918	   expected that this copy is being done on behalf of the principal
919	   (called the "user principal") that sent the RPC request that encloses
920	   the COMPOUND procedure that contains the COPY_NOTIFY operation.  The
921	   user principal is identified by the RPC credentials.  A mechanism
922	   that allows the user principal to authorize the destination server to
923	   perform the copy in a manner that lets the source server properly
924	   authenticate the destination's copy, and without allowing the
925	   destination to exceed its authorization is necessary.

927	   An approach that sends delegated credentials of the client's user
928	   principal to the destination server is not used for the following
929	   reasons.  If the client's user delegated its credentials, the
930	   destination would authenticate as the user principal.  If the
931	   destination were using the NFSv4 protocol to perform the copy, then
932	   the source server would authenticate the destination server as the
933	   user principal, and the file copy would securely proceed.  However,
934	   this approach would allow the destination server to copy other files.
935	   The user principal would have to trust the destination server to not
936	   do so.  This is counter to the requirements, and therefore is not
937	   considered.  Instead an approach using RPCSEC_GSSv3 [6] privileges is
938	   proposed.

940	   One of the stated applications of the proposed RPCSEC_GSSv3 protocol
941	   is compound client host and user authentication [+ privilege
942	   assertion].  For inter-server file copy, we require compound NFS
943	   server host and user authentication [+ privilege assertion].  The
944	   distinction between the two is one without meaning.

946	   RPCSEC_GSSv3 introduces the notion of privileges.  We define three
947	   privileges:

949	   copy_from_auth:  A user principal is authorizing a source principal
950	      ("nfs@<source>") to allow a destination principal ("nfs@
951	      <destination>") to copy a file from the source to the destination.
952	      This privilege is established on the source server before the user
953	      principal sends a COPY_NOTIFY operation to the source server.

955	   struct copy_from_auth_priv {
956	           secret4             cfap_shared_secret;
957	           netloc4             cfap_destination;
958	           /* the NFSv4 user name that the user principal maps to */
959	           utf8str_mixed       cfap_username;
960	           /* equal to seq_num of rpc_gss_cred_vers_3_t */
961	           unsigned int        cfap_seq_num;
962	   };

964	      cap_shared_secret is a secret value the user principal generates.

966	   copy_to_auth:  A user principal is authorizing a destination
967	      principal ("nfs@<destination>") to allow it to copy a file from
968	      the source to the destination.  This privilege is established on
969	      the destination server before the user principal sends a COPY
970	      operation to the destination server.

972	   struct copy_to_auth_priv {
973	           /* equal to cfap_shared_secret */
974	           secret4              ctap_shared_secret;
975	           netloc4              ctap_source;
976	           /* the NFSv4 user name that the user principal maps to */
977	           utf8str_mixed        ctap_username;
978	           /* equal to seq_num of rpc_gss_cred_vers_3_t */
979	           unsigned int         ctap_seq_num;
980	   };

982	      ctap_shared_secret is a secret value the user principal generated
983	      and was used to establish the copy_from_auth privilege with the
984	      source principal.

986	   copy_confirm_auth:  A destination principal is confirming with the
987	      source principal that it is authorized to copy data from the
988	      source on behalf of the user principal.  When the inter-server
989	      copy protocol is NFSv4, or for that matter, any protocol capable
990	      of being secured via RPCSEC_GSSv3 (i.e., any ONC RPC protocol),
991	      this privilege is established before the file is copied from the
992	      source to the destination.

994	   struct copy_confirm_auth_priv {
995	           /* equal to GSS_GetMIC() of cfap_shared_secret */
996	           opaque              ccap_shared_secret_mic<>;
997	           /* the NFSv4 user name that the user principal maps to */
998	           utf8str_mixed       ccap_username;
999	           /* equal to seq_num of rpc_gss_cred_vers_3_t */
1000	           unsigned int        ccap_seq_num;
1001	   };

1003	4.4.1.2.1.  Establishing a Security Context

1005	   When the user principal wants to COPY a file between two servers, if
1006	   it has not established copy_from_auth and copy_to_auth privileges on
1007	   the servers, it establishes them:

1009	   o  The user principal generates a secret it will share with the two
1010	      servers.  This shared secret will be placed in the
1011	      cfap_shared_secret and ctap_shared_secret fields of the
1012	      appropriate privilege data types, copy_from_auth_priv and
1013	      copy_to_auth_priv.

1015	   o  An instance of copy_from_auth_priv is filled in with the shared
1016	      secret, the destination server, and the NFSv4 user id of the user
1017	      principal.  It will be sent with an RPCSEC_GSS3_CREATE procedure,
1018	      and so cfap_seq_num is set to the seq_num of the credential of the
1019	      RPCSEC_GSS3_CREATE procedure.  Because cfap_shared_secret is a
1020	      secret, after XDR encoding copy_from_auth_priv, GSS_Wrap() (with
1021	      privacy) is invoked on copy_from_auth_priv.  The
1022	      RPCSEC_GSS3_CREATE procedure's arguments are:

1024	      struct {
1025	         rpc_gss3_gss_binding    *compound_binding;
1026	         rpc_gss3_chan_binding   *chan_binding_mic;
1027	         rpc_gss3_assertion      assertions<>;
1028	         rpc_gss3_extension      extensions<>;
1029	      } rpc_gss3_create_args;

1031	      The string "copy_from_auth" is placed in assertions[0].privs.  The
1032	      output of GSS_Wrap() is placed in extensions[0].data.  The field
1033	      extensions[0].critical is set to TRUE.  The source server calls
1034	      GSS_Unwrap() on the privilege, and verifies that the seq_num
1035	      matches the credential.  It then verifies that the NFSv4 user id
1036	      being asserted matches the source server's mapping of the user
1037	      principal.  If it does, the privilege is established on the source
1038	      server as: <"copy_from_auth", user id, destination>.  The
1039	      successful reply to RPCSEC_GSS3_CREATE has:

1041	      struct {
1042	         opaque                  handle<>;
1043	         rpc_gss3_chan_binding   *chan_binding_mic;
1044	         rpc_gss3_assertion      granted_assertions<>;
1045	         rpc_gss3_assertion      server_assertions<>;
1046	         rpc_gss3_extension      extensions<>;
1047	      } rpc_gss3_create_res;

1049	      The field "handle" is the RPCSEC_GSSv3 handle that the client will
1050	      use on COPY_NOTIFY requests involving the source and destination
1051	      server. granted_assertions[0].privs will be equal to
1052	      "copy_from_auth".  The server will return a GSS_Wrap() of
1053	      copy_to_auth_priv.

1055	   o  An instance of copy_to_auth_priv is filled in with the shared
1056	      secret, the source server, and the NFSv4 user id.  It will be sent
1057	      with an RPCSEC_GSS3_CREATE procedure, and so ctap_seq_num is set
1058	      to the seq_num of the credential of the RPCSEC_GSS3_CREATE
1059	      procedure.  Because ctap_shared_secret is a secret, after XDR
1060	      encoding copy_to_auth_priv, GSS_Wrap() is invoked on
1061	      copy_to_auth_priv.  The RPCSEC_GSS3_CREATE procedure's arguments
1062	      are:

1064	      struct {
1065	         rpc_gss3_gss_binding    *compound_binding;
1066	         rpc_gss3_chan_binding   *chan_binding_mic;
1067	         rpc_gss3_assertion      assertions<>;
1068	         rpc_gss3_extension      extensions<>;
1069	      } rpc_gss3_create_args;

1071	      The string "copy_to_auth" is placed in assertions[0].privs.  The
1072	      output of GSS_Wrap() is placed in extensions[0].data.  The field
1073	      extensions[0].critical is set to TRUE.  After unwrapping,
1074	      verifying the seq_num, and the user principal to NFSv4 user ID
1075	      mapping, the destination establishes a privilege of
1076	      <"copy_to_auth", user id, source>.  The successful reply to
1077	      RPCSEC_GSS3_CREATE has:

1079	      struct {
1080	         opaque                  handle<>;
1081	         rpc_gss3_chan_binding   *chan_binding_mic;
1082	         rpc_gss3_assertion      granted_assertions<>;
1083	         rpc_gss3_assertion      server_assertions<>;
1084	         rpc_gss3_extension      extensions<>;
1085	      } rpc_gss3_create_res;

1087	      The field "handle" is the RPCSEC_GSSv3 handle that the client will
1088	      use on COPY requests involving the source and destination server.
1089	      The field granted_assertions[0].privs will be equal to
1090	      "copy_to_auth".  The server will return a GSS_Wrap() of
1091	      copy_to_auth_priv.

1093	4.4.1.2.2.  Starting a Secure Inter-Server Copy

1095	   When the client sends a COPY_NOTIFY request to the source server, it
1096	   uses the privileged "copy_from_auth" RPCSEC_GSSv3 handle.
1097	   cna_destination_server in COPY_NOTIFY MUST be the same as the name of
1098	   the destination server specified in copy_from_auth_priv.  Otherwise,
1099	   COPY_NOTIFY will fail with NFS4ERR_ACCESS.  The source server
1100	   verifies that the privilege <"copy_from_auth", user id, destination>
1101	   exists, and annotates it with the source filehandle, if the user
1102	   principal has read access to the source file, and if administrative
1103	   policies give the user principal and the NFS client read access to
1104	   the source file (i.e., if the ACCESS operation would grant read
1105	   access).  Otherwise, COPY_NOTIFY will fail with NFS4ERR_ACCESS.

1107	   When the client sends a COPY request to the destination server, it
1108	   uses the privileged "copy_to_auth" RPCSEC_GSSv3 handle.
1109	   ca_source_server in COPY MUST be the same as the name of the source
1110	   server specified in copy_to_auth_priv.  Otherwise, COPY will fail
1111	   with NFS4ERR_ACCESS.  The destination server verifies that the
1112	   privilege <"copy_to_auth", user id, source> exists, and annotates it
1113	   with the source and destination filehandles.  If the client has
1114	   failed to establish the "copy_to_auth" policy it will reject the
1115	   request with NFS4ERR_PARTNER_NO_AUTH.

1117	   If the client sends a COPY_REVOKE to the source server to rescind the
1118	   destination server's copy privilege, it uses the privileged
1119	   "copy_from_auth" RPCSEC_GSSv3 handle and the cra_destination_server
1120	   in COPY_REVOKE MUST be the same as the name of the destination server
1121	   specified in copy_from_auth_priv.  The source server will then delete
1122	   the <"copy_from_auth", user id, destination> privilege and fail any
1123	   subsequent copy requests sent under the auspices of this privilege
1124	   from the destination server.

1126	4.4.1.2.3.  Securing ONC RPC Server-to-Server Copy Protocols

1128	   After a destination server has a "copy_to_auth" privilege established
1129	   on it, and it receives a COPY request, if it knows it will use an ONC
1130	   RPC protocol to copy data, it will establish a "copy_confirm_auth"
1131	   privilege on the source server, using nfs@<destination> as the
1132	   initiator principal, and nfs@<source> as the target principal.

1134	   The value of the field ccap_shared_secret_mic is a GSS_VerifyMIC() of
1135	   the shared secret passed in the copy_to_auth privilege.  The field
1136	   ccap_username is the mapping of the user principal to an NFSv4 user
1137	   name ("user"@"domain" form), and MUST be the same as ctap_username
1138	   and cfap_username.  The field ccap_seq_num is the seq_num of the
1139	   RPCSEC_GSSv3 credential used for the RPCSEC_GSS3_CREATE procedure the
1140	   destination will send to the source server to establish the
1141	   privilege.

1143	   The source server verifies the privilege, and establishes a
1144	   <"copy_confirm_auth", user id, destination> privilege.  If the source
1145	   server fails to verify the privilege, the COPY operation will be
1146	   rejected with NFS4ERR_PARTNER_NO_AUTH.  All subsequent ONC RPC
1147	   requests sent from the destination to copy data from the source to
1148	   the destination will use the RPCSEC_GSSv3 handle returned by the
1149	   source's RPCSEC_GSS3_CREATE response.

1151	   Note that the use of the "copy_confirm_auth" privilege accomplishes
1152	   the following:

1154	   o  if a protocol like NFS is being used, with export policies, export
1155	      policies can be overridden in case the destination server as-an-
1156	      NFS-client is not authorized

1158	   o  manual configuration to allow a copy relationship between the
1159	      source and destination is not needed.

1161	   If the attempt to establish a "copy_confirm_auth" privilege fails,
1162	   then when the user principal sends a COPY request to destination, the
1163	   destination server will reject it with NFS4ERR_PARTNER_NO_AUTH.

1165	4.4.1.2.4.  Securing Non ONC RPC Server-to-Server Copy Protocols

1167	   If the destination won't be using ONC RPC to copy the data, then the
1168	   source and destination are using an unspecified copy protocol.  The
1169	   destination could use the shared secret and the NFSv4 user id to
1170	   prove to the source server that the user principal has authorized the
1171	   copy.

1173	   For protocols that authenticate user names with passwords (e.g., HTTP
1174	   [14] and FTP [15]), the nfsv4 user id could be used as the user name,
1175	   and an ASCII hexadecimal representation of the RPCSEC_GSSv3 shared
1176	   secret could be used as the user password or as input into non-
1177	   password authentication methods like CHAP [16].

1179	4.4.1.3.  Inter-Server Copy via ONC RPC but without RPCSEC_GSSv3

1181	   ONC RPC security flavors other than RPCSEC_GSSv3 MAY be used with the
1182	   server-side copy offload operations described in this document.  In
1183	   particular, host-based ONC RPC security flavors such as AUTH_NONE and
1184	   AUTH_SYS MAY be used.  If a host-based security flavor is used, a
1185	   minimal level of protection for the server-to-server copy protocol is
1186	   possible.

1188	   In the absence of strong security mechanisms such as RPCSEC_GSSv3,
1189	   the challenge is how the source server and destination server
1190	   identify themselves to each other, especially in the presence of
1191	   multi-homed source and destination servers.  In a multi-homed
1192	   environment, the destination server might not contact the source
1193	   server from the same network address specified by the client in the
1194	   COPY_NOTIFY.  This can be overcome using the procedure described
1195	   below.

1197	   When the client sends the source server the COPY_NOTIFY operation,
1198	   the source server may reply to the client with a list of target
1199	   addresses, names, and/or URLs and assign them to the unique triple:
1200	   <source fh, user ID, destination address Y>.  If the destination uses
1201	   one of these target netlocs to contact the source server, the source
1202	   server will be able to uniquely identify the destination server, even
1203	   if the destination server does not connect from the address specified
1204	   by the client in COPY_NOTIFY.

1206	   For example, suppose the network topology is as shown in Figure 3.
1207	   If the source filehandle is 0x12345, the source server may respond to
1208	   a COPY_NOTIFY for destination 10.11.78.56 with the URLs:

1210	      nfs://10.11.78.18//_COPY/10.11.78.56/_FH/0x12345

1212	      nfs://192.168.33.18//_COPY/10.11.78.56/_FH/0x12345

1214	   The client will then send these URLs to the destination server in the
1215	   COPY operation.  Suppose that the 192.168.33.0/24 network is a high
1216	   speed network and the destination server decides to transfer the file
1217	   over this network.  If the destination contacts the source server
1218	   from 192.168.33.56 over this network using NFSv4.1, it does the
1219	   following:

1221	   COMPOUND  { PUTROOTFH, LOOKUP "_COPY" ; LOOKUP "10.11.78.56"; LOOKUP
1222	      "_FH" ; OPEN "0x12345" ; GETFH }

1224	   The source server will therefore know that these NFSv4.1 operations
1225	   are being issued by the destination server identified in the
1226	   COPY_NOTIFY.

1228	4.4.1.4.  Inter-Server Copy without ONC RPC and RPCSEC_GSSv3

1230	   The same techniques as Section 4.4.1.3, using unique URLs for each
1231	   destination server, can be used for other protocols (e.g., HTTP [14]
1232	   and FTP [15]) as well.

1234	5.  Application Data Block Support

1236	   At the OS level, files are contained on disk blocks.  Applications
1237	   are also free to impose structure on the data contained in a file and
1238	   we can define an Application Data Block (ADB) to be such a structure.
1239	   From the application's viewpoint, it only wants to handle ADBs and
1240	   not raw bytes (see [17]).  An ADB is typically comprised of two
1241	   sections: a header and data.  The header describes the
1242	   characteristics of the block and can provide a means to detect
1243	   corruption in the data payload.  The data section is typically
1244	   initialized to all zeros.

1246	   The format of the header is application specific, but there are two
1247	   main components typically encountered:

1249	   1.  An ADB Number (ADBN), which allows the application to determine
1250	       which data block is being referenced.  The ADBN is a logical
1251	       block number and is useful when the client is not storing the
1252	       blocks in contiguous memory.

1254	   2.  Fields to describe the state of the ADB and a means to detect
1255	       block corruption.  For both pieces of data, a useful property is
1256	       that allowed values be unique in that if passed across the
1257	       network, corruption due to translation between big and little
1258	       endian architectures are detectable.  For example, 0xF0DEDEF0 has
1259	       the same bit pattern in both architectures.

1261	   Applications already impose structures on files [17] and detect
1262	   corruption in data blocks [18].  What they are not able to do is
1263	   efficiently transfer and store ADBs.  To initialize a file with ADBs,
1264	   the client must send the full ADB to the server and that must be
1265	   stored on the server.  When the application is initializing a file to
1266	   have the ADB structure, it could compress the ADBs to just the
1267	   information to necessary to later reconstruct the header portion of
1268	   the ADB when the contents are read back.  Using sparse file
1269	   techniques, the disk blocks described by would not be allocated.
1270	   Unlike sparse file techniques, there would be a small cost to store
1271	   the compressed header data.

1273	   In this section, we are going to define a generic framework for an
1274	   ADB, present one approach to detecting corruption in a given ADB
1275	   implementation, and describe the model for how the client and server
1276	   can support efficient initialization of ADBs, reading of ADB holes,
1277	   punching holes in ADBs, and space reservation.  Further, we need to
1278	   be able to extend this model to applications which do not support
1279	   ADBs, but wish to be able to handle sparse files, hole punching, and
1280	   space reservation.

1282	5.1.  Generic Framework

1284	   We want the representation of the ADB to be flexible enough to
1285	   support many different applications.  The most basic approach is no
1286	   imposition of a block at all, which means we are working with the raw
1287	   bytes.  Such an approach would be useful for storing holes, punching
1288	   holes, etc.  In more complex deployments, a server might be
1289	   supporting multiple applications, each with their own definition of
1290	   the ADB.  One might store the ADBN at the start of the block and then
1291	   have a guard pattern to detect corruption [19].  The next might store
1292	   the ADBN at an offset of 100 bytes within the block and have no guard
1293	   pattern at all.  The point is that existing applications might
1294	   already have well defined formats for their data blocks.

1296	   The guard pattern can be used to represent the state of the block, to
1297	   protect against corruption, or both.  Again, it needs to be able to
1298	   be placed anywhere within the ADB.

1300	   We need to be able to represent the starting offset of the block and
1301	   the size of the block.  Note that nothing prevents the application
1302	   from defining different sized blocks in a file.

1304	5.1.1.  Data Block Representation

1306	   struct app_data_block4 {
1307	           offset4         adb_offset;
1308	           length4         adb_block_size;
1309	           length4         adb_block_count;
1310	           length4         adb_reloff_blocknum;
1311	           count4          adb_block_num;
1312	           length4         adb_reloff_pattern;
1313	           opaque          adb_pattern<>;
1314	   };

1316	   The app_data_block4 structure captures the abstraction presented for
1317	   the ADB.  The additional fields present are to allow the transmission
1318	   of adb_block_count ADBs at one time.  We also use adb_block_num to
1319	   convey the ADBN of the first block in the sequence.  Each ADB will
1320	   contain the same adb_pattern string.

1322	   As both adb_block_num and adb_pattern are optional, if either
1323	   adb_reloff_pattern or adb_reloff_blocknum is set to NFS4_UINT64_MAX,
1324	   then the corresponding field is not set in any of the ADB.

1326	5.1.2.  Data Content

1328	   /*
1329	    * Use an enum such that we can extend new types.
1330	    */
1331	   enum data_content4 {
1332	           NFS4_CONTENT_DATA = 0,
1333	           NFS4_CONTENT_APP_BLOCK = 1,
1334	           NFS4_CONTENT_HOLE = 2
1335	   };

1337	   New operations might need to differentiate between wanting to access
1338	   data versus an ADB.  Also, future minor versions might want to
1339	   introduce new data formats.  This enumeration allows that to occur.

1341	5.2.  pNFS Considerations

1343	   While this document does not mandate how sparse ADBs are recorded on
1344	   the server, it does make the assumption that such information is not
1345	   in the file.  I.e., the information is metadata.  As such, the
1346	   INITIALIZE operation is defined to be not supported by the DS - it
1347	   must be issued to the MDS.  But since the client must not assume a
1348	   priori whether a read is sparse or not, the READ_PLUS operation MUST
1349	   be supported by both the DS and the MDS.  I.e., the client might
1350	   impose on the MDS to asynchronously read the data from the DS.

1352	   Furthermore, each DS MUST not report to a client either a sparse ADB
1353	   or data which belongs to another DS.  One implication of this
1354	   requirement is that the app_data_block4's adb_block_size MUST be
1355	   either be the stripe width or the stripe width must be an even
1356	   multiple of it.

1358	   The second implication here is that the DS must be able to use the
1359	   Control Protocol to determine from the MDS where the sparse ADBs
1360	   occur.  [[Comment.4: Need to discuss what happens if after the file
1361	   is being written to and an INITIALIZE occurs? --TH]] Perhaps instead
1362	   of the DS pulling from the MDS, the MDS pushes to the DS?  Thus an
1363	   INITIALIZE causes a new push?  [[Comment.5: Still need to consider
1364	   race cases of the DS getting a WRITE and the MDS getting an
1365	   INITIALIZE. --TH]]

1367	5.3.  An Example of Detecting Corruption

1369	   In this section, we define an ADB format in which corruption can be
1370	   detected.  Note that this is just one possible format and means to
1371	   detect corruption.

1373	   Consider a very basic implementation of an operating system's disk
1374	   blocks.  A block is either data or it is an indirect block which
1375	   allows for files to be larger than one block.  It is desired to be
1376	   able to initialize a block.  Lastly, to quickly unlink a file, a
1377	   block can be marked invalid.  The contents remain intact - which
1378	   would enable this OS application to undelete a file.

1380	   The application defines 4k sized data blocks, with an 8 byte block
1381	   counter occurring at offset 0 in the block, and with the guard
1382	   pattern occurring at offset 8 inside the block.  Furthermore, the
1383	   guard pattern can take one of four states:

1385	   0xfeedface -   This is the FREE state and indicates that the ADB
1386	      format has been applied.

1388	   0xcafedead -   This is the DATA state and indicates that real data
1389	      has been written to this block.

1391	   0xe4e5c001 -   This is the INDIRECT state and indicates that the
1392	      block contains block counter numbers that are chained off of this
1393	      block.

1395	   0xba1ed4a3 -   This is the INVALID state and indicates that the block
1396	      contains data whose contents are garbage.

1398	   Finally, it also defines an 8 byte checksum [20] starting at byte 16
1399	   which applies to the remaining contents of the block.  If the state
1400	   is FREE, then that checksum is trivially zero.  As such, the
1401	   application has no need to transfer the checksum implicitly inside
1402	   the ADB - it need not make the transfer layer aware of the fact that
1403	   there is a checksum (see [18] for an example of checksums used to
1404	   detect corruption in application data blocks).

1406	   Corruption in each ADB can be detected thusly:

1408	   o  If the guard pattern is anything other than one of the allowed
1409	      values, including all zeros.

1411	   o  If the guard pattern is FREE and any other byte in the remainder
1412	      of the ADB is anything other than zero.

1414	   o  If the guard pattern is anything other than FREE, then if the
1415	      stored checksum does not match the computed checksum.

1417	   o  If the guard pattern is INDIRECT and one of the stored indirect
1418	      block numbers has a value greater than the number of ADBs in the
1419	      file.

1421	   o  If the guard pattern is INDIRECT and one of the stored indirect
1422	      block numbers is a duplicate of another stored indirect block
1423	      number.

1425	   As can be seen, the application can detect errors based on the
1426	   combination of the guard pattern state and the checksum.  But also,
1427	   the application can detect corruption based on the state and the
1428	   contents of the ADB.  This last point is important in validating the
1429	   minimum amount of data we incorporated into our generic framework.
1430	   I.e., the guard pattern is sufficient in allowing applications to
1431	   design their own corruption detection.

1433	   Finally, it is important to note that none of these corruption checks
1434	   occur in the transport layer.  The server and client components are
1435	   totally unaware of the file format and might report everything as
1436	   being transferred correctly even in the case the application detects
1437	   corruption.

1439	5.4.  Example of READ_PLUS

1441	   The hypothetical application presented in Section 5.3 can be used to
1442	   illustrate how READ_PLUS would return an array of results.  A file is
1443	   created and initialized with 100 4k ADBs in the FREE state:

1445	      INITIALIZE {0, 4k, 100, 0, 0, 8, 0xfeedface}

1447	   Further, assume the application writes a single ADB at 16k, changing
1448	   the guard pattern to 0xcafedead, we would then have in memory:

1450	      0 -> (16k - 1)   : 4k, 4, 0, 0, 8, 0xfeedface
1451	      16k -> (20k - 1) : 00 00 00 05 ca fe de ad XX XX ... XX XX
1452	      20k -> 400k      : 4k, 95, 0, 6, 0xfeedface

1454	   And when the client did a READ_PLUS of 64k at the start of the file,
1455	   it would get back a result of an ADB, some data, and a final ADB:

1457	      ADB {0, 4, 0, 0, 8, 0xfeedface}
1458	      data 4k
1459	      ADB {20k, 4k, 59, 0, 6, 0xfeedface}

1461	5.5.  Zero Filled Holes

1463	   As applications are free to define the structure of an ADB, it is
1464	   trivial to define an ADB which supports zero filled holes.  Such a
1465	   case would encompass the traditional definitions of a sparse file and
1466	   hole punching.  For example, to punch a 64k hole, starting at 100M,
1467	   into an existing file which has no ADB structure:

1469	      INITIALIZE {100M, 64k, 1, NFS4_UINT64_MAX,
1470	                  0, NFS4_UINT64_MAX, 0x0}

1472	6.  Space Reservation

1474	6.1.  Introduction

1476	   This section describes a set of operations that allow applications
1477	   such as hypervisors to reserve space for a file, report the amount of
1478	   actual disk space a file occupies and freeup the backing space of a
1479	   file when it is not required.

1481	   In virtualized environments, virtual disk files are often stored on
1482	   NFS mounted volumes.  Since virtual disk files represent the hard
1483	   disks of virtual machines, hypervisors often have to guarantee
1484	   certain properties for the file.

1486	   One such example is space reservation.  When a hypervisor creates a
1487	   virtual disk file, it often tries to preallocate the space for the
1488	   file so that there are no future allocation related errors during the
1489	   operation of the virtual machine.  Such errors prevent a virtual
1490	   machine from continuing execution and result in downtime.

1492	   Another useful feature would be the ability to report the number of
1493	   blocks that would be freed when a file is deleted.  Currently, NFS
1494	   reports two size attributes:

1496	   size  The logical file size of the file.

1498	   space_used  The size in bytes that the file occupies on disk

1500	   While these attributes are sufficient for space accounting in
1501	   traditional filesystems, they prove to be inadequate in modern
1502	   filesystems that support block sharing.  Having a way to tell the
1503	   number of blocks that would be freed if the file was deleted would be
1504	   useful to applications that wish to migrate files when a volume is
1505	   low on space.

1507	   Since virtual disks represent a hard drive in a virtual machine, a
1508	   virtual disk can be viewed as a filesystem within a file.  Since not
1509	   all blocks within a filesystem are in use, there is an opportunity to
1510	   reclaim blocks that are no longer in use.  A call to deallocate
1511	   blocks could result in better space efficiency.  Lesser space MAY be
1512	   consumed for backups after block deallocation.

1514	   We propose the following operations and attributes for the
1515	   aforementioned use cases:

1517	   space_reserved  This attribute specifies whether the blocks backing
1518	      the file have been preallocated.

1520	   space_freed  This attribute specifies the space freed when a file is
1521	      deleted, taking block sharing into consideration.

1523	   max_hole_punch  This attribute specifies the maximum sized hole that
1524	      can be punched on the filesystem.

1526	   HOLE_PUNCH  This operation zeroes and/or deallocates the blocks
1527	      backing a region of the file.

1529	6.2.  Use Cases
1530	6.2.1.  Space Reservation

1532	   Some applications require that once a file of a certain size is
1533	   created, writes to that file never fail with an out of space
1534	   condition.  One such example is that of a hypervisor writing to a
1535	   virtual disk.  An out of space condition while writing to virtual
1536	   disks would mean that the virtual machine would need to be frozen.

1538	   Currently, in order to achieve such a guarantee, applications zero
1539	   the entire file.  The initial zeroing allocates the backing blocks
1540	   and all subsequent writes are overwrites of already allocated blocks.
1541	   This approach is not only inefficient in terms of the amount of I/O
1542	   done, it is also not guaranteed to work on filesystems that are log
1543	   structured or deduplicated.  An efficient way of guaranteeing space
1544	   reservation would be beneficial to such applications.

1546	   If the space_reserved attribute is set on a file, it is guaranteed
1547	   that writes that do not grow the file will not fail with
1548	   NFSERR_NOSPC.

1550	6.2.2.  Space freed on deletes

1552	   Currently, files in NFS have two size attributes:

1554	   size  The logical file size of the file.

1556	   space_used  The size in bytes that the file occupies on disk.

1558	   While these attributes are sufficient for space accounting in
1559	   traditional filesystems, they prove to be inadequate in modern
1560	   filesystems that support block sharing.  In such filesystems,
1561	   multiple inodes can point to a single block with a block reference
1562	   count to guard against premature freeing.

1564	   If space_used of a file is interpreted to mean the size in bytes of
1565	   all disk blocks pointed to by the inode of the file, then shared
1566	   blocks get double counted, over-reporting the space utilization.
1567	   This also has the adverse effect that the deletion of a file with
1568	   shared blocks frees up less than space_used bytes.

1570	   On the other hand, if space_used is interpreted to mean the size in
1571	   bytes of those disk blocks unique to the inode of the file, then
1572	   shared blocks are not counted in any file, resulting in under-
1573	   reporting of the space utilization.

1575	   For example, two files A and B have 10 blocks each.  Let 6 of these
1576	   blocks be shared between them.  Thus, the combined space utilized by
1577	   the two files is 14 * BLOCK_SIZE bytes.  In the former case, the
1578	   combined space utilization of the two files would be reported as 20 *
1579	   BLOCK_SIZE.  However, deleting either would only result in 4 *
1580	   BLOCK_SIZE being freed.  Conversely, the latter interpretation would
1581	   report that the space utilization is only 8 * BLOCK_SIZE.

1583	   Adding another size attribute, space_freed, is helpful in solving
1584	   this problem. space_freed is the number of blocks that are allocated
1585	   to the given file that would be freed on its deletion.  In the
1586	   example, both A and B would report space_freed as 4 * BLOCK_SIZE and
1587	   space_used as 10 * BLOCK_SIZE.  If A is deleted, B will report
1588	   space_freed as 10 * BLOCK_SIZE as the deletion of B would result in
1589	   the deallocation of all 10 blocks.

1591	   The addition of this problem doesn't solve the problem of space being
1592	   over-reported.  However, over-reporting is better than under-
1593	   reporting.

1595	6.2.3.  Operations and attributes

1597	   In the sections that follow, one operation and three attributes are
1598	   defined that together provide the space management facilities
1599	   outlined earlier in the document.  The operation is intended to be
1600	   OPTIONAL and the attributes RECOMMENDED as defined in section 17 of
1601	   [2].

1603	6.2.4.  Attribute 77: space_reserved

1605	   The space_reserve attribute is a read/write attribute of type
1606	   boolean.  It is a per file attribute.  When the space_reserved
1607	   attribute is set via SETATTR, the server must ensure that there is
1608	   disk space to accommodate every byte in the file before it can return
1609	   success.  If the server cannot guarantee this, it must return
1610	   NFS4ERR_NOSPC.

1612	   If the client tries to grow a file which has the space_reserved
1613	   attribute set, the server must guarantee that there is disk space to
1614	   accommodate every byte in the file with the new size before it can
1615	   return success.  If the server cannot guarantee this, it must return
1616	   NFS4ERR_NOSPC.

1618	   It is not required that the server allocate the space to the file
1619	   before returning success.  The allocation can be deferred, however,
1620	   it must be guaranteed that it will not fail for lack of space.

1622	   The value of space_reserved can be obtained at any time through
1623	   GETATTR.

1625	   In order to avoid ambiguity, the space_reserve bit cannot be set
1626	   along with the size bit in SETATTR.  Increasing the size of a file
1627	   with space_reserve set will fail if space reservation cannot be
1628	   guaranteed for the new size.  If the file size is decreased, space
1629	   reservation is only guaranteed for the new size and the extra blocks
1630	   backing the file can be released.

1632	6.2.5.  Attribute 78: space_freed

1634	   space_freed gives the number of bytes freed if the file is deleted.
1635	   This attribute is read only and is of type length4.  It is a per file
1636	   attribute.

1638	6.2.6.  Attribute 79: max_hole_punch

1640	   max_hole_punch specifies the maximum size of a hole that the
1641	   HOLE_PUNCH operation can handle.  This attribute is read only and of
1642	   type length4.  It is a per filesystem attribute.  This attribute MUST
1643	   be implemented if HOLE_PUNCH is implemented.

1645	6.2.7.  Operation 64: HOLE_PUNCH - Zero and deallocate blocks backing
1646	        the file in the specified range.

1648	   WARNING: Most of this section is now obsolete.  Parts of it need to
1649	   be scavanged for the ADB discussion, but for the most part, it cannot
1650	   be trusted.

1652	6.2.7.1.  DESCRIPTION

1654	   Whenever a client wishes to deallocate the blocks backing a
1655	   particular region in the file, it calls the HOLE_PUNCH operation with
1656	   the current filehandle set to the filehandle of the file in question,
1657	   start offset and length in bytes of the region set in hpa_offset and
1658	   hpa_count respectively.  All further reads to this region MUST return
1659	   zeros until overwritten.  The filehandle specified must be that of a
1660	   regular file.

1662	   Situations may arise where hpa_offset and/or hpa_offset + hpa_count
1663	   will not be aligned to a boundary that the server does allocations/
1664	   deallocations in.  For most filesystems, this is the block size of
1665	   the file system.  In such a case, the server can deallocate as many
1666	   bytes as it can in the region.  The blocks that cannot be deallocated
1667	   MUST be zeroed.  Except for the block deallocation and maximum hole
1668	   punching capability, a HOLE_PUNCH operation is to be treated similar
1669	   to a write of zeroes.

1671	   The server is not required to complete deallocating the blocks
1672	   specified in the operation before returning.  It is acceptable to
1673	   have the deallocation be deferred.  In fact, HOLE_PUNCH is merely a
1674	   hint; it is valid for a server to return success without ever doing
1675	   anything towards deallocating the blocks backing the region
1676	   specified.  However, any future reads to the region MUST return
1677	   zeroes.

1679	   HOLE_PUNCH will result in the space_used attribute being decreased by
1680	   the number of bytes that were deallocated.  The space_freed attribute
1681	   may or may not decrease, depending on the support and whether the
1682	   blocks backing the specified range were shared or not.  The size
1683	   attribute will remain unchanged.

1685	   The HOLE_PUNCH operation MUST NOT change the space reservation
1686	   guarantee of the file.  While the server can deallocate the blocks
1687	   specified by hpa_offset and hpa_count, future writes to this region
1688	   MUST NOT fail with NFSERR_NOSPC.

1690	   The HOLE_PUNCH operation may fail for the following reasons (this is
1691	   a partial list):

1693	   NFS4ERR_NOTSUPP  The Hole punch operations are not supported by the
1694	      NFS server receiving this request.

1696	   NFS4ERR_DIR  The current filehandle is of type NF4DIR.

1698	   NFS4ERR_SYMLINK  The current filehandle is of type NF4LNK.

1700	   NFS4ERR_WRONG_TYPE  The current filehandle does not designate an
1701	      ordinary file.

1703	7.  Sparse Files

1705	   WARNING: Some of this section needs to be reworked because of the
1706	   work going on in the ADB section.

1708	7.1.  Introduction

1710	   A sparse file is a common way of representing a large file without
1711	   having to utilize all of the disk space for it.  Consequently, a
1712	   sparse file uses less physical space than its size indicates.  This
1713	   means the file contains 'holes', byte ranges within the file that
1714	   contain no data.  Most modern file systems support sparse files,
1715	   including most UNIX file systems and NTFS, but notably not Apple's
1716	   HFS+.  Common examples of sparse files include Virtual Machine (VM)
1717	   OS/disk images, database files, log files, and even checkpoint
1718	   recovery files most commonly used by the HPC community.

1720	   If an application reads a hole in a sparse file, the file system must
1721	   returns all zeros to the application.  For local data access there is
1722	   little penalty, but with NFS these zeroes must be transferred back to
1723	   the client.  If an application uses the NFS client to read data into
1724	   memory, this wastes time and bandwidth as the application waits for
1725	   the zeroes to be transferred.

1727	   A sparse file is typically created by initializing the file to be all
1728	   zeros - nothing is written to the data in the file, instead the hole
1729	   is recorded in the metadata for the file.  So a 8G disk image might
1730	   be represented initially by a couple hundred bits in the inode and
1731	   nothing on the disk.  If the VM then writes 100M to a file in the
1732	   middle of the image, there would now be two holes represented in the
1733	   metadata and 100M in the data.

1735	   Other applications want to initialize a file to patterns other than
1736	   zero.  The problem with initializing to zero is that it is often
1737	   difficult to distinguish a byte-range of initialized to all zeroes
1738	   from data corruption, since a pattern of zeroes is a probable pattern
1739	   for corruption.  Instead, some applications, such as database
1740	   management systems, use pattern consisting of bytes or words of non-
1741	   zero values.

1743	   Besides reading sparse files and initializing them, applications
1744	   might want to hole punch, which is the deallocation of the data
1745	   blocks which back a region of the file.  At such time, the affected
1746	   blocks are reinitialized to a pattern.

1748	   This section introduces a new operation to read patterns from a file,
1749	   READ_PLUS, and a new operation to both initialize patterns and to
1750	   punch pattern holes into a file, WRITE_PLUS.  READ_PLUS supports all
1751	   the features of READ but includes an extension to support sparse
1752	   pattern files.  READ_PLUS is guaranteed to perform no worse than
1753	   READ, and can dramatically improve performance with sparse files.
1754	   READ_PLUS does not depend on pNFS protocol features, but can be used
1755	   by pNFS to support sparse files.

1757	7.2.  Terminology

1759	   Regular file:  An object of file type NF4REG or NF4NAMEDATTR.

1761	   Sparse file:  A Regular file that contains one or more Holes.

1763	   Hole:  A byte range within a Sparse file that contains regions of all
1764	      zeroes.  For block-based file systems, this could also be an
1765	      unallocated region of the file.

1767	   Hole Threshold  The minimum length of a Hole as determined by the
1768	      server.  If a server chooses to define a Hole Threshold, then it
1769	      would not return hole information (nfs_readplusreshole) with a
1770	      hole_offset and hole_length that specify a range shorter than the
1771	      Hole Threshold.

1773	7.3.  Applications and Sparse Files

1775	   Applications may cause an NFS client to read holes in a file for
1776	   several reasons.  This section describes three different application
1777	   workloads that cause the NFS client to transfer data unnecessarily.
1778	   These workloads are simply examples, and there are probably many more
1779	   workloads that are negatively impacted by sparse files.

1781	   The first workload that can cause holes to be read is sequential
1782	   reads within a sparse file.  When this happens, the NFS client may
1783	   perform read requests ("readahead") into sections of the file not
1784	   explicitly requested by the application.  Since the NFS client cannot
1785	   differentiate between holes and non-holes, the NFS client may
1786	   prefetch empty sections of the file.

1788	   This workload is exemplified by Virtual Machines and their associated
1789	   file system images, e.g., VMware .vmdk files, which are large sparse
1790	   files encapsulating an entire operating system.  If a VM reads files
1791	   within the file system image, this will translate to sequential NFS
1792	   read requests into the much larger file system image file.  Since NFS
1793	   does not understand the internals of the file system image, it ends
1794	   up performing readahead file holes.

1796	   The second workload is generated by copying a file from a directory
1797	   in NFS to either the same NFS server, to another file system, e.g.,
1798	   another NFS or Samba server, to a local ext3 file system, or even a
1799	   network socket.  In this case, bandwidth and server resources are
1800	   wasted as the entire file is transferred from the NFS server to the
1801	   NFS client.  Once a byte range of the file has been transferred to
1802	   the client, it is up to the client application, e.g., rsync, cp, scp,
1803	   on how it writes the data to the target location.  For example, cp
1804	   supports sparse files and will not write all zero regions, whereas
1805	   scp does not support sparse files and will transfer every byte of the
1806	   file.

1808	   The third workload is generated by applications that do not utilize
1809	   the NFS client cache, but instead use direct I/O and manage cached
1810	   data independently, e.g., databases.  These applications may perform
1811	   whole file caching with sparse files, which would mean that even the
1812	   holes will be transferred to the clients and cached.

1814	7.4.  Overview of Sparse Files and NFSv4

1816	   This proposal seeks to provide sparse file support to the largest
1817	   number of NFS client and server implementations, and as such proposes
1818	   to add a new return code to the mandatory NFSv4.1 READ_PLUS operation
1819	   instead of proposing additions or extensions of new or existing
1820	   optional features (such as pNFS).

1822	   As well, this document seeks to ensure that the proposed extensions
1823	   are simple and do not transfer data between the client and server
1824	   unnecessarily.  For example, one possible way to implement sparse
1825	   file read support would be to have the client, on the first hole
1826	   encountered or at OPEN time, request a Data Region Map from the
1827	   server.  A Data Region Map would specify all zero and non-zero
1828	   regions in a file.  While this option seems simple, it is less useful
1829	   and can become inefficient and cumbersome for several reasons:

1831	   o  Data Region Maps can be large, and transferring them can reduce
1832	      overall read performance.  For example, VMware's .vmdk files can
1833	      have a file size of over 100 GBs and have a map well over several
1834	      MBs.

1836	   o  Data Region Maps can change frequently, and become invalidated on
1837	      every write to the file.  NFSv4 has a single change attribute,
1838	      which means any change to any region of a file will invalidate all
1839	      Data Region Maps.  This can result in the map being transferred
1840	      multiple times with each update to the file.  For example, a VM
1841	      that updates a config file in its file system image would
1842	      invalidate the Data Region Map not only for itself, but for all
1843	      other clients accessing the same file system image.

1845	   o  Data Region Maps do not handle all zero-filled sections of the
1846	      file, reducing the effectiveness of the solution.  While it may be
1847	      possible to modify the maps to handle zero-filled sections (at
1848	      possibly great effort to the server), it is almost impossible with
1849	      pNFS.  With pNFS, the owner of the Data Region Map is the metadata
1850	      server, which is not in the data path and has no knowledge of the
1851	      contents of a data region.

1853	   Another way to handle holes is compression, but this not ideal since
1854	   it requires all implementations to agree on a single compression
1855	   algorithm and requires a fair amount of computational overhead.

1857	   Note that supporting writing to a sparse file does not require
1858	   changes to the protocol.  Applications and/or NFS implementations can
1859	   choose to ignore WRITE requests of all zeroes to the NFS server
1860	   without consequence.

1862	7.5.  Operation 65: READ_PLUS

1864	   The section introduces a new read operation, named READ_PLUS, which
1865	   allows NFS clients to avoid reading holes in a sparse file.
1866	   READ_PLUS is guaranteed to perform no worse than READ, and can
1867	   dramatically improve performance with sparse files.

1869	   READ_PLUS supports all the features of the existing NFSv4.1 READ
1870	   operation [2] and adds a simple yet significant extension to the
1871	   format of its response.  The change allows the client to avoid
1872	   returning all zeroes from a file hole, wasting computational and
1873	   network resources and reducing performance.  READ_PLUS uses a new
1874	   result structure that tells the client that the result is all zeroes
1875	   AND the byte-range of the hole in which the request was made.
1876	   Returning the hole's byte-range, and only upon request, avoids
1877	   transferring large Data Region Maps that may be soon invalidated and
1878	   contain information about a file that may not even be read in its
1879	   entirely.

1881	   A new read operation is required due to NFSv4.1 minor versioning
1882	   rules that do not allow modification of existing operation's
1883	   arguments or results.  READ_PLUS is designed in such a way to allow
1884	   future extensions to the result structure.  The same approach could
1885	   be taken to extend the argument structure, but a good use case is
1886	   first required to make such a change.

1888	7.5.1.  ARGUMENT

1890	   struct READ_PLUS4args {
1891	           /* CURRENT_FH: file */
1892	           stateid4        rpa_stateid;
1893	           offset4         rpa_offset;
1894	           count4          rpa_count;
1895	   };

1897	7.5.2.  RESULT

1899	   union read_plus_content switch (data_content4 content) {
1900	   case NFS4_CONTENT_DATA:
1901	           opaque          rpc_data<>;
1902	   case NFS4_CONTENT_APP_BLOCK:
1903	           app_data_block4 rpc_block;
1904	   case NFS4_CONTENT_HOLE:
1905	           hole_info4      rpc_hole;
1906	   default:
1907	           void;
1908	   };

1910	   /*
1911	    * Allow a return of an array of contents.
1912	    */
1913	   struct read_plus_res4 {
1914	           bool                    rpr_eof;
1915	           read_plus_content       rpr_contents<>;
1916	   };

1918	   union READ_PLUS4res switch (nfsstat4 status) {
1919	   case NFS4_OK:
1920	           read_plus_res4  resok4;
1921	   default:
1922	           void;
1923	   };

1925	7.5.3.  DESCRIPTION

1927	   The READ_PLUS operation is based upon the NFSv4.1 READ operation [2],
1928	   and similarly reads data from the regular file identified by the
1929	   current filehandle.

1931	   The client provides an offset of where the READ_PLUS is to start and
1932	   a count of how many bytes are to be read.  An offset of zero means to
1933	   read data starting at the beginning of the file.  If offset is
1934	   greater than or equal to the size of the file, the status NFS4_OK is
1935	   returned with nfs_readplusrestype4 set to READ_OK, data length set to
1936	   zero, and eof set to TRUE.  The READ_PLUS is subject to access
1937	   permissions checking.

1939	   If the client specifies a count value of zero, the READ_PLUS succeeds
1940	   and returns zero bytes of data, again subject to access permissions
1941	   checking.  In all situations, the server may choose to return fewer
1942	   bytes than specified by the client.  The client needs to check for
1943	   this condition and handle the condition appropriately.

1945	   If the client specifies an offset and count value that is entirely
1946	   contained within a hole of the file, the status NFS4_OK is returned
1947	   with nfs_readplusresok4 set to READ_HOLE, and if information is
1948	   available regarding the hole, a nfs_readplusreshole structure
1949	   containing the offset and range of the entire hole.  The
1950	   nfs_readplusreshole structure is considered valid until the file is
1951	   changed (detected via the change attribute).  The server MUST provide
1952	   the same semantics for nfs_readplusreshole as if the client read the
1953	   region and received zeroes; the implied holes contents lifetime MUST
1954	   be exactly the same as any other read data.

1956	   If the client specifies an offset and count value that begins in a
1957	   non-hole of the file but extends into hole the server should return a
1958	   short read with status NFS4_OK, nfs_readplusresok4 set to READ_OK,
1959	   and data length set to the number of bytes returned.  The client will
1960	   then issue another READ_PLUS for the remaining bytes, which the
1961	   server will respond with information about the hole in the file.

1963	   If the server knows that the requested byte range is into a hole of
1964	   the file, but has no further information regarding the hole, it
1965	   returns a nfs_readplusreshole structure with holeres4 set to
1966	   HOLE_NOINFO.

1968	   If hole information is available and can be returned to the client,
1969	   the server returns a nfs_readplusreshole structure with the value of
1970	   holeres4 to HOLE_INFO.  The values of hole_offset and hole_length
1971	   define the byte-range for the current hole in the file.  These values
1972	   represent the information known to the server and may describe a
1973	   byte-range smaller than the true size of the hole.

1975	   Except when special stateids are used, the stateid value for a
1976	   READ_PLUS request represents a value returned from a previous byte-
1977	   range lock or share reservation request or the stateid associated
1978	   with a delegation.  The stateid identifies the associated owners if
1979	   any and is used by the server to verify that the associated locks are
1980	   still valid (e.g., have not been revoked).

1982	   If the read ended at the end-of-file (formally, in a correctly formed
1983	   READ_PLUS operation, if offset + count is equal to the size of the
1984	   file), or the READ_PLUS operation extends beyond the size of the file
1985	   (if offset + count is greater than the size of the file), eof is
1986	   returned as TRUE; otherwise, it is FALSE.  A successful READ_PLUS of
1987	   an empty file will always return eof as TRUE.

1989	   If the current filehandle is not an ordinary file, an error will be
1990	   returned to the client.  In the case that the current filehandle
1991	   represents an object of type NF4DIR, NFS4ERR_ISDIR is returned.  If
1992	   the current filehandle designates a symbolic link, NFS4ERR_SYMLINK is
1993	   returned.  In all other cases, NFS4ERR_WRONG_TYPE is returned.

1995	   For a READ_PLUS with a stateid value of all bits equal to zero, the
1996	   server MAY allow the READ_PLUS to be serviced subject to mandatory
1997	   byte-range locks or the current share deny modes for the file.  For a
1998	   READ_PLUS with a stateid value of all bits equal to one, the server
1999	   MAY allow READ_PLUS operations to bypass locking checks at the
2000	   server.

2002	   On success, the current filehandle retains its value.

2004	7.5.4.  IMPLEMENTATION

2006	   If the server returns a "short read" (i.e., fewer data than requested
2007	   and eof is set to FALSE), the client should send another READ_PLUS to
2008	   get the remaining data.  A server may return less data than requested
2009	   under several circumstances.  The file may have been truncated by
2010	   another client or perhaps on the server itself, changing the file
2011	   size from what the requesting client believes to be the case.  This
2012	   would reduce the actual amount of data available to the client.  It
2013	   is possible that the server reduce the transfer size and so return a
2014	   short read result.  Server resource exhaustion may also occur in a
2015	   short read.

2017	   If mandatory byte-range locking is in effect for the file, and if the
2018	   byte-range corresponding to the data to be read from the file is
2019	   WRITE_LT locked by an owner not associated with the stateid, the
2020	   server will return the NFS4ERR_LOCKED error.  The client should try
2021	   to get the appropriate READ_LT via the LOCK operation before re-
2022	   attempting the READ_PLUS.  When the READ_PLUS completes, the client
2023	   should release the byte-range lock via LOCKU.  In addition, the
2024	   server MUST return a nfs_readplusreshole structure with values of
2025	   hole_offset and hole_length that are within the owner's locked byte
2026	   range.

2028	   If another client has an OPEN_DELEGATE_WRITE delegation for the file
2029	   being read, the delegation must be recalled, and the operation cannot
2030	   proceed until that delegation is returned or revoked.  Except where
2031	   this happens very quickly, one or more NFS4ERR_DELAY errors will be
2032	   returned to requests made while the delegation remains outstanding.
2033	   Normally, delegations will not be recalled as a result of a READ_PLUS
2034	   operation since the recall will occur as a result of an earlier OPEN.
2035	   However, since it is possible for a READ_PLUS to be done with a
2036	   special stateid, the server needs to check for this case even though
2037	   the client should have done an OPEN previously.

2039	7.5.4.1.  Additional pNFS Implementation Information

2041	   With pNFS, the semantics of using READ_PLUS remains the same.  Any
2042	   data server MAY return a READ_HOLE result for a READ_PLUS request
2043	   that it receives.

2045	   When a data server chooses to return a READ_HOLE result, it has the
2046	   option of returning hole information for the data stored on that data
2047	   server (as defined by the data layout), but it MUST not return a
2048	   nfs_readplusreshole structure with a byte range that includes data
2049	   managed by another data server.

2051	   1.  Data servers that cannot determine hole information SHOULD return
2052	       HOLE_NOINFO.

2054	   2.  Data servers that can obtain hole information for the parts of
2055	       the file stored on that data server, the data server SHOULD
2056	       return HOLE_INFO and the byte range of the hole stored on that
2057	       data server.

2059	   A data server should do its best to return as much information about
2060	   a hole as is feasible without having to contact the metadata server.
2061	   If communication with the metadata server is required, then every
2062	   attempt should be taken to minimize the number of requests.

2064	   If mandatory locking is enforced, then the data server must also
2065	   ensure that to return only information for a Hole that is within the
2066	   owner's locked byte range.

2068	7.5.5.  READ_PLUS with Sparse Files Example

2070	   To see how the return value READ_HOLE will work, the following table
2071	   describes a sparse file.  For each byte range, the file contains
2072	   either non-zero data or a hole.  In addition, the server in this
2073	   example uses a hole threshold of 32K.

2075	                        +-------------+----------+
2076	                        | Byte-Range  | Contents |
2077	                        +-------------+----------+
2078	                        | 0-15999     | Hole     |
2079	                        | 16K-31999   | Non-Zero |
2080	                        | 32K-255999  | Hole     |
2081	                        | 256K-287999 | Non-Zero |
2082	                        | 288K-353999 | Hole     |
2083	                        | 354K-417999 | Non-Zero |
2084	                        +-------------+----------+

2086	                                  Table 1

2088	   Under the given circumstances, if a client was to read the file from
2089	   beginning to end with a max read size of 64K, the following will be
2090	   the result.  This assumes the client has already opened the file and
2091	   acquired a valid stateid and just needs to issue READ_PLUS requests.

2093	   1.  READ_PLUS(s, 0, 64K) --> NFS_OK, readplusrestype4 = READ_OK, eof
2094	       = false, data<>[32K].  Return a short read, as the last half of
2095	       the request was all zeroes.  Note that the first hole is read
2096	       back as all zeros as it is below the hole threshhold.

2098	   2.  READ_PLUS(s, 32K, 64K) --> NFS_OK, readplusrestype4 = READ_HOLE,
2099	       nfs_readplusreshole(HOLE_INFO)(32K, 224K).  The requested range
2100	       was all zeros, and the current hole begins at offset 32K and is
2101	       224K in length.

2103	   3.  READ_PLUS(s, 256K, 64K) --> NFS_OK, readplusrestype4 = READ_OK,
2104	       eof = false, data<>[32K].  Return a short read, as the last half
2105	       of the request was all zeroes.

2107	   4.  READ_PLUS(s, 288K, 64K) --> NFS_OK, readplusrestype4 = READ_HOLE,
2108	       nfs_readplusreshole(HOLE_INFO)(288K, 66K).

2110	   5.  READ_PLUS(s, 354K, 64K) --> NFS_OK, readplusrestype4 = READ_OK,
2111	       eof = true, data<>[64K].

2113	7.6.  Related Work

2115	   Solaris and ZFS support an extension to lseek(2) that allows
2116	   applications to discover holes in a file.  The values, SEEK_HOLE and
2117	   SEEK_DATA, allow clients to seek to the next hole or beginning of
2118	   data, respectively.

2120	   XFS supports the XFS_IOC_GETBMAP extended attribute, which returns
2121	   the Data Region Map for a file.  Clients can then use this
2122	   information to avoid reading holes in a file.

2124	   NTFS and CIFS support the FSCTL_SET_SPARSE attribute, which allows
2125	   applications to control whether empty regions of the file are
2126	   preallocated and filled in with zeros or simply left unallocated.

2128	7.7.  Other Proposed Designs

2130	7.7.1.  Multi-Data Server Hole Information

2132	   The current design prohibits pnfs data servers from returning hole
2133	   information for regions of a file that are not stored on that data
2134	   server.  Having data servers return information regarding other data
2135	   servers changes the fundamental principal that all metadata
2136	   information comes from the metadata server.

2138	   Here is a brief description if we did choose to support multi-data
2139	   server hole information:

2141	   For a data server that can obtain hole information for the entire
2142	   file without severe performance impact, it MAY return HOLE_INFO and
2143	   the byte range of the entire file hole.  When a pNFS client receives
2144	   a READ_HOLE result and a non-empty nfs_readplusreshole structure, it
2145	   MAY use this information in conjunction with a valid layout for the
2146	   file to determine the next data server for the next region of data
2147	   that is not in a hole.

2149	7.7.2.  Data Result Array

2151	   If a single read request contains one or more Holes with a length
2152	   greater than the Sparse Threshold, the current design would return
2153	   results indicating a short read to the client.  A client would then
2154	   send a series of read requests to the server to retrieve information
2155	   for the Holes and the remaining data.  To avoid turning a single read
2156	   request into several exchanges between the client and server, the
2157	   server may need to choose a relatively large Sparse Threshold in
2158	   order to decrease the number of short reads it creates.  A large
2159	   Sparse Threshold may miss many smaller holes, which in turn may
2160	   negate the benefits of sparse read support.

2162	   To avoid this situation, one option is to have the READ_PLUS
2163	   operation return information for multiple holes in a single return
2164	   value.  This would allow several small holes to be described in a
2165	   single read response without requiring multliple exchanges between
2166	   the client and server.

2168	   One important item to consider with returning an array of data chunks
2169	   is its impact on RDMA, which may use different block sizes on the
2170	   client and server (among other things).

2172	7.7.3.  User-Defined Sparse Mask

2174	   Add mask (instead of just zeroes).  Specified by server or client?

2176	7.7.4.  Allocated flag

2178	   A Hole on the server may be an allocated byte-range consisting of all
2179	   zeroes or may not be allocated at all.  To ensure this information is
2180	   properly communicated to the client, it may be beneficial to add a
2181	   'alloc' flag to the HOLE_INFO section of nfs_readplusreshole.  This
2182	   would allow an NFS client to copy a file from one file system to
2183	   another and have it more closely resemble the original.

2185	7.7.5.  Dense and Sparse pNFS File Layouts

2187	   The hole information returned form a data server must be understood
2188	   by pNFS clients using both Dense or Sparse file layout types.  Does
2189	   the current READ_PLUS return value work for both layout types?  Does
2190	   the data server know if it is using dense or sparse so that it can
2191	   return the correct hole_offset and hole_length values?

2193	8.  Labeled NFS

2195	8.1.  Introduction

2197	   Access control models such as Unix permissions or Access Control
2198	   Lists are commonly referred to as Discretionary Access Control (DAC)
2199	   models.  These systems base their access decisions on user identity
2200	   and resource ownership.  In contrast Mandatory Access Control (MAC)
2201	   models base their access control decisions on the label on the
2202	   subject (usually a process) and the object it wishes to access.
2203	   These labels may contain user identity information but usually
2204	   contain additional information.  In DAC systems users are free to
2205	   specify the access rules for resources that they own.  MAC models
2206	   base their security decisions on a system wide policy established by
2207	   an administrator or organization which the users do not have the
2208	   ability to override.  In this section, we add a MAC model to NFSv4.

2210	   The first change necessary is to devise a method for transporting and
2211	   storing security label data on NFSv4 file objects.  Security labels
2212	   have several semantics that are met by NFSv4 recommended attributes
2213	   such as the ability to set the label value upon object creation.
2214	   Access control on these attributes are done through a combination of
2215	   two mechanisms.  As with other recommended attributes on file objects
2216	   the usual DAC checks (ACLs and permission bits) will be performed to
2217	   ensure that proper file ownership is enforced.  In addition a MAC
2218	   system MAY be employed on the client, server, or both to enforce
2219	   additional policy on what subjects may modify security label
2220	   information.

2222	   The second change is to provide a method for the server to notify the
2223	   client that the attribute changed on an open file on the server.  If
2224	   the file is closed, then during the open attempt, the client will
2225	   gather the new attribute value.  The server MUST not communicate the
2226	   new value of the attribute, the client MUST query it.  This
2227	   requirement stems from the need for the client to provide sufficient
2228	   access rights to the attribute.

2230	   The final change necessary is a modification to the RPC layer used in
2231	   NFSv4 in the form of a new version of the RPCSEC_GSS [7] framework.

2233	   In order for an NFSv4 server to apply MAC checks it must obtain
2234	   additional information from the client.  Several methods were
2235	   explored for performing this and it was decided that the best
2236	   approach was to incorporate the ability to make security attribute
2237	   assertions through the RPC mechanism.  RPCSECGSSv3 [6] outlines a
2238	   method to assert additional security information such as security
2239	   labels on gss context creation and have that data bound to all RPC
2240	   requests that make use of that context.

2242	8.2.  Definitions

2244	   Label Format Specifier (LFS):  is an identifier used by the client to
2245	      establish the syntactic format of the security label and the
2246	      semantic meaning of its components.  These specifiers exist in a
2247	      registry associated with documents describing the format and
2248	      semantics of the label.

2250	   Label Format Registry:  is the IANA registry containing all
2251	      registered LFS along with references to the documents that
2252	      describe the syntactic format and semantics of the security label.

2254	   Policy Identifier (PI):  is an optional part of the definition of a
2255	      Label Format Specifier which allows for clients and server to
2256	      identify specific security policies.

2258	   Domain of Interpretation (DOI):  represents an administrative
2259	      security boundary, where all systems within the DOI have
2260	      semantically coherent labeling.  That is, a security attribute
2261	      must always mean exactly the same thing anywhere within the DOI.

2263	   Object:  is a passive resource within the system that we wish to be
2264	      protected.  Objects can be entities such as files, directories,
2265	      pipes, sockets, and many other system resources relevant to the
2266	      protection of the system state.

2268	   Subject:  A subject is an active entity usually a process which is
2269	      requesting access to an object.

2271	   Multi-Level Security (MLS):  is a traditional model where objects are
2272	      given a sensitivity level (Unclassified, Secret, Top Secret, etc)
2273	      and a category set [21].

2275	8.3.  MAC Security Attribute

2277	   MAC models base access decisions on security attributes bound to
2278	   subjects and objects.  This information can range from a user
2279	   identity for an identity based MAC model, sensitivity levels for
2280	   Multi-level security, or a type for Type Enforcement.  These models
2281	   base their decisions on different criteria but the semantics of the
2282	   security attribute remain the same.  The semantics required by the
2283	   security attributes are listed below:

2285	   o  Must provide flexibility with respect to MAC model.

2287	   o  Must provide the ability to atomically set security information
2288	      upon object creation

2290	   o  Must provide the ability to enforce access control decisions both
2291	      on the client and the server

2293	   o  Must not expose an object to either the client or server name
2294	      space before its security information has been bound to it.

2296	   NFSv4 implements the security attribute as a recommended attribute.
2297	   These attributes have a fixed format and semantics, which conflicts
2298	   with the flexible nature of the security attribute.  To resolve this
2299	   the security attribute consists of two components.  The first
2300	   component is a LFS as defined in [22] to allow for interoperability
2301	   between MAC mechanisms.  The second component is an opaque field
2302	   which is the actual security attribute data.  To allow for various
2303	   MAC models NFSv4 should be used solely as a transport mechanism for
2304	   the security attribute.  It is the responsibility of the endpoints to
2305	   consume the security attribute and make access decisions based on
2306	   their respective models.  In addition, creation of objects through
2307	   OPEN and CREATE allows for the security attribute to be specified
2308	   upon creation.  By providing an atomic create and set operation for
2309	   the security attribute it is possible to enforce the second and
2310	   fourth requirements.  The recommended attribute FATTR4_SEC_LABEL will
2311	   be used to satisfy this requirement.

2313	8.3.1.  Interpreting FATTR4_SEC_LABEL

2315	   The XDR [11] necessary to implement Labeled NFSv4 is presented below:

2317	   const FATTR4_SEC_LABEL   = 81;

2319	   typedef uint32_t  policy4;

2321	                                 Figure 6

2323	   struct labelformat_spec4 {
2324	           policy4 lfs_lfs;
2325	           policy4 lfs_pi;
2326	   };

2328	   struct sec_label_attr_info {
2329	           labelformat_spec4       slai_lfs;
2330	           opaque                  slai_data<>;
2331	   };

2333	   The FATTR4_SEC_LABEL contains an array of two components with the
2334	   first component being an LFS.  It serves to provide the receiving end
2335	   with the information necessary to translate the security attribute
2336	   into a form that is usable by the endpoint.  Label Formats assigned
2337	   an LFS may optionally choose to include a Policy Identifier field to
2338	   allow for complex policy deployments.  The LFS and Label Format
2339	   Registry are described in detail in [22].  The translation used to
2340	   interpret the security attribute is not specified as part of the
2341	   protocol as it may depend on various factors.  The second component
2342	   is an opaque section which contains the data of the attribute.  This
2343	   component is dependent on the MAC model to interpret and enforce.

2345	   In particular, it is the responsibility of the LFS specification to
2346	   define a maximum size for the opaque section, slai_data<>.  When
2347	   creating or modifying a label for an object, the client needs to be
2348	   guaranteed that the server will accept a label that is sized
2349	   correctly.  By both client and server being part of a specific MAC
2350	   model, the client will be aware of the size.

2352	8.3.2.  Delegations

2354	   In the event that a security attribute is changed on the server while
2355	   a client holds a delegation on the file, the client should follow the
2356	   existing protocol with respect to attribute changes.  It should flush
2357	   all changes back to the server and relinquish the delegation.

2359	8.3.3.  Permission Checking

2361	   It is not feasible to enumerate all possible MAC models and even
2362	   levels of protection within a subset of these models.  This means
2363	   that the NFSv4 client and servers cannot be expected to directly make
2364	   access control decisions based on the security attribute.  Instead
2365	   NFSv4 should defer permission checking on this attribute to the host
2366	   system.  These checks are performed in addition to existing DAC and
2367	   ACL checks outlined in the NFSv4 protocol.  Section 8.6 gives a
2368	   specific example of how the security attribute is handled under a
2369	   particular MAC model.

2371	8.3.4.  Object Creation

2373	   When creating files in NFSv4 the OPEN and CREATE operations are used.
2374	   One of the parameters to these operations is an fattr4 structure
2375	   containing the attributes the file is to be created with.  This
2376	   allows NFSv4 to atomically set the security attribute of files upon
2377	   creation.  When a client is MAC aware it must always provide the
2378	   initial security attribute upon file creation.  In the event that the
2379	   server is the only MAC aware entity in the system it should ignore
2380	   the security attribute specified by the client and instead make the
2381	   determination itself.  A more in depth explanation can be found in
2382	   Section 8.6.

2384	8.3.5.  Existing Objects

2386	   Note that under the MAC model, all objects must have labels.
2387	   Therefore, if an existing server is upgraded to include LNFS support,
2388	   then it is the responsibility of the security system to define the
2389	   behavior for existing objects.  For example, if the security system
2390	   is LFS 0, which means the server just stores and returns labels, then
2391	   existing files should return labels which are set to an empty value.

2393	8.3.6.  Label Changes

2395	   As per the requirements, when a file's security label is modified,
2396	   the server must notify all clients which have the file opened of the
2397	   change in label.  It does so with CB_ATTR_CHANGED.  There are
2398	   preconditions to making an attribute change imposed by NFSv4 and the
2399	   security system might want to impose others.  In the process of
2400	   meeting these preconditions, the server may chose to either serve the
2401	   request in whole or return NFS4ERR_DELAY to the SETATTR operation.

2403	   If there are open delegations on the file belonging to client other
2404	   than the one making the label change, then the process described in
2405	   Section 8.3.2 must be followed.

2407	   As the server is always presented with the subject label from the
2408	   client, it does not necessarily need to communicate the fact that the
2409	   label has changed to the client.  In the cases where the change
2410	   outright denies the client access, the client will be able to quickly
2411	   determine that there is a new label in effect.  It is in cases where
2412	   the client may share the same object between multiple subjects or a
2413	   security system which is not strictly hierarchical that the
2414	   CB_ATTR_CHANGED callback is very useful.  It allows the server to
2415	   inform the clients that the cached security attribute is now stale.

2417	   Consider a system in which the clients enforce MAC checks and and the
2418	   server has a very simple security system which just stores the
2419	   labels.  In this system, the MAC label check always allows access,
2420	   regardless of the subject label.

2422	   The way in which MAC labels are enforced is by the smart client.  So
2423	   if client A changes a security label on a file, then the server MUST
2424	   inform all clients that have the file opened that the label has
2425	   changed via CB_ATTR_CHANGED.  Then the clients MUST retrieve the new
2426	   label and MUST enforce access via the new attribute values.

2428	   [[Comment.6: Describe a LFS of 0, which will be the means to indicate
2429	   such a deployment.  In the current LFR, 0 is marked as reserved.  If
2430	   we use it, then we define the default LFS to be used by a LNFS aware
2431	   server.  I.e., it lets smart clients work together in the face of a
2432	   dumb server.  Note that will supporting this system is optional, it
2433	   will make for a very good debugging mode during development.  I.e.,
2434	   even if a server does not deploy with another security system, this
2435	   mode gets your foot in the door. --TH]]

2437	8.4.  pNFS Considerations

2439	   This section examines the issues in deploying LNFS in a pNFS
2440	   community of servers.

2442	8.4.1.  MAC Label Checks

2444	   The new FATTR4_SEC_LABEL attribute is metadata information and as
2445	   such the DS is not aware of the value contained on the MDS.
2446	   Fortunately, the NFSv4.1 protocol [2] already has provisions for
2447	   doing access level checks from the DS to the MDS.  In order for the
2448	   DS to validate the subject label presented by the client, it SHOULD
2449	   utilize this mechanism.

2451	   If a file's FATTR4_SEC_LABEL is changed, then the MDS should utilize
2452	   CB_ATTR_CHANGED to inform the client of that fact.  If the MDS is
2453	   maintaining

2455	8.5.  Discovery of Server LNFS Support

2457	   The server can easily determine that a client supports LNFS when it
2458	   queries for the FATTR4_SEC_LABEL label for an object.  Note that it
2459	   cannot assume that the presence of RPCSEC_GSSv3 indicates LNFS
2460	   support.  The client might need to discover which LFS the server
2461	   supports.

2463	   A server which supports LNFS MUST allow a client with any subject
2464	   label to retrieve the FATTR4_SEC_LABEL attribute for the root
2465	   filehandle, ROOTFH.  The following compound must always succeed as
2466	   far as a MAC label check is concerned:

2468	        PUTROOTFH, GETATTR {FATTR4_SEC_LABEL}

2470	   Note that the server might have imposed a security flavor on the root
2471	   that precludes such access.  I.e., if the server requires kerberized
2472	   access and the client presents a compound with AUTH_SYS, then the
2473	   server is allowed to return NFS4ERR_WRONGSEC in this case.  But if
2474	   the client presents a correct security flavor, then the server MUST
2475	   return the FATTR4_SEC_LABEL attribute with the supported LFS filled
2476	   in.

2478	8.6.  MAC Security NFS Modes of Operation

2480	   A system using Labeled NFS may operate in three modes.  The first
2481	   mode provides the most protection and is called "full mode".  In this
2482	   mode both the client and server implement a MAC model allowing each
2483	   end to make an access control decision.  The remaining two modes are
2484	   variations on each other and are called "smart client" and "smart
2485	   server" modes.  In these modes one end of the connection is not
2486	   implementing a MAC model and because of this these operating modes
2487	   offer less protection than full mode.

2489	8.6.1.  Full Mode

2491	   Full mode environments consist of MAC aware NFSv4 servers and clients
2492	   and may be composed of mixed MAC models and policies.  The system
2493	   requires that both the client and server have an opportunity to
2494	   perform an access control check based on all relevant information
2495	   within the network.  The file object security attribute is provided
2496	   using the mechanism described in Section 8.3.  The security attribute
2497	   of the subject making the request is transported at the RPC layer
2498	   using the mechanism described in RPCSECGSSv3 [6].

2500	8.6.1.1.  Initial Labeling and Translation

2502	   The ability to create a file is an action that a MAC model may wish
2503	   to mediate.  The client is given the responsibility to determine the
2504	   initial security attribute to be placed on a file.  This allows the
2505	   client to make a decision as to the acceptable security attributes to
2506	   create a file with before sending the request to the server.  Once
2507	   the server receives the creation request from the client it may
2508	   choose to evaluate if the security attribute is acceptable.

2510	   Security attributes on the client and server may vary based on MAC
2511	   model and policy.  To handle this the security attribute field has an
2512	   LFS component.  This component is a mechanism for the host to
2513	   identify the format and meaning of the opaque portion of the security
2514	   attribute.  A full mode environment may contain hosts operating in
2515	   several different LFSs and DOIs.  In this case a mechanism for
2516	   translating the opaque portion of the security attribute is needed.
2517	   The actual translation function will vary based on MAC model and
2518	   policy and is out of the scope of this document.  If a translation is
2519	   unavailable for a given LFS and DOI then the request SHOULD be
2520	   denied.  Another recourse is to allow the host to provide a fallback
2521	   mapping for unknown security attributes.

2523	8.6.1.2.  Policy Enforcement

2525	   In full mode access control decisions are made by both the clients
2526	   and servers.  When a client makes a request it takes the security
2527	   attribute from the requesting process and makes an access control
2528	   decision based on that attribute and the security attribute of the
2529	   object it is trying to access.  If the client denies that access an
2530	   RPC call to the server is never made.  If however the access is
2531	   allowed the client will make a call to the NFS server.

2533	   When the server receives the request from the client it extracts the
2534	   security attribute conveyed in the RPC request.  The server then uses
2535	   this security attribute and the attribute of the object the client is
2536	   trying to access to make an access control decision.  If the server's
2537	   policy allows this access it will fulfill the client's request,
2538	   otherwise it will return NFS4ERR_ACCESS.

2540	   Implementations MAY validate security attributes supplied over the
2541	   network to ensure that they are within a set of attributes permitted
2542	   from a specific peer, and if not, reject them.  Note that a system
2543	   may permit a different set of attributes to be accepted from each
2544	   peer.

2546	8.6.2.  Smart Client Mode

2548	   Smart client environments consist of NFSv4 servers that are not MAC
2549	   aware but NFSv4 clients that are.  Clients in this environment are
2550	   may consist of groups implementing different MAC models policies.
2551	   The system requires that all clients in the environment be
2552	   responsible for access control checks.  Due to the amount of trust
2553	   placed in the clients this mode is only to be used in a trusted
2554	   environment.

2556	8.6.2.1.  Initial Labeling and Translation

2558	   Just like in full mode the client is responsible for determining the
2559	   initial label upon object creation.  The server in smart client mode
2560	   does not implement a MAC model, however, it may provide the ability
2561	   to restrict the creation and labeling of object with certain labels
2562	   based on different criteria as described in Section 8.6.1.2.

2564	   In a smart client environment a group of clients operate in a single
2565	   DOI.  This removes the need for the clients to maintain a set of DOI
2566	   translations.  Servers should provide a method to allow different
2567	   groups of clients to access the server at the same time.  However it
2568	   should not let two groups of clients operating in different DOIs to
2569	   access the same files.

2571	8.6.2.2.  Policy Enforcement

2573	   In smart client mode access control decisions are made by the
2574	   clients.  When a client accesses an object it obtains the security
2575	   attribute of the object from the server and combines it with the
2576	   security attribute of the process making the request to make an
2577	   access control decision.  This check is in addition to the DAC checks
2578	   provided by NFSv4 so this may fail based on the DAC criteria even if
2579	   the MAC policy grants access.  As the policy check is located on the
2580	   client an access control denial should take the form that is native
2581	   to the platform.

2583	8.6.3.  Smart Server Mode

2585	   Smart server environments consist of NFSv4 servers that are MAC aware
2586	   and one or more MAC unaware clients.  The server is the only entity
2587	   enforcing policy, and may selectively provide standard NFS services
2588	   to clients based on their authentication credentials and/or
2589	   associated network attributes (e.g., IP address, network interface).
2590	   The level of trust and access extended to a client in this mode is
2591	   configuration-specific.

2593	8.6.3.1.  Initial Labeling and Translation

2595	   In smart server mode all labeling and access control decisions are
2596	   performed by the NFSv4 server.  In this environment the NFSv4 clients
2597	   are not MAC aware so they cannot provide input into the access
2598	   control decision.  This requires the server to determine the initial
2599	   labeling of objects.  Normally the subject to use in this calculation
2600	   would originate from the client.  Instead the NFSv4 server may choose
2601	   to assign the subject security attribute based on their
2602	   authentication credentials and/or associated network attributes
2603	   (e.g., IP address, network interface).

2605	   In smart server mode security attributes are contained solely within
2606	   the NFSv4 server.  This means that all security attributes used in
2607	   the system remain within a single LFS and DOI.  Since security
2608	   attributes will not cross DOIs or change format there is no need to
2609	   provide any translation functionality above that which is needed
2610	   internally by the MAC model.

2612	8.6.3.2.  Policy Enforcement

2614	   All access control decisions in smart server mode are made by the
2615	   server.  The server will assign the subject a security attribute
2616	   based on some criteria (e.g., IP address, network interface).  Using
2617	   the newly calculated security attribute and the security attribute of
2618	   the object being requested the MAC model makes the access control
2619	   check and returns NFS4ERR_ACCESS on a denial and NFS4_OK on success.
2620	   This check is done transparently to the client so if the MAC
2621	   permission check fails the client may be unaware of the reason for
2622	   the permission failure.  When operating in this mode administrators
2623	   attempting to debug permission failures should be aware to check the
2624	   MAC policy running on the server in addition to the DAC settings.

2626	8.7.  Security Considerations

2628	   This entire document deals with security issues.

2630	   Depending on the level of protection the MAC system offers there may
2631	   be a requirement to tightly bind the security attribute to the data.

2633	   When only one of the client or server enforces labels, it is
2634	   important to realize that the other side is not enforcing MAC
2635	   protections.  Alternate methods might be in use to handle the lack of
2636	   MAC support and care should be taken to identify and mitigate threats
2637	   from possible tampering outside of these methods.

2639	   An example of this is that a server that modifies READDIR or LOOKUP
2640	   results based on the client's subject label might want to always
2641	   construct the same subject label for a client which does not present
2642	   one.  This will prevent a non-LNFS client from mixing entries in the
2643	   directory cache.

2645	9.  Security Considerations

2647	10.  Operations: REQUIRED, RECOMMENDED, or OPTIONAL

2649	   The following tables summarize the operations of the NFSv4.2 protocol
2650	   and the corresponding designation of REQUIRED, RECOMMENDED, and
2651	   OPTIONAL to implement or MUST NOT implement.  The designation of MUST
2652	   NOT implement is reserved for those operations that were defined in
2653	   either NFSv4.0 or NFSV4.1 and MUST NOT be implemented in NFSv4.2.

2655	   For the most part, the REQUIRED, RECOMMENDED, or OPTIONAL designation
2656	   for operations sent by the client is for the server implementation.
2657	   The client is generally required to implement the operations needed
2658	   for the operating environment for which it serves.  For example, a
2659	   read-only NFSv4.2 client would have no need to implement the WRITE
2660	   operation and is not required to do so.

2662	   The REQUIRED or OPTIONAL designation for callback operations sent by
2663	   the server is for both the client and server.  Generally, the client
2664	   has the option of creating the backchannel and sending the operations
2665	   on the fore channel that will be a catalyst for the server sending
2666	   callback operations.  A partial exception is CB_RECALL_SLOT; the only
2667	   way the client can avoid supporting this operation is by not creating
2668	   a backchannel.

2670	   Since this is a summary of the operations and their designation,
2671	   there are subtleties that are not presented here.  Therefore, if
2672	   there is a question of the requirements of implementation, the
2673	   operation descriptions themselves must be consulted along with other
2674	   relevant explanatory text within this either specification or that of
2675	   NFSv4.1 [2]..

2677	   The abbreviations used in the second and third columns of the table
2678	   are defined as follows.

2680	   REQ  REQUIRED to implement

2682	   REC  RECOMMEND to implement

2684	   OPT  OPTIONAL to implement

2686	   MNI  MUST NOT implement

2688	   For the NFSv4.2 features that are OPTIONAL, the operations that
2689	   support those features are OPTIONAL, and the server would return
2690	   NFS4ERR_NOTSUPP in response to the client's use of those operations.
2691	   If an OPTIONAL feature is supported, it is possible that a set of
2692	   operations related to the feature become REQUIRED to implement.  The
2693	   third column of the table designates the feature(s) and if the
2694	   operation is REQUIRED or OPTIONAL in the presence of support for the
2695	   feature.

2697	   The OPTIONAL features identified and their abbreviations are as
2698	   follows:

2700	   pNFS  Parallel NFS

2702	   FDELG  File Delegations
2703	   DDELG  Directory Delegations

2705	   COPY  Server Side Copy

2707	   ADB  Application Data Blocks

2709	                                Operations

2711	   +----------------------+--------------------+-----------------------+
2712	   | Operation            | REQ, REC, OPT, or  | Feature (REQ, REC, or |
2713	   |                      | MNI                | OPT)                  |
2714	   +----------------------+--------------------+-----------------------+
2715	   | ACCESS               | REQ                |                       |
2716	   | BACKCHANNEL_CTL      | REQ                |                       |
2717	   | BIND_CONN_TO_SESSION | REQ                |                       |
2718	   | CLOSE                | REQ                |                       |
2719	   | COMMIT               | REQ                |                       |
2720	   | COPY                 | OPT                | COPY (REQ)            |
2721	   | COPY_ABORT           | OPT                | COPY (REQ)            |
2722	   | COPY_NOTIFY          | OPT                | COPY (REQ)            |
2723	   | COPY_REVOKE          | OPT                | COPY (REQ)            |
2724	   | COPY_STATUS          | OPT                | COPY (REQ)            |
2725	   | CREATE               | REQ                |                       |
2726	   | CREATE_SESSION       | REQ                |                       |
2727	   | DELEGPURGE           | OPT                | FDELG (REQ)           |
2728	   | DELEGRETURN          | OPT                | FDELG, DDELG, pNFS    |
2729	   |                      |                    | (REQ)                 |
2730	   | DESTROY_CLIENTID     | REQ                |                       |
2731	   | DESTROY_SESSION      | REQ                |                       |
2732	   | EXCHANGE_ID          | REQ                |                       |
2733	   | FREE_STATEID         | REQ                |                       |
2734	   | GETATTR              | REQ                |                       |
2735	   | GETDEVICEINFO        | OPT                | pNFS (REQ)            |
2736	   | GETDEVICELIST        | OPT                | pNFS (OPT)            |
2737	   | GETFH                | REQ                |                       |
2738	   | INITIALIZE           | OPT                | ADB (REQ)             |
2739	   | GET_DIR_DELEGATION   | OPT                | DDELG (REQ)           |
2740	   | LAYOUTCOMMIT         | OPT                | pNFS (REQ)            |
2741	   | LAYOUTGET            | OPT                | pNFS (REQ)            |
2742	   | LAYOUTRETURN         | OPT                | pNFS (REQ)            |
2743	   | LINK                 | OPT                |                       |
2744	   | LOCK                 | REQ                |                       |
2745	   | LOCKT                | REQ                |                       |
2746	   | LOCKU                | REQ                |                       |
2747	   | LOOKUP               | REQ                |                       |
2748	   | LOOKUPP              | REQ                |                       |
2749	   | NVERIFY              | REQ                |                       |
2750	   | OPEN                 | REQ                |                       |
2751	   | OPENATTR             | OPT                |                       |
2752	   | OPEN_CONFIRM         | MNI                |                       |
2753	   | OPEN_DOWNGRADE       | REQ                |                       |
2754	   | PUTFH                | REQ                |                       |
2755	   | PUTPUBFH             | REQ                |                       |
2756	   | PUTROOTFH            | REQ                |                       |
2757	   | READ                 | OPT                |                       |
2758	   | READDIR              | REQ                |                       |
2759	   | READLINK             | OPT                |                       |
2760	   | READ_PLUS            | OPT                | ADB (REQ)             |
2761	   | RECLAIM_COMPLETE     | REQ                |                       |
2762	   | RELEASE_LOCKOWNER    | MNI                |                       |
2763	   | REMOVE               | REQ                |                       |
2764	   | RENAME               | REQ                |                       |
2765	   | RENEW                | MNI                |                       |
2766	   | RESTOREFH            | REQ                |                       |
2767	   | SAVEFH               | REQ                |                       |
2768	   | SECINFO              | REQ                |                       |
2769	   | SECINFO_NO_NAME      | REC                | pNFS file layout      |
2770	   |                      |                    | (REQ)                 |
2771	   | SEQUENCE             | REQ                |                       |
2772	   | SETATTR              | REQ                |                       |
2773	   | SETCLIENTID          | MNI                |                       |
2774	   | SETCLIENTID_CONFIRM  | MNI                |                       |
2775	   | SET_SSV              | REQ                |                       |
2776	   | TEST_STATEID         | REQ                |                       |
2777	   | VERIFY               | REQ                |                       |
2778	   | WANT_DELEGATION      | OPT                | FDELG (OPT)           |
2779	   | WRITE                | REQ                |                       |
2780	   +----------------------+--------------------+-----------------------+
2781	                            Callback Operations

2783	   +-------------------------+-------------------+---------------------+
2784	   | Operation               | REQ, REC, OPT, or | Feature (REQ, REC,  |
2785	   |                         | MNI               | or OPT)             |
2786	   +-------------------------+-------------------+---------------------+
2787	   | CB_COPY                 | OPT               | COPY (REQ)          |
2788	   | CB_GETATTR              | OPT               | FDELG (REQ)         |
2789	   | CB_LAYOUTRECALL         | OPT               | pNFS (REQ)          |
2790	   | CB_NOTIFY               | OPT               | DDELG (REQ)         |
2791	   | CB_NOTIFY_DEVICEID      | OPT               | pNFS (OPT)          |
2792	   | CB_NOTIFY_LOCK          | OPT               |                     |
2793	   | CB_PUSH_DELEG           | OPT               | FDELG (OPT)         |
2794	   | CB_RECALL               | OPT               | FDELG, DDELG, pNFS  |
2795	   |                         |                   | (REQ)               |
2796	   | CB_RECALL_ANY           | OPT               | FDELG, DDELG, pNFS  |
2797	   |                         |                   | (REQ)               |
2798	   | CB_RECALL_SLOT          | REQ               |                     |
2799	   | CB_RECALLABLE_OBJ_AVAIL | OPT               | DDELG, pNFS (REQ)   |
2800	   | CB_SEQUENCE             | OPT               | FDELG, DDELG, pNFS  |
2801	   |                         |                   | (REQ)               |
2802	   | CB_WANTS_CANCELLED      | OPT               | FDELG, DDELG, pNFS  |
2803	   |                         |                   | (REQ)               |
2804	   +-------------------------+-------------------+---------------------+

2806	11.  NFSv4.2 Operations

2808	11.1.  Operation 59: COPY - Initiate a server-side copy

2810	11.1.1.  ARGUMENT

2812	   const COPY4_GUARDED     = 0x00000001;
2813	   const COPY4_METADATA    = 0x00000002;

2815	   struct COPY4args {
2816	           /* SAVED_FH: source file */
2817	           /* CURRENT_FH: destination file or */
2818	           /*             directory           */
2819	           offset4         ca_src_offset;
2820	           offset4         ca_dst_offset;
2821	           length4         ca_count;
2822	           uint32_t        ca_flags;
2823	           component4      ca_destination;
2824	           netloc4         ca_source_server<>;
2825	   };

2827	11.1.2.  RESULT

2829	   union COPY4res switch (nfsstat4 cr_status) {
2830	           case NFS4_OK:
2831	                   stateid4        cr_callback_id<1>;
2832	           default:
2833	                   length4         cr_bytes_copied;
2834	   };

2836	11.1.3.  DESCRIPTION

2838	   The COPY operation is used for both intra-server and inter-server
2839	   copies.  In both cases, the COPY is always sent from the client to
2840	   the destination server of the file copy.  The COPY operation requests
2841	   that a file be copied from the location specified by the SAVED_FH
2842	   value to the location specified by the combination of CURRENT_FH and
2843	   ca_destination.

2845	   The SAVED_FH must be a regular file.  If SAVED_FH is not a regular
2846	   file, the operation MUST fail and return NFS4ERR_WRONG_TYPE.

2848	   In order to set SAVED_FH to the source file handle, the compound
2849	   procedure requesting the COPY will include a sub-sequence of
2850	   operations such as

2852	      PUTFH source-fh
2853	      SAVEFH

2855	   If the request is for a server-to-server copy, the source-fh is a
2856	   filehandle from the source server and the compound procedure is being
2857	   executed on the destination server.  In this case, the source-fh is a
2858	   foreign filehandle on the server receiving the COPY request.  If
2859	   either PUTFH or SAVEFH checked the validity of the filehandle, the
2860	   operation would likely fail and return NFS4ERR_STALE.

2862	   In order to avoid this problem, the minor version incorporating the
2863	   COPY operations will need to make a few small changes in the handling
2864	   of existing operations.  If a server supports the server-to-server
2865	   COPY feature, a PUTFH followed by a SAVEFH MUST NOT return
2866	   NFS4ERR_STALE for either operation.  These restrictions do not pose
2867	   substantial difficulties for servers.  The CURRENT_FH and SAVED_FH
2868	   may be validated in the context of the operation referencing them and
2869	   an NFS4ERR_STALE error returned for an invalid file handle at that
2870	   point.

2872	   The CURRENT_FH and ca_destination together specify the destination of
2873	   the copy operation.  If ca_destination is of 0 (zero) length, then
2874	   CURRENT_FH specifies the target file.  In this case, CURRENT_FH MUST
2875	   be a regular file and not a directory.  If ca_destination is not of 0
2876	   (zero) length, the ca_destination argument specifies the file name to
2877	   which the data will be copied within the directory identified by
2878	   CURRENT_FH.  In this case, CURRENT_FH MUST be a directory and not a
2879	   regular file.

2881	   If the file named by ca_destination does not exist and the operation
2882	   completes successfully, the file will be visible in the file system
2883	   namespace.  If the file does not exist and the operation fails, the
2884	   file MAY be visible in the file system namespace depending on when
2885	   the failure occurs and on the implementation of the NFS server
2886	   receiving the COPY operation.  If the ca_destination name cannot be
2887	   created in the destination file system (due to file name
2888	   restrictions, such as case or length), the operation MUST fail.

2890	   The ca_src_offset is the offset within the source file from which the
2891	   data will be read, the ca_dst_offset is the offset within the
2892	   destination file to which the data will be written, and the ca_count
2893	   is the number of bytes that will be copied.  An offset of 0 (zero)
2894	   specifies the start of the file.  A count of 0 (zero) requests that
2895	   all bytes from ca_src_offset through EOF be copied to the
2896	   destination.  If concurrent modifications to the source file overlap
2897	   with the source file region being copied, the data copied may include
2898	   all, some, or none of the modifications.  The client can use standard
2899	   NFS operations (e.g., OPEN with OPEN4_SHARE_DENY_WRITE or mandatory
2900	   byte range locks) to protect against concurrent modifications if the
2901	   client is concerned about this.  If the source file's end of file is
2902	   being modified in parallel with a copy that specifies a count of 0
2903	   (zero) bytes, the amount of data copied is implementation dependent
2904	   (clients may guard against this case by specifying a non-zero count
2905	   value or preventing modification of the source file as mentioned
2906	   above).

2908	   If the source offset or the source offset plus count is greater than
2909	   or equal to the size of the source file, the operation will fail with
2910	   NFS4ERR_INVAL.  The destination offset or destination offset plus
2911	   count may be greater than the size of the destination file.  This
2912	   allows for the client to issue parallel copies to implement
2913	   operations such as "cat file1 file2 file3 file4 > dest".

2915	   If the destination file is created as a result of this command, the
2916	   destination file's size will be equal to the number of bytes
2917	   successfully copied.  If the destination file already existed, the
2918	   destination file's size may increase as a result of this operation
2919	   (e.g. if ca_dst_offset plus ca_count is greater than the
2920	   destination's initial size).

2922	   If the ca_source_server list is specified, then this is an inter-
2923	   server copy operation and the source file is on a remote server.  The
2924	   client is expected to have previously issued a successful COPY_NOTIFY
2925	   request to the remote source server.  The ca_source_server list
2926	   SHOULD be the same as the COPY_NOTIFY response's cnr_source_server
2927	   list.  If the client includes the entries from the COPY_NOTIFY
2928	   response's cnr_source_server list in the ca_source_server list, the
2929	   source server can indicate a specific copy protocol for the
2930	   destination server to use by returning a URL, which specifies both a
2931	   protocol service and server name.  Server-to-server copy protocol
2932	   considerations are described in Section 4.2.3 and Section 4.4.1.

2934	   The ca_flags argument allows the copy operation to be customized in
2935	   the following ways using the guarded flag (COPY4_GUARDED) and the
2936	   metadata flag (COPY4_METADATA).

2938	   If the guarded flag is set and the destination exists on the server,
2939	   this operation will fail with NFS4ERR_EXIST.

2941	   If the guarded flag is not set and the destination exists on the
2942	   server, the behavior is implementation dependent.

2944	   If the metadata flag is set and the client is requesting a whole file
2945	   copy (i.e., ca_count is 0 (zero)), a subset of the destination file's
2946	   attributes MUST be the same as the source file's corresponding
2947	   attributes and a subset of the destination file's attributes SHOULD
2948	   be the same as the source file's corresponding attributes.  The
2949	   attributes in the MUST and SHOULD copy subsets will be defined for
2950	   each NFS version.

2952	   For NFSv4.1, Table 2 and Table 3 list the REQUIRED and RECOMMENDED
2953	   attributes respectively.  A "MUST" in the "Copy to destination file?"
2954	   column indicates that the attribute is part of the MUST copy set.  A
2955	   "SHOULD" in the "Copy to destination file?" column indicates that the
2956	   attribute is part of the SHOULD copy set.

2958	          +--------------------+----+---------------------------+
2959	          | Name               | Id | Copy to destination file? |
2960	          +--------------------+----+---------------------------+
2961	          | supported_attrs    | 0  | no                        |
2962	          | type               | 1  | MUST                      |
2963	          | fh_expire_type     | 2  | no                        |
2964	          | change             | 3  | SHOULD                    |
2965	          | size               | 4  | MUST                      |
2966	          | link_support       | 5  | no                        |
2967	          | symlink_support    | 6  | no                        |
2968	          | named_attr         | 7  | no                        |
2969	          | fsid               | 8  | no                        |
2970	          | unique_handles     | 9  | no                        |
2971	          | lease_time         | 10 | no                        |
2972	          | rdattr_error       | 11 | no                        |
2973	          | filehandle         | 19 | no                        |
2974	          | suppattr_exclcreat | 75 | no                        |
2975	          +--------------------+----+---------------------------+

2977	                                  Table 2

2979	          +--------------------+----+---------------------------+
2980	          | Name               | Id | Copy to destination file? |
2981	          +--------------------+----+---------------------------+
2982	          | acl                | 12 | MUST                      |
2983	          | aclsupport         | 13 | no                        |
2984	          | archive            | 14 | no                        |
2985	          | cansettime         | 15 | no                        |
2986	          | case_insensitive   | 16 | no                        |
2987	          | case_preserving    | 17 | no                        |
2988	          | change_policy      | 60 | no                        |
2989	          | chown_restricted   | 18 | MUST                      |
2990	          | dacl               | 58 | MUST                      |
2991	          | dir_notif_delay    | 56 | no                        |
2992	          | dirent_notif_delay | 57 | no                        |
2993	          | fileid             | 20 | no                        |
2994	          | files_avail        | 21 | no                        |
2995	          | files_free         | 22 | no                        |
2996	          | files_total        | 23 | no                        |
2997	          | fs_charset_cap     | 76 | no                        |
2998	          | fs_layout_type     | 62 | no                        |
2999	          | fs_locations       | 24 | no                        |
3000	          | fs_locations_info  | 67 | no                        |
3001	          | fs_status          | 61 | no                        |
3002	          | hidden             | 25 | MUST                      |
3003	          | homogeneous        | 26 | no                        |
3004	          | layout_alignment   | 66 | no                        |
3005	          | layout_blksize     | 65 | no                        |
3006	          | layout_hint        | 63 | no                        |
3007	          | layout_type        | 64 | no                        |
3008	          | maxfilesize        | 27 | no                        |
3009	          | maxlink            | 28 | no                        |
3010	          | maxname            | 29 | no                        |
3011	          | maxread            | 30 | no                        |
3012	          | maxwrite           | 31 | no                        |
3013	          | max_hole_punch     | 31 | no                        |
3014	          | mdsthreshold       | 68 | no                        |
3015	          | mimetype           | 32 | MUST                      |
3016	          | mode               | 33 | MUST                      |
3017	          | mode_set_masked    | 74 | no                        |
3018	          | mounted_on_fileid  | 55 | no                        |
3019	          | no_trunc           | 34 | no                        |
3020	          | numlinks           | 35 | no                        |
3021	          | owner              | 36 | MUST                      |
3022	          | owner_group        | 37 | MUST                      |
3023	          | quota_avail_hard   | 38 | no                        |
3024	          | quota_avail_soft   | 39 | no                        |
3025	          | quota_used         | 40 | no                        |
3026	          | rawdev             | 41 | no                        |
3027	          | retentevt_get      | 71 | MUST                      |
3028	          | retentevt_set      | 72 | no                        |
3029	          | retention_get      | 69 | MUST                      |
3030	          | retention_hold     | 73 | MUST                      |
3031	          | retention_set      | 70 | no                        |
3032	          | sacl               | 59 | MUST                      |
3033	          | space_avail        | 42 | no                        |
3034	          | space_free         | 43 | no                        |
3035	          | space_freed        | 78 | no                        |
3036	          | space_reserved     | 77 | MUST                      |
3037	          | space_total        | 44 | no                        |
3038	          | space_used         | 45 | no                        |
3039	          | system             | 46 | MUST                      |
3040	          | time_access        | 47 | MUST                      |
3041	          | time_access_set    | 48 | no                        |
3042	          | time_backup        | 49 | no                        |
3043	          | time_create        | 50 | MUST                      |
3044	          | time_delta         | 51 | no                        |
3045	          | time_metadata      | 52 | SHOULD                    |
3046	          | time_modify        | 53 | MUST                      |
3047	          | time_modify_set    | 54 | no                        |
3048	          +--------------------+----+---------------------------+

3050	                                  Table 3

3052	   [NOTE: The source file's attribute values will take precedence over
3053	   any attribute values inherited by the destination file.]
3054	   In the case of an inter-server copy or an intra-server copy between
3055	   file systems, the attributes supported for the source file and
3056	   destination file could be different.  By definition,the REQUIRED
3057	   attributes will be supported in all cases.  If the metadata flag is
3058	   set and the source file has a RECOMMENDED attribute that is not
3059	   supported for the destination file, the copy MUST fail with
3060	   NFS4ERR_ATTRNOTSUPP.

3062	   Any attribute supported by the destination server that is not set on
3063	   the source file SHOULD be left unset.

3065	   Metadata attributes not exposed via the NFS protocol SHOULD be copied
3066	   to the destination file where appropriate.

3068	   The destination file's named attributes are not duplicated from the
3069	   source file.  After the copy process completes, the client MAY
3070	   attempt to duplicate named attributes using standard NFSv4
3071	   operations.  However, the destination file's named attribute
3072	   capabilities MAY be different from the source file's named attribute
3073	   capabilities.

3075	   If the metadata flag is not set and the client is requesting a whole
3076	   file copy (i.e., ca_count is 0 (zero)), the destination file's
3077	   metadata is implementation dependent.

3079	   If the client is requesting a partial file copy (i.e., ca_count is
3080	   not 0 (zero)), the client SHOULD NOT set the metadata flag and the
3081	   server MUST ignore the metadata flag.

3083	   If the operation does not result in an immediate failure, the server
3084	   will return NFS4_OK, and the CURRENT_FH will remain the destination's
3085	   filehandle.

3087	   If an immediate failure does occur, cr_bytes_copied will be set to
3088	   the number of bytes copied to the destination file before the error
3089	   occurred.  The cr_bytes_copied value indicates the number of bytes
3090	   copied but not which specific bytes have been copied.

3092	   A return of NFS4_OK indicates that either the operation is complete
3093	   or the operation was initiated and a callback will be used to deliver
3094	   the final status of the operation.

3096	   If the cr_callback_id is returned, this indicates that the operation
3097	   was initiated and a CB_COPY callback will deliver the final results
3098	   of the operation.  The cr_callback_id stateid is termed a copy
3099	   stateid in this context.  The server is given the option of returning
3100	   the results in a callback because the data may require a relatively
3101	   long period of time to copy.

3103	   If no cr_callback_id is returned, the operation completed
3104	   synchronously and no callback will be issued by the server.  The
3105	   completion status of the operation is indicated by cr_status.

3107	   If the copy completes successfully, either synchronously or
3108	   asynchronously, the data copied from the source file to the
3109	   destination file MUST appear identical to the NFS client.  However,
3110	   the NFS server's on disk representation of the data in the source
3111	   file and destination file MAY differ.  For example, the NFS server
3112	   might encrypt, compress, deduplicate, or otherwise represent the on
3113	   disk data in the source and destination file differently.

3115	   In the event of a failure the state of the destination file is
3116	   implementation dependent.  The COPY operation may fail for the
3117	   following reasons (this is a partial list).

3119	   NFS4ERR_MOVED:  The file system which contains the source file, or
3120	      the destination file or directory is not present.  The client can
3121	      determine the correct location and reissue the operation with the
3122	      correct location.

3124	   NFS4ERR_NOTSUPP:  The copy offload operation is not supported by the
3125	      NFS server receiving this request.

3127	   NFS4ERR_PARTNER_NOTSUPP:  The remote server does not support the
3128	      server-to-server copy offload protocol.

3130	   NFS4ERR_OFFLOAD_DENIED:  The copy offload operation is supported by
3131	      both the source and the destination, but the destination is not
3132	      allowing it for this file.  If the client sees this error, it
3133	      should fall back to the normal copy semantics.

3135	   NFS4ERR_PARTNER_NO_AUTH:  The remote server does not authorize a
3136	      server-to-server copy offload operation.  This may be due to the
3137	      client's failure to send the COPY_NOTIFY operation to the remote
3138	      server, the remote server receiving a server-to-server copy
3139	      offload request after the copy lease time expired, or for some
3140	      other permission problem.

3142	   NFS4ERR_FBIG:  The copy operation would have caused the file to grow
3143	      beyond the server's limit.

3145	   NFS4ERR_NOTDIR:  The CURRENT_FH is a file and ca_destination has non-
3146	      zero length.

3148	   NFS4ERR_WRONG_TYPE:  The SAVED_FH is not a regular file.

3150	   NFS4ERR_ISDIR:  The CURRENT_FH is a directory and ca_destination has
3151	      zero length.

3153	   NFS4ERR_INVAL:  The source offset or offset plus count are greater
3154	      than or equal to the size of the source file.

3156	   NFS4ERR_DELAY:  The server does not have the resources to perform the
3157	      copy operation at the current time.  The client should retry the
3158	      operation sometime in the future.

3160	   NFS4ERR_METADATA_NOTSUPP:  The destination file cannot support the
3161	      same metadata as the source file.

3163	   NFS4ERR_WRONGSEC:  The security mechanism being used by the client
3164	      does not match the server's security policy.

3166	11.2.  Operation 60: COPY_ABORT - Cancel a server-side copy

3168	11.2.1.  ARGUMENT

3170	   struct COPY_ABORT4args {
3171	           /* CURRENT_FH: desination file */
3172	           stateid4        caa_stateid;
3173	   };

3175	11.2.2.  RESULT

3177	   struct COPY_ABORT4res {
3178	           nfsstat4        car_status;
3179	   };

3181	11.2.3.  DESCRIPTION

3183	   COPY_ABORT is used for both intra- and inter-server asynchronous
3184	   copies.  The COPY_ABORT operation allows the client to cancel a
3185	   server-side copy operation that it initiated.  This operation is sent
3186	   in a COMPOUND request from the client to the destination server.
3187	   This operation may be used to cancel a copy when the application that
3188	   requested the copy exits before the operation is completed or for
3189	   some other reason.

3191	   The request contains the filehandle and copy stateid cookies that act
3192	   as the context for the previously initiated copy operation.

3194	   The result's car_status field indicates whether the cancel was
3195	   successful or not.  A value of NFS4_OK indicates that the copy
3196	   operation was canceled and no callback will be issued by the server.
3197	   A copy operation that is successfully canceled may result in none,
3198	   some, or all of the data copied.

3200	   If the server supports asynchronous copies, the server is REQUIRED to
3201	   support the COPY_ABORT operation.

3203	   The COPY_ABORT operation may fail for the following reasons (this is
3204	   a partial list):

3206	   NFS4ERR_NOTSUPP:  The abort operation is not supported by the NFS
3207	      server receiving this request.

3209	   NFS4ERR_RETRY:  The abort failed, but a retry at some time in the
3210	      future MAY succeed.

3212	   NFS4ERR_COMPLETE_ALREADY:  The abort failed, and a callback will
3213	      deliver the results of the copy operation.

3215	   NFS4ERR_SERVERFAULT:  An error occurred on the server that does not
3216	      map to a specific error code.

3218	11.3.  Operation 61: COPY_NOTIFY - Notify a source server of a future
3219	       copy

3221	11.3.1.  ARGUMENT

3223	   struct COPY_NOTIFY4args {
3224	           /* CURRENT_FH: source file */
3225	           netloc4         cna_destination_server;
3226	   };

3228	11.3.2.  RESULT

3230	   struct COPY_NOTIFY4resok {
3231	           nfstime4        cnr_lease_time;
3232	           netloc4         cnr_source_server<>;
3233	   };

3235	   union COPY_NOTIFY4res switch (nfsstat4 cnr_status) {
3236	           case NFS4_OK:
3237	                   COPY_NOTIFY4resok       resok4;
3238	           default:
3239	                   void;
3240	   };

3242	11.3.3.  DESCRIPTION

3244	   This operation is used for an inter-server copy.  A client sends this
3245	   operation in a COMPOUND request to the source server to authorize a
3246	   destination server identified by cna_destination_server to read the
3247	   file specified by CURRENT_FH on behalf of the given user.

3249	   The cna_destination_server MUST be specified using the netloc4
3250	   network location format.  The server is not required to resolve the
3251	   cna_destination_server address before completing this operation.

3253	   If this operation succeeds, the source server will allow the
3254	   cna_destination_server to copy the specified file on behalf of the
3255	   given user.  If COPY_NOTIFY succeeds, the destination server is
3256	   granted permission to read the file as long as both of the following
3257	   conditions are met:

3259	   o  The destination server begins reading the source file before the
3260	      cnr_lease_time expires.  If the cnr_lease_time expires while the
3261	      destination server is still reading the source file, the
3262	      destination server is allowed to finish reading the file.

3264	   o  The client has not issued a COPY_REVOKE for the same combination
3265	      of user, filehandle, and destination server.

3267	   The cnr_lease_time is chosen by the source server.  A cnr_lease_time
3268	   of 0 (zero) indicates an infinite lease.  To renew the copy lease
3269	   time the client should resend the same copy notification request to
3270	   the source server.

3272	   To avoid the need for synchronized clocks, copy lease times are
3273	   granted by the server as a time delta.  However, there is a
3274	   requirement that the client and server clocks do not drift
3275	   excessively over the duration of the lease.  There is also the issue
3276	   of propagation delay across the network which could easily be several
3277	   hundred milliseconds as well as the possibility that requests will be
3278	   lost and need to be retransmitted.

3280	   To take propagation delay into account, the client should subtract it
3281	   from copy lease times (e.g., if the client estimates the one-way
3282	   propagation delay as 200 milliseconds, then it can assume that the
3283	   lease is already 200 milliseconds old when it gets it).  In addition,
3284	   it will take another 200 milliseconds to get a response back to the
3285	   server.  So the client must send a lease renewal or send the copy
3286	   offload request to the cna_destination_server at least 400
3287	   milliseconds before the copy lease would expire.  If the propagation
3288	   delay varies over the life of the lease (e.g., the client is on a
3289	   mobile host), the client will need to continuously subtract the
3290	   increase in propagation delay from the copy lease times.

3292	   The server's copy lease period configuration should take into account
3293	   the network distance of the clients that will be accessing the
3294	   server's resources.  It is expected that the lease period will take
3295	   into account the network propagation delays and other network delay
3296	   factors for the client population.  Since the protocol does not allow
3297	   for an automatic method to determine an appropriate copy lease
3298	   period, the server's administrator may have to tune the copy lease
3299	   period.

3301	   A successful response will also contain a list of names, addresses,
3302	   and URLs called cnr_source_server, on which the source is willing to
3303	   accept connections from the destination.  These might not be
3304	   reachable from the client and might be located on networks to which
3305	   the client has no connection.

3307	   If the client wishes to perform an inter-server copy, the client MUST
3308	   send a COPY_NOTIFY to the source server.  Therefore, the source
3309	   server MUST support COPY_NOTIFY.

3311	   For a copy only involving one server (the source and destination are
3312	   on the same server), this operation is unnecessary.

3314	   The COPY_NOTIFY operation may fail for the following reasons (this is
3315	   a partial list):

3317	   NFS4ERR_MOVED:  The file system which contains the source file is not
3318	      present on the source server.  The client can determine the
3319	      correct location and reissue the operation with the correct
3320	      location.

3322	   NFS4ERR_NOTSUPP:  The copy offload operation is not supported by the
3323	      NFS server receiving this request.

3325	   NFS4ERR_WRONGSEC:  The security mechanism being used by the client
3326	      does not match the server's security policy.

3328	11.4.  Operation 62: COPY_REVOKE - Revoke a destination server's copy
3329	       privileges

3331	11.4.1.  ARGUMENT

3333	   struct COPY_REVOKE4args {
3334	           /* CURRENT_FH: source file */
3335	           netloc4         cra_destination_server;
3336	   };

3338	11.4.2.  RESULT

3340	   struct COPY_REVOKE4res {
3341	           nfsstat4        crr_status;
3342	   };

3344	11.4.3.  DESCRIPTION

3346	   This operation is used for an inter-server copy.  A client sends this
3347	   operation in a COMPOUND request to the source server to revoke the
3348	   authorization of a destination server identified by
3349	   cra_destination_server from reading the file specified by CURRENT_FH
3350	   on behalf of given user.  If the cra_destination_server has already
3351	   begun copying the file, a successful return from this operation
3352	   indicates that further access will be prevented.

3354	   The cra_destination_server MUST be specified using the netloc4
3355	   network location format.  The server is not required to resolve the
3356	   cra_destination_server address before completing this operation.

3358	   The COPY_REVOKE operation is useful in situations in which the source
3359	   server granted a very long or infinite lease on the destination
3360	   server's ability to read the source file and all copy operations on
3361	   the source file have been completed.

3363	   For a copy only involving one server (the source and destination are
3364	   on the same server), this operation is unnecessary.

3366	   If the server supports COPY_NOTIFY, the server is REQUIRED to support
3367	   the COPY_REVOKE operation.

3369	   The COPY_REVOKE operation may fail for the following reasons (this is
3370	   a partial list):

3372	   NFS4ERR_MOVED:  The file system which contains the source file is not
3373	      present on the source server.  The client can determine the
3374	      correct location and reissue the operation with the correct
3375	      location.

3377	   NFS4ERR_NOTSUPP:  The copy offload operation is not supported by the
3378	      NFS server receiving this request.

3380	11.5.  Operation 63: COPY_STATUS - Poll for status of a server-side copy

3382	11.5.1.  ARGUMENT

3384	   struct COPY_STATUS4args {
3385	           /* CURRENT_FH: destination file */
3386	           stateid4        csa_stateid;
3387	   };

3389	11.5.2.  RESULT

3391	   struct COPY_STATUS4resok {
3392	           length4         csr_bytes_copied;
3393	           nfsstat4        csr_complete<1>;
3394	   };

3396	   union COPY_STATUS4res switch (nfsstat4 csr_status) {
3397	           case NFS4_OK:
3398	                   COPY_STATUS4resok       resok4;
3399	           default:
3400	                   void;
3401	   };

3403	11.5.3.  DESCRIPTION

3405	   COPY_STATUS is used for both intra- and inter-server asynchronous
3406	   copies.  The COPY_STATUS operation allows the client to poll the
3407	   server to determine the status of an asynchronous copy operation.
3408	   This operation is sent by the client to the destination server.

3410	   If this operation is successful, the number of bytes copied are
3411	   returned to the client in the csr_bytes_copied field.  The
3412	   csr_bytes_copied value indicates the number of bytes copied but not
3413	   which specific bytes have been copied.

3415	   If the optional csr_complete field is present, the copy has
3416	   completed.  In this case the status value indicates the result of the
3417	   asynchronous copy operation.  In all cases, the server will also
3418	   deliver the final results of the asynchronous copy in a CB_COPY
3419	   operation.

3421	   The failure of this operation does not indicate the result of the
3422	   asynchronous copy in any way.

3424	   If the server supports asynchronous copies, the server is REQUIRED to
3425	   support the COPY_STATUS operation.

3427	   The COPY_STATUS operation may fail for the following reasons (this is
3428	   a partial list):

3430	   NFS4ERR_NOTSUPP:  The copy status operation is not supported by the
3431	      NFS server receiving this request.

3433	   NFS4ERR_BAD_STATEID:  The stateid is not valid (see Section 4.3.2
3434	      below).

3436	   NFS4ERR_EXPIRED:  The stateid has expired (see Copy Offload Stateid
3437	      section below).

3439	11.6.  Operation 64: INITIALIZE

3441	   The server has no concept of the structure imposed by the
3442	   application.  It is only when the application writes to a section of
3443	   the file does order get imposed.  In order to detect corruption even
3444	   before the application utilizes the file, the application will want
3445	   to initialize a range of ADBs.  It uses the INITIALIZE operation to
3446	   do so.

3448	11.6.1.  ARGUMENT

3450	   /*
3451	    * We use data_content4 in case we wish to
3452	    * extend new types later. Note that we
3453	    * are explicitly disallowing data.
3454	    */
3455	   union initialize_arg4 switch (data_content4 content) {
3456	   case NFS4_CONTENT_APP_BLOCK:
3457	           app_data_block4 ia_adb;
3458	   case NFS4_CONTENT_HOLE:
3459	           hole_info4      ia_hole;
3460	   default:
3461	           void;
3462	   };

3464	   struct INITIALIZE4args {
3465	           /* CURRENT_FH: file */
3466	           stateid4        ia_stateid;
3467	           stable_how4     ia_stable;
3468	           initialize_arg4 ia_data<>;
3469	   };

3471	11.6.2.  RESULT

3473	   struct INITIALIZE4resok {
3474	           count4          ir_count;
3475	           stable_how4     ir_committed;
3476	           verifier4       ir_writeverf;
3477	           data_content4   ir_sparse;
3478	   };

3480	   union INITIALIZE4res switch (nfsstat4 status) {
3481	   case NFS4_OK:
3482	           INITIALIZE4resok        resok4;
3483	   default:
3484	           void;
3485	   };

3487	11.6.3.  DESCRIPTION

3489	   When the client invokes the INITIALIZE operation, it has two desired
3490	   results:

3492	   1.  The structure described by the app_data_block4 be imposed on the
3493	       file.

3495	   2.  The contents described by the app_data_block4 be sparse.

3497	   If the server supports the INITIALIZE operation, it still might not
3498	   support sparse files.  So if it receives the INITIALIZE operation,
3499	   then it MUST populate the contents of the file with the initialized
3500	   ADBs.  In other words, if the server supports INITIALIZE, then it
3501	   supports the concept of ADBs.  [[Comment.7: Do we want to support an
3502	   asynchronous INITIALIZE?  Do we have to? --TH]]

3504	   If the data was already initialized, There are two interesting
3505	   scenarios:

3507	   1.  The data blocks are allocated.

3509	   2.  Initializing in the middle of an existing ADB.

3511	   If the data blocks were already allocated, then the INITIALIZE is a
3512	   hole punch operation.  If INITIALIZE supports sparse files, then the
3513	   data blocks are to be deallocated.  If not, then the data blocks are
3514	   to be rewritten in the indicated ADB format.  [[Comment.8: Need to
3515	   document interaction between space reservation and hole punching?
3516	   --TH]]

3518	   Since the server has no knowledge of ADBs, it should not report
3519	   misaligned creation of ADBs.  Even while it can detect them, it
3520	   cannot disallow them, as the application might be in the process of
3521	   changing the size of the ADBs.  Thus the server must be prepared to
3522	   handle an INITIALIZE into an existing ADB.

3524	   This document does not mandate the manner in which the server stores
3525	   ADBs sparsely for a file.  It does assume that if ADBs are stored
3526	   sparsely, then the server can detect when an INITIALIZE arrives that
3527	   will force a new ADB to start inside an existing ADB.  For example,
3528	   assume that ADBi has a adb_block_size of 4k and that an INITIALIZE
3529	   starts 1k inside ADBi.  The server should [[Comment.9: Need to flesh
3530	   this out. --TH]]

3532	11.7.  Modification to Operation 42: EXCHANGE_ID - Instantiate Client ID

3534	11.7.1.  ARGUMENT

3536	      /* new */
3537	      const EXCHGID4_FLAG_SUPP_FENCE_OPS      = 0x00000004;

3539	11.7.2.  RESULT

3541	      Unchanged

3543	11.7.3.  MOTIVATION

3545	   Enterprise applications require guarantees that an operation has
3546	   either aborted or completed.  NFSv4.1 provides this guarantee as long
3547	   as the session is alive: simply send a SEQUENCE operation on the same
3548	   slot with a new sequence number, and the successful return of
3549	   SEQUENCE indicates the previous operation has completed.  However, if
3550	   the session is lost, there is no way to know when any in progress
3551	   operations have aborted or completed.  In hindsight, the NFSv4.1
3552	   specification should have mandated that DESTROY_SESSION abort/
3553	   complete all outstanding operations.

3555	11.7.4.  DESCRIPTION

3557	   A client SHOULD request the EXCHGID4_FLAG_SUPP_FENCE_OPS capability
3558	   when it sends an EXCHANGE_ID operation.  The server SHOULD set this
3559	   capability in the EXCHANGE_ID reply whether the client requests it or
3560	   not.  If the client ID is created with this capability then the
3561	   following will occur:

3563	   o  The server will not reply to DESTROY_SESSION until all operations
3564	      in progress are completed or aborted.

3566	   o  The server will not reply to subsequent EXCHANGE_ID invoked on the
3567	      same Client Owner with a new verifier until all operations in
3568	      progress on the Client ID's session are completed or aborted.

3570	   o  When DESTROY_CLIENTID is invoked, if there are sessions (both idle
3571	      and non-idle), opens, locks, delegations, layouts, and/or wants
3572	      (Section 18.49) associated with the client ID are removed.
3573	      Pending operations will be completed or aborted before the
3574	      sessions, opens, locks, delegations, layouts, and/or wants are
3575	      deleted.

3577	   o  The NFS server SHOULD support client ID trunking, and if it does
3578	      and the EXCHGID4_FLAG_SUPP_FENCE_OPS capability is enabled, then a
3579	      session ID created on one node of the storage cluster MUST be
3580	      destroyable via DESTROY_SESSION.  In addition, DESTROY_CLIENTID
3581	      and an EXCHANGE_ID with a new verifier affects all sessions
3582	      regardless what node the sessions were created on.

3584	11.8.  Operation 65: READ_PLUS

3586	   If the client sends a READ operation, it is explicitly stating that
3587	   it is not supporting sparse files.  So if a READ occurs on a sparse
3588	   ADB, then the server must expand such ADBs to be raw bytes.  If a
3589	   READ occurs in the middle of an ADB, the server can only send back
3590	   bytes starting from that offset.

3592	   Such an operation is inefficient for transfer of sparse sections of
3593	   the file.  As such, READ is marked as OBSOLETE in NFSv4.2.  Instead,
3594	   a client should issue READ_PLUS.  Note that as the client has no a
3595	   priori knowledge of whether an ADB is present or not, it should
3596	   always use READ_PLUS.

3598	11.8.1.  ARGUMENT

3600	   struct READ_PLUS4args {
3601	           /* CURRENT_FH: file */
3602	           stateid4        rpa_stateid;
3603	           offset4         rpa_offset;
3604	           count4          rpa_count;
3605	   };

3607	11.8.2.  RESULT

3609	   union read_plus_content switch (data_content4 content) {
3610	   case NFS4_CONTENT_DATA:
3611	           opaque          rpc_data<>;
3612	   case NFS4_CONTENT_APP_BLOCK:
3613	           app_data_block4 rpc_block;
3614	   case NFS4_CONTENT_HOLE:
3615	           hole_info4      rpc_hole;
3616	   default:
3617	           void;
3618	   };

3620	   /*
3621	    * Allow a return of an array of contents.
3622	    */
3623	   struct read_plus_res4 {
3624	           bool                    rpr_eof;
3625	           read_plus_content       rpr_contents<>;
3626	   };

3628	   union READ_PLUS4res switch (nfsstat4 status) {
3629	   case NFS4_OK:
3630	           read_plus_res4  resok4;
3631	   default:
3632	           void;
3633	   };

3635	11.8.3.  DESCRIPTION

3637	   Over the given range, READ_PLUS will return all data and ADBs found
3638	   as an array of read_plus_content.  It is possible to have consecutive
3639	   ADBs in the array as either different definitions of ADBs are present
3640	   or as the guard pattern changes.

3642	   Edge cases exist for ABDs which either begin before the rpa_offset
3643	   requested by the READ_PLUS or end after the rpa_count requested -
3644	   both of which may occur as not all applications which access the file
3645	   are aware of the main application imposing a format on the file
3646	   contents, i.e., tar, dd, cp, etc.  READ_PLUS MUST retrieve whole
3647	   ADBs, but it need not retrieve an entire sequences of ADBs.

3649	   The server MUST return a whole ADB because if it does not, it must
3650	   expand that partial ADB before it sends it to the client.  E.g., if
3651	   an ADB had a block size of 64k and the READ_PLUS was for 128k
3652	   starting at an offset of 32k inside the ADB, then the first 32k would
3653	   be converted to data.

3655	12.  NFSv4.2 Callback Operations

3657	12.1.  Procedure 16: CB_ATTR_CHANGED - Notify Client that the File's
3658	       Attributes Changed

3660	12.1.1.  ARGUMENTS

3662	   struct CB_ATTR_CHANGED4args {
3663	           nfs_fh4         acca_fh;
3664	           bitmap4         acca_critical;
3665	           bitmap4         acca_info;
3666	   };

3668	12.1.2.  RESULTS

3670	   struct CB_ATTR_CHANGED4res {
3671	           nfsstat4        accr_status;
3672	   };

3674	12.1.3.  DESCRIPTION

3676	   The CB_ATTR_CHANGED callback operation is used by the server to
3677	   indicate to the client that the file's attributes have been modified
3678	   on the server.  The server does not convey how the attributes have
3679	   changed, just that they have been modified.  The server can inform
3680	   the client about both critical and informational attribute changes in
3681	   the bitmask arguments.  The client SHOULD query the server about all
3682	   attributes set in acca_critical.  For all changes reflected in
3683	   acca_info, the client can decide whether or not it wants to poll the
3684	   server.

3686	   The CB_ATTR_CHANGED callback operation with the FATTR4_SEC_LABEL set
3687	   in acca_critical is the method used by the server to indicate that
3688	   the MAC label for the file referenced by acca_fh has changed.  In
3689	   many ways, the server does not care about the result returned by the
3690	   client.

3692	12.2.  Operation 15: CB_COPY - Report results of a server-side copy
3693	12.2.1.  ARGUMENT

3695	   union copy_info4 switch (nfsstat4 cca_status) {
3696	           case NFS4_OK:
3697	                   void;
3698	           default:
3699	                   length4         cca_bytes_copied;
3700	   };

3702	   struct CB_COPY4args {
3703	           nfs_fh4         cca_fh;
3704	           stateid4        cca_stateid;
3705	           copy_info4      cca_copy_info;
3706	   };

3708	12.2.2.  RESULT

3710	   struct CB_COPY4res {
3711	           nfsstat4        ccr_status;
3712	   };

3714	12.2.3.  DESCRIPTION

3716	   CB_COPY is used for both intra- and inter-server asynchronous copies.
3717	   The CB_COPY callback informs the client of the result of an
3718	   asynchronous server-side copy.  This operation is sent by the
3719	   destination server to the client in a CB_COMPOUND request.  The copy
3720	   is identified by the filehandle and stateid arguments.  The result is
3721	   indicated by the status field.  If the copy failed, cca_bytes_copied
3722	   contains the number of bytes copied before the failure occurred.  The
3723	   cca_bytes_copied value indicates the number of bytes copied but not
3724	   which specific bytes have been copied.

3726	   In the absence of an established backchannel, the server cannot
3727	   signal the completion of the COPY via a CB_COPY callback.  The loss
3728	   of a callback channel would be indicated by the server setting the
3729	   SEQ4_STATUS_CB_PATH_DOWN flag in the sr_status_flags field of the
3730	   SEQUENCE operation.  The client must re-establish the callback
3731	   channel to receive the status of the COPY operation.  Prolonged loss
3732	   of the callback channel could result in the server dropping the COPY
3733	   operation state and invalidating the copy stateid.

3735	   If the client supports the COPY operation, the client is REQUIRED to
3736	   support the CB_COPY operation.

3738	   The CB_COPY operation may fail for the following reasons (this is a
3739	   partial list):

3741	   NFS4ERR_NOTSUPP:  The copy offload operation is not supported by the
3742	      NFS client receiving this request.

3744	13.  IANA Considerations

3746	   This section uses terms that are defined in [23].

3748	14.  References

3750	14.1.  Normative References

3752	   [1]   Bradner, S., "Key words for use in RFCs to Indicate Requirement
3753	         Levels", March 1997.

3755	   [2]   Shepler, S., Eisler, M., and D. Noveck, "Network File System
3756	         (NFS) Version 4 Minor Version 1 Protocol", RFC 5661,
3757	         January 2010.

3759	   [3]   Haynes, T., "Network File System (NFS) Version 4 Minor Version
3760	         2 External Data Representation Standard (XDR) Description",
3761	         March 2011.

3763	   [4]   Halevy, B., Welch, B., and J. Zelenka, "Object-Based Parallel
3764	         NFS (pNFS) Operations", RFC 5664, January 2010.

3766	   [5]   Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform
3767	         Resource Identifier (URI): Generic Syntax", STD 66, RFC 3986,
3768	         January 2005.

3770	   [6]   Haynes, T. and N. Williams, "Remote Procedure Call (RPC)
3771	         Security Version 3", draft-williams-rpcsecgssv3 (work in
3772	         progress), 2011.

3774	   [7]   Eisler, M., Chiu, A., and L. Ling, "RPCSEC_GSS Protocol
3775	         Specification", RFC 2203, September 1997.

3777	   [8]   Shepler, S., Eisler, M., and D. Noveck, "Network File System
3778	         (NFS) Version 4 Minor Version 1 External Data Representation
3779	         Standard (XDR) Description", RFC 5662, January 2010.

3781	   [9]   Black, D., Glasgow, J., and S. Fridella, "Parallel NFS (pNFS)
3782	         Block/Volume Layout", RFC 5663, January 2010.

3784	14.2.  Informative References

3786	   [10]  Haynes, T. and D. Noveck, "Network File System (NFS) version 4
3787	         Protocol", draft-ietf-nfsv4-rfc3530bis-09 (Work In Progress),
3788	         March 2011.

3790	   [11]  Eisler, M., "XDR: External Data Representation Standard",
3791	         RFC 4506, May 2006.

3793	   [12]  Lentini, J., Everhart, C., Ellard, D., Tewari, R., and M. Naik,
3794	         "NSDB Protocol for Federated Filesystems",
3795	         draft-ietf-nfsv4-federated-fs-protocol (Work In Progress),
3796	         2010.

3798	   [13]  Lentini, J., Everhart, C., Ellard, D., Tewari, R., and M. Naik,
3799	         "Administration Protocol for Federated Filesystems",
3800	         draft-ietf-nfsv4-federated-fs-admin (Work In Progress), 2010.

3802	   [14]  Fielding, R., Gettys, J., Mogul, J., Frystyk, H., Masinter, L.,
3803	         Leach, P., and T. Berners-Lee, "Hypertext Transfer Protocol --
3804	         HTTP/1.1", RFC 2616, June 1999.

3806	   [15]  Postel, J. and J. Reynolds, "File Transfer Protocol", STD 9,
3807	         RFC 959, October 1985.

3809	   [16]  Simpson, W., "PPP Challenge Handshake Authentication Protocol
3810	         (CHAP)", RFC 1994, August 1996.

3812	   [17]  Strohm, R., "Chapter 2, Data Blocks, Extents, and Segments, of
3813	         Oracle Database Concepts 11g Release 1 (11.1)", January 2011.

3815	   [18]  Ashdown, L., "Chapter 15, Validating Database Files and
3816	         Backups, of Oracle Database Backup and Recovery User's Guide
3817	         11g Release 1 (11.1)", August 2008.

3819	   [19]  McDougall, R. and J. Mauro, "Section 11.4.3, Detecting Memory
3820	         Corruption of Solaris Internals", 2007.

3822	   [20]  Bairavasundaram, L., Goodson, G., Schroeder, B., Arpaci-
3823	         Dusseau, A., and R. Arpaci-Dusseau, "An Analysis of Data
3824	         Corruption in the Storage Stack", Proceedings of the 6th USENIX
3825	         Symposium on File and Storage Technologies (FAST '08) , 2008.

3827	   [21]  "Section 46.6. Multi-Level Security (MLS) of Deployment Guide:
3828	         Deployment, configuration and administration of Red Hat
3829	         Enterprise Linux 5, Edition 6", 2011.

3831	   [22]  Quigley, D. and J. Lu, "Registry Specification for MAC Security
3832	         Label Formats", draft-quigley-label-format-registry (work in
3833	         progress), 2011.

3835	   [23]  Narten, T. and H. Alvestrand, "Guidelines for Writing an IANA
3836	         Considerations Section in RFCs", BCP 26, RFC 5226, May 2008.

3838	   [24]  Nowicki, B., "NFS: Network File System Protocol specification",
3839	         RFC 1094, March 1989.

3841	   [25]  Callaghan, B., Pawlowski, B., and P. Staubach, "NFS Version 3
3842	         Protocol Specification", RFC 1813, June 1995.

3844	   [26]  Srinivasan, R., "Binding Protocols for ONC RPC Version 2",
3845	         RFC 1833, August 1995.

3847	   [27]  Eisler, M., "NFS Version 2 and Version 3 Security Issues and
3848	         the NFS Protocol's Use of RPCSEC_GSS and Kerberos V5",
3849	         RFC 2623, June 1999.

3851	   [28]  Callaghan, B., "NFS URL Scheme", RFC 2224, October 1997.

3853	   [29]  Shepler, S., "NFS Version 4 Design Considerations", RFC 2624,
3854	         June 1999.

3856	   [30]  Reynolds, J., "Assigned Numbers: RFC 1700 is Replaced by an On-
3857	         line Database", RFC 3232, January 2002.

3859	   [31]  Linn, J., "The Kerberos Version 5 GSS-API Mechanism", RFC 1964,
3860	         June 1996.

3862	   [32]  Shepler, S., Callaghan, B., Robinson, D., Thurlow, R., Beame,
3863	         C., Eisler, M., and D. Noveck, "Network File System (NFS)
3864	         version 4 Protocol", RFC 3530, April 2003.

3866	Appendix A.  Acknowledgments

3868	   For the pNFS Access Permissions Check, the original draft was by
3869	   Sorin Faibish, David Black, Mike Eisler, and Jason Glasgow.  The work
3870	   was influenced by discussions with Benny Halevy and Bruce Fields.  A
3871	   review was done by Tom Haynes.

3873	   For the Sharing change attribute implementation details with NFSv4
3874	   clients, the original draft was by Trond Myklebust.

3876	   For the NFS Server-side Copy, the original draft was by James
3877	   Lentini, Mike Eisler, Deepak Kenchammana, Anshul Madan, and Rahul
3878	   Iyer.  Talpey co-authored an unpublished version of that document.

3880	   It was also was reviewed by a number of individuals: Pranoop Erasani,
3881	   Tom Haynes, Arthur Lent, Trond Myklebust, Dave Noveck, Theresa
3882	   Lingutla-Raj, Manjunath Shankararao, Satyam Vaghani, and Nico
3883	   Williams.

3885	   For the NFS space reservation operations, the original draft was by
3886	   Mike Eisler, James Lentini, Manjunath Shankararao, and Rahul Iyer.

3888	   For the sparse file support, the original draft was by Dean
3889	   Hildebrand and Marc Eshel.  Valuable input and advice was received
3890	   from Sorin Faibish, Bruce Fields, Benny Halevy, Trond Myklebust, and
3891	   Richard Scheffenegger.

3893	   For Labeled NFS, the original draft was by David Quigley, James
3894	   Morris, Jarret Lu, and Tom Haynes.  Peter Staubach, Trond Myklebust,
3895	   Sorrin Faibish, Nico Williams, and David Black also contributed in
3896	   the final push to get this accepted.

3898	Appendix B.  RFC Editor Notes

3900	   [RFC Editor: please remove this section prior to publishing this
3901	   document as an RFC]

3903	   [RFC Editor: prior to publishing this document as an RFC, please
3904	   replace all occurrences of RFCTBD10 with RFCxxxx where xxxx is the
3905	   RFC number of this document]

3907	Author's Address

3909	   Thomas Haynes
3910	   NetApp
3911	   9110 E 66th St
3912	   Tulsa, OK  74133
3913	   USA

3915	   Phone: +1 918 307 1415
3916	   Email: thomas@netapp.com
3917	   URI:   http://www.tulsalabs.com