idnits 2.17.1 

draft-ietf-nfsv4-rfc5664bis-03.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  == Line 1109 has weird spacing: '...stateid    lor...'

  -- The document date (May 01, 2014) is 3645 days in the past.  Is this
     intentional?

  -- Found something which looks like a code comment -- if you have code
     sections in the document, please surround them with '<CODE BEGINS>' and
     '<CODE ENDS>' lines.


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  -- Possible downref: Non-RFC (?) normative reference: ref. '1'

  -- Possible downref: Non-RFC (?) normative reference: ref. '3'

  ** Obsolete normative reference: RFC 5661 (ref. '4') (Obsoleted by RFC 8881)

  ** Obsolete normative reference: RFC 3720 (ref. '8') (Obsoleted by RFC 7143)

  ** Obsolete normative reference: RFC 3980 (ref. '9') (Obsoleted by RFC 7143)

  -- Possible downref: Non-RFC (?) normative reference: ref. '10'

  -- Possible downref: Non-RFC (?) normative reference: ref. '11'

  -- Possible downref: Non-RFC (?) normative reference: ref. '13'


     Summary: 3 errors (**), 0 flaws (~~), 2 warnings (==), 7 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	NFSv4                                                          B. Halevy
3	Internet-Draft                                               PrimaryData
4	Intended status: Standards Track                              B. Harrosh
5	Expires: November 2, 2014                                       B. Welch
6	                                                              B. Mueller
7	                                                                 Panasas
8	                                                            May 01, 2014

10	              Object-Based Parallel NFS (pNFS) Operations
11	                     draft-ietf-nfsv4-rfc5664bis-03

13	Abstract

15	   Parallel NFS (pNFS) extends Network File System version 4 (NFSv4) to
16	   allow clients to directly access file data on the storage used by the
17	   NFSv4 server.  This ability to bypass the server for data access can
18	   increase both performance and parallelism, but requires additional
19	   client functionality for data access, some of which is dependent on
20	   the class of storage used, a.k.a. the Layout Type.  The main pNFS
21	   operations and data types in NFSv4 Minor version 1 specify a layout-
22	   type-independent layer; layout-type-specific information is conveyed
23	   using opaque data structures whose internal structure is further
24	   defined by the particular layout type specification.  This document
25	   specifies the NFSv4.1 Object-Based pNFS Layout Type as a companion to
26	   the main NFSv4 Minor version 1 specification.  This document has been
27	   updated since the initial version to clarify and fix some of the
28	   RAID-related computations so they match current implementations.

30	Status of This Memo

32	   This Internet-Draft is submitted in full conformance with the
33	   provisions of BCP 78 and BCP 79.

35	   Internet-Drafts are working documents of the Internet Engineering
36	   Task Force (IETF).  Note that other groups may also distribute
37	   working documents as Internet-Drafts.  The list of current Internet-
38	   Drafts is at http://datatracker.ietf.org/drafts/current/.

40	   Internet-Drafts are draft documents valid for a maximum of six months
41	   and may be updated, replaced, or obsoleted by other documents at any
42	   time.  It is inappropriate to use Internet-Drafts as reference
43	   material or to cite them other than as "work in progress."

45	   This Internet-Draft will expire on November 2, 2014.

47	Copyright Notice

49	   Copyright (c) 2014 IETF Trust and the persons identified as the
50	   document authors.  All rights reserved.

52	   This document is subject to BCP 78 and the IETF Trust's Legal
53	   Provisions Relating to IETF Documents
54	   (http://trustee.ietf.org/license-info) in effect on the date of
55	   publication of this document.  Please review these documents
56	   carefully, as they describe your rights and restrictions with respect
57	   to this document.  Code Components extracted from this document must
58	   include Simplified BSD License text as described in Section 4.e of
59	   the Trust Legal Provisions and are provided without warranty as
60	   described in the Simplified BSD License.

62	Table of Contents

64	   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   3
65	     1.1.  Requirements Language . . . . . . . . . . . . . . . . . .   4
66	     1.2.  Overview of Changes . . . . . . . . . . . . . . . . . . .   4
67	   2.  XDR Description of the Objects-Based Layout Protocol  . . . .   4
68	     2.1.  Code Components Licensing Notice  . . . . . . . . . . . .   5
69	   3.  Basic Data Type Definitions . . . . . . . . . . . . . . . . .   6
70	     3.1.  pnfs_osd_objid4 . . . . . . . . . . . . . . . . . . . . .   6
71	     3.2.  pnfs_osd_version4 . . . . . . . . . . . . . . . . . . . .   6
72	     3.3.  pnfs_osd_object_cred4 . . . . . . . . . . . . . . . . . .   7
73	     3.4.  pnfs_osd_raid_algorithm4  . . . . . . . . . . . . . . . .   8
74	   4.  Object Storage Device Addressing and Discovery  . . . . . . .   9
75	     4.1.  pnfs_osd_targetid_type4 . . . . . . . . . . . . . . . . .  10
76	     4.2.  pnfs_osd_deviceaddr4  . . . . . . . . . . . . . . . . . .  10
77	       4.2.1.  SCSI Target Identifier  . . . . . . . . . . . . . . .  11
78	       4.2.2.  Device Network Address  . . . . . . . . . . . . . . .  12
79	   5.  Object-Based Layout . . . . . . . . . . . . . . . . . . . . .  12
80	     5.1.  pnfs_osd_data_map4  . . . . . . . . . . . . . . . . . . .  13
81	     5.2.  pnfs_osd_layout4  . . . . . . . . . . . . . . . . . . . .  14
82	     5.3.  Data Mapping Schemes  . . . . . . . . . . . . . . . . . .  15
83	       5.3.1.  Simple Striping . . . . . . . . . . . . . . . . . . .  15
84	       5.3.2.  Nested Striping . . . . . . . . . . . . . . . . . . .  16
85	       5.3.3.  Mirroring . . . . . . . . . . . . . . . . . . . . . .  18
86	     5.4.  RAID Algorithms . . . . . . . . . . . . . . . . . . . . .  19
87	       5.4.1.  PNFS_OSD_RAID_0 . . . . . . . . . . . . . . . . . . .  19
88	       5.4.2.  PNFS_OSD_RAID_4 . . . . . . . . . . . . . . . . . . .  20
89	       5.4.3.  PNFS_OSD_RAID_5 . . . . . . . . . . . . . . . . . . .  20
90	       5.4.4.  PNFS_OSD_RAID_PQ  . . . . . . . . . . . . . . . . . .  21
91	       5.4.5.  RAID Usage and Implementation Notes . . . . . . . . .  22
92	   6.  Object-Based Layout Update  . . . . . . . . . . . . . . . . .  22
93	     6.1.  pnfs_osd_deltaspaceused4  . . . . . . . . . . . . . . . .  23
94	     6.2.  pnfs_osd_layoutupdate4  . . . . . . . . . . . . . . . . .  23

96	   7.  Recovering from Client I/O Errors . . . . . . . . . . . . . .  24
97	   8.  Object-Based Layout Return  . . . . . . . . . . . . . . . . .  24
98	     8.1.  pnfs_osd_errno4 . . . . . . . . . . . . . . . . . . . . .  25
99	     8.2.  pnfs_osd_ioerr4 . . . . . . . . . . . . . . . . . . . . .  26
100	     8.3.  pnfs_osd_layoutreturn4  . . . . . . . . . . . . . . . . .  27
101	   9.  Object-Based Creation Layout Hint . . . . . . . . . . . . . .  27
102	     9.1.  pnfs_osd_layouthint4  . . . . . . . . . . . . . . . . . .  27
103	   10. Layout Segments . . . . . . . . . . . . . . . . . . . . . . .  29
104	     10.1.  CB_LAYOUTRECALL and LAYOUTRETURN . . . . . . . . . . . .  29
105	     10.2.  LAYOUTCOMMIT . . . . . . . . . . . . . . . . . . . . . .  30
106	   11. Recalling Layouts . . . . . . . . . . . . . . . . . . . . . .  30
107	     11.1.  CB_RECALL_ANY  . . . . . . . . . . . . . . . . . . . . .  30
108	   12. Client Fencing  . . . . . . . . . . . . . . . . . . . . . . .  31
109	   13. Security Considerations . . . . . . . . . . . . . . . . . . .  31
110	     13.1.  OSD Security Data Types  . . . . . . . . . . . . . . . .  32
111	     13.2.  The OSD Security Protocol  . . . . . . . . . . . . . . .  33
112	     13.3.  Protocol Privacy Requirements  . . . . . . . . . . . . .  34
113	     13.4.  Revoking Capabilities  . . . . . . . . . . . . . . . . .  34
114	   14. IANA Considerations . . . . . . . . . . . . . . . . . . . . .  35
115	   15. References  . . . . . . . . . . . . . . . . . . . . . . . . .  35
116	     15.1.  Normative References . . . . . . . . . . . . . . . . . .  35
117	     15.2.  Informative References . . . . . . . . . . . . . . . . .  36
118	   Appendix A.  Acknowledgments  . . . . . . . . . . . . . . . . . .  37
119	   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  37

121	1.  Introduction

123	   In pNFS, the file server returns typed layout structures that
124	   describe where file data is located.  There are different layouts for
125	   different storage systems and methods of arranging data on storage
126	   devices.  This document describes the layouts used with object-based
127	   storage devices (OSDs) that are accessed according to the OSD storage
128	   protocol standard (ANSI INCITS 400-2004 [1]).

130	   An "object" is a container for data and attributes, and files are
131	   stored in one or more objects.  The OSD protocol specifies several
132	   operations on objects, including READ, WRITE, FLUSH, GET ATTRIBUTES,
133	   SET ATTRIBUTES, CREATE, and DELETE.  However, using the object-based
134	   layout the client only uses the READ, WRITE, GET ATTRIBUTES, and
135	   FLUSH commands.  The other commands are only used by the pNFS server.

137	   An object-based layout for pNFS includes object identifiers,
138	   capabilities that allow clients to READ or WRITE those objects, and
139	   various parameters that control how file data is striped across their
140	   component objects.  The OSD protocol has a capability-based security
141	   scheme that allows the pNFS server to control what operations and
142	   what objects can be used by clients.  This scheme is described in
143	   more detail in the "Security Considerations" section (Section 13).

145	1.1.  Requirements Language

147	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
148	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
149	   document are to be interpreted as described in RFC 2119 [2].

151	1.2.  Overview of Changes

153	   This document is an update to the initial RFC.  The primary area for
154	   changes are the clarification and correction of the RAID-related
155	   equations and algorithms in Section 5.3.  The equations were restated
156	   for clarity, and in a few places minor corrections were made to
157	   ensure that this spec accurately matches current implementations.  In
158	   addition, minor corrections have been made to other sections.

160	2.  XDR Description of the Objects-Based Layout Protocol

162	   This document contains the external data representation (XDR [6])
163	   description of the NFSv4.1 objects layout protocol.  The XDR
164	   description is embedded in this document in a way that makes it
165	   simple for the reader to extract into a ready-to-compile form.  The
166	   reader can feed this document into the following shell script to
167	   produce the machine readable XDR description of the NFSv4.1 objects
168	   layout protocol:

170	   #!/bin/sh
171	   grep '^ *///' $* | sed 's?^ */// ??' | sed 's?^ *///$??'

173	   That is, if the above script is stored in a file called "extract.sh",
174	   and this document is in a file called "spec.txt", then the reader can
175	   do:

177	   sh extract.sh < spec.txt > pnfs_osd_prot.x

179	   The effect of the script is to remove leading white space from each
180	   line, plus a sentinel sequence of "///".

182	   The embedded XDR file header follows.  Subsequent XDR descriptions,
183	   with the sentinel sequence are embedded throughout the document.

185	   Note that the XDR code contained in this document depends on types
186	   from the NFSv4.1 nfs4_prot.x file ([5]).  This includes both nfs
187	   types that end with a 4, such as offset4, length4, etc., as well as
188	   more generic types such as uint32_t and uint64_t.

190	2.1.  Code Components Licensing Notice

192	   The XDR description, marked with lines beginning with the sequence "/
193	   //", as well as scripts for extracting the XDR description are Code
194	   Components as described in Section 4 of "Legal Provisions Relating to
195	   IETF Documents" [3].  These Code Components are licensed according to
196	   the terms of Section 4 of "Legal Provisions Relating to IETF
197	   Documents".

199	   /// /*
200	   ///  * Copyright (c) 2010 IETF Trust and the persons identified
201	   ///  * as authors of the code. All rights reserved.
202	   ///  *
203	   ///  * Redistribution and use in source and binary forms, with
204	   ///  * or without modification, are permitted provided that the
205	   ///  * following conditions are met:
206	   ///  *
207	   ///  * o Redistributions of source code must retain the above
208	   ///  *   copyright notice, this list of conditions and the
209	   ///  *   following disclaimer.
210	   ///  *
211	   ///  * o Redistributions in binary form must reproduce the above
212	   ///  *   copyright notice, this list of conditions and the
213	   ///  *   following disclaimer in the documentation and/or other
214	   ///  *   materials provided with the distribution.
215	   ///  *
216	   ///  * o Neither the name of Internet Society, IETF or IETF
217	   ///  *   Trust, nor the names of specific contributors, may be
218	   ///  *   used to endorse or promote products derived from this
219	   ///  *   software without specific prior written permission.
220	   ///  *
221	   ///  *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS
222	   ///  *   AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED
223	   ///  *   WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
224	   ///  *   IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
225	   ///  *   FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO
226	   ///  *   EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
227	   ///  *   LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
228	   ///  *   EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
229	   ///  *   NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
230	   ///  *   SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
231	   ///  *   INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
232	   ///  *   LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
233	   ///  *   OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING
234	   ///  *   IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
235	   ///  *   ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
236	   ///  *
237	   ///  * This code was derived from draft-ietf-nfsv4-rfc5664bis-03.

239	   [[RFC Editor: please insert RFC number if needed]]
240	   ///  * Please reproduce this note if possible.
241	   ///  */
242	   ///
243	   /// /*
244	   ///  * pnfs_osd_prot.x
245	   ///  */
246	   ///
247	   /// %#include <nfs4_prot.x>
248	   ///

250	3.  Basic Data Type Definitions

252	   The following sections define basic data types and constants used by
253	   the Object-Based Layout protocol.

255	3.1.  pnfs_osd_objid4

257	   An object is identified by a number, somewhat like an inode number.
258	   The object storage model has a two-level scheme, where the objects
259	   within an object storage device are grouped into partitions.

261	   /// struct pnfs_osd_objid4 {
262	   ///     deviceid4       oid_device_id;
263	   ///     uint64_t        oid_partition_id;
264	   ///     uint64_t        oid_object_id;
265	   /// };
266	   ///

268	   The pnfs_osd_objid4 type is used to identify an object within a
269	   partition on a specified object storage device.  "oid_device_id"
270	   selects the object storage device from the set of available storage
271	   devices.  The device is identified with the deviceid4 type, which is
272	   an index into addressing information about that device returned by
273	   the GETDEVICELIST and GETDEVICEINFO operations.  The deviceid4 data
274	   type is defined in NFSv4.1 [4].  Within an OSD, a partition is
275	   identified with a 64-bit number, "oid_partition_id".  Within a
276	   partition, an object is identified with a 64-bit number,
277	   "oid_object_id".  Creation and management of partitions is outside
278	   the scope of this document, and is a facility provided by the object-
279	   based storage file system.

281	3.2.  pnfs_osd_version4
282	   /// enum pnfs_osd_version4 {
283	   ///     PNFS_OSD_MISSING    = 0,
284	   ///     PNFS_OSD_VERSION_1  = 1,
285	   ///     PNFS_OSD_VERSION_2  = 2
286	   /// };
287	   ///

289	   pnfs_osd_version4 is used to indicate the OSD protocol version used
290	   to access an object, or whether an object is missing (i.e.,
291	   unavailable).  Some of the RAID algorithms supported by object-based
292	   layouts encode redundant information and can compensate for missing
293	   components, but the data placement algorithms need to be aware of the
294	   logical positions of the missing components.

296	   The 1.0 version of the OSD standard has been ratified.  The 2.0
297	   version of the OSD standard has reached final draft status, but has
298	   not been fully ratified.  However, current object-based pNFS
299	   implementations adhere to the OSD 2.0 protocol (SNIA T10/1729-D
300	   [14]).  The second generation OSD protocol has additional features to
301	   support more robust error recovery, snapshots, and byte-range
302	   capabilities.  For completeness, and to allow for future revisions in
303	   the OSD protocol, the OSD version is explicitly called out in the
304	   information returned in the layout.  (This information can also be
305	   deduced by looking inside the capability type at the format field,
306	   which is the first byte.  The format value is 0x1 for an OSD v1
307	   capability.)

309	3.3.  pnfs_osd_object_cred4

311	   /// enum pnfs_osd_cap_key_sec4 {
312	   ///     PNFS_OSD_CAP_KEY_SEC_NONE = 0,
313	   ///     PNFS_OSD_CAP_KEY_SEC_SSV  = 1
314	   /// };
315	   ///
316	   /// struct pnfs_osd_object_cred4 {
317	   ///     pnfs_osd_objid4         oc_object_id;
318	   ///     pnfs_osd_version4       oc_osd_version;
319	   ///     pnfs_osd_cap_key_sec4   oc_cap_key_sec;
320	   ///     opaque                  oc_capability_key<>;
321	   ///     opaque                  oc_capability<>;
322	   /// };
323	   ///

325	   The pnfs_osd_object_cred4 structure is used to identify each
326	   component comprising the file.  The "oc_object_id" identifies the
327	   component object, the "oc_osd_version" represents the osd protocol
328	   version, or whether that component is unavailable, and the
329	   "oc_capability" and "oc_capability_key", along with the
330	   "oda_systemid" from the pnfs_osd_deviceaddr4, provide the OSD
331	   security credentials needed to access that object.  The
332	   "oc_cap_key_sec" value denotes the method used to secure the
333	   oc_capability_key (see Section 13.1 for more details).

335	   To comply with the OSD security requirements, the capability key
336	   SHOULD be transferred securely to prevent eavesdropping (see
337	   Section 13).  Therefore, a client SHOULD either issue the LAYOUTGET
338	   or GETDEVICEINFO operations via RPCSEC_GSS with the privacy service
339	   or previously establish a secret state verifier (SSV) for the
340	   sessions via the NFSv4.1 SET_SSV operation.  The
341	   pnfs_osd_cap_key_sec4 type is used to identify the method used by the
342	   server to secure the capability key.

344	   o  PNFS_OSD_CAP_KEY_SEC_NONE denotes that the oc_capability_key is
345	      not encrypted, in which case the client SHOULD issue the LAYOUTGET
346	      or GETDEVICEINFO operations with RPCSEC_GSS with the privacy
347	      service or the NFSv4.1 transport should be secured by using
348	      methods that are external to NFSv4.1 like the use of IPsec [15]
349	      for transporting the NFSV4.1 protocol.

351	   o  PNFS_OSD_CAP_KEY_SEC_SSV denotes that the oc_capability_key
352	      contents are encrypted using the SSV GSS context and the
353	      capability key as inputs to the GSS_Wrap() function (see GSS-API
354	      [7]) with the conf_req_flag set to TRUE.  The client MUST use the
355	      secret SSV key as part of the client's GSS context to decrypt the
356	      capability key using the value of the oc_capability_key field as
357	      the input_message to the GSS_unwrap() function.  Note that to
358	      prevent eavesdropping of the SSV key, the client SHOULD issue
359	      SET_SSV via RPCSEC_GSS with the privacy service.

361	   The actual method chosen depends on whether the client established a
362	   SSV key with the server and whether it issued the operation with the
363	   RPCSEC_GSS privacy method.  Naturally, if the client did not
364	   establish an SSV key via SET_SSV, the server MUST use the
365	   PNFS_OSD_CAP_KEY_SEC_NONE method.  Otherwise, if the operation was
366	   not issued with the RPCSEC_GSS privacy method, the server SHOULD
367	   secure the oc_capability_key with the PNFS_OSD_CAP_KEY_SEC_SSV
368	   method.  The server MAY use the PNFS_OSD_CAP_KEY_SEC_SSV method also
369	   when the operation was issued with the RPCSEC_GSS privacy method.

371	3.4.  pnfs_osd_raid_algorithm4
372	   /// enum pnfs_osd_raid_algorithm4 {
373	   ///     PNFS_OSD_RAID_0  = 1,
374	   ///     PNFS_OSD_RAID_4  = 2,
375	   ///     PNFS_OSD_RAID_5  = 3,
376	   ///     PNFS_OSD_RAID_PQ = 4     /* Reed-Solomon P+Q */
377	   /// };
378	   ///

380	   pnfs_osd_raid_algorithm4 represents the data redundancy algorithm
381	   used to protect the file's contents.  See Section 5.4 for more
382	   details.

384	4.  Object Storage Device Addressing and Discovery

386	   Data operations to an OSD require the client to know the "address" of
387	   each OSD's root object.  The root object is synonymous with the Small
388	   Computer System Interface (SCSI) logical unit.  The client specifies
389	   SCSI logical units to its SCSI protocol stack using a representation
390	   local to the client.  Because these representations are local,
391	   GETDEVICEINFO must return information that can be used by the client
392	   to select the correct local representation.

394	   In the block world, a set offset (logical block number or track/
395	   sector) contains a disk label.  This label identifies the disk
396	   uniquely.  In contrast, an OSD has a standard set of attributes on
397	   its root object.  For device identification purposes, the OSD System
398	   ID (root information attribute number 3) and the OSD Name (root
399	   information attribute number 9) are used as the label.  These appear
400	   in the pnfs_osd_deviceaddr4 type below under the "oda_systemid" and
401	   "oda_osdname" fields.

403	   In some situations, SCSI target discovery may need to be driven based
404	   on information contained in the GETDEVICEINFO response.  One example
405	   of this is Internet SCSI (iSCSI) targets that are not known to the
406	   client until a layout has been requested.  The information provided
407	   as the "oda_targetid", "oda_targetaddr", and "oda_lun" fields in the
408	   pnfs_osd_deviceaddr4 type described below (see Section 4.2) allows
409	   the client to probe a specific device given its network address and
410	   optionally its iSCSI Name (see iSCSI [8]), or when the device network
411	   address is omitted, allows it to discover the object storage device
412	   using the provided device name or SCSI Device Identifier (see SPC-3
413	   [10].)

415	   The oda_systemid is implicitly used by the client, by using the
416	   object credential signing key to sign each request with the request
417	   integrity check value.  This method protects the client from
418	   unintentionally accessing a device if the device address mapping was
419	   changed (or revoked).  The server computes the capability key using
420	   its own view of the systemid associated with the respective deviceid
421	   present in the credential.  If the client's view of the deviceid
422	   mapping is stale, the client will use the wrong systemid (which must
423	   be system-wide unique) and the I/O request to the OSD will fail to
424	   pass the integrity check verification.

426	   To recover from this condition the client should report the error and
427	   return the layout using LAYOUTRETURN, and invalidate all the device
428	   address mappings associated with this layout.  The client can then
429	   ask for a new layout if it wishes using LAYOUTGET and resolve the
430	   referenced deviceids using GETDEVICEINFO or GETDEVICELIST.

432	   The server MUST provide the oda_systemid and SHOULD also provide the
433	   oda_osdname.  When the OSD name is present, the client SHOULD get the
434	   root information attributes whenever it establishes communication
435	   with the OSD and verify that the OSD name it got from the OSD matches
436	   the one sent by the metadata server.  To do so, the client uses the
437	   root_obj_cred credentials.

439	4.1.  pnfs_osd_targetid_type4

441	   The following enum specifies the manner in which a SCSI target can be
442	   specified.  The target can be specified as a SCSI Name, or as an SCSI
443	   Device Identifier.

445	   /// enum pnfs_osd_targetid_type4 {
446	   ///     OBJ_TARGET_ANON             = 1,
447	   ///     OBJ_TARGET_SCSI_NAME        = 2,
448	   ///     OBJ_TARGET_SCSI_DEVICE_ID   = 3
449	   /// };
450	   ///

452	4.2.  pnfs_osd_deviceaddr4

454	   The "pnfs_osd_deviceaddr4" data structure is returned by the server
455	   as the storage-protocol-specific opaque field da_addr_body in the
456	   "device_addr4" structure by a successful GETDEVICEINFO operation
457	   NFSv4.1 [4].

459	   The specification for an object device address is as follows:

461	/// union pnfs_osd_targetid4 switch (pnfs_osd_targetid_type4 oti_type) {
462	///     case OBJ_TARGET_SCSI_NAME:
463	///         string              oti_scsi_name<>;
464	///
465	///     case OBJ_TARGET_SCSI_DEVICE_ID:
466	///         opaque              oti_scsi_device_id<>;
467	///
468	///     default:
469	///         void;
470	/// };
471	///
472	/// union pnfs_osd_targetaddr4 switch (bool ota_available) {
473	///     case TRUE:
474	///         netaddr4            ota_netaddr;
475	///     case FALSE:
476	///         void;
477	/// };
478	///
479	/// struct pnfs_osd_deviceaddr4 {
480	///     pnfs_osd_targetid4      oda_targetid;
481	///     pnfs_osd_targetaddr4    oda_targetaddr;
482	///     opaque                  oda_lun[8];
483	///     opaque                  oda_systemid<>;
484	///     pnfs_osd_object_cred4   oda_root_obj_cred;
485	///     opaque                  oda_osdname<>;
486	/// };
487	///

489	4.2.1.  SCSI Target Identifier

491	   When "oda_targetid" is specified as an OBJ_TARGET_SCSI_NAME, the
492	   "oti_scsi_name" string MUST be formatted as an "iSCSI Name" as
493	   specified in iSCSI [8] and [9].  Note that the specification of the
494	   oti_scsi_name string format is outside the scope of this document.
495	   Parsing the string is based on the string prefix, e.g., "iqn.",
496	   "eui.", or "naa." and more formats MAY be specified in the future in
497	   accordance with iSCSI Names properties.

499	   Currently, the iSCSI Name provides for naming the target device using
500	   a string formatted as an iSCSI Qualified Name (IQN) or as an Extended
501	   Unique Identifier (EUI) [13] string.  Those are typically used to
502	   identify iSCSI or Secure Routing Protocol (SRP) [20] devices.  The
503	   Network Address Authority (NAA) string format (see [9]) provides for
504	   naming the device using globally unique identifiers, as defined in
505	   Fibre Channel Framing and Signaling (FC-FS) [21].  These are
506	   typically used to identify Fibre Channel or SAS [22] (Serial Attached
507	   SCSI) devices.  In particular, such devices that are dual-attached
508	   both over Fibre Channel or SAS and over iSCSI.

510	   When "oda_targetid" is specified as an OBJ_TARGET_SCSI_DEVICE_ID, the
511	   "oti_scsi_device_id" opaque field MUST be formatted as a SCSI Device
512	   Identifier as defined in SPC-3 [10] VPD Page 83h (Section 7.6.3.
513	   "Device Identification VPD Page").  If the Device Identifier is
514	   identical to the OSD System ID, as given by oda_systemid, the server
515	   SHOULD provide a zero-length oti_scsi_device_id opaque value.  Note
516	   that similarly to the "oti_scsi_name", the specification of the
517	   oti_scsi_device_id opaque contents is outside the scope of this
518	   document and more formats MAY be specified in the future in
519	   accordance with SPC-3.

521	   The OBJ_TARGET_ANON pnfs_osd_targetid_type4 MAY be used for providing
522	   no target identification.  In this case, only the OSD System ID, and
523	   optionally the provided network address, are used to locate the
524	   device.

526	4.2.2.  Device Network Address

528	   The optional "oda_targetaddr" field MAY be provided by the server as
529	   a hint to accelerate device discovery over, e.g., the iSCSI transport
530	   protocol.  The network address is given with the netaddr4 type, which
531	   specifies a TCP/IP based endpoint (as specified in NFSv4.1 [4]).
532	   When given, the client SHOULD use it to probe for the SCSI device at
533	   the given network address.  The client MAY still use other discovery
534	   mechanisms such as Internet Storage Name Service (iSNS) [12] to
535	   locate the device using the oda_targetid.  In particular, such an
536	   external name service SHOULD be used when the devices may be attached
537	   to the network using multiple connections, and/or multiple storage
538	   fabrics (e.g., Fibre-Channel and iSCSI).

540	   The "oda_lun" field identifies the OSD 64-bit Logical Unit Number,
541	   formatted in accordance with SAM-3 [11].  The client uses the Logical
542	   Unit Number to communicate with the specific OSD Logical Unit.  Its
543	   use is defined in detail by the SCSI transport protocol, e.g., iSCSI
544	   [8].

546	5.  Object-Based Layout

548	   The layout4 type is defined in the NFSv4.1 [4] as follows:

550	   enum layouttype4 {
551	       LAYOUT4_NFSV4_1_FILES   = 1,
552	       LAYOUT4_OSD2_OBJECTS    = 2,
553	       LAYOUT4_BLOCK_VOLUME    = 3
554	   };

556	   struct layout_content4 {
557	       layouttype4             loc_type;
558	       opaque                  loc_body<>;
559	   };

561	   struct layout4 {
562	       offset4                 lo_offset;
563	       length4                 lo_length;
564	       layoutiomode4           lo_iomode;
565	       layout_content4         lo_content;
566	   };

568	   This document defines structure associated with the layouttype4
569	   value, LAYOUT4_OSD2_OBJECTS.  The NFSv4.1 [4] specifies the loc_body
570	   structure as an XDR type "opaque".  The opaque layout is
571	   uninterpreted by the generic pNFS client layers, but obviously must
572	   be interpreted by the object storage layout driver.  This section
573	   defines the structure of this opaque value, pnfs_osd_layout4.

575	5.1.  pnfs_osd_data_map4

577	   /// struct pnfs_osd_data_map4 {
578	   ///     uint32_t                    odm_num_comps;
579	   ///     length4                     odm_stripe_unit;
580	   ///     uint32_t                    odm_group_width;
581	   ///     uint32_t                    odm_group_depth;
582	   ///     uint32_t                    odm_mirror_cnt;
583	   ///     pnfs_osd_raid_algorithm4    odm_raid_algorithm;
584	   /// };
585	   ///

587	   The pnfs_osd_data_map4 structure parameterizes the algorithm that
588	   maps a file's contents over the component objects.  Instead of
589	   limiting the system to simple striping scheme where loss of a single
590	   component object results in data loss, the map parameters support
591	   mirroring and more complicated schemes that protect against loss of a
592	   component object.

594	   "odm_num_comps" is the number of component objects the file is
595	   striped over.  The server MAY grow the file by adding more components
596	   to the stripe while clients hold valid layouts until the file has
597	   reached its final stripe width.  The file length in this case MUST be
598	   limited to the number of bytes in a full stripe.

600	   The "odm_stripe_unit" is the number of bytes placed on one component
601	   before advancing to the next one in the list of components.  The
602	   number of bytes in a full stripe is odm_stripe_unit times the number
603	   of components.  In some RAID schemes, a stripe includes redundant
604	   information (i.e., parity) that lets the system recover from loss or
605	   damage to a component object.

607	   The "odm_group_width" and "odm_group_depth" parameters allow a nested
608	   striping pattern (see Section 5.3.2 for details).  If there is no
609	   nesting, then odm_group_width and odm_group_depth MUST be zero.  The
610	   size of the components array MUST be a multiple of odm_group_width.

612	   The "odm_mirror_cnt" is used to replicate a file by replicating its
613	   component objects.  If there is no mirroring, then odm_mirror_cnt
614	   MUST be 0.  If odm_mirror_cnt is greater than zero, then the size of
615	   the component array MUST be a multiple of (odm_mirror_cnt+1).

617	   See Section 5.3 for more details.

619	5.2.  pnfs_osd_layout4

621	   /// struct pnfs_osd_layout4 {
622	   ///     pnfs_osd_data_map4      olo_map;
623	   ///     uint32_t                olo_comps_index;
624	   ///     pnfs_osd_object_cred4   olo_components<>;
625	   /// };
626	   ///

628	   The pnfs_osd_layout4 structure specifies a layout over a set of
629	   component objects.  The "olo_components" field is an array of object
630	   identifiers and security credentials that grant access to each
631	   object.  The organization of the data is defined by the
632	   pnfs_osd_data_map4 type that specifies how the file's data is mapped
633	   onto the component objects (i.e., the striping pattern).  The data
634	   placement algorithm that maps file data onto component objects
635	   assumes that each component object occurs exactly once in the array
636	   of components.  Therefore, component objects MUST appear in the
637	   olo_components array only once.  The components array may represent
638	   all objects comprising the file, in which case "olo_comps_index" is
639	   set to zero and the number of entries in the olo_components array is
640	   equal to olo_map.odm_num_comps.  The server MAY return fewer
641	   components than odm_num_comps, provided that the returned components
642	   are sufficient to access any byte in the layout's data range (e.g., a
643	   sub-stripe of "odm_group_width" components).  In this case,
644	   olo_comps_index represents the position of the returned components
645	   array within the full array of components that comprise the file.

647	   Note that the layout depends on the file size, which the client
648	   learns from the generic return parameters of LAYOUTGET, by doing
649	   GETATTR commands to the metadata server.  The client uses the file
650	   size to decide if it should fill holes with zeros or return a short
651	   read.  Striping patterns can cause cases where component objects are
652	   shorter than other components because a hole happens to correspond to
653	   the last part of the component object.

655	5.3.  Data Mapping Schemes

657	   This section describes the different data mapping schemes in detail.
658	   The object layout always uses a "dense" layout as described in
659	   NFSv4.1 [4].  This means that the second stripe unit of the file
660	   starts at offset 0 of the second component, rather than at offset
661	   stripe_unit bytes.  After a full stripe has been written, the next
662	   stripe unit is appended to the first component object in the list
663	   without any holes in the component objects.

665	5.3.1.  Simple Striping

667	   The mapping from the logical offset within a file (L) to the
668	   component object C and object-specific offset O is defined by the
669	   following equations:

671	   L: logical offset into the file

673	   W: stripe width
674	       W = size of olo_components array

676	   S: number of bytes in a stripe
677	       S = W * stripe_unit

679	   N: stripe number
680	       N = L / S

682	   C: component index corresponding to L
683	      C = (L % S) / stripe_unit

685	   O: The component offset corresponding to L
686	      O = (N * stripe_unit) + (L % stripe_unit)

688	   Note that this computation does not accommodate the same object
689	   appearing in the olo_components array multiple times.  Therefore the
690	   server may not return layouts with the same object appearing multiple
691	   times.  If needed the server can return multiple layout segments each
692	   covering a single instance of the object.

694	   For example, consider an object striped over four devices, <D0 D1 D2
695	   D3>.  The stripe_unit is 4096 bytes.  The stripe width S is thus 4 *
696	   4096 = 16384.

698	   Offset 0:
699	     N = 0 / 16384 = 0
700	     C = (0 % 16384) /4096 = 0 (D0)
701	     O = 0*4096 + (0%4096) = 0

703	   Offset 4096:
704	     N = 4096 / 16384 = 0
705	     C = (4096 % 16384) / 4096 = 1 (D1)
706	     O = (0*4096)+(4096%4096) = 0

708	   Offset 9000:
709	     N = 9000 / 16384 = 0
710	     C = (9000 % 16384) / 4096 = 2 (D2)
711	     O = (0*4096)+(9000%4096) = 808

713	   Offset 132000:
714	     N = 132000 / 16384 = 8
715	     C = (132000 % 16384) / 4096 = 0 (D0)
716	     O = (8*4096) + (132000%4096) = 33696

718	5.3.2.  Nested Striping

720	   The odm_group_width and odm_group_depth parameters allow a nested
721	   striping pattern.  odm_group_width defines the width of a data stripe
722	   and odm_group_depth defines how many stripes are written before
723	   advancing to the next group of components in the list of component
724	   objects for the file.  The math used to map from a file offset to a
725	   component object and offset within that object is shown below.  The
726	   computations map from the logical offset L to the component index C
727	   and offset relative O within that component object.

729	   L: logical offset into the file

731	   FW: total number of components
732	       FW = size of olo_components array

734	   W: stripe width
735	      W = group_width, if not zero, else FW

737	   group_count: number of groups
738	      group_count = FW / group_width, if group_width is not zero, else 1

740	   D: number of data devices in a stripe
741	      D = W

743	   U: number of data bytes in a stripe within a group
744	      U =  D * stripe_unit

746	   T: number of bytes striped within a group of component objects
747	      (before advancing to the next group)
748	      T = U * group_depth

750	   S: number of bytes striped across all component objects
751	      (before the pattern repeats)
752	      S = T * group_count

754	   M: The "major" (i.e., across all components) cycle number
755	      M = L / S

757	   G: group number from the beginning of the major cycle
758	      G = (L % S) / T

760	   H: byte offset within the last group
761	      H = (L % S) % T

763	   N: The "minor" (i.e., across the group) stripe number
764	      N = H / U

766	   C: component index corresponding to L
767	      C = (G * D) + ((H % U) / stripe_unit)

769	   O: The component offset corresponding to L
770	      O = (M * group_depth * stripe_unit) + (N * stripe_unit) +
771	          (L % stripe_unit)

773	   For example, consider an object striped over 100 devices with a
774	   group_width of 10, a group_depth of 50, and a stripe_unit of 1 MB.
775	   In this scheme, 500 MB are written to the first 10 components, and
776	   5000 MB are written before the pattern wraps back around to the first
777	   component in the array.

779	   Offset 0:
780	     W = 100
781	     group_count = 100 / 10 = 10
782	     D = 10
783	     U = 1 MB * 10 = 10 MB
784	     T = 10 MB * 50 = 500 MB
785	     S = 500 MB * 10 = 5000 MB
786	     M = 0 / 5000 MB = 0
787	     G = (0 % 5000 MB) / 500 MB = 0
788	     H = (0 % 5000 MB) % 500 MB = 0
789	     N = 0 / 10 MB = 0
790	     C = (0 * 10) + ((0 % 10 MB) / 1 MB) = 0
791	     O = (0 * 50 * 1 MB) + (0 * 1 MB) + (0 % 1 MB) = 0

793	   Offset 27 MB:
794	     M = 27 MB / 5000 MB = 0
795	     G = (27 MB % 5000 MB) / 500 MB = 0
796	     H = (27 MB % 5000 MB) % 500 MB = 27 MB
797	     N = 27 MB / 10 MB = 2
798	     C = (0 * 10) + ((27 MB % 10 MB) / 1 MB) = 7
799	     O = (0 * 50 * 1 MB) + (2 * 1 MB) + (27 MB % 1 MB) = 2 MB

801	   Offset 7232 MB:
802	     M = 7232 MB / 5000 MB = 1
803	     G = (7232 MB % 5000 MB) / 500 MB = 4
804	     H = (7232 MB % 5000 MB) % 500 MB = 232 MB
805	     N = 232 MB / 10 MB = 23
806	     C = (4 * 10) + ((232 MB % 10 MB) / 1 MB) = 42
807	     O = (1 * 50 * 1 MB) + (23 * 1 MB) + (7232 MB % 1 MB) = 73 MB

809	5.3.3.  Mirroring

811	   The odm_mirror_cnt is used to replicate a file by replicating its
812	   component objects.  If there is no mirroring, then odm_mirror_cnt
813	   MUST be 0.  If odm_mirror_cnt is greater than zero, then the size of
814	   the olo_components array MUST be a multiple of (odm_mirror_cnt+1).
815	   Thus, for a classic mirror on two objects, odm_mirror_cnt is one.
816	   Note that mirroring can be defined over any RAID algorithm and
817	   striping pattern (either simple or nested).  If odm_group_width is
818	   also non-zero, then the size of the olo_components array MUST be a
819	   multiple of odm_group_width * (odm_mirror_cnt+1).  Note that
820	   odm_group_width does not account for mirrors.  Replicas are adjacent
821	   in the olo_components array, and the value C produced by the above
822	   equations is not a direct index into the olo_components array.

824	   Instead, the following equations determine the replica component
825	   index RCi, where i ranges from 0 to odm_mirror_cnt.

827	   FW = size of olo_components array / (odm_mirror_cnt+1)

829	   C = component index for striping or two-level striping
830	       as calculated using above equations

832	   i ranges from 0 to odm_mirror_cnt, inclusive
833	   RCi = C * (odm_mirror_cnt+1) + i

835	5.4.  RAID Algorithms

837	   pnfs_osd_raid_algorithm4 determines the algorithm and placement of
838	   redundant data.  This section defines the different redundancy
839	   algorithms.  Note: The term "RAID" (Redundant Array of Independent
840	   Disks) is used in this document to represent an array of component
841	   objects that store data for an individual file.  The objects are
842	   stored on independent object-based storage devices.  File data is
843	   encoded and striped across the array of component objects using
844	   algorithms developed for block-based RAID systems.

846	   The use of per-file RAID encoding in the object-layout for pNFS
847	   imposes an additional responsibility on the file system client.  The
848	   pNFS client SHOULD generate the redundant data and write it do
849	   storage along with the file data according to the RAID parameters
850	   returned in the layout.  However, various error conditions may
851	   prevent the client from meeting its obligations, and this is
852	   supported by the error information in the pnfs_osd_ioerr4 structure
853	   (see Section 8.1).  An explicit error return from the client, or an
854	   implicit error caused by a client's failure to return a layout MUST
855	   trigger recovery action by the server to prevent access to invalid
856	   data (see Section 7).  It is the server's responsibility to only
857	   grant layout information to files that can be safely accessed, and to
858	   deny access to files that are in an inconsistent state.

860	5.4.1.  PNFS_OSD_RAID_0

862	   PNFS_OSD_RAID_0 means there is no parity data, so all bytes in the
863	   component objects are data bytes located by the above equations for C
864	   and O.  If a component object is marked as PNFS_OSD_MISSING, an I/O
865	   error MUST be returned if this component is accessed.  In this case,
866	   the generic NFS client layer MAY elect to retry this operation
867	   against the pNFS server.

869	5.4.2.  PNFS_OSD_RAID_4

871	   PNFS_OSD_RAID_4 means that the last component object, or the last in
872	   each group (if odm_group_width is greater than zero), contains parity
873	   information computed over the rest of the stripe with an XOR
874	   operation.  If a component object is unavailable, the client can read
875	   the rest of the stripe units in the damaged stripe and recompute the
876	   missing stripe unit by XORing the other stripe units in the stripe.
877	   Or the client can replay the READ against the pNFS server that will
878	   presumably perform the reconstructed read on the client's behalf.

880	   When parity is present in the file, then the number of parity devices
881	   is taken into account in the above equations when calculating (D),
882	   the number of data devices in a stripe, as follows:

884	   P: number of parity devices in each stripe
885	      P = 1

887	   D: number of data devices in a stripe
888	      D = W - P

890	   I: parity device index
891	      I = D

893	5.4.3.  PNFS_OSD_RAID_5

895	   PNFS_OSD_RAID_5 means that the position of the parity data is rotated
896	   on each stripe or each group (if odm_group_width is greater than
897	   zero).  In the first stripe, the last component holds the parity.  In
898	   the second stripe, the next-to-last component holds the parity, and
899	   so on.  In this scheme, all stripe units are rotated so that I/O is
900	   evenly spread across objects as the file is read sequentially.  The
901	   rotated parity layout is illustrated here, with hexadecimal numbers
902	   indicating the stripe unit.

904	   0 1 2 P
905	   4 5 P 3
906	   8 P 6 7
907	   P 9 a b

909	   Note that the math for RAID_5 is similar to RAID_4 only that the
910	   device indices for each stripe are rotated backwards.  So start with
911	   the equations above for RAID_4, then compute the rotation as
912	   described below.  Also note that the parity rotation cycle always
913	   starts on group boundaries so the first stripe in a group has its
914	   parity at device D.

916	   P: number of parity devices in each stripe
917	      P = 1

919	   PC: Parity Cycle
920	       PC = W

922	   R: The parity rotation index
923	      (N is as computed in above equations for RAID-4)
924	      R = N % PC

926	   I: parity device index
927	      I = (W + W - (R + 1) * P) % W

929	   Cr: The rotated device index
930	       (C is as computed in the above equations for RAID-4)
931	       Cr = (W + C - (R * P)) % W

933	   Note: W is added above to avoid negative numbers modulo math.

935	5.4.4.  PNFS_OSD_RAID_PQ

937	   PNFS_OSD_RAID_PQ is a double-parity scheme that uses the Reed-Solomon
938	   P+Q encoding scheme [16].  In this layout, the last two component
939	   objects hold the P and Q data, respectively.  P is parity computed
940	   with XOR.  The Q computation is described in detail by Anvin [17].
941	   The same polynomial "x^8+x^4+x^3+x^2+1" and Galois field size of 2^8
942	   are used here.  Clients may simply choose to read data through the
943	   metadata server if two or more components are missing or damaged.

945	   The equations given above for embedded parity can be used to map a
946	   file offset to the correct component object by setting the number of
947	   parity components (P) to 2 instead of 1 for RAID-5 and computing the
948	   Parity Cycle length as the Lowest Common Multiple [18] of
949	   odm_group_width and P, devided by P, as described below.  Note: This
950	   algorithm can be used also for RAID-5 where P=1.

952	   P: number of parity devices
953	      P = 2

955	   PC: Parity cycle:
956	       PC = LCM(W, P) / P

958	   Q: The device index holding the Q component
959	      (I is as computed in the above equations for RAID-5)
960	      Qdev = (I + 1) % W

962	5.4.5.  RAID Usage and Implementation Notes

964	   RAID layouts with redundant data in their stripes require additional
965	   serialization of updates to ensure correct operation.  Otherwise, if
966	   two clients simultaneously write to the same logical range of an
967	   object, the result could include different data in the same ranges of
968	   mirrored tuples, or corrupt parity information.  It is the
969	   responsibility of the metadata server to enforce serialization
970	   requirements.  Serialization MUST occur at the RAID stripe boundary
971	   for write operations to avoid corrupting parity by concurrent updates
972	   to the same stripe.  Mirrors do not have explicit stripe boundaries,
973	   so it is sufficient to serialize writes to the same byte ranges.

975	   Many alternative encoding schemes exist for P>=2 [19].  These involve
976	   P or Q equations different than the Reed-Solomon encoding used in
977	   PNFS_OSD_RAID_PQ.  Thus, if one of these schemes is to be used in the
978	   future, a distinct value must be added to pnfs_osd_raid_algorithm4
979	   for it.

981	6.  Object-Based Layout Update

983	   layoutupdate4 is used in the LAYOUTCOMMIT operation to convey updates
984	   to the layout and additional information to the metadata server.  It
985	   is defined in the NFSv4.1 [4] as follows:

987	   struct layoutupdate4 {
988	       layouttype4             lou_type;
989	       opaque                  lou_body<>;
990	   };

992	   The layoutupdate4 type is an opaque value at the generic pNFS client
993	   level.  If the lou_type layout type is LAYOUT4_OSD2_OBJECTS, then the
994	   lou_body opaque value is defined by the pnfs_osd_layoutupdate4 type.

996	   Object-Based pNFS clients are not allowed to modify the layout.
997	   Therefore, the information passed in pnfs_osd_layoutupdate4 is used
998	   only to update the file's attributes.  In addition to the generic
999	   information the client can pass to the metadata server in
1000	   LAYOUTCOMMIT such as the highest offset the client wrote to and the
1001	   last time it modified the file, the client MAY use
1002	   pnfs_osd_layoutupdate4 to convey the capacity consumed (or released)
1003	   by writes using the layout, and to indicate that I/O errors were
1004	   encountered by such writes.

1006	6.1.  pnfs_osd_deltaspaceused4

1008	   /// union pnfs_osd_deltaspaceused4 switch (bool dsu_valid) {
1009	   ///     case TRUE:
1010	   ///         int64_t     dsu_delta;
1011	   ///     case FALSE:
1012	   ///         void;
1013	   /// };
1014	   ///

1016	   pnfs_osd_deltaspaceused4 is used to convey space utilization
1017	   information at the time of LAYOUTCOMMIT.  For the file system to
1018	   properly maintain capacity-used information, it needs to track how
1019	   much capacity was consumed by WRITE operations performed by the
1020	   client.  In this protocol, the OSD returns the capacity consumed by a
1021	   write (*), which can be different than the number of bytes written
1022	   because of internal overhead like block-level allocation and indirect
1023	   blocks, and the client reflects this back to the pNFS server so it
1024	   can accurately track quota.  The pNFS server can choose to trust this
1025	   information coming from the clients and therefore avoid querying the
1026	   OSDs at the time of LAYOUTCOMMIT.  If the client is unable to obtain
1027	   this information from the OSD, it simply returns invalid
1028	   olu_delta_space_used.

1030	6.2.  pnfs_osd_layoutupdate4

1032	   /// struct pnfs_osd_layoutupdate4 {
1033	   ///     pnfs_osd_deltaspaceused4    olu_delta_space_used;
1034	   ///     bool                        olu_ioerr_flag;
1035	   /// };
1036	   ///

1038	   "olu_delta_space_used" is used to convey capacity usage information
1039	   back to the metadata server.

1041	   The "olu_ioerr_flag" is used when I/O errors were encountered while
1042	   writing the file.  The client MUST report the errors using the
1043	   pnfs_osd_ioerr4 structure (see Section 8.1) at LAYOUTRETURN time.

1045	   If the client updated the file successfully before hitting the I/O
1046	   errors, it MAY use LAYOUTCOMMIT to update the metadata server as
1047	   described above.  Typically, in the error-free case, the server MAY
1048	   turn around and update the file's attributes on the storage devices.
1049	   However, if I/O errors were encountered, the server better not
1050	   attempt to write the new attributes on the storage devices until it
1051	   receives the I/O error report; therefore, the client MUST set the
1052	   olu_ioerr_flag to true.  Note that in this case, the client SHOULD
1053	   send both the LAYOUTCOMMIT and LAYOUTRETURN operations in the same
1054	   COMPOUND RPC.

1056	7.  Recovering from Client I/O Errors

1058	   The pNFS client may encounter errors when directly accessing the
1059	   object storage devices.  A well behaved client will report any such
1060	   errors promptly by executing a LAYOUTRETURN.  When the
1061	   LAYOUT4_OSD2_OBJECTS layout type is used, the client MUST report the
1062	   I/O errors to the server at LAYOUTRETURN time using the
1063	   pnfs_osd_ioerr4 structure (see Section 8.1).

1065	   It is the responsibility of the metadata server to handle the I/O
1066	   errors.  The server MUST analyze the error and perform the required
1067	   recovery operations such as repairing any parity inconsistencies,
1068	   recovering media failures, or reconstructing missing objects.

1070	   The metadata server SHOULD recall any outstanding layouts to allow it
1071	   exclusive write access to the stripes being recovered and to prevent
1072	   other clients from hitting the same error condition.  In these cases,
1073	   the server MUST complete recovery before handing out any new layouts
1074	   to the affected byte ranges.

1076	   The client SHOULD attempt to compensate for the error before giving
1077	   up and reflecting an error to the application.  The first step in
1078	   error recovery is to return the layout with LAYOUTRETURN and the
1079	   associated error information.  The second step is to request a new
1080	   layout using LAYOUTGET and then retry the I/O operation with the new
1081	   layout.  Finally, if the error persists, the client may choose to
1082	   retry the I/O operation using regular NFS READ or WRITE operations
1083	   via the metadata server.

1085	8.  Object-Based Layout Return

1087	   layoutreturn_file4 is used in the LAYOUTRETURN operation to convey
1088	   layout-type specific information to the server.  It is defined in the
1089	   NFSv4.1 [4] as follows:

1091	   struct layoutreturn_file4 {
1092	           offset4         lrf_offset;
1093	           length4         lrf_length;
1094	           stateid4        lrf_stateid;
1095	           /* layouttype4 specific data */
1096	           opaque          lrf_body<>;
1097	   };

1099	   union layoutreturn4 switch(layoutreturn_type4 lr_returntype) {
1100	           case LAYOUTRETURN4_FILE:
1101	                   layoutreturn_file4      lr_layout;
1102	           default:
1103	                   void;
1104	   };

1106	   struct LAYOUTRETURN4args {
1107	           /* CURRENT_FH: file */
1108	           bool                    lora_reclaim;
1109	           layoutreturn_stateid    lora_recallstateid;
1110	           layouttype4             lora_layout_type;
1111	           layoutiomode4           lora_iomode;
1112	           layoutreturn4           lora_layoutreturn;
1113	   };

1115	   If the lora_layout_type layout type is LAYOUT4_OSD2_OBJECTS, then the
1116	   lrf_body opaque value is defined by the pnfs_osd_layoutreturn4 type.

1118	   The pnfs_osd_layoutreturn4 type allows the client to report I/O error
1119	   information back to the metadata server as defined below.

1121	8.1.  pnfs_osd_errno4

1123	   /// enum pnfs_osd_errno4 {
1124	   ///     PNFS_OSD_ERR_EIO            = 1,
1125	   ///     PNFS_OSD_ERR_NOT_FOUND      = 2,
1126	   ///     PNFS_OSD_ERR_NO_SPACE       = 3,
1127	   ///     PNFS_OSD_ERR_BAD_CRED       = 4,
1128	   ///     PNFS_OSD_ERR_NO_ACCESS      = 5,
1129	   ///     PNFS_OSD_ERR_UNREACHABLE    = 6,
1130	   ///     PNFS_OSD_ERR_RESOURCE       = 7
1131	   /// };
1132	   ///

1134	   pnfs_osd_errno4 is used to represent error types when read/write
1135	   errors are reported to the metadata server.  The error codes serve as
1136	   hints to the metadata server that may help it in diagnosing the exact
1137	   reason for the error and in repairing it.

1139	   o  PNFS_OSD_ERR_EIO indicates the operation failed because the object
1140	      storage device experienced a failure trying to access the object.
1141	      The most common source of these errors is media errors, but other
1142	      internal errors might cause this as well.  In this case, the
1143	      metadata server should go examine the broken object more closely;
1144	      hence, it should be used as the default error code.

1146	   o  PNFS_OSD_ERR_NOT_FOUND indicates the object ID specifies an object
1147	      that does not exist on the object storage device.

1149	   o  PNFS_OSD_ERR_NO_SPACE indicates the operation failed because the
1150	      object storage device ran out of free capacity during the
1151	      operation.

1153	   o  PNFS_OSD_ERR_BAD_CRED indicates the security parameters are not
1154	      valid.  The primary cause of this is that the capability has
1155	      expired, or the access policy tag (a.k.a., capability version
1156	      number) has been changed to revoke capabilities.  The client will
1157	      need to return the layout and get a new one with fresh
1158	      capabilities.

1160	   o  PNFS_OSD_ERR_NO_ACCESS indicates the capability does not allow the
1161	      requested operation.  This should not occur in normal operation
1162	      because the metadata server should give out correct capabilities,
1163	      or none at all.

1165	   o  PNFS_OSD_ERR_UNREACHABLE indicates the client did not complete the
1166	      I/O operation at the object storage device due to a communication
1167	      failure.  Whether or not the I/O operation was executed by the OSD
1168	      is undetermined.

1170	   o  PNFS_OSD_ERR_RESOURCE indicates the client did not issue the I/O
1171	      operation due to a local problem on the initiator (i.e., client)
1172	      side, e.g., when running out of memory.  The client MUST guarantee
1173	      that the OSD command was never dispatched to the OSD.

1175	8.2.  pnfs_osd_ioerr4

1177	   /// struct pnfs_osd_ioerr4 {
1178	   ///     pnfs_osd_objid4     oer_component;
1179	   ///     length4             oer_comp_offset;
1180	   ///     length4             oer_comp_length;
1181	   ///     bool                oer_iswrite;
1182	   ///     pnfs_osd_errno4     oer_errno;
1183	   /// };
1184	   ///
1185	   The pnfs_osd_ioerr4 structure is used to return error indications for
1186	   objects that generated errors during data transfers.  These are hints
1187	   to the metadata server that there are problems with that object.  For
1188	   each error, "oer_component", "oer_comp_offset", and "oer_comp_length"
1189	   represent the object and byte range within the component object in
1190	   which the error occurred; "oer_iswrite" is set to "true" if the
1191	   failed OSD operation was data modifying, and "oer_errno" represents
1192	   the type of error.

1194	   Component byte ranges in the optional pnfs_osd_ioerr4 structure are
1195	   used for recovering the object and MUST be set by the client to cover
1196	   all failed I/O operations to the component.

1198	8.3.  pnfs_osd_layoutreturn4

1200	   /// struct pnfs_osd_layoutreturn4 {
1201	   ///     pnfs_osd_ioerr4             olr_ioerr_report<>;
1202	   /// };
1203	   ///

1205	   When OSD I/O operations failed, "olr_ioerr_report<>" is used to
1206	   report these errors to the metadata server as an array of elements of
1207	   type pnfs_osd_ioerr4.  Each element in the array represents an error
1208	   that occurred on the object specified by oer_component.  If no errors
1209	   are to be reported, the size of the olr_ioerr_report<> array is set
1210	   to zero.

1212	9.  Object-Based Creation Layout Hint

1214	   The layouthint4 type is defined in the NFSv4.1 [4] as follows:

1216	   struct layouthint4 {
1217	       layouttype4           loh_type;
1218	       opaque                loh_body<>;
1219	   };

1221	   The layouthint4 structure is used by the client to pass a hint about
1222	   the type of layout it would like created for a particular file.  If
1223	   the loh_type layout type is LAYOUT4_OSD2_OBJECTS, then the loh_body
1224	   opaque value is defined by the pnfs_osd_layouthint4 type.

1226	9.1.  pnfs_osd_layouthint4

1228	   /// union pnfs_osd_max_comps_hint4 switch (bool omx_valid) {
1229	   ///     case TRUE:
1230	   ///         uint32_t            omx_max_comps;
1231	   ///     case FALSE:
1232	   ///         void;
1233	   /// };
1234	   ///
1235	   /// union pnfs_osd_stripe_unit_hint4 switch (bool osu_valid) {
1236	   ///     case TRUE:
1237	   ///         length4             osu_stripe_unit;
1238	   ///     case FALSE:
1239	   ///         void;
1240	   /// };
1241	   ///
1242	   /// union pnfs_osd_group_width_hint4 switch (bool ogw_valid) {
1243	   ///     case TRUE:
1244	   ///         uint32_t            ogw_group_width;
1245	   ///     case FALSE:
1246	   ///         void;
1247	   /// };
1248	   ///
1249	   /// union pnfs_osd_group_depth_hint4 switch (bool ogd_valid) {
1250	   ///     case TRUE:
1251	   ///         uint32_t            ogd_group_depth;
1252	   ///     case FALSE:
1253	   ///         void;
1254	   /// };
1255	   ///
1256	   /// union pnfs_osd_mirror_cnt_hint4 switch (bool omc_valid) {
1257	   ///     case TRUE:
1258	   ///         uint32_t            omc_mirror_cnt;
1259	   ///     case FALSE:
1260	   ///         void;
1261	   /// };
1262	   ///
1263	   /// union pnfs_osd_raid_algorithm_hint4 switch (bool ora_valid) {
1264	   ///     case TRUE:
1265	   ///         pnfs_osd_raid_algorithm4    ora_raid_algorithm;
1266	   ///     case FALSE:
1267	   ///         void;
1268	   /// };
1269	   ///
1270	   /// struct pnfs_osd_layouthint4 {
1271	   ///     pnfs_osd_max_comps_hint4        olh_max_comps_hint;
1272	   ///     pnfs_osd_stripe_unit_hint4      olh_stripe_unit_hint;
1273	   ///     pnfs_osd_group_width_hint4      olh_group_width_hint;
1274	   ///     pnfs_osd_group_depth_hint4      olh_group_depth_hint;
1275	   ///     pnfs_osd_mirror_cnt_hint4       olh_mirror_cnt_hint;
1276	   ///     pnfs_osd_raid_algorithm_hint4   olh_raid_algorithm_hint;
1277	   /// };
1278	   ///
1279	   This type conveys hints for the desired data map.  All parameters are
1280	   optional so the client can give values for only the parameters it
1281	   cares about, e.g. it can provide a hint for the desired number of
1282	   mirrored components, regardless of the RAID algorithm selected for
1283	   the file.  The server should make an attempt to honor the hints, but
1284	   it can ignore any or all of them at its own discretion and without
1285	   failing the respective CREATE operation.

1287	   The "olh_max_comps_hint" can be used to limit the total number of
1288	   component objects comprising the file.  All other hints correspond
1289	   directly to the different fields of pnfs_osd_data_map4.

1291	10.  Layout Segments

1293	   The pnfs layout operations operate on logical byte ranges.  There is
1294	   no requirement in the protocol for any relationship between byte
1295	   ranges used in LAYOUTGET to acquire layouts and byte ranges used in
1296	   CB_LAYOUTRECALL, LAYOUTCOMMIT, or LAYOUTRETURN.  However, using OSD
1297	   byte-range capabilities poses limitations on these operations since
1298	   the capabilities associated with layout segments cannot be merged or
1299	   split.  The following guidelines should be followed for proper
1300	   operation of object-based layouts.

1302	10.1.  CB_LAYOUTRECALL and LAYOUTRETURN

1304	   In general, the object-based layout driver should keep track of each
1305	   layout segment it got, keeping record of the segment's iomode,
1306	   offset, and length.  The server should allow the client to get
1307	   multiple overlapping layout segments but is free to recall the layout
1308	   to prevent overlap.

1310	   In response to CB_LAYOUTRECALL, the client should return all layout
1311	   segments matching the given iomode and overlapping with the recalled
1312	   range.  When returning the layouts for this byte range with
1313	   LAYOUTRETURN, the client MUST NOT return a sub-range of a layout
1314	   segment it has; each LAYOUTRETURN sent MUST completely cover at least
1315	   one outstanding layout segment.

1317	   The server, in turn, should release any segment that exactly matches
1318	   the clientid, iomode, and byte range given in LAYOUTRETURN.  If no
1319	   exact match is found, then the server should release all layout
1320	   segments matching the clientid and iomode and that are fully
1321	   contained in the returned byte range.  If none are found and the byte
1322	   range is a subset of an outstanding layout segment with for the same
1323	   clientid and iomode, then the client can be considered malfunctioning
1324	   and the server SHOULD recall all layouts from this client to reset
1325	   its state.  If this behavior repeats, the server SHOULD deny all
1326	   LAYOUTGETs from this client.

1328	10.2.  LAYOUTCOMMIT

1330	   LAYOUTCOMMIT is only used by object-based pNFS to convey modified
1331	   attributes hints and/or to report the presence of I/O errors to the
1332	   metadata server (MDS).  Therefore, the offset and length in
1333	   LAYOUTCOMMIT4args are reserved for future use and should be set to 0.

1335	11.  Recalling Layouts

1337	   The object-based metadata server should recall outstanding layouts in
1338	   the following cases:

1340	   o  When the file's security policy changes, i.e., Access Control
1341	      Lists (ACLs) or permission mode bits are set.

1343	   o  When the file's aggregation map changes, rendering outstanding
1344	      layouts invalid.

1346	   o  When there are sharing conflicts.  For example, the server will
1347	      issue stripe-aligned layout segments for RAID-5 objects.  To
1348	      prevent corruption of the file's parity, multiple clients must not
1349	      hold valid write layouts for the same stripes.  An outstanding
1350	      READ/WRITE (RW) layout should be recalled when a conflicting
1351	      LAYOUTGET is received from a different client for LAYOUTIOMODE4_RW
1352	      and for a byte range overlapping with the outstanding layout
1353	      segment.

1355	11.1.  CB_RECALL_ANY

1357	   The metadata server can use the CB_RECALL_ANY callback operation to
1358	   notify the client to return some or all of its layouts.  The NFSv4.1
1359	   [4] defines the following types:

1361	   const RCA4_TYPE_MASK_OBJ_LAYOUT_MIN     = 8;
1362	   const RCA4_TYPE_MASK_OBJ_LAYOUT_MAX     = 9;

1364	   struct  CB_RECALL_ANY4args      {
1365	       uint32_t        craa_objects_to_keep;
1366	       bitmap4         craa_type_mask;
1367	   };

1369	   Typically, CB_RECALL_ANY will be used to recall client state when the
1370	   server needs to reclaim resources.  The craa_type_mask bitmap
1371	   specifies the type of resources that are recalled and the
1372	   craa_objects_to_keep value specifies how many of the recalled objects
1373	   the client is allowed to keep.  The object-based layout type mask
1374	   flags are defined as follows.  They represent the iomode of the
1375	   recalled layouts.  In response, the client SHOULD return layouts of
1376	   the recalled iomode that it needs the least, keeping at most
1377	   craa_objects_to_keep object-based layouts.

1379	   /// enum pnfs_osd_cb_recall_any_mask {
1380	   ///     PNFS_OSD_RCA4_TYPE_MASK_READ = 8,
1381	   ///     PNFS_OSD_RCA4_TYPE_MASK_RW   = 9
1382	   /// };
1383	   ///

1385	   The PNFS_OSD_RCA4_TYPE_MASK_READ flag notifies the client to return
1386	   layouts of iomode LAYOUTIOMODE4_READ.  Similarly, the
1387	   PNFS_OSD_RCA4_TYPE_MASK_RW flag notifies the client to return layouts
1388	   of iomode LAYOUTIOMODE4_RW.  When both mask flags are set, the client
1389	   is notified to return layouts of either iomode.

1391	12.  Client Fencing

1393	   In cases where clients are uncommunicative and their lease has
1394	   expired or when clients fail to return recalled layouts within a
1395	   lease period at the least (see "Recalling a Layout"[4]), the server
1396	   MAY revoke client layouts and/or device address mappings and reassign
1397	   these resources to other clients.  To avoid data corruption, the
1398	   metadata server MUST fence off the revoked clients from the
1399	   respective objects as described in Section 13.4.

1401	13.  Security Considerations

1403	   The pNFS extension partitions the NFSv4 file system protocol into two
1404	   parts, the control path and the data path (storage protocol).  The
1405	   control path contains all the new operations described by this
1406	   extension; all existing NFSv4 security mechanisms and features apply
1407	   to the control path.  The combination of components in a pNFS system
1408	   is required to preserve the security properties of NFSv4 with respect
1409	   to an entity accessing data via a client, including security
1410	   countermeasures to defend against threats that NFSv4 provides
1411	   defenses for in environments where these threats are considered
1412	   significant.

1414	   The metadata server enforces the file access-control policy at
1415	   LAYOUTGET time.  The client should use suitable authorization
1416	   credentials for getting the layout for the requested iomode (READ or
1417	   RW) and the server verifies the permissions and ACL for these
1418	   credentials, possibly returning NFS4ERR_ACCESS if the client is not
1419	   allowed the requested iomode.  If the LAYOUTGET operation succeeds
1420	   the client receives, as part of the layout, a set of object
1421	   capabilities allowing it I/O access to the specified objects
1422	   corresponding to the requested iomode.  When the client acts on I/O
1423	   operations on behalf of its local users, it MUST authenticate and
1424	   authorize the user by issuing respective OPEN and ACCESS calls to the
1425	   metadata server, similar to having NFSv4 data delegations.  If access
1426	   is allowed, the client uses the corresponding (READ or RW)
1427	   capabilities to perform the I/O operations at the object storage
1428	   devices.  When the metadata server receives a request to change a
1429	   file's permissions or ACL, it SHOULD recall all layouts for that file
1430	   and it MUST change the capability version attribute on all objects
1431	   comprising the file to implicitly invalidate any outstanding
1432	   capabilities before committing to the new permissions and ACL.  Doing
1433	   this will ensure that clients re-authorize their layouts according to
1434	   the modified permissions and ACL by requesting new layouts.
1435	   Recalling the layouts in this case is courtesy of the server intended
1436	   to prevent clients from getting an error on I/Os done after the
1437	   capability version changed.

1439	   The object storage protocol MUST implement the security aspects
1440	   described in version 1 of the T10 OSD protocol definition [1].  The
1441	   standard defines four security methods: NOSEC, CAPKEY, CMDRSP, and
1442	   ALLDATA.  To provide minimum level of security allowing verification
1443	   and enforcement of the server access control policy using the layout
1444	   security credentials, the NOSEC security method MUST NOT be used for
1445	   any I/O operation.  The remainder of this section gives an overview
1446	   of the security mechanism described in that standard.  The goal is to
1447	   give the reader a basic understanding of the object security model.
1448	   Any discrepancies between this text and the actual standard are
1449	   obviously to be resolved in favor of the OSD standard.

1451	13.1.  OSD Security Data Types

1453	   There are three main data types associated with object security: a
1454	   capability, a credential, and security parameters.  The capability is
1455	   a set of fields that specifies an object and what operations can be
1456	   performed on it.  A credential is a signed capability.  Only a
1457	   security manager that knows the secret device keys can correctly sign
1458	   a capability to form a valid credential.  In pNFS, the file server
1459	   acts as the security manager and returns signed capabilities (i.e.,
1460	   credentials) to the pNFS client.  The security parameters are values
1461	   computed by the issuer of OSD commands (i.e., the client) that prove
1462	   they hold valid credentials.  The client uses the credential as a
1463	   signing key to sign the requests it makes to OSD, and puts the
1464	   resulting signatures into the security_parameters field of the OSD
1465	   command.  The object storage device uses the secret keys it shares
1466	   with the security manager to validate the signature values in the
1467	   security parameters.

1469	   The security types are opaque to the generic layers of the pNFS
1470	   client.  The credential contents are defined as opaque within the
1471	   pnfs_osd_object_cred4 type.  Instead of repeating the definitions
1472	   here, the reader is referred to Section 4.9.2.2 of the OSD standard.

1474	13.2.  The OSD Security Protocol

1476	   The object storage protocol relies on a cryptographically secure
1477	   capability to control accesses at the object storage devices.
1478	   Capabilities are generated by the metadata server, returned to the
1479	   client, and used by the client as described below to authenticate
1480	   their requests to the object-based storage device.  Capabilities
1481	   therefore achieve the required access and open mode checking.  They
1482	   allow the file server to define and check a policy (e.g., open mode)
1483	   and the OSD to enforce that policy without knowing the details (e.g.,
1484	   user IDs and ACLs).

1486	   Since capabilities are tied to layouts, and since they are used to
1487	   enforce access control, when the file ACL or mode changes the
1488	   outstanding capabilities MUST be revoked to enforce the new access
1489	   permissions.  The server SHOULD recall layouts to allow clients to
1490	   gracefully return their capabilities before the access permissions
1491	   change.

1493	   Each capability is specific to a particular object, an operation on
1494	   that object, a byte range within the object (in OSDv2), and has an
1495	   explicit expiration time.  The capabilities are signed with a secret
1496	   key that is shared by the object storage devices and the metadata
1497	   managers.  Clients do not have device keys so they are unable to
1498	   forge the signatures in the security parameters.  The combination of
1499	   a capability, the OSD System ID, and a signature is called a
1500	   "credential" in the OSD specification.

1502	   The details of the security and privacy model for object storage are
1503	   defined in the T10 OSD standard.  The following sketch of the
1504	   algorithm should help the reader understand the basic model.

1506	   LAYOUTGET returns a CapKey and a Cap, which, together with the OSD
1507	   SystemID, are also called a credential.  It is a capability and a
1508	   signature over that capability and the SystemID.  The OSD Standard
1509	   refers to the CapKey as the "Credential integrity check value" and to
1510	   the ReqMAC as the "Request integrity check value".

1512	   CapKey = MAC<SecretKey>(Cap, SystemID)
1513	   Credential = {Cap, SystemID, CapKey}

1515	   The client uses CapKey to sign all the requests it issues for that
1516	   object using the respective Cap.  In other words, the Cap appears in
1517	   the request to the storage device, and that request is signed with
1518	   the CapKey as follows:

1520	   ReqMAC = MAC<CapKey>(Req, ReqNonce)
1521	   Request = {Cap, Req, ReqNonce, ReqMAC}

1523	   The following is sent to the OSD: {Cap, Req, ReqNonce, ReqMAC}.  The
1524	   OSD uses the SecretKey it shares with the metadata server to compare
1525	   the ReqMAC the client sent with a locally computed value:

1527	   LocalCapKey = MAC<SecretKey>(Cap, SystemID)
1528	   LocalReqMAC = MAC<LocalCapKey>(Req, ReqNonce)

1530	   and if they match the OSD assumes that the capabilities came from an
1531	   authentic metadata server and allows access to the object, as allowed
1532	   by the Cap.

1534	13.3.  Protocol Privacy Requirements

1536	   Note that if the server LAYOUTGET reply, holding CapKey and Cap, is
1537	   snooped by another client, it can be used to generate valid OSD
1538	   requests (within the Cap access restrictions).

1540	   To provide the required privacy requirements for the capability key
1541	   returned by LAYOUTGET, the GSS-API [7] framework can be used, e.g.,
1542	   by using the RPCSEC_GSS privacy method to send the LAYOUTGET
1543	   operation or by using the SSV key to encrypt the oc_capability_key
1544	   using the GSS_Wrap() function.  Two general ways to provide privacy
1545	   in the absence of GSS-API that are independent of NFSv4 are either an
1546	   isolated network such as a VLAN or a secure channel provided by IPsec
1547	   [15].

1549	13.4.  Revoking Capabilities

1551	   At any time, the metadata server may invalidate all outstanding
1552	   capabilities on an object by changing its POLICY ACCESS TAG
1553	   attribute.  The value of the POLICY ACCESS TAG is part of a
1554	   capability, and it must match the state of the object attribute.  If
1555	   they do not match, the OSD rejects accesses to the object with the
1556	   sense key set to ILLEGAL REQUEST and an additional sense code set to
1557	   INVALID FIELD IN CDB.  When a client attempts to use a capability and
1558	   is rejected this way, it should issue a LAYOUTCOMMIT for the object
1559	   and specify PNFS_OSD_BAD_CRED in the olr_ioerr_report parameter.  The
1560	   client may elect to issue a compound LAYOUTRETURN/LAYOUTGET (or
1561	   LAYOUTCOMMIT/LAYOUTRETURN/LAYOUTGET) to attempt to fetch a refreshed
1562	   set of capabilities.

1564	   The metadata server may elect to change the access policy tag on an
1565	   object at any time, for any reason (with the understanding that there
1566	   is likely an associated performance penalty, especially if there are
1567	   outstanding layouts for this object).  The metadata server MUST
1568	   revoke outstanding capabilities when any one of the following occurs:

1570	   o  the permissions on the object change,

1572	   o  a conflicting mandatory byte-range lock is granted, or

1574	   o  a layout is revoked and reassigned to another client.

1576	   A pNFS client will typically hold one layout for each byte range for
1577	   either READ or READ/WRITE.  The client's credentials are checked by
1578	   the metadata server at LAYOUTGET time and it is the client's
1579	   responsibility to enforce access control among multiple users
1580	   accessing the same file.  It is neither required nor expected that
1581	   the pNFS client will obtain a separate layout for each user accessing
1582	   a shared object.  The client SHOULD use OPEN and ACCESS calls to
1583	   check user permissions when performing I/O so that the server's
1584	   access control policies are correctly enforced.  The result of the
1585	   ACCESS operation may be cached while the client holds a valid layout
1586	   as the server is expected to recall layouts when the file's access
1587	   permissions or ACL change.

1589	14.  IANA Considerations

1591	   As described in NFSv4.1 [4], new layout type numbers have been
1592	   assigned by IANA.  This document defines the protocol associated with
1593	   the existing layout type number, LAYOUT4_OSD2_OBJECTS, and it
1594	   requires no further actions for IANA.

1596	15.  References

1598	15.1.  Normative References

1600	   [1]        Weber, R., "Information Technology - SCSI Object-Based
1601	              Storage Device Commands (OSD)", ANSI INCITS 400-2004,
1602	              December 2004.

1604	   [2]        Bradner, S., "Key words for use in RFCs to Indicate
1605	              Requirement Levels", BCP 14, RFC 2119, March 1997.

1607	   [3]        IETF Trust, "Legal Provisions Relating to IETF Documents",
1608	              November 2008, <http://trustee.ietf.org/docs/
1609	              IETF-Trust-License-Policy.pdf>.

1611	   [4]        Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed.,
1612	              "Network File System (NFS) Version 4 Minor Version 1
1613	              Protocol", RFC 5661, January 2010.

1615	   [5]        Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed.,
1616	              "Network File System (NFS) Version 4 Minor Version 1
1617	              External Data Representation Standard (XDR) Description",
1618	              RFC 5662, January 2010.

1620	   [6]        Eisler, M., "XDR: External Data Representation Standard",
1621	              STD 67, RFC 4506, May 2006.

1623	   [7]        Linn, J., "Generic Security Service Application Program
1624	              Interface Version 2, Update 1", RFC 2743, January 2000.

1626	   [8]        Satran, J., Meth, K., Sapuntzakis, C., Chadalapaka, M.,
1627	              and E. Zeidner, "Internet Small Computer Systems Interface
1628	              (iSCSI)", RFC 3720, April 2004.

1630	   [9]        Krueger, M., Chadalapaka, M., and R. Elliott, "T11 Network
1631	              Address Authority (NAA) Naming Format for iSCSI Node
1632	              Names", RFC 3980, February 2005.

1634	   [10]       Weber, R., "SCSI Primary Commands - 3 (SPC-3)", ANSI
1635	              INCITS 408-2005, October 2005.

1637	   [11]       Weber, R., "SCSI Architecture Model - 3 (SAM-3)", ANSI
1638	              INCITS 402-2005, February 2005.

1640	   [12]       Tseng, J., Gibbons, K., Travostino, F., Du Laney, C., and
1641	              J. Souza, "Internet Storage Name Service (iSNS)", RFC
1642	              4171, September 2005.

1644	   [13]       IEEE, "Guidelines for 64-bit Global Identifier (EUI-64)
1645	              Registration Authority", <http://standards.ieee.org/
1646	              regauth/oui/tutorials/EUI64.html>.

1648	15.2.  Informative References

1650	   [14]       Weber, R., "SCSI Object-Based Storage Device Commands -2
1651	              (OSD-2)", January 2009,
1652	              <http://www.t10.org/cgi-bin/ac.pl?t=f&f=osd2r05a.pdf>.

1654	   [15]       Kent, S. and K. Seo, "Security Architecture for the
1655	              Internet Protocol", RFC 4301, December 2005.

1657	   [16]       MacWilliams, F. and N. Sloane, "The Theory of Error-
1658	              Correcting Codes, Part I", 1977.

1660	   [17]       Anvin, H., "The Mathematics of RAID-6", May 2009,
1661	              <http://kernel.org/pub/linux/kernel/people/hpa/raid6.pdf>.

1663	   [18]       The free encyclopedia, Wikipedia., "Least common
1664	              multiple", April 2011,
1665	              <http://en.wikipedia.org/wiki/Least_common_multiple>.

1667	   [19]       Plank, James S., and Luo, Jianqiang and Schuman, Catherine
1668	              D. and Xu, Lihao and Wilcox-O'Hearn, Zooko, , "A
1669	              Performance Evaluation and Examination of Open-source
1670	              Erasure Coding Libraries for Storage", 2007.

1672	   [20]       T10 1415-D, "SCSI RDMA Protocol (SRP)", ANSI INCITS
1673	              365-2002, December 2002.

1675	   [21]       T11 1619-D, "Fibre Channel Framing and Signaling - 2 (FC-
1676	              FS-2)", ANSI INCITS 424-2007, February 2007.

1678	   [22]       T10 1601-D, "Serial Attached SCSI - 1.1 (SAS-1.1)", ANSI
1679	              INCITS 417-2006, June 2006.

1681	Appendix A.  Acknowledgments

1683	   Todd Pisek was a co-editor of the initial versions of this document.
1684	   Daniel E. Messinger, Pete Wyckoff, Mike Eisler, Sean P. Turner, Brian
1685	   E. Carpenter, Jari Arkko, David Black, and Jason Glasgow reviewed and
1686	   commented on this document.

1688	Authors' Addresses

1690	   Benny Halevy
1691	   Primary Data

1693	   Email: bhalevy@primarydata.com
1694	   URI:   http://www.primarydata.com/

1696	   Boaz Harrosh
1697	   Panasas, Inc.
1698	   1501 Reedsdale St. Suite 400
1699	   Pittsburgh, PA  15233
1700	   USA

1702	   Phone: +1-412-323-3500
1703	   Email: bharrosh@panasas.com
1704	   URI:   http://www.panasas.com/
1705	   Brent Welch
1706	   Panasas, Inc.
1707	   969 W. Maude Ave
1708	   Sunnyvale, CA  94095
1709	   USA

1711	   Phone: +1-408-215-6715
1712	   Email: welch@acm.org
1713	   URI:   http://www.panasas.com/

1715	   Brian Mueller
1716	   Panasas, Inc.

1718	   Email: bmueller@panasas.com
1719	   URI:   http://www.panasas.com/