idnits 2.17.1 

draft-bhalevy-nfs-obj-00.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  == Line 1179 has weird spacing: '...stateid    lor...'

  -- The document date (August 31, 2012) is 4256 days in the past.  Is this
     intentional?

  -- Found something which looks like a code comment -- if you have code
     sections in the document, please surround them with '<CODE BEGINS>' and
     '<CODE ENDS>' lines.


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  -- Possible downref: Non-RFC (?) normative reference: ref. '1'

  ** Obsolete normative reference: RFC 5661 (ref. '2') (Obsoleted by RFC 8881)

  -- Possible downref: Non-RFC (?) normative reference: ref. '6'

  ** Obsolete normative reference: RFC 3720 (ref. '8') (Obsoleted by RFC 7143)

  -- Possible downref: Non-RFC (?) normative reference: ref. '9'

  ** Obsolete normative reference: RFC 3980 (ref. '10') (Obsoleted by RFC
     7143)

  -- Possible downref: Non-RFC (?) normative reference: ref. '11'

  -- Possible downref: Non-RFC (?) normative reference: ref. '13'

  -- Obsolete informational reference (is this intentional?): RFC 3530 (ref.
     '15') (Obsoleted by RFC 7530)


     Summary: 3 errors (**), 0 flaws (~~), 2 warnings (==), 8 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	NFSv4                                                          B. Halevy
3	Internet-Draft                                                    Tonian
4	Intended status: Standards Track                         August 31, 2012
5	Expires: March 4, 2013

7	        Object-Based Parallel NFS (pNFS) Operations - Version 2
8	                        draft-bhalevy-nfs-obj-00

10	Abstract

12	   Parallel NFS (pNFS) extends Network File System version 4 (NFSv4) to
13	   allow clients to directly access file data on the storage used by the
14	   NFSv4 server.  This ability to bypass the server for data access can
15	   increase both performance and parallelism, but requires additional
16	   client functionality for data access, some of which is dependent on
17	   the class of storage used, a.k.a. the Layout Type.  The main pNFS
18	   operations and data types in NFSv4 Minor version 1 specify a layout-
19	   type-independent layer; layout-type-specific information is conveyed
20	   using opaque data structures whose internal structure is further
21	   defined by the particular layout type specification.  This document
22	   specifies the NFSv4.1 Object-Based pNFS Layout Type as a companion to
23	   the main NFSv4 Minor version 1 specification.  Version 2 of this
24	   layout type introduces the use of the Network File System protocol as
25	   an object-storage protocol in addition to the T-10 OSD protocol.

27	Status of this Memo

29	   This Internet-Draft is submitted to IETF in full conformance with the
30	   provisions of BCP 78 and BCP 79.

32	   Internet-Drafts are working documents of the Internet Engineering
33	   Task Force (IETF).  Note that other groups may also distribute
34	   working documents as Internet-Drafts.  The list of current Internet-
35	   Drafts is at http://datatracker.ietf.org/drafts/current/.

37	   Internet-Drafts are draft documents valid for a maximum of six months
38	   and may be updated, replaced, or obsoleted by other documents at any
39	   time.  It is inappropriate to use Internet-Drafts as reference
40	   material or to cite them other than as "work in progress."

42	   This Internet-Draft will expire on March 4, 2013.

44	Copyright Notice

46	   Copyright (c) 2012 IETF Trust and the persons identified as the
47	   document authors.  All rights reserved.

49	   This document is subject to BCP 78 and the IETF Trust's Legal
50	   Provisions Relating to IETF Documents
51	   (http://trustee.ietf.org/license-info) in effect on the date of
52	   publication of this document.  Please review these documents
53	   carefully, as they describe your rights and restrictions with respect
54	   to this document.  Code Components extracted from this document must
55	   include Simplified BSD License text as described in Section 4.e of
56	   the Trust Legal Provisions and are provided without warranty as
57	   described in the Simplified BSD License.

59	Table of Contents

61	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  4
62	     1.1.  Requirements Language  . . . . . . . . . . . . . . . . . .  4
63	   2.  XDR Description of the Objects-Based Layout Protocol . . . . .  4
64	     2.1.  Code Components Licensing Notice . . . . . . . . . . . . .  5
65	   3.  Basic Data Type Definitions  . . . . . . . . . . . . . . . . .  6
66	     3.1.  pnfs_obj_osd_objid4  . . . . . . . . . . . . . . . . . . .  7
67	     3.2.  pnfs_obj_nfs_objid4  . . . . . . . . . . . . . . . . . . .  7
68	     3.3.  pnfs_obj_type4 . . . . . . . . . . . . . . . . . . . . . .  8
69	     3.4.  pnfs_obj_comp4 . . . . . . . . . . . . . . . . . . . . . .  9
70	     3.5.  pnfs_obj_raid_algorithm4 . . . . . . . . . . . . . . . . . 10
71	   4.  Object Storage Device Addressing and Discovery . . . . . . . . 11
72	     4.1.  pnfs_osd_targetid_type4  . . . . . . . . . . . . . . . . . 12
73	     4.2.  pnfs_obj_deviceaddr4 . . . . . . . . . . . . . . . . . . . 12
74	       4.2.1.  SCSI Target Identifier . . . . . . . . . . . . . . . . 13
75	       4.2.2.  Device Network Address . . . . . . . . . . . . . . . . 14
76	       4.2.3.  NFS Device Identifier  . . . . . . . . . . . . . . . . 15
77	   5.  Object-Based Layout  . . . . . . . . . . . . . . . . . . . . . 15
78	     5.1.  pnfs_obj_data_map4 . . . . . . . . . . . . . . . . . . . . 16
79	     5.2.  pnfs_obj_layout4 . . . . . . . . . . . . . . . . . . . . . 17
80	     5.3.  Data Mapping Schemes . . . . . . . . . . . . . . . . . . . 17
81	       5.3.1.  Simple Striping  . . . . . . . . . . . . . . . . . . . 18
82	       5.3.2.  Nested Striping  . . . . . . . . . . . . . . . . . . . 19
83	       5.3.3.  Mirroring  . . . . . . . . . . . . . . . . . . . . . . 21
84	     5.4.  RAID Algorithms  . . . . . . . . . . . . . . . . . . . . . 22
85	       5.4.1.  PNFS_OBJ_RAID_0  . . . . . . . . . . . . . . . . . . . 22
86	       5.4.2.  PNFS_OBJ_RAID_4  . . . . . . . . . . . . . . . . . . . 22
87	       5.4.3.  PNFS_OBJ_RAID_5  . . . . . . . . . . . . . . . . . . . 23
88	       5.4.4.  PNFS_OBJ_RAID_PQ . . . . . . . . . . . . . . . . . . . 24
89	       5.4.5.  RAID Usage and Implementation Notes  . . . . . . . . . 25
90	   6.  Object-Based Layout Update . . . . . . . . . . . . . . . . . . 25
91	     6.1.  pnfs_obj_deltaspaceused4 . . . . . . . . . . . . . . . . . 26
92	     6.2.  pnfs_obj_layoutupdate4 . . . . . . . . . . . . . . . . . . 26
93	   7.  Recovering from Client I/O Errors  . . . . . . . . . . . . . . 27
94	   8.  Object-Based Layout Return . . . . . . . . . . . . . . . . . . 27
95	     8.1.  pnfs_obj_errno4  . . . . . . . . . . . . . . . . . . . . . 28
96	     8.2.  pnfs_obj_ioerr4  . . . . . . . . . . . . . . . . . . . . . 30
97	     8.3.  pnfs_obj_iostats4  . . . . . . . . . . . . . . . . . . . . 30
98	     8.4.  pnfs_obj_layoutreturn4 . . . . . . . . . . . . . . . . . . 31
99	   9.  Object-Based Creation Layout Hint  . . . . . . . . . . . . . . 31
100	     9.1.  pnfs_obj_layouthint4 . . . . . . . . . . . . . . . . . . . 32
101	   10. Layout Segments  . . . . . . . . . . . . . . . . . . . . . . . 33
102	     10.1. CB_LAYOUTRECALL and LAYOUTRETURN . . . . . . . . . . . . . 33
103	     10.2. LAYOUTCOMMIT . . . . . . . . . . . . . . . . . . . . . . . 34
104	   11. Recalling Layouts  . . . . . . . . . . . . . . . . . . . . . . 34
105	     11.1. CB_RECALL_ANY  . . . . . . . . . . . . . . . . . . . . . . 34
106	   12. Client Fencing . . . . . . . . . . . . . . . . . . . . . . . . 35
107	   13. Security Considerations  . . . . . . . . . . . . . . . . . . . 35
108	     13.1. OSD Security Data Types  . . . . . . . . . . . . . . . . . 36
109	     13.2. The OSD Security Protocol  . . . . . . . . . . . . . . . . 37
110	     13.3. Protocol Privacy Requirements  . . . . . . . . . . . . . . 38
111	     13.4. Revoking Capabilities  . . . . . . . . . . . . . . . . . . 39
112	     13.5. Security Considerations over NFS . . . . . . . . . . . . . 39
113	   14. IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 40
114	   15. References . . . . . . . . . . . . . . . . . . . . . . . . . . 40
115	     15.1. Normative References . . . . . . . . . . . . . . . . . . . 40
116	     15.2. Informative References . . . . . . . . . . . . . . . . . . 41
117	   Appendix A.  Acknowledgments . . . . . . . . . . . . . . . . . . . 42
118	   Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 42

120	1.  Introduction

122	   In pNFS, the file server returns typed layout structures that
123	   describe where file data is located.  There are different layouts for
124	   different storage systems and methods of arranging data on storage
125	   devices.  This document describes the layouts used with object-based
126	   storage devices (OSDs) that are accessed according to the OSD storage
127	   protocol standard (ANSI INCITS 400-2004 [1]) or the NFS protocol
128	   (RFC1813 [14], RFC3530 [15], RFC5661 [2])

130	   An "object" is a container for data and attributes, and files are
131	   stored in one or more objects.  The Object Storage protocol specifies
132	   several operations on objects, including READ, WRITE, FLUSH, GET
133	   ATTRIBUTES, SET ATTRIBUTES, CREATE, and DELETE.  However, using the
134	   object-based layout the client only uses the READ, WRITE, GET
135	   ATTRIBUTES, and FLUSH commands, or in the NFS case, the READ, WRITE,
136	   GETATTR, and COMMIT operations.  The other commands are only used by
137	   the pNFS server.

139	   An object-based layout for pNFS includes object identifiers,
140	   capabilities that allow clients to READ or WRITE those objects, and
141	   various parameters that control how file data is striped across their
142	   component objects.  The OSD protocol has a capability-based security
143	   scheme that allows the pNFS server to control what operations and
144	   what objects can be used by clients.

146	   With NFS filers used for object storage devices the object's owner,
147	   group owner, and mode are used to implement a security mechanism
148	   equivalent to the OSD capability model for the purpose of client
149	   fencing.

151	   This scheme is described in more detail in the "Security
152	   Considerations" section (Section 13).

154	1.1.  Requirements Language

156	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
157	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
158	   document are to be interpreted as described in RFC 2119 [3].

160	2.  XDR Description of the Objects-Based Layout Protocol

162	   This document contains the external data representation (XDR [4])
163	   description of the NFSv4.1 objects layout protocol.  The XDR
164	   description is embedded in this document in a way that makes it
165	   simple for the reader to extract into a ready-to-compile form.  The
166	   reader can feed this document into the following shell script to
167	   produce the machine readable XDR description of the NFSv4.1 objects
168	   layout protocol:

170	   #!/bin/sh
171	   grep '^ *///' $* | sed 's?^ */// ??' | sed 's?^ *///$??'

173	   That is, if the above script is stored in a file called "extract.sh",
174	   and this document is in a file called "spec.txt", then the reader can
175	   do:

177	   sh extract.sh < spec.txt > pnfs_obj_prot.x

179	   The effect of the script is to remove leading white space from each
180	   line, plus a sentinel sequence of "///".

182	   The embedded XDR file header follows.  Subsequent XDR descriptions,
183	   with the sentinel sequence are embedded throughout the document.

185	   Note that the XDR code contained in this document depends on types
186	   from the NFSv4.1 nfs4_prot.x file ([5]).  This includes both nfs
187	   types that end with a 4, such as offset4, length4, etc., as well as
188	   more generic types such as uint32_t and uint64_t.

190	2.1.  Code Components Licensing Notice

192	   The XDR description, marked with lines beginning with the sequence
193	   "///", as well as scripts for extracting the XDR description are Code
194	   Components as described in Section 4 of "Legal Provisions Relating to
195	   IETF Documents" [6].  These Code Components are licensed according to
196	   the terms of Section 4 of "Legal Provisions Relating to IETF
197	   Documents".

199	   /// /*
200	   ///  * Copyright (c) 2012 IETF Trust and the persons identified
201	   ///  * as authors of the code. All rights reserved.
202	   ///  *
203	   ///  * Redistribution and use in source and binary forms, with
204	   ///  * or without modification, are permitted provided that the
205	   ///  * following conditions are met:
206	   ///  *
207	   ///  * o Redistributions of source code must retain the above
208	   ///  *   copyright notice, this list of conditions and the
209	   ///  *   following disclaimer.
210	   ///  *
211	   ///  * o Redistributions in binary form must reproduce the above
212	   ///  *   copyright notice, this list of conditions and the
213	   ///  *   following disclaimer in the documentation and/or other
214	   ///  *   materials provided with the distribution.

216	   ///  *
217	   ///  * o Neither the name of Internet Society, IETF or IETF
218	   ///  *   Trust, nor the names of specific contributors, may be
219	   ///  *   used to endorse or promote products derived from this
220	   ///  *   software without specific prior written permission.
221	   ///  *
222	   ///  *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS
223	   ///  *   AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED
224	   ///  *   WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
225	   ///  *   IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
226	   ///  *   FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO
227	   ///  *   EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
228	   ///  *   LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
229	   ///  *   EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
230	   ///  *   NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
231	   ///  *   SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
232	   ///  *   INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
233	   ///  *   LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
234	   ///  *   OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING
235	   ///  *   IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
236	   ///  *   ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
237	   ///  *
238	   ///  * This code was derived from draft-bhalevy-nfs-obj-00.
239	   [[RFC Editor: please insert RFC number if needed]]
240	   ///  * Please reproduce this note if possible.
241	   ///  */
242	   ///
243	   /// /*
244	   ///  * pnfs_obj_prot_v2.x
245	   ///  */
246	   ///
247	   /// /*
248	   ///  * The following include statements are for example only.
249	   ///  * The actual XDR definition files are generated separately
250	   ///  * and independently and are likely to have a different name.
251	   ///  */
252	   /// %#include <nfs4_prot.x>
253	   /// %#include <rpc_prot.x>
254	   ///

256	3.  Basic Data Type Definitions

258	   The following sections define basic data types and constants used by
259	   the Object-Based Layout protocol.

261	3.1.  pnfs_obj_osd_objid4

263	   An object is identified by a number, somewhat like an inode number.
264	   The object storage model has a two-level scheme, where the objects
265	   within an object storage device are grouped into partitions.

267	   /// struct pnfs_obj_osd_objid4 {
268	   ///     deviceid4       oid_device_id;
269	   ///     uint64_t        oid_partition_id;
270	   ///     uint64_t        oid_object_id;
271	   /// };
272	   ///

274	   The pnfs_obj_osd_objid4 type is used to identify an object within a
275	   partition on a specified object storage device. "oid_device_id"
276	   selects the object storage device from the set of available storage
277	   devices.  The device is identified with the deviceid4 type, which is
278	   an index into addressing information about that device returned by
279	   the GETDEVICELIST and GETDEVICEINFO operations.  The deviceid4 data
280	   type is defined in NFSv4.1 [2].  The device MUST by identifed as an
281	   OSD device, represented as oda_obj_type equal to PNFS_OBJ_OSD_V1 or
282	   PNFS_OBJ_OSD_V2.  Within an OSD, a partition is identified with a 64-
283	   bit number, "oid_partition_id".  Within a partition, an object is
284	   identified with a 64-bit number, "oid_object_id".  Creation and
285	   management of partitions is outside the scope of this document, and
286	   is a facility provided by the object-based storage file system.

288	3.2.  pnfs_obj_nfs_objid4

290	   The NFS equivalent of pnfs_obj_osd_objid4 identifies the object using
291	   a NFS filehandle (See RFC1813 [14], RFC3530 [15], or RFC5661 [2]).

293	   /// struct pnfs_obj_nfs_objid4 {
294	   ///     deviceid4       nid_device_id;
295	   ///     opaque          nid_fhandle<>;
296	   /// };
297	   ///

299	   Similar to pnfs_obj_osd_objid4, "nid_device_id" selects the storage
300	   device from the set of available storage devices.  However, it MUST
301	   refer to a device identifed as an NFS device, represented as
302	   oda_obj_type equal to PNFS_OBJ_NFS.

304	3.3.  pnfs_obj_type4

306	   /// enum pnfs_obj_type4 {
307	   ///     PNFS_OBJ_MISSING    = 0,
308	   ///     PNFS_OBJ_OSD_V1     = 1,
309	   ///     PNFS_OBJ_OSD_V2     = 2,
310	   ///     PNFS_OBJ_NFS        = 3
311	   /// };
312	   ///

314	   pnfs_obj_type4 is used to indicate the object storage protocol type
315	   and version or whether an object is missing (i.e., unavailable).
316	   Some of the object-based layout- supported RAID algorithms encode
317	   redundant information and can compensate for missing components, but
318	   the data placement algorithm needs to know what parts are missing.

320	   The second generation OSD protocol (SNIA T10/1729-D [16]). has
321	   additional proposed features to support more robust error recovery,
322	   snapshots, and byte-range capabilities.  Therefore, the OSD version
323	   is explicitly called out in the information returned in the layout.
324	   (This information can also be deduced by looking inside the
325	   capability type at the format field, which is the first byte.  The
326	   format value is 0x1 for an OSD v1 capability.  However, it seems most
327	   robust to call out the version explicitly.)

329	   Support for the NFS protocol as the object storage protocol has been
330	   added in version 2 of this layout type.  In this model, the NFS
331	   filehandle is used to identify the object and the NFS RPC
332	   authentication credentials are used to emulate the OSD security
333	   model.

335	3.4.  pnfs_obj_comp4

337	   /// enum pnfs_obj_osd_cap_key_sec4 {
338	   ///     PNFS_OBJ_CAP_KEY_SEC_NONE = 0,
339	   ///     PNFS_OBJ_CAP_KEY_SEC_SSV  = 1
340	   /// };
341	   ///
342	   /// struct pnfs_obj_osd_cred4 {
343	   ///     pnfs_obj_osd_objid4         ooc_object_id;
344	   ///     pnfs_obj_osd_cap_key_sec4   ooc_cap_key_sec;
345	   ///     opaque                      ooc_capability_key<>;
346	   ///     opaque                      ooc_capability<>;
347	   /// };
348	   ///
349	   /// struct pnfs_obj_nfs_cred4 {
350	   ///     pnfs_obj_nfs_objid4         onc_object_id;
351	   ///     opaque_auth                 onc_auth;
352	   /// };
353	   ///
354	   /// union pnfs_obj_comp4 switch (pnfs_obj_type4 oc_obj_type) {
355	   ///     case PNFS_OBJ_MISSING:
356	   ///         pnfs_obj_osd_objid4     oc_missing_obj_id;
357	   ///
358	   ///     case PNFS_OBJ_OSD_V1:
359	   ///     case PNFS_OBJ_OSD_V2:
360	   ///         pnfs_obj_osd_cred4      oc_osd_cred;
361	   ///
362	   ///     case PNFS_OBJ_NFS:
363	   ///         pnfs_obj_nfs_cred4      oc_nfs_cred;
364	   /// };
365	   ///

367	   The pnfs_obj_comp4 union is used to identify each component
368	   comprising the file. "oc_obj_type" represents the object storage
369	   device protocol type and version, or whether that component is
370	   unavailable.  When oc_obj_type indicates PNFS_OBJ_OSD_V1 or
371	   PNFS_OBJ_OSD_V2, the "ooc_object_id" field identifies the component
372	   object, the "ooc_capability" and "ooc_capability_key" fields, along
373	   with the "ooa_systemid" from the pnfs_obj_deviceaddr4, provide the
374	   OSD security credentials needed to access that object.  The
375	   "ooc_cap_key_sec" value denotes the method used to secure the
376	   ooc_capability_key (see Section 13.1 for more details).

378	   To comply with the OSD security requirements, the capability key
379	   SHOULD be transferred securely to prevent eavesdropping (see
380	   Section 13).  Therefore, a client SHOULD either issue the LAYOUTGET
381	   or GETDEVICEINFO operations via RPCSEC_GSS with the privacy service
382	   or previously establish a secret state verifier (SSV) for the
383	   sessions via the NFSv4.1 SET_SSV operation.  The
384	   pnfs_obj_osd_cap_key_sec4 type is used to identify the method used by
385	   the server to secure the capability key.

387	   o  PNFS_OBJ_CAP_KEY_SEC_NONE denotes that the ooc_capability_key is
388	      not encrypted, in which case the client SHOULD issue the LAYOUTGET
389	      or GETDEVICEINFO operations with RPCSEC_GSS with the privacy
390	      service or the NFSv4.1 transport should be secured by using
391	      methods that are external to NFSv4.1 like the use of IPsec [17]
392	      for transporting the NFSV4.1 protocol.

394	   o  PNFS_OBJ_CAP_KEY_SEC_SSV denotes that the ooc_capability_key
395	      contents are encrypted using the SSV GSS context and the
396	      capability key as inputs to the GSS_Wrap() function (see GSS-API
397	      [7]) with the conf_req_flag set to TRUE.  The client MUST use the
398	      secret SSV key as part of the client's GSS context to decrypt the
399	      capability key using the value of the ooc_capability_key field as
400	      the input_message to the GSS_unwrap() function.  Note that to
401	      prevent eavesdropping of the SSV key, the client SHOULD issue
402	      SET_SSV via RPCSEC_GSS with the privacy service.

404	   The actual method chosen depends on whether the client established a
405	   SSV key with the server and whether it issued the operation with the
406	   RPCSEC_GSS privacy method.  Naturally, if the client did not
407	   establish an SSV key via SET_SSV, the server MUST use the
408	   PNFS_OBJ_CAP_KEY_SEC_NONE method.  Otherwise, if the operation was
409	   not issued with the RPCSEC_GSS privacy method, the server SHOULD
410	   secure the ooc_capability_key with the PNFS_OBJ_CAP_KEY_SEC_SSV
411	   method.  The server MAY use the PNFS_OBJ_CAP_KEY_SEC_SSV method also
412	   when the operation was issued with the RPCSEC_GSS privacy method.

414	   When oc_obj_type represents PNFS_OBJ_NFS as the storage protocol, the
415	   "onc_device_id" field identifies the NFS server holding the object,
416	   "onc_fhandle" provides the opaque NFS filehandle identifying the
417	   object, and "onc_auth" provides the RPC credentials to be used for
418	   accessing the object, encoded as struct opaque_auth (See RFC5531
419	   [18]).

421	3.5.  pnfs_obj_raid_algorithm4

423	   /// enum pnfs_obj_raid_algorithm4 {
424	   ///     PNFS_OBJ_RAID_0  = 1,
425	   ///     PNFS_OBJ_RAID_4  = 2,
426	   ///     PNFS_OBJ_RAID_5  = 3,
427	   ///     PNFS_OBJ_RAID_PQ = 4     /* Reed-Solomon P+Q */
428	   /// };
429	   ///
430	   pnfs_obj_raid_algorithm4 represents the data redundancy algorithm
431	   used to protect the file's contents.  See Section 5.4 for more
432	   details.

434	4.  Object Storage Device Addressing and Discovery

436	   Data operations to an OSD require the client to know the "address" of
437	   each OSD's root object.  The root object is synonymous with the Small
438	   Computer System Interface (SCSI) logical unit.  The client specifies
439	   SCSI logical units to its SCSI protocol stack using a representation
440	   local to the client.  Because these representations are local,
441	   GETDEVICEINFO must return information that can be used by the client
442	   to select the correct local representation.

444	   In the block world, a set offset (logical block number or track/
445	   sector) contains a disk label.  This label identifies the disk
446	   uniquely.  In contrast, an OSD has a standard set of attributes on
447	   its root object.  For device identification purposes, the OSD System
448	   ID (root information attribute number 3) and the OSD Name (root
449	   information attribute number 9) are used as the label.  These appear
450	   in the pnfs_obj_deviceaddr4 type below under the "ooa_systemid" and
451	   "oda_osdname" fields.

453	   In some situations, SCSI target discovery may need to be driven based
454	   on information contained in the GETDEVICEINFO response.  One example
455	   of this is Internet SCSI (iSCSI) targets that are not known to the
456	   client until a layout has been requested.  The information provided
457	   as the "ooa_targetid", "ooa_netaddrs", and "ooa_lun" fields in the
458	   pnfs_obj_osd_addr4 type described below (see Section 4.2) allows the
459	   client to probe a specific device given its network address and
460	   optionally its iSCSI Name (see iSCSI [8]), or when the device network
461	   address is omitted, allows it to discover the object storage device
462	   using the provided device name or SCSI Device Identifier (see SPC-3
463	   [9].)

465	   The ooa_systemid is implicitly used by the client, by using the
466	   object credential signing key to sign each request with the request
467	   integrity check value.  This method protects the client from
468	   unintentionally accessing a device if the device address mapping was
469	   changed (or revoked).  The server computes the capability key using
470	   its own view of the systemid associated with the respective deviceid
471	   present in the credential.  If the client's view of the deviceid
472	   mapping is stale, the client will use the wrong systemid (which must
473	   be system-wide unique) and the I/O request to the OSD will fail to
474	   pass the integrity check verification.

476	   To recover from this condition the client should report the error and
477	   return the layout using LAYOUTRETURN, and invalidate all the device
478	   address mappings associated with this layout.  The client can then
479	   ask for a new layout if it wishes using LAYOUTGET and resolve the
480	   referenced deviceids using GETDEVICEINFO or GETDEVICELIST.

482	   The server MUST provide the ooa_systemid and SHOULD also provide the
483	   ooa_osdname.  When the OSD name is present, the client SHOULD get the
484	   root information attributes whenever it establishes communication
485	   with the OSD and verify that the OSD name it got from the OSD matches
486	   the one sent by the metadata server.  To do so, the client uses the
487	   ooa_root_obj_cred credentials.

489	   For Network Attached Devices, the server MUST provide either the
490	   ona_netaddrs network address(es) or ona_fqdn to identify the device.

492	4.1.  pnfs_osd_targetid_type4

494	   The following enum specifies the manner in which a SCSI target can be
495	   specified.  The target can be specified as a SCSI Name, or as an SCSI
496	   Device Identifier.

498	   /// enum pnfs_osd_targetid_type4 {
499	   ///     OBJ_TARGET_ANON             = 1,
500	   ///     OBJ_TARGET_SCSI_NAME        = 2,
501	   ///     OBJ_TARGET_SCSI_DEVICE_ID   = 3
502	   /// };
503	   ///

505	4.2.  pnfs_obj_deviceaddr4

507	   The "pnfs_obj_deviceaddr4" data structure is returned by the server
508	   as the storage-protocol-specific opaque field da_addr_body in the
509	   "device_addr4" structure by a successful GETDEVICEINFO operation
510	   NFSv4.1 [2].

512	   The specification for an object device address is as follows:

514	/// union pnfs_obj_targetid4 switch (pnfs_osd_targetid_type4 oti_type) {
515	///     case OBJ_TARGET_SCSI_NAME:
516	///         string              oti_scsi_name<>;
517	///
518	///     case OBJ_TARGET_SCSI_DEVICE_ID:
519	///         opaque              oti_scsi_device_id<>;
520	///
521	///     default:
522	///         void;
523	/// };
524	///
525	/// struct pnfs_obj_osd_addr4 {
526	///     pnfs_obj_targetid4      ooa_targetid;
527	///     netaddr4                ooa_netaddrs<>;
528	///     opaque                  ooa_lun[8];
529	///     opaque                  ooa_systemid<>;
530	///     pnfs_obj_osd_cred4      ooa_root_obj_cred;
531	///     opaque                  ooa_osdname<>;
532	/// };
533	///
534	/// struct pnfs_obj_nfs_addr4 {
535	///     uint32_t                ona_version;
536	///     uint32_t                ona_minorversion;
537	///     netaddr4                ona_netaddrs<>;
538	///     opaque                  ona_fqdn<>;
539	///     opaque                  ona_path<>;
540	/// };
541	///
542	/// union pnfs_obj_deviceaddr4 switch (pnfs_obj_type4 oda_obj_type) {
543	///     case PNFS_OBJ_OSD_V1:
544	///     case PNFS_OBJ_OSD_V2:
545	///         pnfs_obj_osd_addr4  oda_osd_addr;
546	///
547	///     case PNFS_OBJ_NFS:
548	///         pnfs_obj_nfs_addr4  oda_nfs_addr;
549	/// };
550	///

552	4.2.1.  SCSI Target Identifier

554	   When "ooa_targetid" is specified as an OBJ_TARGET_SCSI_NAME, the
555	   "oti_scsi_name" string MUST be formatted as an "iSCSI Name" as
556	   specified in iSCSI [8] and [10].  Note that the specification of the
557	   oti_scsi_name string format is outside the scope of this document.
558	   Parsing the string is based on the string prefix, e.g., "iqn.",
559	   "eui.", or "naa." and more formats MAY be specified in the future in
560	   accordance with iSCSI Names properties.

562	   Currently, the iSCSI Name provides for naming the target device using
563	   a string formatted as an iSCSI Qualified Name (IQN) or as an Extended
564	   Unique Identifier (EUI) [11] string.  Those are typically used to
565	   identify iSCSI or Secure Routing Protocol (SRP) [19] devices.  The
566	   Network Address Authority (NAA) string format (see [10]) provides for
567	   naming the device using globally unique identifiers, as defined in
568	   Fibre Channel Framing and Signaling (FC-FS) [20].  These are
569	   typically used to identify Fibre Channel or SAS [21] (Serial Attached
570	   SCSI) devices.  In particular, such devices that are dual-attached
571	   both over Fibre Channel or SAS and over iSCSI.

573	   When "ooa_targetid" is specified as an OBJ_TARGET_SCSI_DEVICE_ID, the
574	   "oti_scsi_device_id" opaque field MUST be formatted as a SCSI Device
575	   Identifier as defined in SPC-3 [9] VPD Page 83h (Section 7.6.3.
576	   "Device Identification VPD Page").  If the Device Identifier is
577	   identical to the OSD System ID, as given by ooa_systemid, the server
578	   SHOULD provide a zero-length oti_scsi_device_id opaque value.  Note
579	   that similarly to the "oti_scsi_name", the specification of the
580	   oti_scsi_device_id opaque contents is outside the scope of this
581	   document and more formats MAY be specified in the future in
582	   accordance with SPC-3.

584	   The OBJ_TARGET_ANON pnfs_osd_targetid_type4 MAY be used for providing
585	   no target identification.  In this case, only the OSD System ID, and
586	   optionally the provided network address, are used to locate the
587	   device.

589	4.2.2.  Device Network Address

591	   The optional "ooa_netaddrs" field MAY be provided by the server as a
592	   hint to accelerate device discovery over, e.g., the iSCSI transport
593	   protocol.  The network address is given with the netaddr4 type, which
594	   specifies a list of TCP/IP based endpoints (as specified in NFSv4.1
595	   [2]).  When given, the client SHOULD use it to probe for the SCSI
596	   device at the given network address(es).  The client MAY still use
597	   other discovery mechanisms such as Internet Storage Name Service
598	   (iSNS) [12] to locate the device using the ooa_targetid.  In
599	   particular, such an external name service SHOULD be used when the
600	   devices may be attached to the network using multiple connections,
601	   and/or multiple storage fabrics (e.g., Fibre-Channel and iSCSI).

603	   The "ooa_lun" field identifies the OSD 64-bit Logical Unit Number,
604	   formatted in accordance with SAM-3 [13].  The client uses the Logical
605	   Unit Number to communicate with the specific OSD Logical Unit.  Its
606	   use is defined in detail by the SCSI transport protocol, e.g., iSCSI
607	   [8].

609	4.2.3.  NFS Device Identifier

611	   The PNFS_OBJ_NFS pnfs_obj_type4 is used to identify NFS filer
612	   devices.  In this case, ona_version and ona_minorversion represent
613	   the NFS protocol version to be used to access the NFS filer.  Either
614	   the "ona_netaddrs" or "ona_fqdn" fields are used to locate the
615	   device. "ona_netaddrs" MAY be set to a list holding one or more of
616	   the device network addresses.  Alternatively, "ona_fqdn" MAY be set
617	   to the NFS device fully qualified domain name that can be resolved by
618	   the client to locate the NFS device network address (See Domain Names
619	   [22]).  The server MUST provide either one of "ona_netaddrs" or
620	   "ona_fqdn", but it MUST NOT provide both.

622	   "ona_path" MUST be set by the server to an exported path on the
623	   device.  The path provided MUST exist and be accessible to the
624	   client.  If the path does not exist, the client MUST ignore this
625	   device information and any layouts referring to the respective
626	   deviceid until valid device information is acquired.

628	5.  Object-Based Layout

630	   The layout4 type is defined in the NFSv4.1 [2] as follows:

632	   enum layouttype4 {
633	       LAYOUT4_NFSV4_1_FILES   = 1,
634	       LAYOUT4_OSD2_OBJECTS    = 2,
635	       LAYOUT4_BLOCK_VOLUME    = 3,
636	       LAYOUT4_OBJECTS_V2      = 0x08010004       /* Tentatively */
637	   };

639	   struct layout_content4 {
640	       layouttype4             loc_type;
641	       opaque                  loc_body<>;
642	   };

644	   struct layout4 {
645	       offset4                 lo_offset;
646	       length4                 lo_length;
647	       layoutiomode4           lo_iomode;
648	       layout_content4         lo_content;
649	   };

651	   This document defines structure associated with the layouttype4
652	   value, LAYOUT4_OBJECTS_V2.  The NFSv4.1 [2] specifies the loc_body
653	   structure as an XDR type "opaque".  The opaque layout is
654	   uninterpreted by the generic pNFS client layers, but obviously must
655	   be interpreted by the object storage layout driver.  This section
656	   defines the structure of this opaque value, pnfs_obj_layout4.

658	5.1.  pnfs_obj_data_map4

660	   /// struct pnfs_obj_data_map4 {
661	   ///     uint32_t                    odm_num_comps;
662	   ///     length4                     odm_stripe_unit;
663	   ///     uint32_t                    odm_group_width;
664	   ///     uint32_t                    odm_group_depth;
665	   ///     uint32_t                    odm_mirror_cnt;
666	   ///     pnfs_obj_raid_algorithm4    odm_raid_algorithm;
667	   /// };
668	   ///

670	   The pnfs_obj_data_map4 structure parameterizes the algorithm that
671	   maps a file's contents over the component objects.  Instead of
672	   limiting the system to simple striping scheme where loss of a single
673	   component object results in data loss, the map parameters support
674	   mirroring and more complicated schemes that protect against loss of a
675	   component object.

677	   "odm_num_comps" is the number of component objects the file is
678	   striped over.  The server MAY grow the file by adding more components
679	   to the stripe while clients hold valid layouts until the file has
680	   reached its final stripe width.  The file length in this case MUST be
681	   limited to the number of bytes in a full stripe.

683	   The "odm_stripe_unit" is the number of bytes placed on one component
684	   before advancing to the next one in the list of components.  The
685	   number of bytes in a full stripe is odm_stripe_unit times the number
686	   of components.  In some RAID schemes, a stripe includes redundant
687	   information (i.e., parity) that lets the system recover from loss or
688	   damage to a component object.

690	   The "odm_group_width" and "odm_group_depth" parameters allow a nested
691	   striping pattern (see Section 5.3.2 for details).  If there is no
692	   nesting, then odm_group_width and odm_group_depth MUST be zero.  The
693	   size of the components array MUST be a multiple of odm_group_width.

695	   The "odm_mirror_cnt" is used to replicate a file by replicating its
696	   component objects.  If there is no mirroring, then odm_mirror_cnt
697	   MUST be 0.  If odm_mirror_cnt is greater than zero, then the size of
698	   the component array MUST be a multiple of (odm_mirror_cnt+1).

700	   See Section 5.3 for more details.

702	5.2.  pnfs_obj_layout4

704	   /// struct pnfs_obj_layout4 {
705	   ///     pnfs_obj_data_map4  olo_map;
706	   ///     uint32_t            olo_comps_index;
707	   ///     pnfs_obj_comp4      olo_components<>;
708	   /// };
709	   ///

711	   The pnfs_obj_layout4 structure specifies a layout over a set of
712	   component objects.  The "olo_components" field is an array of object
713	   identifiers and security credentials that grant access to each
714	   object.  The organization of the data is defined by the
715	   pnfs_obj_data_map4 type that specifies how the file's data is mapped
716	   onto the component objects (i.e., the striping pattern).  The data
717	   placement algorithm that maps file data onto component objects
718	   assumes that each component object occurs exactly once in the array
719	   of components.  Therefore, component objects MUST appear in the
720	   olo_components array only once.  The components array may represent
721	   all objects comprising the file, in which case "olo_comps_index" is
722	   set to zero and the number of entries in the olo_components array is
723	   equal to olo_map.odm_num_comps.  The server MAY return fewer
724	   components than odm_num_comps, provided that the returned components
725	   are sufficient to access any byte in the layout's data range (e.g., a
726	   sub-stripe of "odm_group_width" components).  In this case,
727	   olo_comps_index represents the position of the returned components
728	   array within the full array of components that comprise the file.

730	   Note that the layout depends on the file size, which the client
731	   learns from the generic return parameters of LAYOUTGET, by doing
732	   GETATTR commands to the metadata server.  The client uses the file
733	   size to decide if it should fill holes with zeros or return a short
734	   read.  Striping patterns can cause cases where component objects are
735	   shorter than other components because a hole happens to correspond to
736	   the last part of the component object.

738	5.3.  Data Mapping Schemes

740	   This section describes the different data mapping schemes in detail.
741	   The object layout always uses a "dense" layout as described in
742	   NFSv4.1 [2].  This means that the second stripe unit of the file
743	   starts at offset 0 of the second component, rather than at offset
744	   stripe_unit bytes.  After a full stripe has been written, the next
745	   stripe unit is appended to the first component object in the list
746	   without any holes in the component objects.

748	5.3.1.  Simple Striping

750	   The mapping from the logical offset within a file (L) to the
751	   component object C and object-specific offset O is defined by the
752	   following equations:

754	   L: logical offset into the file

756	   W: stripe width
757	       W = size of olo_components array

759	   S: number of bytes in a stripe
760	       S = W * stripe_unit

762	   N: stripe number
763	       N = L / S

765	   C: component index corresponding to L
766	      C = (L % S) / stripe_unit

768	   O: The component offset corresponding to L
769	      O = (N * stripe_unit) + (L % stripe_unit)

771	   Note that this computation does not accommodate the same object
772	   appearing in the olo_components array multiple times.  Therefore the
773	   server may not return layouts with the same object appearing multiple
774	   times.  If needed the server can return multiple layout segments each
775	   covering a single instance of the object.

777	   For example, consider an object striped over four devices, <D0 D1 D2
778	   D3>.  The stripe_unit is 4096 bytes.  The stripe width S is thus 4 *
779	   4096 = 16384.

781	   Offset 0:
782	     N = 0 / 16384 = 0
783	     C = (0 % 16384) /4096 = 0 (D0)
784	     O = 0*4096 + (0%4096) = 0

786	   Offset 4096:
787	     N = 4096 / 16384 = 0
788	     C = (4096 % 16384) / 4096 = 1 (D1)
789	     O = (0*4096)+(4096%4096) = 0

791	   Offset 9000:
792	     N = 9000 / 16384 = 0
793	     C = (9000 % 16384) / 4096 = 2 (D2)
794	     O = (0*4096)+(9000%4096) = 808

796	   Offset 132000:
797	     N = 132000 / 16384 = 8
798	     C = (132000 % 16384) / 4096 = 0 (D0)
799	     O = (8*4096) + (132000%4096) = 33696

801	5.3.2.  Nested Striping

803	   The odm_group_width and odm_group_depth parameters allow a nested
804	   striping pattern. odm_group_width defines the width of a data stripe
805	   and odm_group_depth defines how many stripes are written before
806	   advancing to the next group of components in the list of component
807	   objects for the file.  The math used to map from a file offset to a
808	   component object and offset within that object is shown below.  The
809	   computations map from the logical offset L to the component index C
810	   and offset relative O within that component object.

812	   L: logical offset into the file

814	   FW: total number of components
815	       FW = size of olo_components array

817	   W: stripe width
818	      W = group_width, if not zero, else FW

820	   group_count: number of groups
821	      group_count = FW / group_width, if group_width is not zero, else 1

823	   D: number of data devices in a stripe
824	      D = W

826	   U: number of data bytes in a stripe within a group
827	      U =  D * stripe_unit

829	   T: number of bytes striped within a group of component objects
830	      (before advancing to the next group)
831	      T = U * group_depth

833	   S: number of bytes striped across all component objects
834	      (before the pattern repeats)
835	      S = T * group_count

837	   M: The "major" (i.e., across all components) cycle number
838	      M = L / S

840	   G: group number from the beginning of the major cycle
841	      G = (L % S) / T

843	   H: byte offset within the last group
844	      H = (L % S) % T

846	   N: The "minor" (i.e., across the group) stripe number
847	      N = H / U

849	   C: component index corresponding to L
850	      C = (G * D) + ((H % U) / stripe_unit)

852	   O: The component offset corresponding to L
853	      O = (M * group_depth * stripe_unit) + (N * stripe_unit) +
854	          (L % stripe_unit)

856	   For example, consider an object striped over 100 devices with a
857	   group_width of 10, a group_depth of 50, and a stripe_unit of 1 MB.
858	   In this scheme, 500 MB are written to the first 10 components, and
859	   5000 MB are written before the pattern wraps back around to the first
860	   component in the array.

862	   Offset 0:
863	     W = 100
864	     group_count = 100 / 10 = 10
865	     D = 10
866	     U = 1 MB * 10 = 10 MB
867	     T = 10 MB * 50 = 500 MB
868	     S = 500 MB * 10 = 5000 MB
869	     M = 0 / 5000 MB = 0
870	     G = (0 % 5000 MB) / 500 MB = 0
871	     H = (0 % 5000 MB) % 500 MB = 0
872	     N = 0 / 10 MB = 0
873	     C = (0 * 10) + ((0 % 10 MB) / 1 MB) = 0
874	     O = (0 * 50 * 1 MB) + (0 * 1 MB) + (0 % 1 MB) = 0

876	   Offset 27 MB:
877	     M = 27 MB / 5000 MB = 0
878	     G = (27 MB % 5000 MB) / 500 MB = 0
879	     H = (27 MB % 5000 MB) % 500 MB = 27 MB
880	     N = 27 MB / 10 MB = 2
881	     C = (0 * 10) + ((27 MB % 10 MB) / 1 MB) = 7
882	     O = (0 * 50 * 1 MB) + (2 * 1 MB) + (27 MB % 1 MB) = 2 MB

884	   Offset 7232 MB:
885	     M = 7232 MB / 5000 MB = 1
886	     G = (7232 MB % 5000 MB) / 500 MB = 4
887	     H = (7232 MB % 5000 MB) % 500 MB = 232 MB
888	     N = 232 MB / 10 MB = 23
889	     C = (4 * 10) + ((232 MB % 10 MB) / 1 MB) = 42
890	     O = (1 * 50 * 1 MB) + (23 * 1 MB) + (7232 MB % 1 MB) = 73 MB

892	5.3.3.  Mirroring

894	   The odm_mirror_cnt is used to replicate a file by replicating its
895	   component objects.  If there is no mirroring, then odm_mirror_cnt
896	   MUST be 0.  If odm_mirror_cnt is greater than zero, then the size of
897	   the olo_components array MUST be a multiple of (odm_mirror_cnt+1).
898	   Thus, for a classic mirror on two objects, odm_mirror_cnt is one.
899	   Note that mirroring can be defined over any RAID algorithm and
900	   striping pattern (either simple or nested).  If odm_group_width is
901	   also non-zero, then the size of the olo_components array MUST be a
902	   multiple of odm_group_width * (odm_mirror_cnt+1).  Note that
903	   odm_group_width does not account for mirrors.  Replicas are adjacent
904	   in the olo_components array, and the value C produced by the above
905	   equations is not a direct index into the olo_components array.
906	   Instead, the following equations determine the replica component
907	   index RCi, where i ranges from 0 to odm_mirror_cnt.

909	   FW = size of olo_components array / (odm_mirror_cnt+1)

911	   C = component index for striping or two-level striping
912	       as calculated using above equations

914	   i ranges from 0 to odm_mirror_cnt, inclusive
915	   RCi = C * (odm_mirror_cnt+1) + i

917	5.4.  RAID Algorithms

919	   pnfs_obj_raid_algorithm4 determines the algorithm and placement of
920	   redundant data.  This section defines the different redundancy
921	   algorithms.  Note: The term "RAID" (Redundant Array of Independent
922	   Disks) is used in this document to represent an array of component
923	   objects that store data for an individual file.  The objects are
924	   stored on independent object-based storage devices.  File data is
925	   encoded and striped across the array of component objects using
926	   algorithms developed for block-based RAID systems.

928	5.4.1.  PNFS_OBJ_RAID_0

930	   PNFS_OBJ_RAID_0 means there is no parity data, so all bytes in the
931	   component objects are data bytes located by the above equations for C
932	   and O. If a component object is marked as PNFS_OBJ_MISSING, the pNFS
933	   client MUST either return an I/O error if this component is attempted
934	   to be read or, alternatively, it can retry the READ against the pNFS
935	   server.

937	5.4.2.  PNFS_OBJ_RAID_4

939	   PNFS_OBJ_RAID_4 means that the last component object, or the last in
940	   each group (if odm_group_width is greater than zero), contains parity
941	   information computed over the rest of the stripe with an XOR
942	   operation.  If a component object is unavailable, the client can read
943	   the rest of the stripe units in the damaged stripe and recompute the
944	   missing stripe unit by XORing the other stripe units in the stripe.
945	   Or the client can replay the READ against the pNFS server that will
946	   presumably perform the reconstructed read on the client's behalf.

948	   When parity is present in the file, then the number of parity devices
949	   is taken into account in the above equations when calculating (D),
950	   the number of data devices in a stripe, as follows:

952	   P: number of parity devices in each stripe
953	      P = 1

955	   D: number of data devices in a stripe
956	      D = W - P

958	   I: parity device index
959	      I = D

961	5.4.3.  PNFS_OBJ_RAID_5

963	   PNFS_OBJ_RAID_5 means that the position of the parity data is rotated
964	   on each stripe or each group (if odm_group_width is greater than
965	   zero).  In the first stripe, the last component holds the parity.  In
966	   the second stripe, the next-to-last component holds the parity, and
967	   so on.  In this scheme, all stripe units are rotated so that I/O is
968	   evenly spread across objects as the file is read sequentially.  The
969	   rotated parity layout is illustrated here, with hexadecimal numbers
970	   indicating the stripe unit.

972	   0 1 2 P
973	   4 5 P 3
974	   8 P 6 7
975	   P 9 a b

977	   Note that the math for RAID_5 is similar to RAID_4 only that the
978	   device indices for each stripe are rotated backwards.  So start with
979	   the equations above for RAID_4, then compute the rotation as
980	   described below.  Also note that the parity rotation cycle always
981	   starts on group boundaries so the first stripe in a group has its
982	   parity at device D.

984	   P: number of parity devices in each stripe
985	      P = 1

987	   PC: Parity Cycle
988	       PC = W

990	   R: The parity rotation index
991	      (N is as computed in above equations for RAID-4)
992	      R = N % PC

994	   I: parity device index
995	      I = (W + W - (R + 1) * P) % W

997	   Cr: The rotated device index
998	       (C is as computed in the above equations for RAID-4)
999	       Cr = (W + C - (R * P)) % W

1001	   Note: W is added above to avoid negative numbers modulo math.

1003	5.4.4.  PNFS_OBJ_RAID_PQ

1005	   PNFS_OBJ_RAID_PQ is a double-parity scheme that uses the Reed-Solomon
1006	   P+Q encoding scheme [23].  In this layout, the last two component
1007	   objects hold the P and Q data, respectively.  P is parity computed
1008	   with XOR.  The Q computation is described in detail by Anvin [24].
1009	   The same polynomial "x^8+x^4+x^3+x^2+1" and Galois field size of 2^8
1010	   are used here.  Clients may simply choose to read data through the
1011	   metadata server if two or more components are missing or damaged.

1013	   The equations given above for embedded parity can be used to map a
1014	   file offset to the correct component object by setting the number of
1015	   parity components (P) to 2 instead of 1 for RAID-5 and computing the
1016	   Parity Cycle length as the Lowest Common Multiple [25] of
1017	   odm_group_width and P, devided by P, as described below.  Note: This
1018	   algorithm can be used also for RAID-5 where P=1.

1020	   P: number of parity devices
1021	      P = 2

1023	   PC: Parity cycle:
1024	       PC = LCM(W, P) / P

1026	   Q: The device index holding the Q component
1027	      (I is as computed in the above equations for RAID-5)
1028	      Qdev = (I + 1) % W

1030	5.4.5.  RAID Usage and Implementation Notes

1032	   RAID layouts with redundant data in their stripes require additional
1033	   serialization of updates to ensure correct operation.  Otherwise, if
1034	   two clients simultaneously write to the same logical range of an
1035	   object, the result could include different data in the same ranges of
1036	   mirrored tuples, or corrupt parity information.  It is the
1037	   responsibility of the metadata server to enforce serialization
1038	   requirements such as this.  For example, the metadata server may do
1039	   so by not granting overlapping write layouts within mirrored objects.

1041	   Many alternative encoding schemes exist for P>=2 [26].  These involve
1042	   P or Q equations different than those used in PNFS_OBJ_RAID_PQ.
1043	   Thus, if one of these schemes is to be used in the future, a distinct
1044	   value must be added to pnfs_obj_raid_algorithm4 for it.  While Reed-
1045	   Solomon codes are well understood, recently discovered schemes such
1046	   as Liberation codes are more computationally efficient for small
1047	   group_widths, and Cauchy Reed-Solomon codes are more computationally
1048	   efficient for higher values of P.

1050	6.  Object-Based Layout Update

1052	   layoutupdate4 is used in the LAYOUTCOMMIT operation to convey updates
1053	   to the layout and additional information to the metadata server.  It
1054	   is defined in the NFSv4.1 [2] as follows:

1056	   struct layoutupdate4 {
1057	       layouttype4             lou_type;
1058	       opaque                  lou_body<>;
1059	   };

1061	   The layoutupdate4 type is an opaque value at the generic pNFS client
1062	   level.  If the lou_type layout type is LAYOUT4_OBJECTS_V2, then the
1063	   lou_body opaque value is defined by the pnfs_obj_layoutupdate4 type.

1065	   Object-Based pNFS clients are not allowed to modify the layout.
1066	   Therefore, the information passed in pnfs_obj_layoutupdate4 is used
1067	   only to update the file's attributes.  In addition to the generic
1068	   information the client can pass to the metadata server in
1069	   LAYOUTCOMMIT such as the highest offset the client wrote to and the
1070	   last time it modified the file, the client MAY use
1071	   pnfs_obj_layoutupdate4 to convey the capacity consumed (or released)
1072	   by writes using the layout, and to indicate that I/O errors were
1073	   encountered by such writes.

1075	6.1.  pnfs_obj_deltaspaceused4

1077	   /// union pnfs_obj_deltaspaceused4 switch (bool dsu_valid) {
1078	   ///     case TRUE:
1079	   ///         int64_t     dsu_delta;
1080	   ///     case FALSE:
1081	   ///         void;
1082	   /// };
1083	   ///

1085	   pnfs_obj_deltaspaceused4 is used to convey space utilization
1086	   information at the time of LAYOUTCOMMIT.  For the file system to
1087	   properly maintain capacity-used information, it needs to track how
1088	   much capacity was consumed by WRITE operations performed by the
1089	   client.  In this protocol, the OSD returns the capacity consumed by a
1090	   write (*), which can be different than the number of bytes written
1091	   because of internal overhead like block-level allocation and indirect
1092	   blocks, and the client reflects this back to the pNFS server so it
1093	   can accurately track quota.  The pNFS server can choose to trust this
1094	   information coming from the clients and therefore avoid querying the
1095	   OSDs at the time of LAYOUTCOMMIT.  If the client is unable to obtain
1096	   this information from the OSD, it simply returns invalid
1097	   olu_delta_space_used.

1099	6.2.  pnfs_obj_layoutupdate4

1101	   /// struct pnfs_obj_layoutupdate4 {
1102	   ///     pnfs_obj_deltaspaceused4    olu_delta_space_used;
1103	   ///     bool                        olu_ioerr_flag;
1104	   /// };
1105	   ///

1107	   "olu_delta_space_used" is used to convey capacity usage information
1108	   back to the metadata server.

1110	   The "olu_ioerr_flag" is used when I/O errors were encountered while
1111	   writing the file.  The client MUST report the errors using the
1112	   pnfs_obj_ioerr4 structure (see Section 8.1) at LAYOUTRETURN time.

1114	   If the client updated the file successfully before hitting the I/O
1115	   errors, it MAY use LAYOUTCOMMIT to update the metadata server as
1116	   described above.  Typically, in the error-free case, the server MAY
1117	   turn around and update the file's attributes on the storage devices.
1118	   However, if I/O errors were encountered, the server better not
1119	   attempt to write the new attributes on the storage devices until it
1120	   receives the I/O error report; therefore, the client MUST set the
1121	   olu_ioerr_flag to true.  Note that in this case, the client SHOULD
1122	   send both the LAYOUTCOMMIT and LAYOUTRETURN operations in the same
1123	   COMPOUND RPC.

1125	7.  Recovering from Client I/O Errors

1127	   The pNFS client may encounter errors when directly accessing the
1128	   object storage devices.  However, it is the responsibility of the
1129	   metadata server to handle the I/O errors.  When the
1130	   LAYOUT4_OBJECTS_V2 layout type is used, the client MUST report the
1131	   I/O errors to the server at LAYOUTRETURN time using the
1132	   pnfs_obj_ioerr4 structure (see Section 8.1).

1134	   The metadata server analyzes the error and determines the required
1135	   recovery operations such as repairing any parity inconsistencies,
1136	   recovering media failures, or reconstructing missing objects.

1138	   The metadata server SHOULD recall any outstanding layouts to allow it
1139	   exclusive write access to the stripes being recovered and to prevent
1140	   other clients from hitting the same error condition.  In these cases,
1141	   the server MUST complete recovery before handing out any new layouts
1142	   to the affected byte ranges.

1144	   Although it MAY be acceptable for the client to propagate a
1145	   corresponding error to the application that initiated the I/O
1146	   operation and drop any unwritten data, the client SHOULD attempt to
1147	   retry the original I/O operation by requesting a new layout using
1148	   LAYOUTGET and retry the I/O operation(s) using the new layout, or the
1149	   client MAY just retry the I/O operation(s) using regular NFS READ or
1150	   WRITE operations via the metadata server.  The client SHOULD attempt
1151	   to retrieve a new layout and retry the I/O operation using OSD
1152	   commands first and only if the error persists, retry the I/O
1153	   operation via the metadata server.

1155	8.  Object-Based Layout Return

1157	   layoutreturn_file4 is used in the LAYOUTRETURN operation to convey
1158	   layout-type specific information to the server.  It is defined in the
1159	   NFSv4.1 [2] as follows:

1161	   struct layoutreturn_file4 {
1162	           offset4         lrf_offset;
1163	           length4         lrf_length;
1164	           stateid4        lrf_stateid;
1165	           /* layouttype4 specific data */
1166	           opaque          lrf_body<>;
1167	   };

1169	   union layoutreturn4 switch(layoutreturn_type4 lr_returntype) {
1170	           case LAYOUTRETURN4_FILE:
1171	                   layoutreturn_file4      lr_layout;
1172	           default:
1173	                   void;
1174	   };

1176	   struct LAYOUTRETURN4args {
1177	           /* CURRENT_FH: file */
1178	           bool                    lora_reclaim;
1179	           layoutreturn_stateid    lora_recallstateid;
1180	           layouttype4             lora_layout_type;
1181	           layoutiomode4           lora_iomode;
1182	           layoutreturn4           lora_layoutreturn;
1183	   };

1185	   If the lora_layout_type layout type is LAYOUT4_OBJECTS_V2, then the
1186	   lrf_body opaque value is defined by the pnfs_obj_layoutreturn4 type.

1188	   The pnfs_obj_layoutreturn4 type allows the client to report I/O error
1189	   information or layout usage statistics back to the metadata server as
1190	   defined below.

1192	8.1.  pnfs_obj_errno4

1194	   /// enum pnfs_obj_errno4 {
1195	   ///     PNFS_OBJ_ERR_EIO            = 1,
1196	   ///     PNFS_OBJ_ERR_NOT_FOUND      = 2,
1197	   ///     PNFS_OBJ_ERR_NO_SPACE       = 3,
1198	   ///     PNFS_OBJ_ERR_BAD_CRED       = 4,
1199	   ///     PNFS_OBJ_ERR_NO_ACCESS      = 5,
1200	   ///     PNFS_OBJ_ERR_UNREACHABLE    = 6,
1201	   ///     PNFS_OBJ_ERR_RESOURCE       = 7
1202	   /// };
1203	   ///

1205	   pnfs_obj_errno4 is used to represent error types when read/write
1206	   errors are reported to the metadata server.  The error codes serve as
1207	   hints to the metadata server that may help it in diagnosing the exact
1208	   reason for the error and in repairing it.

1210	   o  PNFS_OBJ_ERR_EIO indicates the operation failed because the object
1211	      storage device experienced a failure trying to access the object.
1212	      The most common source of these errors is media errors, but other
1213	      internal errors might cause this as well.  In this case, the
1214	      metadata server should go examine the broken object more closely;
1215	      hence, it should be used as the default error code.

1217	   o  PNFS_OBJ_ERR_NOT_FOUND indicates the object ID specifies an object
1218	      that does not exist on the object storage device.

1220	   o  PNFS_OBJ_ERR_NO_SPACE indicates the operation failed because the
1221	      object storage device ran out of free capacity during the
1222	      operation.

1224	   o  PNFS_OBJ_ERR_BAD_CRED indicates the security parameters are not
1225	      valid.  The primary cause of this is that the capability has
1226	      expired, or the access policy tag (a.k.a., capability version
1227	      number) has been changed to revoke capabilities.  The client will
1228	      need to return the layout and get a new one with fresh
1229	      capabilities.

1231	   o  PNFS_OBJ_ERR_NO_ACCESS indicates the capability does not allow the
1232	      requested operation.  This should not occur in normal operation
1233	      because the metadata server should give out correct capabilities,
1234	      or none at all.

1236	   o  PNFS_OBJ_ERR_UNREACHABLE indicates the client did not complete the
1237	      I/O operation at the object storage device due to a communication
1238	      failure.  Whether or not the I/O operation was executed by the OSD
1239	      is undetermined.

1241	   o  PNFS_OBJ_ERR_RESOURCE indicates the client did not issue the I/O
1242	      operation due to a local problem on the initiator (i.e., client)
1243	      side, e.g., when running out of memory.  The client MUST guarantee
1244	      that the OSD command was never dispatched to the OSD.

1246	8.2.  pnfs_obj_ioerr4

1248	   /// union pnfs_obj_objid4 switch (pnfs_obj_type4 oc_obj_type) {
1249	   ///     case PNFS_OBJ_OSD_V1:
1250	   ///     case PNFS_OBJ_OSD_V2:
1251	   ///         pnfs_obj_osd_objid4     oi_osd_objid;
1252	   ///
1253	   ///     case PNFS_OBJ_NFS:
1254	   ///         pnfs_obj_nfs_objid4     oi_nfs_objid;
1255	   /// };
1256	   ///
1257	   /// struct pnfs_obj_ioerr4 {
1258	   ///     pnfs_obj_objid4     oer_component;
1259	   ///     offset4             oer_comp_offset;
1260	   ///     length4             oer_comp_length;
1261	   ///     bool                oer_iswrite;
1262	   ///     pnfs_obj_errno4     oer_errno;
1263	   /// };
1264	   ///

1266	   The pnfs_obj_ioerr4 structure is used to return error indications for
1267	   objects that generated errors during data transfers.  These are hints
1268	   to the metadata server that there are problems with that object.  For
1269	   each error, "oer_component", "oer_comp_offset", and "oer_comp_length"
1270	   represent the object and byte range within the component object in
1271	   which the error occurred; "oer_iswrite" is set to "true" if the
1272	   failed OSD operation was data modifying, and "oer_errno" represents
1273	   the type of error.

1275	   Component byte ranges in the optional pnfs_obj_ioerr4 structure are
1276	   used for recovering the object and MUST be set by the client to cover
1277	   all failed I/O operations to the component.

1279	8.3.  pnfs_obj_iostats4

1281	   /// struct pnfs_obj_iostats4 {
1282	   ///     offset4             osr_offset;
1283	   ///     length4             osr_length;
1284	   ///     uint32_t            osr_duration;
1285	   ///     uint32_t            osr_rd_count;
1286	   ///     uint64_t            osr_rd_bytes;
1287	   ///     uint32_t            osr_wr_count;
1288	   ///     uint64_t            osr_wr_bytes;
1289	   /// };
1290	   ///

1292	   With pNFS, the data transfers are performed directly between the pNFS
1293	   client and the data servers.  Therefore, the metadata server has no
1294	   visibility to the I/O stream and cannot use any statistical
1295	   information about client I/O to optimize data storage location.
1296	   pnfs_obj_iostats4 MAY be used by the client to report I/O statistics
1297	   back to the metadata server upon returning the layout.  Since it is
1298	   infeasible for the client to report every I/O that used the layout,
1299	   the client MAY identify "hot" byte ranges for which to report I/O
1300	   statistics.  The definition and/or configuration mechanism of what is
1301	   considered "hot" and the size of the reported byte range is out of
1302	   the scope of this document.  It is suggested for client
1303	   implementation to provide reasonable default values and an optional
1304	   run-time management interface to control these parameters.  For
1305	   example, a client can define the default byte range resolution to be
1306	   1 MB in size and the thresholds for reporting to be 1 MB/second or 10
1307	   I/O operations per second.  For each byte range, osr_offset and
1308	   osr_length represent the starting offset of the range and the range
1309	   length in bytes. osr_duration represents the number of seconds the
1310	   reported burst of I/O lasted. osr_rd_count, osr_rd_bytes,
1311	   osr_wr_count, and osr_wr_bytes represent, respectively, the number of
1312	   contiguous read and write I/Os and the respective aggregate number of
1313	   bytes transferred within the reported byte range.

1315	8.4.  pnfs_obj_layoutreturn4

1317	   /// struct pnfs_obj_layoutreturn4 {
1318	   ///     pnfs_obj_ioerr4             olr_ioerr_report<>;
1319	   ///     pnfs_obj_iostats4           olr_iostats_report<>;
1320	   /// };
1321	   ///

1323	   When object I/O operations failed, "olr_ioerr_report<>" is used to
1324	   report these errors to the metadata server as an array of elements of
1325	   type pnfs_obj_ioerr4.  Each element in the array represents an error
1326	   that occurred on the object specified by oer_component.  If no errors
1327	   are to be reported, the size of the olr_ioerr_report<> array is set
1328	   to zero.  The client MAY also use "olr_iostats_report<>" to report a
1329	   list of I/O statistics as an array of elements of type
1330	   pnfs_obj_iostats4.  Each element in the array represents statistics
1331	   for a particular byte range.  Byte ranges are not guaranteed to be
1332	   disjoint and MAY repeat or intersect.

1334	9.  Object-Based Creation Layout Hint

1336	   The layouthint4 type is defined in the NFSv4.1 [2] as follows:

1338	   struct layouthint4 {
1339	       layouttype4           loh_type;
1340	       opaque                loh_body<>;

1342	   };

1344	   The layouthint4 structure is used by the client to pass a hint about
1345	   the type of layout it would like created for a particular file.  If
1346	   the loh_type layout type is LAYOUT4_OBJECTS_V2, then the loh_body
1347	   opaque value is defined by the pnfs_obj_layouthint4 type.

1349	9.1.  pnfs_obj_layouthint4

1351	   /// union pnfs_obj_max_comps_hint4 switch (bool omx_valid) {
1352	   ///     case TRUE:
1353	   ///         uint32_t            omx_max_comps;
1354	   ///     case FALSE:
1355	   ///         void;
1356	   /// };
1357	   ///
1358	   /// union pnfs_obj_stripe_unit_hint4 switch (bool osu_valid) {
1359	   ///     case TRUE:
1360	   ///         length4             osu_stripe_unit;
1361	   ///     case FALSE:
1362	   ///         void;
1363	   /// };
1364	   ///
1365	   /// union pnfs_obj_group_width_hint4 switch (bool ogw_valid) {
1366	   ///     case TRUE:
1367	   ///         uint32_t            ogw_group_width;
1368	   ///     case FALSE:
1369	   ///         void;
1370	   /// };
1371	   ///
1372	   /// union pnfs_obj_group_depth_hint4 switch (bool ogd_valid) {
1373	   ///     case TRUE:
1374	   ///         uint32_t            ogd_group_depth;
1375	   ///     case FALSE:
1376	   ///         void;
1377	   /// };
1378	   ///
1379	   /// union pnfs_obj_mirror_cnt_hint4 switch (bool omc_valid) {
1380	   ///     case TRUE:
1381	   ///         uint32_t            omc_mirror_cnt;
1382	   ///     case FALSE:
1383	   ///         void;
1384	   /// };
1385	   ///
1386	   /// union pnfs_obj_raid_algorithm_hint4 switch (bool ora_valid) {
1387	   ///     case TRUE:
1388	   ///         pnfs_obj_raid_algorithm4    ora_raid_algorithm;
1389	   ///     case FALSE:

1391	   ///         void;
1392	   /// };
1393	   ///
1394	   /// struct pnfs_obj_layouthint4 {
1395	   ///     pnfs_obj_max_comps_hint4        olh_max_comps_hint;
1396	   ///     pnfs_obj_stripe_unit_hint4      olh_stripe_unit_hint;
1397	   ///     pnfs_obj_group_width_hint4      olh_group_width_hint;
1398	   ///     pnfs_obj_group_depth_hint4      olh_group_depth_hint;
1399	   ///     pnfs_obj_mirror_cnt_hint4       olh_mirror_cnt_hint;
1400	   ///     pnfs_obj_raid_algorithm_hint4   olh_raid_algorithm_hint;
1401	   /// };
1402	   ///

1404	   This type conveys hints for the desired data map.  All parameters are
1405	   optional so the client can give values for only the parameters it
1406	   cares about, e.g. it can provide a hint for the desired number of
1407	   mirrored components, regardless of the RAID algorithm selected for
1408	   the file.  The server should make an attempt to honor the hints, but
1409	   it can ignore any or all of them at its own discretion and without
1410	   failing the respective CREATE operation.

1412	   The "olh_max_comps_hint" can be used to limit the total number of
1413	   component objects comprising the file.  All other hints correspond
1414	   directly to the different fields of pnfs_obj_data_map4.

1416	10.  Layout Segments

1418	   The pnfs layout operations operate on logical byte ranges.  There is
1419	   no requirement in the protocol for any relationship between byte
1420	   ranges used in LAYOUTGET to acquire layouts and byte ranges used in
1421	   CB_LAYOUTRECALL, LAYOUTCOMMIT, or LAYOUTRETURN.  However, using OSD
1422	   byte-range capabilities poses limitations on these operations since
1423	   the capabilities associated with layout segments cannot be merged or
1424	   split.  The following guidelines should be followed for proper
1425	   operation of object-based layouts.

1427	10.1.  CB_LAYOUTRECALL and LAYOUTRETURN

1429	   In general, the object-based layout driver should keep track of each
1430	   layout segment it got, keeping record of the segment's iomode,
1431	   offset, and length.  The server should allow the client to get
1432	   multiple overlapping layout segments but is free to recall the layout
1433	   to prevent overlap.

1435	   In response to CB_LAYOUTRECALL, the client should return all layout
1436	   segments matching the given iomode and overlapping with the recalled
1437	   range.  When returning the layouts for this byte range with
1438	   LAYOUTRETURN, the client MUST NOT return a sub-range of a layout
1439	   segment it has; each LAYOUTRETURN sent MUST completely cover at least
1440	   one outstanding layout segment.

1442	   The server, in turn, should release any segment that exactly matches
1443	   the clientid, iomode, and byte range given in LAYOUTRETURN.  If no
1444	   exact match is found, then the server should release all layout
1445	   segments matching the clientid and iomode and that are fully
1446	   contained in the returned byte range.  If none are found and the byte
1447	   range is a subset of an outstanding layout segment with for the same
1448	   clientid and iomode, then the client can be considered malfunctioning
1449	   and the server SHOULD recall all layouts from this client to reset
1450	   its state.  If this behavior repeats, the server SHOULD deny all
1451	   LAYOUTGETs from this client.

1453	10.2.  LAYOUTCOMMIT

1455	   LAYOUTCOMMIT is only used by object-based pNFS to convey modified
1456	   attributes hints and/or to report the presence of I/O errors to the
1457	   metadata server (MDS).  Therefore, the offset and length in
1458	   LAYOUTCOMMIT4args are reserved for future use and should be set to 0.

1460	11.  Recalling Layouts

1462	   The object-based metadata server should recall outstanding layouts in
1463	   the following cases:

1465	   o  When the file's security policy changes, i.e., Access Control
1466	      Lists (ACLs) or permission mode bits are set.

1468	   o  When the file's aggregation map changes, rendering outstanding
1469	      layouts invalid.

1471	   o  When there are sharing conflicts.  For example, the server will
1472	      issue stripe-aligned layout segments for RAID-5 objects.  To
1473	      prevent corruption of the file's parity, multiple clients must not
1474	      hold valid write layouts for the same stripes.  An outstanding
1475	      READ/WRITE (RW) layout should be recalled when a conflicting
1476	      LAYOUTGET is received from a different client for LAYOUTIOMODE4_RW
1477	      and for a byte range overlapping with the outstanding layout
1478	      segment.

1480	11.1.  CB_RECALL_ANY

1482	   The metadata server can use the CB_RECALL_ANY callback operation to
1483	   notify the client to return some or all of its layouts.  The NFSv4.1
1484	   [2] defines the following types:

1486	   const RCA4_TYPE_MASK_OBJ_LAYOUT_MIN     = 8;
1487	   const RCA4_TYPE_MASK_OBJ_LAYOUT_MAX     = 9;

1489	   struct  CB_RECALL_ANY4args      {
1490	       uint32_t        craa_objects_to_keep;
1491	       bitmap4         craa_type_mask;
1492	   };

1494	   Typically, CB_RECALL_ANY will be used to recall client state when the
1495	   server needs to reclaim resources.  The craa_type_mask bitmap
1496	   specifies the type of resources that are recalled and the
1497	   craa_objects_to_keep value specifies how many of the recalled objects
1498	   the client is allowed to keep.  The object-based layout type mask
1499	   flags are defined as follows.  They represent the iomode of the
1500	   recalled layouts.  In response, the client SHOULD return layouts of
1501	   the recalled iomode that it needs the least, keeping at most
1502	   craa_objects_to_keep object-based layouts.

1504	   /// enum pnfs_obj_cb_recall_any_mask {
1505	   ///     PNFS_OSD_RCA4_TYPE_MASK_READ = 8,
1506	   ///     PNFS_OSD_RCA4_TYPE_MASK_RW   = 9
1507	   /// };
1508	   ///

1510	   The PNFS_OSD_RCA4_TYPE_MASK_READ flag notifies the client to return
1511	   layouts of iomode LAYOUTIOMODE4_READ.  Similarly, the
1512	   PNFS_OSD_RCA4_TYPE_MASK_RW flag notifies the client to return layouts
1513	   of iomode LAYOUTIOMODE4_RW.  When both mask flags are set, the client
1514	   is notified to return layouts of either iomode.

1516	12.  Client Fencing

1518	   In cases where clients are uncommunicative and their lease has
1519	   expired or when clients fail to return recalled layouts within a
1520	   lease period at the least (see "Recalling a Layout"[2]), the server
1521	   MAY revoke client layouts and/or device address mappings and reassign
1522	   these resources to other clients.  To avoid data corruption, the
1523	   metadata server MUST fence off the revoked clients from the
1524	   respective objects as described in Section 13.4.

1526	13.  Security Considerations

1528	   The pNFS extension partitions the NFSv4 file system protocol into two
1529	   parts, the control path and the data path (storage protocol).  The
1530	   control path contains all the new operations described by this
1531	   extension; all existing NFSv4 security mechanisms and features apply
1532	   to the control path.  The combination of components in a pNFS system
1533	   is required to preserve the security properties of NFSv4 with respect
1534	   to an entity accessing data via a client, including security
1535	   countermeasures to defend against threats that NFSv4 provides
1536	   defenses for in environments where these threats are considered
1537	   significant.

1539	   The metadata server enforces the file access-control policy at
1540	   LAYOUTGET time.  The client should use suitable authorization
1541	   credentials for getting the layout for the requested iomode (READ or
1542	   RW) and the server verifies the permissions and ACL for these
1543	   credentials, possibly returning NFS4ERR_ACCESS if the client is not
1544	   allowed the requested iomode.  If the LAYOUTGET operation succeeds
1545	   the client receives, as part of the layout, a set of object
1546	   capabilities allowing it I/O access to the specified objects
1547	   corresponding to the requested iomode.  When the client acts on I/O
1548	   operations on behalf of its local users, it MUST authenticate and
1549	   authorize the user by issuing respective OPEN and ACCESS calls to the
1550	   metadata server, similar to having NFSv4 data delegations.  If access
1551	   is allowed, the client uses the corresponding (READ or RW)
1552	   capabilities to perform the I/O operations at the object storage
1553	   devices.  When the metadata server receives a request to change a
1554	   file's permissions or ACL, it SHOULD recall all layouts for that file
1555	   and it MUST change the capability version attribute on all objects
1556	   comprising the file to implicitly invalidate any outstanding
1557	   capabilities before committing to the new permissions and ACL.  Doing
1558	   this will ensure that clients re-authorize their layouts according to
1559	   the modified permissions and ACL by requesting new layouts.
1560	   Recalling the layouts in this case is courtesy of the server intended
1561	   to prevent clients from getting an error on I/Os done after the
1562	   capability version changed.

1564	   The object storage protocol MUST implement the security aspects
1565	   described in version 1 of the T10 OSD protocol definition [1].  The
1566	   standard defines four security methods: NOSEC, CAPKEY, CMDRSP, and
1567	   ALLDATA.  To provide minimum level of security allowing verification
1568	   and enforcement of the server access control policy using the layout
1569	   security credentials, the NOSEC security method MUST NOT be used for
1570	   any I/O operation.  The remainder of this section gives an overview
1571	   of the security mechanism described in that standard.  The goal is to
1572	   give the reader a basic understanding of the object security model.
1573	   Any discrepancies between this text and the actual standard are
1574	   obviously to be resolved in favor of the OSD standard.

1576	13.1.  OSD Security Data Types

1578	   There are three main data types associated with object security: a
1579	   capability, a credential, and security parameters.  The capability is
1580	   a set of fields that specifies an object and what operations can be
1581	   performed on it.  A credential is a signed capability.  Only a
1582	   security manager that knows the secret device keys can correctly sign
1583	   a capability to form a valid credential.  In pNFS, the file server
1584	   acts as the security manager and returns signed capabilities (i.e.,
1585	   credentials) to the pNFS client.  The security parameters are values
1586	   computed by the issuer of OSD commands (i.e., the client) that prove
1587	   they hold valid credentials.  The client uses the credential as a
1588	   signing key to sign the requests it makes to OSD, and puts the
1589	   resulting signatures into the security_parameters field of the OSD
1590	   command.  The object storage device uses the secret keys it shares
1591	   with the security manager to validate the signature values in the
1592	   security parameters.

1594	   The security types are opaque to the generic layers of the pNFS
1595	   client.  The credential contents are defined as opaque within the
1596	   pnfs_obj_cred4 type.  Instead of repeating the definitions here, the
1597	   reader is referred to Section 4.9.2.2 of the OSD standard.

1599	13.2.  The OSD Security Protocol

1601	   The object storage protocol relies on a cryptographically secure
1602	   capability to control accesses at the object storage devices.
1603	   Capabilities are generated by the metadata server, returned to the
1604	   client, and used by the client as described below to authenticate
1605	   their requests to the object-based storage device.  Capabilities
1606	   therefore achieve the required access and open mode checking.  They
1607	   allow the file server to define and check a policy (e.g., open mode)
1608	   and the OSD to enforce that policy without knowing the details (e.g.,
1609	   user IDs and ACLs).

1611	   Since capabilities are tied to layouts, and since they are used to
1612	   enforce access control, when the file ACL or mode changes the
1613	   outstanding capabilities MUST be revoked to enforce the new access
1614	   permissions.  The server SHOULD recall layouts to allow clients to
1615	   gracefully return their capabilities before the access permissions
1616	   change.

1618	   Each capability is specific to a particular object, an operation on
1619	   that object, a byte range within the object (in OSDv2), and has an
1620	   explicit expiration time.  The capabilities are signed with a secret
1621	   key that is shared by the object storage devices and the metadata
1622	   managers.  Clients do not have device keys so they are unable to
1623	   forge the signatures in the security parameters.  The combination of
1624	   a capability, the OSD System ID, and a signature is called a
1625	   "credential" in the OSD specification.

1627	   The details of the security and privacy model for object storage are
1628	   defined in the T10 OSD standard.  The following sketch of the
1629	   algorithm should help the reader understand the basic model.

1631	   LAYOUTGET returns a CapKey and a Cap, which, together with the OSD
1632	   SystemID, are also called a credential.  It is a capability and a
1633	   signature over that capability and the SystemID.  The OSD Standard
1634	   refers to the CapKey as the "Credential integrity check value" and to
1635	   the ReqMAC as the "Request integrity check value".

1637	   CapKey = MAC<SecretKey>(Cap, SystemID)
1638	   Credential = {Cap, SystemID, CapKey}

1640	   The client uses CapKey to sign all the requests it issues for that
1641	   object using the respective Cap. In other words, the Cap appears in
1642	   the request to the storage device, and that request is signed with
1643	   the CapKey as follows:

1645	   ReqMAC = MAC<CapKey>(Req, ReqNonce)
1646	   Request = {Cap, Req, ReqNonce, ReqMAC}

1648	   The following is sent to the OSD: {Cap, Req, ReqNonce, ReqMAC}.  The
1649	   OSD uses the SecretKey it shares with the metadata server to compare
1650	   the ReqMAC the client sent with a locally computed value:

1652	   LocalCapKey = MAC<SecretKey>(Cap, SystemID)
1653	   LocalReqMAC = MAC<LocalCapKey>(Req, ReqNonce)

1655	   and if they match the OSD assumes that the capabilities came from an
1656	   authentic metadata server and allows access to the object, as allowed
1657	   by the Cap.

1659	13.3.  Protocol Privacy Requirements

1661	   Note that if the server LAYOUTGET reply, holding CapKey and Cap, is
1662	   snooped by another client, it can be used to generate valid OSD
1663	   requests (within the Cap access restrictions).

1665	   To provide the required privacy requirements for the capability key
1666	   returned by LAYOUTGET, the GSS-API [7] framework can be used, e.g.,
1667	   by using the RPCSEC_GSS privacy method to send the LAYOUTGET
1668	   operation or by using the SSV key to encrypt the oc_capability_key
1669	   using the GSS_Wrap() function.  Two general ways to provide privacy
1670	   in the absence of GSS-API that are independent of NFSv4 are either an
1671	   isolated network such as a VLAN or a secure channel provided by IPsec
1672	   [17].

1674	13.4.  Revoking Capabilities

1676	   At any time, the metadata server may invalidate all outstanding
1677	   capabilities on an object by changing its POLICY ACCESS TAG
1678	   attribute.  The value of the POLICY ACCESS TAG is part of a
1679	   capability, and it must match the state of the object attribute.  If
1680	   they do not match, the OSD rejects accesses to the object with the
1681	   sense key set to ILLEGAL REQUEST and an additional sense code set to
1682	   INVALID FIELD IN CDB.  When a client attempts to use a capability and
1683	   is rejected this way, it should issue a LAYOUTCOMMIT for the object
1684	   and specify PNFS_OBJ_BAD_CRED in the olr_ioerr_report parameter.  The
1685	   client may elect to issue a compound LAYOUTRETURN/LAYOUTGET (or
1686	   LAYOUTCOMMIT/LAYOUTRETURN/LAYOUTGET) to attempt to fetch a refreshed
1687	   set of capabilities.

1689	   The metadata server may elect to change the access policy tag on an
1690	   object at any time, for any reason (with the understanding that there
1691	   is likely an associated performance penalty, especially if there are
1692	   outstanding layouts for this object).  The metadata server MUST
1693	   revoke outstanding capabilities when any one of the following occurs:

1695	   o  the permissions on the object change,

1697	   o  a conflicting mandatory byte-range lock is granted, or

1699	   o  a layout is revoked and reassigned to another client.

1701	   A pNFS client will typically hold one layout for each byte range for
1702	   either READ or READ/WRITE.  The client's credentials are checked by
1703	   the metadata server at LAYOUTGET time and it is the client's
1704	   responsibility to enforce access control among multiple users
1705	   accessing the same file.  It is neither required nor expected that
1706	   the pNFS client will obtain a separate layout for each user accessing
1707	   a shared object.  The client SHOULD use OPEN and ACCESS calls to
1708	   check user permissions when performing I/O so that the server's
1709	   access control policies are correctly enforced.  The result of the
1710	   ACCESS operation may be cached while the client holds a valid layout
1711	   as the server is expected to recall layouts when the file's access
1712	   permissions or ACL change.

1714	13.5.  Security Considerations over NFS

1716	   When NFS is used for storage protocol, obviously the T10 security
1717	   mechanism cannot be implemented.  Instead, the server uses the NFS
1718	   owner and group identifiers, in combination with the RPC credentials
1719	   provided with the layout (as described in Paragraph 4), to simulate
1720	   the OSD CAPKEY security model.  The file's mode or ACL are set to
1721	   provide the file's owner READ and WRITE permissions, to the file's
1722	   group READ only permissions, and no permissions to others.
1723	   Respecively, the client is provided with the respective credentials
1724	   to provided READ/WRITE or READ only access for the respective layout
1725	   lo_iomode.

1727	   Fencing off a client over NFS is achieved by modifying the respective
1728	   files' ownership attributes.  This will implictly revoke the
1729	   outstanding credentials and will require the client to ask the server
1730	   for new layouts.

1732	14.  IANA Considerations

1734	   As described in NFSv4.1 [2], new layout type numbers have been
1735	   assigned by IANA.  This document defines the protocol associated with
1736	   the existing layout type number, LAYOUT4_OBJECTS_V2, and it requires
1737	   no further actions for IANA.

1739	15.  References

1741	15.1.  Normative References

1743	   [1]   Weber, R., "Information Technology - SCSI Object-Based Storage
1744	         Device Commands (OSD)", ANSI INCITS 400-2004, December 2004.

1746	   [2]   Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed., "Network
1747	         File System (NFS) Version 4 Minor Version 1 Protocol",
1748	         RFC 5661, January 2010.

1750	   [3]   Bradner, S., "Key words for use in RFCs to Indicate Requirement
1751	         Levels", BCP 14, RFC 2119, March 1997.

1753	   [4]   Eisler, M., "XDR: External Data Representation Standard",
1754	         STD 67, RFC 4506, May 2006.

1756	   [5]   Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed., "Network
1757	         File System (NFS) Version 4 Minor Version 1 External Data
1758	         Representation Standard (XDR) Description", RFC 5662,
1759	         January 2010.

1761	   [6]   IETF Trust, "Legal Provisions Relating to IETF Documents",
1762	         November 2008,
1763	         <http://trustee.ietf.org/docs/IETF-Trust-License-Policy.pdf>.

1765	   [7]   Linn, J., "Generic Security Service Application Program
1766	         Interface Version 2, Update 1", RFC 2743, January 2000.

1768	   [8]   Satran, J., Meth, K., Sapuntzakis, C., Chadalapaka, M., and E.
1769	         Zeidner, "Internet Small Computer Systems Interface (iSCSI)",
1770	         RFC 3720, April 2004.

1772	   [9]   Weber, R., "SCSI Primary Commands - 3 (SPC-3)", ANSI
1773	         INCITS 408-2005, October 2005.

1775	   [10]  Krueger, M., Chadalapaka, M., and R. Elliott, "T11 Network
1776	         Address Authority (NAA) Naming Format for iSCSI Node Names",
1777	         RFC 3980, February 2005.

1779	   [11]  IEEE, "Guidelines for 64-bit Global Identifier (EUI-64)
1780	         Registration Authority",
1781	         <http://standards.ieee.org/regauth/oui/tutorials/EUI64.html>.

1783	   [12]  Tseng, J., Gibbons, K., Travostino, F., Du Laney, C., and J.
1784	         Souza, "Internet Storage Name Service (iSNS)", RFC 4171,
1785	         September 2005.

1787	   [13]  Weber, R., "SCSI Architecture Model - 3 (SAM-3)", ANSI
1788	         INCITS 402-2005, February 2005.

1790	15.2.  Informative References

1792	   [14]  IETF, "NFS Version 3 Protocol Specification", RFC 1813,
1793	         June 1995.

1795	   [15]  Shepler, S., Callaghan, B., Robinson, D., Thurlow, R., Beame,
1796	         C., Eisler, M., and D. Noveck, "Network File System (NFS)
1797	         version 4 Protocol", RFC 3530, April 2003.

1799	   [16]  Weber, R., "SCSI Object-Based Storage Device Commands -2
1800	         (OSD-2)", January 2009,
1801	         <http://www.t10.org/cgi-bin/ac.pl?t=f&f=osd2r05a.pdf>.

1803	   [17]  Kent, S. and K. Seo, "Security Architecture for the Internet
1804	         Protocol", RFC 4301, December 2005.

1806	   [18]  IETF, "RPC: Remote Procedure Call Protocol Specification
1807	         Version 2", RFC 5531, May 2009.

1809	   [19]  T10 1415-D, "SCSI RDMA Protocol (SRP)", ANSI INCITS 365-2002,
1810	         December 2002.

1812	   [20]  T11 1619-D, "Fibre Channel Framing and Signaling - 2
1813	         (FC-FS-2)", ANSI INCITS 424-2007, February 2007.

1815	   [21]  T10 1601-D, "Serial Attached SCSI - 1.1 (SAS-1.1)", ANSI
1816	         INCITS 417-2006, June 2006.

1818	   [22]  IETF, "DOMAIN NAMES - IMPLEMENTATION AND SPECIFICATION",
1819	         RFC 1035, November 1987.

1821	   [23]  MacWilliams, F. and N. Sloane, "The Theory of Error-Correcting
1822	         Codes, Part I", 1977.

1824	   [24]  Anvin, H., "The Mathematics of RAID-6", May 2009,
1825	         <http://kernel.org/pub/linux/kernel/people/hpa/raid6.pdf>.

1827	   [25]  The free encyclopedia, Wikipedia., "Least common multiple",
1828	         April 2011,
1829	         <http://en.wikipedia.org/wiki/Least_common_multiple>.

1831	   [26]  Plank, James S., and Luo, Jianqiang and Schuman, Catherine D.
1832	         and Xu, Lihao and Wilcox-O'Hearn, Zooko, "A Performance
1833	         Evaluation and Examination of Open-source Erasure Coding
1834	         Libraries for Storage", 2007.

1836	Appendix A.  Acknowledgments

1838	   Todd Pisek was a co-editor of the initial versions of this document.
1839	   Daniel E. Messinger, Pete Wyckoff, Mike Eisler, Sean P. Turner, Brian
1840	   E. Carpenter, Jari Arkko, David Black, and Jason Glasgow reviewed and
1841	   commented on this document.

1843	Author's Address

1845	   Benny Halevy
1846	   Tonian, Inc.

1848	   Email: bhalevy@tonian.com
1849	   URI:   http://www.tonian.com/