idnits 2.17.1 

draft-bhalevy-nfsv4-flex-files-02.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  == Line 447 has weird spacing: '...pattern    pfl...'

  == Line 844 has weird spacing: '...stateid    lor...'

  == Line 1023 has weird spacing: '...pattern    pfs...'

  == Line 1032 has weird spacing: '...rn_hint  pflh_...'

  -- The document date (April 17, 2014) is 3655 days in the past.  Is this
     intentional?

  -- Found something which looks like a code comment -- if you have code
     sections in the document, please surround them with '<CODE BEGINS>' and
     '<CODE ENDS>' lines.


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  ** Obsolete normative reference: RFC 3530 (Obsoleted by RFC 7530)

  ** Obsolete normative reference: RFC 5661 (Obsoleted by RFC 8881)


     Summary: 2 errors (**), 0 flaws (~~), 5 warnings (==), 2 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	NFSv4                                                          B. Halevy
3	Internet-Draft                                                 T. Haynes
4	Intended status: Informational                              Primary Data
5	Expires: October 19, 2014                                 April 17, 2014

7	               Parallel NFS (pNFS) Flexible Files Layout
8	                 draft-bhalevy-nfsv4-flex-files-02.txt

10	Abstract

12	   Parallel NFS (pNFS) extends Network File System version 4 (NFSv4) to
13	   allow clients to directly access file data on the storage used by the
14	   NFSv4 server.  This ability to bypass the server for data access can
15	   increase both performance and parallelism, but requires additional
16	   client functionality for data access, some of which is dependent on
17	   the class of storage used, i.e., the Layout Type.  The main pNFS
18	   operations and data types in NFSv4 Minor version 1 specify a layout-
19	   type-independent layer; layout-type-specific information is conveyed
20	   using opaque data structures whose internal structure is further
21	   defined by the particular layout type specification.  This document
22	   specifies the NFSv4.1 Flexible Files pNFS Layout as a companion to
23	   the main NFSv4 Minor version 1 specification for use of pNFS with
24	   Data Servers over NFSv4 or higher minor versions using flexible, per-
25	   file striping topology.

27	Status of This Memo

29	   This Internet-Draft is submitted in full conformance with the
30	   provisions of BCP 78 and BCP 79.

32	   Internet-Drafts are working documents of the Internet Engineering
33	   Task Force (IETF).  Note that other groups may also distribute
34	   working documents as Internet-Drafts.  The list of current Internet-
35	   Drafts is at http://datatracker.ietf.org/drafts/current/.

37	   Internet-Drafts are draft documents valid for a maximum of six months
38	   and may be updated, replaced, or obsoleted by other documents at any
39	   time.  It is inappropriate to use Internet-Drafts as reference
40	   material or to cite them other than as "work in progress."

42	   This Internet-Draft will expire on October 19, 2014.

44	Copyright Notice

46	   Copyright (c) 2014 IETF Trust and the persons identified as the
47	   document authors.  All rights reserved.

49	   This document is subject to BCP 78 and the IETF Trust's Legal
50	   Provisions Relating to IETF Documents
51	   (http://trustee.ietf.org/license-info) in effect on the date of
52	   publication of this document.  Please review these documents
53	   carefully, as they describe your rights and restrictions with respect
54	   to this document.  Code Components extracted from this document must
55	   include Simplified BSD License text as described in Section 4.e of
56	   the Trust Legal Provisions and are provided without warranty as
57	   described in the Simplified BSD License.

59	Table of Contents

61	   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   3
62	     1.1.  Requirements Language . . . . . . . . . . . . . . . . . .   3
63	   2.  Method of Operation . . . . . . . . . . . . . . . . . . . . .   3
64	     2.1.  Security models . . . . . . . . . . . . . . . . . . . . .   4
65	     2.2.  State and Locking Models  . . . . . . . . . . . . . . . .   4
66	   3.  XDR Description of the Flexible Files Layout Protocol . . . .   5
67	     3.1.  Code Components Licensing Notice  . . . . . . . . . . . .   5
68	   4.  Device Addressing and Discovery . . . . . . . . . . . . . . .   7
69	     4.1.  pnfs_ff_device_addr . . . . . . . . . . . . . . . . . . .   7
70	     4.2.  Data Server Multipathing  . . . . . . . . . . . . . . . .   8
71	   5.  Flexible Files Layout . . . . . . . . . . . . . . . . . . . .   9
72	     5.1.  pnfs_ff_layout  . . . . . . . . . . . . . . . . . . . . .   9
73	     5.2.  Striping Topologies . . . . . . . . . . . . . . . . . . .  13
74	       5.2.1.  PFSP_SPARSE_STRIPING  . . . . . . . . . . . . . . . .  13
75	       5.2.2.  PFSP_DENSE_STRIPING . . . . . . . . . . . . . . . . .  14
76	       5.2.3.  PFSP_RAID_4 . . . . . . . . . . . . . . . . . . . . .  15
77	       5.2.4.  PFSP_RAID_5 . . . . . . . . . . . . . . . . . . . . .  15
78	       5.2.5.  PFSP_RAID_PQ  . . . . . . . . . . . . . . . . . . . .  16
79	       5.2.6.  RAID Usage and Implementation Notes . . . . . . . . .  17
80	     5.3.  Mirroring . . . . . . . . . . . . . . . . . . . . . . . .  17
81	   6.  Recovering from Client I/O Errors . . . . . . . . . . . . . .  17
82	   7.  Flexible Files Layout Return  . . . . . . . . . . . . . . . .  18
83	     7.1.  pflr_errno  . . . . . . . . . . . . . . . . . . . . . . .  19
84	     7.2.  pnfs_ff_ioerr . . . . . . . . . . . . . . . . . . . . . .  20
85	     7.3.  pnfs_ff_iostats . . . . . . . . . . . . . . . . . . . . .  21
86	     7.4.  pnfs_ff_layoutreturn  . . . . . . . . . . . . . . . . . .  22
87	   8.  Flexible Files Creation Layout Hint . . . . . . . . . . . . .  22
88	     8.1.  pnfs_ff_layouthint  . . . . . . . . . . . . . . . . . . .  22
89	   9.  Recalling Layouts . . . . . . . . . . . . . . . . . . . . . .  24
90	     9.1.  CB_RECALL_ANY . . . . . . . . . . . . . . . . . . . . . .  24
91	   10. Client Fencing  . . . . . . . . . . . . . . . . . . . . . . .  25
92	   11. Security Considerations . . . . . . . . . . . . . . . . . . .  25
93	   12. Striping Topologies Extensibility . . . . . . . . . . . . . .  26
94	   13. IANA Considerations . . . . . . . . . . . . . . . . . . . . .  26
95	   14. Normative References  . . . . . . . . . . . . . . . . . . . .  26
96	   Appendix A.  Acknowledgments  . . . . . . . . . . . . . . . . . .  27
97	   Appendix B.  RFC Editor Notes . . . . . . . . . . . . . . . . . .  28
98	   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  28

100	1.  Introduction

102	   In pNFS, the file server returns typed layout structures that
103	   describe where file data is located.  There are different layouts for
104	   different storage systems and methods of arranging data on storage
105	   devices.  This document defines the layout used with file-based data
106	   servers that are accessed using the Network File System (NFS)
107	   Protocol: NFSv3 [RFC1813], NFSv4 [RFC3530], and NFSv4.1 [RFC5661].

109	   In contrast to the LAYOUT4_NFSV4_1_FILES layout type [RFC5661] that
110	   also uses NFSv4.1 to access the data server, the Flexible Files
111	   layout defines a model of device metadata and striping patterns that
112	   is inspired by the object layout [RFC5664] that provide flexible,
113	   per-file striping patterns and simple device information suitable
114	   aggregating standalone NFS servers into a centrally managed pNFS
115	   cluster.

117	   To provide a global state model equivalent to that of the files
118	   layout a back-end control protocol may be implemented between the
119	   metadata server (MDS) and NFSv4.1 data servers (DSs).  It is out of
120	   scope for this document to specify the wire protocol of such a
121	   protocol, yet the requirements for the protocol are specified in
122	   [RFC5661] and clarified in [pNFSLayouts].

124	1.1.  Requirements Language

126	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
127	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
128	   document are to be interpreted as described in [RFC2119].

130	2.  Method of Operation

132	   This section describes the semantics and format of flexible file-
133	   based layouts for pNFS.  Flexible file-based layouts use the
134	   LAYOUT4_FLEX_FILES layout type.  The LAYOUT4_FLEX_FILES type defines
135	   striping data across multiple NFS Data Servers.

137	   For the purpose of this discussion, we will distinguish between user
138	   files served by the metadata server, to be referred to as User Files;
139	   vs. user files served by Data Servers, to be referred to as Component
140	   Objects.

142	   Component Objects are addressable by their NFS filehandle.  Each
143	   Component Object may store a whole User File or parts of it, in case
144	   the User File is striped across multiple Component Objects.  The
145	   striping pattern is provided by pfl_striping_pattern as defined
146	   below.

148	   Data Servers may be accessed using different versions of the NFS
149	   protocol.  It is required that the server MUST use Data Servers of
150	   the same NFS version and minor version for striping data within each
151	   layout.  The NFS version and minor version define the respective
152	   security, state, and locking models to be used, as described below.

154	2.1.  Security models

156	   With NFSv3 Data Servers, the Metadata Server uses synthetic uids and
157	   gids for the Component Objects, where the uid owner of the Component
158	   Objects is allowed read/write access and the gid owner is allowed
159	   read only access.  As part of the layout, the client is provided with
160	   the rpc credentials to be used (XREF pfcf_auth) to access the Object.
161	   Fencing off clients is achieved by using SETATTR by the server to
162	   change the uid and/or gid owners of the Component Objects to
163	   implicitly revoke the outstanding rpc credentials.  Note: it is
164	   recommended to implement common access control methods at the Data
165	   Server filesystem exports level to allow only the Metadata Server
166	   root (super user) access to the Data Server, and to set the owner of
167	   all directories holding Component Objects to the root user.  This
168	   security method, when using weak auth flavors such as AUTH_SYS,
169	   provides a practical model to enforce access control and fence off
170	   cooperative clients, but it can not protect against malicious
171	   clients; hence it provides a level of security equivalent to NFSv3.

173	   With NFSv4.x Data Servers, the Metadata Server sets the user and
174	   group owners, mode bits, and ACL of the Component Objects to be the
175	   same as the User File.  And the client must authenticate with the
176	   Data Server and go through the same authorization process it would go
177	   through via the Metadata Server.

179	2.2.  State and Locking Models

181	   User File OPEN, LOCK, and DELEGATION operations are always executed
182	   only against the Metadata Server.

184	   With NFSv4 Data Servers, the Metadata Server, in response to the
185	   state changing operation, executes them against the respective
186	   Component Objects on the Data Server(s).  It then sends the Data
187	   Server open stateid as part of the layout (see the pfcf_stateid in
188	   Section 5.1) and it is then used by the client for executing READ/
189	   WRITE operations against the Data Server.

191	   Standalone NFSv4.1 Data Servers that do not return the
192	   EXCHGID4_FLAG_USE_PNFS_DS flag to EXCHANGE_ID are used the same way
193	   as NFSv4 Data Servers.

195	   NFSv4.1 Clustered Data Servers that do identify themselves with the
196	   EXCHGID4_FLAG_USE_PNFS_DS flag to EXCHANGE_ID use a back-end control
197	   protocol as described in [RFC5661] to implement a global stateid
198	   model as defined there.

200	3.  XDR Description of the Flexible Files Layout Protocol

202	   This document contains the external data representation (XDR)
203	   [RFC4506] description of the NFSv4.1 flexible files layout protocol.
204	   The XDR description is embedded in this document in a way that makes
205	   it simple for the reader to extract into a ready-to-compile form.
206	   The reader can feed this document into the following shell script to
207	   produce the machine readable XDR description of the NFSv4.1 objects
208	   layout protocol:

210	   #!/bin/sh
211	   grep '^ *///' $* | sed 's?^ */// ??' | sed 's?^ *///$??'

213	   That is, if the above script is stored in a file called "extract.sh",
214	   and this document is in a file called "spec.txt", then the reader can
215	   do:

217	   sh extract.sh < spec.txt > pnfs_flex_files_prot.x

219	   The effect of the script is to remove leading white space from each
220	   line, plus a sentinel sequence of "///".

222	   The embedded XDR file header follows.  Subsequent XDR descriptions,
223	   with the sentinel sequence are embedded throughout the document.

225	   Note that the XDR code contained in this document depends on types
226	   from the NFSv4.1 nfs4_prot.x file [RFC5662].  This includes both nfs
227	   types that end with a 4, such as offset4, length4, etc., as well as
228	   more generic types such as uint32_t and uint64_t.

230	3.1.  Code Components Licensing Notice

232	   Both the XDR description and the scripts used for extracting the XDR
233	   description are Code Components as described in Section 4 of "Legal
234	   Provisions Relating to IETF Documents" [LEGAL].  These Code
235	   Components are licensed according to the terms of that document.

237	   /// /*
238	   ///  * Copyright (c) 2012 IETF Trust and the persons identified
239	   ///  * as authors of the code. All rights reserved.
240	   ///  *
241	   ///  * Redistribution and use in source and binary forms, with
242	   ///  * or without modification, are permitted provided that the
243	   ///  * following conditions are met:
244	   ///  *
245	   ///  * o Redistributions of source code must retain the above
246	   ///  *   copyright notice, this list of conditions and the
247	   ///  *   following disclaimer.
248	   ///  *
249	   ///  * o Redistributions in binary form must reproduce the above
250	   ///  *   copyright notice, this list of conditions and the
251	   ///  *   following disclaimer in the documentation and/or other
252	   ///  *   materials provided with the distribution.
253	   ///  *
254	   ///  * o Neither the name of Internet Society, IETF or IETF
255	   ///  *   Trust, nor the names of specific contributors, may be
256	   ///  *   used to endorse or promote products derived from this
257	   ///  *   software without specific prior written permission.
258	   ///  *
259	   ///  *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS
260	   ///  *   AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED
261	   ///  *   WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
262	   ///  *   IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
263	   ///  *   FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO
264	   ///  *   EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
265	   ///  *   LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
266	   ///  *   EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
267	   ///  *   NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
268	   ///  *   SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
269	   ///  *   INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
270	   ///  *   LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
271	   ///  *   OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING
272	   ///  *   IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
273	   ///  *   ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
274	   ///  *
275	   ///  * This code was derived from draft-bhalevy-nfsv4-flex-files-01.
276	   [[RFC Editor: please insert RFC number if needed]]
277	   ///  * Please reproduce this note if possible.
278	   ///  */
279	   ///
280	   /// /*
281	   ///  * pnfs_flex_files_prot.x
282	   ///  */
283	   ///
284	   /// /*
285	   ///  * The following include statements are for example only.
286	   ///  * The actual XDR definition files are generated separately
287	   ///  * and independently and are likely to have a different name.
288	   ///  */
289	   /// %#include <nfs4_prot.x>
290	   /// %#include <rpc_prot.x>
291	   ///

293	4.  Device Addressing and Discovery

295	   Data operations to a data server require the client to know the
296	   network address of the data server.  The GETDEVICEINFO NFSv4.1
297	   operation is used by the client to retrieve that information.

299	4.1.  pnfs_ff_device_addr

301	   The pnfs_ff_device_addr data structure is returned by the server as
302	   the storage-protocol-specific opaque field da_addr_body in the
303	   device_addr4 structure by a successful GETDEVICEINFO operation
304	   [RFC5661].

306	   /// struct pnfs_ff_device_addr {
307	   ///     multipath_list4         pfda_netaddrs;
308	   ///     uint32_t                pfda_version;
309	   ///     uint32_t                pfda_minorversion;
310	   ///     pathname4               pfda_path;
311	   /// };
312	   ///

314	   The pfda_netaddrs field is used to locate the data server.  It MUST
315	   be set by the server to a list holding one or more of the device
316	   network addresses.

318	   The pfda_version and pfda_minorversion represent the NFS protocol to
319	   be used to access the data server.  This layout specification defines
320	   the semantics for pfda_versions 3 and 4.  If pfda_version equals 3
321	   then server MUST set pfda_minorversion to 0 and the client MUST
322	   access the data server using the NFSv3 protocol [RFC1813].  If
323	   pfda_version equals 4 then the server MUST set pfda_minorversion to
324	   either 0 or 1 and the client MUST access the data server using NFSv4
325	   [RFC3530] or NFSv4.1 [RFC5661], respectively.

327	   The pfda_path MAY be set by the server to an exported path on the
328	   data server for device identification.  If provided, the path MUST
329	   exist and be accessible to the client.  If the path does not exist,
330	   the client MUST ignore this device information and any layouts
331	   referring to the respective deviceid until valid device information
332	   is acquired.

334	4.2.  Data Server Multipathing

336	   The flexible file layout supports multipathing to multiple data
337	   server addresses.  Data-server-level multipathing is used for
338	   bandwidth scaling via trunking and for higher availability of use in
339	   the case of a data-server failure.  Multipathing allows the client to
340	   switch to another data server address which may be that of another
341	   data server that is exporting the same data stripe unit, without
342	   having to contact the metadata server for a new layout.

344	   To support data server multipathing, pfda_netaddrs contains an array
345	   of one more data server network addresses.  This array (data type
346	   multipath_list4) represents a list of data servers (each identified
347	   by a network address), with the possibility that some data servers
348	   will appear in the list multiple times.

350	   The client is free to use any of the network addresses as a
351	   destination to send data server requests.  If some network addresses
352	   are less optimal paths to the data than others, then the MDS SHOULD
353	   NOT include those network addresses in pfda_netaddrs.  If less
354	   optimal network addresses exist to provide failover, the RECOMMENDED
355	   method to offer the addresses is to provide them in a replacement
356	   device-ID-to-device-address mapping, or a replacement device ID.
357	   When a client finds no response from the data server using all
358	   addresses available in pfda_netaddrs, it SHOULD send a GETDEVICEINFO
359	   to attempt to replace the existing device-ID-to-device-address
360	   mappings.  If the MDS detects that all network paths represented by
361	   pfda_netaddrs are unavailable, the MDS SHOULD send a
362	   CB_NOTIFY_DEVICEID (if the client has indicated it wants device ID
363	   notifications for changed device IDs) to change the device-ID-to-
364	   device-address mappings to the available addresses.  If the device ID
365	   itself will be replaced, the MDS SHOULD recall all layouts with the
366	   device ID, and thus force the client to get new layouts and device ID
367	   mappings via LAYOUTGET and GETDEVICEINFO.

369	   Generally, if two network addresses appear in pfda_netaddrs, they
370	   will designate the same data server.  When the data server is
371	   accessed over NFSv4.1 or higher minor version the two data server
372	   addresses will support the implementation of client ID or session
373	   trunking (the latter is RECOMMENDED) as defined in [RFC5661].  The
374	   two data server addresses will share the same server owner or major
375	   ID of the server owner.  It is not always necessary for the two data
376	   server addresses to designate the same server with trunking being
377	   used.  For example, the data could be read-only, and the data consist
378	   of exact replicas.

380	5.  Flexible Files Layout

382	   The layout4 type is defined in [RFC5662] as follows:

384	   /// enum layouttype4 {
385	   ///     LAYOUT4_NFSV4_1_FILES   = 1,
386	   ///     LAYOUT4_OSD2_OBJECTS    = 2,
387	   ///     LAYOUT4_BLOCK_VOLUME    = 3,
388	   ///     LAYOUT4_FLEX_FILES      = 4
389	   [[RFC Editor: please modify the LAYOUT4_FLEX_FILES
390	     to be the layouttype assigned by IANA]]
391	   /// };
392	   ///
393	   /// struct layout_content4 {
394	   ///     layouttype4             loc_type;
395	   ///     opaque                  loc_body<>;
396	   /// };
397	   ///
398	   /// struct layout4 {
399	   ///     offset4                 lo_offset;
400	   ///     length4                 lo_length;
401	   ///     layoutiomode4           lo_iomode;
402	   ///     layout_content4         lo_content;
403	   /// };

405	   This document defines structure associated with the layouttype4 value
406	   LAYOUT4_FLEX_FILES.  [RFC5661] specifies the loc_body structure as an
407	   XDR type "opaque".  The opaque layout is uninterpreted by the generic
408	   pNFS client layers, but obviously must be interpreted by the flexible
409	   files layout driver.  This section defines the structure of this
410	   opaque value, pnfs_ff_layout4.

412	5.1.  pnfs_ff_layout
413	   /// enum pnfs_ff_striping_pattern {
414	   ///     PFSP_SPARSE_STRIPING = 1,
415	   ///     PFSP_DENSE_STRIPING  = 2,
416	   ///     PFSP_RAID_4          = 4,
417	   ///     PFSP_RAID_5          = 5,
418	   ///     PFSP_RAID_PQ         = 6
419	   /// };
420	   ///
421	   /// enum pnfs_ff_comp_type {
422	   ///     PNFS_FF_COMP_MISSING = 0,
423	   ///     PNFS_FF_COMP_PACKED  = 1,
424	   ///     PNFS_FF_COMP_FULL    = 2
425	   /// };
426	   ///
427	   /// struct pnfs_ff_comp_full {
428	   ///     deviceid4               pfcf_deviceid;
429	   ///     nfs_fh4                 pfcf_fhandle;
430	   ///     stateid4                pfcf_stateid;
431	   ///     opaque_auth             pfcf_auth;
432	   ///     uint32_t                pfcf_metric;
433	   /// };
434	   ///
435	   /// union pnfs_ff_comp switch (pnfs_ff_comp_type pfc_type) {
436	   ///    case PNFS_FF_COMP_MISSING:
437	   ///         void;
438	   ///
439	   ///    case PNFS_FF_COMP_PACKED:
440	   ///         deviceid4               pfcp_deviceid;
441	   ///
442	   ///    case PNFS_FF_COMP_FULL:
443	   ///         pnfs_ff_comp_full       pfcp_full;
444	   /// };
445	   ///
446	   /// struct pnfs_ff_layout {
447	   ///     pnfs_ff_striping_pattern    pfl_striping_pattern;
448	   ///     uint32_t                    pfl_num_comps;
449	   ///     uint32_t                    pfl_mirror_cnt;
450	   ///     length4                     pfl_stripe_unit;
451	   ///     nfs_fh4                     pfl_global_fh;
452	   ///     uint32_t                    pfl_comps_index;
453	   ///     pnfs_ff_comp                pfl_comps<>;
454	   /// };
455	   ///

457	   The pnfs_ff_layout structure specifies a layout over a set of
458	   Component Objects.  The layout parameterizes the algorithm that maps
459	   the file's contents within the returned byte range, as represented by
460	   lo_offset and lo_length, over the Component Objects.

462	   It is possible that the file is concatenated from more than one
463	   layout segment.  Each layout segment MAY represent different striping
464	   parameters, applying respectively only to the layout segment byte
465	   range.

467	   This section provides a brief introduction to the layout parameters.
468	   See Section 5.2 for a more detailed description of the different
469	   striping schemes and the respective interpretation of the layout
470	   parameters for each striping scheme.

472	   In addition to mapping data using simple striping schemes where loss
473	   of a single component object results in data loss, the layout
474	   parameters support mirroring and more advanced redundancy schemes
475	   that protect against loss of component objects.  pfl_striping_pattern
476	   represents the algorithm to be used for mapping byte offsets in the
477	   file address space to corresponding component objects in the returned
478	   layout and byte offsets in the component's address space.
479	   pfl_striping_pattern also represents methods for storing and
480	   retrieving redundant data that can be used to recover from failure or
481	   loss of component objects.

483	   pfl_num_comps is the total number of component objects the file is
484	   striped over within the returned byte range, not counting mirrored
485	   components (See pfl_mirror_cnt below).  Note that the server MAY grow
486	   the file by adding more components to the stripe while clients hold
487	   valid layouts until the file has reached its final stripe width.

489	   pfl_mirror_cnt represents the number of mirrors each component in the
490	   stripe has.  If there is no mirroring then pfm_mirror_cnt MUST be 0.
491	   Otherwise, the number of entries listed in pfl_comps MUST be a
492	   multiple of (pfl_mirror_cnt + 1).

494	   pfl_stripe_unit is the number of bytes placed on one component before
495	   advancing to the next one in the list of components.  When the file
496	   is striped over a single component object (pfl_num_comps equals to
497	   1), the stripe unit has no use and the server SHOULD set it to the
498	   server default value or to zero; otherwise, pfl_stripe_unit MUST NOT
499	   be set to zero.

501	   The pfl_comps field represents an array of component objects.  The
502	   data placement algorithm that maps file data onto component objects
503	   assumes that each component object occurs exactly once in the array
504	   of components.  Therefore, component objects MUST appear in the
505	   pfl_comps array only once.  The components array may represent all
506	   objects comprising the file, in which case pfl_comps_index is set to
507	   zero and the number of entries in the pfl_comps array is equal to
508	   pfl_num_comps * (pfl_mirror_cnt + 1).  The server MAY return fewer
509	   components than pfl_num_comps, provided that the returned byte range
510	   represented by lo_offset and lo_count maps in whole into the set of
511	   returned component objects.  In this case, pfl_comps_index represents
512	   the logical position of the returned components array, pfl_comps,
513	   within the full array of components that comprise the file.
514	   pfl_comps_index MUST be a multiple of (pfl_mirror_cnt + 1).

516	   Each component object in the pfl_comps array is described by the
517	   pnfs_ff_comp type.

519	   When a component object is unavailable pfc_type is set to
520	   PNFS_FF_COMP_MISSING and no other information for this component is
521	   returned.  When a data redundancy scheme is being used, as
522	   represented by pfl_striping_pattern, the client MAY use a respective
523	   data recovery algorithm to reconstruct data that is logically stored
524	   on the missing component using user data and redundant data stored on
525	   the available components in the containing stripe.

527	   The server MUST set the same pfc_type for all available components to
528	   either PNFS_FF_COMP_PACKED or PNFS_FF_COMP_FULL.

530	   When NFSv4.1 Clustered Data Servers are used, the metadata server
531	   implements the global state model where all data servers share the
532	   same stateid and filehandle for the file.  In such case, the client
533	   MUST use the open, delegation, or lock stateid returned by the
534	   metadata server for the file for accessing the Data Servers for READ
535	   and WRITE; the global filehandle to be used by the client is provided
536	   by pfl_global_fh.  If the metadata server filehandle for the file is
537	   being used by all data servers then pfl_global_fh MAY be set to an
538	   empty filehandle.

540	   pfcp_deviceid or pfcf_deviceid provide the deviceid of the data
541	   server holding the Component Object.

543	   When standalone data servers are used, either over NFSv4 or NFSv4.1,
544	   pfl_global_fh SHOULD be set to an empty filehandle and it MUST be
545	   ignored by the client and pfcf_fhandle provides the filehandle of the
546	   Data Server file holding the Component Object, and pfcf_stateid
547	   provides the stateid to be used by the client to access the file.

549	   For NFSv3 Data Servers, pfcf_auth provides the RPC credentials to be
550	   used by the client to access the Component Objects.  For NFSv4.x Data
551	   Servers, the server SHOULD use the AUTH_NONE flavor and a zero length
552	   opaque body to minimize the returned structure length.  The client
553	   MUST ignore pfxf_auth in this case.

555	   When pfl_mirror_cnt is not zero pfcf_metric indicates the distance to
556	   the client the distance of the respective component object, otherwise
557	   the server MUST set pfcf_metric to zero.  When reading data, the
558	   client the client is advised to read from components with the lowest
559	   pfcf_metric.  When there are several components with the same
560	   pfcf_metric client implementations may implement a load distribution
561	   algorithm to evenly distribute the read load across several devices
562	   and by so provide larger bandwidth.

564	5.2.  Striping Topologies

566	   This section describes the different data mapping schemes in detail.

568	   pnfs_ff_striping_pattern determines the algorithm and placement of
569	   redundant data.  This section defines the different redundancy
570	   algorithms.  Note: The term "RAID" (Redundant Array of Independent
571	   Disks) is used in this document to represent an array of Component
572	   Objects that store data for an individual User File.  The objects are
573	   stored on independent Data Servers.  User File data is encoded and
574	   striped across the array of Component Objects using algorithms
575	   developed for block-based RAID systems.

577	5.2.1.  PFSP_SPARSE_STRIPING

579	   The mapping from the logical offset within a file (L) to the
580	   Component Object C and object-specific offset O is direct and
581	   straight forward as defined by the following equations:

583	   L: logical offset into the file

585	   W: stripe width
586	       W = pfl_num_comps

588	   S: number of bytes in a stripe
589	       S = W * pfl_stripe_unit

591	   N: stripe number
592	       N = L / S

594	   C: component index corresponding to L
595	      C = (L % S) / pfl_stripe_unit

597	   O: The component offset corresponding to L
598	      O = L

600	   Note that this computation does not accommodate the same object
601	   appearing in the pfl_comps array multiple times.  Therefore the
602	   server may not return layouts with the same object appearing multiple
603	   times.  If needed the server can return multiple layout segments each
604	   covering a single instance of the object.

606	   PFSP_SPARSE_STRIPING means there is no parity data, so all bytes in
607	   the component objects are data bytes located by the above equations
608	   for C and O.  If a component object is marked as
609	   PNFS_FF_COMP_MISSING, the pNFS client MUST either return an I/O error
610	   if this component is attempted to be read or, alternatively, it can
611	   retry the READ against the pNFS server.

613	5.2.2.  PFSP_DENSE_STRIPING

615	   The mapping from the logical offset within a file (L) to the
616	   component object C and object-specific offset O is defined by the
617	   following equations:

619	   L: logical offset into the file

621	   W: stripe width
622	       W = pfl_num_comps

624	   S: number of bytes in a stripe
625	       S = W * pfl_stripe_unit

627	   N: stripe number
628	       N = L / S

630	   C: component index corresponding to L
631	      C = (L % S) / pfl_stripe_unit

633	   O: The component offset corresponding to L
634	      O = (N * pfl_stripe_unit) + (L % pfl_stripe_unit)

636	   Note that this computation does not accommodate the same object
637	   appearing in the pfl_comps array multiple times.  Therefore the
638	   server may not return layouts with the same object appearing multiple
639	   times.  If needed the server can return multiple layout segments each
640	   covering a single instance of the object.

642	   PFSP_DENSE_STRIPING means there is no parity data, so all bytes in
643	   the component objects are data bytes located by the above equations
644	   for C and O.  If a component object is marked as
645	   PNFS_FF_COMP_MISSING, the pNFS client MUST either return an I/O error
646	   if this component is attempted to be read or, alternatively, it can
647	   retry the READ against the pNFS server.

649	   Note that the layout depends on the file size, which the client
650	   learns from the generic return parameters of LAYOUTGET, by doing
651	   GETATTR commands to the Metadata Server.  The client uses the file
652	   size to decide if it should fill holes with zeros or return a short
653	   read.  Striping patterns can cause cases where Component Objects are
654	   shorter than other components because a hole happens to correspond to
655	   the last part of the Component Object.

657	5.2.3.  PFSP_RAID_4

659	   PFSP_RAID_4 means that the last component object in the stripe
660	   contains parity information computed over the rest of the stripe with
661	   an XOR operation.  If a Component Object is unavailable, the client
662	   can read the rest of the stripe units in the damaged stripe and
663	   recompute the missing stripe unit by XORing the other stripe units in
664	   the stripe.  Or the client can replay the READ against the pNFS
665	   server that will presumably perform the reconstructed read on the
666	   client's behalf.

668	   When parity is present in the file, then the number of parity devices
669	   is taken into account in the above equations when calculating (D),
670	   the number of data devices in a stripe, as follows:

672	   P: number of parity devices in each stripe
673	      P = 1

675	   D: number of data devices in a stripe
676	      D = W - P

678	   I: parity device index
679	      I = D

681	5.2.4.  PFSP_RAID_5

683	   PNFS_OBJ_RAID_5 means that the position of the parity data is rotated
684	   on each stripe.  In the first stripe, the last component holds the
685	   parity.  In the second stripe, the next-to-last component holds the
686	   parity, and so on.  In this scheme, all stripe units are rotated so
687	   that I/O is evenly spread across objects as the file is read
688	   sequentially.  The rotated parity layout is illustrated here, with
689	   hexadecimal numbers indicating the stripe unit.

691	   0 1 2 P
692	   4 5 P 3
693	   8 P 6 7
694	   P 9 a b

696	   Note that the math for RAID_5 is similar to RAID_4 only that the
697	   device indices for each stripe are rotated backwards.  So start with
698	   the equations above for RAID_4, then compute the rotation as
699	   described below.

701	   P: number of parity devices in each stripe
702	      P = 1

704	   PC: Parity Cycle
705	       PC = W

707	   R: The parity rotation index
708	      (N is as computed in above equations for RAID-4)
709	      R = N % PC

711	   I: parity device index
712	      I = (W + W - (R + 1) * P) % W

714	   Cr: The rotated device index
715	       (C is as computed in the above equations for RAID-4)
716	       Cr = (W + C - (R * P)) % W

718	   Note: W is added above to avoid negative numbers modulo math.

720	5.2.5.  PFSP_RAID_PQ

722	   PFSP_RAID_PQ is a double-parity scheme that uses the Reed-Solomon P+Q
723	   encoding scheme [ErrorCorrectingCodes].  In this layout, the last two
724	   component objects hold the P and Q data, respectively.  P is parity
725	   computed with XOR.  The Q computation is described in detail in
726	   [MathOfRAID-6].  The same polynomial "x^8+x^4+x^3+x^2+1" and Galois
727	   field size of 2^8 are used here.  Clients may simply choose to read
728	   data through the metadata server if two or more components are
729	   missing or damaged.

731	   The equations given above for embedded parity can be used to map a
732	   file offset to the correct component object by setting the number of
733	   parity components (P) to 2 instead of 1 for RAID-5 and computing the
734	   Parity Cycle length as the Lowest Common Multiple of pfl_num_comps
735	   and P, divided by P, as described below.  Note: This algorithm can be
736	   used also for RAID-5 where P=1.

738	   P: number of parity devices
739	      P = 2

741	   PC: Parity cycle:
742	       PC = LCM(W, P) / P

744	   Q: The device index holding the Q component
745	      (I is as computed in the above equations for RAID-5)
746	      Qdev = (I + 1) % W

748	5.2.6.  RAID Usage and Implementation Notes

750	   RAID layouts with redundant data in their stripes require additional
751	   serialization of updates to ensure correct operation.  Otherwise, if
752	   two clients simultaneously write to the same logical range of an
753	   object, the result could include different data in the same ranges of
754	   mirrored tuples, or corrupt parity information.  It is the
755	   responsibility of the metadata server to enforce serialization
756	   requirements such as this.  For example, the metadata server may do
757	   so by not granting overlapping write layouts within mirrored objects.

759	   Many alternative encoding schemes exist for P >= 2
760	   [ErasureCodingLibraries].  These involve P or Q equations different
761	   than those used in PFSP_RAID_PQ.  Thus, if one of these schemes is to
762	   be used in the future, a distinct value must be added to
763	   pnfs_ff_striping_pattern for it.  While Reed-Solomon codes are well
764	   understood, recently discovered schemes such as Liberation codes are
765	   more computationally efficient for small group_widths, and Cauchy
766	   Reed-Solomon codes are more computationally efficient for higher
767	   values of P.

769	5.3.  Mirroring

771	   The pfl_mirror_cnt is used to replicate a file by replicating its
772	   Component Objects.  If there is no mirroring, then pfs_mirror_cnt
773	   MUST be 0.  If pfl_mirror_cnt is greater than zero, then the size of
774	   the pfl_comps array MUST be a multiple of (pfl_mirror_cnt + 1).
775	   Thus, for a classic mirror on two objects, pfl_mirror_cnt is one.
776	   Note that mirroring can be defined over any striping pattern.

778	   Replicas are adjacent in the olo_components array, and the value C
779	   produced by the above equations is not a direct index into the
780	   pfl_comps array.  Instead, the following equations determine the
781	   replica component index RCi, where i ranges from 0 to pfl_mirror_cnt.

783	   FW = size of pfl_comps array / (pfl_mirror_cnt+1)

785	   C = component index for striping or two-level striping
786	       as calculated using above equations

788	   i ranges from 0 to pfl_mirror_cnt, inclusive
789	   RCi = C * (pfl_mirror_cnt+1) + i

791	6.  Recovering from Client I/O Errors

793	   The pNFS client may encounter errors when directly accessing the Data
794	   Servers.  However, it is the responsibility of the Metadata Server to
795	   recover from the I/O errors.  When the LAYOUT4_FLEX_FILES layout type
796	   is used, the client MUST report the I/O errors to the server at
797	   LAYOUTRETURN time using the pflr_ioerr4 structure (see Section 7.1).

799	   The metadata server analyzes the error and determines the required
800	   recovery operations such as repairing any parity inconsistencies,
801	   recovering media failures, or reconstructing missing objects.

803	   The metadata server SHOULD recall any outstanding layouts to allow it
804	   exclusive write access to the stripes being recovered and to prevent
805	   other clients from hitting the same error condition.  In these cases,
806	   the server MUST complete recovery before handing out any new layouts
807	   to the affected byte ranges.

809	   Although it MAY be acceptable for the client to propagate a
810	   corresponding error to the application that initiated the I/O
811	   operation and drop any unwritten data, the client SHOULD attempt to
812	   retry the original I/O operation by requesting a new layout using
813	   LAYOUTGET and retry the I/O operation(s) using the new layout, or the
814	   client MAY just retry the I/O operation(s) using regular NFS READ or
815	   WRITE operations via the metadata server.  The client SHOULD attempt
816	   to retrieve a new layout and retry the I/O operation using the Data
817	   Server first and only if the error persists, retry the I/O operation
818	   via the metadata server.

820	7.  Flexible Files Layout Return

822	   layoutreturn_file4 is used in the LAYOUTRETURN operation to convey
823	   layout-type specific information to the server.  It is defined in
824	   [RFC5661] as follows:

826	   struct layoutreturn_file4 {
827	           offset4         lrf_offset;
828	           length4         lrf_length;
829	           stateid4        lrf_stateid;
830	           /* layouttype4 specific data */
831	           opaque          lrf_body<>;
832	   };

834	   union layoutreturn4 switch(layoutreturn_type4 lr_returntype) {
835	           case LAYOUTRETURN4_FILE:
836	                   layoutreturn_file4      lr_layout;
837	           default:
838	                   void;
839	   };

841	   struct LAYOUTRETURN4args {
842	           /* CURRENT_FH: file */
843	           bool                    lora_reclaim;
844	           layoutreturn_stateid    lora_recallstateid;
845	           layouttype4             lora_layout_type;
846	           layoutiomode4           lora_iomode;
847	           layoutreturn4           lora_layoutreturn;
848	   };

850	   If the lora_layout_type layout type is LAYOUT4_FLEX_FILES, then the
851	   lrf_body opaque value is defined by the pnfs_ff_layoutreturn4 type.

853	   The pnfs_ff_layoutreturn4 type allows the client to report I/O error
854	   information or layout usage statistics back to the metadata server as
855	   defined below.

857	7.1.  pflr_errno

859	   /// enum pflr_errno {
860	   ///     PNFS_FF_ERR_EIO            = 1,
861	   ///     PNFS_FF_ERR_NOT_FOUND      = 2,
862	   ///     PNFS_FF_ERR_NO_SPACE       = 3,
863	   ///     PNFS_FF_ERR_BAD_STATEID    = 4,
864	   ///     PNFS_FF_ERR_NO_ACCESS      = 5,
865	   ///     PNFS_FF_ERR_UNREACHABLE    = 6,
866	   ///     PNFS_FF_ERR_RESOURCE       = 7
867	   /// };
868	   ///

870	   pflr_errno4 is used to represent error types when read/write errors
871	   are reported to the metadata server.  The error codes serve as hints
872	   to the metadata server that may help it in diagnosing the exact
873	   reason for the error and in repairing it.

875	   PNFS_FF_ERR_EIO  indicates the operation failed because the Data
876	      Server experienced a failure trying to access the object.  The
877	      most common source of these errors is media errors, but other
878	      internal errors might cause this as well.  In this case, the
879	      metadata server should go examine the broken object more closely;
880	      hence, it should be used as the default error code.

882	   PNFS_FF_ERR_NOT_FOUND  indicates the object ID specifies a Component
883	      Object that does not exist on the Data Server.

885	   PNFS_FF_ERR_NO_SPACE  indicates the operation failed because the Data
886	      Server ran out of free capacity during the operation.

888	   PNFS_FF_ERR_BAD_STATEID  indicates the stateid is not valid.

890	   PNFS_FF_ERR_NO_ACCESS  indicates the RPC credentials do not allow the
891	      requested operation.  This may happen when the client is fenced
892	      off.  The client will need to return the layout and get a new one
893	      with fresh credentials.

895	   PNFS_FF_ERR_UNREACHABLE  indicates the client did not complete the I/
896	      O operation at the Data Server due to a communication failure.
897	      Whether or not the I/O operation was executed by the Data Server
898	      is undetermined.

900	   PNFS_FF_ERR_RESOURCE  indicates the client did not issue the I/O
901	      operation due to a local problem on the initiator (i.e., client)
902	      side, e.g., when running out of memory.  The client MUST guarantee
903	      that the Data Server WRITE operation was never sent.

905	7.2.  pnfs_ff_ioerr

907	   /// struct pnfs_ff_ioerr {
908	   ///     deviceid4           ioe_deviceid;
909	   ///     nfs_fh4             ioe_fhandle;
910	   ///     offset4             ioe_comp_offset;
911	   ///     length4             ioe_comp_length;
912	   ///     bool                ioe_iswrite;
913	   ///     pnfs_ff_errno       ioe_errno;
914	   /// };
915	   ///

917	   The pnfs_ff_ioerr4 structure is used to return error indications for
918	   Component Objects that generated errors during data transfers.  These
919	   are hints to the metadata server that there are problems with that
920	   object.  For each error, "ioe_deviceid", "ioe_fhandle",
921	   "ioe_comp_offset", and "ioe_comp_length" represent the Component
922	   Object and byte range within the object in which the error occurred;
923	   "ioe_iswrite" is set to "true" if the failed Data Server operation
924	   was data modifying, and "ioe_errno" represents the type of error.

926	   Component byte ranges in the optional pnfs_ff_ioerr4 structure are
927	   used for recovering the object and MUST be set by the client to cover
928	   all failed I/O operations to the component.

930	7.3.  pnfs_ff_iostats

932	   /// struct pnfs_ff_iostats {
933	   ///     offset4             ios_offset;
934	   ///     length4             ios_length;
935	   ///     uint32_t            ios_duration;
936	   ///     uint32_t            ios_rd_count;
937	   ///     uint64_t            ios_rd_bytes;
938	   ///     uint32_t            ios_wr_count;
939	   ///     uint64_t            ios_wr_bytes;
940	   /// };
941	   ///

943	   With pNFS, the data transfers are performed directly between the pNFS
944	   client and the data servers.  Therefore, the metadata server has no
945	   visibility to the I/O stream and cannot use any statistical
946	   information about client I/O to optimize data storage location.
947	   pnfs_ff_iostats4 MAY be used by the client to report I/O statistics
948	   back to the metadata server upon returning the layout.  Since it is
949	   infeasible for the client to report every I/O that used the layout,
950	   the client MAY identify "hot" byte ranges for which to report I/O
951	   statistics.  The definition and/or configuration mechanism of what is
952	   considered "hot" and the size of the reported byte range is out of
953	   the scope of this document.  It is suggested for client
954	   implementation to provide reasonable default values and an optional
955	   run-time management interface to control these parameters.  For
956	   example, a client can define the default byte range resolution to be
957	   1 MB in size and the thresholds for reporting to be 1 MB/second or 10
958	   I/O operations per second.  For each byte range, ios_offset and
959	   ios_length represent the starting offset of the range and the range
960	   length in bytes.  ios_duration represents the number of seconds the
961	   reported burst of I/O lasted.  ios_rd_count, ios_rd_bytes,
962	   ios_wr_count, and ios_wr_bytes represent, respectively, the number of
963	   contiguous read and write I/Os and the respective aggregate number of
964	   bytes transferred within the reported byte range.

966	7.4.  pnfs_ff_layoutreturn

968	   /// struct pnfs_ff_layoutreturn {
969	   ///     pnfs_ff_ioerr               pflr_ioerr_report<>;
970	   ///     pnfs_ff_iostats             pflr_iostats_report<>;
971	   /// };
972	   ///

974	   When object I/O operations failed, "pflr_ioerr_report<>" is used to
975	   report these errors to the metadata server as an array of elements of
976	   type pnfs_ff_ioerr4.  Each element in the array represents an error
977	   that occurred on the Component Object identified by <ioe_deviceid,
978	   ioe_fhandle>.  If no errors are to be reported, the size of the
979	   pflr_ioerr_report<> array is set to zero.  The client MAY also use
980	   "pflr_iostats_report<>" to report a list of I/O statistics as an
981	   array of elements of type pnfs_ff_iostats4.  Each element in the
982	   array represents statistics for a particular byte range.  Byte ranges
983	   are not guaranteed to be disjoint and MAY repeat or intersect.

985	8.  Flexible Files Creation Layout Hint

987	   The layouthint4 type is defined in the [RFC5661] as follows:

989	   struct layouthint4 {
990	       layouttype4           loh_type;
991	       opaque                loh_body<>;
992	   };

994	   The layouthint4 structure is used by the client to pass a hint about
995	   the type of layout it would like created for a particular file.  If
996	   the loh_type layout type is LAYOUT4_FLEX_FILES, then the loh_body
997	   opaque value is defined by the pnfs_ff_layouthint type.

999	8.1.  pnfs_ff_layouthint
1000	   /// union pnfs_ff_max_comps_hint switch (bool pfmx_valid) {
1001	   ///     case TRUE:
1002	   ///         uint32_t            omx_max_comps;
1003	   ///     case FALSE:
1004	   ///         void;
1005	   /// };
1006	   ///
1007	   /// union pnfs_ff_stripe_unit_hint switch (bool pfsu_valid) {
1008	   ///     case TRUE:
1009	   ///         length4             osu_stripe_unit;
1010	   ///     case FALSE:
1011	   ///         void;
1012	   /// };
1013	   ///
1014	   /// union pnfs_ff_mirror_cnt_hint switch (bool pfmc_valid) {
1015	   ///     case TRUE:
1016	   ///         uint32_t            omc_mirror_cnt;
1017	   ///     case FALSE:
1018	   ///         void;
1019	   /// };
1020	   ///
1021	   /// union pnfs_ff_striping_pattern_hint switch (bool pfsp_valid) {
1022	   ///     case TRUE:
1023	   ///         pnfs_ff_striping_pattern    pfsp_striping_pattern;
1024	   ///     case FALSE:
1025	   ///         void;
1026	   /// };
1027	   ///
1028	   /// struct pnfs_ff_layouthint {
1029	   ///     pnfs_ff_max_comps_hint         pflh_max_comps_hint;
1030	   ///     pnfs_ff_stripe_unit_hint       pflh_stripe_unit_hint;
1031	   ///     pnfs_ff_mirror_cnt_hint        pflh_mirror_cnt_hint;
1032	   ///     pnfs_ff_striping_pattern_hint  pflh_striping_pattern_hint;
1033	   /// };
1034	   ///

1036	   This type conveys hints for the desired data map.  All parameters are
1037	   optional so the client can give values for only the parameters it
1038	   cares about, e.g. it can provide a hint for the desired number of
1039	   mirrored components, regardless of the striping pattern selected for
1040	   the file.  The server should make an attempt to honor the hints, but
1041	   it can ignore any or all of them at its own discretion and without
1042	   failing the respective CREATE operation.

1044	9.  Recalling Layouts

1046	   The Flexible Files metadata server should recall outstanding layouts
1047	   in the following cases:

1049	   o  When the file's security policy changes, i.e., Access Control
1050	      Lists (ACLs) or permission mode bits are set.

1052	   o  When the file's layout changes, rendering outstanding layouts
1053	      invalid.

1055	   o  When there are sharing conflicts.  For example, the server will
1056	      issue stripe-aligned layout segments for RAID-5 objects.  To
1057	      prevent corruption of the file's parity, multiple clients must not
1058	      hold valid write layouts for the same stripes.  An outstanding
1059	      READ/WRITE (RW) layout should be recalled when a conflicting
1060	      LAYOUTGET is received from a different client for LAYOUTIOMODE4_RW
1061	      and for a byte range overlapping with the outstanding layout
1062	      segment.

1064	9.1.  CB_RECALL_ANY

1066	   The metadata server can use the CB_RECALL_ANY callback operation to
1067	   notify the client to return some or all of its layouts.  The
1068	   [RFC5661] defines the following types:

1070	   const RCA4_TYPE_MASK_FF_LAYOUT_MIN     = -2;
1071	   const RCA4_TYPE_MASK_FF_LAYOUT_MAX     = -1;
1072	   [[RFC Editor: please insert assigned constants]]

1074	   struct  CB_RECALL_ANY4args      {
1075	       uint32_t        craa_objects_to_keep;
1076	       bitmap4         craa_type_mask;
1077	   };

1079	   Typically, CB_RECALL_ANY will be used to recall client state when the
1080	   server needs to reclaim resources.  The craa_type_mask bitmap
1081	   specifies the type of resources that are recalled and the
1082	   craa_objects_to_keep value specifies how many of the recalled objects
1083	   the client is allowed to keep.  The Flexible Files layout type mask
1084	   flags are defined as follows.  They represent the iomode of the
1085	   recalled layouts.  In response, the client SHOULD return layouts of
1086	   the recalled iomode that it needs the least, keeping at most
1087	   craa_objects_to_keep object-based layouts.

1089	   /// enum pnfs_ff_cb_recall_any_mask {
1090	   ///     PNFS_FF_RCA4_TYPE_MASK_READ = -2,
1091	   ///     PNFS_FF_RCA4_TYPE_MASK_RW   = -1
1092	   [[RFC Editor: please insert assigned constants]]
1093	   /// };
1094	   ///

1096	   The PNFS_FF_RCA4_TYPE_MASK_READ flag notifies the client to return
1097	   layouts of iomode LAYOUTIOMODE4_READ.  Similarly, the
1098	   PNFS_FF_RCA4_TYPE_MASK_RW flag notifies the client to return layouts
1099	   of iomode LAYOUTIOMODE4_RW.  When both mask flags are set, the client
1100	   is notified to return layouts of either iomode.

1102	10.  Client Fencing

1104	   In cases where clients are uncommunicative and their lease has
1105	   expired or when clients fail to return recalled layouts within a
1106	   lease period, at the least the server MAY revoke client layouts and/
1107	   or device address mappings and reassign these resources to other
1108	   clients (see "Recalling a Layout" in [RFC5661]).  To avoid data
1109	   corruption, the metadata server MUST fence off the revoked clients
1110	   from the respective objects as described in Section 2.1.

1112	11.  Security Considerations

1114	   The pNFS extension partitions the NFSv4 file system protocol into two
1115	   parts, the control path and the data path (storage protocol).  The
1116	   control path contains all the new operations described by this
1117	   extension; all existing NFSv4 security mechanisms and features apply
1118	   to the control path.  The combination of components in a pNFS system
1119	   is required to preserve the security properties of NFSv4 with respect
1120	   to an entity accessing data via a client, including security
1121	   countermeasures to defend against threats that NFSv4 provides
1122	   defenses for in environments where these threats are considered
1123	   significant.

1125	   The metadata server enforces the file access-control policy at
1126	   LAYOUTGET time.  The client should use suitable authorization
1127	   credentials for getting the layout for the requested iomode (READ or
1128	   RW) and the server verifies the permissions and ACL for these
1129	   credentials, possibly returning NFS4ERR_ACCESS if the client is not
1130	   allowed the requested iomode.  If the LAYOUTGET operation succeeds
1131	   the client receives, as part of the layout, a set of credentials
1132	   allowing it I/O access to the specified objects corresponding to the
1133	   requested iomode.  When the client acts on I/O operations on behalf
1134	   of its local users, it MUST authenticate and authorize the user by
1135	   issuing respective OPEN and ACCESS calls to the metadata server,
1136	   similar to having NFSv4 data delegations.  If access is allowed, the
1137	   client uses the corresponding (READ or RW) credentials to perform the
1138	   I/O operations at the object storage devices.  When the metadata
1139	   server receives a request to change a file's permissions or ACL, it
1140	   SHOULD recall all layouts for that file and it MUST fence off the
1141	   clients holding outstanding layouts for the respective file by
1142	   implicitly invalidating the outstanding credentials on all Component
1143	   Objects comprising before committing to the new permissions and ACL.
1144	   Doing this will ensure that clients re-authorize their layouts
1145	   according to the modified permissions and ACL by requesting new
1146	   layouts.  Recalling the layouts in this case is courtesy of the
1147	   server intended to prevent clients from getting an error on I/Os done
1148	   after the client was fenced off.

1150	12.  Striping Topologies Extensibility

1152	   New striping topologies that are not specified in this document may
1153	   be specified using @@@.  These must be documented in the IETF by
1154	   submitting an RFC augmenting this protocol provided that:

1156	   o  New striping topologies MUST be wire-protocol compatible with the
1157	      Flexible Files Layout protocol as specified in this document.

1159	   o  Some members of the data structures specified here may be declared
1160	      as optional or manadatory-not-to-be-used.

1162	   o  Upon acceptance by the IETF as a RFC, new striping topology
1163	      constants MUST be registered as describe in Section 13.

1165	13.  IANA Considerations

1167	   As described in [RFC5661], new layout type numbers have been assigned
1168	   by IANA.  This document defines the protocol associated with the
1169	   existing layout type number, LAYOUT4_FLEX_FILES.

1171	   A new IANA registry should be assigned to register new data map
1172	   striping topologies described by the enumerated type: @@@.

1174	14.  Normative References

1176	   [ErasureCodingLibraries]
1177	              Plank, James S., and Luo, Jianqiang and Schuman, Catherine
1178	              D. and Xu, Lihao and Wilcox-O'Hearn, Zooko, , "A
1179	              Performance Evaluation and Examination of Open-source
1180	              Erasure Coding Libraries for Storage", 2007.

1182	   [ErrorCorrectingCodes]
1183	              MacWilliams, F. and N. Sloane, "The Theory of Error-
1184	              Correcting Codes, Part I", 1977.

1186	   [LEGAL]    IETF Trust, "Legal Provisions Relating to IETF Documents",
1187	              November 2008, <http://trustee.ietf.org/docs/
1188	              IETF-Trust-License-Policy.pdf>.

1190	   [MathOfRAID-6]
1191	              Anvin, H., "The Mathematics of RAID-6", May 2009,
1192	              <http://kernel.org/pub/linux/kernel/people/hpa/raid6.pdf>.

1194	   [RFC1813]  IETF, "NFS Version 3 Protocol Specification", RFC 1813,
1195	              June 1995.

1197	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
1198	              Requirement Levels", BCP 14, RFC 2119, March 1997.

1200	   [RFC3530]  Shepler, S., Callaghan, B., Robinson, D., Thurlow, R.,
1201	              Beame, C., Eisler, M., and D. Noveck, "Network File System
1202	              (NFS) version 4 Protocol", RFC 3530, April 2003.

1204	   [RFC4506]  Eisler, M., "XDR: External Data Representation Standard",
1205	              STD 67, RFC 4506, May 2006.

1207	   [RFC5661]  Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed.,
1208	              "Network File System (NFS) Version 4 Minor Version 1
1209	              Protocol", RFC 5661, January 2010.

1211	   [RFC5662]  Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed.,
1212	              "Network File System (NFS) Version 4 Minor Version 1
1213	              External Data Representation Standard (XDR) Description",
1214	              RFC 5662, January 2010.

1216	   [RFC5664]  Halevy, B., Ed., Welch, B., Ed., and J. Zelenka, Ed.,
1217	              "Object-Based Parallel NFS (pNFS) Operations", RFC 5664,
1218	              January 2010.

1220	   [pNFSLayouts]
1221	              Haynes, T., "Considerations for a New pNFS Layout Type",
1222	              draft-haynes-nfsv4-layout-types-02 (Work In Progress),
1223	              April 2014.

1225	Appendix A.  Acknowledgments

1227	   The pNFS Objects Layout was authored and revised by Brent Welch, Jim
1228	   Zelenka, Benny Halevy, and Boaz Harrosh.

1230	   Those who provided miscellaneous comments to early drafts of this
1231	   document include: Matt W. Benjamin, Adam Emerson, Tom Haynes, J.
1232	   Bruce Fields, and Lev Solomonov.

1234	Appendix B.  RFC Editor Notes

1236	   [RFC Editor: please remove this section prior to publishing this
1237	   document as an RFC]

1239	   [RFC Editor: prior to publishing this document as an RFC, please
1240	   replace all occurrences of RFCTBD10 with RFCxxxx where xxxx is the
1241	   RFC number of this document]

1243	Authors' Addresses

1245	   Benny Halevy
1246	   Primary Data, Inc.

1248	   Email: bhalevy@primarydata.com
1249	   URI:   http://www.primarydata.com

1251	   Thomas Haynes
1252	   Primary Data, Inc.
1253	   4300 El Camino Real Ste 100
1254	   Los Altos, CA  94022
1255	   USA

1257	   Phone: +1 408 215 1519
1258	   Email: thomas.haynes@primarydata.com