idnits 2.17.1 

draft-ietf-nfsv4-flex-files-05.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  == Line 333 has weird spacing: '... loghyr  staff...'

  == Line 970 has weird spacing: '...stateid    lor...'

  -- The document date (February 09, 2015) is 3364 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  -- Possible downref: Non-RFC (?) normative reference: ref. 'LEGAL'

  == Outdated reference: A later version (-41) exists of
     draft-ietf-nfsv4-minorversion2-28

  ** Downref: Normative reference to an Informational RFC: RFC 1813

  ** Obsolete normative reference: RFC 5661 (Obsoleted by RFC 8881)


     Summary: 2 errors (**), 0 flaws (~~), 4 warnings (==), 2 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	NFSv4                                                          B. Halevy
3	Internet-Draft
4	Intended status: Standards Track                               T. Haynes
5	Expires: August 13, 2015                                    Primary Data
6	                                                       February 09, 2015

8	                Parallel NFS (pNFS) Flexible File Layout
9	                   draft-ietf-nfsv4-flex-files-05.txt

11	Abstract

13	   The Parallel Network File System (pNFS) allows a separation between
14	   the metadata (onto a metadata server) and data (onto a storage
15	   device) for a file.  The Flexible File Layout Type is defined in this
16	   document as an extension to pNFS to allow the use of storage devices
17	   in a fashion such that they require only a quite limited degree of
18	   interaction with the metadata server, using already existing
19	   protocols.  Client side mirroring is also added to provide
20	   replication of files.

22	Status of This Memo

24	   This Internet-Draft is submitted in full conformance with the
25	   provisions of BCP 78 and BCP 79.

27	   Internet-Drafts are working documents of the Internet Engineering
28	   Task Force (IETF).  Note that other groups may also distribute
29	   working documents as Internet-Drafts.  The list of current Internet-
30	   Drafts is at http://datatracker.ietf.org/drafts/current/.

32	   Internet-Drafts are draft documents valid for a maximum of six months
33	   and may be updated, replaced, or obsoleted by other documents at any
34	   time.  It is inappropriate to use Internet-Drafts as reference
35	   material or to cite them other than as "work in progress."

37	   This Internet-Draft will expire on August 13, 2015.

39	Copyright Notice

41	   Copyright (c) 2015 IETF Trust and the persons identified as the
42	   document authors.  All rights reserved.

44	   This document is subject to BCP 78 and the IETF Trust's Legal
45	   Provisions Relating to IETF Documents
46	   (http://trustee.ietf.org/license-info) in effect on the date of
47	   publication of this document.  Please review these documents
48	   carefully, as they describe your rights and restrictions with respect
49	   to this document.  Code Components extracted from this document must
50	   include Simplified BSD License text as described in Section 4.e of
51	   the Trust Legal Provisions and are provided without warranty as
52	   described in the Simplified BSD License.

54	Table of Contents

56	   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   3
57	     1.1.  Definitions . . . . . . . . . . . . . . . . . . . . . . .   3
58	     1.2.  Difference Between a Data Server and a Storage Device . .   5
59	     1.3.  Requirements Language . . . . . . . . . . . . . . . . . .   6
60	   2.  Coupling of Storage Devices . . . . . . . . . . . . . . . . .   6
61	     2.1.  LAYOUTCOMMIT  . . . . . . . . . . . . . . . . . . . . . .   6
62	     2.2.  Security Models . . . . . . . . . . . . . . . . . . . . .   6
63	       2.2.1.  Implementation Notes for Synthetic uids/gids  . . . .   7
64	       2.2.2.  Example of using Synthetic uids/gids  . . . . . . . .   7
65	     2.3.  State and Locking Models  . . . . . . . . . . . . . . . .   8
66	   3.  XDR Description of the Flexible File Layout Type  . . . . . .   9
67	     3.1.  Code Components Licensing Notice  . . . . . . . . . . . .   9
68	   4.  Device Addressing and Discovery . . . . . . . . . . . . . . .  11
69	     4.1.  ff_device_addr4 . . . . . . . . . . . . . . . . . . . . .  11
70	     4.2.  Storage Device Multipathing . . . . . . . . . . . . . . .  12
71	   5.  Flexible File Layout Type . . . . . . . . . . . . . . . . . .  13
72	     5.1.  ff_layout4  . . . . . . . . . . . . . . . . . . . . . . .  14
73	     5.2.  Interactions Between Devices and Layouts  . . . . . . . .  17
74	     5.3.  Handling Version Errors . . . . . . . . . . . . . . . . .  17
75	   6.  Striping via Sparse Mapping . . . . . . . . . . . . . . . . .  18
76	   7.  Recovering from Client I/O Errors . . . . . . . . . . . . . .  18
77	   8.  Mirroring . . . . . . . . . . . . . . . . . . . . . . . . . .  19
78	     8.1.  Selecting a Mirror  . . . . . . . . . . . . . . . . . . .  20
79	     8.2.  Writing to Mirrors  . . . . . . . . . . . . . . . . . . .  20
80	     8.3.  Metadata Server Resilvering of the File . . . . . . . . .  21
81	   9.  Flexible Files Layout Type Return . . . . . . . . . . . . . .  21
82	     9.1.  I/O Error Reporting . . . . . . . . . . . . . . . . . . .  22
83	       9.1.1.  ff_ioerr4 . . . . . . . . . . . . . . . . . . . . . .  22
84	     9.2.  Layout Usage Statistics . . . . . . . . . . . . . . . . .  23
85	       9.2.1.  ff_io_latency4  . . . . . . . . . . . . . . . . . . .  23
86	       9.2.2.  ff_layoutupdate4  . . . . . . . . . . . . . . . . . .  23
87	       9.2.3.  ff_iostats4 . . . . . . . . . . . . . . . . . . . . .  24
88	     9.3.  ff_layoutreturn4  . . . . . . . . . . . . . . . . . . . .  25
89	   10. Flexible Files Layout Type LAYOUTERROR  . . . . . . . . . . .  25
90	   11. Flexible Files Layout Type LAYOUTSTATS  . . . . . . . . . . .  25
91	   12. Flexible File Layout Type Creation Hint . . . . . . . . . . .  26
92	     12.1.  ff_layouthint4 . . . . . . . . . . . . . . . . . . . . .  26
93	   13. Recalling Layouts . . . . . . . . . . . . . . . . . . . . . .  27
94	     13.1.  CB_RECALL_ANY  . . . . . . . . . . . . . . . . . . . . .  27
95	   14. Client Fencing  . . . . . . . . . . . . . . . . . . . . . . .  28
96	   15. Security Considerations . . . . . . . . . . . . . . . . . . .  28
97	     15.1.  Kerberized File Access . . . . . . . . . . . . . . . . .  29
98	       15.1.1.  Loosely Coupled  . . . . . . . . . . . . . . . . . .  29
99	       15.1.2.  Tightly Coupled  . . . . . . . . . . . . . . . . . .  29
100	   16. IANA Considerations . . . . . . . . . . . . . . . . . . . . .  30
101	   17. References  . . . . . . . . . . . . . . . . . . . . . . . . .  30
102	     17.1.  Normative References . . . . . . . . . . . . . . . . . .  30
103	     17.2.  Informative References . . . . . . . . . . . . . . . . .  31
104	   Appendix A.  Acknowledgments  . . . . . . . . . . . . . . . . . .  31
105	   Appendix B.  RFC Editor Notes . . . . . . . . . . . . . . . . . .  31
106	   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  31

108	1.  Introduction

110	   In the parallel Network File System (pNFS), the metadata server
111	   returns Layout Type structures that describe where file data is
112	   located.  There are different Layout Types for different storage
113	   systems and methods of arranging data on storage devices.  This
114	   document defines the Flexible File Layout Type used with file-based
115	   data servers that are accessed using the Network File System (NFS)
116	   protocols: NFSv3 [RFC1813], NFSv4.0 [RFCNFSv4], NFSv4.1 [RFC5661],
117	   and NFSv4.2 [NFSv42].

119	   To provide a global state model equivalent to that of the Files
120	   Layout Type, a back-end control protocol MAY be implemented between
121	   the metadata server and NFSv4.1+ storage devices.  It is out of scope
122	   for this document to specify the wire protocol of such a protocol,
123	   yet the requirements for the protocol are specified in [RFC5661] and
124	   clarified in [pNFSLayouts].

126	1.1.  Definitions

128	   control protocol:  is a set of requirements for the communication of
129	      information on layouts, stateids, file metadata, and file data
130	      between the metadata server and the storage devices (see
131	      [pNFSLayouts]).

133	   client-side mirroring:  is when the client and not the server is
134	      responsible for updating all of the mirrored copies of a layout
135	      segment.

137	   data file:  is that part of the file system object which describes
138	      the payload and not the object.  E.g., it is the file contents.

140	   data server (DS):  is one of the pNFS servers which provides the
141	      contents of a file system object which is a regular file.
142	      Depending on the layout, there might be one or more data servers
143	      over which the data is striped.  Note that while the metadata
144	      server is strictly accessed over the NFSv4.1+ protocol, depending
145	      on the Layout Type, the data server could be accessed via any
146	      protocol that meets the pNFS requirements.

148	   fencing:  is when the metadata server prevents the storage devices
149	      from processing I/O from a specific client to a specific file.

151	   File Layout Type:  is a Layout Type in which the storage devices are
152	      accessed via the NFS protocol.

154	   layout:  informs a client of which storage devices it needs to
155	      communicate with (and over which protocol) to perform I/O on a
156	      file.  The layout might also provide some hints about how the
157	      storage is physically organized.

159	   layout iomode:  describes whether the layout granted to the client is
160	      for read or read/write I/O.

162	   layout segment:  describes a sub-division of a layout.  That sub-
163	      division might be by the iomode (see Sections 3.3.20 and 12.2.9 of
164	      [RFC5661]), a striping pattern (see Section 13.3 of [RFC5661]), or
165	      requested byte range.

167	   layout stateid:  is a 128-bit quantity returned by a server that
168	      uniquely defines the layout state provided by the server for a
169	      specific layout that describes a Layout Type and file (see
170	      Section 12.5.2 of [RFC5661]).  Further, Section 12.5.3 describes
171	      the difference between a layout stateid and a normal stateid.

173	   layout type:  describes both the storage protocol used to access the
174	      data and the aggregation scheme used to lay out the file data on
175	      the underlying storage devices.

177	   loose coupling:  is when the metadata server and the storage devices
178	      do not have a control protocol present.

180	   metadata file:  is that part of the file system object which
181	      describes the object and not the payload.  E.g., it could be the
182	      time since last modification, access, etc.

184	   metadata server (MDS):  is the pNFS server which provides metadata
185	      information for a file system object.  It also is responsible for
186	      generating layouts for file system objects.  Note that the MDS is
187	      responsible for directory-based operations.

189	   mirror:  is a copy of a layout segment.  While mirroring can be used
190	      for backing up a layout segment, the copies can be distributed
191	      such that each remote site has a locally available copy.  Note
192	      that if one copy of the mirror is updated, then all copies must be
193	      updated.

195	   recalling a layout:  is when the metadata server uses a back channel
196	      to inform the client that the layout is to be returned in a
197	      graceful manner.  Note that the client could be able to flush any
198	      writes, etc., before replying to the metadata server.

200	   revoking a layout:  is when the metadata server invalidates the
201	      layout such that neither the metadata server nor any storage
202	      device will accept any access from the client with that layout.

204	   resilvering:  is the act of rebuilding a mirrored copy of a layout
205	      segment from a known good copy of the layout segment.  Note that
206	      this can also be done to create a new mirrored copy of the layout
207	      segment.

209	   rsize:  is the data transfer buffer size used for reads.

211	   stateid:  is a 128-bit quantity returned by a server that uniquely
212	      defines the open and locking states provided by the server for a
213	      specific open-owner or lock-owner/open-owner pair for a specific
214	      file and type of lock.

216	   storage device:  is another term used almost interchangeably with
217	      data server.  See Section 1.2 for the nuances between the two.

219	   tight coupling:  is when the metadata server and the storage devices
220	      do have a control protocol present.

222	   wsize:  is the data transfer buffer size used for writes.

224	1.2.  Difference Between a Data Server and a Storage Device

226	   We defined a data server as a pNFS server, which implies that it can
227	   utilize the NFSv4.1+ protocol to communicate with the client.  As
228	   such, only the File Layout Type would currently meet this
229	   requirement.  The more generic concept is a storage device, which can
230	   use any protocol to communicate with the client.  The requirements
231	   for a storage device to act together with the metadata server to
232	   provide data to a client are that there is a Layout Type
233	   specification for the given protocol and that the metadata server has
234	   granted a layout to the client.  Note that nothing precludes there
235	   being multiple supported Layout Types (i.e., protocols) between a
236	   metadata server, storage devices, and client.

238	   As storage device is the more encompassing terminology, this document
239	   utilizes it over data server.

241	1.3.  Requirements Language

243	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
244	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
245	   document are to be interpreted as described in [RFC2119].

247	2.  Coupling of Storage Devices

249	   The coupling of the metadata server with the storage devices can be
250	   either tight or loose.  In a tight coupling, there is a control
251	   protocol present to manage security, LAYOUTCOMMITs, etc.  With a
252	   loose coupling, the only control protocol might be a version of NFS.
253	   As such, semantics for managing security, state, and locking models
254	   MUST be defined.

256	2.1.  LAYOUTCOMMIT

258	   With a tightly coupled system, when the metadata server receives a
259	   LAYOUTCOMMIT (see Section 18.42 of [RFC5661]), the semantics of the
260	   File Layout Type MUST be met (see Section 12.5.4 of [RFC5661]).  With
261	   a loosely coupled system, a LAYOUTCOMMIT to the metadata server MUST
262	   be proceeded with a COMMIT to the storage device.  It is the
263	   responsibility of the client to make sure the data file is stable
264	   before the metadata server begins to query the storage devices about
265	   the changes to the file.  Note that if the client has not done a
266	   COMMIT to the storage device, then the LAYOUTCOMMIT might not be
267	   synchronized to the last WRITE operation to the storage device.

269	2.2.  Security Models

271	   With loosely coupled storage devices, the metadata server uses
272	   synthetic uids and gids for the data file, where the uid owner of the
273	   data file is allowed read/write access and the gid owner is allowed
274	   read only access.  As part of the layout (see ffds_user and
275	   ffds_group in Section 5.1), the client is provided with the user and
276	   group to be used in the Remote Procedure Call (RPC) [RFC5531]
277	   credentials needed to access the data file.  Fencing off of clients
278	   is achieved by the metadata server changing the synthetic uid and/or
279	   gid owners of the data file on the storage device to implicitly
280	   revoke the outstanding RPC credentials.

282	   With this loosely coupled model, the metadata server is not able to
283	   fence off a single client, it forced to fence off all clients.
284	   However, as the other clients react to the fencing, returning their
285	   layouts and trying to get new ones, the metadata server can hand out
286	   a new uid and gid to allow access.

288	   Note: it is recommended to implement common access control methods at
289	   the storage device filesystem to allow only the metadata server root
290	   (super user) access to the storage device, and to set the owner of
291	   all directories holding data files to the root user.  This approach
292	   provides a practical model to enforce access control and fence off
293	   cooperative clients, but it can not protect against malicious
294	   clients; hence it provides a level of security equivalent to
295	   AUTH_SYS.

297	   With tightly coupled storage devices, the metadata server sets the
298	   user and group owners, mode bits, and ACL of the data file to be the
299	   same as the metadata file.  And the client must authenticate with the
300	   storage device and go through the same authorization process it would
301	   go through via the metadata server.

303	2.2.1.  Implementation Notes for Synthetic uids/gids

305	   The selection method for the synthetic uids and gids to be used for
306	   fencing in loosely coupled storage devices is strictly an
307	   implementation issue.  An implementation might allow an administrator
308	   to restrict a range of such ids in the name servers.  She might also
309	   be able to choose an id that would never be used to grant acccess.
310	   Then when the metadata server had a request to access a file, a
311	   SETATTR would be sent to the storage device to set the owner and
312	   group of the data file.  The user and group might be selected in a
313	   round robin fashion from the range of available ids.

315	   Those ids would be sent back as ffds_user and ffds_group to the
316	   client.  And it would present them as the RPC credentials to the
317	   storage device.  When the client was done accessing the file and the
318	   metadata server knew that no other client was accessing the file, it
319	   could reset the owner and group to restrict access to the data file.

321	   When the metadata server wanted to fence off a client, it would
322	   change the synthetic uid and/or gid to the restricted ids.  Note that
323	   using a restricted id ensures that there is a change of owner and at
324	   least one id available that never gets allowed access.

326	2.2.2.  Example of using Synthetic uids/gids

328	   The user loghyr creates a file "ompha.c" on the metadata server and
329	   it creates a corresponding data file on the storage device.

331	   The metadata server entry may look like:

333	   -rw-r--r--    1 loghyr  staff    1697 Dec  4 11:31 ompha.c
334	   On the storage device, it may be assigned some random synthetic uid/
335	   gid to deny access:

337	   -rw-r-----    1 19452   28418    1697 Dec  4 11:31 data_ompha.c

339	   When the file is opened on a client, since the layout knows nothing
340	   about the user (and does not care), whether loghyr or garbo opens the
341	   file does not matter.  The owner and group are modified and those
342	   values are returned.

344	   -rw-r-----    1 1066    1067     1697 Dec  4 11:31 data_ompha.c

346	   The set of synthetic gids on the storage device should be selected
347	   such that there is no mapping in any of the name services used by the
348	   storage device.  I.e., each group should have no members.

350	   If the layout segment has an iomode of LAYOUTIOMODE4_READ, then the
351	   metadata server should return a synthetic uid that is not set on the
352	   storage device.  Only the synthetic gid would be valid.

354	   The client is thus solely responsible for enforcing file permissions
355	   in a loosely coupled model.  To allow loghyr write access, it will
356	   send an RPC to the storage device with a credential of 1066:1067.  To
357	   allow garbo read access, it will send an RPC to the storage device
358	   with a credential of 1067:1067.  The value of the uid does not matter
359	   as long as it is not the synthetic uid granted it when getting the
360	   layout.

362	   While pushing the enforcement of permission checking onto the client
363	   may seem to weaken security, the client may already be responsible
364	   for enforcing permissions before modificaations are sent to a server.
365	   With cached writes, the client is always responsible for tracking who
366	   is modifying a file and making sure to not coalesce requests from
367	   multiple users into one request.

369	2.3.  State and Locking Models

371	   Metadata file OPEN, LOCK, and DELEGATION operations are always
372	   executed only against the metadata server.

374	   The metadata server responds to state changing operations by
375	   executing them against the respective data files on the storage
376	   devices.  It then sends the storage device open stateid as part of
377	   the layout (see the ffm_stateid in Section 5.1) and it is then used
378	   by the client for executing READ/WRITE operations against the storage
379	   device.

381	   Standalone NFSv4.1+ storage devices that do not return the
382	   EXCHGID4_FLAG_USE_PNFS_DS flag to EXCHANGE_ID are used the same way
383	   as NFSv4 storage devices.

385	   NFSv4.1+ clustered storage devices that do identify themselves with
386	   the EXCHGID4_FLAG_USE_PNFS_DS flag to EXCHANGE_ID use a back-end
387	   control protocol as described in [RFC5661] to implement a global
388	   stateid model as defined there.

390	3.  XDR Description of the Flexible File Layout Type

392	   This document contains the external data representation (XDR)
393	   [RFC4506] description of the Flexible File Layout Type.  The XDR
394	   description is embedded in this document in a way that makes it
395	   simple for the reader to extract into a ready-to-compile form.  The
396	   reader can feed this document into the following shell script to
397	   produce the machine readable XDR description of the Flexible File
398	   Layout Type:

400	   <CODE BEGINS>

402	   #!/bin/sh
403	   grep '^ *///' $* | sed 's?^ */// ??' | sed 's?^ *///$??'

405	   <CODE ENDS>

407	   That is, if the above script is stored in a file called "extract.sh",
408	   and this document is in a file called "spec.txt", then the reader can
409	   do:

411	   sh extract.sh < spec.txt > flex_files_prot.x

413	   The effect of the script is to remove leading white space from each
414	   line, plus a sentinel sequence of "///".

416	   The embedded XDR file header follows.  Subsequent XDR descriptions,
417	   with the sentinel sequence are embedded throughout the document.

419	   Note that the XDR code contained in this document depends on types
420	   from the NFSv4.1 nfs4_prot.x file [RFC5662].  This includes both nfs
421	   types that end with a 4, such as offset4, length4, etc., as well as
422	   more generic types such as uint32_t and uint64_t.

424	3.1.  Code Components Licensing Notice

426	   Both the XDR description and the scripts used for extracting the XDR
427	   description are Code Components as described in Section 4 of "Legal
428	   Provisions Relating to IETF Documents" [LEGAL].  These Code
429	   Components are licensed according to the terms of that document.

431	   <CODE BEGINS>

433	   /// /*
434	   ///  * Copyright (c) 2012 IETF Trust and the persons identified
435	   ///  * as authors of the code. All rights reserved.
436	   ///  *
437	   ///  * Redistribution and use in source and binary forms, with
438	   ///  * or without modification, are permitted provided that the
439	   ///  * following conditions are met:
440	   ///  *
441	   ///  * o Redistributions of source code must retain the above
442	   ///  *   copyright notice, this list of conditions and the
443	   ///  *   following disclaimer.
444	   ///  *
445	   ///  * o Redistributions in binary form must reproduce the above
446	   ///  *   copyright notice, this list of conditions and the
447	   ///  *   following disclaimer in the documentation and/or other
448	   ///  *   materials provided with the distribution.
449	   ///  *
450	   ///  * o Neither the name of Internet Society, IETF or IETF
451	   ///  *   Trust, nor the names of specific contributors, may be
452	   ///  *   used to endorse or promote products derived from this
453	   ///  *   software without specific prior written permission.
454	   ///  *
455	   ///  *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS
456	   ///  *   AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED
457	   ///  *   WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
458	   ///  *   IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
459	   ///  *   FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO
460	   ///  *   EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
461	   ///  *   LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
462	   ///  *   EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
463	   ///  *   NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
464	   ///  *   SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
465	   ///  *   INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
466	   ///  *   LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
467	   ///  *   OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING
468	   ///  *   IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
469	   ///  *   ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
470	   ///  *
471	   ///  * This code was derived from RFCTBD10.
472	   ///  * Please reproduce this note if possible.
473	   ///  */
474	   ///
475	   /// /*
476	   ///  * flex_files_prot.x
477	   ///  */
478	   ///
479	   /// /*
480	   ///  * The following include statements are for example only.
481	   ///  * The actual XDR definition files are generated separately
482	   ///  * and independently and are likely to have a different name.
483	   ///  * %#include <nfsv42.x>
484	   ///  * %#include <rpc_prot.x>
485	   ///  */
486	   ///

488	   <CODE ENDS>

490	4.  Device Addressing and Discovery

492	   Data operations to a storage device require the client to know the
493	   network address of the storage device.  The NFSv4.1+ GETDEVICEINFO
494	   operation (Section 18.40 of [RFC5661]) is used by the client to
495	   retrieve that information.

497	4.1.  ff_device_addr4

499	   The ff_device_addr4 data structure is returned by the server as the
500	   storage protocol specific opaque field da_addr_body in the
501	   device_addr4 structure by a successful GETDEVICEINFO operation.

503	   <CODE BEGINS>

505	   /// struct ff_device_versions4 {
506	   ///         uint32_t        ffdv_version;
507	   ///         uint32_t        ffdv_minorversion;
508	   ///         uint32_t        ffdv_rsize;
509	   ///         uint32_t        ffdv_wsize;
510	   ///         bool            ffdv_tightly_coupled;
511	   /// };
512	   ///

514	   /// struct ff_device_addr4 {
515	   ///         multipath_list4     ffda_netaddrs;
516	   ///         ff_device_versions4 ffda_versions<>;
517	   /// };
518	   ///

520	   <CODE ENDS>
521	   The ffda_netaddrs field is used to locate the storage device.  It
522	   MUST be set by the server to a list holding one or more of the device
523	   network addresses.

525	   The ffda_versions array allows the metadata server to present
526	   multiple NFS versions and/or minor versions to the client.  The
527	   ffdv_version and ffdv_minorversion represent the NFS protocol to be
528	   used to access the storage device.  This layout specification defines
529	   the semantics for ffdv_versions 3 and 4.  If ffdv_version equals 3
530	   then server MUST set ffdv_minorversion to 0 and the client MUST
531	   access the storage device using the NFSv3 protocol [RFC1813].  If
532	   ffdv_version equals 4 then the server MUST set ffdv_minorversion to
533	   one of the NFSv4 minor version numbers and the client MUST access the
534	   storage device using NFSv4.

536	   Note that while the client might determine that it can not use any of
537	   the configured ffdv_version or ffdv_minorversion, when it gets the
538	   device list from the metadata server, there is no way to indicate to
539	   the metadata server as to which device it is version incompatible.
540	   If however the client waits until it retrieves the layout from the
541	   metadata server, it can at that time clearly identify the storage
542	   device in question (see Section 5.3).

544	   The ffdv_rsize and ffdv_wsize are used to communicate the maximum
545	   rsize and wsize supported by the storage device.  As the storage
546	   device can have a different rsize or wsize than the metadata server,
547	   the ffdv_rsize and ffdv_wsize allow the metadata server to
548	   communicate that information on behalf of the storage device.

550	   ffdv_tightly_coupled informs the client as to whether the metadata
551	   server is tightly coupled with the storage devices or not.  Note that
552	   even if the data protocol is at least NFSv4.1, it may still be the
553	   case that there is no control protocol present.  If
554	   ffdv_tightly_coupled is not set, then the client MUST commit writes
555	   to the storage devices for the file before sending a LAYOUTCOMMIT to
556	   the metadata server.  I.e., the writes MUST be committed by the
557	   client to stable storage via issuing WRITEs with stable_how ==
558	   FILE_SYNC or by issuing a COMMIT after WRITEs with stable_how !=
559	   FILE_SYNC (see Section 3.3.7 of [RFC1813]).

561	4.2.  Storage Device Multipathing

563	   The Flexible File Layout Type supports multipathing to multiple
564	   storage device addresses.  Storage device level multipathing is used
565	   for bandwidth scaling via trunking and for higher availability of use
566	   in the case of a storage device failure.  Multipathing allows the
567	   client to switch to another storage device address which may be that
568	   of another storage device that is exporting the same data stripe
569	   unit, without having to contact the metadata server for a new layout.

571	   To support storage device multipathing, ffda_netaddrs contains an
572	   array of one or more storage device network addresses.  This array
573	   (data type multipath_list4) represents a list of storage device (each
574	   identified by a network address), with the possibility that some
575	   storage device will appear in the list multiple times.

577	   The client is free to use any of the network addresses as a
578	   destination to send storage device requests.  If some network
579	   addresses are less optimal paths to the data than others, then the
580	   MDS SHOULD NOT include those network addresses in ffda_netaddrs.  If
581	   less optimal network addresses exist to provide failover, the
582	   RECOMMENDED method to offer the addresses is to provide them in a
583	   replacement device-ID-to-device-address mapping, or a replacement
584	   device ID.  When a client finds no response from the storage device
585	   using all addresses available in ffda_netaddrs, it SHOULD send a
586	   GETDEVICEINFO to attempt to replace the existing device-ID-to-device-
587	   address mappings.  If the MDS detects that all network paths
588	   represented by ffda_netaddrs are unavailable, the MDS SHOULD send a
589	   CB_NOTIFY_DEVICEID (if the client has indicated it wants device ID
590	   notifications for changed device IDs) to change the device-ID-to-
591	   device-address mappings to the available addresses.  If the device ID
592	   itself will be replaced, the MDS SHOULD recall all layouts with the
593	   device ID, and thus force the client to get new layouts and device ID
594	   mappings via LAYOUTGET and GETDEVICEINFO.

596	   Generally, if two network addresses appear in ffda_netaddrs, they
597	   will designate the same storage device.  When the storage device is
598	   accessed over NFSv4.1 or higher minor version the two storage device
599	   addresses will support the implementation of client ID or session
600	   trunking (the latter is RECOMMENDED) as defined in [RFC5661].  The
601	   two storage device addresses will share the same server owner or
602	   major ID of the server owner.  It is not always necessary for the two
603	   storage device addresses to designate the same storage device with
604	   trunking being used.  For example, the data could be read-only, and
605	   the data consist of exact replicas.

607	5.  Flexible File Layout Type

609	   The layout4 type is defined in [RFC5662] as follows:

611	   <CODE BEGINS>
612	       enum layouttype4 {
613	           LAYOUT4_NFSV4_1_FILES   = 1,
614	           LAYOUT4_OSD2_OBJECTS    = 2,
615	           LAYOUT4_BLOCK_VOLUME    = 3,
616	           LAYOUT4_FLEX_FILES      = 4
617	   [[RFC Editor: please modify the LAYOUT4_FLEX_FILES
618	     to be the layouttype assigned by IANA]]
619	       };

621	       struct layout_content4 {
622	           layouttype4             loc_type;
623	           opaque                  loc_body<>;
624	       };

626	       struct layout4 {
627	           offset4                 lo_offset;
628	           length4                 lo_length;
629	           layoutiomode4           lo_iomode;
630	           layout_content4         lo_content;
631	       };

633	   <CODE ENDS>

635	   This document defines structure associated with the layouttype4 value
636	   LAYOUT4_FLEX_FILES.  [RFC5661] specifies the loc_body structure as an
637	   XDR type "opaque".  The opaque layout is uninterpreted by the generic
638	   pNFS client layers, but obviously must be interpreted by the Flexible
639	   File Layout Type implementation.  This section defines the structure
640	   of this opaque value, ff_layout4.

642	5.1.  ff_layout4

644	   <CODE BEGINS>

646	   /// struct ff_data_server4 {
647	   ///     deviceid4               ffds_deviceid;
648	   ///     uint32_t                ffds_efficiency;
649	   ///     stateid4                ffds_stateid;
650	   ///     nfs_fh4                 ffds_fh_vers<>;
651	   ///     fattr4_owner            ffds_user;
652	   ///     fattr4_owner_group      ffds_group;
653	   /// };
654	   ///

656	   /// struct ff_mirror4 {
657	   ///     ff_data_server4         ffm_data_servers<>;
658	   /// };
659	   ///
660	   /// struct ff_layout4 {
661	   ///     length4                 ffl_stripe_unit;
662	   ///     ff_mirror4              ffl_mirrors<>;
663	   /// };
664	   ///

666	   <CODE ENDS>

668	   The ff_layout4 structure specifies a layout over a set of mirrored
669	   copies of that portion of the data file described in the current
670	   layout segment.  This mirroring protects against loss of data in
671	   layout segments.  Note that while not explicitly shown in the above
672	   XDR, each layout4 element returned in the logr_layout array of
673	   LAYOUTGET4res (see Section 18.43.1 of [RFC5661]) descibes a layout
674	   segment.  Hence each ff_layout4 also descibes a layout segment.

676	   It is possible that the file is concatenated from more than one
677	   layout segment.  Each layout segment MAY represent different striping
678	   parameters, applying respectively only to the layout segment byte
679	   range.

681	   The ffl_stripe_unit field is the stripe unit size in use for the
682	   current layout segment.  The number of stripes is given inside each
683	   mirror by the number of elements in ffm_data_servers.  If the number
684	   of stripes is one, then the value for ffl_stripe_unit MUST default to
685	   zero.  The only supported mapping scheme is sparse and is detailed in
686	   Section 6.  Note that there is an assumption here that both the
687	   stripe unit size and the number of stripes is the same across all
688	   mirrors.

690	   The ffl_mirrors field is the array of mirrored storage devices which
691	   provide the storage for the current stripe, see Figure 1.

693	                      +-----------+
694	                      |           |
695	                      |           |
696	                      |   File    |
697	                      |           |
698	                      |           |
699	                      +-----+-----+
700	                            |
701	               +------------+------------+
702	               |                         |
703	          +----+-----+             +-----+----+
704	          | Mirror 1 |             | Mirror 2 |
705	          +----+-----+             +-----+----+
706	               |                         |
707	          +-----------+            +-----------+
708	          |+-----------+           |+-----------+
709	          ||+-----------+          ||+-----------+
710	          +||  Storage  |          +||  Storage  |
711	           +|  Devices  |           +|  Devices  |
712	            +-----------+            +-----------+

714	                                 Figure 1

716	   The ffs_mirrors field represents an array of state information for
717	   each mirrored copy of the current layout segment.  Each element is
718	   described by a ff_mirror4 type.

720	   ffds_deviceid provides the deviceid of the storage device holding the
721	   data file.

723	   ffds_fh_vers is an array of filehandles of the data file matching to
724	   the available NFS versions on the given storage device.  There MUST
725	   be exactly as many elements in ffds_fh_vers as there are in
726	   ffda_versions.  Each element of the array corresponds to each
727	   ffdv_version and ffdv_minorversion provided for the device.  The
728	   array allows for server implementations which have different
729	   filehandles for different version and minor version combinations.
730	   See Section 5.3 for how to handle versioning issues between the
731	   client and storage devices.

733	   For tight coupling, ffds_stateid provides the stateid to be used by
734	   the client to access the file.  For loose coupling and a NFSv4
735	   storage device, the client may use an anonymous stateid to perform I/
736	   O on the storage device as there is no use for the metadata server
737	   stateid (no control protocol).  In such a scenario, the server MUST
738	   set the ffds_stateid to be zero.

740	   For loosely coupled storage devices, ffds_user and ffds_group provide
741	   the synthetic user and group to be used in the RPC credentials that
742	   the client presents to the storage device to access the data files.
743	   For tightly coupled storage devices, the user and group on the
744	   storage device will be the same as on the metadata server.  I.e., if
745	   ffdv_tightly_coupled (see Section 4.1) is set, then the client MUST
746	   ignore both ffds_user and ffds_group.

748	   The allowed values for both ffds_user and ffds_group are specified in
749	   Section 5.9 of [RFC5661].  For NFSv3 compatibility, user and group
750	   strings that consist of decimal numeric values with no leading zeros
751	   can be given a special interpretation by clients and servers that
752	   choose to provide such support.  The receiver may treat such a user
753	   or group string as representing the same user as would be represented
754	   by an NFSv3 uid or gid having the corresponding numeric value.  Note
755	   that if using Kerberos for security, the expectation is that these
756	   values will be a name@domain string.

758	   ffds_efficiency describes the metadata server's evaluation as to the
759	   effectiveness of each mirror.  Note that this is per layout and not
760	   per device as the metric may change due to perceived load,
761	   availability to the metadata server, etc.  Higher values denote
762	   higher perceived utility.  The way the client can select the best
763	   mirror to access is discussed in Section 8.1.

765	5.2.  Interactions Between Devices and Layouts

767	   In [RFC5661], the File Layout Type is defined such that the
768	   relationship between multipathing and filehandles can result in
769	   either 0, 1, or N filehandles (see Section 13.3).  Some rationals for
770	   this are clustered servers which share the same filehandle or
771	   allowing for multiple read-only copies of the file on the same
772	   storage device.  In the Flexible File Layout Type, while there is an
773	   array of filehandles, they are independent of the multipathing being
774	   used.  If the metadata server wants to provide multiple read-only
775	   copies of the same file on the same storage device, then it should
776	   provide multiple ff_device_addr4, each as a mirror.  The client can
777	   then determine that since the ffds_fh_vers are different, then there
778	   are multiple copies of the file for the current layout segment
779	   available.

781	5.3.  Handling Version Errors

783	   When the metadata server provides the ffda_versions array in the
784	   ff_device_addr4 (see Section 4.1), the client is able to determine if
785	   it can not access a storage device with any of the supplied
786	   ffdv_version and ffdv_minorversion combinations.  However, due to the
787	   limitations of reporting errors in GETDEVICEINFO (see Section 18.40
788	   in [RFC5661], the client is not able to specify which specific device
789	   it can not communicate with over one of the provided ffdv_version and
790	   ffdv_minorversion combinations.  Using ff_ioerr4 (see Section 9.1.1
791	   inside either the LAYOUTRETURN (see Section 18.44 of [RFC5661]) or
792	   the LAYOUTERROR (see Section 15.6 of [NFSv42] and Section 10 of this
793	   document), the client can isolate the problematic storage device.

795	   The error code to return for LAYOUTRETURN and/or LAYOUTERROR is
796	   NFS4ERR_MINOR_VERS_MISMATCH.  It does not matter whether the mismatch
797	   is a major version (e.g., client can use NFSv3 but not NFSv4) or
798	   minor version (e.g., client can use NFSv4.1 but not NFSv4.2), the
799	   error indicates that for all the supplied combinations for
800	   ffdv_version and ffdv_minorversion, the client can not communicate
801	   with the storage device.  The client can retry the GETDEVICEINFO to
802	   see if the metadata server can provide a different combination or it
803	   can fall back to doing the I/O through the metadata server.

805	6.  Striping via Sparse Mapping

807	   While other Layout Types support both dense and sparse mapping of
808	   logical offsets to physical offsets within a file (see for example
809	   Section 13.4 of [RFC5661]), the Flexible File Layout Type only
810	   supports a sparse mapping.

812	   With sparse mappings, the logical offset within a file (L) is also
813	   the physical offset on the storage device.  As detailed in
814	   Section 13.4.4 of [RFC5661], this results in holes across each
815	   storage device which does not contain the current stripe index.

817	   L: logical offset into the file

819	   W: stripe width
820	       W = number of elements in ffm_data_servers

822	   S: number of bytes in a stripe
823	       S = W * ffl_stripe_unit

825	   N: stripe number
826	       N = L / S

828	7.  Recovering from Client I/O Errors

830	   The pNFS client may encounter errors when directly accessing the
831	   storage devices.  However, it is the responsibility of the metadata
832	   server to recover from the I/O errors.  When the LAYOUT4_FLEX_FILES
833	   layout type is used, the client MUST report the I/O errors to the
834	   server at LAYOUTRETURN time using the ff_ioerr4 structure (see
835	   Section 9.1.1).

837	   The metadata server analyzes the error and determines the required
838	   recovery operations such as recovering media failures or
839	   reconstructing missing data files.

841	   The metadata server SHOULD recall any outstanding layouts to allow it
842	   exclusive write access to the stripes being recovered and to prevent
843	   other clients from hitting the same error condition.  In these cases,
844	   the server MUST complete recovery before handing out any new layouts
845	   to the affected byte ranges.

847	   Although it MAY be acceptable for the client to propagate a
848	   corresponding error to the application that initiated the I/O
849	   operation and drop any unwritten data, the client SHOULD attempt to
850	   retry the original I/O operation by requesting a new layout using
851	   LAYOUTGET and retry the I/O operation(s) using the new layout, or the
852	   client MAY just retry the I/O operation(s) using regular NFS READ or
853	   WRITE operations via the metadata server.  The client SHOULD attempt
854	   to retrieve a new layout and retry the I/O operation using the
855	   storage device first and only if the error persists, retry the I/O
856	   operation via the metadata server.

858	8.  Mirroring

860	   The Flexible File Layout Type has a simple model in place for the
861	   mirroring of the file data constrained by a layout segment.  There is
862	   no assumption that each copy of the mirror is stored identically on
863	   the storage devices, i.e., one device might employ compression or
864	   deduplication on the data.  However, the over the wire transfer of
865	   the file contents MUST appear identical.  Note, this is a construct
866	   of the selected XDR representation that each mirrored copy of the
867	   layout segment has the same striping pattern (see Figure 1).

869	   The metadata server is responsible for determining the number of
870	   mirrored copies and the location of each mirror.  While the client
871	   may provide a hint to how many copies it wants (see Section 12), the
872	   metadata server can ignore that hint and in any event, the client has
873	   no means to dictate neither the storage device (which also means the
874	   coupling and/or protocol levels to access the layout segments) nor
875	   the location of said storage device.

877	   The updating of mirrored layout segments is done via client-side
878	   mirroring.  With this approach, the client is responsible for making
879	   sure modifications get to all copies of the layout segments it is
880	   informed of via the layout.  If a layout segments is being resilvered
881	   to a storage device, that mirrored copy will not be in the layout.
882	   Thus the metadata server MUST update that copy until the client is
883	   presented it in a layout.  Also, if the client is writing to the
884	   layout segments via the metadata server, e.g., using an earlier
885	   version of the protocol, then the metadata server MUST update all
886	   copies of the mirror.  As seen in Section 8.3, during the
887	   resilvering, the layout is recalled, and the client has to make
888	   modifications via the metadata server.

890	8.1.  Selecting a Mirror

892	   When the metadata server grants a layout to a client, it can let the
893	   client know how fast it expects each mirror to be once the request
894	   arrives at the storage devices via the ffds_efficiency member.  While
895	   the algorithms to calculate that value are left to the metadata
896	   server implementations, factors that could contribute to that
897	   calculation include speed of the storage device, physical memory
898	   available to the device, operating system version, current load, etc.

900	   However, what should not be involved in that calculation is a
901	   perceived network distance between the client and the storage device.
902	   The client is better situated for making that determination based on
903	   past interaction with the storage device over the different available
904	   network interfaces between the two.  I.e., the metadata server might
905	   not know about a transient outage between the client and storage
906	   device because it has no presence on the given subnet.

908	   As such, it is the client which decides which mirror to access for
909	   reading the file.  The requirements for writing to a mirrored layout
910	   segments are presented below.

912	8.2.  Writing to Mirrors

914	   The client is responsible for updating all mirrored copies of the
915	   layout segments that it is given in the layout.  If all but one copy
916	   is updated successfully and the last one provides an error, then the
917	   client needs to return the layout to the metadata server with an
918	   error indicating that the update failed to that storage device.

920	   The metadata server is then responsible for determining if it wants
921	   to remove the errant mirror from the layout, if the mirror has
922	   recovered from some transient error, etc.  When the client tries to
923	   get a new layout, the metadata server informs it of the decision by
924	   the contents of the layout.  The client MUST NOT make any assumptions
925	   that the contents of the previous layout will match those of the new
926	   one.  If it has updates that were not committed, it MUST resend those
927	   updates to all mirrors.

929	8.3.  Metadata Server Resilvering of the File

931	   The metadata server may elect to create a new mirror of the layout
932	   segments at any time.  This might be to resilver a copy on a storage
933	   device which was down for servicing, to provide a copy of the layout
934	   segments on storage with different storage performance
935	   characteristics, etc.  As the client will not be aware of the new
936	   mirror and the metadata server will not be aware of updates that the
937	   client is making to the layout segments, the metadata server MUST
938	   recall the writable layout segment(s) that it is resilvering.  If the
939	   client issues a LAYOUTGET for a writable layout segment which is in
940	   the process of being resilvered, then the metadata server MUST deny
941	   that request with a NFS4ERR_LAYOUTTRYLATER.  The client can then
942	   perform the I/O through the metadata server.

944	9.  Flexible Files Layout Type Return

946	   layoutreturn_file4 is used in the LAYOUTRETURN operation to convey
947	   layout-type specific information to the server.  It is defined in
948	   [RFC5661] as follows:

950	   <CODE BEGINS>

952	   struct layoutreturn_file4 {
953	           offset4         lrf_offset;
954	           length4         lrf_length;
955	           stateid4        lrf_stateid;
956	           /* layouttype4 specific data */
957	           opaque          lrf_body<>;
958	   };

960	   union layoutreturn4 switch(layoutreturn_type4 lr_returntype) {
961	           case LAYOUTRETURN4_FILE:
962	                   layoutreturn_file4      lr_layout;
963	           default:
964	                   void;
965	   };

967	   struct LAYOUTRETURN4args {
968	           /* CURRENT_FH: file */
969	           bool                    lora_reclaim;
970	           layoutreturn_stateid    lora_recallstateid;
971	           layouttype4             lora_layout_type;
972	           layoutiomode4           lora_iomode;
973	           layoutreturn4           lora_layoutreturn;
974	   };

976	   <CODE ENDS>
977	   If the lora_layout_type layout type is LAYOUT4_FLEX_FILES, then the
978	   lrf_body opaque value is defined by ff_layoutreturn4 (See
979	   Section 9.3).  It allows the client to report I/O error information
980	   or layout usage statistics back to the metadata server as defined
981	   below.

983	9.1.  I/O Error Reporting

985	9.1.1.  ff_ioerr4

987	   <CODE BEGINS>

989	   /// struct ff_ioerr4 {
990	   ///         offset4        ffie_offset;
991	   ///         length4        ffie_length;
992	   ///         stateid4       ffie_stateid;
993	   ///         device_error4  ffie_errors<>;
994	   /// };
995	   ///

997	   <CODE ENDS>

999	   Recall that [NFSv42] defines device_error4 as:

1001	   <CODE BEGINS>

1003	   struct device_error4 {
1004	           deviceid4       de_deviceid;
1005	           nfsstat4        de_status;
1006	           nfs_opnum4      de_opnum;
1007	   };

1009	   <CODE ENDS>

1011	   The ff_ioerr4 structure is used to return error indications for data
1012	   files that generated errors during data transfers.  These are hints
1013	   to the metadata server that there are problems with that file.  For
1014	   each error, ffie_errors.de_deviceid, ffie_offset, and ffie_length
1015	   represent the storage device and byte range within the file in which
1016	   the error occurred; ffie_errors represents the operation and type of
1017	   error.  The use of device_error4 is described in Section 15.6 of
1018	   [NFSv42].

1020	   Even though the storage device might be accessed via NFSv3 and
1021	   reports back NFSv3 errors to the client, the client is responsible
1022	   for mapping these to appropriate NFSv4 status codes as de_status.
1023	   Likewise, the NFSv3 operations need to be mapped to equivalent NFSv4
1024	   operations.

1026	9.2.  Layout Usage Statistics

1028	9.2.1.  ff_io_latency4

1030	   <CODE BEGINS>

1032	   /// struct ff_io_latency4 {
1033	   ///         nfstime4       ffil_min;
1034	   ///         nfstime4       ffil_max;
1035	   ///         nfstime4       ffil_avg;
1036	   ///         uint32_t       ffil_count;
1037	   /// };
1038	   ///

1040	   <CODE ENDS>

1042	   When determining latencies, the client can collect the minimum via
1043	   ffil_min, the maximum via ffil_max, and the average via ffil_avg.
1044	   Further, ffil_count relates how many data points were collected in
1045	   the reported period.

1047	9.2.2.  ff_layoutupdate4

1049	   <CODE BEGINS>

1051	   /// struct ff_layoutupdate4 {
1052	   ///         netaddr4       ffl_addr;
1053	   ///         nfs_fh4        ffl_fhandle;
1054	   ///         ff_io_latency4 ffl_read;
1055	   ///         ff_io_latency4 ffl_write;
1056	   ///         nfstime4       ffl_duration;
1057	   ///         bool           ffl_local;
1058	   /// };
1059	   ///

1061	   <CODE ENDS>

1063	   ffl_addr differentiates which network address the client connected to
1064	   on the storage device.  In the case of multipathing, ffl_fhandle
1065	   indicates which read-only copy was selected. ffl_read and ffl_write
1066	   convey the latencies respectively for both read and write operations.
1067	   ffl_duration is used to indicate the time period over which the
1068	   statistics were collected.  ffl_local if true indicates that the I/O
1069	   was serviced by the client's cache.  This flag allows the client to
1070	   inform the metadata server about "hot" access to a file it would not
1071	   normally be allowed to report on.

1073	9.2.3.  ff_iostats4

1075	   <CODE BEGINS>

1077	   /// struct ff_iostats4 {
1078	   ///         offset4           ffis_offset;
1079	   ///         length4           ffis_length;
1080	   ///         stateid4          ffis_stateid;
1081	   ///         io_info4          ffis_read;
1082	   ///         io_info4          ffis_write;
1083	   ///         deviceid4         ffis_deviceid;
1084	   ///         ff_layoutupdate4  ffis_layoutupdate;
1085	   /// };
1086	   ///

1088	   <CODE ENDS>

1090	   Recall that [NFSv42] defines io_info4 as:

1092	   <CODE BEGINS>

1094	   struct io_info4 {
1095	           uint32_t        ii_count;
1096	           uint64_t        ii_bytes;
1097	   };

1099	   <CODE ENDS>

1101	   With pNFS, the data transfers are performed directly between the pNFS
1102	   client and the storage devices.  Therefore, the metadata server has
1103	   no visibility to the I/O stream and cannot use any statistical
1104	   information about client I/O to optimize data storage location.
1105	   ff_iostats4 MAY be used by the client to report I/O statistics back
1106	   to the metadata server upon returning the layout.  Since it is
1107	   infeasible for the client to report every I/O that used the layout,
1108	   the client MAY identify "hot" byte ranges for which to report I/O
1109	   statistics.  The definition and/or configuration mechanism of what is
1110	   considered "hot" and the size of the reported byte range is out of
1111	   the scope of this document.  It is suggested for client
1112	   implementation to provide reasonable default values and an optional
1113	   run-time management interface to control these parameters.  For
1114	   example, a client can define the default byte range resolution to be
1115	   1 MB in size and the thresholds for reporting to be 1 MB/second or 10
1116	   I/O operations per second.  For each byte range, ffis_offset and
1117	   ffis_length represent the starting offset of the range and the range
1118	   length in bytes.  ffis_read.ii_count, ffis_read.ii_bytes,
1119	   ffis_write.ii_count, and ffis_write.ii_bytes represent, respectively,
1120	   the number of contiguous read and write I/Os and the respective
1121	   aggregate number of bytes transferred within the reported byte range.

1123	   The combination of ffis_deviceid and ffl_addr uniquely identify both
1124	   the storage path and the network route to it.  Finally, the
1125	   ffl_fhandle allows the metadata server to differentiate between
1126	   multiple read-only copies of the file on the same storage device.

1128	9.3.  ff_layoutreturn4

1130	   <CODE BEGINS>

1132	   /// struct ff_layoutreturn4 {
1133	   ///         ff_ioerr4     fflr_ioerr_report<>;
1134	   ///         ff_iostats4   fflr_iostats_report<>;
1135	   /// };
1136	   ///

1138	   <CODE ENDS>

1140	   When data file I/O operations fail, fflr_ioerr_report<> is used to
1141	   report these errors to the metadata server as an array of elements of
1142	   type ff_ioerr4.  Each element in the array represents an error that
1143	   occurred on the data file identified by ffie_errors.de_deviceid.  If
1144	   no errors are to be reported, the size of the fflr_ioerr_report<>
1145	   array is set to zero.  The client MAY also use fflr_iostats_report<>
1146	   to report a list of I/O statistics as an array of elements of type
1147	   ff_iostats4.  Each element in the array represents statistics for a
1148	   particular byte range.  Byte ranges are not guaranteed to be disjoint
1149	   and MAY repeat or intersect.

1151	10.  Flexible Files Layout Type LAYOUTERROR

1153	   If the client is using NFSv4.2 to communicate with the metadata
1154	   server, then instead of waiting for a LAYOUTRETURN to send error
1155	   information to the metadata server (see Section 9.1), it can use
1156	   LAYOUTERROR (see Section 15.6 of [NFSv42]) to communicate that
1157	   information.  For the Flexible Files Layout Type, this means that
1158	   LAYOUTERROR4args is treated the same as ff_ioerr4.

1160	11.  Flexible Files Layout Type LAYOUTSTATS

1162	   If the client is using NFSv4.2 to communicate with the metadata
1163	   server, then instead of waiting for a LAYOUTRETURN to send I/O
1164	   statistics to the metadata server (see Section 9.2), it can use
1165	   LAYOUTSTATS (see Section 15.7 of [NFSv42]) to communicate that
1166	   information.  For the Flexible Files Layout Type, this means that
1167	   LAYOUTSTATS4args.lsa_layoutupdate is overloaded with the same
1168	   contents as in ffis_layoutupdate.

1170	12.  Flexible File Layout Type Creation Hint

1172	   The layouthint4 type is defined in the [RFC5661] as follows:

1174	   <CODE BEGINS>

1176	   struct layouthint4 {
1177	       layouttype4           loh_type;
1178	       opaque                loh_body<>;
1179	   };

1181	   <CODE ENDS>

1183	   The layouthint4 structure is used by the client to pass a hint about
1184	   the type of layout it would like created for a particular file.  If
1185	   the loh_type layout type is LAYOUT4_FLEX_FILES, then the loh_body
1186	   opaque value is defined by the ff_layouthint4 type.

1188	12.1.  ff_layouthint4

1190	   <CODE BEGINS>

1192	   /// union ff_mirrors_hint switch (bool ffmc_valid) {
1193	   ///     case TRUE:
1194	   ///         uint32_t    ffmc_mirrors;
1195	   ///     case FALSE:
1196	   ///         void;
1197	   /// };
1198	   ///

1200	   /// struct ff_layouthint4 {
1201	   ///     ff_mirrors_hint fflh_mirrors_hint;
1202	   /// };
1203	   ///

1205	   <CODE ENDS>

1207	   This type conveys hints for the desired data map.  All parameters are
1208	   optional so the client can give values for only the parameter it
1209	   cares about.

1211	13.  Recalling Layouts

1213	   The Flexible File Layout Type metadata server should recall
1214	   outstanding layouts in the following cases:

1216	   o  When the file's security policy changes, i.e., Access Control
1217	      Lists (ACLs) or permission mode bits are set.

1219	   o  When the file's layout changes, rendering outstanding layouts
1220	      invalid.

1222	   o  When there are sharing conflicts.

1224	13.1.  CB_RECALL_ANY

1226	   The metadata server can use the CB_RECALL_ANY callback operation to
1227	   notify the client to return some or all of its layouts.  The
1228	   [RFC5661] defines the following types:

1230	   <CODE BEGINS>

1232	   const RCA4_TYPE_MASK_FF_LAYOUT_MIN     = -2;
1233	   const RCA4_TYPE_MASK_FF_LAYOUT_MAX     = -1;
1234	   [[RFC Editor: please insert assigned constants]]

1236	   struct  CB_RECALL_ANY4args      {
1237	       uint32_t        craa_layouts_to_keep;
1238	       bitmap4         craa_type_mask;
1239	   };

1241	   <CODE ENDS>

1243	   [[AI13: No, 5661 does not define these above values.  The ask here is
1244	   to create these and _add_ them to 5661.  --TH]]

1246	   Typically, CB_RECALL_ANY will be used to recall client state when the
1247	   server needs to reclaim resources.  The craa_type_mask bitmap
1248	   specifies the type of resources that are recalled and the
1249	   craa_layouts_to_keep value specifies how many of the recalled
1250	   Flexible File Layouts the client is allowed to keep.  The Flexible
1251	   File Layout Type mask flags are defined as follows:

1253	   <CODE BEGINS>
1254	   /// enum ff_cb_recall_any_mask {
1255	   ///     FF_RCA4_TYPE_MASK_READ = -2,
1256	   ///     FF_RCA4_TYPE_MASK_RW   = -1
1257	   [[RFC Editor: please insert assigned constants]]
1258	   /// };
1259	   ///

1261	   <CODE ENDS>

1263	   They represent the iomode of the recalled layouts.  In response, the
1264	   client SHOULD return layouts of the recalled iomode that it needs the
1265	   least, keeping at most craa_layouts_to_keep Flexible File Layouts.

1267	   The PNFS_FF_RCA4_TYPE_MASK_READ flag notifies the client to return
1268	   layouts of iomode LAYOUTIOMODE4_READ.  Similarly, the
1269	   PNFS_FF_RCA4_TYPE_MASK_RW flag notifies the client to return layouts
1270	   of iomode LAYOUTIOMODE4_RW.  When both mask flags are set, the client
1271	   is notified to return layouts of either iomode.

1273	14.  Client Fencing

1275	   In cases where clients are uncommunicative and their lease has
1276	   expired or when clients fail to return recalled layouts within a
1277	   lease period, at the least the server MAY revoke client layouts and/
1278	   or device address mappings and reassign these resources to other
1279	   clients (see "Recalling a Layout" in [RFC5661]).  To avoid data
1280	   corruption, the metadata server MUST fence off the revoked clients
1281	   from the respective data files as described in Section 2.2.

1283	15.  Security Considerations

1285	   The pNFS extension partitions the NFSv4.1+ file system protocol into
1286	   two parts, the control path and the data path (storage protocol).
1287	   The control path contains all the new operations described by this
1288	   extension; all existing NFSv4 security mechanisms and features apply
1289	   to the control path.  The combination of components in a pNFS system
1290	   is required to preserve the security properties of NFSv4.1+ with
1291	   respect to an entity accessing data via a client, including security
1292	   countermeasures to defend against threats that NFSv4.1+ provides
1293	   defenses for in environments where these threats are considered
1294	   significant.

1296	   The metadata server enforces the file access-control policy at
1297	   LAYOUTGET time.  The client should use suitable authorization
1298	   credentials for getting the layout for the requested iomode (READ or
1299	   RW) and the server verifies the permissions and ACL for these
1300	   credentials, possibly returning NFS4ERR_ACCESS if the client is not
1301	   allowed the requested iomode.  If the LAYOUTGET operation succeeds
1302	   the client receives, as part of the layout, a set of credentials
1303	   allowing it I/O access to the specified data files corresponding to
1304	   the requested iomode.  When the client acts on I/O operations on
1305	   behalf of its local users, it MUST authenticate and authorize the
1306	   user by issuing respective OPEN and ACCESS calls to the metadata
1307	   server, similar to having NFSv4 data delegations.  If access is
1308	   allowed, the client uses the corresponding (READ or RW) credentials
1309	   to perform the I/O operations at the data file's storage devices.
1310	   When the metadata server receives a request to change a file's
1311	   permissions or ACL, it SHOULD recall all layouts for that file and it
1312	   MUST fence off the clients holding outstanding layouts for the
1313	   respective file by implicitly invalidating the outstanding
1314	   credentials on all data files comprising before committing to the new
1315	   permissions and ACL.  Doing this will ensure that clients re-
1316	   authorize their layouts according to the modified permissions and ACL
1317	   by requesting new layouts.  Recalling the layouts in this case is
1318	   courtesy of the server intended to prevent clients from getting an
1319	   error on I/Os done after the client was fenced off.

1321	15.1.  Kerberized File Access

1323	15.1.1.  Loosely Coupled

1325	   Under this coupling model, the principal used to authenticate the
1326	   metadata file is different than that used to authenticate the data
1327	   file.  I.e., the synthetic principals generated to control access to
1328	   the data file could prove to be difficult to manage.

1330	   While RPCSEC_GSS version 3 (RPCSEC_GSSv3) [rpcsec_gssv3] could be
1331	   used to authorize the client to the storage device on behalf of the
1332	   metadata server, such a requirement exceeds the loose coupling model.
1333	   I.e., each of the metadata server, storage device, and client would
1334	   have to implement RPCSEC_GSSv3.

1336	   In all, while either an elaborate schema could be used to
1337	   automatically authenticate principals or RPCSEC_GSSv3 aware clients,
1338	   metadata server, and storage devices could be deployed, if more
1339	   secure authentication is desired, tight coupling should be considered
1340	   as described in the next section.

1342	15.1.2.  Tightly Coupled

1344	   With tight coupling, the principal used to access the metadata file
1345	   is exactly the same as used to access the data file.  Thus there are
1346	   no security issues related to using Kerberos with a tightly coupled
1347	   system.

1349	16.  IANA Considerations

1351	   As described in [RFC5661], new layout type numbers have been assigned
1352	   by IANA.  This document defines the protocol associated with the
1353	   existing layout type number, LAYOUT4_FLEX_FILES.

1355	17.  References

1357	17.1.  Normative References

1359	   [LEGAL]    IETF Trust, "Legal Provisions Relating to IETF Documents",
1360	              November 2008, <http://trustee.ietf.org/docs/
1361	              IETF-Trust-License-Policy.pdf>.

1363	   [NFSv42]   Haynes, T., "NFS Version 4 Minor Version 2", draft-ietf-
1364	              nfsv4-minorversion2-28 (Work In Progress), November 2014.

1366	   [RFC1813]  IETF, "NFS Version 3 Protocol Specification", RFC 1813,
1367	              June 1995.

1369	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
1370	              Requirement Levels", BCP 14, RFC 2119, March 1997.

1372	   [RFC4506]  Eisler, M., "XDR: External Data Representation Standard",
1373	              STD 67, RFC 4506, May 2006.

1375	   [RFC5531]  Thurlow, R., "RPC: Remote Procedure Call Protocol
1376	              Specification Version 2", RFC 5531, May 2009.

1378	   [RFC5661]  Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed.,
1379	              "Network File System (NFS) Version 4 Minor Version 1
1380	              Protocol", RFC 5661, January 2010.

1382	   [RFC5662]  Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed.,
1383	              "Network File System (NFS) Version 4 Minor Version 1
1384	              External Data Representation Standard (XDR) Description",
1385	              RFC 5662, January 2010.

1387	   [RFCNFSv4]
1388	              Haynes, T. and D. Noveck, "NFS Version 4 Protocol", draft-
1389	              ietf-nfsv4-rfc3530bis-35 (work in progress), Dec 2014.

1391	   [pNFSLayouts]
1392	              Haynes, T., "Considerations for a New pNFS Layout Type",
1393	              draft-ietf-nfsv4-layout-types-02 (Work In Progress),
1394	              October 2014.

1396	17.2.  Informative References

1398	   [rpcsec_gssv3]
1399	              Adamson, W. and N. Williams, "Remote Procedure Call (RPC)
1400	              Security Version 3", November 2014.

1402	Appendix A.  Acknowledgments

1404	   Those who provided miscellaneous comments to early drafts of this
1405	   document include: Matt W. Benjamin, Adam Emerson, J. Bruce Fields,
1406	   and Lev Solomonov.

1408	   Those who provided miscellaneous comments to the final drafts of this
1409	   document include: Anand Ganesh, Robert Wipfel, Gobikrishnan
1410	   Sundharraj, and Trond Myklebust.

1412	   Idan Kedar caught a nasty bug in the interaction of client side
1413	   mirroring and the minor versioning of devices.

1415	   Dave Noveck provided a comprehensive review of the document during
1416	   the working group last call.

1418	   Olga Kornievskaia lead the charge against the use of a credential
1419	   versus a principal in the fencing approach.  Andy Adamson and
1420	   Benjamin Kaduk helped to sharpen the focus.

1422	Appendix B.  RFC Editor Notes

1424	   [RFC Editor: please remove this section prior to publishing this
1425	   document as an RFC]

1427	   [RFC Editor: prior to publishing this document as an RFC, please
1428	   replace all occurrences of RFCTBD10 with RFCxxxx where xxxx is the
1429	   RFC number of this document]

1431	Authors' Addresses

1433	   Benny Halevy

1435	   Email: bhalevy@gmail.com
1436	   Thomas Haynes
1437	   Primary Data, Inc.
1438	   4300 El Camino Real Ste 100
1439	   Los Altos, CA  94022
1440	   USA

1442	   Phone: +1 408 215 1519
1443	   Email: thomas.haynes@primarydata.com