idnits 2.17.1 draft-ietf-nfsv4-minorversion1-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3978, Section 5.1 on line 14. -- Found old boilerplate from RFC 3978, Section 5.5 on line 6348. -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 6325. -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 6332. -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 6338. ** This document has an original RFC 3978 Section 5.4 Copyright Line, instead of the newer IETF Trust Copyright according to RFC 4748. ** This document has an original RFC 3978 Section 5.5 Disclaimer, instead of the newer disclaimer which includes the IETF Trust according to RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year == Line 5074 has weird spacing: '...L4resok reso...' == Line 5276 has weird spacing: '...E4resok reso...' == Line 5508 has weird spacing: '...r_entry cha...' == Line 5854 has weird spacing: '...T4resok resok...' == Using lowercase 'not' together with uppercase 'MUST', 'SHALL', 'SHOULD', or 'RECOMMENDED' is not an accepted usage according to RFC 2119. Please use uppercase 'NOT' together with RFC 2119 keywords (if that is what you mean). Found 'SHOULD not' in this paragraph: If the "layoutchanged" field is TRUE, then the client SHOULD not flush its dirty data to the devices specified by the layout being recalled. Instead, it is preferable for the client to flush the dirty data through the metadata server. Alternatively, the client may attempt to obtain a new layout. Note: in order to obtain a new layout the client must first return the old layout. Since obtaining a new layout is not guaranteed to succeed, the client must be ready to flush its dirty data through the metadata server. -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (December 12, 2005) is 6711 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '
' and
     '' lines.


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  -- Looks like a reference, but probably isn't: 'RFC3530' on line 2047

  -- Looks like a reference, but probably isn't: 'RDMAP' on line 307

  -- Looks like a reference, but probably isn't: 'DDP' on line 307

  -- Looks like a reference, but probably isn't: 'IB' on line 316

  -- Looks like a reference, but probably isn't: 'RPCRDMA' on line 1989

  -- Looks like a reference, but probably isn't: 'RDDPPS' on line 455

  -- Looks like a reference, but probably isn't: 'KM02' on line 411

  -- Looks like a reference, but probably isn't: 'NFSPS' on line 430

  -- Looks like a reference, but probably isn't: 'CJ89' on line 756

  -- Looks like a reference, but probably isn't: 'CCM' on line 1975

  -- Looks like a reference, but probably isn't: 'RW96' on line 1749

  -- Looks like a reference, but probably isn't: 'Connection' on line 678

  -- Looks like a reference, but probably isn't: 'MIDTAX' on line 1002

  -- Looks like a reference, but probably isn't: 'NFSDDP' on line 1142

  -- Looks like a reference, but probably isn't: 'Segment' on line 1172

  -- Looks like a reference, but probably isn't: 'BW87' on line 1821

  -- Looks like a reference, but probably isn't: 'RFC2203' on line 4410

  -- Looks like a reference, but probably isn't: 'Section 9' on line 2491

  -- Looks like a reference, but probably isn't: 'RFC1831' on line 4400

  -- Looks like a reference, but probably isn't: 'RFC2743' on line 4409

  == Missing Reference: '16' is mentioned on line 4852, but not defined

  -- Possible downref: Non-RFC (?) normative reference: ref. '1'

  ** Obsolete normative reference: RFC 3530 (ref. '2') (Obsoleted by RFC 7530)

  -- Obsolete informational reference (is this intentional?): RFC 3720 (ref.
     '3') (Obsoleted by RFC 7143)

  == Outdated reference: A later version (-02) exists of
     draft-zelenka-pnfs-obj-01


     Summary: 5 errors (**), 0 flaws (~~), 9 warnings (==), 30 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	NFSv4                                                         S. Shepler
3	Internet-Draft                                                    Editor
4	Expires: June 15, 2006                                 December 12, 2005

6	                         NFSv4 Minor Version 1
7	                 draft-ietf-nfsv4-minorversion1-01.txt

9	Status of this Memo

11	   By submitting this Internet-Draft, each author represents that any
12	   applicable patent or other IPR claims of which he or she is aware
13	   have been or will be disclosed, and any of which he or she becomes
14	   aware will be disclosed, in accordance with Section 6 of BCP 79.

16	   Internet-Drafts are working documents of the Internet Engineering
17	   Task Force (IETF), its areas, and its working groups.  Note that
18	   other groups may also distribute working documents as Internet-
19	   Drafts.

21	   Internet-Drafts are draft documents valid for a maximum of six months
22	   and may be updated, replaced, or obsoleted by other documents at any
23	   time.  It is inappropriate to use Internet-Drafts as reference
24	   material or to cite them other than as "work in progress."

26	   The list of current Internet-Drafts can be accessed at
27	   http://www.ietf.org/ietf/1id-abstracts.txt.

29	   The list of Internet-Draft Shadow Directories can be accessed at
30	   http://www.ietf.org/shadow.html.

32	   This Internet-Draft will expire on June 15, 2006.

34	Copyright Notice

36	   Copyright (C) The Internet Society (2005).

38	Abstract

40	   This Internet-Draft describes the NFSv4 minor version 1 protocol
41	   extensions.  These most significant of these extensions are commonly
42	   called: Sessions, Directory Delegations, and parallel NFS or pNFS

44	Requirements Language

46	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
47	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
48	   document are to be interpreted as described in RFC 2119 [1].

50	Table of Contents

52	   1.   Security Negotiation . . . . . . . . . . . . . . . . . . . .   6
53	   2.   Clarification of Security Negotiation in NFSv4.1 . . . . . .   6
54	     2.1  PUTFH + LOOKUP . . . . . . . . . . . . . . . . . . . . . .   6
55	     2.2  PUTFH + LOOKUPP  . . . . . . . . . . . . . . . . . . . . .   7
56	     2.3  PUTFH + SECINFO  . . . . . . . . . . . . . . . . . . . . .   7
57	     2.4  PUTFH + Anything Else  . . . . . . . . . . . . . . . . . .   7
58	   3.   NFSv4.1 Sessions . . . . . . . . . . . . . . . . . . . . . .   8
59	     3.1  Sessions Background  . . . . . . . . . . . . . . . . . . .   8
60	       3.1.1  Introduction to Sessions . . . . . . . . . . . . . . .   8
61	       3.1.2  Motivation . . . . . . . . . . . . . . . . . . . . . .   9
62	       3.1.3  Problem Statement  . . . . . . . . . . . . . . . . . .  10
63	       3.1.4  NFSv4 Session Extension Characteristics  . . . . . . .  11
64	     3.2  Transport Issues . . . . . . . . . . . . . . . . . . . . .  12
65	       3.2.1  Session Model  . . . . . . . . . . . . . . . . . . . .  12
66	       3.2.2  Connection State . . . . . . . . . . . . . . . . . . .  13
67	       3.2.3  NFSv4 Channels, Sessions and Connections . . . . . . .  14
68	       3.2.4  Reconnection, Trunking and Failover  . . . . . . . . .  16
69	       3.2.5  Server Duplicate Request Cache . . . . . . . . . . . .  17
70	     3.3  Session Initialization and Transfer Models . . . . . . . .  18
71	       3.3.1  Session Negotiation  . . . . . . . . . . . . . . . . .  18
72	       3.3.2  RDMA Requirements  . . . . . . . . . . . . . . . . . .  19
73	       3.3.3  RDMA Connection Resources  . . . . . . . . . . . . . .  20
74	       3.3.4  TCP and RDMA Inline Transfer Model . . . . . . . . . .  21
75	       3.3.5  RDMA Direct Transfer Model . . . . . . . . . . . . . .  23
76	     3.4  Connection Models  . . . . . . . . . . . . . . . . . . . .  26
77	       3.4.1  TCP Connection Model . . . . . . . . . . . . . . . . .  27
78	       3.4.2  Negotiated RDMA Connection Model . . . . . . . . . . .  28
79	       3.4.3  Automatic RDMA Connection Model  . . . . . . . . . . .  29
80	     3.5  Buffer Management, Transfer, Flow Control  . . . . . . . .  29
81	     3.6  Retry and Replay . . . . . . . . . . . . . . . . . . . . .  32
82	     3.7  The Back Channel . . . . . . . . . . . . . . . . . . . . .  33
83	     3.8  COMPOUND Sizing Issues . . . . . . . . . . . . . . . . . .  34
84	     3.9  Data Alignment . . . . . . . . . . . . . . . . . . . . . .  34
85	     3.10   NFSv4 Integration  . . . . . . . . . . . . . . . . . . .  36
86	       3.10.1   Minor Versioning . . . . . . . . . . . . . . . . . .  36
87	       3.10.2   Slot Identifiers and Server Duplicate Request
88	                Cache  . . . . . . . . . . . . . . . . . . . . . . .  36
89	       3.10.3   COMPOUND and CB_COMPOUND . . . . . . . . . . . . . .  40
90	       3.10.4   eXternal Data Representation Efficiency  . . . . . .  41
91	       3.10.5   Effect of Sessions on Existing Operations  . . . . .  41
92	       3.10.6   Authentication Efficiencies  . . . . . . . . . . . .  42
93	     3.11   Sessions Security Considerations . . . . . . . . . . . .  43
94	       3.11.1   Authentication . . . . . . . . . . . . . . . . . . .  44
95	   4.   Directory Delegations  . . . . . . . . . . . . . . . . . . .  45
96	     4.1  Introduction to Directory Delegations  . . . . . . . . . .  45
97	     4.2  Directory Delegation Design (in brief) . . . . . . . . . .  47
98	     4.3  Recommended Attributes in support of Directory
99	          Delegations  . . . . . . . . . . . . . . . . . . . . . . .  48
100	     4.4  Delegation Recall  . . . . . . . . . . . . . . . . . . . .  48
101	     4.5  Delegation Recovery  . . . . . . . . . . . . . . . . . . .  49
102	   5.   Introduction . . . . . . . . . . . . . . . . . . . . . . . .  49
103	   6.   General Definitions  . . . . . . . . . . . . . . . . . . . .  51
104	     6.1  Metadata Server  . . . . . . . . . . . . . . . . . . . . .  52
105	     6.2  Client . . . . . . . . . . . . . . . . . . . . . . . . . .  52
106	     6.3  Storage Device . . . . . . . . . . . . . . . . . . . . . .  52
107	     6.4  Storage Protocol . . . . . . . . . . . . . . . . . . . . .  52
108	     6.5  Control Protocol . . . . . . . . . . . . . . . . . . . . .  53
109	     6.6  Metadata . . . . . . . . . . . . . . . . . . . . . . . . .  53
110	     6.7  Layout . . . . . . . . . . . . . . . . . . . . . . . . . .  53
111	   7.   pNFS protocol semantics  . . . . . . . . . . . . . . . . . .  53
112	     7.1  Definitions  . . . . . . . . . . . . . . . . . . . . . . .  54
113	       7.1.1  Layout Types . . . . . . . . . . . . . . . . . . . . .  54
114	       7.1.2  Layout Iomode  . . . . . . . . . . . . . . . . . . . .  54
115	       7.1.3  Layout Segments  . . . . . . . . . . . . . . . . . . .  55
116	       7.1.4  Device IDs . . . . . . . . . . . . . . . . . . . . . .  56
117	       7.1.5  Aggregation Schemes  . . . . . . . . . . . . . . . . .  56
118	     7.2  Guarantees Provided by Layouts . . . . . . . . . . . . . .  56
119	     7.3  Getting a Layout . . . . . . . . . . . . . . . . . . . . .  58
120	     7.4  Committing a Layout  . . . . . . . . . . . . . . . . . . .  58
121	       7.4.1  LAYOUTCOMMIT and mtime/atime/change  . . . . . . . . .  59
122	       7.4.2  LAYOUTCOMMIT and size  . . . . . . . . . . . . . . . .  60
123	       7.4.3  LAYOUTCOMMIT and layoutupdate  . . . . . . . . . . . .  61
124	     7.5  Recalling a Layout . . . . . . . . . . . . . . . . . . . .  61
125	       7.5.1  Basic Operation  . . . . . . . . . . . . . . . . . . .  61
126	       7.5.2  Recall Callback Robustness . . . . . . . . . . . . . .  62
127	       7.5.3  Recall/Return Sequencing . . . . . . . . . . . . . . .  63
128	     7.6  Metadata Server Write Propagation  . . . . . . . . . . . .  65
129	     7.7  Crash Recovery . . . . . . . . . . . . . . . . . . . . . .  66
130	       7.7.1  Leases . . . . . . . . . . . . . . . . . . . . . . . .  66
131	       7.7.2  Client Recovery  . . . . . . . . . . . . . . . . . . .  67
132	       7.7.3  Metadata Server Recovery . . . . . . . . . . . . . . .  68
133	       7.7.4  Storage Device Recovery  . . . . . . . . . . . . . . .  70
134	   8.   Security Considerations  . . . . . . . . . . . . . . . . . .  71
135	     8.1  File Layout Security . . . . . . . . . . . . . . . . . . .  72
136	     8.2  Object Layout Security . . . . . . . . . . . . . . . . . .  72
137	     8.3  Block/Volume Layout Security . . . . . . . . . . . . . . .  73
138	   9.   The NFSv4 File Layout Type . . . . . . . . . . . . . . . . .  74
139	     9.1  File Striping and Data Access  . . . . . . . . . . . . . .  74
140	       9.1.1  Sparse and Dense Storage Device Data Layouts . . . . .  75
141	       9.1.2  Metadata and Storage Device Roles  . . . . . . . . . .  77
142	       9.1.3  Device Multipathing  . . . . . . . . . . . . . . . . .  78
143	       9.1.4  Operations Issued to Storage Devices . . . . . . . . .  79
144	     9.2  Global Stateid Requirements  . . . . . . . . . . . . . . .  79
145	     9.3  The Layout Iomode  . . . . . . . . . . . . . . . . . . . .  80
146	     9.4  Storage Device State Propagation . . . . . . . . . . . . .  80
147	       9.4.1  Lock State Propagation . . . . . . . . . . . . . . . .  80
148	       9.4.2  Open-mode Validation . . . . . . . . . . . . . . . . .  81
149	       9.4.3  File Attributes  . . . . . . . . . . . . . . . . . . .  81
150	     9.5  Storage Device Component File Size . . . . . . . . . . . .  82
151	     9.6  Crash Recovery Considerations  . . . . . . . . . . . . . .  83
152	     9.7  Security Considerations  . . . . . . . . . . . . . . . . .  83
153	     9.8  Alternate Approaches . . . . . . . . . . . . . . . . . . .  84
154	   10.  pNFS Typed Data Structures . . . . . . . . . . . . . . . . .  85
155	     10.1   pnfs_layouttype4 . . . . . . . . . . . . . . . . . . . .  85
156	     10.2   pnfs_deviceid4 . . . . . . . . . . . . . . . . . . . . .  85
157	     10.3   pnfs_deviceaddr4 . . . . . . . . . . . . . . . . . . . .  86
158	     10.4   pnfs_devlist_item4 . . . . . . . . . . . . . . . . . . .  86
159	     10.5   pnfs_layout4 . . . . . . . . . . . . . . . . . . . . . .  87
160	     10.6   pnfs_layoutupdate4 . . . . . . . . . . . . . . . . . . .  87
161	     10.7   pnfs_layouthint4 . . . . . . . . . . . . . . . . . . . .  88
162	     10.8   pnfs_layoutiomode4 . . . . . . . . . . . . . . . . . . .  88
163	   11.  pNFS File Attributes . . . . . . . . . . . . . . . . . . . .  88
164	     11.1   pnfs_layouttype4<> FS_LAYOUT_TYPES . . . . . . . . . . .  88
165	     11.2   pnfs_layouttype4<> FILE_LAYOUT_TYPES . . . . . . . . . .  88
166	     11.3   pnfs_layouthint4 FILE_LAYOUT_HINT  . . . . . . . . . . .  89
167	     11.4   uint32_t FS_LAYOUT_PREFERRED_BLOCKSIZE . . . . . . . . .  89
168	     11.5   uint32_t FS_LAYOUT_PREFERRED_ALIGNMENT . . . . . . . . .  89
169	   12.  pNFS Error Definitions . . . . . . . . . . . . . . . . . . .  89
170	   13.  Layouts and Aggregation  . . . . . . . . . . . . . . . . . .  90
171	     13.1   Simple Map . . . . . . . . . . . . . . . . . . . . . . .  90
172	     13.2   Block Extent Map . . . . . . . . . . . . . . . . . . . .  90
173	     13.3   Striped Map (RAID 0) . . . . . . . . . . . . . . . . . .  90
174	     13.4   Replicated Map . . . . . . . . . . . . . . . . . . . . .  91
175	     13.5   Concatenated Map . . . . . . . . . . . . . . . . . . . .  91
176	     13.6   Nested Map . . . . . . . . . . . . . . . . . . . . . . .  91
177	   14.  NFSv4.1 Operations . . . . . . . . . . . . . . . . . . . . .  91
178	     14.1   LOOKUPP - Lookup Parent Directory  . . . . . . . . . . .  91
179	     14.2   SECINFO -- Obtain Available Security . . . . . . . . . .  93
180	     14.3   SECINFO_NO_NAME - Get Security on Unnamed Object . . . .  96
181	     14.4   CREATECLIENTID - Instantiate Clientid  . . . . . . . . .  98
182	     14.5   CREATESESSION - Create New Session and Confirm
183	            Clientid . . . . . . . . . . . . . . . . . . . . . . . . 104
184	     14.6   BIND_BACKCHANNEL - Create a callback channel binding . . 109
185	     14.7   DESTROYSESSION - Destroy existing session  . . . . . . . 112
186	     14.8   SEQUENCE - Supply per-procedure sequencing and control . 113
187	     14.9   CB_RECALLCREDIT - change flow control limits . . . . . . 114
188	     14.10  CB_SEQUENCE - Supply callback channel sequencing and
189	            control  . . . . . . . . . . . . . . . . . . . . . . . . 115
190	     14.11  GET_DIR_DELEGATION - Get a directory delegation  . . . . 117
191	     14.12  CB_NOTIFY - Notify directory changes . . . . . . . . . . 120
192	     14.13  CB_RECALL_ANY - Keep any N delegations . . . . . . . . . 124
193	     14.14  LAYOUTGET - Get Layout Information . . . . . . . . . . . 126
194	     14.15  LAYOUTCOMMIT - Commit writes made using a layout . . . . 128
195	     14.16  LAYOUTRETURN - Release Layout Information  . . . . . . . 131
196	     14.17  GETDEVICEINFO - Get Device Information . . . . . . . . . 133
197	     14.18  GETDEVICELIST - Get List of Devices  . . . . . . . . . . 134
198	     14.19  CB_LAYOUTRECALL  . . . . . . . . . . . . . . . . . . . . 136
199	     14.20  CB_SIZECHANGED . . . . . . . . . . . . . . . . . . . . . 138
200	   15.  References . . . . . . . . . . . . . . . . . . . . . . . . . 139
201	     15.1   Normative References . . . . . . . . . . . . . . . . . . 139
202	     15.2   Informative References . . . . . . . . . . . . . . . . . 139
203	        Author's Address . . . . . . . . . . . . . . . . . . . . . . 139
204	   A.   Acknowledgments  . . . . . . . . . . . . . . . . . . . . . . 139
205	        Intellectual Property and Copyright Statements . . . . . . . 141

207	1.  Security Negotiation

209	   The NFSv4.0 specification contains three oversights and ambiguities
210	   with respect to the SECINFO operation.

212	   First, it is impossible for the client to use the SECINFO operation
213	   to determine the correct security triple for accessing a parent
214	   directory.  This is because SECINFO takes as arguments the current
215	   file handle and a component name.  However, NFSv4.0 uses the LOOKUPP
216	   operation to get the parent directory of the current file handle.  If
217	   the client uses the wrong security when issuing the LOOKUPP, and gets
218	   back an NFS4ERR_WRONGSEC error, SECINFO is useless to the client.
219	   The client is left with guessing which security the server will
220	   accept.  This defeats the purpose of SECINFO, which was to provide an
221	   efficient method of negotiating security.

223	   Second, there is ambiguity as to what the server should do when it is
224	   passed a LOOKUP operation such that the server restricts access to
225	   the current file handle with one security triple, and access to the
226	   component with a different triple, and remote procedure call uses one
227	   of the two security triples.  Should the server allow the LOOKUP?

229	   Third, there is a problem as to what the client must do (or can do),
230	   whenever the server returns NFS4ERR_WRONGSEC in response to a PUTFH
231	   operation.  The NFSv4.0 specification says that client should issue a
232	   SECINFO using the parent filehandle and the component name of the
233	   filehandle that PUTFH was issued with.  This may not be convenient
234	   for the client.

236	   This document resolves the above three issues in the context of
237	   NFSv4.1.

239	2.  Clarification of Security Negotiation in NFSv4.1

241	   This section attempts to clarify NFSv4.1 security negotiation issues.
242	   Unless noted otherwise, for any mention of PUTFH in this section, the
243	   reader should interpret it as applying to PUTROOTFH and PUTPUBFH in
244	   addition to PUTFH.

246	2.1  PUTFH + LOOKUP

248	   The server implementation may decide whether to impose any
249	   restrictions on export security administration.  There are at least
250	   three approaches (Sc is the flavor set of the child export, Sp that
251	   of the parent),
252	     a) Sc <= Sp (<= for subset)

254	     b) Sc ^ Sp != {} (^ for intersection, {} for the empty set)

256	     c) free form

258	   To support b (when client chooses a flavor that is not a member of
259	   Sp) and c, PUTFH must NOT return NFS4ERR_WRONGSEC in case of security
260	   mismatch.  Instead, it should be returned from the LOOKUP that
261	   follows.

263	   Since the above guideline does not contradict a, it should be
264	   followed in general.

266	2.2  PUTFH + LOOKUPP

268	   Since SECINFO only works its way down, there is no way LOOKUPP can
269	   return NFS4ERR_WRONGSEC without the server implementing
270	   SECINFO_NO_NAME.  SECINFO_NO_NAME solves this issue because via style
271	   "parent", it works in the opposite direction as SECINFO (component
272	   name is implicit in this case).

274	2.3  PUTFH + SECINFO

276	   This case should be treated specially.

278	   A security sensitive client should be allowed to choose a strong
279	   flavor when querying a server to determine a file object's permitted
280	   security flavors.  The security flavor chosen by the client does not
281	   have to be included in the flavor list of the export.  Of course the
282	   server has to be configured for whatever flavor the client selects,
283	   otherwise the request will fail at RPC authentication.

285	   In theory, there is no connection between the security flavor used by
286	   SECINFO and those supported by the export.  But in practice, the
287	   client may start looking for strong flavors from those supported by
288	   the export, followed by those in the mandatory set.

290	2.4  PUTFH + Anything Else

292	   PUTFH must return NFS4ERR_WRONGSEC in case of security mismatch.
293	   This is the most straightforward approach without having to add
294	   NFS4ERR_WRONGSEC to every other operations.

296	   PUTFH + SECINFO_NO_NAME (style "current_fh") is needed for the client
297	   to recover from NFS4ERR_WRONGSEC.

299	3.  NFSv4.1 Sessions

301	3.1  Sessions Background

303	3.1.1  Introduction to Sessions

305	   This draft proposes extensions to NFS version 4 [RFC3530] enabling it
306	   to support sessions and endpoint management, and to support operation
307	   atop RDMA-capable RPC over transports such as iWARP.  [RDMAP, DDP]
308	   These extensions enable support for exactly-once semantics by NFSv4
309	   servers, multipathing and trunking of transport connections, and
310	   enhanced security.  The ability to operate over RDMA enables greatly
311	   enhanced performance.  Operation over existing TCP is enhanced as
312	   well.

314	   While discussed here with respect to IETF-chartered transports, the
315	   proposed protocol is intended to function over other standards, such
316	   as Infiniband.  [IB]

318	   The following are the major aspects of this proposal:

320	      Changes are proposed within the framework of NFSv4 minor
321	      versioning.  RPC, XDR, and the NFSv4 procedures and operations are
322	      preserved.  The proposed extension functions equally well over
323	      existing transports and RDMA, and interoperates transparently with
324	      existing implementations, both at the local programmatic interface
325	      and over the wire.

327	      An explicit session is introduced to NFSv4, and new operations are
328	      added to support it.  The session allows for enhanced trunking,
329	      failover and recovery, and authentication efficiency, along with
330	      necessary support for RDMA.  The session is implemented as
331	      operations within NFSv4 COMPOUND and does not impact layering or
332	      interoperability with existing NFSv4 implementations.  The NFSv4
333	      callback channel is dynamically associated and is connected by the
334	      client and not the server, enhancing security and operation
335	      through firewalls.  In fact, the callback channel will be enabled
336	      to share the same connection as the operations channel.

338	      An enhanced RPC layer enables NFSv4 operation atop RDMA.  The
339	      session assists RDMA-mode connection, and additional facilities
340	      are provided for managing RDMA resources at both NFSv4 server and
341	      client.  Existing NFSv4 operations continue to function as before,
342	      though certain size limits are negotiated.  A companion draft to
343	      this document, "RDMA Transport for ONC RPC" [RPCRDMA] is to be
344	      referenced for details of RPC RDMA support.

346	      Support for exactly-once semantics ("EOS") is enabled by the new
347	      session facilities, by providing to the server a way to bound the
348	      size of the duplicate request cache for a single client, and to
349	      manage its persistent storage.

351	                                   Block Diagram

353	             +-----------------+-------------------------------------+
354	             |     NFSv4       |     NFSv4 + session extensions      |
355	             +-----------------+------+----------------+-------------+
356	             |      Operations        |   Session      |             |
357	             +------------------------+----------------+             |
358	             |                RPC/XDR                  |             |
359	             +-------------------------------+---------+             |
360	             |       Stream Transport        |    RDMA Transport     |
361	             +-------------------------------+-----------------------+

363	3.1.2  Motivation

365	   NFS version 4 [RFC3530] has been granted "Proposed Standard" status.
366	   The NFSv4 protocol was developed along several design points,
367	   important among them: effective operation over wide-area networks,
368	   including the Internet itself;  strong security integrated into the
369	   protocol;  extensive cross-platform interoperability including
370	   integrated locking semantics compatible with multiple operating
371	   systems; and protocol extensibility.

373	   The NFS version 4 protocol, however, does not provide support for
374	   certain important transport aspects.  For example, the protocol does
375	   not address response caching, which is required to provide
376	   correctness for retried client requests across a network partition,
377	   nor does it provide an interoperable way to support trunking and
378	   multipathing of connections.  This leads to inefficiencies,
379	   especially where trunking and multipathing are concerned, and
380	   presents additional difficulties in supporting RDMA fabrics, in which
381	   endpoints may require dedicated or specialized resources.  Sessions
382	   can be employed to unify NFS-level constructs such as the clientid,
383	   with transport-level constructs such as transport endpoints.  Each
384	   transport endpoint draws on resources via its membership in a
385	   session.  Resource management can be more strictly maintained,
386	   leading to greater server efficiency in implementing the protocol.
387	   The enhanced operation over a session affords an opportunity to the
388	   server to implement a highly reliable duplicate request cache, and
389	   thereby export exactly-once semantics.

391	   NFSv4 advances the state of high-performance local sharing, by virtue
392	   of its integrated security, locking, and delegation, and its
393	   excellent coverage of the sharing semantics of multiple operating
394	   systems.  It is precisely this environment where exactly-once
395	   semantics become a fundamental requirement.

397	   Additionally, efforts to standardize a set of protocols for Remote
398	   Direct Memory Access, RDMA, over the Internet Protocol Suite have
399	   made significant progress.  RDMA is a general solution to the problem
400	   of CPU overhead incurred due to data copies, primarily at the
401	   receiver.  Substantial research has addressed this and has borne out
402	   the efficacy of the approach.  An overview of this is the RDDP
403	   Problem Statement document, [RDDPPS].

405	   Numerous upper layer protocols achieve extremely high bandwidth and
406	   low overhead through the use of RDMA.  Products from a wide variety
407	   of vendors employ RDMA to advantage, and prototypes have demonstrated
408	   the effectiveness of many more.  Here, we are concerned specifically
409	   with NFS and NFS-style upper layer protocols;  examples from Network
410	   Appliance [DAFS, DCK+03], Fujitsu Prime Software Technologies [FJNFS,
411	   FJDAFS] and Harvard University [KM02] are all relevant.

413	   By layering a session binding for NFS version 4 directly atop a
414	   standard RDMA transport, a greatly enhanced level of performance and
415	   transparency can be supported on a wide variety of operating system
416	   platforms.  These combined capabilities alter the landscape between
417	   local filesystems and network attached storage, enable a new level of
418	   performance, and lead new classes of application to take advantage of
419	   NFS.

421	3.1.3  Problem Statement

423	   Two issues drive the current proposal: correctness, and performance.
424	   Both are instances of "raising the bar" for NFS, whereby the desire
425	   to use NFS in new classes applications can be accommodated by
426	   providing the basic features to make such use feasible.  Such
427	   applications include tightly coupled sharing environments such as
428	   cluster computing, high performance computing (HPC) and information
429	   processing such as databases.  These trends are explored in depth in
430	   [NFSPS].

432	   The first issue, correctness, exemplified among the attributes of
433	   local filesystems, is support for exactly-once semantics.  Such
434	   semantics have not been reliably available with NFS.  Server-based
435	   duplicate request caches [CJ89] help, but do not reliably provide
436	   strict correctness.  For the type of application which is expected to
437	   make extensive use of the high-performance RDMA-enabled environment,
438	   the reliable provision of such semantics is a fundamental
439	   requirement.

441	   Introduction of a session to NFSv4 will address these issues.  With
442	   higher performance and enhanced semantics comes the problem of
443	   enabling advanced endpoint management, for example high-speed
444	   trunking, multipathing and failover.  These characteristics enable
445	   availability and performance.  RFC3530 presents some issues in
446	   permitting a single clientid to access a server over multiple
447	   connections.

449	   A second issue encountered in common by NFS implementations is the
450	   CPU overhead required to implement the protocol.  Primary among the
451	   sources of this overhead is the movement of data from NFS protocol
452	   messages to its eventual destination in user buffers or aligned
453	   kernel buffers.  The data copies consume system bus bandwidth and CPU
454	   time, reducing the available system capacity for applications.
455	   [RDDPPS] Achieving zero-copy with NFS has to date required
456	   sophisticated, "header cracking" hardware and/or extensive platform-
457	   specific virtual memory mapping tricks.

459	   Combined in this way, NFSv4, RDMA and the emerging high-speed network
460	   fabrics will enable delivery of performance which matches that of the
461	   fastest local filesystems, preserving the key existing local
462	   filesystem semantics, while enhancing them by providing network
463	   filesystem sharing semantics.

465	   RDMA implementations generally have other interesting properties,
466	   such as hardware assisted protocol access, and support for user space
467	   access to I/O. RDMA is compelling here for another reason; hardware
468	   offloaded networking support in itself does not avoid data copies,
469	   without resorting to implementing part of the NFS protocol in the
470	   NIC.  Support of RDMA by NFS enables the highest performance at the
471	   architecture level rather than by implementation; this enables
472	   ubiquitous and interoperable solutions.

474	   By providing file access performance equivalent to that of local file
475	   systems, NFSv4 over RDMA will enable applications running on a set of
476	   client machines to interact through an NFSv4 file system, just as
477	   applications running on a single machine might interact through a
478	   local file system.

480	   This raises the issue of whether additional protocol enhancements to
481	   enable such interaction would be desirable and what such enhancements
482	   would be.  This is a complicated issue which the working group needs
483	   to address and will not be further discussed in this document.

485	3.1.4  NFSv4 Session Extension Characteristics

487	   This draft will present a solution based upon minor versioning of
488	   NFSv4.  It will introduce a session to collect transport endpoints
489	   and resources such as reply caching, which in turn enables
490	   enhancements such as trunking, failover and recovery.  It will
491	   describe use of RDMA by employing support within an underlying RPC
492	   layer [RPCRDMA].  Most importantly, it will focus on making the best
493	   possible use of an RDMA transport.

495	   These extensions are proposed as elements of a new minor revision of
496	   NFS version 4.  In this draft, NFS version 4 will be referred to
497	   generically as "NFSv4", when describing properties common to all
498	   minor versions.  When referring specifically to properties of the
499	   original, minor version 0 protocol, "NFSv4.0" will be used, and
500	   changes proposed here for minor version 1 will be referred to as
501	   "NFSv4.1".

503	   This draft proposes only changes which are strictly upward-
504	   compatible with existing RPC and NFS Application Programming
505	   Interfaces (APIs).

507	3.2  Transport Issues

509	   The Transport Issues section of the document explores the details of
510	   utilizing the various supported transports.

512	3.2.1  Session Model

514	   The first and most evident issue in supporting diverse transports is
515	   how to provide for their differences.  This draft proposes
516	   introducing an explicit session.

518	   A session introduces minimal protocol requirements, and provides for
519	   a highly useful and convenient way to manage numerous endpoint-
520	   related issues.  The session is a local construct; it represents a
521	   named, higher-layer object to which connections can refer, and
522	   encapsulates properties important to each associated client.

524	   A session is a dynamically created, long-lived server object created
525	   by a client, used over time from one or more transport connections.
526	   Its function is to maintain the server's state relative to the
527	   connection(s) belonging to a client instance.  This state is entirely
528	   independent of the connection itself.  The session in effect becomes
529	   the object representing an active client on a connection or set of
530	   connections.

532	   Clients may create multiple sessions for a single clientid, and may
533	   wish to do so for optimization of transport resources, buffers, or
534	   server behavior.  A session could be created by the client to
535	   represent a single mount point, for separate read and write
536	   "channels", or for any number of other client-selected parameters.

538	   The session enables several things immediately.  Clients may
539	   disconnect and reconnect (voluntarily or not) without loss of context
540	   at the server.  (Of course, locks, delegations and related
541	   associations require special handling, and generally expire in the
542	   extended absence of an open connection.)  Clients may connect
543	   multiple transport endpoints to this common state.  The endpoints may
544	   have all the same attributes, for instance when trunked on multiple
545	   physical network links for bandwidth aggregation or path failover.
546	   Or, the endpoints can have specific, special purpose attributes such
547	   as callback channels.

549	   The NFSv4 specification does not provide for any form of flow
550	   control;  instead it relies on the windowing provided by TCP to
551	   throttle requests.  This unfortunately does not work with RDMA, which
552	   in general provides no operation flow control and will terminate a
553	   connection in error when limits are exceeded.  Limits are therefore
554	   exchanged when a session is created; These limits then provide maxima
555	   within which each session's connections must operate, they are
556	   managed within these limits as described in [RPCRDMA].  The limits
557	   may also be modified dynamically at the server's choosing by
558	   manipulating certain parameters present in each NFSv4.1 request.

560	   The presence of a maximum request limit on the session bounds the
561	   requirements of the duplicate request cache.  This can be used to
562	   advantage by a server, which can accurately determine any storage
563	   needs and enable it to maintain duplicate request cache persistence
564	   and to provide reliable exactly-once semantics.

566	   Finally, given adequate connection-oriented transport security
567	   semantics, authentication and authorization may be cached on a per-
568	   session basis, enabling greater efficiency in the issuing and
569	   processing of requests on both client and server.  A proposal for
570	   transparent, server-driven implementation of this in NFSv4 has been
571	   made.  [CCM] The existence of the session greatly facilitates the
572	   implementation of this approach.  This is discussed in detail in the
573	   Authentication Efficiencies section later in this draft.

575	3.2.2  Connection State

577	   In RFC3530, the combination of a connected transport endpoint and a
578	   clientid forms the basis of connection state.  While has been made to
579	   be workable with certain limitations, there are difficulties in
580	   correct and robust implementation.  The NFSv4.0 protocol must provide
581	   a server-initiated connection for the callback channel, and must
582	   carefully specify the persistence of client state at the server in
583	   the face of transport interruptions.  The server has only the
584	   client's transport address binding (the IP 4-tuple) to identify the
585	   client RPC transaction stream and to use as a lookup tag on the
586	   duplicate request cache.  (A useful overview of this is in [RW96].)
587	   If the server listens on multiple adddresses, and the client connects
588	   to more than one, it must employ different clientid's on each,
589	   negating its ability to aggregate bandwidth and redundancy.  In
590	   effect, each transport connection is used as the server's
591	   representation of client state.  But, transport connections are
592	   potentially fragile and transitory.

594	   In this proposal, a session identifier is assigned by the server upon
595	   initial session negotiation on each connection.  This identifier is
596	   used to associate additional connections, to renegotiate after a
597	   reconnect, to provide an abstraction for the various session
598	   properties, and to address the duplicate request cache.  No
599	   transport-specific information is used in the duplicate request cache
600	   implementation of an NFSv4.1 server, nor in fact the RPC XID itself.
601	   The session identifier is unique within the server's scope and may be
602	   subject to certain server policies such as being bounded in time.

604	   It is envisioned that the primary transport model will be connection
605	   oriented.  Connection orientation brings with it certain potential
606	   optimizations, such as caching of per-connection properties, which
607	   are easily leveraged through the generality of the session.  However,
608	   it is possible that in future, other transport models could be
609	   accommodated below the session abstraction.

611	3.2.3  NFSv4 Channels, Sessions and Connections

613	   There are at least two types of NFSv4 channels: the "operations"
614	   channel used for ordinary requests from client to server, and the
615	   "back" channel, used for callback requests from server to client.

617	   As mentioned above, different NFSv4 operations on these channels can
618	   lead to different resource needs.  For example, server callback
619	   operations (CB_RECALL) are specific, small messages which flow from
620	   server to client at arbitrary times, while data transfers such as
621	   read and write have very different sizes and asymmetric behaviors.
622	   It is sometimes impractical for the RDMA peers (NFSv4 client and
623	   NFSv4 server) to post buffers for these various operations on a
624	   single connection.  Commingling of requests with responses at the
625	   client receive queue is particularly troublesome, due both to the
626	   need to manage both solicited and unsolicited completions, and to
627	   provision buffers for both purposes.  Due to the lack of any ordering
628	   of callback requests versus response arrivals, without any other
629	   mechanisms, the client would be forced to allocate all buffers sized
630	   to the worst case.

632	   The callback requests are likely to be handled by a different task
633	   context from that handling the responses.  Significant demultiplexing
634	   and thread management may be required if both are received on the
635	   same queue.  However, if callbacks are relatively rare (perhaps due
636	   to client access patterns), many of these difficulties can be
637	   minimized.

639	   Also, the client may wish to perform trunking of operations channel
640	   requests for performance reasons, or multipathing for availability.
641	   This proposal permits both, as well as many other session and
642	   connection possibilities, by permitting each operation to carry
643	   session membership information and to share session (and clientid)
644	   state in order to draw upon the appropriate resources.  For example,
645	   reads and writes may be assigned to specific, optimized connections,
646	   or sorted and separated by any or all of size, idempotency, etc.

648	   To address the problems described above, this proposal allows
649	   multiple sessions to share a clientid, as well as for multiple
650	   connections to share a session.

652	   Single Connection model:

654	                            NFSv4.1 Session
655	                               /      \
656	                Operations_Channel   [Back_Channel]
657	                                \    /
658	                             Connection
659	                                  |

661	        Multi-connection trunked model (2 operations channels shown):

663	                            NFSv4.1 Session
664	                               /      \
665	                Operations_Channels  [Back_Channel]
666	                    |          |               |
667	                Connection Connection     [Connection]
668	                    |          |               |

670	        Multi-connection split-use model (2 mounts shown):

672	                                     NFSv4.1 Session
673	                                   /                 \
674	                            (/home)        (/usr/local - readonly)
675	                            /      \                    |
676	             Operations_Channel  [Back_Channel]         |
677	                     |                 |          Operations_Channel
678	                 Connection       [Connection]          |
679	                     |                 |            Connection
680	                                                        |

682	   In this way, implementation as well as resource management may be
683	   optimized.  Each session will have its own response caching and
684	   buffering, and each connection or channel will have its own transport
685	   resources, as appropriate.  Clients which do not require certain
686	   behaviors may optimize such resources away completely, by using
687	   specific sessions and not even creating the additional channels and
688	   connections.

690	3.2.4  Reconnection, Trunking and Failover

692	   Reconnection after failure references stored state on the server
693	   associated with lease recovery during the grace period.  The session
694	   provides a convenient handle for storing and managing information
695	   regarding the client's previous state on a per- connection basis,
696	   e.g. to be used upon reconnection.  Reconnection to a previously
697	   existing session, and its stored resources, are covered in the
698	   "Connection Models" section below.

700	   One important aspect of reconnection is that of RPC library support.
701	   Traditionally, an Upper Layer RPC-based Protocol such as NFS leaves
702	   all transport knowledge to the RPC layer implementation below it.
703	   This allows NFS to operate over a wide variety of transports and has
704	   proven to be a highly successful approach.  The session, however,
705	   introduces an abstraction which is, in a way, "between" RPC and
706	   NFSv4.1.  It is important that the session abstraction not have
707	   ramifications within the RPC layer.

709	   One such issue arises within the reconnection logic of RPC.
710	   Previously, an explicit session binding operation, which established
711	   session context for each new connection, was explored.  This however
712	   required that the session binding also be performed during reconnect,
713	   which in turn required an RPC request.  This additional request
714	   requires new RPC semantics, both in implementation and the fact that
715	   a new request is inserted into the RPC stream.  Also, the binding of
716	   a connection to a session required the upper layer to become "aware"
717	   of connections, something the RPC layer abstraction architecturally
718	   abstracts away.  Therefore the session binding is not handled in
719	   connection scope but instead explicitly carried in each request.

721	   For Reliability Availability and Serviceability (RAS) issues such as
722	   bandwidth aggregation and multipathing, clients frequently seek to
723	   make multiple connections through multiple logical or physical
724	   channels.  The session is a convenient point to aggregate and manage
725	   these resources.

727	3.2.5  Server Duplicate Request Cache

729	   Server duplicate request caches, while not a part of an NFS protocol,
730	   have become a standard, even required, part of any NFS
731	   implementation.  First described in [CJ89], the duplicate request
732	   cache was initially found to reduce work at the server by avoiding
733	   duplicate processing for retransmitted requests.  A second, and in
734	   the long run more important benefit, was improved correctness, as the
735	   cache avoided certain destructive non-idempotent requests from being
736	   reinvoked.

738	   However, such caches do not provide correctness guarantees;  they
739	   cannot be managed in a reliable, persistent fashion.  The reason is
740	   understandable - their storage requirement is unbounded due to the
741	   lack of any such bound in the NFS protocol, and they are dependent on
742	   transport addresses for request matching.

744	   As proposed in this draft, the presence of maximum request count
745	   limits and negotiated maximum sizes allows the size and duration of
746	   the cache to be bounded, and coupled with a long-lived session
747	   identifier, enables its persistent storage on a per-session basis.

749	   This provides a single unified mechanism which provides the following
750	   guarantees required in the NFSv4 specification, while extending them
751	   to all requests, rather than limiting them only to a subset of state-
752	   related requests:

754	   "It is critical the server maintain the last response sent to the
755	   client to provide a more reliable cache of duplicate non- idempotent
756	   requests than that of the traditional cache described in [CJ89]..."
757	   [RFC3530]

759	   The maximum request count limit is the count of active operations,
760	   which bounds the number of entries in the cache.  Constraining the
761	   size of operations additionally serves to limit the required storage
762	   to the product of the current maximum request count and the maximum
763	   response size.  This storage requirement enables server- side
764	   efficiencies.

766	   Session negotiation allows the server to maintain other state.  An
767	   NFSv4.1 client invoking the session destroy operation will cause the
768	   server to denegotiate (close) the session, allowing the server to
769	   deallocate cache entries.  Clients can potentially specify that such
770	   caches not be kept for appropriate types of sessions (for example,
771	   read-only sessions).  This can enable more efficient server operation
772	   resulting in improved response times, and more efficient sizing of
773	   buffers and response caches.

775	   Similarly, it is important for the client to explicitly learn whether
776	   the server is able to implement reliable semantics.  Knowledge of
777	   whether these semantics are in force is critical for a highly
778	   reliable client, one which must provide transactional integrity
779	   guarantees.  When clients request that the semantics be enabled for a
780	   given session, the session reply must inform the client if the mode
781	   is in fact enabled.  In this way the client can confidently proceed
782	   with operations without having to implement consistency facilities of
783	   its own.

785	3.3  Session Initialization and Transfer Models

787	   Session initialization issues, and data transfer models relevant to
788	   both TCP and RDMA are discussed in this section.

790	3.3.1  Session Negotiation

792	   The following parameters are exchanged between client and server at
793	   session creation time.  Their values allow the server to properly
794	   size resources allocated in order to service the client's requests,
795	   and to provide the server with a way to communicate limits to the
796	   client for proper and optimal operation.  They are exchanged prior to
797	   all session-related activity, over any transport type.  Discussion of
798	   their use is found in their descriptions as well as throughout this
799	   section.

801	   Maximum Requests

803	      The client's desired maximum number of concurrent requests is
804	      passed, in order to allow the server to size its reply cache
805	      storage.  The server may modify the client's requested limit
806	      downward (or upward) to match its local policy and/or resources.
807	      Over RDMA-capable RPC transports, the per-request management of
808	      low-level transport message credits is handled within the RPC
809	      layer.  [RPCRDMA]

811	   Maximum Request/Response Sizes

813	      The maximum request and response sizes are exchanged in order to
814	      permit allocation of appropriately sized buffers and request cache
815	      entries.  The size must allow for certain protocol minima,
816	      allowing the receipt of maximally sized operations (e.g.  RENAME
817	      requests which contains two name strings).  Note the maximum
818	      request/response sizes cover the entire request/response message
819	      and not simply the data payload as traditional NFS maximum read or
820	      write size.  Also note the server implementation may not, in fact
821	      probably does not, require the reply cache entries to be sized as
822	      large as the maximum response.  The server may reduce the client's
823	      requested sizes.

825	   Inline Padding/Alignment

827	      The server can inform the client of any padding which can be used
828	      to deliver NFSv4 inline WRITE payloads into aligned buffers.  Such
829	      alignment can be used to avoid data copy operations at the server
830	      for both TCP and inline RDMA transfers.  For RDMA, the client
831	      informs the server in each operation when padding has been
832	      applied.  [RPCRDMA]

834	   Transport Attributes

836	      A placeholder for transport-specific attributes is provided, with
837	      a format to be determined.  Possible examples of information to be
838	      passed in this parameter include transport security attributes to
839	      be used on the connection, RDMA- specific attributes, legacy
840	      "private data" as used on existing RDMA fabrics, transport Quality
841	      of Service attributes, etc.  This information is to be passed to
842	      the peer's transport layer by local means which is currently
843	      outside the scope of this draft, however one attribute is provided
844	      in the RDMA case:

846	   RDMA Read Resources

848	      RDMA implementations must explicitly provision resources to
849	      support RDMA Read requests from connected peers.  These values
850	      must be explicitly specified, to provide adequate resources for
851	      matching the peer's expected needs and the connection's delay-
852	      bandwidth parameters.  The client provides its chosen value to the
853	      server in the initial session creation, the value must be provided
854	      in each client RDMA endpoint.  The values are asymmetric and
855	      should be set to zero at the server in order to conserve RDMA
856	      resources, since clients do not issue RDMA Read operations in this
857	      proposal.  The result is communicated in the session response, to
858	      permit matching of values across the connection.  The value may
859	      not be changed in the duration of the session, although a new
860	      value may be requested as part of a new session.

862	3.3.2  RDMA Requirements

864	   A complete discussion of the operation of RPC-based protocols atop
865	   RDMA transports is in [RPCRDMA].  Where RDMA is considered, this
866	   proposal assumes the use of such a layering;  it addresses only the
867	   upper layer issues relevant to making best use of RPC/RDMA.

869	   A connection oriented (reliable sequenced) RDMA transport will be
870	   required.  There are several reasons for this.  First, this model
871	   most closely reflects the general NFSv4 requirement of long-lived and
872	   congestion-controlled transports.  Second, to operate correctly over
873	   either an unreliable or unsequenced RDMA transport, or both, would
874	   require significant complexity in the implementation and protocol not
875	   appropriate for a strict minor version.  For example, retransmission
876	   on connected endpoints is explicitly disallowed in the current NFSv4
877	   draft;  it would again be required with these alternate transport
878	   characteristics.  Third, the proposal assumes a specific RDMA
879	   ordering semantic, which presents the same set of ordering and
880	   reliability issues to the RDMA layer over such transports.

882	   The RDMA implementation provides for making connections to other
883	   RDMA-capable peers.  In the case of the current proposals before the
884	   RDDP working group, these RDMA connections are preceded by a
885	   "streaming" phase, where ordinary TCP (or NFS) traffic might flow.
886	   However, this is not assumed here and sizes and other parameters are
887	   explicitly exchanged upon a session entering RDMA mode.

889	3.3.3  RDMA Connection Resources

891	   On transport endpoints which support automatic RDMA mode, that is,
892	   endpoints which are created in the RDMA-enabled state, a single,
893	   preposted buffer must initially be provided by both peers, and the
894	   client session negotiation must be the first exchange.

896	   On transport endpoints supporting dynamic negotiation, a more
897	   sophisticated negotiation is possible, but is not discussed in the
898	   current draft.

900	   RDMA imposes several requirements on upper layer consumers.
901	   Registration of memory and the need to post buffers of a specific
902	   size and number for receive operations are a primary consideration.

904	   Registration of memory can be a relatively high-overhead operation,
905	   since it requires pinning of buffers, assignment of attributes (e.g.
906	   readable/writable), and initialization of hardware translation.
907	   Preregistration is desirable to reduce overhead.  These registrations
908	   are specific to hardware interfaces and even to RDMA connection
909	   endpoints, therefore negotiation of their limits is desirable to
910	   manage resources effectively.

912	   Following the basic registration, these buffers must be posted by the
913	   RPC layer to handle receives.  These buffers remain in use by the
914	   RPC/NFSv4 implementation; the size and number of them must be known
915	   to the remote peer in order to avoid RDMA errors which would cause a
916	   fatal error on the RDMA connection.

918	   The session provides a natural way for the server to manage resource
919	   allocation to each client rather than to each transport connection
920	   itself.  This enables considerable flexibility in the administration
921	   of transport endpoints.

923	3.3.4  TCP and RDMA Inline Transfer Model

925	   The basic transfer model for both TCP and RDMA is referred to as
926	   "inline".  For TCP, this is the only transfer model supported, since
927	   TCP carries both the RPC header and data together in the data stream.

929	   For RDMA, the RDMA Send transfer model is used for all NFS requests
930	   and replies, but data is optionally carried by RDMA Writes or RDMA
931	   Reads.  Use of Sends is required to ensure consistency of data and to
932	   deliver completion notifications.  The pure-Send method is typically
933	   used where the data payload is small, or where for whatever reason
934	   target memory for RDMA is not available.

936	        Inline message exchange

938	               Client                                Server
939	                  :                Request              :
940	             Send :   ------------------------------>   : untagged
941	                  :                                     :  buffer
942	                  :               Response              :
943	         untagged :   <------------------------------   : Send
944	          buffer  :                                     :

946	               Client                                Server
947	                  :            Read request             :
948	             Send :   ------------------------------>   : untagged
949	                  :                                     :  buffer
950	                  :       Read response with data       :
951	         untagged :   <------------------------------   : Send
952	          buffer  :                                     :

954	               Client                                Server
955	                  :       Write request with data       :
956	             Send :   ------------------------------>   : untagged
957	                  :                                     :  buffer
958	                  :            Write response           :
959	         untagged :   <------------------------------   : Send
960	          buffer  :                                     :

962	   Responses must be sent to the client on the same connection that the
963	   request was sent.  It is important that the server does not assume
964	   any specific client implementation, in particular whether connections
965	   within a session share any state at the client.  This is also
966	   important to preserve ordering of RDMA operations, and especially
967	   RMDA consistency.  Additionally, it ensures that the RPC RDMA layer
968	   makes no requirement of the RDMA provider to open its memory
969	   registration handles (Steering Tags) beyond the scope of a single
970	   RDMA connection.  This is an important security consideration.

972	   Two values must be known to each peer prior to issuing Sends: the
973	   maximum number of sends which may be posted, and their maximum size.
974	   These values are referred to, respectively, as the message credits
975	   and the maximum message size.  While the message credits might vary
976	   dynamically over the duration of the session, the maximum message
977	   size does not.  The server must commit to preserving this number of
978	   duplicate request cache entires, and preparing a number of receive
979	   buffers equal to or greater than its currently advertised credit
980	   value, each of the advertised size.  These ensure that transport
981	   resources are allocated sufficient to receive the full advertised
982	   limits.

984	   Note that the server must post the maximum number of session requests
985	   to each client operations channel.  The client is not required to
986	   spread its requests in any particular fashion across connections
987	   within a session.  If the client wishes, it may create multiple
988	   sessions, each with a single or small number of operations channels
989	   to provide the server with this resource advantage.  Or, over RDMA
990	   the server may employ a "shared receive queue".  The server can in
991	   any case protect its resources by restricting the client's request
992	   credits.

994	   While tempting to consider, it is not possible to use the TCP window
995	   as an RDMA operation flow control mechanism.  First, to do so would
996	   violate layering, requiring both senders to be aware of the existing
997	   TCP outbound window at all times.  Second, since requests are of
998	   variable size, the TCP window can hold a widely variable number of
999	   them, and since it cannot be reduced without actually receiving data,
1000	   the receiver cannot limit the sender.  Third, any middlebox
1001	   interposing on the connection would wreck any possible scheme.
1002	   [MIDTAX] In this proposal, maximum request count limits are exchanged
1003	   at the session level to allow correct provisioning of receive buffers
1004	   by transports.

1006	   When operating over TCP or other similar transport, request limits
1007	   and sizes are still employed in NFSv4.1, but instead of being
1008	   required for correctness, they provide the basis for efficient server
1009	   implementation of the duplicate request cache.  The limits are chosen
1010	   based upon the expected needs and capabilities of the client and
1011	   server, and are in fact arbitrary.  Sizes may be specified by the
1012	   client as zero (requesting the server's preferred or optimal value),
1013	   and request limits may be chosen in proportion to the client's
1014	   capabilities.  For example, a limit of 1000 allows 1000 requests to
1015	   be in progress, which may generally be far more than adequate to keep
1016	   local networks and servers fully utilized.

1018	   Both client and server have independent sizes and buffering, but over
1019	   RDMA fabrics client credits are easily managed by posting a receive
1020	   buffer prior to sending each request.  Each such buffer may not be
1021	   completed with the corresponding reply, since responses from NFSv4
1022	   servers arrive in arbitrary order.  When an operations channel is
1023	   also used for callbacks, the client must account for callback
1024	   requests by posting additional buffers.  Note that implementation-
1025	   specific facilities such as a shared receive queue may also allow
1026	   optimization of these allocations.

1028	   When a session is created, the client requests a preferred buffer
1029	   size, and the server provides its answer.  The server posts all
1030	   buffers of at least this size.  The client must comply by not sending
1031	   requests greater than this size.  It is recommended that server
1032	   implementations do all they can to accommodate a useful range of
1033	   possible client requests.  There is a provision in [RPCRDMA] to allow
1034	   the sending of client requests which exceed the server's receive
1035	   buffer size, but it requires the server to "pull" the client's
1036	   request as a "read chunk" via RDMA Read.  This introduces at least
1037	   one additional network roundtrip, plus other overhead such as
1038	   registering memory for RDMA Read at the client and additional RDMA
1039	   operations at the server, and is to be avoided.

1041	   An issue therefore arises when considering the NFSv4 COMPOUND
1042	   procedures.  Since an arbitrary number (total size) of operations can
1043	   be specified in a single COMPOUND procedure, its size is effectively
1044	   unbounded.  This cannot be supported by RDMA Sends, and therefore
1045	   this size negotiation places a restriction on the construction and
1046	   maximum size of both COMPOUND requests and responses.  If a COMPOUND
1047	   results in a reply at the server that is larger than can be sent in
1048	   an RDMA Send to the client, then the COMPOUND must terminate and the
1049	   operation which causes the overflow will provide a TOOSMALL error
1050	   status result.

1052	3.3.5  RDMA Direct Transfer Model

1054	   Placement of data by explicitly tagged RDMA operations is referred to
1055	   as "direct" transfer.  This method is typically used where the data
1056	   payload is relatively large, that is, when RDMA setup has been
1057	   performed prior to the operation, or when any overhead for setting up
1058	   and performing the transfer is regained by avoiding the overhead of
1059	   processing an ordinary receive.

1061	   The client advertises RDMA buffers in this proposed model, and not
1062	   the server.  This means the "XDR Decoding with Read Chunks" described
1063	   in [RPCRDMA] is not employed by NFSv4.1 replies, and instead all
1064	   results transferred via RDMA to the client employ "XDR Decoding with
1065	   Write Chunks".  There are several reasons for this.

1067	   First, it allows for a correct and secure mode of transfer.  The
1068	   client may advertise specific memory buffers only during specific
1069	   times, and may revoke access when it pleases.  The server is not
1070	   required to expose copies of local file buffers for individual
1071	   clients, or to lock or copy them for each client access.

1073	   Second, client credits based on fixed-size request buffers are easily
1074	   managed on the server, but for the server additional management of
1075	   buffers for client RDMA Reads is not well-bounded.  For example, the
1076	   client may not perform these RDMA Read operations in a timely
1077	   fashion, therefore the server would have to protect itself against
1078	   denial-of-service on these resources.

1080	   Third, it reduces network traffic, since buffer exposure outside the
1081	   scope and duration of a single request/response exchange necessitates
1082	   additional memory management exchanges.

1084	   There are costs associated with this decision.  Primary among them is
1085	   the need for the server to employ RDMA Read for operations such as
1086	   large WRITE.  The RDMA Read operation is a two-way exchange at the
1087	   RDMA layer, which incurs additional overhead relative to RDMA Write.
1088	   Additionally, RDMA Read requires resources at the data source (the
1089	   client in this proposal) to maintain state and to generate replies.
1090	   These costs are overcome through use of pipelining with credits, with
1091	   sufficient RDMA Read resources negotiated at session initiation, and
1092	   appropriate use of RDMA for writes by the client - for example only
1093	   for transfers above a certain size.

1095	   A description of which NFSv4 operation results are eligible for data
1096	   transfer via RDMA Write is in [NFSDDP].  There are only two such
1097	   operations: READ and READLINK.  When XDR encoding these requests on
1098	   an RDMA transport, the NFSv4.1 client must insert the appropriate
1099	   xdr_write_list entries to indicate to the server whether the results
1100	   should be transferred via RDMA or inline with a Send.  As described
1101	   in [NFSDDP], a zero-length write chunk is used to indicate an inline
1102	   result.  In this way, it is unnecessary to create new operations for
1103	   RDMA-mode versions of READ and READLINK.

1105	   Another tool to avoid creation of new, RDMA-mode operations is the
1106	   Reply Chunk [RPCRDMA], which is used by RPC in RDMA mode to return
1107	   large replies via RDMA as if they were inline.  Reply chunks are used
1108	   for operations such as READDIR, which returns large amounts of
1109	   information, but in many small XDR segments.  Reply chunks are
1110	   offered by the client and the server can use them in preference to
1111	   inline.  Reply chunks are transparent to upper layers such as NFSv4.

1113	   In any very rare cases where another NFSv4.1 operation requires
1114	   larger buffers than were negotiated when the session was created (for
1115	   example extraordinarily large RENAMEs), the underlying RPC layer may
1116	   support the use of "Message as an RDMA Read Chunk" and "RDMA Write of
1117	   Long Replies" as described in [RPCRDMA].  No additional support is
1118	   required in the NFSv4.1 client for this.  The client should be
1119	   certain that its requested buffer sizes are not so small as to make
1120	   this a frequent occurrence, however.

1122	   All operations are initiated by a Send, and are completed with a
1123	   Send.  This is exactly as in conventional NFSv4, but under RDMA has a
1124	   significant purpose: RDMA operations are not complete, that is,
1125	   guaranteed consistent, at the data sink until followed by a
1126	   successful Send completion (i.e. a receive).  These events provide a
1127	   natural opportunity for the initiator (client) to enable and later
1128	   disable RDMA access to the memory which is the target of each
1129	   operation, in order to provide for consistent and secure operation.
1130	   The RDMAP Send with Invalidate operation may be worth employing in
1131	   this respect, as it relieves the client of certain overhead in this
1132	   case.

1134	   A "onetime" boolean advisory to each RDMA region might become a hint
1135	   to the server that the client will use the three-tuple for only one
1136	   NFSv4 operation.  For a transport such as iWARP, the server can
1137	   assist the client in invalidating the three-tuple by performing a
1138	   Send with Solicited Event and Invalidate.  The server may ignore this
1139	   hint, in which case the client must perform a local invalidate after
1140	   receiving the indication from the server that the NFSv4 operation is
1141	   complete.  This may be considered in a future version of this draft
1142	   and [NFSDDP].

1144	   In a trusted environment, it may be desirable for the client to
1145	   persistently enable RDMA access by the server.  Such a model is
1146	   desirable for the highest level of efficiency and lowest overhead.

1148	        RDMA message exchanges

1150	               Client                                Server
1151	                  :         Direct Read Request         :
1152	             Send :   ------------------------------>   : untagged
1153	                  :                                     :  buffer
1154	                  :               Segment               :
1155	          tagged  :   <------------------------------   :  RDMA Write
1156	          buffer  :                  :                  :
1157	                  :              [Segment]              :
1158	          tagged  :   <------------------------------   : [RDMA Write]
1159	          buffer  :                                     :
1160	                  :         Direct Read Response        :
1161	         untagged :   <------------------------------   :  Send (w/Inv.)
1162	          buffer  :                                     :

1164	               Client                                Server
1165	                  :        Direct Write Request         :
1166	             Send :   ------------------------------>   : untagged
1167	                  :                                     :  buffer
1168	                  :               Segment               :
1169	          tagged  :   v------------------------------   :  RDMA Read
1170	          buffer  :   +----------------------------->   :
1171	                  :                  :                  :
1172	                  :              [Segment]              :
1173	          tagged  :   v------------------------------   : [RDMA Read]
1174	          buffer  :   +----------------------------->   :
1175	                  :                                     :
1176	                  :        Direct Write Response        :
1177	         untagged :   <------------------------------   :  Send (w/Inv.)
1178	          buffer  :                                     :

1180	3.4  Connection Models

1182	   There are three scenarios in which to discuss the connection model.
1183	   Each will be discussed individually, after describing the common case
1184	   encountered at initial connection establishment.

1186	   After a successful connection, the first request proceeds, in the
1187	   case of a new client association, to initial session creation, and
1188	   then optionally to session callback channel binding, prior to regular
1189	   operation.

1191	   Commonly, each new client "mount" will be the action which drives
1192	   creation of a new session.  However there are any number of other
1193	   approaches.  Clients may choose to share a single connection and
1194	   session among all their mount points.  Or, clients may support
1195	   trunking, where additional connections are created but all within a
1196	   single session.  Alternatively, the client may choose to create
1197	   multiple sessions, each tuned to the buffering and reliability needs
1198	   of the mount point.  For example, a readonly mount can sharply reduce
1199	   its write buffering and also makes no requirement for the server to
1200	   support reliable duplicate request caching.

1202	   Similarly, the client can choose among several strategies for
1203	   clientid usage.  Sessions can share a single clientid, or create new
1204	   clientids as the client deems appropriate.  For kernel-based clients
1205	   which service multiple authenticated users, a single clientid shared
1206	   across all mount points is generally the most appropriate and
1207	   flexible approach.  For example, all the client's file operations may
1208	   wish to share locking state and the local client kernel takes the
1209	   responsibility for arbitrating access locally.  For clients choosing
1210	   to support other authentication models, perhaps example userspace
1211	   implementations, a new clientid is indicated.  Through use of session
1212	   create options, both models are supported at the client's choice.

1214	   Since the session is explicitly created and destroyed by the client,
1215	   and each client is uniquely identified, the server may be
1216	   specifically instructed to discard unneeded presistent state.  For
1217	   this reason, it is possible that a server will retain any previous
1218	   state indefinitely, and place its destruction under administrative
1219	   control.  Or, a server may choose to retain state for some
1220	   configurable period, provided that the period meets other NFSv4
1221	   requirements such as lease reclamation time, etc.  However, since
1222	   discarding this state at the server may affect the correctness of the
1223	   server as seen by the client across network partitioning, such
1224	   discarding of state should be done only in a conservative manner.

1226	   Each client request to the server carries a new SEQUENCE operation
1227	   within each COMPOUND, which provides the session context.  This
1228	   session context then governs the request control, duplicate request
1229	   caching, and other persistent parameters managed by the server for a
1230	   session.

1232	3.4.1  TCP Connection Model

1234	   The following is a schematic diagram of the NFSv4.1 protocol
1235	   exchanges leading up to normal operation on a TCP stream.

1237	               Client                                Server
1238	          TCPmode :   Create Clientid(nfs_client_id4)   : TCPmode
1239	                  :   ------------------------------>   :
1240	                  :                                     :
1241	                  :     Clientid reply(clientid, ...)   :
1242	                  :   <------------------------------   :
1243	                  :                                     :
1244	                  :   Create Session(clientid, size S,  :
1245	                  :      maxreq N, STREAM, ...)         :
1246	                  :   ------------------------------>   :
1247	                  :                                     :
1248	                  :   Session reply(sessionid, size S', :
1249	                  :      maxreq N')                     :
1250	                  :   <------------------------------   :
1251	                  :                                     :
1252	                  :                   :
1253	                  :   ------------------------------>   :
1254	                  :   <------------------------------   :
1255	                  :                  :                  :

1257	   No net additional exchange is added to the initial negotiation by
1258	   this proposal.  In the NFSv4.1 exchange, the CREATECLIENTID replaces
1259	   SETCLIENTID (eliding the callback "clientaddr4" addressing) and
1260	   CREATESESSION subsumes the function of SETCLIENTID_CONFIRM, as
1261	   described elsewhere in this document.  Callback channel binding is
1262	   optional, as in NFSv4.0.  Note that the STREAM transport type is
1263	   shown above, but since the transport mode remains unchanged and
1264	   transport attributes are not necessarily exchanged, DEFAULT could
1265	   also be passed.

1267	3.4.2  Negotiated RDMA Connection Model

1269	   One possible design which has been considered is to have a
1270	   "negotiated" RDMA connection model, supported via use of a session
1271	   bind operation as a required first step.  However due to issues
1272	   mentioned earlier, this proved problematic.  This section remains as
1273	   a reminder of that fact, and it is possible such a mode can be
1274	   supported.

1276	   It is not considered critical that this be supported for two reasons.
1277	   One, the session persistence provides a way for the server to
1278	   remember important session parameters, such as sizes and maximum
1279	   request counts.  These values can be used to restore the endpoint
1280	   prior to making the first reply.  Two, there are currently no
1281	   critical RDMA parameters to set in the endpoint at the server side of
1282	   the connection.  RDMA Read resources, which are in general not
1283	   settable after entering RDMA mode, are set only at the client - the
1284	   originator of the connection.  Therefore as long as the RDMA provider
1285	   supports an automatic RDMA connection mode, no further support is
1286	   required from the NFSv4.1 protocol for reconnection.

1288	   Note, the client must provide at least as many RDMA Read resources to
1289	   its local queue for the benefit of the server when reconnecting, as
1290	   it used when negotiating the session.  If this value is no longer
1291	   appropriate, the client should resynchronize its session state,
1292	   destroy the existing session, and start over with the more
1293	   appropriate values.

1295	3.4.3  Automatic RDMA Connection Model

1297	   The following is a schematic diagram of the NFSv4.1 protocol
1298	   exchanges performed on an RDMA connection.

1300	             Client                                Server
1301	       RDMAmode :                  :                  : RDMAmode
1302	                :                  :                  :
1303	       Prepost  :                  :                  : Prepost
1304	       receive  :                  :                  : receive
1305	                :                                     :
1306	                :   Create Clientid(nfs_client_id4)   :
1307	                :   ------------------------------>   :
1308	                :                                     : Prepost
1309	                :     Clientid reply(clientid, ...)   : receive
1310	                :   <------------------------------   :
1311	       Prepost  :                                     :
1312	       receive  :   Create Session(clientid, size S,  :
1313	                :      maxreq N, RDMA ...)            :
1314	                :   ------------------------------>   :
1315	                :                                     : Prepost <=N'
1316	                :   Session reply(sessionid, size S', :     receives of
1317	                :      maxreq N')                     :     size S'
1318	                :   <------------------------------   :
1319	                :                                     :
1320	                :                   :
1321	                :   ------------------------------>   :
1322	                :   <------------------------------   :
1323	                :                  :                  :

1325	3.5  Buffer Management, Transfer, Flow Control

1327	   Inline operations in NFSv4.1 behave effectively the same as TCP
1328	   sends.  Procedure results are passed in a single message, and its
1329	   completion at the client signal the receiving process to inspect the
1330	   message.

1332	   RDMA operations are performed solely by the server in this proposal,
1333	   as described in the previous "RDMA Direct Model" section.  Since
1334	   server RDMA operations do not result in a completion at the client,
1335	   and due to ordering rules in RDMA transports, after all required RDMA
1336	   operations are complete, a Send (Send with Solicited Event for iWARP)
1337	   containing the procedure results is performed from server to client.
1338	   This Send operation will result in a completion which will signal the
1339	   client to inspect the message.

1341	   In the case of client read-type NFSv4 operations, the server will
1342	   have issued RDMA Writes to transfer the resulting data into client-
1343	   advertised buffers.  The subsequent Send operation performs two
1344	   necessary functions: finalizing any active or pending DMA at the
1345	   client, and signaling the client to inspect the message.

1347	   In the case of client write-type NFSv4 operations, the server will
1348	   have issued RDMA Reads to fetch the data from the client-advertised
1349	   buffers.  No data consistency issues arise at the client, but the
1350	   completion of the transfer must be acknowledged, again by a Send from
1351	   server to client.

1353	   In either case, the client advertises buffers for direct (RDMA style)
1354	   operations.  The client may desire certain advertisement limits, and
1355	   may wish the server to perform remote invalidation on its behalf when
1356	   the server has completed its RDMA.  This may be considered in a
1357	   future version of this draft.

1359	   In the absence of remote invalidation, the client may perform its
1360	   own, local invalidation after the operation completes.  This
1361	   invalidation should occur prior to any RPCSEC GSS integrity checking,
1362	   since a validly remotely accessible buffer can possibly be modified
1363	   by the peer.  However, after invalidation and the contents integrity
1364	   checked, the contents are locally secure.

1366	   Credit updates over RDMA transports are supported at the RPC layer as
1367	   described in [RPCRDMA].  In each request, the client requests a
1368	   desired number of credits to be made available to the connection on
1369	   which it sends the request.  The client must not send more requests
1370	   than the number which the server has previously advertised, or in the
1371	   case of the first request, only one.  If the client exceeds its
1372	   credit limit, the connection may close with a fatal RDMA error.

1374	   The server then executes the request, and replies with an updated
1375	   credit count accompanying its results.  Since replies are sequenced
1376	   by their RDMA Send order, the most recent results always reflect the
1377	   server's limit.  In this way the client will always know the maximum
1378	   number of requests it may safely post.

1380	   Because the client requests an arbitrary credit count in each
1381	   request, it is relatively easy for the client to request more, or
1382	   fewer, credits to match its expected need.  A client that discovered
1383	   itself frequently queuing outgoing requests due to lack of server
1384	   credits might increase its requested credits proportionately in
1385	   response.  Or, a client might have a simple, configurable number.
1386	   The protocol also provides a per-operation "maxslot" exchange to
1387	   assist in dynamic adjustment at the session level, described in a
1388	   later section.

1390	   Occasionally, a server may wish to reduce the total number of credits
1391	   it offers a certain client on a connection.  This could be
1392	   encountered if a client were found to be consuming its credits
1393	   slowly, or not at all.  A client might notice this itself, and reduce
1394	   its requested credits in advance, for instance requesting only the
1395	   count of operations it currently has queued, plus a few as a base for
1396	   starting up again.  Such mechanisms can, however, be potentially
1397	   complicated and are implementation-defined.  The protocol does not
1398	   require them.

1400	   Because of the way in which RDMA fabrics function, it is not possible
1401	   for the server (or client back channel) to cancel outstanding receive
1402	   operations.  Therefore, effectively only one credit can be withdrawn
1403	   per receive completion.  The server (or client back channel) would
1404	   simply not replenish a receive operation when replying.  The server
1405	   can still reduce the available credit advertisement in its replies to
1406	   the target value it desires, as a hint to the client that its credit
1407	   target is lower and it should expect it to be reduced accordingly.
1408	   Of course, even if the server could cancel outstanding receives, it
1409	   cannot do so, since the client may have already sent requests in
1410	   expectation of the previous limit.

1412	   This brings out an interesting scenario similar to the client
1413	   reconnect discussed earlier in "Connection Models".  How does the
1414	   server reduce the credits of an inactive client?

1416	   One approach is for the server to simply close such a connection and
1417	   require the client to reconnect at a new credit limit.  This is
1418	   acceptable, if inefficient, when the connection setup time is short
1419	   and where the server supports persistent session semantics.

1421	   A better approach is to provide a back channel request to return the
1422	   operations channel credits.  The server may request the client to
1423	   return some number of credits, the client must comply by performing
1424	   operations on the operations channel, provided of course that the
1425	   request does not drop the client's credit count to zero (in which
1426	   case the connection would deadlock).  If the client finds that it has
1427	   no requests with which to consume the credits it was previously
1428	   granted, it must send zero-length Send RDMA operations, or NULL NFSv4
1429	   operations in order to return the resources to the server.  If the
1430	   client fails to comply in a timely fashion, the server can recover
1431	   the resources by breaking the connection.

1433	   While in principle, the back channel credits could be subject to a
1434	   similar resource adjustment, in practice this is not an issue, since
1435	   the back channel is used purely for control and is expected to be
1436	   statically provisioned.

1438	   It is important to note that in addition to maximum request counts,
1439	   the sizes of buffers are negotiated per-session.  This permits the
1440	   most efficient allocation of resources on both peers.  There is an
1441	   important requirement on reconnection: the sizes posted by the server
1442	   at reconnect must be at least as large as previously used, to allow
1443	   recovery.  Any replies that are replayed from the server's duplicate
1444	   request cache must be able to be received into client buffers.  In
1445	   the case where a client has received replies to all its retried
1446	   requests (and therefore received all its expected responses), then
1447	   the client may disconnect and reconnect with different buffers at
1448	   will, since no cache replay will be required.

1450	3.6  Retry and Replay

1452	   NFSv4.0 forbids retransmission on active connections over reliable
1453	   transports;  this includes connected-mode RDMA.  This restriction
1454	   must be maintained in NFSv4.1.

1456	   If one peer were to retransmit a request (or reply), it would consume
1457	   an additional credit on the other.  If the server retransmitted a
1458	   reply, it would certainly result in an RDMA connection loss, since
1459	   the client would typically only post a single receive buffer for each
1460	   request.  If the client retransmitted a request, the additional
1461	   credit consumed on the server might lead to RDMA connection failure
1462	   unless the client accounted for it and decreased its available
1463	   credit, leading to wasted resources.

1465	   RDMA credits present a new issue to the duplicate request cache in
1466	   NFSv4.1.  The request cache may be used when a connection within a
1467	   session is lost, such as after the client reconnects.  Credit
1468	   information is a dynamic property of the connection, and stale values
1469	   must not be replayed from the cache.  This implies that the request
1470	   cache contents must not be blindly used when replies are issued from
1471	   it, and credit information appropriate to the channel must be
1472	   refreshed by the RPC layer.

1474	   Finally, RDMA fabrics do not guarantee that the memory handles
1475	   (Steering Tags) within each rdma three-tuple are valid on a scope
1476	   outside that of a single connection.  Therefore, handles used by the
1477	   direct operations become invalid after connection loss.  The server
1478	   must ensure that any RDMA operations which must be replayed from the
1479	   request cache use the newly provided handle(s) from the most recent
1480	   request.

1482	3.7  The Back Channel

1484	   The NFSv4 callback operations present a significant resource problem
1485	   for the RDMA enabled client.  Clearly, callbacks must be negotiated
1486	   in the way credits are for the ordinary operations channel for
1487	   requests flowing from client to server.  But, for callbacks to arrive
1488	   on the same RDMA endpoint as operation replies would require
1489	   dedicating additional resources, and specialized demultiplexing and
1490	   event handling.  Or, callbacks may not require RDMA sevice at all
1491	   (they do not normally carry substantial data payloads).  It is highly
1492	   desirable to streamline this critical path via a second
1493	   communications channel.

1495	   The session callback channel binding facility is designed for exactly
1496	   such a situation, by dynamically associating a new connected endpoint
1497	   with the session, and separately negotiating sizes and counts for
1498	   active callback channel operations.  The binding operation is
1499	   firewall-friendly since it does not require the server to initiate
1500	   the connection.

1502	   This same method serves as well for ordinary TCP connection mode.  It
1503	   is expected that all NFSv4.1 clients may make use of the session
1504	   facility to streamline their design.

1506	   The back channel functions exactly the same as the operations channel
1507	   except that no RDMA operations are required to perform transfers,
1508	   instead the sizes are required to be sufficiently large to carry all
1509	   data inline, and of course the client and server reverse their roles
1510	   with respect to which is in control of credit management.  The same
1511	   rules apply for all transfers, with the server being required to flow
1512	   control its callback requests.

1514	   The back channel is optional.  If not bound on a given session, the
1515	   server must not issue callback operations to the client.  This in
1516	   turn implies that such a client must never put itself in the
1517	   situation where the server will need to do so, lest the client lose
1518	   its connection by force, or its operation be incorrect.  For the same
1519	   reason, if a back channel is bound, the client is subject to
1520	   revocation of its delegations if the back channel is lost.  Any
1521	   connection loss should be corrected by the client as soon as
1522	   possible.

1524	   This can be convenient for the NFSv4.1 client; if the client expects
1525	   to make no use of back channel facilities such as delegations, then
1526	   there is no need to create it.  This may save significant resources
1527	   and complexity at the client.

1529	   For these reasons, if the client wishes to use the back channel, that
1530	   channel must be bound first, before using the operations channel.  In
1531	   this way, the server will not find itself in a position where it will
1532	   send callbacks on the operations channel when the client is not
1533	   prepared for them.

1535	   There is one special case, that where the back channel is bound in
1536	   fact to the operations channel's connection.  This configuration
1537	   would be used normally over a TCP stream connection to exactly
1538	   implement the NFSv4.0 behavior, but over RDMA would require complex
1539	   resource and event management at both sides of the connection.  The
1540	   server is not required to accept such a bind request on an RDMA
1541	   connection for this reason, though it is recommended.

1543	3.8  COMPOUND Sizing Issues

1545	   Very large responses may pose duplicate request cache issues.  Since
1546	   servers will want to bound the storage required for such a cache, the
1547	   unlimited size of response data in COMPOUND may be troublesome.  If
1548	   COMPOUND is used in all its generality, then the inclusion of certain
1549	   non-idempotent operations within a single COMPOUND request may render
1550	   the entire request non-idempotent.  (For example, a single COMPOUND
1551	   request which read a file or symbolic link, then removed it, would be
1552	   obliged to cache the data in order to allow identical replay).
1553	   Therefore, many requests might include operations that return any
1554	   amount of data.

1556	   It is not satisfactory for the server to reject COMPOUNDs at will
1557	   with NFS4ERR_RESOURCE when they pose such difficulties for the
1558	   server, as this results in serious interoperability problems.
1559	   Instead, any such limits must be explicitly exposed as attributes of
1560	   the session, ensuring that the server can explicitly support any
1561	   duplicate request cache needs at all times.

1563	3.9  Data Alignment

1565	   A negotiated data alignment enables certain scatter/gather
1566	   optimizations.  A facility for this is supported by [RPCRDMA].  Where
1567	   NFS file data is the payload, specific optimizations become highly
1568	   attractive.

1570	   Header padding is requested by each peer at session initiation, and
1571	   may be zero (no padding).  Padding leverages the useful property that
1572	   RDMA receives preserve alignment of data, even when they are placed
1573	   into anonymous (untagged) buffers.  If requested, client inline
1574	   writes will insert appropriate pad bytes within the request header to
1575	   align the data payload on the specified boundary.  The client is
1576	   encouraged to be optimistic and simply pad all WRITEs within the RPC
1577	   layer to the negotiated size, in the expectation that the server can
1578	   use them efficiently.

1580	   It is highly recommended that clients offer to pad headers to an
1581	   appropriate size.  Most servers can make good use of such padding,
1582	   which allows them to chain receive buffers in such a way that any
1583	   data carried by client requests will be placed into appropriate
1584	   buffers at the server, ready for filesystem processing.  The
1585	   receiver's RPC layer encounters no overhead from skipping over pad
1586	   bytes, and the RDMA layer's high performance makes the insertion and
1587	   transmission of padding on the sender a significant optimization.  In
1588	   this way, the need for servers to perform RDMA Read to satisfy all
1589	   but the largest client writes is obviated.  An added benefit is the
1590	   reduction of message roundtrips on the network - a potentially good
1591	   trade, where latency is present.

1593	   The value to choose for padding is subject to a number of criteria.
1594	   A primary source of variable-length data in the RPC header is the
1595	   authentication information, the form of which is client-determined,
1596	   possibly in response to server specification.  The contents of
1597	   COMPOUNDs, sizes of strings such as those passed to RENAME, etc. all
1598	   go into the determination of a maximal NFSv4 request size and
1599	   therefore minimal buffer size.  The client must select its offered
1600	   value carefully, so as not to overburden the server, and vice- versa.
1601	   The payoff of an appropriate padding value is higher performance.

1603	                    Sender gather:
1604	        |RPC Request|Pad bytes|Length| -> |User data...|
1605	        \------+---------------------/       \
1606	                \                             \
1607	                 \    Receiver scatter:        \-----------+- ...
1608	            /-----+----------------\            \           \
1609	            |RPC Request|Pad|Length|   ->  |FS buffer|->|FS buffer|->...

1611	   In the above case, the server may recycle unused buffers to the next
1612	   posted receive if unused by the actual received request, or may pass
1613	   the now-complete buffers by reference for normal write processing.
1614	   For a server which can make use of it, this removes any need for data
1615	   copies of incoming data, without resorting to complicated end-to-end
1616	   buffer advertisement and management.  This includes most kernel-based
1617	   and integrated server designs, among many others.  The client may
1618	   perform similar optimizations, if desired.

1620	   Padding is negotiated by the session creation operation, and
1621	   subsequently used by the RPC RDMA layer, as described in [RPCRDMA].

1623	3.10  NFSv4 Integration

1625	   The following section discusses the integration of the proposed RDMA
1626	   extensions with NFSv4.0.

1628	3.10.1  Minor Versioning

1630	   Minor versioning is the existing facility to extend the NFSv4
1631	   protocol, and this proposal takes that approach.

1633	   Minor versioning of NFSv4 is relatively restrictive, and allows for
1634	   tightly limited changes only.  In particular, it does not permit
1635	   adding new "procedures" (it permits adding only new "operations").
1636	   Interoperability concerns make it impossible to consider additional
1637	   layering to be a minor revision.  This somewhat limits the changes
1638	   that can be proposed when considering extensions.

1640	   To support the duplicate request cache integrated with sessions and
1641	   request control, it is desirable to tag each request with an
1642	   identifier to be called a Slotid.  This identifier must be passed by
1643	   NFSv4 when running atop any transport, including traditional TCP.
1644	   Therefore it is not desirable to add the Slotid to a new RPC
1645	   transport, even though such a transport is indicated for support of
1646	   RDMA.  This draft and [RPCRDMA] do not propose such an approach.

1648	   Instead, this proposal conforms to the requirements of NFSv4 minor
1649	   versioning, through the use of a new operation within NFSv4 COMPOUND
1650	   procedures as detailed below.

1652	   If sessions are in use for a given clientid, this same clientid
1653	   cannot be used for non-session NFSv4 operation, including NFSv4.0.
1654	   Because the server will have allocated session-specific state to the
1655	   active clientid, it would be an unnecessary burden on the server
1656	   implementor to support and account for additional, non- session
1657	   traffic, in addition to being of no benefit.  Therefore this proposal
1658	   prohibits a single clientid from doing this.  Nevertheless, employing
1659	   a new clientid for such traffic is supported.

1661	3.10.2  Slot Identifiers and Server Duplicate Request Cache

1663	   The presence of deterministic maximum request limits on a session
1664	   enables in-progress requests to be assigned unique values with useful
1665	   properties.

1667	   The RPC layer provides a transaction ID (xid), which, while required
1668	   to be unique, is not especially convenient for tracking requests.
1669	   The transaction ID is only meaningful to the issuer (client), it
1670	   cannot be interpreted at the server except to test for equality with
1671	   previously issued requests.  Because RPC operations may be completed
1672	   by the server in any order, many transaction IDs may be outstanding
1673	   at any time.  The client may therefore perform a computationally
1674	   expensive lookup operation in the process of demultiplexing each
1675	   reply.

1677	   In the proposal, there is a limit to the number of active requests.
1678	   This immediately enables a convenient, computationally efficient
1679	   index for each request which is designated as a Slot Identifier, or
1680	   slotid.

1682	   When the client issues a new request, it selects a slotid in the
1683	   range 0..N-1, where N is the server's current "totalrequests" limit
1684	   granted the client on the session over which the request is to be
1685	   issued.  The slotid must be unused by any of the requests which the
1686	   client has already active on the session.  "Unused" here means the
1687	   client has no outstanding request for that slotid.  Because the slot
1688	   id is always an integer in the range 0..N-1, client implementations
1689	   can use the slotid from a server response to efficiently match
1690	   responses with outstanding requests, such as, for example, by using
1691	   the slotid to index into a outstanding request array.  This can be
1692	   used to avoid expensive hashing and lookup functions in the
1693	   performace-critical receive path.

1695	   The sequenceid, which accompanies the slotid in each request, is
1696	   important for a second, important check at the server: it must be
1697	   able to be determined efficiently whether a request using a certain
1698	   slotid is a retransmit or a new, never-before-seen request.  It is
1699	   not feasible for the client to assert that it is retransmitting to
1700	   implement this, because for any given request the client cannot know
1701	   the server has seen it unless the server actually replies.  Of
1702	   course, if the client has seen the server's reply, the client would
1703	   not retransmit!

1705	   The sequenceid must increase monotonically for each new transmit of a
1706	   given slotid, and must remain unchanged for any retransmission.  The
1707	   server must in turn compare each newly received request's sequenceid
1708	   with the last one previously received for that slotid, to see if the
1709	   new request is:

1711	      A new request, in which the sequenceid is greater than that
1712	      previously seen in the slot (accounting for sequence wraparound).
1713	      The server proceeds to execute the new request.

1715	      A retransmitted request, in which the sequenceid is equal to that
1716	      last seen in the slot.  Note that this request may be either
1717	      complete, or in progress.  The server performs replay processing
1718	      in these cases.

1720	      A misordered duplicate, in which the sequenceid is less than that
1721	      previously seen in the slot.  The server must drop the incoming
1722	      request, which may imply dropping the connection if the transport
1723	      is reliable, as dictated by section 3.1.1 of [RFC3530].

1725	   This last condition is possible on any connection, not just
1726	   unreliable, unordered transports.  Delayed behavior on abandoned TCP
1727	   connections which are not yet closed at the server, or pathological
1728	   client implementations can cause it, among other causes.  Therefore,
1729	   the server may wish to harden itself against certain repeated
1730	   occurrences of this, as it would for retransmissions in [RFC3530].

1732	   It is recommended, though not necessary for protocol correctness,
1733	   that the client simply increment the sequenceid by one for each new
1734	   request on each slotid.  This reduces the wraparound window to a
1735	   minimum, and is useful for tracing and avoidance of possible
1736	   implementation errors.

1738	   The client may however, for implementation-specific reasons, choose a
1739	   different algorithm.  For example it might maintain a single sequence
1740	   space for all slots in the session - e.g. employing the RPC XID
1741	   itself.  The sequenceid, in any case, is never interpreted by the
1742	   server for anything but to test by comparison with previously seen
1743	   values.

1745	   The server may thereby use the slotid, in conjunction with the
1746	   sessionid and sequenceid, within the SEQUENCE portion of the request
1747	   to maintain its duplicate request cache (DRC) for the session, as
1748	   opposed to the traditional approach of ONC RPC applications that use
1749	   the XID along with certain transport information [RW96].

1751	   Unlike the XID, the slotid is always within a specific range;  this
1752	   has two implications.  The first implication is that for a given
1753	   session, the server need only cache the results of a limited number
1754	   of COMPOUND requests.  The second implication derives from the first,
1755	   which is unlike XID-indexed DRCs, the slotid DRC by its nature cannot
1756	   be overflowed.  Through use of the sequenceid to identify
1757	   retransmitted requests, it is notable that the server does not need
1758	   to actually cache the request itself, reducing the storage
1759	   requirements of the DRC further.  These new facilities makes it
1760	   practical to maintain all the required entries for an effective DRC.

1762	   The slotid and sequenceid therefore take over the traditional role of
1763	   the port number in the server DRC implementation, and the session
1764	   replaces the IP address.  This approach is considerably more portable
1765	   and completely robust - it is not subject to the frequent
1766	   reassignment of ports as clients reconnect over IP networks.  In
1767	   addition, the RPC XID is not used in the reply cache, enhancing
1768	   robustness of the cache in the face of any rapid reuse of XIDs by the
1769	   client.

1771	   It is required to encode the slotid information into each request in
1772	   a way that does not violate the minor versioning rules of the NFSv4.0
1773	   specification.  This is accomplished here by encoding it in a control
1774	   operation within each NFSv4.1 COMPOUND and CB_COMPOUND procedure.
1775	   The operation easily piggybacks within existing messages.  The
1776	   implementation section of this document describes the specific
1777	   proposal.

1779	   In general, the receipt of a new sequenced request arriving on any
1780	   valid slot is an indication that the previous DRC contents of that
1781	   slot may be discarded.  In order to further assist the server in slot
1782	   management, the client is required to use the lowest available slot
1783	   when issuing a new request.  In this way, the server may be able to
1784	   retire additional entries.

1786	   However, in the case where the server is actively adjusting its
1787	   granted maximum request count to the client, it may not be able to
1788	   use receipt of the slotid to retire cache entries.  The slotid used
1789	   in an incoming request may not reflect the server's current idea of
1790	   the client's session limit, because the request may have been sent
1791	   from the client before the update was received.  Therefore, in the
1792	   downward adjustment case, the server may have to retain a number of
1793	   duplicate request cache entries at least as large as the old value,
1794	   until operation sequencing rules allow it to infer that the client
1795	   has seen its reply.

1797	   The SEQUENCE (and CB_SEQUENCE) operation also carries a "maxslot"
1798	   value which carries additional client slot usage information.  The
1799	   client must always provide its highest-numbered outstanding slot
1800	   value in the maxslot argument, and the server may reply with a new
1801	   recognized value.  The client should in all cases provide the most
1802	   conservative value possible, although it can be increased somewhat
1803	   above the actual instantaneous usage to maintain some minimum or
1804	   optimal level.  This provides a way for the client to yield unused
1805	   request slots back to the server, which in turn can use the
1806	   information to reallocate resources.  Obviously, maxslot can never be
1807	   zero, or the session would deadlock.

1809	   The server also provides a target maxslot value to the client, which
1810	   is an indication to the client of the maxslot the server wishes the
1811	   client to be using.  This permits the server to withdraw (or add)
1812	   resources from a client that has been found to not be using them, in
1813	   order to more fairly share resources among a varying level of demand
1814	   from other clients.  The client must always comply with the server's
1815	   value updates, since they indicate newly established hard limits on
1816	   the client's access to session resources.  However, because of
1817	   request pipelining, the client may have active requests in flight
1818	   reflecting prior values, therefore the server must not immediately
1819	   require the client to comply.

1821	   It is worthwhile to note that Sprite RPC [BW87] defined a "channel"
1822	   which in some ways is similar to the slotid proposed here.  Sprite
1823	   RPC used channels to implement parallel request processing and
1824	   request/response cache retirement.

1826	3.10.3  COMPOUND and CB_COMPOUND

1828	   Support for per-operation control can be piggybacked onto NFSv4
1829	   COMPOUNDs with full transparency, by placing such facilities into
1830	   their own, new operation, and placing this operation first in each
1831	   COMPOUND under the new NFSv4 minor protocol revision.  The contents
1832	   of the operation would then apply to the entire COMPOUND.

1834	   Recall that the NFSv4 minor revision is contained within the COMPOUND
1835	   header, encoded prior to the COMPOUNDed operations.  By simply
1836	   requiring that the new operation always be contained in NFSv4 minor
1837	   COMPOUNDs, the control protocol can piggyback perfectly with each
1838	   request and response.

1840	   In this way, the NFSv4 RDMA Extensions may stay in compliance with
1841	   the minor versioning requirements specified in section 10 of
1842	   [RFC3530].

1844	   Referring to section 13.1 of the same document, the proposed session-
1845	   enabled COMPOUND and CB_COMPOUND have the form:

1847	      +-----+--------------+-----------+------------+-----------+----
1848	      | tag | minorversion | numops    | control op | op + args | ...
1849	      |     |   (== 1)     | (limited) |  + args    |           |
1850	      +-----+--------------+-----------+------------+-----------+----

1852	      and the reply's structure is:

1854	      +------------+-----+--------+-------------------------------+--//
1855	      |last status | tag | numres | status + control op + results |  //
1856	      +------------+-----+--------+-------------------------------+--//
1857	              //-----------------------+----
1858	              // status + op + results | ...

1860	              //-----------------------+----

1862	   The single control operation within each NFSv4.1 COMPOUND defines the
1863	   context and operational session parameters which govern that COMPOUND
1864	   request and reply.  Placing it first in the COMPOUND encoding is
1865	   required in order to allow its processing before other operations in
1866	   the COMPOUND.

1868	3.10.4  eXternal Data Representation Efficiency

1870	   RDMA is a copy avoidance technology, and it is important to maintain
1871	   this efficiency when decoding received messages.  Traditional XDR
1872	   implementations frequently use generated unmarshaling code to convert
1873	   objects to local form, incurring a data copy in the process (in
1874	   addition to subjecting the caller to recursive calls, etc).  Often,
1875	   such conversions are carried out even when no size or byte order
1876	   conversion is necessary.

1878	   It is recommended that implementations pay close attention to the
1879	   details of memory referencing in such code.  It is far more efficient
1880	   to inspect data in place, using native facilities to deal with word
1881	   size and byte order conversion into registers or local variables,
1882	   rather than formally (and blindly) performing the operation via
1883	   fetch, reallocate and store.

1885	   Of particular concern is the result of the READDIR operation, in
1886	   which such encoding abounds.

1888	3.10.5  Effect of Sessions on Existing Operations

1890	   The use of a session replaces the use of the SETCLIENTID and
1891	   SETCLIENTID_CONFIRM operations, and allows certain simplification of
1892	   the RENEW and callback addressing mechanisms in the base protocol.

1894	   The cb_program and cb_location which are obtained by the server in
1895	   SETCLIENTID_CONFIRM must not be used by the server, because the
1896	   NFSv4.1 client performs callback channel designation with
1897	   BIND_BACKCHANNEL.  Therefore the SETCLIENTID and SETCLIENTID_CONFIRM
1898	   operations becomes obsolete when sessions are in use, and a server
1899	   should return an error to NFSv4.1 clients which might issue either
1900	   operation.

1902	   Another favorable result of the session is that the server is able to
1903	   avoid requiring the client to perform OPEN_CONFIRM operations.  The
1904	   existence of a reliable and effective DRC means that the server will
1905	   be able to determine whether an OPEN request carrying a previously
1906	   known open_owner from a client is or is not a retransmission.
1907	   Because of this, the server no longer requires OPEN_CONFIRM to verify
1908	   whether the client is retransmitting an open request.  This in turn
1909	   eliminates the server's reason for requesting OPEN_CONFIRM - the
1910	   server can simply replace any previous information on this
1911	   open_owner.  Client OPEN operations are therefore streamlined,
1912	   reducing overhead and latency through avoiding the additional
1913	   OPEN_CONFIRM exchange.

1915	   Since the session carries the client liveness indication with it
1916	   implicitly, any request on a session associated with a given client
1917	   will renew that client's leases.  Therefore the RENEW operation is
1918	   made unnecessary when a session is present, as any request (including
1919	   a SEQUENCE operation with or without additional NFSv4 operations)
1920	   performs its function.  It is possible (though this proposal does not
1921	   make any recommendation) that the RENEW operation could be made
1922	   obsolete.

1924	   An interesting issue arises however if an error occurs on such a
1925	   SEQUENCE operation.  If the SEQUENCE operation fails, perhaps due to
1926	   an invalid slotid or other non-renewal-based issue, the server may or
1927	   may not have performed the RENEW.  In this case, the state of any
1928	   renewal is undefined, and the client should make no assumption that
1929	   it has been performed.  In practice, this should not occur but even
1930	   if it did, it is expected the client would perform some sort of
1931	   recovery which would result in a new, successful, SEQUENCE operation
1932	   being run and the client assured that the renewal took place.

1934	3.10.6  Authentication Efficiencies

1936	   NFSv4 requires the use of the RPCSEC_GSS ONC RPC security flavor
1937	   [RFC2203] to provide authentication, integrity, and privacy via
1938	   cryptography.  The server dictates to the client the use of
1939	   RPCSEC_GSS, the service (authentication, integrity, or privacy), and
1940	   the specific GSS-API security mechanism that each remote procedure
1941	   call and result will use.

1943	   If the connection's integrity is protected by an additional means
1944	   than RPCSEC_GSS, such as via IPsec, then the use of RPCSEC_GSS's
1945	   integrity service is nearly redundant (See the Security
1946	   Considerations section for more explanation of why it is "nearly" and
1947	   not completely redundant).  Likewise, if the connection's privacy is
1948	   protected by additional means, then the use of both RPCSEC_GSS's
1949	   integrity and privacy services is nearly redundant.

1951	   Connection protection schemes, such as IPsec, are more likely to be
1952	   implemented in hardware than upper layer protocols like RPCSEC_GSS.
1953	   Hardware-based cryptography at the IPsec layer will be more efficient
1954	   than software-based cryptography at the RPCSEC_GSS layer.

1956	   When transport integrity can be obtained, it is possible for server
1957	   and client to downgrade their per-operation authentication, after an
1958	   appropriate exchange.  This downgrade can in fact be as complete as
1959	   to establish security mechanisms that have zero cryptographic
1960	   overhead, effectively using the underlying integrity and privacy
1961	   services provided by transport.

1963	   Based on the above observations, a new GSS-API mechanism, called the
1964	   Channel Conjunction Mechanism [CCM], is being defined.  The CCM works
1965	   by creating a GSS-API security context using as input a cookie that
1966	   the initiator and target have previously agreed to be a handle for
1967	   GSS-API context created previously over another GSS-API mechanism.

1969	   NFSv4.1 clients and servers should support CCM and they must use as
1970	   the cookie the handle from a successful RPCSEC_GSS context creation
1971	   over a non-CCM mechanism (such as Kerberos V5).  The value of the
1972	   cookie will be equal to the handle field of the rpc_gss_init_res
1973	   structure from the RPCSEC_GSS specification.

1975	   The [CCM] Draft provides further discussion and examples.

1977	3.11  Sessions Security Considerations

1979	   The NFSv4 minor version 1 retains all of existing NFSv4 security; all
1980	   security considerations present in NFSv4.0 apply to it equally.

1982	   Security considerations of any underlying RDMA transport are
1983	   additionally important, all the more so due to the emerging nature of
1984	   such transports.  Examining these issues is outside the scope of this
1985	   draft.

1987	   When protecting a connection with RPCSEC_GSS, all data in each
1988	   request and response (whether transferred inline or via RDMA)
1989	   continues to receive this protection over RDMA fabrics [RPCRDMA].
1990	   However when performing data transfers via RDMA, RPCSEC_GSS
1991	   protection of the data transfer portion works against the efficiency
1992	   which RDMA is typically employed to achieve.  This is because such
1993	   data is normally managed solely by the RDMA fabric, and intentionally
1994	   is not touched by software.  Therefore when employing RPCSEC_GSS
1995	   under CCM, and where integrity protection has been "downgraded", the
1996	   cooperation of the RDMA transport provider is critical to maintain
1997	   any integrity and privacy otherwise in place for the session.  The
1998	   means by which the local RPCSEC_GSS implementation is integrated with
1999	   the RDMA data protection facilities are outside the scope of this
2000	   draft.

2002	   It is logical to use the same GSS context on a session's callback
2003	   channel as that used on its operations channel(s), particularly when
2004	   the connection is shared by both.  The client must indicate to the
2005	   server:

2007	   - what security flavor(s) to use in the call back.  A special
2008	   callback flavor might be defined for this.

2010	   - if the flavor is RPCSEC_GSS, then the client must have previously
2011	   created an RPCSEC_GSS session with the server.  The client offers to
2012	   the server the the opaque handle<> value from the rpc_gss_init_res
2013	   structure, the window size of RPCSEC_GSS sequence numbers, and an
2014	   opaque gss_cb_handle.

2016	   This exchange can be performed as part of session and clientid
2017	   creation, and the issue warrants careful analysis before being
2018	   specified.

2020	   If the NFS client wishes to maintain full control over RPCSEC_GSS
2021	   protection, it may still perform its transfer operations using either
2022	   the inline or RDMA transfer model, or of course employ traditional
2023	   TCP stream operation.  In the RDMA inline case, header padding is
2024	   recommended to optimize behavior at the server.  At the client, close
2025	   attention should be paid to the implementation of RPCSEC_GSS
2026	   processing to minimize memory referencing and especially copying.
2027	   These are well-advised in any case!

2029	   The proposed session callback channel binding improves security over
2030	   that provided by NFSv4 for the callback channel.  The connection is
2031	   client-initiated, and subject to the same firewall and routing checks
2032	   as the operations channel.  The connection cannot be hijacked by an
2033	   attacker who connects to the client port prior to the intended
2034	   server.  The connection is set up by the client with its desired
2035	   attributes, such as optionally securing with IPsec or similar.  The
2036	   binding is fully authenticated before being activated.

2038	3.11.1  Authentication

2040	   Proper authentication of the principal which issues any session and
2041	   clientid in the proposed NFSv4.1 operations exactly follows the
2042	   similar requirement on client identifiers in NFSv4.0.  It must not be
2043	   possible for a client to impersonate another by guessing its session
2044	   identifiers for NFSv4.1 operations, nor to bind a callback channel to
2045	   an existing session.  To protect against this, NFSv4.0 requires
2046	   appropriate authentication and matching of the principal used.  This
2047	   is discussed in Section 16, Security Considerations of [RFC3530].
2048	   The same requirement when using a session identifier applies to
2049	   NFSv4.1 here.

2051	   Going beyond NFSv4.0, the presence of a session associated with any
2052	   clientid may also be used to enhance NFSv4.1 security with respect to
2053	   client impersonation.  In NFSv4.0, there are many operations which
2054	   carry no clientid, including in particular those which employ a
2055	   stateid argument.  A rogue client which wished to carry out a denial
2056	   of service attack on another client could perform CLOSE, DELEGRETURN,
2057	   etc operations with that client's current filehandle, sequenceid and
2058	   stateid, after having obtained them from eavesdropping or other
2059	   approach.  Locking and open downgrade operations could be similarly
2060	   attacked.

2062	   When an NFSv4.1 session is in place for any clientid, countermeasures
2063	   are easily applied through use of authentication by the server.
2064	   Because the clientid and sessionid must be present in each request
2065	   within a session, the server may verify that the clientid is in fact
2066	   originating from a principal with the appropriate authenticated
2067	   credentials, that the sessionid belongs to the clientid, and that the
2068	   stateid is valid in these contexts.  This is in general not possible
2069	   with the affected operations in NFSv4.0 due to the fact that the
2070	   clientid is not present in the requests.

2072	   In the event that authentication information is not available in the
2073	   incoming request, for example after a reconnection when the security
2074	   was previously downgraded using CCM, the server must require the
2075	   client re-establish the authentication in order that the server may
2076	   validate the other client-provided context, prior to executing any
2077	   operation.  The sessionid, present in the newly retransmitted
2078	   request, combined with the retransmission detection enabled by the
2079	   NFSv4.1 duplicate request cache, are a convenient and reliable
2080	   context for the server to use for this contingency.

2082	   The server should take care to protect itself against denial of
2083	   service attacks in the creation of sessions and clientids.  Clients
2084	   who connect and create sessions, only to disconnect and never use
2085	   them may leave significant state behind.  (The same issue applies to
2086	   NFSv4.0 with clients who may perform SETCLIENTID, then never perform
2087	   SETCLIENTID_CONFIRM.)  Careful authentication coupled with resource
2088	   checks is highly recommended.

2090	4.  Directory Delegations

2092	4.1  Introduction to Directory Delegations

2094	   The major addition to NFS version 4 in the area of caching is the
2095	   ability of the server to delegate certain responsibilities to the
2096	   client.  When the server grants a delegation for a file to a client,
2097	   the client receives certain semantics with respect to the sharing of
2098	   that file with other clients.  At OPEN, the server may provide the
2099	   client either a read or write delegation for the file.  If the client
2100	   is granted a read delegation, it is assured that no other client has
2101	   the ability to write to the file for the duration of the delegation.
2102	   If the client is granted a write delegation, the client is assured
2103	   that no other client has read or write access to the file.  This
2104	   reduces network traffic and server load by allowing the client to
2105	   perform certain operations on local file data and can also provide
2106	   stronger consistency for the local data.

2108	   Directory caching for the NFS version 4 protocol is similar to
2109	   previous versions.  Clients typically cache directory information for
2110	   a duration determined by the client.  At the end of a predefined
2111	   timeout, the client will query the server to see if the directory has
2112	   been updated.  By caching attributes, clients reduce the number of
2113	   GETATTR calls made to the server to validate attributes.
2114	   Furthermore, frequently accessed files and directories, such as the
2115	   current working directory, have their attributes cached on the client
2116	   so that some NFS operations can be performed without having to make
2117	   an RPC call.  By caching name and inode information about most
2118	   recently looked up entries in DNLC (Directory Name Lookup Cache),
2119	   clients do not need to send LOOKUP calls to the server every time
2120	   these files are accessed.

2122	   This caching approach works reasonably well at reducing network
2123	   traffic in many environments.  However, it does not address
2124	   environments where there are numerous queries for files that do not
2125	   exist.  In these cases of "misses", the client must make RPC calls to
2126	   the server in order to provide reasonable application semantics and
2127	   promptly detect the creation of new directory entries.  Examples of
2128	   high miss activity are compilation in software development
2129	   environments.  The current behavior of NFS limits its potential
2130	   scalability and wide-area sharing effectiveness in these types of
2131	   environments.  Other distributed stateful filesystem architectures
2132	   such as AFS and DFS have proven that adding state around directory
2133	   contents can greatly reduce network traffic in high miss
2134	   environments.

2136	   Delegation of directory contents is proposed as an extension for
2137	   NFSv4.  Such an extension would provide similar traffic reduction
2138	   benefits as with file delegations.  By allowing clients to cache
2139	   directory contents (in a read-only fashion) while being notified of
2140	   changes, the client can avoid making frequent requests to interrogate
2141	   the contents of slowly-changing directories, reducing network traffic
2142	   and improving client performance.

2144	   These extensions allow improved namespace cache consistency to be
2145	   achieved through delegations and synchronous recalls alone without
2146	   asking for notifications.  In addition, if time-based consistency is
2147	   sufficient, asynchronous notifications can provide performance
2148	   benefits for the client, and possibly the server, under some common
2149	   operating conditions such as slowly-changing and/or very large
2150	   directories.

2152	4.2  Directory Delegation Design (in brief)

2154	   A new operation GET_DIR_DELEGATION is used by the client to ask for a
2155	   directory delegation.  The delegation covers directory attributes and
2156	   all entries in the directory.  If either of these change the
2157	   delegation will be recalled synchronously.  The operation causing the
2158	   recall will have to wait before the recall is complete.  Any changes
2159	   to directory entry attributes will not cause the delegation to be
2160	   recalled.

2162	   In addition to asking for delegations, a client can also ask for
2163	   notifications for certain events.  These events include changes to
2164	   directory attributes and/or its contents.  If a client asks for
2165	   notification for a certain event, the server will notify the client
2166	   when that event occurs.  This will not result in the delegation being
2167	   recalled for that client.  The notifications are asynchronous and
2168	   provide a way of avoiding recalls in situations where a directory is
2169	   changing enough that the pure recall model may not be effective while
2170	   trying to allow the client to get substantial benefit.  In the
2171	   absence of notifications, once the delegation is recalled the client
2172	   has to refresh its directory cache which might not be very efficient
2173	   for very large directories.

2175	   The delegation is read only and the client may not make changes to
2176	   the directory other than by performing NFSv4 operations that modify
2177	   the directory or the associated file attributes so that the server
2178	   has knowledge of these changes.  In order to keep the client
2179	   namespace in sync with the server, the server will notify the client
2180	   holding the delegation of the changes made as a result.  This is to
2181	   avoid any subsequent GETATTR or READDIR calls to the server.  If a
2182	   client holding the delegation makes any changes to the directory, the
2183	   delegation will not be recalled.

2185	   Delegations can be recalled by the server at any time.  Normally, the
2186	   server will recall the delegation when the directory changes in a way
2187	   that is not covered by the notification, or when the directory
2188	   changes and notifications have not been requested.

2190	   Also if the server notices that handing out a delegation for a
2191	   directory is causing too many notifications to be sent out, it may
2192	   decide not to hand out a delegation for that directory or recall
2193	   existing delegations.  If another client removes the directory for
2194	   which a delegation has been granted, the server will recall the
2195	   delegation.

2197	   Both the notification and recall operations need a callback path to
2198	   exist between the client and server.  If the callback path does not
2199	   exist, then delegation can not be granted.  Note that with the
2200	   session extensions [talpey] that should not be an issue.  In the
2201	   absense of sessions, the server will have to establish a callback
2202	   path to the client to send callbacks.

2204	4.3  Recommended Attributes in support of Directory Delegations

2206	   supp_dir_attr_notice - notification delays on directory attributes

2208	   supp_child_attr_notice - notification delays on child attributes

2210	   These attributes allow the client and server to negotiate the
2211	   frequency of notifications sent due to changes in attributes.  These
2212	   attributes are returned as part of a GETATTR call on the directory.
2213	   The supp_dir_attr_notice value covers all attribute changes to the
2214	   directory and the supp_child_attr_notice covers all attribute changes
2215	   to any child in the directory.

2217	   These attributes are per directory.  The client needs to get these
2218	   values by doing a GETATTR on the directory for which it wants
2219	   notifications.  However these attributes are only required when the
2220	   client is interested in getting attribute notifications.  For all
2221	   other types of notifications and delegation requests without
2222	   notifications, these attributes are not required.

2224	   When the client calls the GET_DIR_DELEGATION operation and asks for
2225	   attribute change notifications, it will request a notification delay
2226	   that is within the server's supported range.  If the client violates
2227	   what supp_attr_file_notice or supp_attr_dir_notice values are, the
2228	   server should not commit to sending notifications for that change
2229	   event.

2231	   A value of zero for these attributes means the server will send the
2232	   notification as soon as the change occurs.  It is not recommended to
2233	   set this value to zero since that can put a lot of burden on the
2234	   server.  A value of N means that the server will make a best effort
2235	   guarentee that attribute notification are not delayed by more than
2236	   that. nfstime4 values that compute to negative values are illegal.

2238	4.4  Delegation Recall

2240	   The server will recall the directory delegation by sending a callback
2241	   to the client.  It will use the same callback procedure as used for
2242	   recalling file delegations.  The server will recall the delegation
2243	   when the directory changes in a way that is not covered by the
2244	   notification.  However the server will not recall the delegation if
2245	   attributes of an entry within the directory change.  Also if the
2246	   server notices that handing out a delegation for a directory is
2247	   causing too many notifications to be sent out, it may decide not to
2248	   hand out a delegation for that directory.  If another client tries to
2249	   remove the directory for which a delegation has been granted, the
2250	   server will recall the delegation.

2252	   The server will recall the delegation by sending a CB_RECALL callback
2253	   to the client.  If the recall is done because of a directory changing
2254	   event, the request making that change will need to wait while the
2255	   client returns the delegation.

2257	4.5  Delegation Recovery

2259	   Crash recovery has two main goals, avoiding the necessity of breaking
2260	   application guarantees with respect to locked files and delivery of
2261	   updates cached at the client.  Neither of these applies to
2262	   directories protected by read delegations and notifications.  Thus,
2263	   the client is required to establish a new delegation on a server or
2264	   client reboot.

2266	5.  Introduction

2268	   The NFSv4 protocol [2] specifies the interaction between a client
2269	   that accesses files and a server that provides access to files and is
2270	   responsible for coordinating access by multiple clients.  As
2271	   described in the pNFS problem statement, this requires that all
2272	   access to a set of files exported by a single NFSv4 server be
2273	   performed by that server; at high data rates the server may become a
2274	   bottleneck.

2276	   The parallel NFS (pNFS) extensions to NFSv4 allow data accesses to
2277	   bypass this bottleneck by permitting direct client access to the
2278	   storage devices containing the file data.  When file data for a
2279	   single NFSv4 server is stored on multiple and/or higher throughput
2280	   storage devices (by comparison to the server's throughput
2281	   capability), the result can be significantly better file access
2282	   performance.  The relationship among multiple clients, a single
2283	   server, and multiple storage devices for pNFS (server and clients
2284	   have access to all storage devices) is shown in this diagram:

2286	       +-----------+
2287	       |+-----------+                                 +-----------+
2288	       ||+-----------+                                |           |
2289	       |||           |        NFSv4 + pNFS            |           |
2290	       +||  Clients  |<------------------------------>|   Server  |
2291	        +|           |                                |           |
2292	         +-----------+                                |           |
2293	              |||                                     +-----------+
2294	              |||                                           |
2295	              |||                                           |
2296	              ||| Storage        +-----------+              |
2297	              ||| Protocol       |+-----------+             |
2298	              ||+----------------||+-----------+  Control|
2299	              |+-----------------|||           |    Protocol|
2300	              +------------------+||  Storage  |------------+
2301	                                  +|  Devices  |
2302	                                   +-----------+

2304	                                 Figure 9

2306	   In this structure, the responsibility for coordination of file access
2307	   by multiple clients is shared among the server, clients, and storage
2308	   devices.  This is in contrast to NFSv4 without pNFS extensions, in
2309	   which this is primarily the server's responsibility, some of which
2310	   can be delegated to clients under strictly specified conditions.

2312	   The pNFS extension to NFSv4 takes the form of new operations that
2313	   manage data location information called a "layout".  The layout is
2314	   managed in a similar fashion as NFSv4 data delegations (e.g., they
2315	   are recallable and revocable).  However, they are distinct
2316	   abstractions and are manipulated with new operations.  When a client
2317	   holds a layout, it has rights to access the data directly using the
2318	   location information in the layout.

2320	   There are new attributes that describe general layout
2321	   characteristics.  However, much of the required information cannot be
2322	   managed solely within the attribute framework, because it will need
2323	   to have a strictly limited term of validity, subject to invalidation
2324	   by the server.  This requires the use of new operations to obtain,
2325	   return, recall, and modify layouts, in addition to new attributes.

2327	   This document specifies both the NFSv4 extensions required to
2328	   distribute file access coordination between the server and its
2329	   clients and a NFSv4 file storage protocol that may be used to access
2330	   data stored on NFSv4 storage devices.

2332	   Storage protocols used to access a variety of other storage devices
2333	   are deliberately not specified here.  These might include:

2335	   o  Block/volume protocols such as iSCSI ([3]), and FCP ([4]).  The
2336	      block/volume protocol support can be independent of the addressing
2337	      structure of the block/volume protocol used, allowing more than
2338	      one protocol to access the same file data and enabling
2339	      extensibility to other block/volume protocols.

2341	   o  Object protocols such as OSD over iSCSI or Fibre Channel [5].

2343	   o  Other storage protocols, including PVFS and other file systems
2344	      that are in use in HPC environments.

2346	   pNFS is designed to accommodate these protocols and be extensible to
2347	   new classes of storage protocols that may be of interest.

2349	   The distribution of file access coordination between the server and
2350	   its clients increases the level of responsibility placed on clients.
2351	   Clients are already responsible for ensuring that suitable access
2352	   checks are made to cached data and that attributes are suitably
2353	   propagated to the server.  Generally, a misbehaving client that hosts
2354	   only a single-user can only impact files accessible to that single
2355	   user.  Misbehavior by a client hosting multiple users may impact
2356	   files accessible to all of its users.  NFSv4 delegations increase the
2357	   level of client responsibility as a client that carries out actions
2358	   requiring a delegation without obtaining that delegation will cause
2359	   its user(s) to see unexpected and/or incorrect behavior.

2361	   Some uses of pNFS extend the responsibility of clients beyond
2362	   delegations.  In some configurations, the storage devices cannot
2363	   perform fine-grained access checks to ensure that clients are only
2364	   performing accesses within the bounds permitted to them by the pNFS
2365	   operations with the server (e.g., the checks may only be possible at
2366	   file system granularity rather than file granularity).  In situations
2367	   where this added responsibility placed on clients creates
2368	   unacceptable security risks, pNFS configurations in which storage
2369	   devices cannot perform fine-grained access checks SHOULD NOT be used.
2370	   All pNFS server implementations MUST support NFSv4 access to any file
2371	   accessible via pNFS in order to provide an interoperable means of
2372	   file access in such situations.  See Section 8 on Security for
2373	   further discussion.

2375	   Finally, there are issues about how layouts interact with the
2376	   existing NFSv4 abstractions of data delegations and byte range
2377	   locking.  These issues, and others, are also discussed here.

2379	6.  General Definitions

2381	   This protocol extension partitions the NFSv4 file system protocol
2382	   into two parts, the control path and the data path.  The control path
2383	   is implemented by the extended (p)NFSv4 server.  When the file system
2384	   being exported by (p)NFSv4 uses storage devices that are visible to
2385	   clients over the network, the data path may be implemented by direct
2386	   communication between the extended (p)NFSv4 file system client and
2387	   the storage devices.  This leads to a few new terms used to describe
2388	   the protocol extension and some clarifications of existing terms.

2390	6.1  Metadata Server

2392	   A pNFS "server" or "metadata server" is a server as defined by
2393	   RFC3530 [2], which additionally provides support of the pNFS minor
2394	   extension.  When using the pNFS NFSv4 minor extension, the metadata
2395	   server may hold only the metadata associated with a file, while the
2396	   data can be stored on the storage devices.  However, similar to
2397	   NFSv4, data may also be written through the metadata server.  Note:
2398	   directory data is always accessed through the metadata server.

2400	6.2  Client

2402	   A pNFS "client" is a client as defined by RFC3530 [2], with the
2403	   addition of supporting the pNFS minor extension server protocol and
2404	   with the addition of supporting at least one storage protocol for
2405	   performing I/O directly to storage devices.

2407	6.3  Storage Device

2409	   This is a device, or server, that controls the file's data, but
2410	   leaves other metadata management up to the metadata server.  A
2411	   storage device could be another NFS server, or an Object Storage
2412	   Device (OSD) or a block device accessed over a SAN (e.g., either
2413	   FiberChannel or iSCSI SAN).  The goal of this extension is to allow
2414	   direct communication between clients and storage devices.

2416	6.4  Storage Protocol

2418	   This is the protocol between the pNFS client and the storage device
2419	   used to access the file data.  Three following types have been
2420	   described: file protocols (e.g., NFSv4), object protocols (e.g.,
2421	   OSD), and block/volume protocols (e.g., based on SCSI-block
2422	   commands).  These protocols are in turn realizable over a variety of
2423	   transport stacks.  We anticipate there will be variations on these
2424	   storage protocols, including new protocols that are unknown at this
2425	   time or experimental in nature.  The details of the storage protocols
2426	   will be described in other documents so that pNFS clients can be
2427	   written to use these storage protocols.  Use of NFSv4 itself as a
2428	   file-based storage protocol is described in Section 9.

2430	6.5  Control Protocol

2432	   This is a protocol used by the exported file system between the
2433	   server and storage devices.  Specification of such protocols is
2434	   outside the scope of this draft.  Such control protocols would be
2435	   used to control such activities as the allocation and deallocation of
2436	   storage and the management of state required by the storage devices
2437	   to perform client access control.  The control protocol should not be
2438	   confused with protocols used to manage LUNs in a SAN and other
2439	   sysadmin kinds of tasks.

2441	   While the pNFS protocol allows for any control protocol, in practice
2442	   the control protocol is closely related to the storage protocol.  For
2443	   example, if the storage devices are NFS servers, then the protocol
2444	   between the pNFS metadata server and the storage devices is likely to
2445	   involve NFS operations.  Similarly, when object storage devices are
2446	   used, the pNFS metadata server will likely use iSCSI/OSD commands to
2447	   manipulate storage.

2449	   However, this document does not mandate any particular control
2450	   protocol.  Instead, it just describes the requirements on the control
2451	   protocol for maintaining attributes like modify time, the change
2452	   attribute, and the end-of-file position.

2454	6.6  Metadata

2456	   This is information about a file, like its name, owner, where it
2457	   stored, and so forth.  The information is managed by the exported
2458	   file system server (metadata server).  Metadata also includes lower-
2459	   level information like block addresses and indirect block pointers.
2460	   Depending the storage protocol, block-level metadata may or may not
2461	   be managed by the metadata server, but is instead managed by Object
2462	   Storage Devices or other servers acting as a storage device.

2464	6.7  Layout

2466	   A layout defines how a file's data is organized on one or more
2467	   storage devices.  There are many possible layout types.  They vary in
2468	   the storage protocol used to access the data, and in the aggregation
2469	   scheme that lays out the file data on the underlying storage devices.
2470	   Layouts are described in more detail below.

2472	7.  pNFS protocol semantics

2474	   This section describes the semantics of the pNFS protocol extension
2475	   to NFSv4; this is the protocol between the client and the metadata
2476	   server.

2478	7.1  Definitions

2480	   This sub-section defines a number of terms necessary for describing
2481	   layouts and their semantics.  In addition, it more precisely defines
2482	   how layouts are identified and how they can be composed of smaller
2483	   granularity layout segments.

2485	7.1.1  Layout Types

2487	   A layout describes the mapping of a file's data to the storage
2488	   devices that hold the data.  A layout is said to belong to a specific
2489	   "layout type" (see Section 10.1 for its RPC definition).  The layout
2490	   type allows for variants to handle different storage protocols (e.g.,
2491	   block/volume [6], object [7], and file [Section 9] layout types).  A
2492	   metadata server, along with its control protocol, must support at
2493	   least one layout type.  A private sub-range of the layout type name
2494	   space is also defined.  Values from the private layout type range can
2495	   be used for internal testing or experimentation.

2497	   As an example, a file layout type could be an array of tuples (e.g.,
2498	   deviceID, file_handle), along with a definition of how the data is
2499	   stored across the devices (e.g., striping).  A block/volume layout
2500	   might be an array of tuples that store  along with information about block size and the file offset of
2502	   the first block.  An object layout might be an array of tuples
2503	    and an additional structure (i.e., the
2504	   aggregation map) that defines how the logical byte sequence of the
2505	   file data is serialized into the different objects.  Note, the actual
2506	   layouts are more complex than these simple expository examples.

2508	   This document defines a NFSv4 file layout type using a stripe-based
2509	   aggregation scheme (see Section 9).  Adjunct specifications are being
2510	   drafted that precisely define other layout formats (e.g., block/
2511	   volume [6], and object [7] layouts) to allow interoperability among
2512	   clients and metadata servers.

2514	7.1.2  Layout Iomode

2516	   The iomode indicates to the metadata server the client's intent to
2517	   perform either READs (only) or a mixture of I/O possibly containing
2518	   WRITEs as well as READs (i.e., READ/WRITE).  For certain layout
2519	   types, it is useful for a client to specify this intent at LAYOUTGET
2520	   time.  E.g., for block/volume based protocols, block allocation could
2521	   occur when a READ/WRITE iomode is specified.  A special
2522	   LAYOUTIOMODE_ANY iomode is defined and can only be used for
2523	   LAYOUTRETURN and LAYOUTRECALL, not for LAYOUTGET.  It specifies that
2524	   layouts pertaining to both READ and RW iomodes are being returned or
2525	   recalled, respectively.

2527	   A storage device may validate I/O with regards to the iomode; this is
2528	   dependent upon storage device implementation.  Thus, if the client's
2529	   layout iomode differs from the I/O being performed the storage device
2530	   may reject the client's I/O with an error indicating a new layout
2531	   with the correct I/O mode should be fetched.  E.g., if a client gets
2532	   a layout with a READ iomode and performs a WRITE to a storage device,
2533	   the storage device is allowed to reject that WRITE.

2535	   The iomode does not conflict with OPEN share modes or lock requests;
2536	   open mode checks and lock enforcement are always enforced, and are
2537	   logically separate from the pNFS layout level.  As well, open modes
2538	   and locks are the preferred method for restricting user access to
2539	   data files.  E.g., an OPEN of read, deny-write does not conflict with
2540	   a LAYOUTGET containing an iomode of READ/WRITE performed by another
2541	   client.  Applications that depend on writing into the same file
2542	   concurrently may use byte range locking to serialize their accesses.

2544	7.1.3  Layout Segments

2546	   Until this point, layouts have been defined in a fairly vague manner.
2547	   A layout is more precisely identified by the following tuple:
2548	   ; the FH refers to the FH of the file on
2549	   the metadata server.  Note, layouts describe a file, not a byte-range
2550	   of a file.

2552	   Since a layout that describes an entire file may be very large, there
2553	   is a desire to manage layouts in smaller chunks that correspond to
2554	   byte-ranges of the file.  For example, the entire layout need not be
2555	   returned, recalled, or committed.  These chunks are called "layout
2556	   segments" and are further identified by the byte-range they
2557	   represent.  Layout operations require the identification of the
2558	   layout segment (i.e., clientID, FH, layout type, and byte-range), as
2559	   well as the iomode.  This structure allows clients and metadata
2560	   servers to aggregate the results of layout operations into a singly
2561	   maintained layout.

2563	   It is important to define when layout segments overlap and/or
2564	   conflict with each other.  For a layout segment to overlap another
2565	   layout segment both segments must be of the same layout type,
2566	   correspond to the same filehandle, and have the same iomode; in
2567	   addition, the byte-ranges of the segments must overlap.  Layout
2568	   segments conflict, when they overlap and differ in the content of the
2569	   layout (i.e., the storage device/file mapping parameters differ).
2570	   Note, differing iomodes do not lead to conflicting layouts.  It is
2571	   permissible for layout segments with different iomodes, pertaining to
2572	   the same byte range, to be held by the same client.

2574	7.1.4  Device IDs

2576	   The "deviceID" is a short name for a storage device.  In practice, a
2577	   significant amount of information may be required to fully identify a
2578	   storage device.  Instead of embedding all that information in a
2579	   layout, a level of indirection is used.  Layouts embed device IDs,
2580	   and a new operation (GETDEVICEINFO) is used to retrieve the complete
2581	   identity information about the storage device according to its layout
2582	   type.  For example, the identity of a file server or object server
2583	   could be an IP address and port.  The identity of a block device
2584	   could be a volume label.  Due to multipath connectivity in a SAN
2585	   environment, agreement on a volume label is considered the reliable
2586	   way to locate a particular storage device.

2588	   The device ID is qualified by the layout type and unique per file
2589	   system (FSID).  This allows different layout drivers to generate
2590	   device IDs without the need for co-ordination.  In addition to
2591	   GETDEVICEINFO, another operation, GETDEVICELIST, has been added to
2592	   allow clients to fetch the mappings of multiple storage devices
2593	   attached to a metadata server.

2595	   Clients cannot expect the mapping between device ID and storage
2596	   device address to persist across server reboots, hence a client MUST
2597	   fetch new mappings on startup or upon detection of a metadata server
2598	   reboot unless it can revalidate its existing mappings.  Not all
2599	   layout types support such revalidation, and the means of doing so is
2600	   layout specific.  If data are reorganized from a storage device with
2601	   a given device ID to a different storage device (i.e., if the mapping
2602	   between storage device and data changes), the layout describing the
2603	   data MUST be recalled rather than assigning the new storage device to
2604	   the old device ID.

2606	7.1.5  Aggregation Schemes

2608	   Aggregation schemes can describe layouts like simple one-to-one
2609	   mapping, concatenation, and striping.  A general aggregation scheme
2610	   allows nested maps so that more complex layouts can be compactly
2611	   described.  The canonical aggregation type for this extension is
2612	   striping, which allows a client to access storage devices in
2613	   parallel.  Even a one-to-one mapping is useful for a file server that
2614	   wishes to distribute its load among a set of other file servers.

2616	7.2  Guarantees Provided by Layouts

2618	   Layouts delegate to the client the ability to access data out of
2619	   band.  The layout guarantees the holder that the layout will be
2620	   recalled when the state encapsulated by the layout becomes invalid
2621	   (e.g., through some operation that directly or indirectly modifies
2622	   the layout) or, possibly, when a conflicting layout is requested, as
2623	   determined by the layout's iomode.  When a layout is recalled, and
2624	   then returned by the client, the client retains the ability to access
2625	   file data with normal NFSv4 I/O operations through the metadata
2626	   server.  Only the right to do I/O out-of-band is affected.

2628	   Holding a layout does not guarantee that a user of the layout has the
2629	   rights to access the data represented by the layout.  All user access
2630	   rights MUST be obtained through the appropriate open, lock, and
2631	   access operations (i.e., those that would be used in the absence of
2632	   pNFS).  However, if a valid layout for a file is not held by the
2633	   client, the storage device should reject all I/Os to that file's byte
2634	   range that originate from that client.  In summary, layouts and
2635	   ordinary file access controls are independent.  The act of modifying
2636	   a file for which a layout is held, does not necessarily conflict with
2637	   the holding of the layout that describes the file being modified.
2638	   However, with certain layout types (e.g., block/volume layouts), the
2639	   layout's iomode must agree with the type of I/O being performed.

2641	   Depending upon the layout type and storage protocol in use, storage
2642	   device access permissions may be granted by LAYOUTGET and may be
2643	   encoded within the type specific layout.  If access permissions are
2644	   encoded within the layout, the metadata server must recall the layout
2645	   when those permissions become invalid for any reason; for example
2646	   when a file becomes unwritable or inaccessible to a client.  Note,
2647	   clients are still required to perform the appropriate access
2648	   operations as described above (e.g., open and lock ops).  The degree
2649	   to which it is possible for the client to circumvent these access
2650	   operations must be clearly addressed by the individual layout type
2651	   documents, as well as the consequences of doing so.  In addition,
2652	   these documents must be clear about the requirements and non-
2653	   requirements for the checking performed by the server.

2655	   If the pNFS metadata server supports mandatory byte range locks then
2656	   byte range locks must behave as specified by the NFSv4 protocol, as
2657	   observed by users of files.  If a storage device is unable to
2658	   restrict access by a pNFS client who does not hold a required
2659	   mandatory byte range lock then the metadata server must not grant
2660	   layouts to a client, for that storage device, that permits any access
2661	   that conflicts with a mandatory byte range lock held by another
2662	   client.  In this scenario, it is also necessary for the metadata
2663	   server to ensure that byte range locks are not granted to a client if
2664	   any other client holds a conflicting layout; in this case all
2665	   conflicting layouts must be recalled and returned before the lock
2666	   request can be granted.  This requires the pNFS server to understand
2667	   the capabilities of its storage devices.

2669	7.3  Getting a Layout

2671	   A client obtains a layout through a new operation, LAYOUTGET.  The
2672	   metadata server will give out layouts of a particular type (e.g.,
2673	   block/volume, object, or file) and aggregation as requested by the
2674	   client.  The client selects an appropriate layout type which the
2675	   server supports and the client is prepared to use.  The layout
2676	   returned to the client may not line up exactly with the requested
2677	   byte range.  A field within the LAYOUTGET request, "minlength",
2678	   specifies the minimum overlap that MUST exist between the requested
2679	   layout and the layout returned by the metadata server.  The
2680	   "minlength" field should specify a size of at least one.  A metadata
2681	   server may give-out multiple overlapping, non-conflicting layout
2682	   segments to the same client in response to a LAYOUTGET.

2684	   There is no implied ordering between getting a layout and performing
2685	   a file OPEN.  For example, a layout may first be retrieved by placing
2686	   a LAYOUTGET operation in the same compound as the initial file OPEN.
2687	   Once the layout has been retrieved, it can be held across multiple
2688	   OPEN and CLOSE sequences.

2690	   The storage protocol used by the client to access the data on the
2691	   storage device is determined by the layout's type.  The client needs
2692	   to select a "layout driver" that understands how to interpret and use
2693	   that layout.  The API used by the client to talk to its drivers is
2694	   outside the scope of the pNFS extension.  The storage protocol
2695	   between the client's layout driver and the actual storage is covered
2696	   by other protocols specifications such as iSCSI (block storage), OSD
2697	   (object storage) or NFS (file storage).

2699	   Although, the metadata server is in control of the layout for a file,
2700	   the pNFS client can provide hints to the server when a file is opened
2701	   or created about preferred layout type and aggregation scheme.  The
2702	   pNFS extension introduces a LAYOUT_HINT attribute that the client can
2703	   set at creation time to provide a hint to the server for new files.
2704	   It is suggested that this attribute be set as one of the initial
2705	   attributes to OPEN when creating a new file.  Setting this attribute
2706	   separately, after the file has been created could make it difficult,
2707	   or impossible, for the server implementation to comply.

2709	7.4  Committing a Layout

2711	   Due to the nature of the protocol, the file attributes, and data
2712	   location mapping (e.g., which offsets store data vs. store holes)
2713	   that exist on the metadata storage device may become inconsistent in
2714	   relation to the data stored on the storage devices; e.g., when WRITEs
2715	   occur before a layout has been committed (e.g., between a LAYOUTGET
2716	   and a LAYOUTCOMMIT).  Thus, it is necessary to occasionally re-sync
2717	   this state and make it visible to other clients through the metadata
2718	   server.

2720	   The LAYOUTCOMMIT operation is responsible for committing a modified
2721	   layout segment to the metadata server.  Note: the data should be
2722	   written and committed to the appropriate storage devices before the
2723	   LAYOUTCOMMIT occurs.  Note, if the data is being written
2724	   asynchronously through the metadata server a COMMIT to the metadata
2725	   server is required to sync the data and make it visible on the
2726	   storage devices (see Section 7.6 for more details).  The scope of
2727	   this operation depends on the storage protocol in use.  For block/
2728	   volume-based layouts, it may require updating the block list that
2729	   comprises the file and committing this layout to stable storage.
2730	   While, for file-layouts it requires some synchronization of
2731	   attributes between the metadata and storage devices (i.e., mainly the
2732	   size attribute; EOF).  It is important to note that the level of
2733	   synchronization is from the point of view of the client who issued
2734	   the LAYOUTCOMMIT.  The updated state on the metadata server need only
2735	   reflect the state as of the client's last operation previous to the
2736	   LAYOUTCOMMIT, it need not reflect a globally synchronized state
2737	   (e.g., other clients may be performing, or may have performed I/O
2738	   since the client's last operation and the LAYOUTCOMMIT).

2740	   The control protocol is free to synchronize the attributes before it
2741	   receives a LAYOUTCOMMIT, however upon successful completion of a
2742	   LAYOUTCOMMIT, state that exists on the metadata server that describes
2743	   the file MUST be in sync with the state existing on the storage
2744	   devices that comprise that file as of the issuing client's last
2745	   operation.  Thus, a client that queries the size of a file between a
2746	   WRITE to a storage device and the LAYOUTCOMMIT may observe a size
2747	   that does not reflects the actual data written.

2749	7.4.1  LAYOUTCOMMIT and mtime/atime/change

2751	   The change attribute and the modify/access times may be updated, by
2752	   the server, at LAYOUTCOMMIT time; since for some layout types, the
2753	   change attribute and atime/mtime can not be updated by the
2754	   appropriate I/O operation performed at a storage device.  The
2755	   arguments to LAYOUTCOMMIT allow the client to provide suggested
2756	   access and modify time values to the server.  Again, depending upon
2757	   the layout type, these client provided values may or may not be used.
2758	   The server should sanity check the client provided values before they
2759	   are used.  For example, the server should ensure that time does not
2760	   flow backwards.  According to the NFSv4 specification, The client
2761	   always has the option to set these attributes through an explicit
2762	   SETATTR operation.

2764	   As mentioned, for some layout protocols the change attribute and
2765	   mtime/atime may be updated at or after the time the I/O occurred
2766	   (e.g., if the storage device is able to communicate these attributes
2767	   to the metadata server).  If, upon receiving a LAYOUTCOMMIT, the
2768	   server implementation is able to determine that the file did not
2769	   change since the last time the change attribute was updated (e.g., no
2770	   WRITEs or over-writes occurred), the implementation need not update
2771	   the change attribute; file-based protocols may have enough state to
2772	   make this determination or may update the change attribute upon each
2773	   file modification.  This also applies for mtime and atime; if the
2774	   server implementation is able to determine that the file has not been
2775	   modified since the last mtime update, the server need not update
2776	   mtime at LAYOUTCOMMIT time.  Once LAYOUTCOMMIT completes, the new
2777	   change attribute and mtime/atime should be visible if that file was
2778	   modified since the latest previous LAYOUTCOMMIT or LAYOUTGET.

2780	7.4.2  LAYOUTCOMMIT and size

2782	   The file's size may be updated at LAYOUTCOMMIT time as well.  The
2783	   LAYOUTCOMMIT operation contains an argument that indicates the last
2784	   byte offset to which the client wrote ("last_write_offset").  Note:
2785	   for this offset to be viewed as a file size it must be incremented by
2786	   one byte (e.g., a write to offset 0 would map into a file size of 1,
2787	   but the last write offset is 0).  The metadata server may do one of
2788	   the following:

2790	   1.  It may update the file's size based on the last write offset.
2791	       However, to the extent possible, the metadata server should
2792	       sanity check any value to which the file's size is going to be
2793	       set.  E.g., it must not truncate the file based on the client
2794	       presenting a smaller last write offset than the file's current
2795	       size.

2797	   2.  If it has sufficient other knowledge of file size (e.g., by
2798	       querying the storage devices through the control protocol), it
2799	       may ignore the client provided argument and use the query-derived
2800	       value.

2802	   3.  It may use the last write offset as a hint, subject to correction
2803	       when other information is available as above.

2805	   The method chosen to update the file's size will depend on the
2806	   storage device's and/or the control protocol's implementation.  For
2807	   example, if the storage devices are block devices with no knowledge
2808	   of file size, the metadata server must rely on the client to set the
2809	   size appropriately.  A new size flag and length are also returned in
2810	   the results of a LAYOUTCOMMIT.  This union indicates whether a new
2811	   size was set, and to what length it was set.  If a new size is set as
2812	   a result of LAYOUTCOMMIT, then the metadata server must reply with
2813	   the new size.  As well, if the size is updated, the metadata server
2814	   in conjunction with the control protocol SHOULD ensure that the new
2815	   size is reflected by the storage devices immediately upon return of
2816	   the LAYOUTCOMMIT operation; e.g., a READ up to the new file size
2817	   should succeed on the storage devices (assuming no intervening
2818	   truncations).  Again, if the client wants to explicitly zero-extend
2819	   or truncate a file, SETATTR must be used; it need not be used when
2820	   simply writing past EOF.

2822	   Since client layout holders may be unaware of changes made to the
2823	   file's size, through LAYOUTCOMMIT or SETATTR, by other clients, an
2824	   additional callback/notification has been added for pNFS.
2825	   CB_SIZECHANGED is a notification that the metadata server sends to
2826	   layout holders to notify them of a change in file size.  This is
2827	   preferred over issuing CB_LAYOUTRECALL to each of the layout holders.

2829	7.4.3  LAYOUTCOMMIT and layoutupdate

2831	   The LAYOUTCOMMIT operation contains a "layoutupdate" argument.  This
2832	   argument is a layout type specific structure.  The structure can be
2833	   used to pass arbitrary layout type specific information from the
2834	   client to the metadata server at LAYOUTCOMMIT time.  For example, if
2835	   using a block/volume layout, the client can indicate to the metadata
2836	   server which reserved or allocated blocks it used and which it did
2837	   not.  The "layoutupdate" structure need not be the same structure as
2838	   the layout returned by LAYOUTGET.  The structure is defined by the
2839	   layout type and is opaque to LAYOUTCOMMIT.

2841	7.5  Recalling a Layout

2843	7.5.1  Basic Operation

2845	   Since a layout protects a client's access to a file via a direct
2846	   client-storage-device path, a layout need only be recalled when it is
2847	   semantically unable to serve this function.  Typically, this occurs
2848	   when the layout no longer encapsulates the true location of the file
2849	   over the byte range it represents.  Any operation or action (e.g.,
2850	   server driven restriping or load balancing) that changes the layout
2851	   will result in a recall of the layout.  A layout is recalled by the
2852	   CB_LAYOUTRECALL callback operation (see Section 14.19).  This
2853	   callback can either recall a layout segment identified by a byte
2854	   range, or all the layouts associated with a file system (FSID).
2855	   However, there is no single operation to return all layouts
2856	   associated with an FSID; multiple layout segments may be returned in
2857	   a single compound operation.  Section 7.5.3 discusses sequencing
2858	   issues surrounding the getting, returning, and recalling of layouts.

2860	   The iomode is also specified when recalling a layout or layout
2861	   segment.  Generally, the iomode in the recall request must match the
2862	   layout, or segment, being returned; e.g., a recall with an iomode of
2863	   RW should cause the client to only return RW layout segments (not R
2864	   segments).  However, a special LAYOUTIOMODE_ANY enumeration is
2865	   defined to enable recalling a layout of any type (i.e., the client
2866	   must return both read-only and read/write layouts).

2868	   A REMOVE operation may cause the metadata server to recall the layout
2869	   to prevent the client from accessing a non-existent file and to
2870	   reclaim state stored on the client.  Since a REMOVE may be delayed
2871	   until the last close of the file has occurred, the recall may also be
2872	   delayed until this time.  As well, once the file has been removed,
2873	   after the last reference, the client SHOULD no longer be able to
2874	   perform I/O using the layout (e.g., with file-based layouts an error
2875	   such as ESTALE could be returned).

2877	   Although, the pNFS extension does not alter the caching capabilities
2878	   of clients, or their semantics, it recognizes that some clients may
2879	   perform more aggressive write-behind caching to optimize the benefits
2880	   provided by pNFS.  However, write-behind caching may impact the
2881	   latency in returning a layout in response to a CB_LAYOUTRECALL; just
2882	   as caching impacts DELEGRETURN with regards to data delegations.
2883	   Client implementations should limit the amount of dirty data they
2884	   have outstanding at any one time.  Server implementations may fence
2885	   clients from performing direct I/O to the storage devices if they
2886	   perceive that the client is taking too long to return a layout once
2887	   recalled.  A server may be able to monitor client progress by
2888	   watching client I/Os or by observing LAYOUTRETURNs of sub-portions of
2889	   the recalled layout.  The server can also limit the amount of dirty
2890	   data to be flushed to storage devices by limiting the byte ranges
2891	   covered in the layouts it gives out.

2893	   Once a layout has been returned, the client MUST NOT issue I/Os to
2894	   the storage devices for the file, byte range, and iomode represented
2895	   by the returned layout.  If a client does issue an I/O to a storage
2896	   device for which it does not hold a layout, the storage device SHOULD
2897	   reject the I/O.

2899	7.5.2  Recall Callback Robustness

2901	   For simplicity, the discussion thus far has assumed that pNFS client
2902	   state for a file exactly matches the pNFS server state for that file
2903	   and client regarding layout ranges and permissions.  This assumption
2904	   leads to the implicit assumption that any callback results in a
2905	   LAYOUTRETURN or set of LAYOUTRETURNs that exactly match the range in
2906	   the callback, since both client and server agree about the state
2907	   being maintained.  However, it can be useful if this assumption does
2908	   not always hold.  For example:

2910	   o  It may be useful for clients to be able to discard layout
2911	      information without calling LAYOUTRETURN.  If conflicts that
2912	      require callbacks are very rare, and a server can use a multi-file
2913	      callback to recover per-client resources (e.g., via a FSID recall,
2914	      or a multi-file recall within a single compound), the result may
2915	      be significantly less client-server pNFS traffic.

2917	   o  It may be similarly useful for servers to enhance information
2918	      about what layout ranges are held by a client beyond what a client
2919	      actually holds.  In the extreme, a server could manage conflicts
2920	      on a per-file basis, only issuing whole-file callbacks even though
2921	      clients may request and be granted sub-file ranges.

2923	   o  As well, the synchronized state assumption is not robust to minor
2924	      errors.  A more robust design would allow for divergence between
2925	      client and server and the ability to recover.  It is vital that a
2926	      client not assign itself layout permissions beyond what the server
2927	      has granted and that the server not forget layout permissions that
2928	      have been granted in order to avoid errors.  On the other hand, if
2929	      a server believes that a client holds a layout segment that the
2930	      client does not know about, it's useful for the client to be able
2931	      to issue the LAYOUTRETURN that the server is expecting in response
2932	      to a recall.

2934	   Thus, in light of the above, it is useful for a server to be able to
2935	   issue callbacks for layout ranges it has not granted to a client, and
2936	   for a client to return ranges it does not hold.  A pNFS client must
2937	   always return layout segments that comprise the full range specified
2938	   by the recall.  Note, the full recalled layout range need not be
2939	   returned as part of a single operation, but may be returned in
2940	   segments.  This allows the client to stage the flushing of dirty
2941	   data, layout commits, and returns.  Also, it indicates to the
2942	   metadata server that the client is making progress.

2944	   In order to ensure client/server convergence on the layout state, the
2945	   final LAYOUTRETURN operation in a sequence of returns for a
2946	   particular recall, SHOULD specify the entire range being recalled,
2947	   even if layout segments pertaining to partial ranges were previously
2948	   returned.  In addition, if the client holds no layout segment that
2949	   overlaps the range being recalled, the client should return the
2950	   NFS4ERR_NOMATCHING_LAYOUT error code.  This allows the server to
2951	   update its view of the client's layout state.

2953	7.5.3  Recall/Return Sequencing

2955	   As with other stateful operations, pNFS requires the correct
2956	   sequencing of layout operations.  This proposal assumes that sessions
2957	   will precede or accompany pNFS into NFSv4.x and thus, pNFS will
2958	   require the use of sessions.  If the sessions proposal does not
2959	   precede pNFS, then this proposal needs to be modified to provide for
2960	   the correct sequencing of pNFS layout operations.  Also, this
2961	   specification is reliant on the sessions protocol to provide the
2962	   correct sequencing between regular operations and callbacks.  It is
2963	   the server's responsibility to avoid inconsistencies regarding the
2964	   layouts it hands out and the client's responsibility to properly
2965	   serialize its layout requests.

2967	   One critical issue with operation sequencing concerns callbacks.  The
2968	   protocol must defend against races between the reply to a LAYOUTGET
2969	   operation and a subsequent CB_LAYOUTRECALL.  It MUST NOT be possible
2970	   for a client to process the CB_LAYOUTRECALL for a layout that it has
2971	   not received in a reply message to a LAYOUTGET.

2973	7.5.3.1  Client Side Considerations

2975	   Consider a pNFS client that has issued a LAYOUTGET and then receives
2976	   an overlapping recall callback for the same file.  There are two
2977	   possibilities, which the client cannot distinguish when the callback
2978	   arrives:

2980	   1.  The server processed the LAYOUTGET before issuing the recall, so
2981	       the LAYOUTGET response is in flight, and must be waited for
2982	       because it may be carrying layout info that will need to be
2983	       returned to deal with the recall callback.

2985	   2.  The server issued the callback before receiving the LAYOUTGET.
2986	       The server will not respond to the LAYOUTGET until the recall
2987	       callback is processed.

2989	   This can cause deadlock, as the client must wait for the LAYOUTGET
2990	   response before processing the recall in the first case, but that
2991	   response will not arrive until after the recall is processed in the
2992	   second case.  This deadlock can be avoided by adhering to the
2993	   following requirements:

2995	   o  A LAYOUTGET MUST be rejected with an error (i.e.,
2996	      NFS4ERR_RECALLCONFLICT) if there's an overlapping outstanding
2997	      recall callback to the same client

2999	   o  When processing a recall, the client MUST wait for a response to
3000	      all conflicting outstanding LAYOUTGETs before performing any
3001	      RETURN that could be affected by any such response.

3003	   o  The client SHOULD wait for responses to all operations required to
3004	      complete a recall before sending any LAYOUTGETs that would
3005	      conflict with the recall because the server is likely to return
3006	      errors for them.

3008	   Now the client can wait for the LAYOUTGET response, as it will be
3009	   received in both cases.

3011	7.5.3.2  Server Side Considerations

3013	   Consider a related situation from the pNFS server's point of view.
3014	   The server has issued a recall callback and receives an overlapping
3015	   LAYOUTGET for the same file before the LAYOUTRETURN(s) that respond
3016	   to the recall callback.  Again, there are two cases:

3018	   1.  The client issued the LAYOUTGET before processing the recall
3019	       callback.

3021	   2.  The client issued the LAYOUTGET after processing the recall
3022	       callback, but it arrived before the LAYOUTRETURN that completed
3023	       that processing.

3025	   The simplest approach is to always reject the overlapping LAYOUTGET.
3026	   The client has two ways to avoid this result - it can issue the
3027	   LAYOUTGET as a subsequent element of a COMPOUND containing the
3028	   LAYOUTRETURN that completes the recall callback, or it can wait for
3029	   the response to that LAYOUTRETURN.

3031	   This leads to a more general problem; in the absence of a callback if
3032	   a client issues concurrent overlapping LAYOUTGET and LAYOUTRETURN
3033	   operations, it is possible for the server to process them in either
3034	   order.  Again, a client must take the appropriate precautions in
3035	   serializing its actions.

3037	   [ASIDE: HighRoad forbids a client from doing this, as the per-file
3038	   layout stateid will cause one of the two operations to be rejected
3039	   with a stale layout stateid.  This approach is simpler and produces
3040	   better results by comparison to allowing concurrent operations, at
3041	   least for this sort of conflict case, because server execution of
3042	   operations in an order not anticipated by the client may produce
3043	   results that are not useful to the client (e.g., if a LAYOUTRETURN is
3044	   followed by a concurrent overlapping LAYOUTGET, but executed in the
3045	   other order, the client will not retain layout extents for the
3046	   overlapping range).]

3048	7.6  Metadata Server Write Propagation

3050	   Asynchronous writes written through the metadata server may be
3051	   propagated lazily to the storage devices.  For data written
3052	   asynchronously through the metadata server, a client performing a
3053	   read at the appropriate storage device is not guaranteed to see the
3054	   newly written data until a COMMIT occurs at the metadata server.
3055	   While the write is pending, reads to the storage device can give out
3056	   either the old data, the new data, or a mixture thereof.  After
3057	   either a synchronous write completes, or a COMMIT is received (for
3058	   asynchronously written data), the metadata server must ensure that
3059	   storage devices give out the new data and that the data has been
3060	   written to stable storage.  If the server implements its storage in
3061	   any way such that it cannot obey these constraints, then it must
3062	   recall the layouts to prevent reads being done that cannot be handled
3063	   correctly.

3065	7.7  Crash Recovery

3067	   Crash recovery is complicated due to the distributed nature of the
3068	   pNFS protocol.  In general, crash recovery for layouts is similar to
3069	   crash recovery for delegations in the base NFSv4 protocol.  However,
3070	   the client's ability to perform I/O without contacting the metadata
3071	   server introduces subtleties that must be handled correctly if file
3072	   system corruption is to be avoided.

3074	7.7.1  Leases

3076	   The layout lease period plays a critical role in crash recovery.
3077	   Depending on the capabilities of the storage protocol, it is crucial
3078	   that the client is able to maintain an accurate layout lease timer to
3079	   ensure that I/Os are not issued to storage devices after expiration
3080	   of the layout lease period.  In order for the client to do so, it
3081	   must know which operations renew a lease.

3083	7.7.1.1  Lease Renewal

3085	   The current NFSv4 specification allows for implicit lease renewals to
3086	   occur upon receiving an I/O. However, due to the distributed pNFS
3087	   architecture, implicit lease renewals are limited to operations
3088	   performed at the metadata server; this includes I/O performed through
3089	   the metadata server.  So, a client must not assume that READ and
3090	   WRITE I/O to storage devices implicitly renew lease state.

3092	   If sessions are required for pNFS, as has been suggested, then the
3093	   SEQUENCE operation is to be used to explicitly renew leases.  It is
3094	   proposed that the SEQUENCE operation be extended to return all the
3095	   specific information that RENEW does, but not as an error as RENEW
3096	   returns it.  Since, when using session, beginning each compound with
3097	   the SEQUENCE op allows renews to be performed without an additional
3098	   operation and without an additional request.  Again, the client must
3099	   not rely on any operation to the storage devices to renew a lease.
3100	   Using the SEQUENCE operation for renewals, simplifies the client's
3101	   perception of lease renewal.

3103	7.7.1.2  Client Lease Timer

3105	   Depending on the storage protocol and layout type in use, it may be
3106	   crucial that the client not issue I/Os to storage devices if the
3107	   corresponding layout's lease has expired.  Doing so may lead to file
3108	   system corruption if the layout has been given out and used by
3109	   another client.  In order to prevent this, the client must maintain
3110	   an accurate lease timer for all layouts held.  RFC3530 has the
3111	   following to say regarding the maintenance of a client lease timer:

3113	      ...the client must track operations which will renew the lease
3114	      period.  Using the time that each such request was sent and the
3115	      time that the corresponding reply was received, the client should
3116	      bound the time that the corresponding renewal could have occurred
3117	      on the server and thus determine if it is possible that a lease
3118	      period expiration could have occurred.

3120	   To be conservative, the client should start its lease timer based on
3121	   the time that the it issued the operation to the metadata server,
3122	   rather than based on the time of the response.

3124	   It is also necessary to take propagation delay into account when
3125	   requesting a renewal of the lease:

3127	      ...the client should subtract it from lease times (e.g., if the
3128	      client estimates the one-way propagation delay as 200 msec, then
3129	      it can assume that the lease is already 200 msec old when it gets
3130	      it).  In addition, it will take another 200 msec to get a response
3131	      back to the server.  So the client must send a lock renewal or
3132	      write data back to the server 400 msec before the lease would
3133	      expire.

3135	   Thus, the client must be aware of the one-way propagation delay and
3136	   should issue renewals well in advance of lease expiration.  Clients,
3137	   to the extent possible, should try not to issue I/Os that may extend
3138	   past the lease expiration time period.  However, since this is not
3139	   always possible, the storage protocol must be able to protect against
3140	   the effects of inflight I/Os, as is discussed later.

3142	7.7.2  Client Recovery

3144	   Client recovery for layouts works in much the same way as NFSv4
3145	   client recovery works for other lock/delegation state.  When an NFSv4
3146	   client reboots, it will lose all information about the layouts that
3147	   it previously owned.  There are two methods by which the server can
3148	   reclaim these resources and allow otherwise conflicting layouts to be
3149	   provided to other clients.

3151	   The first is through the expiry of the client's lease.  If the client
3152	   recovery time is longer than the lease period, the client's lease
3153	   will expire and the server will know that state may be released. for
3154	   layouts the server may release the state immediately upon lease
3155	   expiry or it may allow the layout to persist awaiting possible lease
3156	   revival, as long as there are no conflicting requests.

3158	   On the other hand, the client may recover in less time than it takes
3159	   for the lease period to expire.  In such a case, the client will
3160	   contact the server through the standard SETCLIENTID protocol.  The
3161	   server will find that the client's id matches the id of the previous
3162	   client invocation, but that the verifier is different.  The server
3163	   uses this as a signal to release all the state associated with the
3164	   client's previous invocation.

3166	7.7.3  Metadata Server Recovery

3168	   The server recovery case is slightly more complex.  In general, the
3169	   recovery process again follows the standard NFSv4 recovery model: the
3170	   client will discover that the metadata server has rebooted when it
3171	   receives an unexpected STALE_STATEID or STALE_CLIENTID reply from the
3172	   server; it will then proceed to try to reclaim its previous
3173	   delegations during the server's recovery grace period.  However,
3174	   layouts are not reclaimable in the same sense as data delegations;
3175	   there is no reclaim bit, thus no guarantee of continuity between the
3176	   previous and new layout.  This is not necessarily required since a
3177	   layout is not required to perform I/O; I/O can always be performed
3178	   through the metadata server.

3180	   [NOTE: there is no reclaim bit for getting a layout.  Thus, in the
3181	   case of reclaiming an old layout obtained through LAYOUTGET, there is
3182	   no guarantee of continuity.  If a reclaim bit existed a block/volume
3183	   layout type might be happier knowing it got the layout back with the
3184	   assurance of continuity.  However, this would require the metadata
3185	   server trusting the client in telling it the exact layout it had
3186	   (i.e., the full block-list); however, divergence is avoided by having
3187	   the server tell the client what is contained within the layout.]

3189	   If the client has dirty data that it needs to write out, or an
3190	   outstanding LAYOUTCOMMIT, the client should try to obtain a new
3191	   layout segment covering the byte range covered by the previous layout
3192	   segment.  However, the client might not not get the same layout
3193	   segment it had.  The range might be different or it might get the
3194	   same range but the content of the layout might be different.  For
3195	   example, if using a block/volume-based layout, the blocks
3196	   provisionally assigned by the layout might be different, in which
3197	   case the client will have to write the corresponding blocks again; in
3198	   the interest of simplicity, the client might decide to always write
3199	   them again.  Alternatively, the client might be unable to obtain a
3200	   new layout and thus, must write the data using normal NFSv4 through
3201	   the metadata server.

3203	   There is an important safety concern associated with layouts that
3204	   does not come into play in the standard NFSv4 case.  If a standard
3205	   NFSv4 client makes use of a stale delegation, while reading, the
3206	   consequence could be to deliver stale data to an application.  If
3207	   writing, using a stale delegation or a stale state stateid for an
3208	   open or lock would result in the rejection of the client's write with
3209	   the appropriate stale stateid error.

3211	   However, the pNFS layout enables the client to directly access the
3212	   file system storage---if this access is not properly managed by the
3213	   NFSv4 server the client can potentially corrupt the file system data
3214	   or metadata.  Thus, it is vitally important that the client discover
3215	   that the metadata server has rebooted, and that the client stops
3216	   using stale layouts before the metadata server gives them away to
3217	   other clients.  To ensure this, the client must be implemented so
3218	   that layouts are never used to access the storage after the client's
3219	   lease timer has expired.  It is crucial that clients have precise
3220	   knowledge of the lease periods of their layouts.  For specific
3221	   details on lease renewal and client lease timers, see Section 7.7.1.

3223	   The prohibition on using stale layouts applies to all layout related
3224	   accesses, especially the flushing of dirty data to the storage
3225	   devices.  If the client's lease timer expires because the client
3226	   could not contact the server for any reason, the client MUST
3227	   immediately stop using the layout until the server can be contacted
3228	   and the layout can be officially recovered or reclaimed.  However,
3229	   this is only part of the solution.  It is also necessary to deal with
3230	   the consequences of I/Os already in flight.

3232	   The issue of the effects of I/Os started before lease expiration and
3233	   possibly continuing through lease expiration is the responsibility of
3234	   the data storage protocol and as such is layout type specific.  There
3235	   are two approaches the data storage protocol can take.  The protocol
3236	   may adopt a global solution which prevents all I/Os from being
3237	   executed after the lease expiration and thus is safe against a client
3238	   who issues I/Os after lease expiration.  This is the preferred
3239	   solution and the solution used by NFSv4 file based layouts (see
3240	   Section 9.6); as well, the object storage device protocol allows
3241	   storage to fence clients after lease expiration.  Alternatively, the
3242	   storage protocol may rely on proper client operation and only deal
3243	   with the effects of lingering I/Os.  These solutions may impact the
3244	   client layout-driver, the metadata server layout-driver, and the
3245	   control protocol.

3247	7.7.4  Storage Device Recovery

3249	   Storage device crash recovery is mostly dependent upon the layout
3250	   type in use.  However, there are a few general techniques a client
3251	   can use if it discovers a storage device has crashed while holding
3252	   asynchronously written, non-committed, data.  First and foremost, it
3253	   is important to realize that the client is the only one who has the
3254	   information necessary to recover asynchronously written data; since,
3255	   it holds the dirty data and most probably nobody else does.  Second,
3256	   the best solution is for the client to err on the side or caution and
3257	   attempt to re-write the dirty data through another path.

3259	   The client, rather than hold the asynchronously written data
3260	   indefinitely, is encouraged to, and can make sure that the data is
3261	   written by using other paths to that data.  The client may write the
3262	   data to the metadata server, either synchronously or asynchronously
3263	   with a subsequent COMMIT.  Once it does this, there is no need to
3264	   wait for the original storage device.  In the event that the data
3265	   range to be committed is transferred to a different storage device,
3266	   as indicated in a new layout, the client may write to that storage
3267	   device.  Once the data has been committed at that storage device,
3268	   either through a synchronous write or through a commit to that
3269	   storage device (e.g., through the NFSv4 COMMIT operation for the
3270	   NFSv4 file layout), the client should consider the transfer of
3271	   responsibility for the data to the new server as strong evidence that
3272	   this is the intended and most effective method for the client to get
3273	   the data written.  In either case, once the write is on stable
3274	   storage (through either the storage device or metadata server), there
3275	   is no need to continue either attempting to commit or attempting to
3276	   synchronously write the data to the original storage device or wait
3277	   for that storage device to become available.  That storage device may
3278	   never be visible to the client again.

3280	   This approach does have a "lingering write" problem, similar to
3281	   regular NFSv4.  Suppose a WRITE is issued to a storage device for
3282	   which no response is received.  The client breaks the connection,
3283	   trying to re-establish a new one, and gets a recall of the layout.
3284	   The client issues the I/O for the dirty data through an alternative
3285	   path, for example, through the metadata server and it succeeds.  The
3286	   client then goes on to perform additional writes that all succeed.
3287	   If at some time later, the original write to the storage device
3288	   succeeds, data inconsistency could result.  The same problem can
3289	   occur in regular NFSv4.  For example, a WRITE is held in a switch for
3290	   some period of time while other writes are issued and replied to, if
3291	   the original WRITE finally succeeds, the same issues can occur.
3292	   However, this is solved by sessions in NFSv4.x.

3294	8.  Security Considerations

3296	   The pNFS extension partitions the NFSv4 file system protocol into two
3297	   parts, the control path and the data path (i.e., storage protocol).
3298	   The control path contains all the new operations described by this
3299	   extension; all existing NFSv4 security mechanisms and features apply
3300	   to the control path.  The combination of components in a pNFS system
3301	   (see Figure 9) is required to preserve the security properties of
3302	   NFSv4 with respect to an entity accessing data via a client,
3303	   including security countermeasures to defend against threats that
3304	   NFSv4 provides defenses for in environments where these threats are
3305	   considered significant.

3307	   In some cases, the security countermeasures for connections to
3308	   storage devices may take the form of physical isolation or a
3309	   recommendation not to use pNFS in an environment.  For example, it is
3310	   currently infeasible to provide confidentiality protection for some
3311	   storage device access protocols to protect against eavesdropping; in
3312	   environments where eavesdropping on such protocols is of sufficient
3313	   concern to require countermeasures, physical isolation of the
3314	   communication channel (e.g., via direct connection from client(s) to
3315	   storage device(s)) and/or a decision to forego use of pNFS (e.g., and
3316	   fall back to NFSv4) may be appropriate courses of action.

3318	   In full generality where communication with storage devices is
3319	   subject to the same threats as client-server communication, the
3320	   protocols used for that communication need to provide security
3321	   mechanisms comparable to those available via RPSEC_GSS for NFSv4.
3322	   Many situations in which pNFS is likely to be used will not be
3323	   subject to the overall threat profile for which NFSv4 is required to
3324	   provide countermeasures.

3326	   pNFS implementations MUST NOT remove NFSv4's access controls.  The
3327	   combination of clients, storage devices, and the server are
3328	   responsible for ensuring that all client to storage device file data
3329	   access respects NFSv4 ACLs and file open modes.  This entails
3330	   performing both of these checks on every access in the client, the
3331	   storage device, or both.  If a pNFS configuration performs these
3332	   checks only in the client, the risk of a misbehaving client obtaining
3333	   unauthorized access is an important consideration in determining when
3334	   it is appropriate to use such a pNFS configuration.  Such
3335	   configurations SHOULD NOT be used when client- only access checks do
3336	   not provide sufficient assurance that NFSv4 access control is being
3337	   applied correctly.

3339	   The following subsections describe security considerations
3340	   specifically applicable to each of the three major storage device
3341	   protocol types supported for pNFS.

3343	   [Requiring strict equivalence to NFSv4 security mechanisms is the
3344	   wrong approach.  Will need to lay down a set of statements that each
3345	   protocol has to make starting with access check location/properties.]

3347	8.1  File Layout Security

3349	   A NFSv4 file layout type is defined in Section 9; see Section 9.7 for
3350	   additional security considerations and details.  In summary, the
3351	   NFSv4 file layout type requires that all I/O access checks MUST be
3352	   performed by the storage devices, as defined by the NFSv4
3353	   specification.  If another file layout type is being used, additional
3354	   access checks may be required.  But in all cases, the access control
3355	   performed by the storage devices must be at least as strict as that
3356	   specified by the NFSv4 protocol.

3358	8.2  Object Layout Security

3360	   The object storage protocol MUST implement the security aspects
3361	   described in version 1 of the T10 OSD protocol definition [5].  The
3362	   remainder of this section gives an overview of the security mechanism
3363	   described in that standard.  The goal is to give the reader a basic
3364	   understanding of the object security model.  Any discrepancies
3365	   between this text and the actual standard are obviously to be
3366	   resolved in favor of the OSD standard.

3368	   The object storage protocol relies on a cryptographically secure
3369	   capability to control accesses at the object storage devices.
3370	   Capabilities are generated by the metadata server, returned to the
3371	   client, and used by the client as described below to authenticate
3372	   their requests to the Object Storage Device (OSD).  Capabilities
3373	   therefore achieve the required access and open mode checking.  They
3374	   allow the file server to define and check a policy (e.g., open mode)
3375	   and the OSD to check and enforce that policy without knowing the
3376	   details (e.g., user IDs and ACLs).  Since capabilities are tied to
3377	   layouts, and since they are used to enforce access control, the
3378	   server should recall layouts and revoke capabilities when the file
3379	   ACL or mode changes in order to signal the clients.

3381	   Each capability is specific to a particular object, an operation on
3382	   that object, a byte range w/in the object, and has an explicit
3383	   expiration time.  The capabilities are signed with a secret key that
3384	   is shared by the object storage devices (OSD) and the metadata
3385	   managers. clients do not have device keys so they are unable to forge
3386	   capabilities.  The the following sketch of the algorithm should help
3387	   the reader understand the basic model.

3389	   LAYOUTGET returns
3390	     {CapKey = MAC(CapArgs), CapArgs}

3392	   The client uses CapKey to sign all the requests it issues for that
3393	   object using the respective CapArgs.  In other words, the CapArgs
3394	   appears in the request to the storage device, and that request is
3395	   signed with the CapKey as follows:

3397	     ReqMAC = MAC(Req, Nonceln)

3399	   The following is sent to the OSD: {CapArgs, Req, Nonceln, ReqMAC}.
3400	   The OSD uses the SecretKey it shares with the metadata server to
3401	   compare the ReqMAC the client sent with a locally computed

3403	     MAC(CapArgs)>(Req, Nonceln)

3405	   and if they match the OSD assumes that the capabilities came from an
3406	   authentic metadata server and allows access to the object, as allowed
3407	   by the CapArgs.  Therefore, if the server LAYOUTGET reply, holding
3408	   CapKey and CapArgs, is snooped by another client, it can be used to
3409	   generate valid OSD requests (within the CapArgs access restriction).

3411	   To provide the required privacy requirements for the capabilities
3412	   returned by LAYOUTGET, the GSS-API can be used, e.g. by using a
3413	   session key known to the file server and to the client to encrypt the
3414	   whole layout or parts of it.  Two general ways to provide privacy in
3415	   the absence of GSS-API that are independent of NFSv4 are either an
3416	   isolated network such as a VLAN or a secure channel provided by
3417	   IPsec.

3419	8.3  Block/Volume Layout Security

3421	   As typically used, block/volume protocols rely on clients to enforce
3422	   file access checks since the storage devices are generally unaware of
3423	   the files they are storing and in particular are unaware of which
3424	   blocks belongs to which file.  In such environments, the physical
3425	   addresses of blocks are exported to pNFS clients via layouts.  An
3426	   alternative method of block/volume protocol use is for the storage
3427	   devices to export virtualized block addresses, which do reflect the
3428	   files to which blocks belong.  These virtual block addresses are
3429	   exported to pNFS clients via layouts.  This allows the storage device
3430	   to make appropriate access checks, while mapping virtual block
3431	   addresses to physical block addresses.

3433	   In environments where access control is important and client-only
3434	   access checks provide insufficient assurance of access control
3435	   enforcement (e.g., there is concern about a malicious of
3436	   malfunctioning client skipping the access checks) and where physical
3437	   block addresses are exported to clients, the storage devices will
3438	   generally be unable to compensate for these client deficiencies.

3440	   In such threat environments, block/volume protocols SHOULD NOT be
3441	   used with pNFS, unless the storage device is able to implement the
3442	   appropriate access checks, via use of virtualized block addresses, or
3443	   other means.  NFSv4 without pNFS or pNFS with a different type of
3444	   storage protocol would be a more suitable means to access files in
3445	   such environments.  Storage-device/protocol-specific methods (e.g.
3446	   LUN masking/mapping) may be available to prevent malicious or high-
3447	   risk clients from directly accessing storage devices.

3449	9.  The NFSv4 File Layout Type

3451	   This section describes the semantics and format of NFSv4 file-based
3452	   layouts.

3454	9.1  File Striping and Data Access

3456	   The file layout type describes a method for striping data across
3457	   multiple devices.  The data for each stripe unit is stored within an
3458	   NFSv4 file located on a particular storage device.  The structures
3459	   used to describe the stripe layout are as follows:

3461	    enum stripetype4 {
3462	           STRIPE_SPARSE = 1,
3463	           STRIPE_DENSE = 2
3464	    };

3466	    struct nfsv4_file_layouthint {
3467	            stripetype4             stripe_type;
3468	            length4                 stripe_unit;
3469	            uint32_t                stripe_width;
3470	    };

3472	    struct nfsv4_file_layout {                   /* Per data stripe */
3473	           pnfs_deviceid4          dev_id<>;
3474	           nfs_fh4                 fh;
3475	    };

3477	    struct nfsv4_file_layouttype4 {              /* Per file */
3478	           stripetype4             stripe_type;
3479	           length4                 stripe_unit;
3480	           length4                 file_size;
3481	           nfsv4_file_layout       dev_list<>;
3482	    };

3484	   The file layout specifies an ordered array of 
3485	   tuples, as well as the stripe size, type of stripe layout (discussed
3486	   a little later), and the file's current size as of LAYOUTGET time.
3487	   The filehandle, "fh", identifies the file on a storage device
3488	   identified by "dev_id", that holds a particular stripe of the file.
3489	   The "dev_id" array can be used for multipathing and is discussed
3490	   further in Section 9.1.3.  The stripe width is determined by the
3491	   stripe unit size multiplied by the number of devices in the dev_list.
3492	   The stripe held by  is determined by that tuples position
3493	   within the device list, "dev_list".  For example, consider a dev_list
3494	   consisting of the following  pairs:

3496	   <(1,0x12), (2,0x13), (1,0x15)> and stripe_unit = 32KB

3498	   The stripe width is 32KB * 3 devices = 96KB.  The first entry
3499	   specifies that on device 1 in the data file with filehandle 0x12
3500	   holds the first 32KB of data (and every 32KB stripe beginning where
3501	   the file's offset % 96KB == 0).

3503	   Devices may be repeated multiple times within the device list array;
3504	   this is shown where storage device 1 holds both the first and third
3505	   stripe of data.  Filehandles can only be repeated if a sparse stripe
3506	   type is used.  Data is striped across the devices in the order listed
3507	   in the device list array in increments of the stripe size.  A data
3508	   file stored on a storage device MUST map to a single file as defined
3509	   by the metadata server; i.e., data from two files as viewed by the
3510	   metadata server MUST NOT be stored within the same data file on any
3511	   storage device.

3513	   The "stripe_type" field specifies how the data is laid out within the
3514	   data file on a storage device.  It allows for two different data
3515	   layouts: sparse and dense or packed.  The stripe type determines the
3516	   calculation that must be made to map the client visible file offset
3517	   to the offset within the data file located on the storage device.

3519	   The layout hint structure is described in more detail in
3520	   Section 10.7.  It is used, by the client, as by the FILE_LAYOUT_HINT
3521	   attribute to specify the type of layout to be used for a newly
3522	   created file.

3524	9.1.1  Sparse and Dense Storage Device Data Layouts

3526	   The stripe_type field allows for two storage device data file
3527	   representations.  Example sparse and dense storage device data
3528	   layouts are illustrated below:

3530	    Sparse file-layout (stripe_unit = 4KB)
3531	    ------------------

3533	    Is represented by the following file layout on the storage devices:

3535	        Offset  ID:0    ID:1   ID:2
3536	        0       +--+    +--+   +--+                 +--+  indicates a
3537	                |//|    |  |   |  |                 |//|  stripe that
3538	        4KB     +--+    +--+   +--+                 +--+  contains data
3539	                |  |    |//|   |  |
3540	        8KB     +--+    +--+   +--+
3541	                |  |    |  |   |//|
3542	        12KB    +--+    +--+   +--+
3543	                |//|    |  |   |  |
3544	        16KB    +--+    +--+   +--+
3545	                |  |    |//|   |  |
3546	                +--+    +--+   +--+

3548	   The sparse file-layout has holes for the byte ranges not exported by
3549	   that storage device.  This allows clients to access data using the
3550	   real offset into the file, regardless of the storage device's
3551	   position within the stripe.  However, if a client writes to one of
3552	   the holes (e.g., offset 4-12KB on device 1), then an error MUST be
3553	   returned by the storage device.  This requires that the storage
3554	   device have knowledge of the layout for each file.

3556	   When using a sparse layout, the offset into the storage device data
3557	   file is the same as the offset into the main file.

3559	    Dense/packed file-layout (stripe_unit = 4KB)
3560	    ------------------------

3562	    Is represented by the following file layout on the storage devices:

3564	        Offset  ID:0    ID:1   ID:2
3565	        0       +--+    +--+   +--+
3566	                |//|    |//|   |//|
3567	        4KB     +--+    +--+   +--+
3568	                |//|    |//|   |//|
3569	        8KB     +--+    +--+   +--+
3570	                |//|    |//|   |//|
3571	        12KB    +--+    +--+   +--+
3572	                |//|    |//|   |//|
3573	        16KB    +--+    +--+   +--+
3574	                |//|    |//|   |//|
3575	                +--+    +--+   +--+

3577	   The dense or packed file-layout does not leave holes on the storage
3578	   devices.  Each stripe unit is spread across the storage devices.  As
3579	   such, the storage devices need not know the file's layout since the
3580	   client is allowed to write to any offset.

3582	   The calculation to determine the byte offset within the data file for
3583	   dense storage device layouts is:

3585	     stripe_width = stripe_unit * N; where N = |dev_list|
3586	     dev_offset = floor(file_offset / stripe_width) * stripe_unit +
3587	                  file_offset % stripe_unit

3589	   Regardless of the storage device data file layout, the calculation to
3590	   determine the index into the device array is the same:

3592	     dev_idx = floor(file_offset / stripe_unit) mod N

3594	   Section 9.5 describe the semantics for dealing with reads to holes
3595	   within the striped file.  This is of particular concern, since each
3596	   individual component stripe file (i.e., the component of the striped
3597	   file that lives on a particular storage device) may be of different
3598	   length.  Thus, clients may experience 'short' reads when reading off
3599	   the end of one of these component files.

3601	9.1.2  Metadata and Storage Device Roles

3603	   In many cases, the metadata server and the storage device will be
3604	   separate pieces of physical hardware.  The specification text is
3605	   written as if that were always case.  However, it can be the case
3606	   that the same physical hardware is used to implement both a metadata
3607	   and storage device and in this case, the specification text's
3608	   references to these two entities are to be understood as referring to
3609	   the same physical hardware implementing two distinct roles and it is
3610	   important that it be clearly understood on behalf of which role the
3611	   hardware is executing at any given time.

3613	   Two sub-cases can be distinguished.  In the first sub-case, the same
3614	   physical hardware is used to implement both a metadata and data
3615	   server in which each role is addressed through a distinct network
3616	   interface (e.g., IP addresses for the metadata server and storage
3617	   device are distinct).  As long as the storage device address is
3618	   obtained from the layout and is distinct from the metadata server's
3619	   address, using the device ID therein to obtain the appropriate
3620	   storage device address, it is always clear, for any given request, to
3621	   what role it is directed, based on the destination IP address.

3623	   However, it may also be the case that even though the metadata server
3624	   and storage device are distinct from one client's point of view, the
3625	   roles may be reversed according to another client's point of view.

3627	   For example, in the cluster file system model a metadata server to
3628	   one client, may be a storage device to another client.  Thus, it is
3629	   safer to always mark the filehandle so that operations addressed to
3630	   storage devices can be distinguished.

3632	   The second sub-case is where both the metadata and storage device
3633	   have the same network address.  This requires us to make the
3634	   distinction as to which role each request is directed, on a another
3635	   basis.  Since the network address is the same, the request is
3636	   understood as being directed at one or the other, based on the
3637	   filehandle of the first current filehandle value for the request.  If
3638	   the first current file handle is one derived from a layout (i.e., it
3639	   is specified within the layout) (and it is recommended that these be
3640	   distinguishable), then the request is to be considered as executed by
3641	   a storage device.  Otherwise, the operation is to be understood as
3642	   executed by the metadata server.

3644	   If a current filehandle is set that is inconsistent with the role to
3645	   which it is directed, then the error NFS4ERR_BADHANDLE should result.
3646	   For example, if a request is directed at the storage device, because
3647	   the first current handle is from a layout, any attempt to set the
3648	   current filehandle to be a value not from a layout should be
3649	   rejected.  Similarly, if the first current file handle was for a
3650	   value not from a layout, a subsequent attempt to set the current file
3651	   handle to a value obtained from a layout should be rejected.

3653	9.1.3  Device Multipathing

3655	   The NFSv4 file layout supports multipathing to 'equivalent' devices.
3656	   Device-level multipathing is primarily of use in the case of a data
3657	   server failure --- it allows the client to switch to another storage
3658	   device that is exporting the same data stripe, without having to
3659	   contact the metadata server for a new layout.

3661	   To support device multipathing, an array of device IDs is encoded
3662	   within the data stripe portion of the file's layout.  This array
3663	   represents an ordered list of devices where the first element has the
3664	   highest priority.  Each device in the list MUST be 'equivalent' to
3665	   every other device in the list and each device must be attempted in
3666	   the order specified.

3668	   Equivalent devices MUST export the same system image (e.g., the
3669	   stateids and filehandles that they use are the same) and must provide
3670	   the same consistency guarantees.  Two equivalent storage devices must
3671	   also have sufficient connections to the storage, such that writing to
3672	   one storage device is equivalent to writing to another, this also
3673	   applies to reading.  Also, if multiple copies of the same data exist,
3674	   reading from one must provide access to all existing copies.  As
3675	   such, it is unlikely that multipathing will provide additional
3676	   benefit in the case of an I/O error.

3678	   [NOTE: the error cases in which a client is expected to attempt an
3679	   equivalent storage device should be specified.]

3681	9.1.4  Operations Issued to Storage Devices

3683	   Clients MUST use the filehandle described within the layout when
3684	   accessing data on the storage devices.  When using the layout's
3685	   filehandle, the client MUST only issue READ, WRITE, PUTFH, COMMIT,
3686	   and NULL operations to the storage device associated with that
3687	   filehandle.  If a client issues an operation other than those
3688	   specified above, using the filehandle and storage device listed in
3689	   the client's layout, that storage device SHOULD return an error to
3690	   the client.  The client MUST follow the instruction implied by the
3691	   layout (i.e., which filehandles to use on which devices).  As
3692	   described in Section 7.2, a client MUST NOT issue I/Os to storage
3693	   devices for which it does not hold a valid layout.  The storage
3694	   devices may reject such requests.

3696	   GETATTR and SETATTR MUST be directed to the metadata server.  In the
3697	   case of a SETATTR of the size attribute, the control protocol is
3698	   responsible for propagating size updates/truncations to the storage
3699	   devices.  In the case of extending WRITEs to the storage devices, the
3700	   new size must be visible on the metadata server once a LAYOUTCOMMIT
3701	   has completed (see Section 7.4.2).  Section 9.5, describes the
3702	   mechanism by which the client is to handle storage device file's that
3703	   do not reflect the metadata server's size.

3705	9.2  Global Stateid Requirements

3707	   Note, there are no stateids returned embedded within the layout.  The
3708	   client MUST use the stateid representing open or lock state as
3709	   returned by an earlier metadata operation (e.g., OPEN, LOCK), or a
3710	   special stateid to perform I/O on the storage devices, as in regular
3711	   NFSv4.  Special stateid usage for I/O is subject to the NFSv4
3712	   protocol specification.  The stateid used for I/O MUST have the same
3713	   effect and be subject to the same validation on storage device as it
3714	   would if the I/O was being performed on the metadata server itself in
3715	   the absence of pNFS.  This has the implication that stateids are
3716	   globally valid on both the metadata and storage devices.  This
3717	   requires the metadata server to propagate changes in lock and open
3718	   state to the storage devices, so that the storage devices can
3719	   validate I/O accesses.  This is discussed further in Section 9.4.
3720	   Depending on when stateids are propagated, the existence of a valid
3721	   stateid on the storage device may act as proof of a valid layout.

3723	   [NOTE: a number of proposals have been made that have the possibility
3724	   of limiting the amount of validation performed by the storage device,
3725	   if any of these proposals are accepted or obtain consensus, the
3726	   global stateid requirement can be revisited.]

3728	9.3  The Layout Iomode

3730	   The layout iomode need not used by the metadata server when servicing
3731	   NFSv4 file-based layouts, although in some circumstances it may be
3732	   useful to use.  For example, if the server implementation supports
3733	   reading from read-only replicas or mirrors, it would be useful for
3734	   the server to return a layout enabling the client to do so.  As such,
3735	   the client should set the iomode based on its intent to read or write
3736	   the data.  The client may default to an iomode of READ/WRITE
3737	   (LAYOUTIOMODE_RW).  The iomode need not be checked by the storage
3738	   devices when clients perform I/O. However, the storage devices SHOULD
3739	   still validate that the client holds a valid layout and return an
3740	   error if the client does not.

3742	9.4  Storage Device State Propagation

3744	   Since the metadata server, which handles lock and open-mode state
3745	   changes, as well as ACLs, may not be collocated with the storage
3746	   devices where I/O access are validated, as such, the server
3747	   implementation MUST take care of propagating changes of this state to
3748	   the storage devices.  Once the propagation to the storage devices is
3749	   complete, the full effect of those changes must be in effect at the
3750	   storage devices.  However, some state changes need not be propagated
3751	   immediately, although all changes SHOULD be propagated promptly.
3752	   These state propagations have an impact on the design of the control
3753	   protocol, even though the control protocol is outside of the scope of
3754	   this specification.  Immediate propagation refers to the synchronous
3755	   propagation of state from the metadata server to the storage
3756	   device(s); the propagation must be complete before returning to the
3757	   client.

3759	9.4.1  Lock State Propagation

3761	   Mandatory locks MUST be made effective at the storage devices before
3762	   the request that establishes them returns to the caller.  Thus,
3763	   mandatory lock state MUST be synchronously propagated to the storage
3764	   devices.  On the other hand, since advisory lock state is not used
3765	   for checking I/O accesses at the storage devices, there is no
3766	   semantic reason for propagating advisory lock state to the storage
3767	   devices.  However, since all lock, unlock, open downgrades and
3768	   upgrades affect the sequence ID stored within the stateid, the
3769	   stateid changes which may cause difficulty if this state is not
3770	   propagated.  Thus, when a client uses a stateid on a storage device
3771	   for I/O with a newer sequence number than the one the storage device
3772	   has, the storage device should query the metadata server and get any
3773	   pending updates to that stateid.  This allows stateid sequence number
3774	   changes to be propagated lazily, on-demand.

3776	   [NOTE: With the reliance on the sessions protocol, there is no real
3777	   need for sequence ID portion of the stateid to be validated on I/O
3778	   accesses.  It is proposed that the seq.  ID checking is obsoleted.]

3780	   Since updates to advisory locks neither confer nor remove privileges,
3781	   these changes need not be propagated immediately, and may not need to
3782	   be propagated promptly.  The updates to advisory locks need only be
3783	   propagated when the storage device needs to resolve a question about
3784	   a stateid.  In fact, if byte-range locking is not mandatory (i.e., is
3785	   advisory) the clients are advised not to use the lock-based stateids
3786	   for I/O at all.  The stateids returned by open are sufficient and
3787	   eliminate overhead for this kind of state propagation.

3789	9.4.2  Open-mode Validation

3791	   Open-mode validation MUST be performed against the open mode(s) held
3792	   by the storage devices.  However, the server implementation may not
3793	   always require the immediate propagation of changes.  Reduction in
3794	   access because of CLOSEs or DOWNGRADEs do not have to be propagated
3795	   immediately, but SHOULD be propagated promptly; whereas changes due
3796	   to revocation MUST be propagated immediately.  On the other hand,
3797	   changes that expand access (e.g., new OPEN's and upgrades) don't have
3798	   to be propagated immediately but the storage device SHOULD NOT reject
3799	   a request because of mode issues without making sure that the upgrade
3800	   is not in flight.

3802	9.4.3  File Attributes

3804	   Since the SETATTR operation has the ability to modify state that is
3805	   visible on both the metadata and storage devices (e.g., the size),
3806	   care must be taken to ensure that the resultant state across the set
3807	   of storage devices is consistent; especially when truncating or
3808	   growing the file.

3810	   As described earlier, the LAYOUTCOMMIT operation is used to ensure
3811	   that the metadata is synced with changes made to the storage devices.
3812	   For the file-based protocol, it is necessary to re-sync state such as
3813	   the size attribute, and the setting of mtime/atime.  See Section 7.4
3814	   for a full description of the semantics regarding LAYOUTCOMMIT and
3815	   attribute synchronization.  It should be noted, that by using a file-
3816	   based layout type, it is possible to synchronize this state before
3817	   LAYOUTCOMMIT occurs.  For example, the control protocol can be used
3818	   to query the attributes present on the storage devices.

3820	   Any changes to file attributes that control authorization or access
3821	   as reflected by ACCESS calls or READs and WRITEs on the metadata
3822	   server, MUST be propagated to the storage devices for enforcement on
3823	   READ and WRITE I/O calls.  If the changes made on the metadata server
3824	   result in more restrictive access permissions for any user, those
3825	   changes MUST be propagated to the storage devices synchronously.

3827	   Recall that the NFSv4 protocol [2] specifies that:

3829	      ...since the NFS version 4 protocol does not impose any
3830	      requirement that READs and WRITEs issued for an open file have the
3831	      same credentials as the OPEN itself, the server still must do
3832	      appropriate access checking on the READs and WRITEs themselves.

3834	   This also includes changes to ACLs.  The propagation of access right
3835	   changes due to changes in ACLs may be asynchronous only if the server
3836	   implementation is able to determine that the updated ACL is not more
3837	   restrictive for any user specified in the old ACL.  Due to the
3838	   relative infrequency of ACL updates, it is suggested that all changes
3839	   be propagated synchronously.

3841	   [NOTE: it has been suggested that the NFSv4 specification is in error
3842	   with regard to allowing principles other than those used for OPEN to
3843	   be used for file I/O. If changes within a minor version alter the
3844	   behavior of NFSv4 with regard to OPEN principals and stateids some
3845	   access control checking at the storage device can be made less
3846	   expensive. pNFS should be altered to take full advantage of these
3847	   changes.]

3849	9.5  Storage Device Component File Size

3851	   A potential problem exists when a component data file on a particular
3852	   storage device is grown past EOF; the problem exists for both dense
3853	   and sparse layouts.  Imagine the following scenario: a client creates
3854	   a new file (size == 0) and writes to byte 128KB; the client then
3855	   seeks to the beginning of the file and reads byte 100.  The client
3856	   should receive 0s back as a result of the read.  However, if the read
3857	   falls on a different storage device to the client's original write,
3858	   the storage device servicing the READ may still believe that the
3859	   file's size is at 0 and return no data with the EOF flag set.  The
3860	   storage device can only return 0s if it knows that the file's size
3861	   has been extended.  This would require the immediate propagation of
3862	   the file's size to all storage devices, which is potentially very
3863	   costly, instead, another approach as outlined below.

3865	   First, the file's size is returned within the layout by LAYOUTGET.
3866	   This size must reflect the latest size at the metadata server as set
3867	   by the most recent of either the last LAYOUTCOMMIT or SETATTR;
3868	   however, it may be more recent.  Second, if a client performs a read
3869	   that is returned short (i.e., is fully within the file's size, but
3870	   the storage device indicates EOF and returns partial or no data), the
3871	   client must assume that it is a hole and substitute 0s for the data
3872	   not read up until its known local file size.  If a client extends the
3873	   file, it must update its local file size.  Third, if the metadata
3874	   server receives a SETATTR of the size or a LAYOUTCOMMIT that alters
3875	   the file's size, the metadata server must send out CB_SIZECHANGED
3876	   messages with the new size to clients holding layouts; it need not
3877	   send a notification to the client that performed the operation that
3878	   resulted in the size changing).  Upon reception of the CB_SIZECHANGED
3879	   notification, clients must update their local size for that file.  As
3880	   well, if a new file size is returned as a result to LAYOUTCOMMIT, the
3881	   client must update their local file size.

3883	9.6  Crash Recovery Considerations

3885	   As described in Section 7.7, the layout type specific storage
3886	   protocol is responsible for handling the effects of I/Os started
3887	   before lease expiration, extending through lease expiration.  The
3888	   NFSv4 file layout type prevents all I/Os from being executed after
3889	   lease expiration, without relying on a precise client lease timer and
3890	   without requiring storage devices to maintain lease timers.

3892	   It works as follows.  In the presence of sessions, each compound
3893	   begins with a SEQUENCE operation that contains the "clientID".  On
3894	   the storage device, the clientID can be used to validate that the
3895	   client has a valid layout for the I/O being performed, if it does
3896	   not, the I/O is rejected.  Before the metadata server takes any
3897	   action to invalidate a layout given out by a previous instance, it
3898	   must make sure that all layouts from that previous instance are
3899	   invalidated at the storage devices.  Note: it is sufficient to
3900	   invalidate the stateids associated with the layout only if special
3901	   stateids are not being used for I/O at the storage devices, otherwise
3902	   the layout itself must be invalidated.

3904	   This means that a metadata server may not restripe a file until it
3905	   has contacted all of the storage devices to invalidate the layouts
3906	   from the previous instance nor may it give out locks that conflict
3907	   with locks embodied by the stateids associated with any layout from
3908	   the previous instance without either doing a specific invalidation
3909	   (as it would have to do anyway) or doing a global storage device
3910	   invalidation.

3912	9.7  Security Considerations

3914	   The NFSv4 file layout type MUST adhere to the security considerations
3915	   outlined in Section 8.  More specifically, storage devices must make
3916	   all of the required access checks on each READ or WRITE I/O as
3917	   determined by the NFSv4 protocol [2].  This impacts the control
3918	   protocol and the propagation of state from the metadata server to the
3919	   storage devices; see Section 9.4 for more details.

3921	9.8  Alternate Approaches

3923	   Two alternate approaches exist for file-based layouts and the method
3924	   used by clients to obtain stateids used for I/O. Both approaches
3925	   embed stateids within the layout.

3927	   However, before examining these approaches it is important to
3928	   understand the distinction between clients and owners.  Delegations
3929	   belong to clients, while locks (e.g., record and share reservations)
3930	   are held by owners which in turn belong to a specific client.  As
3931	   such, delegations can only protect against inter-client conflicts,
3932	   not intra-client conflicts.  Layouts are held by clients and SHOULD
3933	   NOT be associated with state held by owners.  Therefore, if stateids
3934	   used for data access are embedded within a layout, these stateids can
3935	   only act as delegation stateids, protecting against inter-client
3936	   conflicts; stateids pertaining to an owner can not be embedded within
3937	   the layout.  This has the implication that the client MUST arbitrate
3938	   among all intra-client conflicts (e.g., arbitrating among lock
3939	   requests by different processes) before issuing pNFS operations.
3940	   Using the stateids stored within the layout, storage devices can only
3941	   arbitrate between clients (not owners).

3943	   The first alternate approach is to do away with global stateids,
3944	   stateids returned by OPEN/LOCK that are valid on the metadata server
3945	   and storage devices, and use only stateids embedded within the
3946	   layout.  This approach has the drawback that the stateids used for
3947	   I/O access can not be validated against per owner state, since they
3948	   are only associated with the client holding the layout.  It breaks
3949	   the semantics of tieing a stateid used for I/O to an open instance.
3950	   This has the implication that clients must delegate per owner lock
3951	   and open requests internally, rather than push the work onto the
3952	   storage devices.  The storage devices can still arbitrate and enforce
3953	   inter-client lock and open state.

3955	   The second approach is a hybrid approach.  This approach allows for
3956	   stateids to be embedded with the layout, but also allows for the
3957	   possibility of global stateids.  If the stateid embedded within the
3958	   layout is a special stateid of all zeros, then the stateid referring
3959	   to the last successful OPEN/LOCK should be used.  This approach is
3960	   recommended if it is decided that using NFSv4 as a control protocol
3961	   is required.

3963	   This proposal suggests the global stateid approach due to the cleaner
3964	   semantics it provides regarding the relationship between stateids
3965	   used for I/O and their corresponding open instance or lock state.
3966	   However, it does have a profound impact on the control protocol's
3967	   implementation and the state propagation that is required (as
3968	   described in Section 9.4).

3970	10.  pNFS Typed Data Structures

3972	10.1  pnfs_layouttype4

3974	     enum pnfs_layouttype4 {
3975	            LAYOUT_NFSV4_FILES = 1,
3976	            LAYOUT_OSD2_OBJECTS = 2,
3977	            LAYOUT_BLOCK_VOLUME = 3
3978	     };

3980	   A layout type specifies the layout being used.  The implication is
3981	   that clients have "layout drivers" that support one or more layout
3982	   types.  The file server advertises the layout types it supports
3983	   through the LAYOUT_TYPES file system attribute.  A client asks for
3984	   layouts of a particular type in LAYOUTGET, and passes those layouts
3985	   to its layout driver.  The set of well known layout types must be
3986	   defined.  As well, a private range of layout types is to be defined
3987	   by this document.  This would allow custom installations to introduce
3988	   new layout types.

3990	   [OPEN ISSUE: Determine private range of layout types]

3992	   New layout types must be specified in RFCs approved by the IESG
3993	   before becoming part of the pNFS specification.

3995	   The LAYOUT_NFSV4_FILES enumeration specifies that the NFSv4 file
3996	   layout type is to be used.  The LAYOUT_OSD2_OBJECTS enumeration
3997	   specifies that the object layout, as defined in [7], is to be used.
3998	   Similarly, the LAYOUT_BLOCK_VOLUME enumeration that the block/volume
3999	   layout, as defined in [6], is to be used.

4001	10.2  pnfs_deviceid4

4003	     typedef uint32_t pnfs_deviceid4;       /* 32-bit device ID */

4005	   Layout information includes device IDs that specify a storage device
4006	   through a compact handle.  Addressing and type information is
4007	   obtained with the GETDEVICEINFO operation.  A client must not assume
4008	   that device IDs are valid across metadata server reboots.  The device
4009	   ID is qualified by the layout type and are unique per file system
4010	   (FSID).  This allows different layout drivers to generate device IDs
4011	   without the need for co-ordination.  See Section 7.1.4 for more
4012	   details.

4014	10.3  pnfs_deviceaddr4

4016	     struct pnfs_netaddr4 {
4017	              string           r_netid<>;   /* network ID */
4018	              string           r_addr<>;    /* universal address */
4019	     };

4021	     struct pnfs_deviceaddr4 {
4022	              pnfs_layouttype4 type;
4023	              opaque           device_addr<>;
4024	     };

4026	   The device address is used to set up a communication channel with the
4027	   storage device.  Different layout types will require different types
4028	   of structures to define how they communicate with storage devices.
4029	   The opaque device_addr field must be interpreted based on the
4030	   specified layout type.

4032	   Currently, the only defined device address is that for the NFSv4 file
4033	   layout (struct pnfs_netaddr4), which identifies a storage device by
4034	   network IP address and port number.  This is sufficient for the
4035	   clients to communicate with the NFSv4 storage devices, and may also
4036	   be sufficient for object-based storage drivers to communicate with
4037	   OSDs.  The other device address we expect to support is a SCSI volume
4038	   identifier.  The final protocol specification will detail the allowed
4039	   values for device_type and the format of their associated location
4040	   information.

4042	   [NOTE: other device addresses will be added as the respective
4043	   specifications mature.  It has been suggested that a separate
4044	   device_type enumeration is used as a switch to the pnfs_deviceaddr4
4045	   structure (e.g., if multiple types of addresses exist for the same
4046	   layout type).  Until such a time as a real case is made and the
4047	   respective layout types have matured, the device address structure
4048	   will be left as is.]

4050	10.4  pnfs_devlist_item4

4052	     struct pnfs_devlist_item4 {
4053	            pnfs_deviceid4          id;
4054	            pnfs_deviceaddr4        addr;
4055	     };

4057	   An array of these values is returned by the GETDEVICELIST operation.
4058	   They define the set of devices associated with a file system.

4060	10.5  pnfs_layout4

4062	     struct pnfs_layout4 {
4063	            offset4                 offset;
4064	            length4                 length;
4065	            pnfs_layoutiomode4      iomode;
4066	            pnfs_layouttype4        type;
4067	            opaque                  layout<>;
4068	     };

4070	   The pnfs_layout4 structure defines a layout for a file.  The layout
4071	   type specific data is opaque within this structure and must be
4072	   interepreted based on the layout type.  Currently, only the NFSv4
4073	   file layout type is defined; see Section 9.1 for its definition.
4074	   Since layouts are sub-dividable, the offset and length together with
4075	   the file's filehandle, the clientid, iomode, and layout type,
4076	   identifies the layout.

4078	   [OPEN ISSUE: there is a discussion of moving the striping
4079	   information, or more generally the "aggregation scheme", up to the
4080	   generic layout level.  This creates a two-layer system where the top
4081	   level is a switch on different data placement layouts, and the next
4082	   level down is a switch on different data storage types.  This lets
4083	   different layouts (e.g., striping or mirroring or redundant servers)
4084	   to be layered over different storage devices.  This would move
4085	   geometry information out of nfsv4_file_layouttype4 and up into a
4086	   generic pnfs_striped_layout type that would specify a set of
4087	   pnfs_deviceid4 and pnfs_devicetype4 to use for storage.  Instead of
4088	   nfsv4_file_layouttype4, there would be pnfs_nfsv4_devicetype4.]

4090	10.6  pnfs_layoutupdate4

4092	     struct pnfs_layoutupdate4 {
4093	            pnfs_layouttype4        type;
4094	            opaque                  layoutupdate_data<>;
4095	     };

4097	   The pnfs_layoutupdate4 structure is used by the client to return
4098	   'updated' layout information to the metadata server at LAYOUTCOMMIT
4099	   time.  This structure provides a channel to pass layout type specific
4100	   information back to the metadata server.  E.g., for block/volume
4101	   layout types this could include the list of reserved blocks that were
4102	   written.  The contents of the opaque layoutupdate_data argument are
4103	   determined by the layout type and are defined in their context.  The
4104	   NFSv4 file-based layout does not use this structure, thus the
4105	   update_data field should have a zero length.

4107	10.7  pnfs_layouthint4

4109	     struct pnfs_layouthint4 {
4110	            pnfs_layouttype4      type;
4111	            opaque                layouthint_data<>;
4112	     };

4114	   The pnfs_layouthint4 structure is used by the client to pass in a
4115	   hint about the type of layout it would like created for a particular
4116	   file.  It is the structure specified by the FILE_LAYOUT_HINT
4117	   attribute described below.  The metadata server may ignore the hint,
4118	   or may selectively ignore fields within the hint.  This hint should
4119	   be provided at create time as part of the initial attributes within
4120	   OPEN.  The NFSv4 file-based layout uses the "nfsv4_file_layouthint"
4121	   structure as defined in Section 9.1.

4123	10.8  pnfs_layoutiomode4

4125	     enum pnfs_layoutiomode4 {
4126	             LAYOUTIOMODE_READ          = 1,
4127	             LAYOUTIOMODE_RW            = 2,
4128	             LAYOUTIOMODE_ANY           = 3
4129	     };

4131	   The iomode specifies whether the client intends to read or write
4132	   (with the possibility of reading) the data represented by the layout.
4133	   The ANY iomode MUST NOT be used for LAYOUTGET, however, it can be
4134	   used for LAYOUTRETURN and LAYOUTRECALL.  The ANY iomode specifies
4135	   that layouts pertaining to both READ and RW iomodes are being
4136	   returned or recalled, respectively.  The metadata server's use of the
4137	   iomode may depend on the layout type being used.  The storage devices
4138	   may validate I/O accesses against the iomode and reject invalid
4139	   accesses.

4141	11.  pNFS File Attributes

4143	11.1  pnfs_layouttype4<> FS_LAYOUT_TYPES

4145	   This attribute applies to a file system and indicates what layout
4146	   types are supported by the file system.  We expect this attribute to
4147	   be queried when a client encounters a new fsid.  This attribute is
4148	   used by the client to determine if it has applicable layout drivers.

4150	11.2  pnfs_layouttype4<> FILE_LAYOUT_TYPES

4152	   This attribute indicates the particular layout type(s) used for a
4153	   file.  This is for informational purposes only.  The client needs to
4154	   use the LAYOUTGET operation in order to get enough information (e.g.,
4155	   specific device information) in order to perform I/O.

4157	11.3  pnfs_layouthint4 FILE_LAYOUT_HINT

4159	   This attribute may be set on newly created files to influence the
4160	   metadata server's choice for the file's layout.  It is suggested that
4161	   this attribute is set as one of the initial attributes within the
4162	   OPEN call.  The metadata server may ignore this attribute.  This
4163	   attribute is a sub-set of the layout structure returned by LAYOUTGET.
4164	   For example, instead of specifying particular devices, this would be
4165	   used to suggest the stripe width of a file.  It is up to the server
4166	   implementation to determine which fields within the layout it uses.

4168	   [OPEN ISSUE: it has been suggested that the HINT is a well defined
4169	   type other than pnfs_layoutdata4, similar to pnfs_layoutupdate4.]

4171	11.4  uint32_t FS_LAYOUT_PREFERRED_BLOCKSIZE

4173	   This attribute is a file system wide attribute and indicates the
4174	   preferred block size for direct storage device access.

4176	11.5  uint32_t FS_LAYOUT_PREFERRED_ALIGNMENT

4178	   This attribute is a file system wide attribute and indicates the
4179	   preferred alignment for direct storage device access.

4181	12.  pNFS Error Definitions

4183	   NFS4ERR_BADLAYOUT Layout specified is invalid.

4185	   NFS4ERR_BADIOMODE Layout iomode is invalid.

4187	   NFS4ERR_LAYOUTUNAVAILABLE Layouts are not available for the file or
4188	      its containing file system.

4190	   NFS4ERR_LAYOUTTRYLATER Layouts are temporarily unavailable for the
4191	      file, client should retry later.

4193	   NFS4ERR_NOMATCHING_LAYOUT Client has no matching layout (segment) to
4194	      return.

4196	   NFS4ERR_RECALLCONFLICT Layout is unavailable due to a conflicting
4197	      LAYOUTRECALL that is in progress.

4199	   NFS4ERR_UNKNOWN_LAYOUTTYPE Layout type is unknown.

4201	13.  Layouts and Aggregation

4203	   This section describes several aggregation schemes in a semi-formal
4204	   way to provide context for layout formats.  These definitions will be
4205	   formalized in other protocols.  However, the set of understood types
4206	   is part of this protocol in order to provide for basic
4207	   interoperability.

4209	   The layout descriptions include (deviceID, objectID) tuples that
4210	   identify some storage object on some storage device.  The addressing
4211	   formation associated with the deviceID is obtained with
4212	   GETDEVICEINFO.  The interpretation of the objectID depends on the
4213	   storage protocol.  The objectID could be a filehandle for an NFSv4
4214	   storage device.  It could be a OSD object ID for an object server.
4215	   The layout for a block device generally includes additional block map
4216	   information to enumerate blocks or extents that are part of the
4217	   layout.

4219	13.1  Simple Map

4221	   The data is located on a single storage device.  In this case the
4222	   file server can act as the front end for several storage devices and
4223	   distribute files among them.  Each file is limited in its size and
4224	   performance characteristics by a single storage device.  The simple
4225	   map consists of (deviceID, objectID).

4227	13.2  Block Extent Map

4229	   The data is located on a LUN in the SAN.  The layout consists of an
4230	   array of (deviceID, blockID, offset, length) tuples.  Each entry
4231	   describes a block extent.

4233	13.3  Striped Map (RAID 0)

4235	   The data is striped across storage devices.  The parameters of the
4236	   stripe include the number of storage devices (N) and the size of each
4237	   stripe unit (U).  A full stripe of data is N * U bytes.  The stripe
4238	   map consists of an ordered list of (deviceID, objectID) tuples and
4239	   the parameter value for U. The first stripe unit (the first U bytes)
4240	   are stored on the first (deviceID, objectID), the second stripe unit
4241	   on the second (deviceID, objectID) and so forth until the first
4242	   complete stripe.  The data layout then wraps around so that byte
4243	   (N*U) of the file is stored on the first (deviceID, objectID) in the
4244	   list, but starting at offset U within that object.  The striped
4245	   layout allows a client to read or write to the component objects in
4246	   parallel to achieve high bandwidth.

4248	   The striped map for a block device would be slightly different.  The
4249	   map is an ordered list of (deviceID, blockID, blocksize), where the
4250	   deviceID is rotated among a set of devices to achieve striping.

4252	13.4  Replicated Map

4254	   The file data is replicated on N storage devices.  The map consists
4255	   of N (deviceID, objectID) tuples.  When data is written using this
4256	   map, it should be written to N objects in parallel.  When data is
4257	   read, any component object can be used.

4259	   This map type is controversial because it highlights the issues with
4260	   error recovery.  Those issues get interesting with any scheme that
4261	   employs redundancy.  The handling of errors (e.g., only a subset of
4262	   replicas get updated) is outside the scope of this protocol
4263	   extension.  Instead, it is a function of the storage protocol and the
4264	   metadata control protocol.

4266	13.5  Concatenated Map

4268	   The map consists of an ordered set of N (deviceID, objectID, size)
4269	   tuples.  Each successive tuple describes the next segment of the
4270	   file.

4272	13.6  Nested Map

4274	   The nested map is used to compose more complex maps out of simpler
4275	   ones.  The map format is an ordered set of M sub-maps, each submap
4276	   applies to a byte range within the file and has its own type such as
4277	   the ones introduced above.  Any level of nesting is allowed in order
4278	   to build up complex aggregation schemes.

4280	14.  NFSv4.1 Operations

4282	14.1  LOOKUPP - Lookup Parent Directory

4284	   If the NFSv4 minor version is 1, then following replaces section
4285	   14.2.14 of the NFSv4.0 specification.  The LOOKUPP operation's "over
4286	   the wire" format is not altered, but the semantics are slightly
4287	   modified to account for the addition of SECINFO_NO_NAME.

4289	   SYNOPSIS

4291	                 (cfh) -> (cfh)

4293	   ARGUMENT

4295	                 /* CURRENT_FH: object */
4296	                 void;

4298	   RESULT

4300	                 struct LOOKUPP4res {
4301	                 /* CURRENT_FH: directory */
4302	                 nfsstat4        status;
4303	                 };

4305	   DESCRIPTION

4307	      The current filehandle is assumed to refer to a regular directory
4308	      or a named attribute directory.  LOOKUPP assigns the filehandle
4309	      for its parent directory to be the current filehandle.  If there
4310	      is no parent directory an NFS4ERR_NOENT error must be returned.
4311	      Therefore, NFS4ERR_NOENT will be returned by the server when the
4312	      current filehandle is at the root or top of the server's file
4313	      tree.

4315	      As for LOOKUP, LOOKUPP will also cross mountpoints.

4317	      If the current filehandle is not a directory or named attribute
4318	      directory, the error NFS4ERR_NOTDIR is returned.

4320	      If the requester's security flavor does not match that configured
4321	      for the parent directory, then the server SHOULD return
4322	      NFS4ERR_WRONGSEC (a future minor revision of NFSv4 may upgrade
4323	      this to MUST) in the LOOKUPP response.  However, if the server
4324	      does so, it MUST support the new SECINFO_NO_NAME operation, so
4325	      that the client can gracefully determine the correct security
4326	      flavor.  See the discussion of the SECINFO_NO_NAME operation for a
4327	      description.

4329	   ERRORS

4331	      NFS4ERR_ACCESS NFS4ERR_BADHANDLE NFS4ERR_FHEXPIRED NFS4ERR_IO
4332	      NFS4ERR_MOVED NFS4ERR_NOENT NFS4ERR_NOFILEHANDLE NFS4ERR_NOTDIR
4333	      NFS4ERR_RESOURCE NFS4ERR_SERVERFAULT NFS4ERR_STALE
4334	      NFS4ERR_WRONGSEC

4336	14.2  SECINFO -- Obtain Available Security

4338	   If the NFSv4 minor version is 1, then following replaces section
4339	   14.2.31 of the NFSv4.0 specification.  The SECINFO operation's "over
4340	   the wire" format is not altered, but the semantics are slightly
4341	   modified to account for the addition of SECINFO_NO_NAME.

4343	   SYNOPSIS

4345	                 (cfh), name -> { secinfo }

4347	   ARGUMENT

4349	                 struct SECINFO4args {
4350	                 /* CURRENT_FH: directory */
4351	                 component4     name;
4352	                 };

4354	   RESULT
4355	                 enum rpc_gss_svc_t {/* From RFC 2203 */
4356	                 RPC_GSS_SVC_NONE        = 1,
4357	                 RPC_GSS_SVC_INTEGRITY   = 2,
4358	                 RPC_GSS_SVC_PRIVACY     = 3
4359	                 };

4361	                 struct rpcsec_gss_info {
4362	                 sec_oid4        oid;
4363	                 qop4            qop;
4364	                 rpc_gss_svc_t   service;
4365	                 };

4367	                 union secinfo4 switch (uint32_t flavor) {
4368	                 case RPCSEC_GSS:
4369	                 rpcsec_gss_info        flavor_info;
4370	                 default:
4371	                 void;
4372	                 };

4374	                 typedef secinfo4 SECINFO4resok<>;

4376	                 union SECINFO4res switch (nfsstat4 status) {
4377	                 case NFS4_OK:
4378	                 SECINFO4resok resok4;
4379	                 default:
4380	                 void;
4381	                 };

4383	   DESCRIPTION

4385	      The SECINFO operation is used by the client to obtain a list of
4386	      valid RPC authentication flavors for a specific directory
4387	      filehandle, file name pair.  SECINFO should apply the same access
4388	      methodology used for LOOKUP when evaluating the name.  Therefore,
4389	      if the requester does not have the appropriate access to LOOKUP
4390	      the name then SECINFO must behave the same way and return
4391	      NFS4ERR_ACCESS.

4393	      The result will contain an array which represents the security
4394	      mechanisms available, with an order corresponding to the server's
4395	      preferences, the most preferred being first in the array.  The
4396	      client is free to pick whatever security mechanism it both desires
4397	      and supports, or to pick in the server's preference order the
4398	      first one it supports.  The array entries are represented by the
4399	      secinfo4 structure.  The field 'flavor' will contain a value of
4400	      AUTH_NONE, AUTH_SYS (as defined in [RFC1831]), or RPCSEC_GSS (as
4401	      defined in [RFC2203]).  The field flavor can also any other
4402	      security flavor registered with IANA.

4404	      For the flavors AUTH_NONE and AUTH_SYS, no additional security
4405	      information is returned.  The same is true of many (if not most)
4406	      other security flavors, including AUTH_DH.  For a return value of
4407	      RPCSEC_GSS, a security triple is returned that contains the
4408	      mechanism object id (as defined in [RFC2743]), the quality of
4409	      protection (as defined in [RFC2743]) and the service type (as
4410	      defined in [RFC2203]).  It is possible for SECINFO to return
4411	      multiple entries with flavor equal to RPCSEC_GSS with different
4412	      security triple values.

4414	      On success, the current filehandle retains its value.

4416	      If the name has a length of 0 (zero), or if name does not obey the
4417	      UTF-8 definition, the error NFS4ERR_INVAL will be returned.

4419	   IMPLEMENTATION

4421	      The SECINFO operation is expected to be used by the NFS client
4422	      when the error value of NFS4ERR_WRONGSEC is returned from another
4423	      NFS operation.  This signifies to the client that the server's
4424	      security policy is different from what the client is currently
4425	      using.  At this point, the client is expected to obtain a list of
4426	      possible security flavors and choose what best suits its policies.

4428	      As mentioned, the server's security policies will determine when a
4429	      client request receives NFS4ERR_WRONGSEC.  The operations which
4430	      may receive this error are: LINK, LOOKUP, LOOKUPP, OPEN, PUTFH,
4431	      PUTPUBFH, PUTROOTFH, RESTOREFH, RENAME, and indirectly READDIR.
4432	      LINK and RENAME will only receive this error if the security used
4433	      for the operation is inappropriate for saved filehandle.  With the
4434	      exception of READDIR, these operations represent the point at
4435	      which the client can instantiate a filehandle into the "current
4436	      filehandle" at the server.  The filehandle is either provided by
4437	      the client (PUTFH, PUTPUBFH, PUTROOTFH) or generated as a result
4438	      of a name to filehandle translation (LOOKUP and OPEN).  RESTOREFH
4439	      is different because the filehandle is a result of a previous
4440	      SAVEFH.  Even though the filehandle, for RESTOREFH, might have
4441	      previously passed the server's inspection for a security match,
4442	      the server will check it again on RESTOREFH to ensure that the
4443	      security policy has not changed.

4445	      If the client wants to resolve an error return of
4446	      NFS4ERR_WRONGSEC, the following will occur:

4448	      *  For LOOKUP and OPEN, the client will use SECINFO with the same
4449	         current filehandle and name as provided in the original LOOKUP
4450	         or OPEN to enumerate the available security triples.

4452	      *  For LINK, PUTFH, PUTROOTFH, PUTPUBFH, RENAME, and RESTOREFH,
4453	         the client will use SECINFO_NO_NAME { style = current_fh }.
4454	         The client will prefix the SECINFO_NO_NAME operation with the
4455	         appropriate PUTFH, PUTPUBFH, or PUTROOTFH operation that
4456	         provides the file handled originally provided by the PUTFH,
4457	         PUTPUBFH, PUTROOTFH, or RESTOREFH, or for the failed LINK or
4458	         RENAME, the SAVEFH.

4460	      *  ========================================================= NOTE:
4461	         In NFSv4.0, the client was required to use SECINFO, and had to
4462	         reconstruct the parent of the original file handle, and the
4463	         component name of the original filehandle.
4464	         ========================================================

4466	      *  For LOOKUPP, the client will use SECINFO_NO_NAME { style =
4467	         parent } and provide the filehandle with equals the filehandle
4468	         originally provided to LOOKUPP.

4470	      The READDIR operation will not directly return the
4471	      NFS4ERR_WRONGSEC error.  However, if the READDIR request included
4472	      a request for attributes, it is possible that the READDIR
4473	      request's security triple did not match that of a directory entry.
4474	      If this is the case and the client has requested the rdattr_error
4475	      attribute, the server will return the NFS4ERR_WRONGSEC error in
4476	      rdattr_error for the entry.

4478	      See the section "Security Considerations" for a discussion on the
4479	      recommendations for security flavor used by SECINFO and
4480	      SECINFO_NO_NAME.

4482	   ERRORS

4484	14.3  SECINFO_NO_NAME - Get Security on Unnamed Object

4486	   Obtain available security mechanisms with the use of the parent of an
4487	   object or the current filehandle.

4489	   SYNOPSIS

4491	                 (cfh), secinfo_style -> { secinfo }

4493	   ARGUMENT

4495	                 enum secinfo_style_4 {
4496	                 current_fh = 0,
4497	                 parent = 1
4498	                 };

4500	                 typedef secinfo_style_4 SECINFO_NO_NAME4args;

4502	   RESULT

4504	                 typedef SECINFO4res SECINFO_NO_NAME4res;

4506	   DESCRIPTION

4508	      Like the SECINFO operation, SECINFO_NO_NAME is used by the client
4509	      to obtain a list of valid RPC authentication flavors for a
4510	      specific file object.  Unlike SECINFO, SECINFO_NO_NAME only works
4511	      with objects are accessed by file handle.

4513	      There are two styles of SECINFO_NO_NAME, as determined by the
4514	      value of the secinfo_style_4 enumeration.  If "current_fh" is
4515	      passed, then SECINFO_NO_NAME is querying for the required security
4516	      for the current filehandle.  If "parent" is passed, then
4517	      SECINFO_NO_NAME is querying for the required security of the
4518	      current filehandles's parent.  If the style selected is "parent",
4519	      then SECINFO should apply the same access methodology used for
4520	      LOOKUPP when evaluating the traversal to the parent directory.
4521	      Therefore, if the requester does not have the appropriate access
4522	      to LOOKUPP the parent then SECINFO_NO_NAME must behave the same
4523	      way and return NFS4ERR_ACCESS.

4525	      Note that if PUTFH, PUTPUBFH, or PUTROOTFH return
4526	      NFS4ERR_WRONGSEC, this is tantamount to the server asserting that
4527	      the client will have to guess what the required security is,
4528	      because there is no way to query.  Therefore, the client must
4529	      iterate through the security triples available at the client and
4530	      reattempt the PUTFH, PUTROOTFH or PUTPUBFH operation.  In the
4531	      unfortunate event none of the MANDATORY security triples are
4532	      supported by the client and server, the client SHOULD try using
4533	      others that support integrity.  Failing that, the client can try
4534	      using other forms (e.g.  AUTH_SYS and AUTH_NONE), but because such
4535	      forms lack integrity checks, this puts the client at risk.

4537	      The server implementor should pay particular attention to the
4538	      section "Clarification of Security Negotiation in NFSv4.1" for
4539	      implementation suggestions for avoiding NFS4ERR_WRONGSEC error
4540	      returns from PUTFH, PUTROOTFH or PUTPUBFH.

4542	      Everything else about SECINFO_NO_NAME is the same as SECINFO.  See
4543	      the previous discussion on SECINFO.

4545	   IMPLEMENTATION

4547	      See the previous dicussion on SECINFO.

4549	   ERRORS

4551	      NFS4ERR_ACCESS NFS4ERR_BADCHAR NFS4ERR_BADHANDLE NFS4ERR_BADNAME
4552	      NFS4ERR_BADXDR NFS4ERR_FHEXPIRED NFS4ERR_INVAL NFS4ERR_MOVED
4553	      NFS4ERR_NAMETOOLONG NFS4ERR_NOENT NFS4ERR_NOFILEHANDLE
4554	      NFS4ERR_NOTDIR NFS4ERR_RESOURCE NFS4ERR_SERVERFAULT NFS4ERR_STALE

4556	14.4  CREATECLIENTID - Instantiate Clientid

4558	   Create a clientid

4560	   SYNOPSIS

4562	                 client -> clientid

4564	   ARGUMENT

4566	                 struct CREATECLIENTID4args {
4567	                 nfs_client_id4  clientdesc;
4568	                 };

4570	   RESULT
4571	                 struct CREATECLIENTID4resok {
4572	                 clientid4       clientid;
4573	                 verifier4       clientid_confirm;
4574	                 };

4576	                 union SETCLIENTID4res switch (nfsstat4 status) {
4577	                 case NFS4_OK:
4578	                 CREATECLIENTID4resok      resok4;
4579	                 case NFS4ERR_CLID_INUSE:
4580	                 void;
4581	                 default:
4582	                 void;
4583	                 };

4585	   DESCRIPTION

4587	      The client uses the CREATECLIENTID operation to register a
4588	      particular client identifier with the server.  The clientid
4589	      returned from this operation will be necessary for requests that
4590	      create state on the server and will serve as a parent object to
4591	      sessions created by the client.  In order to verify the clientid
4592	      it must first be used as an argument to CREATESESSION.

4594	   IMPLEMENTATION

4596	      A server's client record is a 5-tuple:

4598	      1.  clientdesc.id:

4600	             The long form client identifier, sent via the client.id
4601	             subfield of the CREATECLIENTID4args structure

4603	      2.  clientdesc.verifier:

4605	             A client-specific value used to indicate reboots, sent via
4606	             the clientdesc.verifier subfield of the CREATECLIENTID4args
4607	             structure

4609	      3.  principal:

4611	             The RPCSEC_GSS principal sent via the RPC headers

4613	      4.  clientid:

4615	             The shorthand client identifier, generated by the server
4616	             and returned via the clientid field in the
4617	             CREATECLIENTID4resok structure

4619	      5.  confirmed:

4621	             A private field on the server indicating whether or not a
4622	             client record has been confirmed.  A client record is
4623	             confirmed if there has been a successful CREATESESSION
4624	             operation to confirm it.  Otherwise it is unconfirmed.  An
4625	             unconfirmed record is established by a CREATECLIENTID call.
4626	             Any unconfirmed record that is not confirmed within a lease
4627	             period may be removed.

4629	      The following identifiers represent special values for the fields
4630	      in the records.

4632	      id_arg:

4634	         The value of the clientdesc.id subfield of the
4635	         CREATECLIENTID4args structure of the current request.

4637	      verifier_arg:

4639	         The value of the clientdesc.verifier subfield of the
4640	         CREATECLIENTID4args structure of the current request.

4642	      old_verifier_arg:

4644	         A value of the clientdesc.verifier field of a client record
4645	         received in a previous request; this is distinct from
4646	         verifier_arg.

4648	      principal_arg:

4650	         The value of the RPCSEC_GSS principal for the current request.

4652	      old_principal_arg:

4654	         A value of the RPCSEC_GSS principal received for a previous
4655	         request.  This is distinct from principal_arg.

4657	      clientid_ret:

4659	         The value of the clientid field the server will return in the
4660	         CREATECLIENTID4resok structure for the current request.

4662	      old_clientid_ret:

4664	         The value of the clientid field the server returned in the
4665	         CREATECLIENTID4resok structure for a previous request.  This is
4666	         distinct from clientid_ret.

4668	      Since CREATECLIENTID is a non-idempotent operation, we must
4669	      consider the possibility that replays may occur as a result of a
4670	      client reboot, network partition, malfunctioning router, etc.
4671	      Replays are identified by the value of the client field of
4672	      CREATECLIENTID4args and the method for dealing with them is
4673	      outlined in the scenarios below.

4675	      The scenarios are described in terms of what client records whose
4676	      clientdesc.id subfield have value equal to id_arg exist in the
4677	      server's set of client records.  Any cases in which there is more
4678	      than one record with identical values for id_arg represent a
4679	      server implementation error.  Operation in the potential valid
4680	      cases is summarized as follows.

4682	      1.  Common case

4684	             If no client records with clientdesc.id matching id_arg
4685	             exist, a new shorthand client identifier clientid_ret is
4686	             generated, and the following unconfirmed record is added to
4687	             the server's state.

4689	             { id_arg, verifier_arg, principal_arg, clientid_ret, FALSE
4690	             }

4692	             Subsequently, the server returns clientid_ret.

4694	      2.  Router Replay

4696	             If the server has the following confirmed record, then this
4697	             request is likely the result of a replayed request due to a
4698	             faulty router or lost connection.

4700	             { id_arg, verifier_arg, principal_arg, clientid_ret, TRUE }

4702	             Since the record has been confirmed, the client must have
4703	             received the server's reply from the initial CREATECLIENTID
4704	             request.  Since this is simply a spurious request, there is
4705	             no modification to the server's state, and the server makes
4706	             no reply to the client.

4708	      3.  Client Collision

4710	             If the server has the following confirmed record, then this
4711	             request is likely the result of a chance collision between
4712	             the values of the clientdesc.id subfield of
4713	             CREATECLIENTID4args for two different clients.

4715	             { id_arg, *, old_principal_arg, clientid_ret, TRUE }

4717	             Since the value of the clientdesc.id subfield of each
4718	             client record must be unique, there is no modification of
4719	             the server's state, and NFS4ERR_CLID_INUSE is returned to
4720	             indicate the client should retry with a different value for
4721	             the clientdesc.id subfield of CREATECLIENTID4args.

4723	             This scenario may also represent a malicious attempt to
4724	             destroy a client's state on the server.  For security
4725	             reasons, the server MUST NOT remove the client's state when
4726	             there is a principal mismatch.

4728	      4.  Replay

4730	             If the server has the following unconfirmed record then
4731	             this request is likely the result of a client replay due to
4732	             a network partition or some other connection failure.

4734	             { id_arg, verifier_arg, principal_arg, clientid_ret, FALSE
4735	             }

4737	             Since the response to the CREATECLIENTID request that
4738	             created this record may have been lost, it is not
4739	             acceptable to drop this duplicate request.  However, rather
4740	             than processing it normally, the existing record is left
4741	             unchanged and clientid_ret, which was generated for the
4742	             previous request, is returned.

4744	      5.  Change of Principal

4746	             If the server has the following unconfirmed record then
4747	             this request is likely the result of a client which has for
4748	             whatever reasons changed principals (possibly to change
4749	             security flavor) after calling CREATECLIENTID, but before
4750	             calling CREATESESSION.

4752	             { id_arg, verifier_arg, old_principal_arg, clientid_ret,
4753	             FALSE}
4754	             Since the client has not changed, the principal field of
4755	             the unconfirmed record is updated to principal_arg and
4756	             clientid_ret is again returned.  There is a small
4757	             possibility that this is merely a collision on the client
4758	             field of CREATECLIENTID4args between unrelated clients, but
4759	             since that is unlikely, and an unconfirmed record does not
4760	             generally have any filesystem pertinent state, we can
4761	             assume it is the same client without risking loss of any
4762	             important state.

4764	             After processing, the following record will exist on the
4765	             server.

4767	             { id_arg, verifier_arg, principal_arg, clientid_ret, FALSE}

4769	      6.  Client Reboot

4771	             If the server has the following confirmed client record,
4772	             then this request is likely from a previously confirmed
4773	             client which has rebooted.

4775	             { id_arg, old_verifier_arg, principal_arg, clientid_ret,
4776	             TRUE }

4778	             Since the previous incarnation of the same client will no
4779	             longer be making requests, lock and share reservations
4780	             should be released immediately rather than forcing the new
4781	             incarnation to wait for the lease time on the previous
4782	             incarnation to expire.  Furthermore, session state should
4783	             be removed since if the client had maintained that
4784	             information across reboot, this request would not have been
4785	             issued.  If the server does not support the
4786	             CLAIM_DELEGATE_PREV claim type, associated delegations
4787	             should be purged as well; otherwise, delegations are
4788	             retained and recovery proceeds according to RFC3530.  The
4789	             client record is updated with the new verifier and its
4790	             status is changed to unconfirmed.

4792	             After processing, clientid_ret is returned to the client
4793	             and the following record will exist on the server.

4795	             { id_arg, verifier_arg, principal_arg, clientid_ret, FALSE
4796	             }

4798	      7.  Reboot before confirmation

4800	             If the server has the following unconfirmed record, then
4801	             this request is likely from a client which rebooted before
4802	             sending a CREATESESSION request.

4804	             { id_arg, old_verifier_arg, *, clientid_ret, FALSE }

4806	             Since this is believed to be a request from a new
4807	             incarnation of the original client, the server updates the
4808	             value of clientdesc.verifier and returns the original
4809	             clientid_ret.  After processing, the following state exists
4810	             on the server.

4812	             { id_arg, verifier_arg, *, clientid_ret, FALSE }

4814	   ERRORS

4816	      NFS4ERR_BADXDR NFS4ERR_CLID_INUSE NFS4ERR_INVAL NFS4ERR_RESOURCE
4817	      NFS4ERR_SERVERFAULT

4819	14.5  CREATESESSION - Create New Session and Confirm Clientid

4821	   Start up session and confirm clientid.

4823	   SYNOPSIS

4825	                 clientid, session_args -> sessionid, session_args

4827	   ARGUMENT
4828	                 struct CREATESESSION4args {
4829	                 clientid4       clientid;
4830	                 bool            persist;
4831	                 count4          maxrequestsize;
4832	                 count4          maxresponsesize;
4833	                 count4          maxrequests;
4834	                 count4          headerpadsize;
4835	                 switch (bool clientid_confirm) {
4836	                 case TRUE:
4837	                 verifier4 setclientid_confirm;
4838	                 case FALSE:
4839	                 void;
4840	                 }
4841	                 switch (channelmode4 mode) {
4842	                 case DEFAULT:
4843	                 void;
4844	                 case STREAM:
4845	                 streamchannelattrs4 streamchanattrs;
4846	                 case RDMA:
4847	                 rdmachannelattrs4   rdmachanattrs;
4848	                 };
4849	                 };

4851	   RESULT
4852	                 typedef opaque sessionid4[16];

4854	                 struct CREATESESSION4resok {
4855	                 sessionid4      sessionid;
4856	                 bool            persist;
4857	                 count4          maxrequestsize;
4858	                 count4          maxresponsesize;
4859	                 count4          maxrequests;
4860	                 count4          headerpadsize;
4861	                 switch (channelmode4 mode) {
4862	                 case DEFAULT:
4863	                 void;
4864	                 case STREAM:
4865	                 streamchannelattrs4 streamchanattrs;
4866	                 case RDMA:
4867	                 rdmachannelattrs4   rdmachanattrs;
4868	                 };
4869	                 };

4871	                 union CREATESESSION4res switch (nfsstat4 status) {
4872	                 case NFS4_OK:
4873	                 CREATESESSION4resok     resok4;
4874	                 default:
4875	                 void;
4876	                 };

4878	   DESCRIPTION

4880	      This operation is used by the client to create new session objects
4881	      on the server.  Additionally the first session created with a new
4882	      shorthand client identifier serves to confirm the creation of that
4883	      client's state on the server.  The server returns the parameter
4884	      values for the new session.

4886	   IMPLEMENTATION

4888	      To describe the implementation, the same notation for client
4889	      records introduced in the description of CREATECLIENTID is used
4890	      with the following addition.

4892	      clientid_arg: The value of the clientid field of the
4893	      CREATESESSION4args structure of the current request.

4895	      Since CREATESESSION is a non-idempotent operation, we must
4896	      consider the possibility that replays may occur as a result of a
4897	      client reboot, network partition, malfunctioning router, etc.
4898	      Replays are identified by the value of the clientid and sessionid
4899	      fields of CREATESESSION4args and the method for dealing with them
4900	      is outlined in the scenarios below.

4902	      The processing of this operation is divided into two phases:
4903	      clientid confirmation and session creation.  In case the state for
4904	      the provided clientid has not been verified, it is confirmed
4905	      before the session is created.  Otherwise the clientid
4906	      confirmation phase is skipped and only the session creation phase
4907	      occurs.  Note that since only confirmed clients may create
4908	      sessions, the clientid confirmation stage does not depend upon
4909	      sessionid_arg.

4911	      CLIENTID CONFIRMATION

4913	      The operational cases are described in terms of what client
4914	      records whose clientid field have value equal to clientid_arg
4915	      exist in the server's set of client records.  Any cases in which
4916	      there is more than one record with identical values for clientid
4917	      represent a server implementation error.  Operation in the
4918	      potential valid cases is summarized as follows.

4920	      1.  Common Case

4922	             If the server has the following unconfirmed record, then
4923	             this is the expected confirmation of an unconfirmed record.

4925	             { *, *, principal_arg, clientid_arg, FALSE }

4927	             The confirmed field of the record is set to TRUE and
4928	             processing of the operation continues normally.

4930	      2.  Stale Clientid

4932	             If the server contains no records with clientid equal to
4933	             clientid_arg, then most likely the client's state has been
4934	             purged during a period of inactivity, possibly due to a
4935	             loss of connectivity.  NFS4ERR_STALE_CLIENTID is returned,
4936	             and no changes are made to any client records on the
4937	             server.

4939	      3.  Principal Change or Collision

4941	             If the server has the following record, then the client has
4942	             changed principals after the previous CREATECLIENTID
4943	             request, or there has been a chance collision between
4944	             shortand client identifiers.

4946	             { *, *, old_principal_arg, clientid_arg, * }

4948	             Neither of these cases are permissible.  Processing stops
4949	             and NFS4ERR_CLID_INUSE is returned to the client.  No
4950	             changes are made to any client records on the server.

4952	      SESSION CREATION

4954	      To determine whether this request is a replay, the server examines
4955	      the sessionid argument provided by the client.  If the sessionid
4956	      matches the identifier of a previously created session, then this
4957	      request must be interpreted as a replay.  No new state is created
4958	      and a reply with the parameters of the existing session is
4959	      returned to the client.  If a session corresponding to the
4960	      sessionid does not already exist, then the request is not a replay
4961	      and is processed as follows.

4963	      NOTE: It is the responsibility of the client to generate
4964	      appropriate values for sessionid.  Since the ordering of messages
4965	      sent on different transport connections is not guaranteed,
4966	      immediately reusing the sessionid of a previously destroyed
4967	      session may yield unpredictable results.  Client implementations
4968	      should avoid recently used sessionids to ensure correct behavior.

4970	      The server examines the persist, maxrequestsize, maxresponsesize,
4971	      maxrequests and headerpadsize arguments.  For each argument, if
4972	      the value is acceptable to the server, it is recommended that the
4973	      server use the provided value to create the new session.  If it is
4974	      not acceptable, the server may use a different value, but must
4975	      return the value used to the client.  These parameters have the
4976	      following interpretation.

4978	      persist:

4980	         True if the client desires server support for "reliable"
4981	         semantics.  For sessions in which only idempotent operations
4982	         will be used (e.g. a read-only session), clients should set
4983	         this value to false.  If the server does not or cannot provide
4984	         "reliable" semantics this value must be set to false on return.

4986	      maxrequestsize:

4988	         The maximum size of a COMPOUND request that will be sent by the
4989	         client including RPC headers.

4991	      maxresponsesize:

4993	         The maximum size of a COMPOUND reply that the client will
4994	         accept from the server including RPC headers.  The server must
4995	         not increase the value of this parameter.  If a client sends a
4996	         COMPOUND request for which the size of the reply would exceed
4997	         this value, the server will return NFS4ERR_RESOURCE.

4999	      maxrequests:

5001	         The maximum number of concurrent COMPOUND requests that the
5002	         client will issue on the session.  Subsequent COMPOUND requests
5003	         will each be assigned a slot identifier by the client on the
5004	         range 0 to maxrequests - 1 inclusive.  A slot id cannot be
5005	         reused until the previous request on that slot has completed.

5007	      headerpadsize:

5009	         The maximum amount of padding the client is willing to apply to
5010	         ensure that write payloads are aligned on some boundary at the
5011	         server.  The server should reply with its preferred value, or
5012	         zero if padding is not in use.  The server may decrease this
5013	         value but must not increase it.

5015	      The server creates the session by recording the parameter values
5016	      used and if the persist parameter is true and has been accepted by
5017	      the server, allocating space for the duplicate request cache
5018	      (DRC).

5020	      If the session state is created successfully, the server
5021	      associates it with the session identifier provided by the client.
5022	      This identifier must be unique among the client's active sessions
5023	      but there is no need for it to be globally unique.  Finally, the
5024	      server returns the negotiated values used to create the session to
5025	      the client.

5027	   ERRORS

5029	      NFS4ERR_BADXDR NFS4ERR_CLID_INUSE NFS4ERR_RESOURCE
5030	      NFS4ERR_SERVERFAULT NFS4ERR_STALE_CLIENTID

5032	14.6  BIND_BACKCHANNEL - Create a callback channel binding

5034	   Establish a callback channel on the connection.

5036	   SYNOPSIS

5038	   ARGUMENT

5040	                 struct BIND_BACKCHANNEL4args {
5041	                 clientid4 clientid;
5042	                 uint32_t  callback_program;
5043	                 uint32_t  callback_ident;
5044	                 count4         maxrequestsize;
5045	                 count4         maxresponsesize;
5046	                 count4         maxrequests;
5047	                 switch (channelmode4 mode) {
5048	                 case DEFAULT:
5049	                 void;
5050	                 case STREAM:
5051	                 streamchannelattrs4 streamchanattrs;
5052	                 case RDMA:
5053	                 rdmachannelattrs4   rdmachanattrs;
5054	                 };
5055	                 };

5057	   RESULT
5058	                 struct BIND_BACKCHANNEL4resok {
5059	                 count4         maxrequestsize;
5060	                 count4         maxresponsesize;
5061	                 count4         maxrequests;
5062	                 switch (channelmode4 mode) {
5063	                 case DEFAULT:
5064	                 void;
5065	                 case STREAM:
5066	                 streamchannelattrs4 streamchanattrs;
5067	                 case RDMA:
5068	                 rdmachannelattrs4   rdmachanattrs;
5069	                 };
5070	                 };

5072	                 union BIND_BACKCHANNEL4res switch (nfsstat4 status) {
5073	                 case NFS4_OK:
5074	                 BIND_BACKCHANNEL4resok   resok4;
5075	                 default:
5076	                 void;
5077	                 };

5079	   DESCRIPTION

5081	      The BIND_BACKCHANNEL operation serves to establish the current
5082	      connection as a designated callback channel for the specified
5083	      session.  Normally, only one callback channel is bound, however if
5084	      more than one are established, they are used at the server's
5085	      prerogative, no affinity or preference is specified by the client.

5087	      The arguments and results of the BIND_BACKCHANNEL call are a
5088	      subset of the session parameters, and used identically to those
5089	      values on the callback channel only.  However, not all session
5090	      operation channel parameters are relevant to the callback channel,
5091	      for example header padding (since writes of bulk data are not
5092	      performed in callbacks).

5094	   IMPLEMENTATION

5096	      No discussion at this time.

5098	   ERRORS

5100	      TBD

5102	14.7  DESTROYSESSION - Destroy existing session

5104	   Destroy existing session.

5106	   SYNOPSIS

5108	                 void -> status

5110	   ARGUMENT

5112	                 struct DESTROYSESSION4args {
5113	                 sessionid4     sessionid;
5114	                 };

5116	   RESULT

5118	                 struct SESSION_DESTROYres {
5119	                 nfsstat status;
5120	                 };

5122	   DESCRIPTION

5124	      The SESSION_DESTROY operation closes the session and discards any
5125	      active state such as locks, leases, and server duplicate request
5126	      cache entries.  Any remaining connections bound to the session are
5127	      immediately unbound and may additionally be closed by the server.

5129	      This operation must be the final, or only operation in any
5130	      request.  Because the operation results in destruction of the
5131	      session, any duplicate request caching for this request, as well
5132	      as previously completed requests, will be lost.  For this reason,
5133	      it is advisable to not place this operation in a request with
5134	      other state-modifying operations.  In addition, a SEQUENCE
5135	      operation is not required in the request.

5137	      Note that because the operation will never be replayed by the
5138	      server, a client that retransmits the request may receive an error
5139	      in response, even though the session may have been successfully
5140	      destroyed.

5142	   IMPLEMENTATION

5144	      No discussion at this time.

5146	   ERRORS

5148	      TBD

5150	14.8  SEQUENCE - Supply per-procedure sequencing and control

5152	   Supply per-procedure sequencing and control

5154	   SYNOPSIS

5156	                 control -> control

5158	   ARGUMENT

5160	                 typedef uint32_t sequenceid4;
5161	                 typedef uint32_t slotid4;

5163	                 struct SEQUENCE4args {
5164	                 clientid4 clientid;
5165	                 sessionid4     sessionid;
5166	                 sequenceid4    sequenceid;
5167	                 slotid4        slotid;
5168	                 slotid4        maxslot;
5169	                 };

5171	   RESULT
5172	                 struct SEQUENCE4resok {
5173	                 clientid4 clientid;
5174	                 sessionid4     sessionid;
5175	                 sequenceid4    sequenceid;
5176	                 slotid4        slotid;
5177	                 slotid4        maxslot;
5178	                 slotid4        target_maxslot;
5179	                 };

5181	                 union SEQUENCE4res switch (nfsstat4 status) {
5182	                 case NFS4_OK:
5183	                 SEQUENCE4resok resok4;
5184	                 default:
5185	                 void;
5186	                 };

5188	   DESCRIPTION

5190	      The SEQUENCE operation is used to manage operational accounting
5191	      for the session on which the operation is sent.  The contents
5192	      include the client and session to which this request belongs,
5193	      slotid and sequenceid, used by the server to implement session
5194	      request control and the duplicate reply cache semantics, and
5195	      exchanged slot counts which are used to adjust these values.  This
5196	      operation must appear once as the first operation in each COMPOUND
5197	      sent after the channel is successfully bound, or a protocol error
5198	      must result.

5200	   IMPLEMENTATION

5202	      No discussion at this time.

5204	   ERRORS

5206	      NFS4ERR_BADSESSION NFS4ERR_BADSLOT

5208	14.9  CB_RECALLCREDIT - change flow control limits

5210	   Change flow control limits

5212	   SYNOPSIS

5214	                 targetcount -> status

5216	   ARGUMENT

5218	                 struct CB_RECALLCREDIT4args {
5219	                 sessionid4     sessionid;
5220	                 uint32_t  target;
5221	                 };

5223	   RESULT

5225	                 struct CB_RECALLCREDIT4res {
5226	                 nfsstat4   status;
5227	                 };

5229	   DESCRIPTION

5231	      The CB_RECALLCREDIT operation requests the client to return
5232	      session and transport credits to the server, by zero-length RDMA
5233	      Sends or NULL NFSv4 operations.

5235	   IMPLEMENTATION

5237	      No discussion at this time.

5239	   ERRORS

5241	      NONE

5243	14.10  CB_SEQUENCE - Supply callback channel sequencing and control

5245	   Sequence and control

5247	   SYNOPSIS

5249	                 control -> control

5251	   ARGUMENT
5252	                 typedef uint32_t sequenceid4;
5253	                 typedef uint32_t slotid4;

5255	                 struct CB_SEQUENCE4args {
5256	                 clientid4 clientid;
5257	                 sessionid4     sessionid;
5258	                 sequenceid4    sequenceid;
5259	                 slotid4        slotid;
5260	                 slotid4        maxslot;
5261	                 };

5263	   RESULT

5265	                 struct CB_SEQUENCE4resok {
5266	                 clientid4 clientid;
5267	                 sessionid4     sessionid;
5268	                 sequenceid4    sequenceid;
5269	                 slotid4        slotid;
5270	                 slotid4        maxslot;
5271	                 slotid4        target_maxslot;
5272	                 };

5274	                 union CB_SEQUENCE4res switch (nfsstat4 status) {
5275	                 case NFS4_OK:
5276	                 CB_SEQUENCE4resok   resok4;
5277	                 default:
5278	                 void;
5279	                 };

5281	   DESCRIPTION

5283	      The CB_SEQUENCE operation is used to manage operational accounting
5284	      for the callback channel of the session on which the operation is
5285	      sent.  The contents include the client and session to which this
5286	      request belongs, slotid and sequenceid, used by the server to
5287	      implement session request control and the duplicate reply cache
5288	      semantics, and exchanged slot counts which are used to adjust
5289	      these values.  This operation must appear once as the first
5290	      operation in each CB_COMPOUND sent after the callback channel is
5291	      successfully bound, or a protocol error must result.

5293	   IMPLEMENTATION
5294	      No discussion at this time.

5296	   ERRORS

5298	      NFS4ERR_BADSESSION NFS4ERR_BADSLOT

5300	14.11  GET_DIR_DELEGATION - Get a directory delegation

5302	   Obtain a directory delegation.

5304	   SYNOPSIS

5306	       (cfh), requested notification -> (cfh), cookieverf, stateid,
5307	       supported notification

5309	   ARGUMENT

5311	       struct GET_DIR_DELEGATION4args {
5312	       dir_notification_type4      notification_type;
5313	       attr_notice4                child_attr_delay;
5314	       attr_notice4                dir_attr_delay;
5315	       };

5317	       /*
5318	       * Notification types.
5319	       */
5320	       const DIR_NOTIFICATION_NONE                    = 0x00000000;
5321	       const DIR_NOTIFICATION_CHANGE_CHILD_ATTRIBUTES  = 0x00000001;
5322	       const DIR_NOTIFICATION_CHANGE_DIR_ATTRIBUTES   = 0x00000002;
5323	       const DIR_NOTIFICATION_REMOVE_ENTRY            = 0x00000004;
5324	       const DIR_NOTIFICATION_ADD_ENTRY               = 0x00000008;
5325	       const DIR_NOTIFICATION_RENAME_ENTRY            = 0x00000010;
5326	       const DIR_NOTIFICATION_CHANGE_COOKIE_VERIFIER  = 0x00000020;

5328	       typedef uint32_t dir_notification_type4;

5330	       typedef nfstime4 attr_notice4;

5332	   RESULT
5333	       struct GET_DIR_DELEGATION4resok {
5334	           verifier4                       cookieverf;
5335	           /* Stateid for get_dir_delegation */
5336	           stateid4                        stateid;
5337	           /* Which notifications can the server support */
5338	           dir_notification_type4          supp_notification;
5339	           bitmap4                         child_attributes;
5340	           bitmap4                         dir_attributes;
5341	       };

5343	       union GET_DIR_DELEGATION4res switch (nfsstat4 status) {
5344	           case NFS4_OK:
5345	           /* CURRENT_FH: delegated dir */
5346	           GET_DIR_DELEGATION4resok      resok4;
5347	           default:
5348	           void;
5349	       };

5351	   DESCRIPTION

5353	      The GET_DIR_DELEGATION operation is used by a client to request a
5354	      directory delegation.  The directory is represented by the current
5355	      filehandle.  The client also specifies whether it wants the server
5356	      to notify it when the directory changes in certain ways by setting
5357	      one or more bits in a bitmap.  The server may also choose not to
5358	      grant the delegation.  In that case the server will return
5359	      NFS4ERR_DIRDELEG_UNAVAIL.  If the server decides to hand out the
5360	      delegation, it will return a cookie verifier for that directory.
5361	      If the cookie verifier changes when the client is holding the
5362	      delegation, the delegation will be recalled unless the client has
5363	      asked for notification for this event.  In that case a
5364	      notification will be sent to the client.

5366	      The server will also return a directory delegation stateid in
5367	      addition to the cookie verifier as a result of the
5368	      GET_DIR_DELEGATION operation.  This stateid will appear in
5369	      callback messages related to the delegation, such as notifications
5370	      and delegation recalls.  The client will use this stateid to
5371	      return the delegation voluntarily or upon recall.  A delegation is
5372	      returned by calling the DELEGRETURN operation.

5374	      The server may not be able to support notifications of certain
5375	      events.  If the client asks for such notifications, the server
5376	      must inform the client of its inability to do so as part of the
5377	      GET_DIR_DELEGATION reply by not setting the appropriate bits in
5378	      the supported notifications bitmask contained in the reply.

5380	      The GET_DIR_DELEGATION operation can be used for both normal and
5381	      named attribute directories.  It covers all the entries in the
5382	      directory except the ".." entry.  That means if a directory and
5383	      its parent both hold directory delegations, any changes to the
5384	      parent will not cause a notification to be sent for the child even
5385	      though the child's ".." entry points to the parent.

5387	   IMPLEMENTATION

5389	      Directory delegation provides the benefit of improving cache
5390	      consistency of namespace information.  This is done through
5391	      synchronous callbacks.  A server must support synchronous
5392	      callbacks in order to support directory delegations.  In addition
5393	      to that, asynchronous notifications provide a way to reduce
5394	      network traffic as well as improve client performance in certain
5395	      conditions.  Notifications would not be requested when the goal is
5396	      just cache consitency.

5398	      Notifications are specified in terms of potential changes to the
5399	      directory.  A client can ask to be notified whenever an entry is
5400	      added to a directory by setting notification_type to
5401	      DIR_NOTIFICATION_ADD_ENTRY.  It can also ask for notifications on
5402	      entry removal, renames, directory attribute changes and cookie
5403	      verifier changes by setting notification_type flag appropriately.
5404	      In addition to that, the client can also ask for notifications
5405	      upon attribute changes to children in the directory to keep its
5406	      attribute cache up to date.  However any changes made to child
5407	      attributes do not cause the delegation to be recalled.  If a
5408	      client is interested in directory entry caching, or negative name
5409	      caching, it can set the notification_type appropriately and the
5410	      server will notify it of all changes that would otherwise
5411	      invalidate its name cache.  The kind of notification a client asks
5412	      for may depend on the directory size, its rate of change and the
5413	      applications being used to access that directory.  However, the
5414	      conditions under which a client might ask for a notification, is
5415	      out of the scope of this specification.

5417	      The client will set one or more bits in a bitmap
5418	      (notification_type) to let the server know what kind of
5419	      notification(s) it is interested in.  For attribute notifications
5420	      it will set bits in another bitmap to indicate which attributes it
5421	      wants to be notified of.  If the server does not support
5422	      notifications for changes to a certain attribute, it should not
5423	      set that attribute in the supported attribute bitmap
5424	      (supp_notification) specified in the reply.

5426	      In addition to that, the client will also let the server know if
5427	      it wants to get the notification as soon as the attribute change
5428	      occurs or after a certain delay by setting a delay factor,
5429	      child_attr_delay for attribute changes to children and
5430	      dir_attr_delay for attribute changes to the directory.  If this
5431	      delay factor is set to zero, that indicates to the server that the
5432	      client wants to be notified of any attribute changes as soon as
5433	      they occur.  If the delay factor is set to N, the server will make
5434	      a best effort guarantee that attribute updates are not out of sync
5435	      by more than that.  One value covers all attribute changes for the
5436	      directory and another value covers all attribute changes for all
5437	      children in the directory.  If the client asks for a delay factor
5438	      that the server does not support or that may cause significant
5439	      resource consumption on the server by causing the server to send a
5440	      lot of notifications, the server should not commit to sending out
5441	      notifications for that attribute and therefore must not set the
5442	      approprite bit in the child_attributes and dir_attributes bitmaps
5443	      in the response.

5445	      The server will let the client know about which notifications it
5446	      can support by setting appropriate bits in a bitmap.  If it agrees
5447	      to send attribute notifications, it will also set two attribute
5448	      masks indicating which attributes it will send change
5449	      notifications for.  One of the masks covers changes in directory
5450	      attributes and the other covers atttribute changes to any files in
5451	      the directory.

5453	      The client should use a security flavor that the filesystem is
5454	      exported with.  If it uses a different flavor, the server should
5455	      return NFS4ERR_WRONGSEC.

5457	   ERRORS

5459	      NFS4ERR_ACCESS NFS4ERR_BADHANDLE NFS4ERR_BADXDR NFS4ERR_FHEXPIRED
5460	      NFS4ERR_INVAL NFS4ERR_MOVED NFS4ERR_NOFILEHANDLE NFS4ERR_NOTDIR
5461	      NFS4ERR_RESOURCE NFS4ERR_SERVERFAULT NFS4ERR_STALE
5462	      NFS4ERR_DIRDELEG_UNAVAIL NFS4ERR_WRONGSEC NFS4ERR_EIO
5463	      NFS4ERR_NOTSUPP

5465	14.12  CB_NOTIFY - Notify directory changes

5467	   Tell the client of directory changes.

5469	   SYNOPSIS

5471	                 stateid, notification -> {}

5473	   ARGUMENT

5475	       struct CB_NOTIFY4args {
5476	           stateid4              stateid;
5477	           dir_notification4     changes<>;
5478	       };

5480	       /*
5481	       * Notification information sent to the client.
5482	       */
5483	       union dir_notification4
5484	       switch (dir_notification_type4 notification_type) {
5485	           case DIR_NOTIFICATION_CHANGE_CHILD_ATTRIBUTES:
5486	               dir_notification_attribute4 change_child_attributes;
5487	           case DIR_NOTIFICATION_CHANGE_DIR_ATTRIBUTES:
5488	               fattr4                      change_dir_attributes;
5489	           case DIR_NOTIFICATION_REMOVE_ENTRY:
5490	               dir_notification_remove4    remove_notification;
5491	           case DIR_NOTIFICATION_ADD_ENTRY:
5492	               dir_notification_add4       add_notification;
5493	           case DIR_NOTIFICATION_RENAME_ENTRY:
5494	               dir_notification_rename4    rename_notification;
5495	           case DIR_NOTIFICATION_CHANGE_COOKIE_VERIFIER:
5496	               dir_notification_verifier4  verf_notification;
5497	       };

5499	       /*
5500	       * Changed entry information.
5501	       */
5502	       struct dir_entry {
5503	           component4      file;
5504	           fattr4          attrs;
5505	       };

5507	       struct dir_notification_attribute4 {
5508	           dir_entry    changed_entry;
5509	       };

5511	       struct dir_notification_remove4 {
5512	           dir_entry      old_entry;
5513	           nfs_cookie4    old_entry_cookie;
5514	       };

5516	       struct dir_notification_rename4 {
5517	           dir_entry              old_entry;
5518	           dir_notification_add4  new_entry;
5519	       };

5521	       struct dir_notification_verifier4 {
5522	           verifier4       old_cookieverf;
5523	           verifier4       new_cookieverf;
5524	       };

5526	       struct dir_notification_add4 {
5527	           dir_entry       new_entry;
5528	           /* what READDIR would have returned for this entry */
5529	           nfs_cookie4     new_entry_cookie;
5530	           bool            last_entry;
5531	           prev_entry_info4     prev_info;
5532	           };

5534	       union prev_entry_info4 switch (bool isprev) {
5535	           case TRUE:       /* A previous entry exists */
5536	           prev_entry4 prev_entry_info;
5537	           case FALSE:       /* we are adding to an empty
5538	           directory */
5539	           void;
5540	       };

5542	       /*
5543	       * Previous entry information
5544	       */
5545	       struct prev_entry4 {
5546	           dir_entry       prev_entry;
5547	           /* what READDIR returned for this entry */
5548	           nfs_cookie4     prev_entry_cookie;
5549	       };

5551	   RESULT

5553	                 struct CB_NOTIFY4res {
5554	                 nfsstat4        status;
5555	                 };

5557	   DESCRIPTION

5559	      The CB_NOTIFY operation is used by the server to send
5560	      notifications to clients about changes in a delegated directory.
5561	      These notifications are sent over the callback path.  The
5562	      notification is sent once the original request has been processed
5563	      on the server.  The server will send an array of notifications for
5564	      all changes that might have occurred in the directory.  The
5565	      dir_notification_type4 can only have one bit set for each
5566	      notification in the array.  If the client holding the delegation
5567	      makes any changes in the directory that cause files or sub
5568	      directories to be added or removed, the server will notify that
5569	      client of the resulting change(s).  If the client holding the
5570	      delegation is making attribute or cookie verifier changes only,
5571	      the server does not need to send notifications to that client.
5572	      The server will send the following information for each operation:

5574	      *  ADDING A FILE:  The server will send information about the new
5575	         entry being created along with the cookie for that entry.  The
5576	         entry information contains the nfs name of the entry and
5577	         attributes.  If this entry is added to the end of the
5578	         directory, the server will set a last_entry flag to true.  If
5579	         the file is added such that there is atleast one entry before
5580	         it, the server will also return the previous entry information
5581	         along with its cookie.  This is to help clients find the right
5582	         location in their DNLC or directory caches where this entry
5583	         should be cached.

5585	      *  REMOVING A FILE:  The server will send information about the
5586	         directory entry being deleted.  The server will also send the
5587	         cookie value for the deleted entry so that clients can get to
5588	         the cached information for this entry.

5590	      *  RENAMING A FILE:  The server will send information about both
5591	         the old entry and the new entry.  This includes name and
5592	         attributes for each entry.  This notification is only sent if
5593	         both entries are in the same directory.  If the rename is
5594	         across directories, the server will send a remove notification
5595	         to one directory and an add notification to the other
5596	         directory, assuming both have a directory delegation.

5598	      *  FILE/DIR ATTRIBUTE CHANGE:  The client will use the attribute
5599	         mask to inform the server of attributes for which it wants to
5600	         receive notifications.  This change notification can be
5601	         requested for both changes to the attributes of the directory
5602	         as well as changes to any file attributes in the directory by
5603	         using two separate attribute masks.  The client can not ask for
5604	         change attribute notification per file.  One attribute mask
5605	         covers all the files in the directory.  Upon any attribute
5606	         change, the server will send back the values of changed
5607	         attributes.  Notifications might not make sense for some
5608	         filesystem wide attributes and it is up to the server to decide
5609	         which subset it wants to support.  The client can negotiate the
5610	         frequency of attribute notifications by letting the server know
5611	         how often it wants to be notified of an attribute change.  The
5612	         server will return supported notification frequencies or an
5613	         indication that no notification is permitted for directory or
5614	         child attributes by setting the supp_dir_attr_notice and
5615	         supp_child_attr_notice attributes respectively.

5617	      *  COOKIE VERIFIER CHANGE:  If the cookie verifier changes while a
5618	         client is holding a delegation, the server will notify the
5619	         client so that it can invalidate its cookies and reissue a
5620	         READDIR to get the new set of cookies.

5622	   IMPLEMENTATION

5624	   ERRORS

5626	      NFS4ERR_BAD_STATEID NFS4ERR_INVAL NFS4ERR_BADXDR
5627	      NFS4ERR_SERVERFAULT

5629	14.13  CB_RECALL_ANY - Keep any N delegations

5631	   Notify client to return delegation and keep N of them.

5633	   SYNOPSIS

5635	                 N -> {}

5637	   ARGUMENT

5639	                 struct CB_RECALLANYY4args {
5640	                 uint4          dlgs_to_keep;
5641	                 }

5643	   RESULT

5645	                 struct CB_RECALLANY4res {
5646	                 nfsstat4        status;
5647	                 };

5649	   DESCRIPTION

5651	      The server may decide that it can not hold all the delegation
5652	      state without running out of resources.  Since the server has no
5653	      knowledge of which delegations are being used more than others, it
5654	      can not implement an effective reclaim scheme that avoids
5655	      reclaiming frequently used delegations.  In that case the server
5656	      may issue a CB_RECALL_ANY callback to the client asking it to keep
5657	      N delegations and return the rest.  The reason why CB_RECALL_ANY
5658	      specifies a count of delegations the client may keep as opposed to
5659	      a count of delegations the client must yield is as follows.  Were
5660	      it otherwise, there is a potential for a race between a
5661	      CB_RECALL_ANY that had a count of delegations to free with a set
5662	      of client originated operations to return delegations.  As a
5663	      result of the race the client and server would have differing
5664	      ideas as to how many delegations to return.  Hence the client
5665	      could mistakenly free too many delegations.  This operation
5666	      applies to delegations for a regular file (read or write) as well
5667	      as for a directory.

5669	      The client can choose to return any type of delegation as a result
5670	      of this callback i.e. read, write or directory delegation.  The
5671	      client can also choose to keep more delegations than what the
5672	      server asked for and it is up to the server to handle this
5673	      situation.  The server must give the client enough time to return
5674	      the delegations.  This time should not be less than the lease
5675	      period.

5677	   IMPLEMENTATION

5679	   ERRORS

5681	      NFS4ERR_RESOURCE

5683	14.14  LAYOUTGET - Get Layout Information

5685	   SYNOPSIS

5687	     (cfh), clientid, layout_type, iomode, offset, length,
5688	     minlength, maxcount -> layout

5690	   ARGUMENT

5692	     struct LAYOUTGET4args {
5693	             /* CURRENT_FH: file */
5694	             clientid4               clientid;
5695	             pnfs_layouttype4        layout_type;
5696	             pnfs_layoutiomode4      iomode;
5697	             offset4                 offset;
5698	             length4                 length;
5699	             length4                 minlength;
5700	             count4                  maxcount;
5701	     };

5703	   RESULT

5705	     struct LAYOUTGET4resok {
5706	             pnfs_layout4            layout;
5707	     };

5709	     union LAYOUTGET4res switch (nfsstat4 status) {
5710	             case NFS4_OK:
5711	                     LAYOUTGET4resok resok4;
5712	             default:
5713	                     void;
5714	     };

5716	   DESCRIPTION

5718	   Requests a layout for reading or writing (and reading) the file given
5719	   by the filehandle at the byte range specified by offset and length.
5720	   Layouts are identified by the clientid, filehandle, and layout type.
5721	   The use of the iomode depends upon the layout type, but should
5722	   reflect the client's data access intent.

5724	   The LAYOUTGET operation returns layout information for the specified
5725	   byte range, a layout segment.  To get a layout segment from a
5726	   specific offset through the end-of-file, regardless of the file's
5727	   length, a length field with all bits set to 1 (one) should be used.
5728	   If the length is zero, or if a length which is not all bits set to
5729	   one is specified, and length when added to the offset exceeds the
5730	   maximum 64-bit unsigned integer value, the error NFS4ERR_INVAL will
5731	   result.

5733	   The "minlength" field specifies the minimum size overlap with the
5734	   requested offset and length that is to be returned.  If this
5735	   requirement cannot be met, no layout must be returned; the error
5736	   NFS4ERR_LAYOUTTRYLATER can be returned.

5738	   The "maxcount" field specifies the maximum layout size (in bytes)
5739	   that the client can handle.  If the size of the layout structure
5740	   exceeds the size specified by maxcount, the metadata server will
5741	   return the NFS4ERR_TOOSMALL error.

5743	   As well, the metadata server may adjust the range of the returned
5744	   layout segment based on striping patterns and usage implied by the
5745	   iomode.  The client must be prepared to get a layout that does not
5746	   line up exactly with their request; there MUST be at least an overlap
5747	   of "minlength" between the layout returned by the server and the
5748	   client's request, or the server SHOULD reject the request.  See
5749	   Section 7.3 for more details.

5751	   The metadata server may also return a layout segment with an iomode
5752	   other than that requested by the client.  If it does so, it must
5753	   ensure that the iomode is more permissive than the iomode requested.
5754	   E.g., this allows an implementation to upgrade read-only requests to
5755	   read/write requests at its discretion, within the limits of the
5756	   layout type specific protocol.  An iomode of either LAYOUTIOMODE_READ
5757	   or LAYOUTIOMODE_RW must be returned.

5759	   The format of the returned layout is specific to the underlying file
5760	   system.  Layout types other than the NFSv4 file layout type should be
5761	   specified outside of this document.

5763	   If layouts are not supported for the requested file or its containing
5764	   file system the server SHOULD return NFS4ERR_LAYOUTUNAVAILABLE.  If
5765	   the layout type is not supported, the metadata server should return
5766	   NFS4ERR_UNKNOWN_LAYOUTTYPE.  If layouts are supported but no layout
5767	   matches the client provided layout identification, the server should
5768	   return NFS4ERR_BADLAYOUT.  If an invalid iomode is specified, or an
5769	   iomode of LAYOUTIOMODE_ANY is specified, the server should return
5770	   NFS4ERR_BADIOMODE.

5772	   If the layout for the file is unavailable due to transient
5773	   conditions, e.g. file sharing prohibits layouts, the server must
5774	   return NFS4ERR_LAYOUTTRYLATER.

5776	   If the layout request is rejected due to an overlapping layout
5777	   recall, the server must return NFS4ERR_RECALLCONFLICT.  See
5778	   Section 7.5.3 for details.

5780	   If the layout conflicts with a mandatory byte range lock held on the
5781	   file, and if the storage devices have no method of enforcing
5782	   mandatory locks, other than through the restriction of layouts, the
5783	   metadata server should return NFS4ERR_LOCKED.

5785	   On success, the current filehandle retains its value.

5787	   IMPLEMENTATION

5789	   Typically, LAYOUTGET will be called as part of a compound RPC after
5790	   an OPEN operation and results in the client having location
5791	   information for the file; a client may also hold a layout across
5792	   multiple OPENs.  The client specifies a layout type that limits what
5793	   kind of layout the server will return.  This prevents servers from
5794	   issuing layouts that are unusable by the client.

5796	   ERRORS

5798	      NFS4ERR_BADLAYOUT
5799	      NFS4ERR_BADIOMODE
5800	      NFS4ERR_FHEXPIRED
5801	      NFS4ERR_INVAL
5802	      NFS4ERR_LAYOUTUNAVAILABLE
5803	      NFS4ERR_LAYOUTTRYLATER
5804	      NFS4ERR_LOCKED
5805	      NFS4ERR_NOFILEHANDLE
5806	      NFS4ERR_NOTSUPP
5807	      NFS4ERR_RECALLCONFLICT
5808	      NFS4ERR_STALE
5809	      NFS4ERR_STALE_CLIENTID
5810	      NFS4ERR_TOOSMALL
5811	      NFS4ERR_UNKNOWN_LAYOUTTYPE

5813	14.15  LAYOUTCOMMIT - Commit writes made using a layout
5814	   SYNOPSIS

5816	     (cfh), clientid, offset, length, last_write_offset,
5817	     time_modify, time_access, layoutupdate -> newsize

5819	   ARGUMENT

5821	     union newtime4 switch (bool timechanged) {
5822	             case TRUE:
5823	                     nfstime4           time;
5824	             case FALSE:
5825	                     void;
5826	     };

5828	     union newsize4 switch (bool sizechanged) {
5829	             case TRUE:
5830	                     length4            size;
5831	             case FALSE:
5832	                     void;
5833	     };

5835	     struct LAYOUTCOMMIT4args {
5836	             /* CURRENT_FH: file */
5837	             clientid4               clientid;
5838	             offset4                 offset;
5839	             length4                 length;
5840	             length4                 last_write_offset;
5841	             newtime4                time_modify;
5842	             newtime4                time_access;
5843	             pnfs_layoutupdate4      layoutupdate;
5844	     };

5846	   RESULT

5848	     struct LAYOUTCOMMIT4resok {
5849	            newsize4                 newsize;
5850	     };

5852	     union LAYOUTCOMMIT4res switch (nfsstat4 status) {
5853	             case NFS4_OK:
5854	                     LAYOUTCOMMIT4resok  resok4;
5855	             default:
5856	                     void;
5857	     };

5859	   DESCRIPTION
5860	   Commits changes in the layout segment represented by the current
5861	   filehandle, clientid, and byte range.  Since layouts are sub-
5862	   dividable, a smaller portion of a layout, retrieved via LAYOUTGET,
5863	   may be committed.  The region being committed is specified through
5864	   the byte range (length and offset).  Note: the "layoutupdate"
5865	   structure does not include the length and offset, as they are already
5866	   specified in the arguments.

5868	   The LAYOUTCOMMIT operation indicates that the client has completed
5869	   writes using a layout obtained by a previous LAYOUTGET.  The client
5870	   may have only written a subset of the data range it previously
5871	   requested.  LAYOUTCOMMIT allows it to commit or discard provisionally
5872	   allocated space and to update the server with a new end of file.  The
5873	   layout referenced by LAYOUTCOMMIT is still valid after the operation
5874	   completes and can be continued to be referenced by the clientid,
5875	   filehandle, byte range, and layout type.

5877	   The "last_write_offset" field specifies the offset of the last byte
5878	   written by the client previous to the LAYOUTCOMMIT.  Note: this value
5879	   is never equal to the file's size (at most it is one byte less than
5880	   the file's size).  The metadata server may use this information to
5881	   determine whether the file's size needs to be updated.  If the
5882	   metadata server updates the file's size as the result of the
5883	   LAYOUTCOMMIT operation, it must return the new size as part of the
5884	   results.

5886	   The "time_modify" and "time_access" fields allow the client to
5887	   suggest times it would like the metadata server to set.  The metadata
5888	   server may use these time values or it may use the time of the
5889	   LAYOUTCOMMIT operation to set these time values.  If the metadata
5890	   server uses the client provided times, it should sanity check the
5891	   values (e.g., to ensure time does not flow backwards).  If the client
5892	   wants to force the metadata server to set an exact time, the client
5893	   should use a SETATTR operation in a compound right after
5894	   LAYOUTCOMMIT.  See Section 7.4 for more details.  If the new client
5895	   desires the resultant mtime or atime, it should issue a GETATTR
5896	   following the LAYOUTCOMMIT; e.g., later in the same compound.

5898	   The "layoutupdate" argument to LAYOUTCOMMIT provides a mechanism for
5899	   a client to provide layout specific updates to the metadata server.
5900	   For example, the layout update can describe what regions of the
5901	   original layout have been used and what regions can be deallocated.
5902	   There is no NFSv4 file layout specific layoutupdate structure.

5904	   The layout information is more verbose for block devices than for
5905	   objects and files because the latter hide the details of block
5906	   allocation behind their storage protocols.  At the minimum, the
5907	   client needs to communicate changes to the end of file location back
5908	   to the server, and, if desired, its view of the file modify and
5909	   access time.  For block/volume layouts, it needs to specify precisely
5910	   which blocks have been used.

5912	   If the layout identified in the arguments does not exist, the error
5913	   NFS4ERR_BADLAYOUT is returned.  The layout being committed may also
5914	   be rejected if it does not correspond to an existing layout with an
5915	   iomode of RW.

5917	   On success, the current filehandle retains its value.

5919	   ERRORS

5921	      NFS4ERR_BADLAYOUT
5922	      NFS4ERR_BADIOMODE
5923	      NFS4ERR_FHEXPIRED
5924	      NFS4ERR_INVAL
5925	      NFS4ERR_NOFILEHANDLE
5926	      NFS4ERR_STALE
5927	      NFS4ERR_STALE_CLIENTID
5928	      NFS4ERR_UNKNOWN_LAYOUTTYPE

5930	14.16  LAYOUTRETURN - Release Layout Information

5932	   SYNOPSIS

5934	     (cfh), clientid, offset, length, iomode, layout_type -> -

5936	   ARGUMENT

5938	     struct LAYOUTRETURN4args {
5939	             /* CURRENT_FH: file */
5940	             clientid4               clientid;
5941	             offset4                 offset;
5942	             length4                 length;
5943	             pnfs_layoutiomode4      iomode;
5944	             pnfs_layouttype4        layout_type;
5945	     };

5947	   RESULT

5949	     struct LAYOUTRETURN4res {
5950	             nfsstat4        status;
5951	     };

5953	   DESCRIPTION
5954	   Returns the layout segment represented by the current filehandle,
5955	   clientid, byte range, iomode, and layout type.  After this call, the
5956	   client MUST NOT use the layout and the associated storage protocol to
5957	   access the file data.  The layout being returned may be a subdivision
5958	   of a layout previously fetched through LAYOUTGET.  As well, it may be
5959	   a subset or superset of a layout specified by CB_LAYOUTRECALL.
5960	   However, if it is a subset, the recall is not complete until the full
5961	   byte range has been returned.  It is also permissible, and no error
5962	   should result, for a client to return a byte range covering a layout
5963	   it does not hold.  If the length is all 1s, the layout covers the
5964	   range from offset to EOF.  An iomode of ANY specifies that all
5965	   layouts that match the other arguments to LAYOUTRETURN (i.e.,
5966	   clientid, byte range, and type) are being returned.

5968	   Layouts may be returned when recalled or voluntarily (i.e., before
5969	   the server has recalled them).  In either case the client must
5970	   properly propagate state changed under the context of the layout to
5971	   storage or to the server before returning the layout.

5973	   If a client fails to return a layout in a timely manner, then the
5974	   metadata server should use its control protocol with the storage
5975	   devices to fence the client from accessing the data referenced by the
5976	   layout.  See Section 7.5 for more details.

5978	   If the layout identified in the arguments does not exist, the error
5979	   NFS4ERR_BADLAYOUT is returned.  If a layout exists, but the iomode
5980	   does not match, NFS4ERR_BADIOMODE is returned.

5982	   On success, the current filehandle retains its value.

5984	   [OPEN ISSUE: Should LAYOUTRETURN be modified to handle FSID
5985	   callbacks?]

5987	   ERRORS

5989	      NFS4ERR_BADLAYOUT
5990	      NFS4ERR_BADIOMODE
5991	      NFS4ERR_FHEXPIRED
5992	      NFS4ERR_INVAL
5993	      NFS4ERR_NOFILEHANDLE
5994	      NFS4ERR_STALE
5995	      NFS4ERR_STALE_CLIENTID
5996	      NFS4ERR_UNKNOWN_LAYOUTTYPE

5998	14.17  GETDEVICEINFO - Get Device Information

6000	   SYNOPSIS

6002	     (cfh), device_id, layout_type, maxcount -> device_addr

6004	   ARGUMENT

6006	     struct GETDEVICEINFO4args {
6007	             /* CURRENT_FH: file */
6008	             pnfs_deviceid4                  device_id;
6009	             pnfs_layouttype4                layout_type;
6010	             count4                          maxcount;
6011	     };

6013	   RESULT

6015	     struct GETDEVICEINFO4resok {
6016	             pnfs_deviceaddr4                device_addr;
6017	     };

6019	     union GETDEVICEINFO4res switch (nfsstat4 status) {
6020	             case NFS4_OK:
6021	                     GETDEVICEINFO4resok     resok4;
6022	             default:
6023	                     void;
6024	     };

6026	   DESCRIPTION

6028	   Returns device type and device address information for a specified
6029	   device.  The returned device_addr includes a type that indicates how
6030	   to interpret the addressing information for that device.  The current
6031	   filehandle (cfh) is used to identify the file system; device IDs are
6032	   unique per file system (FSID) and are qualified by the layout type.

6034	   See Section 7.1.4 for more details on device ID assignment.

6036	   If the size of the device address exceeds maxcount bytes, the
6037	   metadata server will return the error NFS4ERR_TOOSMALL.  If an
6038	   invalid device ID is given, the metadata server will respond with
6039	   NFS4ERR_INVAL.

6041	   ERRORS

6043	      NFS4ERR_FHEXPIRED
6044	      NFS4ERR_INVAL
6045	      NFS4ERR_TOOSMALL
6046	       NFS4ERR_UNKNOWN_LAYOUTTYPE

6048	14.18   GETDEVICELIST - Get List of Devices

6050	   SYNOPSIS

6052	     (cfh), layout_type, maxcount, cookie, cookieverf ->
6053	     cookie, cookieverf, device_addrs<>

6055	   ARGUMENT

6057	     struct GETDEVICELIST4args {
6058	             /* CURRENT_FH: file */
6059	             pnfs_layouttype4                layout_type;
6060	             count4                          maxcount;
6061	             nfs_cookie4                     cookie;
6062	             verifier4                       cookieverf;
6063	     };

6065	   RESULT

6067	     struct GETDEVICELIST4resok {
6068	             nfs_cookie4                     cookie;
6069	             verifier4                       cookieverf;
6070	             pnfs_devlist_item4              device_addrs<>;
6071	     };

6073	     union GETDEVICELIST4res switch (nfsstat4 status) {
6074	             case NFS4_OK:
6075	                     GETDEVICELIST4resok     resok4;
6076	             default:
6077	                     void;
6078	     };

6080	   DESCRIPTION

6082	   In some applications, especially SAN environments, it is convenient
6083	   to find out about all the devices associated with a file system.
6084	   This lets a client determine if it has access to these devices, e.g.,
6085	   at mount time.

6087	   This operation returns an array of items (pnfs_devlist_item4) that
6088	   establish the association between the short pnfs_deviceid4 and the
6089	   addressing information for that device, for a particular layout type.
6090	   This operation may not be able to fetch all device information at
6091	   once, thus it uses a cookie based approach, similar to READDIR, to
6092	   fetch additional device information (see [2], section 14.2.24).  As
6093	   in GETDEVICEINFO, the current filehandle (cfh) is used to identify
6094	   the file system.

6096	   As in GETDEVICEINFO, maxcount specifies the maximum number of bytes
6097	   to return.  If the metadata server is unable to return a single
6098	   device address, it will return the error NFS4ERR_TOOSMALL.  If an
6099	   invalid device ID is given, the metadata server will respond with
6100	   NFS4ERR_INVAL.

6102	   ERRORS

6104	      NFS4ERR_BAD_COOKIE
6105	      NFS4ERR_FHEXPIRED
6106	      NFS4ERR_INVAL
6107	      NFS4ERR_TOOSMALL
6108	      NFS4ERR_UNKNOWN_LAYOUTTYPE

6110	14.19  CB_LAYOUTRECALL

6112	   SYNOPSIS

6114	     layout_type, iomode, layoutchanged, layoutrecall -> -

6116	   ARGUMENT

6118	     enum layoutrecall_type4 {
6119	             RECALL_FILE = 1,
6120	             RECALL_FSID = 2
6121	     };

6123	     struct layoutrecall_file4 {
6124	             nfs_fh4         fh;
6125	             offset4         offset;
6126	             length4         length;
6127	     };

6129	     union layoutrecall4 switch(layoutrecall_type4 recalltype) {
6130	             case RECALL_FILE:
6131	                     layoutrecall_file4 layout;
6132	             case RECALL_FSID:
6133	                     fsid4              fsid;
6134	     };

6136	     struct CB_LAYOUTRECALLargs {
6137	             pnfs_layouttype4        layout_type;
6138	             pnfs_layoutiomode4      iomode;
6139	             bool                    layoutchanged;
6140	             layoutrecall4           layoutrecall;
6141	     };

6143	   RESULT

6145	     struct CB_LAYOUTRECALLres {
6146	             nfsstat4        status;
6147	     };

6149	   DESCRIPTION

6151	   The CB_LAYOUTRECALL operation is used to begin the process of
6152	   recalling a layout, a portion thereof, or all layouts pertaining to a
6153	   particular file system (FSID).  If RECALL_FILE is specified, the
6154	   offset and length fields specify the portion of the layout to be
6155	   returned.  The iomode specifies the set of layouts to be returned.
6156	   An iomode of ANY specifies that all matching layouts, regardless of
6157	   iomode, must be returned; otherwise, only layouts that exactly match
6158	   the iomode must be returned.

6160	   If the "layoutchanged" field is TRUE, then the client SHOULD not
6161	   flush its dirty data to the devices specified by the layout being
6162	   recalled.  Instead, it is preferable for the client to flush the
6163	   dirty data through the metadata server.  Alternatively, the client
6164	   may attempt to obtain a new layout.  Note: in order to obtain a new
6165	   layout the client must first return the old layout.  Since obtaining
6166	   a new layout is not guaranteed to succeed, the client must be ready
6167	   to flush its dirty data through the metadata server.

6169	   If RECALL_FSID is specified, the fsid specifies the file system for
6170	   which any outstanding layouts must be returned.  Layouts are returned
6171	   through the LAYOUTRETURN operation.

6173	   If the client does not hold any layout segment either matching or
6174	   overlapping with the requested layout, it returns
6175	   NFS4ERR_NOMATCHING_LAYOUT.  If a length of all 1s is specified then
6176	   the layout corresponding to the byte range from "offset" to the end-
6177	   of-file MUST be returned.

6179	   IMPLEMENTATION

6181	   The client should reply to the callback immediately.  Replying does
6182	   not complete the recall except when an error is returned.  The recall
6183	   is not complete until the layout(s) are returned using a
6184	   LAYOUTRETURN.

6186	   The client should complete any in-flight I/O operations using the
6187	   recalled layout(s) before returning it/them via LAYOUTRETURN.  If the
6188	   client has buffered dirty data there are a number of options for
6189	   flushing that data.  If "layoutchanged" is false, the client may
6190	   choose to write dirty data directly to storage before calling
6191	   LAYOUTRETURN.  However, if "layoutchanged" is true, the client may
6192	   either choose to write it later using normal NFSv4 WRITE operations
6193	   to the metadata server or it may attempt to obtain a new layout,
6194	   after first returning the recalled layout, using the new layout to
6195	   flush the dirty data.  Regardless of whether the client is holding a
6196	   layout, it may always write data through the metadata server.

6198	   If dirty data is flushed while the layout is held, the client must
6199	   still issue LAYOUTCOMMIT operations at the appropriate time,
6200	   especially before issuing the LAYOUTRETURN.  If a large amount of
6201	   dirty data is outstanding, the client may issue LAYOUTRETURNs for
6202	   portions of the layout being recalled; this allows the server to
6203	   monitor the client's progress and adherence to the callback.
6204	   However, the last LAYOUTRETURN in a sequence of returns, SHOULD
6205	   specify the full range being recalled (see Section 7.5.2 for
6206	   details).

6208	   ERRORS

6210	      NFS4ERR_NOMATCHING_LAYOUT

6212	14.20  CB_SIZECHANGED

6214	   SYNOPSIS

6216	     fh, size -> -

6218	   ARGUMENT

6220	     struct CB_SIZECHANGEDargs {
6221	             nfs_fh4         fh;
6222	             length4         size;
6223	     };

6225	   RESULT

6227	     struct CB_SIZECHANGEDres {
6228	             nfsstat4        status;
6229	     };

6231	   DESCRIPTION

6233	   The CB_SIZECHANGED operation is used to notify the client that the
6234	   size pertaining to the filehandle associated with "fh", has changed.
6235	   The new size is specified.  Upon reception of this notification
6236	   callback, the client should update its internal size for the file.
6237	   If the layout being held for the file is of the NFSv4 file layout
6238	   type, then the size field within that layout should be updated (see
6239	   Section 9.5).  For other layout types see Section 7.4.2 for more
6240	   details.

6242	   If the handle specified is not one for which the client holds a
6243	   layout, an NFS4ERR_BADHANDLE error is returned.

6245	   ERRORS

6247	      NFS4ERR_BADHANDLE

6249	15.  References
6250	15.1  Normative References

6252	   [1]  Bradner, S., "Key words for use in RFCs to Indicate Requirement
6253	        Levels", March 1997.

6255	   [2]  Shepler, S., Callaghan, B., Robinson, D., Thurlow, R., Beame,
6256	        C., Eisler, M., and D. Noveck, "Network File System (NFS)
6257	        version 4 Protocol", RFC 3530, April 2003.

6259	15.2  Informative References

6261	   [3]  Satran, J., Meth, K., Sapuntzakis, C., Chadalapaka, M., and E.
6262	        Zeidner, "Internet Small Computer Systems Interface (iSCSI)",
6263	        RFC 3720, April 2004.

6265	   [4]  Snively, R., "Fibre Channel Protocol for SCSI, 2nd Version
6266	        (FCP-2)", ANSI/INCITS 350-2003, Oct 2003.

6268	   [5]  Weber, R., "Object-Based Storage Device Commands (OSD)", ANSI/
6269	        INCITS 400-2004, July 2004,
6270	        .

6272	   [6]  Black, D., "pNFS Block/Volume Layout", July 2005, .

6275	   [7]  Zelenka, J., Welch, B., and B. Halevy, "Object-based pNFS
6276	        Operations", July 2005, .

6279	Author's Address

6281	   Spencer Shepler
6282	   Sun Microsystems, Inc.
6283	   7808 Moonflower Drive
6284	   Austin, TX  78750
6285	   USA

6287	   Phone: +1-512-349-9376
6288	   Email: spencer.shepler@sun.com

6290	Appendix A.  Acknowledgments

6292	   The initial drafts for the SECINFO extensions were edited by Mike
6293	   Eisler with contributions from Tom Talpey, Saadia Khan, and Jon
6294	   Bauman.

6296	   The initial drafts for the SESSIONS extensions were edited by Tom
6297	   Talpey, Spencer Shepler, Jon Bauman with contributions from Charles
6298	   Antonelli, Brent Callaghan, Mike Eisler, John Howard, Chet Juszczak,
6299	   Trond Myklebust, Dave Noveck, John Scott, Mike stolarchuk and Mark
6300	   Wittle.

6302	   The initial drafts for the Directory Delegations support were
6303	   contributed by Saadia Khan with input from Dave Noveck, Mike Eisler,
6304	   Carl Burnett, Ted Anderson and Tom Talpey.

6306	   The initial drafts for the parellel NFS support were edited by Brent
6307	   Welch and Garth Goodson.  Additional authors for those documents were
6308	   Benny Halevy, David Black, and Andy Adamson.  Additional input came
6309	   from the informal group which contributed to the construction of the
6310	   initial pNFS drafts; specific acknowledgement goes to Gary Grider,
6311	   Peter Corbett, Dave Noveck, and Peter Honeyman.  The pNFS work was
6312	   inspired by the NASD and OSD work done by Garth Gibson.  Gary Grider
6313	   of the national labs (LANL) has also been a champion of high-
6314	   performance parallel I/O.

6316	Intellectual Property Statement

6318	   The IETF takes no position regarding the validity or scope of any
6319	   Intellectual Property Rights or other rights that might be claimed to
6320	   pertain to the implementation or use of the technology described in
6321	   this document or the extent to which any license under such rights
6322	   might or might not be available; nor does it represent that it has
6323	   made any independent effort to identify any such rights.  Information
6324	   on the procedures with respect to rights in RFC documents can be
6325	   found in BCP 78 and BCP 79.

6327	   Copies of IPR disclosures made to the IETF Secretariat and any
6328	   assurances of licenses to be made available, or the result of an
6329	   attempt made to obtain a general license or permission for the use of
6330	   such proprietary rights by implementers or users of this
6331	   specification can be obtained from the IETF on-line IPR repository at
6332	   http://www.ietf.org/ipr.

6334	   The IETF invites any interested party to bring to its attention any
6335	   copyrights, patents or patent applications, or other proprietary
6336	   rights that may cover technology that may be required to implement
6337	   this standard.  Please address the information to the IETF at
6338	   ietf-ipr@ietf.org.

6340	Disclaimer of Validity

6342	   This document and the information contained herein are provided on an
6343	   "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
6344	   OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET
6345	   ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED,
6346	   INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE
6347	   INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
6348	   WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

6350	Copyright Statement

6352	   Copyright (C) The Internet Society (2005).  This document is subject
6353	   to the rights, licenses and restrictions contained in BCP 78, and
6354	   except as set forth therein, the authors retain all their rights.

6356	Acknowledgment

6358	   Funding for the RFC Editor function is currently provided by the
6359	   Internet Society.