idnits 2.17.1 

draft-talpey-nfsv4-rdma-sess-01.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** Looks like you're using RFC 2026 boilerplate.  This must be updated to
     follow RFC 3978/3979, as updated by RFC 4748.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  == No 'Intended status' indicated for this document; assuming Proposed
     Standard


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack separate sections for Informative/Normative
     References.  All references will be assumed normative when checking for
     downward references.

  ** There are 3 instances of too long lines in the document, the longest one
     being 2 characters in excess of 72.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not
     match the current year

  == Line 1758 has weird spacing: '...E4resok  resok...'

  == Line 1813 has weird spacing: '...ionMode  mode;...'

  == Line 1836 has weird spacing: '...D4resok  resok...'

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- Couldn't find a document date in the document -- date freshness check
     skipped.


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  == Missing Reference: 'Connection' is mentioned on line 454, but not defined

  == Missing Reference: 'Segment' is mentioned on line 903, but not defined

  == Unused Reference: 'DAFS' is defined on line 2022, but no explicit
     reference was found in the text

  == Unused Reference: 'RDDP' is defined on line 2087, but no explicit
     reference was found in the text

  -- Possible downref: Non-RFC (?) normative reference: ref. 'CCM'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'CJ89'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'DAFS'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'DDP'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'FJDAFS'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'FJNFS'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'IB'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'KM02'

  ** Downref: Normative reference to an Informational RFC: RFC 3234 (ref.
     'MIDTAX')

  -- Possible downref: Non-RFC (?) normative reference: ref. 'NFSDDP'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'NFSPS'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'RDMAREQ'

  ** Obsolete normative reference: RFC 3530 (Obsoleted by RFC 7530)

  -- Possible downref: Non-RFC (?) normative reference: ref. 'RDDP'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'RDDPPS'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'RDMAP'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'RPCRDMA'


     Summary: 5 errors (**), 0 flaws (~~), 9 warnings (==), 17 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Internet-Draft                                            Tom Talpey
3	Expires: August 2004                         Network Appliance, Inc.
4	                                                     Spencer Shepler
5	                                              Sun Microsystems, Inc.

7	                                                      February, 2004

9	                   NFSv4 RDMA and Session Extensions
10	                    draft-talpey-nfsv4-rdma-sess-01

12	Status of this Memo

14	     This document is an Internet-Draft and is subject to all provisions
15	     of Section 10 of RFC2026.

17	     Internet-Drafts are working documents of the Internet Engineering
18	     Task Force (IETF), its areas, and its working groups.  Note that
19	     other groups may also distribute working documents as Internet-
20	     Drafts.

22	     Internet-Drafts are draft documents valid for a maximum of six
23	     months and may be updated, replaced, or obsoleted by other
24	     documents at any time.  It is inappropriate to use Internet-Drafts
25	     as reference material or to cite them other than as "work in
26	     progress."

28	     The list of current Internet-Drafts can be accessed at
29	     http://www.ietf.org/ietf/1id-abstracts.txt

31	     The list of Internet-Draft Shadow Directories can be accessed at
32	     http://www.ietf.org/shadow.html.

34	Copyright Notice

36	     Copyright (C) The Internet Society (2004).  All Rights Reserved.

38	Abstract

40	     Extensions are proposed to NFS version 4 which enable it to support
41	     sessions, connection management, and operation atop either TCP or
42	     RDMA-capable RPC.  These extensions enable universal support for
43	     exactly-once semantics by NFSv4 servers, enhanced security,
44	     multipathing and trunking of transport connections.  These
45	     extensions provide identical benefits over both TCP and RDMA
46	     connection types.

48	Table Of Contents

50	     1.   Introduction . . . . . . . . . . . . . . . . . . . . . . .   3
51	     1.1.   Motivation . . . . . . . . . . . . . . . . . . . . . . .   4
52	     1.2.   Problem Statement  . . . . . . . . . . . . . . . . . . .   5
53	     1.3.   NFSv4 Session Extension Characteristics  . . . . . . . .   6
54	     2.   Transport Issues . . . . . . . . . . . . . . . . . . . . .   7
55	     2.1.   Session Model  . . . . . . . . . . . . . . . . . . . . .   7
56	     2.1.1.   Connection State . . . . . . . . . . . . . . . . . . .   8
57	     2.1.2.   Channels . . . . . . . . . . . . . . . . . . . . . . .   9
58	     2.1.3.   Reconnection, Trunking, Failover . . . . . . . . . . .  10
59	     2.1.4.   Server Duplicate Request Cache . . . . . . . . . . . .  11
60	     2.2.   RDMA . . . . . . . . . . . . . . . . . . . . . . . . . .  12
61	     2.2.1.   RDMA Requirements  . . . . . . . . . . . . . . . . . .  12
62	     2.2.2.   RDMA Negotiation . . . . . . . . . . . . . . . . . . .  12
63	     2.2.3.   Connection Resources . . . . . . . . . . . . . . . . .  14
64	     2.2.4.   Inline Transfer Model  . . . . . . . . . . . . . . . .  14
65	     2.2.5.   Direct Transfer Model  . . . . . . . . . . . . . . . .  17
66	     2.3.   Connection Models  . . . . . . . . . . . . . . . . . . .  20
67	     2.3.1.   TCP Connection Model . . . . . . . . . . . . . . . . .  21
68	     2.3.2.   Negotiated RDMA Connection Model . . . . . . . . . . .  21
69	     2.3.3.   Automatic RDMA Connection Model  . . . . . . . . . . .  22
70	     2.4.   Buffer Management, Transfer, Flow Control  . . . . . . .  23
71	     2.5.   Retry and Replay . . . . . . . . . . . . . . . . . . . .  26
72	     2.6.   The Back Channel . . . . . . . . . . . . . . . . . . . .  26
73	     2.7.   COMPOUND Sizing Issues . . . . . . . . . . . . . . . . .  28
74	     2.8.   Data Alignment . . . . . . . . . . . . . . . . . . . . .  29
75	     3.   NFSv4 Integration  . . . . . . . . . . . . . . . . . . . .  30
76	     3.1.   Minor Versioning . . . . . . . . . . . . . . . . . . . .  30
77	     3.2.   Stream Identifiers and Exactly-Once Semantics  . . . . .  31
78	     3.3.   COMPOUND and CB_COMPOUND . . . . . . . . . . . . . . . .  32
79	     3.4.   eXternal Data Representation Efficiency  . . . . . . . .  33
80	     3.5.   Effect of Sessions on Existing Operations  . . . . . . .  34
81	     3.6.   Authentication Efficiencies  . . . . . . . . . . . . . .  35
82	     4.   Security Considerations  . . . . . . . . . . . . . . . . .  36
83	     5.   IANA Considerations  . . . . . . . . . . . . . . . . . . .  37
84	     6.   NFSv4 Protocol Extensions  . . . . . . . . . . . . . . . .  37
85	     6.1.   SESSION_CREATE . . . . . . . . . . . . . . . . . . . . .  38
86	     6.2.   SESSION_BIND . . . . . . . . . . . . . . . . . . . . . .  39
87	     6.3.   SESSION_DESTROY  . . . . . . . . . . . . . . . . . . . .  41
88	     6.4.   OPERATION_CONTROL  . . . . . . . . . . . . . . . . . . .  42
89	     6.5.   CB_CREDITRECALL  . . . . . . . . . . . . . . . . . . . .  43
90	     7.   Acknowledgements . . . . . . . . . . . . . . . . . . . . .  43
91	          References . . . . . . . . . . . . . . . . . . . . . . . .  43
92	          Authors' Addresses . . . . . . . . . . . . . . . . . . . .  46
93	          Full Copyright Statement . . . . . . . . . . . . . . . . .  46

95	1.  Introduction

97	     This draft proposes extensions to NFS version 4 enabling it to
98	     support sessions and connection management, and to support
99	     operation atop RDMA-capable RPC over transports such as iWARP.
100	     [RDMAP, DDP] These extensions enable universal support for exactly-
101	     once semantics by NFSv4 servers, multipathing and trunking of
102	     transport connections, and enhanced security.  The ability to
103	     operate over RDMA enables greatly enhanced performance.  Operation
104	     over existing TCP is enhanced as well.

106	     While discussed here with respect to IETF-chartered transports, the
107	     proposed protocol is intended to function over other standards,
108	     such as Infiniband. [IB]

110	     The following are the major aspects of this proposal:

112	     o    Changes are proposed within the framework of NFSv4 minor
113	          versioning.  RPC, XDR, and the NFSv4 procedures and operations
114	          are preserved.  The proposed minor version functions equally
115	          well over existing transports and RDMA, and interoperates
116	          transparently with existing implementations, both at the local
117	          programmatic interface and over the wire.

119	     o    An explicit session is introduced to NFSv4, and four new
120	          operations are added to support it.  The session allows for
121	          enhanced trunking, failover and recovery, and authentication
122	          efficiency, along with necessary support for RDMA.  The
123	          session is implemented as operations within NFSv4 COMPOUND and
124	          does not impact layering or interoperability with existing
125	          NFSv4 implementations.  The NFSv4 callback channel is
126	          associated with a session, and is connected by the client and
127	          not the server, enhancing security and operation through
128	          firewalls.  In fact, the callback channel will be enabled to
129	          share the same connection as the operations channel.

131	     o    An enhanced RPC layer enables NFSv4 operation atop RDMA.  The
132	          session is RDMA-aware, and additional facilities are provided
133	          for managing RDMA resources at both NFSv4 server and client.
134	          Existing NFSv4 operations continue to function as before,
135	          though certain size limits are negotiated.  A companion draft
136	          to this document, "RDMA Transport for ONC RPC" [RPCRDMA] is to
137	          be referenced for details of RPC RDMA support.

139	     o    Support for exactly-once semantics ("EOS") is enabled by the
140	          new session facilities, providing to the server a way to bound
141	          the size of the duplicate request cache for a single client,
142	          and to manage its persistent storage.

144	                                Block Diagram

146	          +-------------------+------------------------------------+
147	          |      NFSv4        |         NFSv4 + extensions         |
148	          +-------------------+-----+----------------+-------------+
149	          |       Operations        |    Session     |             |
150	          +-------------------------+----------------+             |
151	          |                RPC/XDR                   |             |
152	          +---------------------------------+--------+             |
153	          |       Stream Transport          |   RDMA Transport     |
154	          +---------------------------------+----------------------+

156	1.1.  Motivation

158	     NFS version 4 [RFC3530] has been granted "Proposed Standard"
159	     status.  The NFSv4 protocol was developed along several design
160	     points, important among them: effective operation over wide-area
161	     networks, including the Internet itself;  strong security
162	     integrated into the protocol;  extensive cross-platform
163	     interoperability including integrated locking semantics compatible
164	     with multiple operating systems; and protocol extensibility.

166	     The NFS version 4 protocol, however, does not provide support for
167	     certain important transport aspects.  For example, the protocol
168	     does not provide a way to implement exactly-once semantics for
169	     clients, nor an interoperable way to support trunking and
170	     multipathing of connections.  This leads to inefficiencies,
171	     especially where trunking and multipathing are concerned, and
172	     presents additional difficulties in supporting RDMA fabrics, in
173	     which endpoints may require dedicated or specialized resources.

175	     Sessions can be employed to unify NFS-level constructs such as the
176	     clientid with transport-level constructs such as transport
177	     endpoints.  The transport endpoint is abstracted to be a member of
178	     the session.  Resource management can be more strictly maintained,
179	     leading to greater server efficiency in implementing the protocol.
180	     The enhanced operation over a session affords an opportunity to the
181	     server to implement highly reliable and exactly-once semantics.

183	     NFSv4 advances the state of high-performance local sharing, by
184	     virtue of its integrated security, locking, and delegation, and its
185	     excellent coverage of the sharing semantics of multiple operating
186	     systems.  It is precisely this environment where exactly-once
187	     semantics become a fundamental requirement.

189	     Additionally, efforts to standardize a set of protocols for Remote
190	     Direct Memory Access, RDMA, over the Internet Protocol Suite have
191	     made significant progress.  RDMA is a general solution to the
192	     problem of CPU overhead incurred due to data copies, primarily at
193	     the receiver.  Substantial research has addressed this and has
194	     borne out the efficacy of the approach.  An overview of this is the
195	     RDDP Problem Statement document, [RDDPPS].

197	     Numerous upper layer protocols achieve extremely high bandwidth and
198	     low overhead through the use of RDMA.  Products from a wide variety
199	     of vendors employ RDMA to advantage, and prototypes have
200	     demonstrated the effectiveness of many more.  Here, we are
201	     concerned specifically with NFS and NFS-style upper layer
202	     protocols;  examples from Network Appliance [DAFS, DCK+03], Fujitsu
203	     Prime Software Technologies [FJNFS, FJDAFS] and Harvard University
204	     [KM02] are all relevant.

206	     By layering a session binding for NFS version 4 directly atop a
207	     standard RDMA transport, a greatly enhanced level of performance
208	     and transparency can be supported on a wide variety of operating
209	     system platforms.  These combined capabilities alter the landscape
210	     between local filesystems and network attached storage, enable a
211	     new level of performance, and lead new classes of application to
212	     take advantage of NFS.

214	1.2.  Problem Statement

216	     Two issues drive the current proposal: correctness, and
217	     performance.  Both are instances of "raising the bar" for NFS,
218	     whereby the desire to use NFS in new classes applications can be
219	     accommodated by providing the basic features to make such use
220	     feasible.  Such applications include tightly coupled sharing
221	     environments such as cluster computing, high performance computing
222	     (HPC) and information processing such as databases.  These trends
223	     are explored in depth in [NFSPS].

225	     The first issue, correctness, exemplified among the attributes of
226	     local filesystems, is support for exactly-once semantics.  Such
227	     semantics have not been reliably available with NFS.  Server-based
228	     duplicate request caches [CJ89] help, but do not reliably provide
229	     strict correctness.  For the type of application which is expected
230	     to make extensive use of the high-performance RDMA-enabled
231	     environment, the reliable provision of such semantics are a
232	     fundamental requirement.

234	     Introduction of a session to NFSv4 will address these issues.  With
235	     higher performance and enhanced semantics comes the problem of
236	     enabling advanced endpoint management, for example high-speed
237	     trunking, multipathing and failover.  These characteristics enable
238	     availability and performance.  RFC3530 presents some issues in
239	     permitting a single clientid to access a server over multiple
240	     connections.

242	     A second issue encountered in common by NFS implementations is the
243	     CPU overhead required to implement the protocol.  Primary among the
244	     sources of this overhead is the movement of data from NFS protocol
245	     messages to its eventual destination in user buffers or aligned
246	     kernel buffers.  The data copies consume system bus bandwidth and
247	     CPU time, reducing the available system capacity for applications.
248	     [RDDPPS] Achieving zero-copy with NFS has to date required
249	     sophisticated, "header cracking" hardware and/or extensive
250	     platform-specific virtual memory mapping tricks.

252	     Combined in this way, NFSv4, RDMA and the emerging high-speed
253	     network fabrics will enable delivery of performance which matches
254	     that of the fastest local filesystems, while preserving the key
255	     existing local filesystem semantics.

257	     RDMA implementations generally have other interesting properties,
258	     such as hardware assisted protocol access, and support for user
259	     space access to I/O.  RDMA is compelling here for another reason;
260	     hardware offloaded networking support in itself does not avoid data
261	     copies, without resorting to implementing part of the NFS protocol
262	     in the NIC.  Support of RDMA by NFS enables the highest performance
263	     at the architecture level rather than by implementation; this
264	     enables ubiquitous and interoperable solutions.

266	     By providing file access performance equivalent to that of local
267	     file systems, NFSv4 over RDMA will enable applications running on a
268	     set of client machines to interact through an NFSv4 file system,
269	     just as applications running on a single machine might interact
270	     through a local file system.

272	     This raises the issue of whether additional protocol enhancements
273	     to enable such interaction would be desirable and what such
274	     enhancements would be.  This is a complicated issue which the
275	     working group needs to address and will not be further discussed in
276	     this document.

278	1.3.  NFSv4 Session Extension Characteristics

280	     This draft will present a solution based upon minor versioning of
281	     NFSv4.  It will introduce a session to collect transport issues
282	     together, which in turn enables enhancements such as trunking,
283	     failover and recovery.  It will describe use of RDMA by employing
284	     support within an underlying RPC layer [RPCRDMA].  Most
285	     importantly, it will focus on making the best possible use of an
286	     RDMA transport.

288	     These extensions are proposed as elements of a new minor revision
289	     of NFS version 4.  In this draft, NFS version 4 will be referred to
290	     generically as "NFSv4", when describing properties common to all
291	     minor versions.  When referring specifically to properties of the
292	     original, minor version 0 protocol, "NFSv4.0" will be used, and
293	     changes proposed here for minor version 1 will be referred to as
294	     "NFSv4.1".

296	     This draft proposes only changes which are strictly upward-
297	     compatible with existing RPC and NFS Application Programming
298	     Interfaces (APIs).

300	2.  Transport Issues

302	     The Transport Issues section of the document explores the details
303	     of utilizing the various supported transports.

305	2.1.  Session Model

307	     The first and most evident issue in supporting diverse transports
308	     is how to provide for their differences.  This draft proposes
309	     introducing an explicit session.

311	     An initialized session will be required before processing requests
312	     contained within COMPOUND and CB_COMPOUND procedures of NFSv4.1.  A
313	     session introduces minimal protocol requirements, and provides for
314	     a highly useful and convenient way to manage numerous endpoint-
315	     related issues.  The session is a local construct; it represents a
316	     named, higher-layer object to which connections can refer, and
317	     encapsulates properties important to each transport layer endpoint.

319	     A session is a dynamically created, persistent object created by a
320	     client, used over time from one or more transport connections.  Its
321	     function is to maintain the server's state relative to any single
322	     client instance.  This state is entirely independent of the
323	     connection itself.  The session in effect becomes the "top-level"
324	     object representing an active client.

326	     The session enables several things immediately.  Clients may
327	     disconnect and reconnect (voluntarily or not) without loss of
328	     context at the server.  (Of course, locks, delegations and related
329	     associations require special handling which generally expires
330	     without an open connection.)  Clients may connect multiple
331	     transport endpoints to this common state.  The endpoints may have
332	     all the same attributes, for instance when trunked on multiple
333	     physical network links for bandwidth aggregation or path failover.
334	     Or, the endpoints can have specific, special purpose attributes
335	     such as channels for callbacks.

337	     The NFSv4 specification does not provide for any form of flow
338	     control;  instead it relies on the windowing provided by TCP to
339	     throttle requests.  This unfortunately does not work with RDMA,
340	     which in general provides no operation flow control and will
341	     terminate a connection in error when limits are exceeded.  Flow
342	     control limits are therefore exchanged when a connection is bound
343	     to a session;  they are then managed within these limits as
344	     described in [RPCRDMA].  The bound state of a connection will be
345	     described in this document as a "channel".

347	     The presence of deterministic flow control on the channels
348	     belonging to a given session bounds the requirements of the
349	     duplicate request cache.  This can be used to advantage by a
350	     server, which can accurately determine any storage needs and enable
351	     it to maintain persistence and to provide reliable exactly-once
352	     semantics.

354	     Finally, given adequate connection-oriented transport security
355	     semantics, authentication and authorization may be cached on a per-
356	     session basis, enabling greater efficiency in the issuing and
357	     processing of requests on both client and server.  A proposal for
358	     transparent, server-driven implementation of this in NFSv4 has been
359	     made. [CCM] The existence of the session greatly adds to the
360	     convenience of this approach.  This is discussed in detail in the
361	     Authentication Efficiencies section later in this draft.

363	2.1.1.  Connection State

365	     In RFC3530, the combination of a connected transport endpoint and a
366	     clientid forms the basis of connection state.  While provably
367	     workable, there are difficulties in correct and robust
368	     implementation.  The NFSv4.0 protocol must provide a clientid
369	     negotiation (SETCLIENTID and SETCLIENTID_CONFIRM), must provide a
370	     server-initiated connection for the callback channel, and must
371	     carefully specify the persistence of client state at the server in
372	     the face of transport interruptions.  In effect, each transport
373	     connection is used as the server's representation of client state.
374	     But, transport connections are potentially fragile and transitory.

376	     In this proposal, a session identifier is assigned by the server
377	     upon initial session negotiation on each connection.  This
378	     identifier is used to associate additional connections, to
379	     renegotiate after a reconnect, and to provide an abstraction for
380	     the various session properties.  The session identifier is unique
381	     within the server's scope and may be subject to certain server
382	     policies such as being bounded in time.  A channel identifier is
383	     issued for each new connection as it binds to the session.  The
384	     channel identifier is unique within the session, and may be unique
385	     within a wider scope, at the server's choosing.

387	     It is envisioned that the primary transport model will be
388	     connection oriented.  Connection orientation brings with it certain
389	     potential optimizations, such as caching of per-connection
390	     properties, which are easily leveraged through the generality of
391	     the session.  However, it is possible that in future, other
392	     transport models could be accommodated below the session and
393	     channel abstractions.

395	2.1.2.  Channels

397	     As mentioned above, different NFSv4 operations can lead to
398	     different resource needs.  For example, server callback operations
399	     (CB_RECALL) are specific, small messages which flow from server to
400	     client at arbitrary times, while data transfers such as read and
401	     write have very different sizes and asymmetric behaviors.  It is
402	     impractical for the RDMA peers (NFSv4 client and NFSv4 server) to
403	     post buffers for these various operations on a single connection.
404	     Commingling of requests with responses at the client receive queue
405	     is particularly troublesome, due both to the need to manage both
406	     solicited and unsolicited completions, and to provision buffers for
407	     both purposes.  Due to the lack of any ordering of callback
408	     requests versus response arrivals, without any other mechanisms,
409	     the client would be forced to allocate all buffers sized to the
410	     worst case.

412	     The callback requests are likely to be handled by a different task
413	     context from that handling the responses.  Significant
414	     demultiplexing and thread management may be required if both are
415	     received on the same queue.

417	     If the client explicitly binds each new connection to an existing
418	     session, multiple connections may be conveniently used to separate
419	     traffic by channel identifier within a session.  For example, reads
420	     and writes may be assigned to specific, optimized channels, or
421	     sorted and separated by any or all of size, idempotency, etc.

423	     To address the problems described above, this proposal defines a
424	     "channel" that is created by the act of binding a connection to a
425	     session for a specific purpose.  A new connection may be created
426	     for each channel, or a single connection may be bound to more than
427	     one channel.  There are at least two types of channels: the
428	     "operations" channel used for ordinary requests from client to
429	     server, and the "back" channel, used for callback requests from
430	     server to client.  The protocol does not permit binding multiple
431	     duplicate operations channels to a single connection.  There is no
432	     benefit in doing so;  supporting this would require increased
433	     complexity in the server duplicate request cache.

435	     Single Connection model:

437	                     NFSv4.1 client instance
438	                               |
439	                            Session
440	                            /      \
441	             Operations_Channel   [Back_Channel]
442	                             \    /
443	                          Connection
444	                               |

446	     Multi-connection model (2 operations channels shown):

448	                     NFSv4.1 client instance
449	                               |
450	                            Session
451	                            /      \
452	             Operations_Channels  [Back_Channel]
453	                 |          |               |
454	             Connection Connection     [Connection]
455	                 |          |               |

457	     In this way, implementation as well as resource management may be
458	     optimized.  Each channel (operations, back) will have its own
459	     credits and buffering.  Clients which do not require certain
460	     behaviors may optimize such resources away completely, by not even
461	     creating the channels.

463	2.1.3.  Reconnection, Trunking, Failover

465	     Reconnection after failure references potentially stored state on
466	     the server associated with lease recovery during the grace period.
467	     The session provides a convenient handle for storing and managing
468	     information regarding the client's previous state on a per-
469	     connection basis, e.g. to be used upon reconnection.  Reconnection
470	     and rebinding to a previously existing session, and its stored
471	     resources, are covered in the "Connection Models" section below.

473	     For Reliability Availability and Serviceability (RAS) issues such
474	     as bandwidth aggregation and multipathing, clients frequently seek
475	     to make multiple connections through multiple logical or physical
476	     channels.  The session is a convenient point to aggregate and
477	     manage these resources.

479	2.1.4.  Server Duplicate Request Cache

481	     Server duplicate request caches, while not a part of an NFS
482	     protocol, have become a standard, even required, part of any NFS
483	     implementation.  First described in [CJ89], the duplicate request
484	     cache was initially found to reduce work at the server by avoiding
485	     duplicate processing for retransmitted requests.  A second, and in
486	     the long run more important benefit, was improved correctness, as
487	     the cache avoided certain destructive non-idempotent requests from
488	     being reinvoked.

490	     However, such caches do not provide correctness guarantees;  they
491	     cannot be managed in a reliable, persistent fashion.  The reason is
492	     understandable - their storage requirement is unbounded due to the
493	     lack of any such bound in the NFS protocol.

495	     As proposed in this draft, the presence of message flow control
496	     credits and negotiated maximum sizes allows the size and duration
497	     of the cache to be bounded, and coupled with a persistent session
498	     identifier, enables its persistent storage on a per-session basis.

500	     This provides a single unified mechanism which provides the
501	     following guarantees required in the NFSv4 specification, while
502	     extending them to all requests, rather than limiting them only to a
503	     subset of state-related requests:

505	          "It is critical the server maintain the last response sent to
506	          the client to provide a more reliable cache of duplicate non-
507	          idempotent requests than that of the traditional cache
508	          described in [CJ89]..." [RFC3530]

510	     The credit limit is the count of active operations, which bounds
511	     the number of entries in the cache.  Constraining the size of
512	     operations additionally serves to limit the required storage to the
513	     product of the current credit count and the maximum response size.
514	     This storage requirement enables server-side efficiencies.

516	     Session negotiation allows the server to maintain other state.  An
517	     NFSv4.1 client invoking the session destroy operation will cause
518	     the server to denegotiate (close) the session, allowing the server
519	     to deallocate cache entries.  Clients can potentially specify that
520	     such caches not be kept for appropriate types of sessions (for
521	     example, read-only sessions).  This can enable more efficient
522	     server operation resulting in improved response times.

524	     Similarly, it is important for the client to explicitly learn
525	     whether the server is able to implement these semantics.  Knowledge
526	     of whether exactly-once semantics are in force is critical for a
527	     highly reliable client, one which must provide transactional
528	     integrity guarantees.  When clients request that the semantics be
529	     enabled for a given session, the session reply must inform the
530	     client if the mode is in fact enabled.  In this way the client can
531	     confidently proceed with operations without having to implement
532	     consistency facilities of its own.

534	2.2.  RDMA

536	2.2.1.  RDMA Requirements

538	     A complete discussion of the operation of RPC-based protocols atop
539	     RDMA transports is in [RPCRDMA], and a general discussion of NFS
540	     RDMA requirements is in [RDMAREQ].  Where RDMA is considered, this
541	     proposal assumes the use of such a layering;  it addresses only the
542	     upper layer issues relevant to making best use of RPC/RDMA.

544	     A connection oriented (reliable sequenced) RDMA transport will be
545	     required.  There are several reasons for this.  First, this model
546	     most closely reflects the general NFSv4 requirement of long-lived
547	     and congestion-controlled transports.  Second, to operate correctly
548	     over either an unreliable or unsequenced RDMA transport, or both,
549	     would require significant complexity in the implementation and
550	     protocol not appropriate for a strict minor version.  For example,
551	     retransmission on connected endpoints is explicitly disallowed in
552	     the current NFSv4 draft;  it would again be required with these
553	     alternate transport characteristics.  Third, the proposal assumes a
554	     specific RDMA ordering semantic, which presents the same set of
555	     ordering and reliability issues to the RDMA layer over such
556	     transports.

558	     The RDMA implementation provides for making connections to other
559	     RDMA-capable peers.  In the case of the current proposals before
560	     the RDDP working group, these RDMA connections are preceded by a
561	     "streaming" phase, where ordinary TCP (or NFS) traffic might flow.
562	     However, this is not assumed here and sizes and other parameters
563	     are explicitly exchanges upon entering RDMA mode in all cases.

565	2.2.2.  RDMA Negotiation

567	     It is proposed that session negotiation be the method to enable
568	     RDMA mode on an NFSv4 connection.

570	     On transport endpoints which support automatic RDMA mode, that is,
571	     endpoints which are created in the RDMA-enabled state, a single,
572	     preposted buffer must initially be provided by both peers, and the
573	     client session negotiation must be the first exchange.

575	     On transport endpoints supporting dynamic negotiation, a more
576	     sophisticated negotiation is possible.  Clients may connect to the
577	     server in traditional NFSv4 mode and enter RDMA mode only after a
578	     successful NFSv4.1 channel binding negotiation returning the RDMA
579	     capability.  If RDMA capability is not indicated, the negotiation
580	     still completes and the benefits of the session are available on
581	     the existing TCP stream connection.

583	     Some of the parameters to be exchanged at session binding time are
584	     as follows.

586	     Maximum Credits
587	          The client's desired maximum credits (number of concurrent
588	          requests) is passed, in order to allow the server to size its
589	          reply cache storage.  The server may modify the client's
590	          requested limit downward (or upward) to match its local policy
591	          and/or resources.  The maximum credits available on a single
592	          bound channel may also be limited by the maximum credits for
593	          the session.  Over RDMA-capable RPC transports, the per-
594	          request management of message credits is handled within the
595	          RPC layer. [RPCRDMA]

597	     Maximum Request/Response Sizes
598	          The maximum request and response sizes are exchanged in order
599	          to permit allocation of appropriately sized buffers and
600	          request cache entries.  The size must allow for certain
601	          protocol minima, allowing the receipt of maximally sized
602	          operations (e.g. RENAME requests which contains two name
603	          strings).  The server may reduce the client's requested sizes.

605	     RDMA Read Resources
606	          RDMA implementations must explicitly provision resources to
607	          support RDMA Read requests from connected peers.  These values
608	          must be explicitly specified, to provide adequate resources
609	          for matching the peer's expected needs and the connection's
610	          delay-bandwidth parameters.  The values are asymmetric and
611	          should be set to zero at the server in order to conserve RDMA
612	          resources, since clients do not issue RDMA Read operations in
613	          this proposal.  The result is communicated in the session
614	          response, to permit matching of values across the connection.
615	          The value may not be changed in the duration of the
616	          connection, although a new value may be requested as part of a
617	          reconnection.

619	     Inline Padding/Alignment
620	          The server can inform the client of any padding which can be
621	          used to deliver NFSv4 inline WRITE payloads into aligned
622	          buffers.  Such alignment can be used to avoid data copy
623	          operations at the server, even when direct RDMA is not used.
624	          The client informs the server in each operation when padding
625	          has been applied [RPCRDMA].

627	     Transport Attributes
628	          A placeholder for transport-specific attributes is provided,
629	          with a format to be determined.  Examples of information to be
630	          passed in this parameter include transport security attributes
631	          to be used on the connection, RDMA-specific attributes, legacy
632	          "private data" as used on existing RDMA fabrics, transport
633	          Quality of Service attributes, etc.  This information is to be
634	          passed to the peer's transport layer by local means which is
635	          currently outside the scope of this draft.

637	2.2.3.  Connection Resources

639	     RDMA imposes several requirements on upper layer consumers.
640	     Registration of memory and the need to post buffers of a specific
641	     size and number for receive operations are a primary consideration.

643	     Registration of memory can be a relatively high-overhead operation,
644	     since it requires pinning of buffers, assignment of attributes
645	     (e.g. readable/writable), and initialization of hardware
646	     translation.  Preregistration is desirable to reduce overhead.
647	     These registrations are specific to hardware interfaces and even to
648	     RDMA connection endpoints, therefore negotiation of their limits is
649	     desirable to manage resources effectively.

651	     Following the basic registration, these buffers must be posted by
652	     the RPC layer to handle receives.  These buffers remain in use by
653	     the RPC/NFSv4 implementation; the size and number of them must be
654	     known to the remote peer in order to avoid RDMA errors which would
655	     cause a fatal error on the RDMA connection.

657	     Each channel within a session will potentially have different
658	     requirements, negotiated per-connection but accounted for per-
659	     session.  The session provides a natural way for the server to
660	     manage resource allocation to each client rather than to each
661	     transport connection itself.  This enables considerable flexibility
662	     in the administration of transport endpoints.

664	2.2.4.  Inline Transfer Model

666	     The RDMA Send transfer model is used for all NFS requests and
667	     replies.  Use of Sends is required to ensure consistency of data
668	     and to deliver completion notifications.

670	     Sends may carry data as well as control.  When a Send carries data
671	     associated with a request type, the data is referred to as
672	     "inline".  This method is typically used where the data payload is
673	     small, or where for whatever reason target memory for RDMA is not
674	     available.

676	     Inline message exchange

678	            Client                                Server
679	               :                Request              :
680	          Send :   ------------------------------>   : untagged
681	               :                                     :  buffer
682	               :               Response              :
683	      untagged :   <------------------------------   : Send
684	       buffer  :                                     :

686	            Client                                Server
687	               :            Read request             :
688	          Send :   ------------------------------>   : untagged
689	               :                                     :  buffer
690	               :       Read response with data       :
691	      untagged :   <------------------------------   : Send
692	       buffer  :                                     :

694	            Client                                Server
695	               :       Write request with data       :
696	          Send :   ------------------------------>   : untagged
697	               :                                     :  buffer
698	               :            Write response           :
699	      untagged :   <------------------------------   : Send
700	       buffer  :                                     :

702	     Responses must be sent to the client on the same channel that the
703	     request was sent.  This is important to preserve ordering of
704	     operations, and especially RMDA consistency.  Additionally, it
705	     ensures that the RPC RDMA layer makes no requirement of the RDMA
706	     provider to open its memory registration handles (Steering Tags)
707	     beyond the scope of a single RDMA connection.  This is an important
708	     security consideration.

710	     Two values must be known to each peer prior to issuing Sends: the
711	     maximum number of sends which may be posted, and their maximum
712	     size.  These values are referred to, respectively, as the message
713	     credits and the maximum message size.  While the message credits
714	     might vary dynamically over the duration of the session, the
715	     maximum message size does not.  The server must commit to posting a
716	     number of receive buffers equal to or greater than its currently
717	     advertised credit value, each of the advertised size.  If fewer
718	     credits or smaller buffers are provided, the connection may fail
719	     with an RDMA transport error.

721	     While tempting to consider, it is not possible to use the TCP
722	     window as an RDMA operation flow control mechanism.  First, to do
723	     so would violate layering, requiring both senders to be aware of
724	     the existing TCP outbound window at all times.  Second, since
725	     requests are of variable size, the TCP window can hold a widely
726	     variable number of them, and since it cannot be reduced without
727	     actually receiving data, the receiver cannot limit the sender.
728	     Third, any middlebox interposing on the connection would wreck any
729	     possible scheme. [MIDTAX] In this proposal, credits, in the form of
730	     explicit operation counts, are exchanged to allow correct
731	     provisioning of receive buffers.

733	     When not operating over RDMA, credits and sizes are still employed
734	     in NFSv4.1, but instead of being required for correctness, they
735	     provide the basis for efficient server implementation of exactly-
736	     once semantics.  The limits are chosen based upon the expected
737	     needs and capabilities of the client and server, and are in fact
738	     arbitrary.  Sizes may be specified as zero (no specific size limit)
739	     and credits may be chosen in proportion to the client's
740	     capabilities.  For example, a limit of 1000 allows 1000 requests to
741	     be in progress, which may generally be far more than adequate to
742	     keep local networks and servers fully utilized.

744	     Both client and server have independent sizes and buffering, but
745	     over RDMA fabrics client credits are easily managed by posting a
746	     receive buffer prior to sending each request.  Each such buffer may
747	     not be completed with the corresponding reply, since responses from
748	     NFSv4 servers arrive in arbitrary order.  When the operations
749	     channel is used for callbacks, the client must account for callback
750	     requests by posting additional buffers.  Note that implementation-
751	     specific facilities such as a "shared receive queue" may allow
752	     optimization of these allocations.

754	     When a connection is bound to a session (creating a channel), the
755	     client requests a preferred buffer size, and the server provides
756	     its answer.  The server posts all buffers of at least this size.
757	     The client must comply by not sending requests greater than this
758	     size.  It is recommended that server implementations do all they
759	     can to accommodate a useful range of possible client requests.
760	     There is a provision in [RPCRDMA] to allow the sending of client
761	     requests which exceed the server's receive buffer size, but it
762	     requires the server to "pull" the client's request as a "read
763	     chunk" via RDMA Read.  This introduces at least one additional
764	     network roundtrip, plus other overhead such as registering memory
765	     for RDMA Read at the client and additional RDMA operations at the
766	     server, and is to be avoided.

768	     An issue therefore arises when considering the NFSv4 COMPOUND
769	     procedures.  Since an arbitrary number (total size) of operations
770	     can be specified in a single COMPOUND procedure, its size is
771	     effectively unbounded.  This cannot be supported by RDMA Sends, and
772	     therefore this size negotiation places a restriction on the
773	     construction and maximum size of both COMPOUND requests and
774	     responses.  If a COMPOUND results in a reply at the server that is
775	     larger than can be sent in an RDMA Send to the client, then the
776	     COMPOUND must terminate and the operation which causes the overflow
777	     will provide a TOOSMALL error status result.  A chaining facility
778	     is provided to overcome some of the resulting limitations,
779	     described later in the draft.

781	2.2.5.  Direct Transfer Model

783	     Placement of data by explicitly tagged RDMA operations is referred
784	     to as "direct" transfer.  This method is typically used where the
785	     data payload is relatively large, that is, when RDMA setup has been
786	     performed prior to the operation, or when any overhead for setting
787	     up and performing the transfer is regained by avoiding the overhead
788	     of processing an ordinary receive.

790	     The client advertises RDMA buffers in this proposed model, and not
791	     the server.  This means the "XDR Decoding with Read Chunks"
792	     described in [RPCRDMA] is not employed by NFSv4.1 replies, and
793	     instead all results transferred via RDMA to the client employ "XDR
794	     Decoding with Write Chunks".  There are several reasons for this.

796	     First, it allows for a correct and secure mode of transfer.  The
797	     client may advertise specific memory buffers only during specific
798	     times, and may revoke access when it pleases.  The server is not
799	     required to expose copies of local file buffers for individual
800	     clients, or to lock or copy them for each client access.

802	     Second, client credits based on fixed-size request buffers are
803	     easily managed on the server, but for the server additional
804	     management of buffers for client RDMA Reads is not well-bounded.
805	     For example, the client may not perform these RDMA Read operations
806	     in a timely fashion, therefore the server would have to protect
807	     itself against denial-of-service on these resources.

809	     Third, it reduces network traffic, since buffer exposure outside
810	     the scope and duration of a single request/response exchange
811	     necessitates additional memory management exchanges.

813	     There are costs associated with this decision.  Primary among them
814	     is the need for the server to employ RDMA Read for operations such
815	     as large WRITE.  The RDMA Read operation is a two-way exchange at
816	     the RDMA layer, which incurs additional overhead relative to RDMA
817	     Write.  Additionally, RDMA Read requires resources at the data
818	     source (the client in this proposal) to maintain state and to
819	     generate replies.  These costs are overcome through use of
820	     pipelining with credits, with sufficient RDMA Read resources
821	     negotiated at session initiation, and appropriate use of RDMA for
822	     writes by the client - for example only for transfers above a
823	     certain size.

825	     A description of which NFSv4 operation results are eligible for
826	     data transfer via RDMA Write is in [NFSDDP].  There are only two
827	     such operations: READ and READLINK.  When XDR encoding these
828	     requests on an RDMA transport, the NFSv4.1 client must insert the
829	     appropriate xdr_write_list entries to indicate to the server
830	     whether the results should be transferred via RDMA or inline with a
831	     Send.  As described in [NFSDDP], a zero-length write chunk is used
832	     to indicate an inline result.  In this way, it is unnecessary to
833	     create new operations for RDMA-mode versions of READ and READLINK.

835	     Another tool to avoid creation of new, RDMA-mode operations is the
836	     Reply Chunk [RPCRDMA], which is used by RPC in RDMA mode to return
837	     large replies via RDMA as if they were inline.  Reply chunks are
838	     used for operations such as READDIR, which returns large amounts of
839	     information, but in many small XDR segments.  Reply chunks are
840	     offered by the client and the server can use them in preference to
841	     inline.  Reply chunks are transparent to upper layers such as
842	     NFSv4.

844	     In any very rare cases where another NFSv4.1 operation requires
845	     larger buffers than were negotiated when the channel was bound (for
846	     example extraordinarily large RENAMEs), the underlying RPC layer
847	     may support the use of "Message as an RDMA Read Chunk" and "RDMA
848	     Write of Long Replies" as described in [RPCRDMA].  No additional
849	     support is required in the NFSv4.1 client for this.  The client
850	     should be certain that its requested buffer sizes are not so small
851	     as to make this a frequent occurrence, however.

853	     All operations are initiated by a Send, and are completed with a
854	     Send.  This is exactly as in conventional NFSv4, but under RDMA has
855	     a significant purpose: RDMA operations are not complete, that is,
856	     guaranteed consistent, at the data sink until followed by a
857	     successful Send completion (i.e. a receive).  These events provide
858	     a natural opportunity for the initiator (client) to enable and
859	     later disable RDMA access to the memory which is the target of each
860	     operation, in order to provide for consistent and secure operation.
861	     The RDMAP Send with Invalidate operation may be worth employing in
862	     this respect, as it relieves the client of certain overhead in this
863	     case.

865	     A "onetime" boolean advisory to each RDMA region might become a
866	     hint to the server that the client will use the three-tuple for
867	     only one NFSv4 operation.  For a transport such as iWARP, the
868	     server can assist the client in invalidating the three-tuple by
869	     performing a Send with Solicited Event and Invalidate.  The server
870	     may ignore this hint, in which case the client must perform a local
871	     invalidate after receiving the indication from the server that the
872	     NFSv4 operation is complete.  This may be considered in a future
873	     version of this draft and [NFSDDP].

875	     In a trusted environment, it may be desirable for the client to
876	     persistently enable RDMA access by the server.  Such a model is
877	     desirable for the highest level of efficiency and lowest overhead.

879	     RDMA message exchanges

881	            Client                                Server
882	               :         Direct Read Request         :
883	          Send :   ------------------------------>   : untagged
884	               :                                     :  buffer
885	               :               Segment               :
886	       tagged  :   <------------------------------   :  RDMA Write
887	       buffer  :                  :                  :
888	               :              [Segment]              :
889	       tagged  :   <------------------------------   : [RDMA Write]
890	       buffer  :                                     :
891	               :         Direct Read Response        :
892	      untagged :   <------------------------------   :  Send (w/Inv.)
893	       buffer  :                                     :

895	            Client                                Server
896	               :        Direct Write Request         :
897	          Send :   ------------------------------>   : untagged
898	               :                                     :  buffer
899	               :               Segment               :
900	       tagged  :   v------------------------------   :  RDMA Read
901	       buffer  :   +----------------------------->   :
902	               :                  :                  :
903	               :              [Segment]              :
904	       tagged  :   v------------------------------   : [RDMA Read]
905	       buffer  :   +----------------------------->   :
906	               :                                     :
907	               :        Direct Write Response        :
908	      untagged :   <------------------------------   :  Send (w/Inv.)
909	       buffer  :                                     :

911	2.3.  Connection Models

913	     There are three scenarios in which to discuss the connection model.
914	     Each will be discussed individually, after describing the common
915	     case encountered at initial connection establishment.

917	     After a successful connection, the first request proceeds, in the
918	     case of a new client association, to initial session creation, and
919	     then to session binding, prior to regular operation.  Session
920	     binding, which creates a channel, is a required first step for
921	     NFSv4.1 operation on each connection, and there is no change in
922	     binding permitted.  The client previously asserted that it does or
923	     does not wish to negotiate RDMA mode in its session creation
924	     request, and the server responded, possibly negatively in which
925	     case all connections remain in traditional TCP mode.  Special rules
926	     apply for the RDMA cases, as described below.

928	     In the case of a reconnect, the session creation step is not
929	     performed and a session binding is attempted to the previously
930	     established session only.  If this rebinding is successful at the
931	     server, the server will have located the previous session's state,
932	     including any surviving locks, delegations, duplicate request cache
933	     entries, etc.  The previous session will be reestablished with its
934	     previous state, ensuring exactly-once semantics of any previously
935	     issued NFSv4 requests.  If the rebinding fails, then the server has
936	     restarted and does not support persistent state.  This would have
937	     been noted in the server's original reply to the session creation,
938	     however.

940	     Since the session is explicitly created and destroyed by the
941	     client, and each client is uniquely identified, the server may be
942	     specifically instructed to discard unneeded presistent state.  For
943	     this reason, it is possible that a server will retain any previous
944	     state indefinitely, and place its destruction under administrative
945	     control.  Or, a server may choose to retain state for some
946	     configurable period, provided that the period meets other NFSv4
947	     requirements.

949	     After successful session establishment, the traditional (TCP
950	     stream) connection model used by NFSv4.0 and NFSv4.1 ensures the
951	     connection is ready to proceed with issuing requests and returning
952	     responses.  This mode is arrived at when the client does not
953	     request that the connection be placed into RDMA mode.

955	2.3.1.  TCP Connection Model

957	     The following is a schematic diagram of the NFSv4.1 protocol
958	     exchanges leading up to normal operation on a TCP stream.

960	            Client                                Server
961	       TCPmode : Session Create(nfs_client_id4, ...) : TCPmode
962	               :   ------------------------------>   :
963	               :                                     :
964	               :     Session reply(sessionid, ...)   :
965	               :   <------------------------------   :
966	               :                                     :
967	               :   Session bind(session id, size S,  :
968	               :      opchan, STREAM, credits N, ...):
969	               :   ------------------------------>   :
970	               :                                     :
971	               :    Bind reply(size S', credits N')  :
972	               :   <------------------------------   :
973	               :                                     :
974	               :          <normal operation>         :
975	               :   ------------------------------>   :
976	               :   <------------------------------   :
977	               :                  :                  :

979	     No net additional exchange is added to the initial negotiation by
980	     this proposal.  In the NFSv4.1 exchange, the SETCLIENTID and
981	     SETCLIENTID_CONFIRM operations are not performed, as described
982	     later in the document.

984	2.3.2.  Negotiated RDMA Connection Model

986	     The following is a schematic diagram of the NFSv4.1 protocol
987	     exchanges negotiating upgrade to RDMA mode on a TCP stream.

989	            Client                                Server
990	       TCPmode : Session Create(nfs_client_id4, ...) : TCPmode
991	               :   ------------------------------>   :
992	               :                                     :
993	               :     Session reply(sessionid, ...)   :
994	               :   <------------------------------   :
995	               :                                     :
996	               :   Session bind(session id, size S', :
997	               :      opchan, RDMA, credits N, ...)  :
998	               :   ------------------------------>   :
999	               :                                     : Prepost N' receives
1000	               :    Bind reply(size S', credits N')  :      of size S'
1001	               :   <------------------------------   : RDMAMode
1002	      RDMAmode :                                     :
1003	               :          <normal operation>         :
1004	               :   ------------------------------>   :
1005	               :   <------------------------------   :
1006	               :                  :                  :

1008	     In iWARP, the Bind reply and RDMA mode entry are combined into a
1009	     single, atomic operation within the Provider, where the Bind reply
1010	     is sent in TCP streaming mode and RDMA mode is enabled immediately.
1011	     There is no opportunity for a race between the client's first
1012	     operation, the preposting of receive descriptors, and RDMA mode
1013	     entry at the server.

1015	2.3.3.  Automatic RDMA Connection Model

1017	     The following is a schematic diagram of the NFSv4.1 protocol
1018	     exchanges performed on an RDMA connection.

1020	            Client                                Server
1021	      RDMAmode :                  :                  : RDMAmode
1022	               :                  :                  :
1023	      Prepost  :                  :                  : Prepost
1024	      receive  :                  :                  : receive
1025	               :                                     :
1026	               : Session Create(nfs_client_id4, ...) :
1027	               :   ------------------------------>   :
1028	               :                                     : Prepost
1029	               :     Session reply(sessionid, ...)   : receive
1030	               :   <------------------------------   :
1031	      Prepost  :                                     :
1032	      receive  :   Session bind(session id, size S,  :
1033	               :      opchan, RDMA, credits N, ...)  :
1034	               :   ------------------------------>   :
1035	               :                                     : Prepost N' receives
1036	               :    Bind reply(size S', credits N')  :      of size S'
1037	               :   <------------------------------   :
1038	               :                                     :
1039	               :          <normal operation>         :
1040	               :   ------------------------------>   :
1041	               :   <------------------------------   :
1042	               :                  :                  :

1044	2.4.  Buffer Management, Transfer, Flow Control

1046	     Inline operations in NFSv4.1 behave effectively the same as TCP
1047	     sends.  Procedure results are passed in a single message, and its
1048	     completion at the client signal the receiving process to inspect
1049	     the message.

1051	     RDMA operations are performed solely by the server in this
1052	     proposal, as described in the previous "RDMA Direct Model" section.
1053	     Since server RDMA operations do not result in a completion at the
1054	     client, and due to ordering rules in RDMA transports, after all
1055	     required RDMA operations are complete, a Send (Send with Solicited
1056	     Event for iWARP) containing the procedure results is performed from
1057	     server to client.  This Send operation will result in a completion
1058	     which will signal the client to inspect the message.

1060	     In the case of client read-type NFSv4 operations, the server will
1061	     have issued RDMA Writes to transfer the resulting data into client-
1062	     advertised buffers.  The subsequent Send operation performs two
1063	     necessary functions: finalizing any active or pending DMA at the
1064	     client, and signaling the client to inspect the message.

1066	     In the case of client write-type NFSv4 operations, the server will
1067	     have issued RDMA Reads to fetch the data from the client-advertised
1068	     buffers.  No data consistency issues arise at the client, but the
1069	     completion of the transfer must be acknowledged, again by a Send
1070	     from server to client.

1072	     In either case, the client advertises buffers for direct (RDMA
1073	     style) operations.  The client may desire certain advertisement
1074	     limits, and may wish the server to perform remote invalidation on
1075	     its behalf when the server has completed its RDMA.  This may be
1076	     considered in a future version of this draft.

1078	     Credit updates over RDMA transports are supported at the RPC layer
1079	     as described in [RPCRDMA].  In each request, the client requests a
1080	     desired number of credits to be made available to the channel on
1081	     which it sends the request.  The client must not send more requests
1082	     than the number which the server has previously advertised, or in
1083	     the case of the first request, only one.  If the client exceeds its
1084	     credit limit, the connection may close with a fatal RDMA error.

1086	     The server then executes the request, and replies with an updated
1087	     credit count accompanying its results.  Since replies are sequenced
1088	     by their RDMA Send order, the most recent results always reflect
1089	     the server's limit.  In this way the client will always know the
1090	     maximum number of requests it may safely post.

1092	     Because the client requests an arbitrary credit count in each
1093	     request, it is relatively easy for the client to request more, or
1094	     fewer, credits to match its expected need.  A client that
1095	     discovered itself frequently queuing outgoing requests due to lack
1096	     of server credits might increase its requested credits
1097	     proportionately in response.  Or, a client might have a simple,
1098	     configurable number.

1100	     Occasionally, a server may wish to reduce the number of credits it
1101	     offers a certain client channel.  This could be encountered if a
1102	     client were found to be consuming its credits slowly, or not at
1103	     all.  A client might notice this itself, and reduce its requested
1104	     credits in advance, for instance requesting only the count of
1105	     operations it currently has queued, plus a few as a base for
1106	     starting up again.  Such mechanism are, however, potentially
1107	     complicated and are implementation-defined.  The protocol does not
1108	     require them.

1110	     Because of the way in which RDMA fabrics function, it is not
1111	     possible for the server (or client back channel) to cancel
1112	     outstanding receive operations.  Therefore, effectively only one
1113	     credit can be withdrawn per receive completion.  The server (or
1114	     client back channel) would simply not replenish a receive operation
1115	     when replying.  The server can still reduce the available credit
1116	     advertisement in its replies to the target value it desires, as a
1117	     hint to the client that its credit target is lower and it should
1118	     expect it to be reduced accordingly.  Of course, even if the server
1119	     could cancel outstanding receives, it cannot do so, since the
1120	     client may have already sent requests in expectation of the
1121	     previous limit.

1123	     This brings out an interesting scenario similar to the client
1124	     reconnect discussed earlier in "Connection Models".  How does the
1125	     server reduce the credits of an inactive client?

1127	     One approach is for the server to simply close such a connection
1128	     and require the client to reconnect at a new credit limit.  This is
1129	     acceptable, if inefficient, when the connection setup time is short
1130	     and where the server supports persistent session semantics.

1132	     A better approach is to provide a back channel request to return
1133	     the operations channel credits.  The server may request the client
1134	     to return some number of credits, the client must comply by
1135	     performing operations on the operations channel, provided of course
1136	     that the request does not drop the client's credit count to zero
1137	     (in which case the channel would deadlock).  If the client finds
1138	     that it has no requests with which to consume the credits it was
1139	     previously granted, it must send zero-length Send RDMA operations,
1140	     or NULL NFSv4 operations in order to return the channel resources
1141	     to the server.  If the client fails to comply in a timely fashion,
1142	     the server can recover the resources by breaking the connection.

1144	     While in principle, the back channel credits could be subject to a
1145	     similar resource adjustment, in practice this is not an issue,
1146	     since the back channel is used purely for control and is expected
1147	     to be statically provisioned.

1149	     It is important to note that in addition to credits, the sizes of
1150	     buffers are negotiated per-channel.  This permits the most
1151	     efficient allocation of resources on both peers.  There is an
1152	     important requirement on reconnection: the sizes offered at
1153	     reconnect (session bind) must be at least as large as previously
1154	     used, to allow recovery.  Any replies that are replayed from the
1155	     server's duplicate request cache must be able to be received into
1156	     client buffers.  In the case where a client has received replies to
1157	     all its retried requests (and therefore received all its expected
1158	     responses), then the client may disconnect and reconnect with
1159	     different buffers at will, since no cache replay will be required.

1161	2.5.  Retry and Replay

1163	     NFSv4.0 forbids retransmission on active connections over reliable
1164	     transports;  this includes connected-mode RDMA.  This restriction
1165	     must be maintained in NFSv4.1.

1167	     If one peer were to retransmit a request (or reply), it would
1168	     consume an additional credit on the other.  If the server
1169	     retransmitted a reply, it would certainly result in an RDMA
1170	     connection loss, since the client would typically only post a
1171	     single receive buffer for each request.  If the client
1172	     retransmitted a request, the additional credit consumed on the
1173	     server might lead to RDMA connection failure unless the client
1174	     accounted for it and decreased its available credit, leading to
1175	     wasted resources.

1177	     Credits present a new issue to the duplicate request cache in
1178	     NFSv4.1.  The request cache may be used when a connection within a
1179	     session is lost, such as after the client reconnects and rebinds.
1180	     Credit information is a dynamic property of the channel, and stale
1181	     values must not be replayed from the cache.  This may occur on
1182	     another existing channel, or a new channel, with potentially new
1183	     credits and buffers.  This implies that the request cache contents
1184	     must not be blindly used when replies are issued from it, and
1185	     credit information appropriate to the channel must be refreshed by
1186	     the RPC layer.

1188	     Finally, RDMA fabrics do not guarantee that the memory handles
1189	     (Steering Tags) within each rdma three-tuple are valid on a scope
1190	     outside that of a single connection.  Therefore, handles used by
1191	     the direct operations become invalid after connection loss.  The
1192	     server must ensure that any RDMA operations which must be replayed
1193	     from the request cache use the newly provided handle(s) from the
1194	     most recent request.

1196	2.6.  The Back Channel

1198	     The NFSv4 callback operations present a significant resource
1199	     problem for the RDMA enabled client.  Clearly, their number must be
1200	     negotiated in the way credits are for the ordinary operations
1201	     channel for requests flowing from client to server.  But, for
1202	     callbacks to arrive on the same RDMA endpoint as operation replies
1203	     would require dedicating additional resources, and specialized
1204	     demultiplexing and event handling.  Or, callbacks may not require
1205	     RDMA sevice at all (they do not normally carry substantial data
1206	     payloads).  It is highly desirable to streamline this critical path
1207	     via a second communications channel.

1209	     The session binding facility is designed for exactly such a
1210	     situation, by dynamically associating a new connected endpoint with
1211	     the session, and separately negotiating sizes and counts for active
1212	     operations.  The ChannelType designation in the session bind
1213	     operation serves to identify the channel.  The binding operation is
1214	     firewall-friendly since it does not require the server to initiate
1215	     the connection.

1217	     This same method serves as well for ordinary TCP connection mode.
1218	     It is expected that all NFSv4.1 clients may make use of the session
1219	     binding facility to streamline their design.

1221	     The back channel functions exactly the same as the operations
1222	     channel except that no RDMA operations are required to perform
1223	     transfers, instead the sizes are required to be sufficiently large
1224	     to carry all data inline, and of course the client and server
1225	     reverse their roles with respect to which is in control of credit
1226	     management.  The same rules apply for all transfers, with the
1227	     server being required to flow control its callback requests.

1229	     The back channel is optional.  If not bound on a given session, the
1230	     server must not issue callback operations to the client.  This in
1231	     turn implies that such a client must never put itself in the
1232	     situation where the server will need to do so, lest the client lose
1233	     its connection by force, or its operation be incorrect.  For the
1234	     same reason, if a back channel is bound, the client is subject to
1235	     revocation of its delegations if the back channel is lost.  Any
1236	     connection loss should be corrected by the client as soon as
1237	     possible.

1239	     This can be convenient for the NFSv4.1 client; if the client
1240	     expects to make no use of back channel facilities such as
1241	     delegations, then there is no need to create it.  This may save
1242	     significant resources and complexity at the client.

1244	     For these reasons, if the client wishes to use the back channel,
1245	     that channel must be bound first, before the operations channel.
1246	     In this way, the server will not find itself in a position where it
1247	     will send callbacks on the operations channel when the client is
1248	     not prepared for them.

1250	     There is one special case, that where the back channel is bound in
1251	     fact to the operations channel.  This configuration would be used
1252	     normally over a TCP stream connection to exactly implement the
1253	     NFSv4.0 behavior, but over RDMA would require complex resource and
1254	     event management at both sides of the connection.  The server is
1255	     not required to accept such a bind request on an RDMA connection
1256	     for this reason, though it is recommended.

1258	2.7.  COMPOUND Sizing Issues

1260	     Very large responses may pose duplicate request cache issues.
1261	     Since servers will want to bound the storage required for such a
1262	     cache, the unlimited size of response data in COMPOUND may be
1263	     troublesome.  If COMPOUND is used in all its generality, then a
1264	     non-idempotent request might include operations that return any
1265	     amount of data via RDMA.

1267	     It is not satisfactory for the server to reject COMPOUNDs at will
1268	     with NFS4ERR_RESOURCE when they pose such difficulties for the
1269	     server, as this results in serious interoperability problems.
1270	     Instead, any such limits must be explicitly exposed as attributes
1271	     of the session, ensuring that the server can explicitly support any
1272	     duplicate request cache needs at all times.

1274	     A need may therefore arise to handle requests of a size which is
1275	     greater than this maximum.  When COMPOUNDed requests would exceed
1276	     the provided buffer, a chaining facility may be used.

1278	     Chaining, when used, provides for executing requests on the channel
1279	     in strict sequence at the server.  At most a single chain may be in
1280	     effect on a channel at any time, and the chain is broken when any
1281	     request within the chain is incomplete, for example when an error
1282	     is returned, or a incomplete result such as a short write.  A new
1283	     error is provided for flushing subsequent chained requests.

1285	     Chained request sequences are subject to ordinary flow control
1286	     since each request is a new, independent request on the channel.
1287	     When a chain is in effect, the server executes requests strictly in
1288	     the sequence as issued in the chain.  When the chain is terminated
1289	     by the client, server operation returns to normal, fully parallel
1290	     mode.

1292	     Chaining is implemented in the OPERATION_CONTROL operation within
1293	     each compound.  A ChainFlags word indicates the beginning,
1294	     continuation and end of each chain.  Requests which arrive in an
1295	     unexpected state (for example, a "continuation" request without a
1296	     "begin") result in a CHAIN_INVALID error.  Requests which follow an
1297	     incomplete result are not executed and result in a CHAIN_BROKEN
1298	     error.  The client terminates the chain by explicitly ending the
1299	     chain with the "end" flag, or by transmitting any unchained
1300	     request.  The explicit "end" flag allows a chain to immediately
1301	     follow another.

1303	     When a chain is in effect, the current filehandle and saved
1304	     filehandle are maintained across chained requests as for a single
1305	     COMPOUND.  This permits passing such results forward in the chain.

1307	     The current and saved filehandles are not available outside the
1308	     chain.

1310	2.8.  Data Alignment

1312	     A negotiated data alignment enables certain scatter/gather
1313	     optimizations.  A facility for this is supported by [RPCRDMA].
1314	     Where NFS file data is the payload, specific optimizations become
1315	     highly attractive.

1317	     Header padding is requested by each peer at session initiation, and
1318	     may be zero (no padding).  Padding leverages the useful property
1319	     that RDMA receives preserve alignment of data, even when they are
1320	     placed into anonymous (untagged) buffers.  If requested, client
1321	     inline writes will insert appropriate pad bytes within the request
1322	     header to align the data payload on the specified boundary.  The
1323	     client is encouraged to be optimistic and simply pad all WRITEs
1324	     within the RPC layer to the negotiated size, in the expectation
1325	     that the server can use them efficiently.

1327	     It is highly recommended that clients offer to pad headers to an
1328	     appropriate size.  Most servers can make good use of such padding,
1329	     which allows them to chain receive buffers in such a way that any
1330	     data carried by client requests will be placed into appropriate
1331	     buffers at the server, ready for filesystem processing.  The
1332	     receiver's RPC layer encounters no overhead from skipping over pad
1333	     bytes, and the RDMA layer's high performance makes the insertion
1334	     and transmission of padding on the sender a significant
1335	     optimization.  In this way, the need for servers to perform RDMA
1336	     Read to satisfy all but the largest client writes is obviated.  An
1337	     added benefit is the reduction of message roundtrips on the network
1338	     - a potentially good trade, where latency is present.

1340	     The value to choose for padding is subject to a number of criteria.
1341	     A primary source of variable-length data in the RPC header is the
1342	     authentication information, the form of which is client-determined,
1343	     possibly in response to server specification.  The contents of
1344	     COMPOUNDs, sizes of strings such as those passed to RENAME, etc.
1345	     all go into the determination of a maximal NFSv4 request size and
1346	     therefore minimal buffer size.  The client must select its offered
1347	     value carefully, so as not to overburden the server, and vice-
1348	     versa.  The payoff of an appropriate padding value is higher
1349	     performance.

1351	                 Sender gather:
1352	     |RPC Request|Pad bytes|Length| -> |User data...|
1353	     \------+---------------------/       \
1354	             \                             \
1355	              \    Receiver scatter:        \--------------+- ...
1356	         /-----+----------------\            \              \
1357	         |RPC Request|Pad|Length|   ->   |FS buffer| -> |FS buffer| -> ...

1359	     In the above case, the server may recycle unused buffers to the
1360	     next posted receive if unused by the actual received request, or
1361	     may pass the now-complete buffers by reference for normal write
1362	     processing.  For a server which can make use of it, this removes
1363	     any need for data copies of incoming data, without resorting to
1364	     complicated end-to-end buffer advertisement and management.  This
1365	     includes most kernel-based and integrated server designs, among
1366	     many others.  The client may perform similar optimizations, if
1367	     desired.

1369	     Padding is negotiated by the session binding operation, and
1370	     subsequently used by the RPC RDMA layer, as described in [RPCRDMA].

1372	3.  NFSv4 Integration

1374	     The following section discusses the integration of the proposed
1375	     RDMA extensions with NFSv4.0.

1377	3.1.  Minor Versioning

1379	     Minor versioning is the existing facility to extend the NFSv4
1380	     protocol, and this proposal takes that approach.

1382	     Minor versioning of NFSv4 is relatively restrictive, and allows for
1383	     tightly limited changes only.  In particular, it does not permit
1384	     adding new "procedures" (it permits adding only new "operations").
1385	     Interoperability concerns make it impossible to consider additional
1386	     layering to be a minor revision.  This somewhat limits the changes
1387	     that can be proposed when considering extensions.

1389	     To support exactly-once semantics integrated with sessions and flow
1390	     control, it is desirable to tag each request with an identifier to
1391	     be called a Streamid.  This identifier must be passed by NFSv4 when
1392	     running atop any transport, including traditional TCP.  Therefore
1393	     it is not desirable to add the Streamid to a new RPC transport,
1394	     even though such a transport is indicated for support of RDMA.
1395	     This draft and [RPCRDMA] do not propose such an approach.

1397	     Instead, this proposal follows these requirements faithfully,
1398	     through the use of a new operation within NFSv4 COMPOUND procedures
1399	     as detailed below.

1401	3.2.  Stream Identifiers and Exactly-Once Semantics

1403	     The presence of deterministic flow control on a channel enables in-
1404	     progress requests to be assigned unique values with useful
1405	     properties.

1407	     The RPC layer provides a transaction ID (xid), which, while
1408	     required to be unique, is not especially convenient for tracking
1409	     requests.  The transaction ID is only meaningful to the issuer
1410	     (client), it cannot be interpreted at the server except to test for
1411	     equality with previously issued requests.  Because RPC operations
1412	     may be completed by the server in any order, many transaction IDs
1413	     may be outstanding at any time.  The client may therefore perform a
1414	     computationally expensive lookup operation in the process of
1415	     demultiplexing each reply.

1417	     When flow control is in effect, there is a limit to the number of
1418	     active requests.  This immediately enables a convenient,
1419	     computationally efficient index for each request which is
1420	     designated as a Stream Identifier, or streamid.

1422	     When the client issues a new request, it selects a streamid in the
1423	     range 0..N-1, where N is the server's current "totalrequests" limit
1424	     granted the client on the session over which the request is to be
1425	     issued.  The streamid must be unused by any of the requests which
1426	     the client has already active on the session.  "Unused" here means
1427	     the client has no outstanding request for that streamid.  Because
1428	     the stream id is always an integer in the range 0..N-1, client
1429	     implementations can use the streamid from a server response to
1430	     efficiently match responses with outstanding requests, such as, for
1431	     example, by using the streamid to index into a outstanding request
1432	     array.

1434	     The server in turn may use this streamid, in conjunction with the
1435	     transaction id within the RPC portion of the request, to maintain
1436	     its duplicate request cache (DRC) for the session, as opposed to
1437	     the traditional approach of ONC RPC applications that use the XID
1438	     to index into the DRC.  Unlike the XID, the streamid is always
1439	     within a specific range;  this has two implications.  The first
1440	     implication is that for a given session, the server need only cache
1441	     the results of a limited number of COMPOUND requests.  The second
1442	     implication derives from the first, which is unlike XID indexed
1443	     DRCs, the streamid DRC by its nature cannot be overflowed.  This
1444	     makes it practical to maintain all the required entries for an
1445	     effective, exactly-once semantics, DRC.

1447	     It is required to encode the streamid information in such a way
1448	     that does not violate the minor versioning rules of the NFSv4.0
1449	     specification.  This is accomplished here by encoding it in a
1450	     control operation within each NFSv4.1 COMPOUND and CB_COMPOUND
1451	     procedure.  The operation easily piggybacks within existing
1452	     messages.  The implementation section of this document describes
1453	     the specific proposal.

1455	     Exactly-once semantics completely replace the functionality
1456	     provided by NFSv4.0 sequence numbers.  It is no longer necessary to
1457	     employ NFS sequence numbers and their contents must be ignored by
1458	     NFSv4.1 servers when a session is in effect for the connection.  As
1459	     previously discussed, such a server will never request open-
1460	     confirmation response to OPEN requests, and a client must not issue
1461	     an OPEN_CONFIRM operation.

1463	     In the case where the server is actively adjusting its granted flow
1464	     control credits to the client, it may not be able to use receipt of
1465	     the streamid to retire a cache entry.  The streamid used in an
1466	     incoming request may not reflect the server's current idea of the
1467	     client's credit limit, because the request may have been sent from
1468	     the client before the update was received.  Therefore, in the
1469	     credit downward adjustment case, the server may have to retain a
1470	     number of duplicate request cache entries at least as large as the
1471	     old credit value, until operation sequencing rules allow it to
1472	     infer that the client has seen its reply.

1474	     Finally, note that the streamid is a guarantee of uniqueness only
1475	     in the scope of an unbroken connection.  A channel identifier,
1476	     assigned at bind time and unique within the session, provides the
1477	     means by which this is detected.  If a request is received on a
1478	     channel with a channel identifier which does not match the incoming
1479	     request, then the request must be handled as a potential retry on
1480	     the previous channel identifier.  It is possible to receive
1481	     requests up to the credit limit previously in effect for the old
1482	     channel, but new requests outside this range should be rejected.
1483	     As in the flow control downward adjustment case, the server may
1484	     finally retire the old channel's request cache entries based on
1485	     operation sequencing rules.

1487	3.3.  COMPOUND and CB_COMPOUND

1489	     Support for per-operation control can be piggybacked onto NFSv4
1490	     COMPOUNDs with full transparency, by placing such facilities into
1491	     their own, new operation, and placing this operation first in each
1492	     COMPOUND under the new NFSv4 minor protocol revision.  The contents
1493	     of the operation would then apply to the entire COMPOUND.

1495	     Recall that the NFSv4 minor revision is contained within the
1496	     COMPOUND header, encoded prior to the COMPOUNDed operations.  By
1497	     simply requiring that the new operation always be contained in
1498	     NFSv4 minor COMPOUNDs, the control protocol can piggyback perfectly
1499	     with each request and response.

1501	     In this way, the NFSv4 RDMA Extensions may stay in compliance with
1502	     the minor versioning requirements specified in section 10 of
1503	     [RFC3530].

1505	     Referring to section 13.1 of the same document, the proposed
1506	     session-enabled COMPOUND and CB_COMPOUND have the form:

1508	     +-----+--------------+-----------+------------+-----------+----
1509	     | tag | minorversion | numops    | control op | op + args | ...
1510	     |     |   (== 1)     | (limited) |  + args    |           |
1511	     +-----+--------------+-----------+------------+-----------+----

1513	     and the reply's structure is:

1515	     +------------+-----+--------+-------------------------------+--//
1516	     |last status | tag | numres | status + control op + results |  //
1517	     +------------+-----+--------+-------------------------------+--//
1518	             //-----------------------+----
1519	             // status + op + results | ...
1520	             //-----------------------+----

1522	     The single control operation within each NFSv4.1 COMPOUND defines
1523	     the context and operational session parameters which govern that
1524	     COMPOUND request and reply.  Placing it first in the COMPOUND
1525	     encoding is required in order to allow its processing before other
1526	     operations in the COMPOUND.  This is especially important where
1527	     chaining is in effect, as the chain must be checked for correctness
1528	     prior to execution.

1530	3.4.  eXternal Data Representation Efficiency

1532	     RDMA is a copy avoidance technology, and it is important to
1533	     maintain this efficiency when decoding received messages.
1534	     Traditional XDR implementations frequently use generated
1535	     unmarshaling code to convert objects to local form, incurring a
1536	     data copy in the process (in addition to subjecting the caller to
1537	     recursive calls, etc).  Often, such conversions are carried out
1538	     even when no size or byte order conversion is necessary.

1540	     It is recommended that implementations pay close attention to the
1541	     details of memory referencing in such code.  It is far more
1542	     efficient to inspect data in place, using native facilities to deal
1543	     with word size and byte order conversion into registers or local
1544	     variables, rather than formally (and blindly) performing the
1545	     operation via fetch, reallocate and store.

1547	     Of particular concern is the result of the READDIR_DIRECT
1548	     operation, in which such encoding abounds.

1550	3.5.  Effect of Sessions on Existing Operations

1552	     The use of a session and associated message credits to provide
1553	     exactly-once semantics allows considerable simplification of a
1554	     number of mechanisms in the base protocol that are all devoted in
1555	     some way to providing replay protection.  In particular, the use of
1556	     sequence id's on many operations becomes superfluous.  Rather than
1557	     replace existing operations with variants that delete the sequence
1558	     id's, sequence id's will still be present but their value must not
1559	     be checked for correctness, nor used for replay protection.  In
1560	     addition, when a session is in effect for the connection, OPENs
1561	     will never require confirmation, the server must not require
1562	     confirmation, and the OPEN_CONFIRM operation must not be issued by
1563	     the client.

1565	     Since each session will only be used by a single client, the use of
1566	     a clientid in many operations will no longer be required.  Rather
1567	     than remove clientid parameters, the existing operations that use
1568	     them will remain unchanged but a value of zero can be used.  The
1569	     determination of the client will follow from the session membership
1570	     of the connection on which the request arrived.

1572	     A similar situation to sequence numbers, described earlier, exists
1573	     for NFSv4.0 clientid operations.  There is no longer a need for
1574	     SETCLIENTID and SETCLIENTID_CONFIRM, as clientid uniqueness is
1575	     managed by the server through the session, and negotiation is both
1576	     unnecessary and redundant.  Additionally, the cb_program and
1577	     cb_location which are obtained by the server in SETCLIENTID_CONFIRM
1578	     must not be used by the server, because the NFSv4.1 client performs
1579	     callback channel designation with SESSION_BIND.  A server should
1580	     return an error to NFSv4.1 clients which might issue either
1581	     operation.

1583	     Finally the RENEW operation is made unnecessary when a session is
1584	     present, and the server should return an error to clients which
1585	     might issue it.

1587	     In summary, the

1589	          o    OPEN_CONFIRM

1591	          o    SETCLIENTID

1593	          o    SETCLIENTID_CONFIRM

1595	          o    RENEW

1597	     operations must not be issued or handled by client nor server when
1598	     a session is in effect.

1600	     Since the session carries the client indication with it implicitly,
1601	     any request on a session associated with a given client will renew
1602	     that client's leases.

1604	3.6.  Authentication Efficiencies

1606	     NFSv4 requires the use of the RPCSEC_GSS ONC RPC security flavor
1607	     [RFC2203] to provide authentication, integrity, and privacy via
1608	     cryptography.  The server dictates to the client the use of
1609	     RPCSEC_GSS, the service (authentication, integrity, or privacy),
1610	     and the specific GSS-API security mechanism that each remote
1611	     procedure call and result will use.

1613	     If the connection's integrity is protected by an additional means
1614	     than RPCSEC_GSS, such as via IPsec, then the use of RPCSEC_GSS's
1615	     integrity service is nearly redundant (See the Security
1616	     Considerations section for more explanation of why it is "nearly"
1617	     and not completely redundant).  Likewise, if the connection's
1618	     privacy is protected by additional means, then the use of both
1619	     RPCSEC_GSS's integrity and privacy services is nearly redundant.

1621	     Connection protection schemes, such as IPsec, are more likely to be
1622	     implemented in hardware than upper layer protocols like RPCSEC_GSS.
1623	     Hardware-based cryptography at the IPsec layer will be more
1624	     efficient than software-based cryptography at the RPCSEC_GSS layer.

1626	     When transport integrity can be obtained, it is possible for server
1627	     and client to downgrade their per-operation authentication, after
1628	     an appropriate exchange.  This downgrade can in fact be as complete
1629	     as to establish security mechanisms that have zero cryptographic
1630	     overhead, effectively using the underlying integrity and privacy
1631	     services provided by transport.

1633	     Based on the above observations, a new GSS-API mechanism, called
1634	     the Channel Conjunction Mechanism [CCM], is being defined.  The CCM
1635	     works by creating a GSS-API security context using as input a
1636	     cookie that the initiator and target have previously agreed to be a
1637	     handle for GSS-API context created previously over another GSS-API
1638	     mechanism.

1640	     NFSv4.1 clients and servers should support CCM and they must use as
1641	     the cookie the handle from a successful RPCSEC_GSS context creation
1642	     over a non-CCM mechanism (such as Kerberos V5).  The value of the
1643	     cookie will be equal to the handle field of the rpc_gss_init_res
1644	     structure from the RPCSEC_GSS specification.

1646	     The [CCM] Draft provides further discussion and examples.

1648	4.  Security Considerations

1650	     The NFSv4 minor version 1 retains all of existing NFSv4 security;
1651	     all security considerations present in NFSv4.0 apply to it equally.

1653	     Security considerations of any underlying RDMA transport are
1654	     additionally important, all the more so due to the emerging nature
1655	     of such transports.  Examining these issues is outside the scope of
1656	     this draft.

1658	     When protecting a connection with RPCSEC_GSS, all data in each
1659	     request and response (whether transferred inline or via RDMA)
1660	     continues to receive this protection over RDMA fabrics [RPCRDMA].
1661	     However when performing data transfers via RDMA, RPCSEC_GSS
1662	     protection of the data transfer portion works against the
1663	     efficiency which RDMA is typically employed to achieve.  This is
1664	     because such data is normally managed solely by the RDMA fabric,
1665	     and intentionally is not touched by software.  Therefore when
1666	     employing RPCSEC_GSS under CCM, and where integrity protection has
1667	     been "downgraded", the cooperation of the RDMA transport provider
1668	     is critical to maintain any integrity and privacy otherwise in
1669	     place for the session.  The means by which the local RPCSEC_GSS
1670	     implementation is integrated with the RDMA data protection
1671	     facilities are outside the scope of this draft.

1673	     It is logical to use the same GSS context on a session's callback
1674	     channel as that used on its operations channel(s), but the issue
1675	     warrants careful analysis.

1677	     If the NFS client wishes to maintain full control over RPCSEC_GSS
1678	     protection, it may still perform its transfer operations using
1679	     either the inline or RDMA transfer model, or of course employ
1680	     traditional TCP stream operation.  In the RDMA inline case, header
1681	     padding is recommended to optimize behavior at the server.  At the
1682	     client, close attention should be paid to the implementation of
1683	     RPCSEC_GSS processing to minimize memory referencing and especially
1684	     copying.  These are well-advised in any case!

1686	     Proper authentication of the session binding operation of the
1687	     proposed NFSv4.1 exactly follows the similar requirement on client
1688	     identifiers in NFSv4.0.  It must not be possible for a client to
1689	     bind to an existing session by guessing its session identifier.  To
1690	     protect against this, NFSv4.0 requires appropriate authentication
1691	     and matching of the principal used.  This is discussed in Section
1692	     16, Security Considerations of [RFC3530].  The same requirement
1693	     before binding to a session identifier applies here.

1695	     The proposed session binding improves security over that provided
1696	     by NFSv4 for the callback channel.  The connection is client-
1697	     initiated, and subject to the same firewall and routing checks as
1698	     the operations channel.  The connection cannot be hijacked by an
1699	     attacker who connects to the client port prior to the intended
1700	     server.  The connection is set up by the client with its desired
1701	     attributes, such as optionally securing with IPsec or similar.  The
1702	     binding is fully authenticated before being activated.

1704	     The server should take care to protect itself against denial of
1705	     service attacks in the creation of sessions and clientids.  Clients
1706	     who connect and create sessions, only to disconnect and never bind
1707	     to them may leave significant state behind.  (The same issue
1708	     applies to NFSv4.0 with clients who may perform SETCLIENTID, then
1709	     never perform SETCLIENTID_CONFIRM.)  Careful authentication coupled
1710	     with resource checks is highly recommended.

1712	5.  IANA Considerations

1714	     As a proposal based on minor protocol revision, any new minor
1715	     number might be registered and reserved with the agreed-upon
1716	     specification.  Assigned operation numbers and any RPC constants
1717	     might undergo the same process.

1719	     There are no issues stemming from RDMA use itself regarding port
1720	     number assignments not already specified by [RFC3530].  Initial
1721	     connection is via ordinary TCP stream services, operating on the
1722	     same ports and under the same set of naming services.

1724	     In the Automatic RDMA connection model described above, it is
1725	     possible that a new well-known port, or a new transport type
1726	     assignment (netid) as described in [RFC3530], may be desirable.

1728	6.  NFSv4 Protocol Extensions

1730	     This section specifies details of the five extensions to NFSv4
1731	     proposed by this document.  Existing NFSv4 operations (under minor
1732	     version 0) continue to be fully supported, unmodified.

1734	6.1.  SESSION_CREATE

1736	     SYNOPSIS

1738	          sessionparams -> sessionresults

1740	     ARGUMENT

1742	          struct SESSIONCREATE4args {
1743	              nfs_client_id4      clientid;
1744	              bool                persist;
1745	              uint32              totalrequests;
1746	           };

1748	     RESULT

1750	          struct SESSIONCREATE4resok {
1751	              uint64              sessionid;
1752	              bool                persist;
1753	              uint32              totalrequests;
1754	           };

1756	          union SESSIONCREATE4res switch (nfsstat4 status) {
1757	           case NFS4_OK:
1758	                SESSIONCREATE4resok  resok4;
1759	           default:
1760	                void;
1761	           };

1763	     DESCRIPTION

1765	     The SESSION_CREATE operation creates a session to which client
1766	     connections may be bound with SESSION_BIND.

1768	     The "persist" argument indicates to the server whether the client
1769	     requires strict response caching for the session.  For example, a
1770	     read-only session may set persist to FALSE.  The server may choose
1771	     to change the returned value of "persist" to match its
1772	     implementation choice.

1774	     The "totalrequests" argument allows the server to size any
1775	     necessary response cache storage.  It is the largest number of
1776	     outstanding requests which the client will adhere to session-wide.

1778	     Note that the SESSION_CREATE operation never appears with an
1779	     associated streamid.  Therefore the SESSION_CREATE operation may
1780	     not receive the same level of exactly-once replay protection in the
1781	     face of transport failure.  However, because at most one
1782	     SESSION_CREATE operation may be issued on a connection, servers can
1783	     provide "special" caching of the result (the sessionid) to
1784	     compensate for this.

1786	      ...

1788	     ERRORS

1790	           <tbd>

1792	6.2.  SESSION_BIND

1794	     SYNOPSIS

1796	          sessionparams -> sessionresults

1798	     ARGUMENT

1800	          enum ChannelType {
1801	               OPERATION = 0,
1802	               BACK      = 1
1803	           };

1805	          enum ConnectionMode {
1806	               STREAM = 0,
1807	               RDMA   = 1
1808	           };

1810	          struct SESSIONBIND4args {
1811	               uint64          sessionid;
1812	               ChannelType     channel;
1813	               ConnectionMode  mode;
1814	               count4          maxrequestsize;
1815	               count4          maxresponsesize;
1816	               count4          headerpadsize;
1817	               count4          maxrequests;
1818	               count4          maxrdmareads;
1819	               opaque          transportattrs<>;
1820	           };

1822	     RESULT

1824	          struct SESSIONBIND4resok {
1825	               uint32          channelid;
1826	               count4          maxrequestsize;
1827	               count4          maxresponsesize;
1828	               count4          headerpadsize;
1829	               count4          maxrequests;
1830	               count4          maxrdmareads;
1831	               opaque          transportattrs<>;
1832	           };

1834	          union SESSIONBIND4res switch (nfsstat4 status) {
1835	           case NFS4_OK:
1836	                SESSIONBIND4resok  resok4;
1837	           default:
1838	                void;
1839	           };

1841	     DESCRIPTION

1843	     The SESSION_BIND operation causes the connection on which the
1844	     operation is issued to be associated with the specified session,
1845	     creating a new channel.  The channel type may be specified to be
1846	     for multiple purposes.  Multiple channels may be bound to a single
1847	     connection within a session.  Normally, only one back channel is
1848	     bound.

1850	     Credits and sizes are interpreted relative to the initiator of each
1851	     channel, that is, the operations channel specifies server credits
1852	     and sizes for the operations channel, while the back channel
1853	     specifies client credits and sizes for the back channel.  Padding
1854	     and also direct operations are generally not required on the back
1855	     channel.

1857	     The channelid is a unique session-wide indentifier for each newly
1858	     bound connection.  New requests must be issued on a channel with
1859	     the matching identifier, while requests retried after connection
1860	     failure must reissue the original identifier.

1862	     When ConnectionMode is "RDMA", the channel may be promoted to RDMA
1863	     mode by the server before replying, if supported.

1865	     The "maxrequests" value is a hint which the client may use to
1866	     communicate to the server its expected credit use on the channel.
1867	     The client must always adhere to the "totalrequests" value,
1868	     aggregated on all channels within the session, which it negotiated
1869	     with the server at session creation.

1871	     Note that the SESSION_BIND operation never appears with an
1872	     associated streamid, but also never requires replay protection.  A
1873	     client which suffered a connection loss must immediately respond
1874	     with new SESSION_BIND, and never a retransmit.  Also, for this
1875	     reason, it is recommended to use SESSION_BIND alone in its request.

1877	      ...

1879	     ERRORS

1881	           <tbd>

1883	6.3.  SESSION_DESTROY

1885	     SYNOPSIS

1887	          void -> status

1889	     ARGUMENT

1891	          void;

1893	     RESULT

1895	          struct SESSION_DESTROYres {
1896	               nfsstat status;
1897	           };

1899	     DESCRIPTION

1901	     The SESSION_DESTROY operation closes the session and discards any
1902	     active state such as locks, leases, and server duplicate request
1903	     cache entries.  Any remaining connections bound to the session are
1904	     immediately unbound and may additionally be closed by the server.

1906	     This operation must be the final, or only operation after the
1907	     required OPERATION_CONTROL in any request.  Because the operation
1908	     results in destruction of the session, any duplicate request
1909	     caching for this request, as well as previously completed rewuests,
1910	     will be lost.  For this reason, it is advisable to not place this
1911	     operation in a request with other state-modifying operations.

1913	     Note that because the operation will never be replayed by the
1914	     server, a client that retransmits the request may receive an error
1915	     in response, even though the session may have been successfully
1916	     destroyed.

1918	      ...

1920	     ERRORS

1922	           <tbd>

1924	6.4.  OPERATION_CONTROL

1926	     SYNOPSIS

1928	          control -> control

1930	     ARGUMENT

1932	          enum ChainFlags {
1933	               NOCHAIN = 0,
1934	               CHAINBEGIN = 1,
1935	               CHAINCONTINUE = 2,
1936	               CHAINEND = 3
1937	           };

1939	          struct OPERATIONCONTROL4args {
1940	               uint32          channelid;
1941	               uint32          streamid;
1942	               enum ChainFlags chainflags;
1943	           };

1945	     RESULT

1947	          union OPERATIONCONTROL4res switch (nfsstat4 status) {
1948	           case NFS4_OK:
1949	                uint32          streamid;
1950	           default:
1951	                void;
1952	           };

1954	     DESCRIPTION

1956	     The OPERATION_CONTROL operation is used to manage operational
1957	     accounting for the channel on which the operation is sent.  The
1958	     contents include the Streamid, used by the server to implement
1959	     exactly-once semantics, and chaining flags to implement request
1960	     chaining for the operations channel.  This operation must appear
1961	     once as the first operation in each COMPOUND and CB_COMPOUND sent
1962	     after the channel is successfully bound, or a protocol error must
1963	     result.

1965	     The channelid and streamid are provided in the arguments in order
1966	     to permit the server to implement duplicate request cache handling.
1967	     The streamid is provided in the results in order to assist the
1968	     client in efficiently demultiplexing the reply.

1970	      ...

1972	     ERRORS

1974	           Streamid out of bounds
1975	           CHAIN_INVALID and CHAIN_BROKEN

1977	6.5.  CB_CREDITRECALL

1979	     SYNOPSIS

1981	          targetcount -> status

1983	     ARGUMENT

1985	           count4          target;

1987	     RESULT

1989	          struct CB_CREDITRECALLres {
1990	               nfsstat status;
1991	           };

1993	     DESCRIPTION

1995	     The CB_CREDITRECALL operation requests the client to return credits
1996	     at the server, by zero-length RDMA Sends or NULL NFSv4 operations.

1998	      ...

2000	     ERRORS

2002	           <none>

2004	7.  Acknowledgements

2006	     The authors wish to acknowledge the valuable contributions and
2007	     review of Brent Callaghan, Mike Eisler, John Howard, Chet Juszczak,
2008	     Dave Noveck and Mark Wittle.

2010	8.  References

2012	     [CCM]
2013	          M. Eisler, N. Williams, "The Channel Conjunction Mechanism
2014	          (CCM) for GSS", Internet-Draft Work in Progress,
2015	          http://www.ietf.org/internet-drafts/draft-ietf-nfsv4-ccm-02

2017	     [CJ89]
2018	          C. Juszczak, "Improving the Performance and Correctness of an
2019	          NFS Server," Winter 1989 USENIX Conference Proceedings, USENIX
2020	          Association, Berkeley, CA, Februry 1989, pages 53-63.

2022	     [DAFS]
2023	          Direct Access File System, available from
2024	          http://www.dafscollaborative.org

2026	     [DCK+03]
2027	          M. DeBergalis, P. Corbett, S. Kleiman, A. Lent, D. Noveck, T.
2028	          Talpey, M. Wittle, "The Direct Access File System", in
2029	          Proceedings of 2nd USENIX Conference on File and Storage
2030	          Technologies (FAST '03), San Francisco, CA, March 31 - April
2031	          2, 2003

2033	     [DDP]
2034	          H. Shah, J. Pinkerton, R. Recio, P. Culley, "Direct Data
2035	          Placement over Reliable Transports",
2036	          http://www.ietf.org/internet-drafts/draft-ietf-rddp-ddp-01

2038	     [FJDAFS]
2039	          Fujitsu Prime Software Technologies, "Meet the DAFS
2040	          Performance with DAFS/VI Kernel Implementation using cLAN",
2041	          http://www.pst.fujitsu.com/english/dafsdemo/index.html

2043	     [FJNFS]
2044	          Fujitsu Prime Software Technologies, "An Adaptation of VIA to
2045	          NFS on Linux",
2046	          http://www.pst.fujitsu.com/english/nfs/index.html

2048	     [IB] InfiniBand Architecture Specification, Volume 1, Release 1.1.
2049	          available from http://www.infinibandta.org

2051	     [KM02]
2052	          K. Magoutis, "Design and Implementation of a Direct Access
2053	          File System (DAFS) Kernel Server for FreeBSD", in Proceedings
2054	          of USENIX BSDCon 2002 Conference, San Francisco, CA, February
2055	          11-14, 2002.

2057	     [MAF+02]
2058	          K. Magoutis, S. Addetia, A. Fedorova, M. Seltzer, J. Chase, D.
2059	          Gallatin, R. Kisley, R. Wickremesinghe, E. Gabber, "Structure
2060	          and Performance of the Direct Access File System (DAFS)", in
2061	          Proceedings of 2002 USENIX Annual Technical Conference,
2062	          Monterey, CA, June 9-14, 2002.

2064	     [MIDTAX]
2065	          B. Carpenter, S. Brim, "Middleboxes: Taxonomy and Issues",
2066	          Informational RFC, http://www.ietf.org/rfc/rfc3234

2068	     [NFSDDP]
2069	          B. Callaghan, T. Talpey, "NFS Direct Data Placement",
2070	          Internet-Draft Work in Progress, http://www.ietf.org/internet-
2071	          drafts/draft-callaghan-nfsdirect-01

2073	     [NFSPS]
2074	          T. Talpey, C. Juszczak, "NFS RDMA Problem Statement",
2075	          Internet-Draft Work in Progress, http://www.ietf.org/internet-
2076	          drafts/draft-talpey-nfs-rdma-problem-statement-01

2078	     [RDMAREQ]
2079	          B. Callaghan, M. Wittle, "NFS RDMA requirements", Internet-
2080	          Draft Work in Progress, http://www.ietf.org/internet-
2081	          drafts/draft-callaghan-nfs-rdmareq-00

2083	     [RFC3530]
2084	          S. Shepler, et. al., "NFS Version 4 Protocol", Standards Track
2085	          RFC, http://www.ietf.org/rfc/rfc3530

2087	     [RDDP]
2088	          Remote Direct Data Placement Working Group charter,
2089	          http://www.ietf.org/html.charters/rddp-charter.html

2091	     [RDDPPS]
2092	          Remote Direct Data Placement Working Group Problem Statement,
2093	          A. Romanow, J. Mogul, T. Talpey, S. Bailey,
2094	          http://www.ietf.org/internet-drafts/draft-ietf-rddp-problem-
2095	          statement-03

2097	     [RDMAP]
2098	          R. Recio, P. Culley, D. Garcia, J. Hilland, "An RDMA Protocol
2099	          Specification", http://www.ietf.org/internet-drafts/draft-
2100	          ietf-rddp-rdmap-01

2102	     [RPCRDMA]
2103	          B. Callaghan, T. Talpey, "RDMA Transport for ONC RPC"
2104	          Internet-Draft Work in Progress, http://www.ietf.org/internet-
2105	          drafts/draft-callaghan-rpc-rdma-01

2107	     [RFC2203]
2108	          M. Eisler, A. Chiu, L. Ling, "RPCSEC_GSS Protocol
2109	          Specification", Standards Track RFC,
2110	          http://www.ietf.org/rfc/rfc2203

2112	Authors' Addresses

2114	Tom Talpey
2115	Network Appliance, Inc.
2116	375 Totten Pond Road
2117	Waltham, MA 02451 USA

2119	Phone: +1 781 768 5329
2120	EMail: thomas.talpey@netapp.com

2122	Spencer Shepler
2123	Sun Microsystems, Inc.
2124	7808 Moonflower Drive
2125	Austin, TX 78750 USA

2127	Phone: +1 512 349 9376
2128	EMail: spencer.shepler@sun.com

2130	Full Copyright Statement

2132	     Copyright (C) The Internet Society (2004).  All Rights Reserved.

2134	     This document and translations of it may be copied and furnished to
2135	     others, and derivative works that comment on or otherwise explain
2136	     it or assist in its implementation may be prepared, copied,
2137	     published and distributed, in whole or in part, without restriction
2138	     of any kind, provided that the above copyright notice and this
2139	     paragraph are included on all such copies and derivative works.
2140	     However, this document itself may not be modified in any way, such
2141	     as by removing the copyright notice or references to the Internet
2142	     Society or other Internet organizations, except as needed for the
2143	     purpose of developing Internet standards in which case the
2144	     procedures for copyrights defined in the Internet Standards process
2145	     must be followed, or as required to translate it into languages
2146	     other than English.

2148	     The limited permissions granted above are perpetual and will not be
2149	     revoked by the Internet Society or its successors or assigns.

2151	     This document and the information contained herein is provided on
2152	     an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET
2153	     ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR
2154	     IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF
2155	     THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
2156	     WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.