idnits 2.17.1 draft-ietf-nfsv4-rpcrdma-09.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** The document seems to lack a License Notice according IETF Trust Provisions of 28 Dec 2009, Section 6.b.i or Provisions of 12 Sep 2009 Section 6.b -- however, there's a paragraph with a matching beginning. Boilerplate error? -- It seems you're using the 'non-IETF stream' Licence Notice instead Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document seems to use 'NOT RECOMMENDED' as an RFC 2119 keyword, but does not include the phrase in its RFC 2119 key words list. -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (December 3, 2008) is 5621 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '
' and
     '' lines.


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  -- Looks like a reference, but probably isn't: '8' on line 998

  -- Obsolete informational reference (is this intentional?): RFC 3530
     (Obsoleted by RFC 7530)


     Summary: 1 error (**), 0 flaws (~~), 2 warnings (==), 6 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	NFSv4 Working Group                                      Tom Talpey
3	Internet-Draft                                               NetApp
4	Intended status: Standards Track                    Brent Callaghan
5	Expires: June 3, 2009                                         Apple
6	                                                   December 3, 2008

8	    Remote Direct Memory Access Transport for Remote Procedure Call
9	                      draft-ietf-nfsv4-rpcrdma-09

11	Status of this Memo

13	     This Internet-Draft is submitted to IETF in full conformance with
14	     the provisions of BCP 78 and BCP 79.

16	     Internet-Drafts are working documents of the Internet Engineering
17	     Task Force (IETF), its areas, and its working groups. Note that
18	     other groups may also distribute working documents as Internet-
19	     Drafts.

21	     Internet-Drafts are draft documents valid for a maximum of six
22	     months and may be updated, replaced, or obsoleted by other
23	     documents at any time. It is inappropriate to use Internet-Drafts
24	     as reference material or to cite them other than as "work in
25	     progress."

27	     The list of current Internet-Drafts can be accessed at
28	         http://www.ietf.org/ietf/1id-abstracts.txt.

30	     The list of Internet-Draft Shadow Directories can be accessed at
31	         http://www.ietf.org/shadow.html.

33	     This Internet-Draft will expire on June 3, 2009.

35	Abstract

37	     A protocol is described providing Remote Direct Memory Access
38	     (RDMA) as a new transport for Remote Procedure Call (RPC).  The
39	     RDMA transport binding conveys the benefits of efficient, bulk data
40	     transport over high speed networks, while providing for minimal
41	     change to RPC applications and with no required revision of the
42	     application RPC protocol, or the RPC protocol itself.

44	Table of Contents

46	     1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2
47	     2.  Abstract RDMA Requirements . . . . . . . . . . . . . . . . . 3
48	     3.  Protocol Outline . . . . . . . . . . . . . . . . . . . . . . 4
49	     3.1.  Short Messages . . . . . . . . . . . . . . . . . . . . . . 5
50	     3.2.  Data Chunks  . . . . . . . . . . . . . . . . . . . . . . . 5
51	     3.3.  Flow Control . . . . . . . . . . . . . . . . . . . . . . . 6
52	     3.4.  XDR Encoding with Chunks . . . . . . . . . . . . . . . . . 7
53	     3.5.  XDR Decoding with Read Chunks  . . . . . . . . . . . . .  10
54	     3.6.  XDR Decoding with Write Chunks . . . . . . . . . . . . .  11
55	     3.7.  XDR Roundup and Chunks . . . . . . . . . . . . . . . . .  12
56	     3.8.  RPC Call and Reply . . . . . . . . . . . . . . . . . . .  13
57	     3.9.  Padding  . . . . . . . . . . . . . . . . . . . . . . . .  16
58	     4.  RPC RDMA Message Layout  . . . . . . . . . . . . . . . . .  17
59	     4.1.  RPC over RDMA Header . . . . . . . . . . . . . . . . . .  17
60	     4.2.  RPC over RDMA header errors  . . . . . . . . . . . . . .  19
61	     4.3.  XDR Language Description . . . . . . . . . . . . . . . .  20
62	     5.  Long Messages  . . . . . . . . . . . . . . . . . . . . . .  22
63	     5.1.  Message as an RDMA Read Chunk  . . . . . . . . . . . . .  22
64	     5.2.  RDMA Write of Long Replies (Reply Chunks)  . . . . . . .  24
65	     6.  Connection Configuration Protocol  . . . . . . . . . . . .  25
66	     6.1.  Initial Connection State . . . . . . . . . . . . . . . .  26
67	     6.2.  Protocol Description . . . . . . . . . . . . . . . . . .  26
68	     7.  Memory Registration Overhead . . . . . . . . . . . . . . .  28
69	     8.  Errors and Error Recovery  . . . . . . . . . . . . . . . .  28
70	     9.  Node Addressing  . . . . . . . . . . . . . . . . . . . . .  28
71	     10.  RPC Binding . . . . . . . . . . . . . . . . . . . . . . .  29
72	     11.  Security Considerations . . . . . . . . . . . . . . . . .  30
73	     12.  IANA Considerations . . . . . . . . . . . . . . . . . . .  31
74	     13.  Acknowledgments . . . . . . . . . . . . . . . . . . . . .  33
75	     14.  Normative References  . . . . . . . . . . . . . . . . . .  33
76	     15.  Informative References  . . . . . . . . . . . . . . . . .  34
77	     Authors' Addresses . . . . . . . . . . . . . . . . . . . . . .  35
78	     Intellectual Property Statement  . . . . . . . . . . . . . . .  35
79	     Disclaimer of Validity . . . . . . . . . . . . . . . . . . . .  36
80	     Copyright Statement  . . . . . . . . . . . . . . . . . . . . .  36

82	Requirements Language

84	     The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
85	     "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in
86	     this document are to be interpreted as described in [RFC2119].

88	1.  Introduction

90	     Remote Direct Memory Access (RDMA) [RFC5040, RFC5041] [IB] is a
91	     technique for efficient movement of data between end nodes, which
92	     becomes increasingly compelling over high speed transports.  By
93	     directing data into destination buffers as it is sent on a network,
94	     and placing it via direct memory access by hardware, the double
95	     benefit of faster transfers and reduced host overhead is obtained.

97	     Open Network Computing Remote Procedure Call (ONC RPC, or simply,
98	     RPC) [RFC1831bis] is a remote procedure call protocol that has been
99	     run over a variety of transports.  Most RPC implementations today
100	     use UDP or TCP.  RPC messages are defined in terms of an eXternal
101	     Data Representation (XDR) [RFC4506] which provides a canonical data
102	     representation across a variety of host architectures.  An XDR data
103	     stream is conveyed differently on each type of transport.  On UDP,
104	     RPC messages are encapsulated inside datagrams, while on a TCP byte
105	     stream, RPC messages are delineated by a record marking protocol.
106	     An RDMA transport also conveys RPC messages in a unique fashion
107	     that must be fully described if client and server implementations
108	     are to interoperate.

110	     RDMA transports present new semantics unlike the behaviors of
111	     either UDP or TCP alone.  They retain message delineations like UDP
112	     while also providing a reliable, sequenced data transfer like TCP.
113	     And, they provide the new efficient, bulk transfer service of RDMA.
114	     RDMA transports are therefore naturally viewed as a new transport
115	     type by RPC.

117	     RDMA as a transport will benefit the performance of RPC protocols
118	     that move large "chunks" of data, since RDMA hardware excels at
119	     moving data efficiently between host memory and a high speed
120	     network with little or no host CPU involvement.  In this context,
121	     the NFS protocol, in all its versions [RFC1094] [RFC1813] [RFC3530]
122	     [NFSv4.1], is an obvious beneficiary of RDMA.  A complete problem
123	     statement is discussed in [NFSRDMAPS], and related NFSv4 issues are
124	     discussed in [NFSv4.1].  Many other RPC-based protocols will also
125	     benefit.

127	     Although the RDMA transport described here provides relatively
128	     transparent support for any RPC application, the proposal goes
129	     further in describing mechanisms that can optimize the use of RDMA
130	     with more active participation by the RPC application.

132	2.  Abstract RDMA Requirements

134	     An RPC transport is responsible for conveying an RPC message from a
135	     sender to a receiver.  An RPC message is either an RPC call from a
136	     client to a server, or an RPC reply from the server back to the
137	     client.  An RPC message contains an RPC call header followed by
138	     arguments if the message is an RPC call, or an RPC reply header
139	     followed by results if the message is an RPC reply.  The call
140	     header contains a transaction ID (XID) followed by the program and
141	     procedure number as well as a security credential.  An RPC reply
142	     header begins with an XID that matches that of the RPC call
143	     message, followed by a security verifier and results.  All data in
144	     an RPC message is XDR encoded.  For a complete description of the
145	     RPC protocol and XDR encoding, see [RFC1831bis] and [RFC4506].

147	     This protocol assumes the following abstract model for RDMA
148	     transports.  These terms, common in the RDMA lexicon, are used in
149	     this document.  A more complete glossary of RDMA terms can be found
150	     in [RFC5040].

152	     o Registered Memory
153	          All data moved via tagged RDMA operations is resident in
154	          registered memory at its destination.  This protocol assumes
155	          that each segment of registered memory MUST be identified with
156	          a steering tag of no more than 32 bits and memory addresses of
157	          up to 64 bits in length.

159	     o RDMA Send
160	          The RDMA provider supports an RDMA Send operation with
161	          completion signalled at the receiver when data is placed in a
162	          pre-posted buffer.  The amount of transferred data is limited
163	          only by the size of the receiver's buffer.  Sends complete at
164	          the receiver in the order they were issued at the sender.

166	     o RDMA Write
167	          The RDMA provider supports an RDMA Write operation to directly
168	          place data in the receiver's buffer.  An RDMA Write is
169	          initiated by the sender and completion is signalled at the
170	          sender.  No completion is signalled at the receiver.  The
171	          sender uses a steering tag, memory address and length of the
172	          remote destination buffer.  RDMA Writes are not necessarily
173	          ordered with respect to one another, but are ordered with
174	          respect to RDMA Sends; a subsequent RDMA Send completion
175	          obtained at the receiver guarantees that prior RDMA Write data
176	          has been successfully placed in the receiver's memory.

178	     o RDMA Read
179	          The RDMA provider supports an RDMA Read operation to directly
180	          place peer source data in the requester's buffer.  An RDMA
181	          Read is initiated by the receiver and completion is signalled
182	          at the receiver.  The receiver provides steering tags, memory
183	          addresses and a length for the remote source and local
184	          destination buffers.  Since the peer at the data source
185	          receives no notification of RDMA Read completion, there is an
186	          assumption that on receiving the data the receiver will signal
187	          completion with an RDMA Send message, so that the peer can
188	          free the source buffers and the associated steering tags.

190	     This protocol is designed to be carried over all RDMA transports
191	     meeting the stated requirements.  This protocol conveys to the RPC
192	     peer, information sufficient for that RPC peer to direct an RDMA
193	     layer to perform transfers containing RPC data, and to communicate
194	     their result(s).  For example, it is readily carried over RDMA
195	     transports such as iWARP [RFC5040, RFC5041] or Infiniband [IB].

197	3.  Protocol Outline

199	     An RPC message can be conveyed in identical fashion, whether it is
200	     a call or reply message.  In each case, the transmission of the
201	     message proper is preceded by transmission of a transport-specific
202	     header for use by RPC over RDMA transports.  This header is
203	     analogous to the record marking used for RPC over TCP, but is more
204	     extensive, since RDMA transports support several modes of data
205	     transfer and it is important to allow the upper layer protocol to
206	     specify the most efficient mode for each of the segments in a
207	     message.  Multiple segments of a message may thereby be transferred
208	     in different ways to different remote memory destinations.

210	     All transfers of a call or reply begin with an RDMA Send which
211	     transfers at least the RPC over RDMA header, usually with the call
212	     or reply message appended, or at least some part thereof.  Because
213	     the size of what may be transmitted via RDMA Send is limited by the
214	     size of the receiver's pre-posted buffer, the RPC over RDMA
215	     transport provides a number of methods to reduce the amount
216	     transferred by means of the RDMA Send, when necessary, by
217	     transferring various parts of the message using RDMA Read and RDMA
218	     Write.

220	     RPC over RDMA framing replaces all other RPC framing (such as TCP
221	     record marking) when used atop an RPC/RDMA association, even though
222	     the underlying RDMA protocol may itself be layered atop a protocol
223	     with a defined RPC framing (such as TCP).  It is however possible
224	     for RPC/RDMA to be dynamically enabled, in the course of
225	     negotiating the use of RDMA via an upper layer exchange.  Because
226	     RPC framing delimits an entire RPC request or reply, the resulting
227	     shift in framing must occur between distinct RPC messages, and in
228	     concert with the transport.

230	3.1.  Short Messages

232	     Many RPC messages are quite short.  For example, the NFS version 3
233	     GETATTR request, is only 56 bytes: 20 bytes of RPC header, plus a
234	     32 byte file handle argument and 4 bytes of length.  The reply to
235	     this common request is about 100 bytes.

237	     There is no benefit in transferring such small messages with an
238	     RDMA Read or Write operation.  The overhead in transferring
239	     steering tags and memory addresses is justified only by large
240	     transfers.  The critical message size that justifies RDMA transfer
241	     will vary depending on the RDMA implementation and network, but is
242	     typically of the order of a few kilobytes.  It is appropriate to
243	     transfer a short message with an RDMA Send to a pre-posted buffer.
244	     The RPC over RDMA header with the short message (call or reply)
245	     immediately following is transferred using a single RDMA Send
246	     operation.

248	     Short RPC messages over an RDMA transport:

250	       RPC Client                           RPC Server
251	           |               RPC Call              |
252	      Send |   ------------------------------>   |
253	           |                                     |
254	           |               RPC Reply             |
255	           |   <------------------------------   | Send

257	3.2.  Data Chunks

259	     Some protocols, like NFS, have RPC procedures that can transfer
260	     very large "chunks" of data in the RPC call or reply and would
261	     cause the maximum send size to be exceeded if one tried to transfer
262	     them as part of the RDMA Send.  These large chunks typically range
263	     from a kilobyte to a megabyte or more.  An RDMA transport can
264	     transfer large chunks of data more efficiently via the direct
265	     placement of an RDMA Read or RDMA Write operation.  Using direct
266	     placement instead of inline transfer not only avoids expensive data
267	     copies, but provides correct data alignment at the destination.

269	3.3.  Flow Control

271	     It is critical to provide RDMA Send flow control for an RDMA
272	     connection.  RDMA receive operations will fail if a pre-posted
273	     receive buffer is not available to accept an incoming RDMA Send,
274	     and repeated occurrences of such errors can be fatal to the
275	     connection.  This is a departure from conventional TCP/IP
276	     networking where buffers are allocated dynamically on an as-needed
277	     basis, and where pre-posting is not required.

279	     It is not practical to provide for fixed credit limits at the RPC
280	     server.  Fixed limits scale poorly, since posted buffers are
281	     dedicated to the associated connection until consumed by receive
282	     operations.  Additionally for protocol correctness, the RPC server
283	     must always be able to reply to client requests, whether or not new
284	     buffers have been posted to accept future receives.  (Note that the
285	     RPC server may in fact be a client at some other layer.  For
286	     example, NFSv4 callbacks are processed by the NFSv4 client, acting
287	     as an RPC server.  The credit discussions apply equally in either
288	     case.)

290	     Flow control for RDMA Send operations is implemented as a simple
291	     request/grant protocol in the RPC over RDMA header associated with
292	     each RPC message.  The RPC over RDMA header for RPC call messages
293	     contains a requested credit value for the RPC server, which MAY be
294	     dynamically adjusted by the caller to match its expected needs.
295	     The RPC over RDMA header for the RPC reply messages provides the
296	     granted result, which MAY have any value except it MUST NOT be zero
297	     when no in-progress operations are present at the server, since
298	     such a value would result in deadlock.  The value MAY be adjusted
299	     up or down at each opportunity to match the server's needs or
300	     policies.

302	     The RPC client MUST NOT send unacknowledged requests in excess of
303	     this granted RPC server credit limit.  If the limit is exceeded,
304	     the RDMA layer may signal an error, possibly terminating the
305	     connection.  Even if an error does not occur, it is OPTIONAL that
306	     the server handle the excess request(s), and it MAY return an RPC
307	     error to the client.  Also note that the never-zero requirement
308	     implies that an RPC server MUST always provide at least one credit
309	     to each connected RPC client from which no requests are
310	     outstanding.  The client would deadlock otherwise, unable to send
311	     another request.

313	     While RPC calls complete in any order, the current flow control
314	     limit at the RPC server is known to the RPC client from the Send
315	     ordering properties.  It is always the most recent server-granted
316	     credit value minus the number of requests in flight.

318	     Certain RDMA implementations may impose additional flow control
319	     restrictions, such as limits on RDMA Read operations in progress at
320	     the responder.  Because these operations are outside the scope of
321	     this protocol, they are not addressed and SHOULD be provided for by
322	     other layers.  For example, a simple upper layer RPC consumer might
323	     perform single-issue RDMA Read requests, while a more
324	     sophisticated, multithreaded RPC consumer might implement its own
325	     FIFO queue of such operations.  For further discussion of possible
326	     protocol implementations capable of negotiating these values, see
327	     section 6 "Connection Configuration Protocol" of this draft, or
328	     [NFSv4.1].

330	3.4.  XDR Encoding with Chunks

332	     The data comprising an RPC call or reply message is marshaled or
333	     serialized into a contiguous stream by an XDR routine.  XDR data
334	     types such as integers, strings, arrays and linked lists are
335	     commonly implemented over two very simple functions that encode
336	     either an XDR data unit (32 bits) or an array of bytes.

338	     Normally, the separate data items in an RPC call or reply are
339	     encoded as a contiguous sequence of bytes for network transmission
340	     over UDP or TCP.  However, in the case of an RDMA transport, local
341	     routines such as XDR encode can determine that (for instance) an
342	     opaque byte array is large enough to be more efficiently moved via
343	     an RDMA data transfer operation like RDMA Read or RDMA Write.

345	     Semantically speaking, the protocol has no restriction regarding
346	     data types which may or may not be represented by a read or write
347	     chunk.  In practice however, efficiency considerations lead to the
348	     conclusion that certain data types are not generally "chunkable".
349	     Typically, only those opaque and aggregate data types that may
350	     attain substantial size are considered to be eligible.  With
351	     today's hardware this size may be a kilobyte or more.  However any
352	     object MAY be chosen for chunking in any given message.

354	     The eligibility of XDR data items to be candidates for being moved
355	     as data chunks (as opposed to being marshaled inline) is not
356	     specified by the RPC over RDMA protocol.  Chunk eligibility
357	     criteria MUST be determined by each upper layer in order to provide
358	     for an interoperable specification.  One such example with
359	     rationale, for the NFS protocol family, is provided in [NFSDDP].

361	     The interface by which an upper layer implementation communicates
362	     the eligibility of a data item locally to RPC for chunking is out
363	     of scope for this specification.  In many implementations, it is
364	     possible to implement a transparent RPC chunking facility.
365	     However, such implementations may lead to inefficiencies, either
366	     because they require the RPC layer to perform expensive
367	     registration and deregistration of memory "on the fly", or they may
368	     require using RDMA chunks in reply messages, along with the
369	     resulting additional handshaking with the RPC over RDMA peer.
370	     However, these issues are internal and generally confined to the
371	     local interface between RPC and its upper layers, one in which
372	     implementations are free to innovate.  The only requirement is that
373	     the resulting RPC RDMA protocol sent to the peer is valid for the
374	     upper layer.  See for example [NFSDDP].

376	     When sending any message (request or reply) that contains an
377	     eligible large data chunk, the XDR encoding routine avoids moving
378	     the data into the XDR stream.  Instead, it does not encode the data
379	     portion, but records the address and size of each chunk in a
380	     separate "read chunk list" encoded within RPC RDMA transport-
381	     specific headers.  Such chunks will be transferred via RDMA Read
382	     operations initiated by the receiver.

384	     When the read chunks are to be moved via RDMA, the memory for each
385	     chunk is registered.  This registration may take place within XDR
386	     itself, providing for full transparency to upper layers, or it may
387	     be performed by any other specific local implementation.

389	     Additionally, when making an RPC call that can result in bulk data
390	     transferred in the reply, write chunks MAY be provided to accept
391	     the data directly via RDMA Write.  These write chunks will
392	     therefore be pre-filled by the RPC server prior to responding, and
393	     XDR decode of the data at the client will not be required.  These
394	     chunks undergo a similar registration and advertisement via "write
395	     chunk lists" built as a part of XDR encoding.

397	     Some RPC client implementations are not able to determine where an
398	     RPC call's results reside during the "encode" phase.  This makes it
399	     difficult or impossible for the RPC client layer to encode the
400	     write chunk list at the time of building the request.  In this
401	     case, it is difficult for the RPC implementation to provide
402	     transparency to the RPC consumer, which may require recoding to
403	     provide result information at this earlier stage.

405	     Therefore if the RPC client does not make a write chunk list
406	     available to receive the result, then the RPC server MAY return
407	     data inline in the reply, or if the upper layer specification
408	     permits, it MAY be returned via a read chunk list.  It is NOT
409	     RECOMMENDED that upper layer RPC client protocol specifications
410	     omit write chunk lists for eligible replies, due to the lower
411	     performance of the additional handshaking to perform data transfer,
412	     and the requirement that the RPC server must expose (and preserve)
413	     the reply data for a period of time.  In the absence of a server-
414	     provided read chunk list in the reply, if the encoded reply
415	     overflows the posted receive buffer, the RPC will fail with an RDMA
416	     transport error.

418	     When any data within a message is provided via either read or write
419	     chunks, the chunk itself refers only to the data portion of the XDR
420	     stream element.  In particular, for counted fields (e.g., a "<>"
421	     encoding) the byte count which is encoded as part of the field
422	     remains in the XDR stream, and is also encoded in the chunk list.
423	     The data portion is however elided from the encoded XDR stream, and
424	     is transferred as part of chunk list processing.  This is important
425	     to maintain upper layer implementation compatibility - both the
426	     count and the data must be transferred as part of the logical XDR
427	     stream.  While the chunk list processing results in the data being
428	     available to the upper layer peer for XDR decoding, the length
429	     present in the chunk list entries is not.  Any byte count in the
430	     XDR stream MUST match the sum of the byte counts present in the
431	     corresponding read or write chunk list.  If they do not agree, an
432	     RPC protocol encoding error results.

434	     The following items are contained in a chunk list entry.

436	     Handle
437	          Steering tag or handle obtained when the chunk memory is
438	          registered for RDMA.

440	     Length
441	          The length of the chunk in bytes.

443	     Offset
444	          The offset or beginning memory address of the chunk.  In order
445	          to support the widest array of RDMA implementations, as well
446	          as the most general steering tag scheme, this field is
447	          unconditionally included in each chunk list entry.

449	          While zero-based offset schemes are available in many RDMA
450	          implementations, their use by RPC requires individual
451	          registration of each read or write chunk.  On many such
452	          implementations this can be a significant overhead.  By
453	          providing an offset in each chunk, many pre-registration or
454	          region-based registrations can be readily supported, and by
455	          using a single, universal chunk representation, the RPC RDMA
456	          protocol implementation is simplified to its most general
457	          form.

459	     Position
460	          For data which is to be encoded, the position in the XDR
461	          stream where the chunk would normally reside.  Note that the
462	          chunk therefore inserts its data into the XDR stream at this
463	          position, but its transfer is no longer "inline".  Also note
464	          therefore that all chunks belonging to a single RPC argument
465	          or result will have the same position.  For data which is to
466	          be decoded, no position is used.

468	     When XDR marshaling is complete, the chunk list is XDR encoded,
469	     then sent to the receiver prepended to the RPC message.  Any source
470	     data for a read chunk, or the destination of a write chunk, remain
471	     behind in the sender's registered memory and their actual payload
472	     is not marshaled into the request or reply.

474	     +----------------+----------------+-------------
475	     | RPC over RDMA  |                |
476	     |    header w/   |   RPC Header   | Non-chunk args/results
477	     |     chunks     |                |
478	     +----------------+----------------+-------------

480	     Read chunk lists and write chunk lists are structured somewhat
481	     differently.  This is due to the different usage - read chunks are
482	     decoded and indexed by their argument's or result's position in the
483	     XDR data stream;  their size is always known.  Write chunks on the
484	     other hand are used only for results, and have neither a
485	     preassigned offset in the XDR stream, nor a size until the results
486	     are produced, since the buffers may be only partially filled, or
487	     may not be used for results at all.  Their presence in the XDR
488	     stream is therefore not known until the reply is processed.  The
489	     mapping of Write chunks onto designated NFS procedures and their
490	     results is described in [NFSDDP].

492	     Therefore, read chunks are encoded into a read chunk list as a
493	     single array, with each entry tagged by its (known) size and its
494	     argument's or result's position in the XDR stream.  Write chunks
495	     are encoded as a list of arrays of RDMA buffers, with each list
496	     element (an array) providing buffers for a separate result.
497	     Individual write chunk list elements MAY thereby result in being
498	     partially or fully filled, or in fact not being filled at all.
499	     Unused write chunks, or unused bytes in write chunk buffer lists,
500	     are not returned as results, and their memory is returned to the
501	     upper layer as part of RPC completion.  However, the RPC layer MUST
502	     NOT assume that the buffers have not been modified.

504	3.5.  XDR Decoding with Read Chunks

506	     The XDR decode process moves data from an XDR stream into a data
507	     structure provided by the RPC client or server application.  Where
508	     elements of the destination data structure are buffers or strings,
509	     the RPC application can either pre-allocate storage to receive the
510	     data, or leave the string or buffer fields null and allow the XDR
511	     decode stage of RPC processing to automatically allocate storage of
512	     sufficient size.

514	     When decoding a message from an RDMA transport, the receiver first
515	     XDR decodes the chunk lists from the RPC over RDMA header, then
516	     proceeds to decode the body of the RPC message (arguments or
517	     results).  Whenever the XDR offset in the decode stream matches
518	     that of a chunk in the read chunk list, the XDR routine initiates
519	     an RDMA Read to bring over the chunk data into locally registered
520	     memory for the destination buffer.

522	     When processing an RPC request, the RPC receiver (RPC server)
523	     acknowledges its completion of use of the source buffers by simply
524	     replying to the RPC sender (client), and the peer may then free all
525	     source buffers advertised by the request.

527	     When processing an RPC reply, after completing such a transfer the
528	     RPC receiver (client) MUST issue an RDMA_DONE message (described in
529	     Section 3.8) to notify the peer (server) that the source buffers
530	     can be freed.

532	     The read chunk list is constructed and used entirely within the
533	     RPC/XDR layer.  Other than specifying the minimum chunk size, the
534	     management of the read chunk list is automatic and transparent to
535	     an RPC application.

537	3.6.  XDR Decoding with Write Chunks

539	     When a "write chunk list" is provided for the results of the RPC
540	     call, the RPC server MUST provide any corresponding data via RDMA
541	     Write to the memory referenced in the chunk list entries.  The RPC
542	     reply conveys this by returning the write chunk list to the client
543	     with the lengths rewritten to match the actual transfer.  The XDR
544	     "decode" of the reply therefore performs no local data transfer but
545	     merely returns the length obtained from the reply.

547	     Each decoded result consumes one entry in the write chunk list,
548	     which in turn consists of an array of RDMA segments.  The length is
549	     therefore the sum of all returned lengths in all segments
550	     comprising the corresponding list entry.  As each list entry is
551	     "decoded", the entire entry is consumed.

553	     The write chunk list is constructed and used by the RPC
554	     application.  The RPC/XDR layer simply conveys the list between
555	     client and server and initiates the RDMA Writes back to the client.
556	     The mapping of write chunk list entries to procedure arguments MUST
557	     be determined for each protocol.  An example of a mapping is
558	     described in [NFSDDP].

560	3.7.  XDR Roundup and Chunks

562	     The XDR protocol requires 4-byte alignment of each new encoded
563	     element in any XDR stream.  This requirement is for efficiency and
564	     ease of decode/unmarshaling at the receiver - if the XDR stream
565	     buffer begins on a native machine boundary, then the XDR elements
566	     will lie on similarly predictable offsets in memory.

568	     Within XDR, when non-4-byte encodes (such as an odd-length string
569	     or bulk data) are marshaled, their length is encoded literally,
570	     while their data is padded to begin the next element at a 4-byte
571	     boundary in the XDR stream.  For TCP or RDMA inline encoding, this
572	     minimal overhead is required because the transport-specific framing
573	     relies on the fact that the relative offset of the elements in the
574	     XDR stream from the start of the message determines the XDR
575	     position during decode.

577	     On the other hand, RPC/RDMA Read chunks carry the XDR position of
578	     each chunked element and length of the Chunk segment, and can be
579	     placed by the receiver exactly where they belong in the receiver's
580	     memory without regard to the alignment of their position in the XDR
581	     stream.  Since any rounded-up data is not actually part of the
582	     upper layer's message, the receiver will not reference it, and
583	     there is no reason to set it to any particular value in the
584	     receiver's memory.

586	     When roundup is present at the end of a sequence of chunks, the
587	     length of the sequence will terminate it at a non-4-byte XDR
588	     position.  When the receiver proceeds to decode the remaining part
589	     of the XDR stream, it inspects the XDR position indicated by the
590	     next chunk.  Because this position will not match (else roundup
591	     would not have occurred), the receiver decoding will fall back to
592	     inspecting the remaining inline portion.  If in turn, no data
593	     remains to be decoded from the inline portion, then the receiver
594	     MUST conclude that roundup is present, and therefore advances the
595	     XDR decode position to that indicated by the next chunk (if any).
596	     In this way, roundup is passed without ever actually transferring
597	     additional XDR bytes.

599	     Some protocol operations over RPC/RDMA, for instance NFS writes of
600	     data encountered at the end of a file or in direct i/o situations,
601	     commonly yield these roundups within RDMA Read Chunks.  Because any
602	     roundup bytes are not actually present in the data buffers being
603	     written, memory for these bytes would come from noncontiguous
604	     buffers, either as an additional memory registration segment, or as
605	     an additional Chunk.  The overhead of these operations can be
606	     significant to both the sender to marshal them, and even higher to
607	     the receiver which to transfer them.  Senders SHOULD therefore
608	     avoid encoding individual RDMA Read Chunks for roundup whenever
609	     possible.  It is acceptable, but not necessary, to include roundup
610	     data in an existing RDMA Read Chunk, but only if it is already
611	     present in the XDR stream to carry upper layer data.

613	     Note that there is no exposure of additional data at the sender due
614	     to eliding roundup data from the XDR stream, since any additional
615	     sender buffers are never exposed to the peer.  The data is
616	     literally not there to be transferred.

618	     For RDMA Write Chunks, a simpler encoding method applies.  Again,
619	     roundup bytes are not transferred, instead the chunk length sent to
620	     the receiver in the reply is simply increased to include any
621	     roundup.  Because of the requirement that the RDMA Write chunks are
622	     filled sequentially without gaps, this situation can only occur on
623	     the final chunk receiving data.  Therefore there is no opportunity
624	     for roundup data to insert misalignment or positional gaps into the
625	     XDR stream.

627	3.8.  RPC Call and Reply

629	     The RDMA transport for RPC provides three methods of moving data
630	     between RPC client and server:

632	     Inline
633	          Data are moved between RPC client and server within an RDMA
634	          Send.

636	     RDMA Read
637	          Data are moved between RPC client and server via an RDMA Read
638	          operation via steering tag, address and offset obtained from a
639	          read chunk list.

641	     RDMA Write
642	          Result data is moved from RPC server to client via an RDMA
643	          Write operation via steering tag, address and offset obtained
644	          from a write chunk list or reply chunk in the client's RPC
645	          call message.

647	     These methods of data movement may occur in combinations within a
648	     single RPC.  For instance, an RPC call may contain some inline data
649	     along with some large chunks to be transferred via RDMA Read to the
650	     server.  The reply to that call may have some result chunks that
651	     the server RDMA Writes back to the client.  The following protocol
652	     interactions illustrate RPC calls that use these methods to move
653	     RPC message data:

655	     An RPC with write chunks in the call message:

657	       RPC Client                           RPC Server
658	           |     RPC Call + Write Chunk list     |
659	      Send |   ------------------------------>   |
660	           |                                     |
661	           |               Chunk 1               |
662	           |   <------------------------------   | Write
663	           |                  :                  |
664	           |               Chunk n               |
665	           |   <------------------------------   | Write
666	           |                                     |
667	           |               RPC Reply             |
668	           |   <------------------------------   | Send

670	     In the presence of write chunks, RDMA ordering provides the
671	     guarantee that all data in the RDMA Write operations has been
672	     placed in memory prior to the client's RPC reply processing.

674	     An RPC with read chunks in the call message:

676	       RPC Client                           RPC Server
677	           |     RPC Call + Read Chunk list      |
678	      Send |   ------------------------------>   |
679	           |                                     |
680	           |               Chunk 1               |
681	           |   +------------------------------   | Read
682	           |   v----------------------------->   |
683	           |                  :                  |
684	           |               Chunk n               |
685	           |   +------------------------------   | Read
686	           |   v----------------------------->   |
687	           |                                     |
688	           |               RPC Reply             |
689	           |   <------------------------------   | Send

691	     An RPC with read chunks in the reply message:

693	       RPC Client                           RPC Server
694	           |               RPC Call              |
695	      Send |   ------------------------------>   |
696	           |                                     |
697	           |     RPC Reply + Read Chunk list     |
698	           |   <------------------------------   | Send
699	           |                                     |
700	           |               Chunk 1               |
701	      Read |   ------------------------------+   |
702	           |   <-----------------------------v   |
703	           |                  :                  |
704	           |               Chunk n               |
705	      Read |   ------------------------------+   |
706	           |   <-----------------------------v   |
707	           |                                     |
708	           |                 Done                |
709	      Send |   ------------------------------>   |

711	     The final Done message allows the RPC client to signal the server
712	     that it has received the chunks, so the server can de-register and
713	     free the memory holding the chunks.  A Done completion is not
714	     necessary for an RPC call, since the RPC reply Send is itself a
715	     receive completion notification.  In the event that the client
716	     fails to return the Done message within some timeout period, the
717	     server MAY conclude that a protocol violation has occurred and
718	     close the RPC connection, or it MAY proceed with a de-register and
719	     free its chunk buffers.  This may result in a fatal RDMA error if
720	     the client later attempts to perform an RDMA Read operation, which
721	     amounts to the same thing.

723	     The use of read chunks in RPC reply messages is much less efficient
724	     than providing write chunks in the originating RPC calls, due to
725	     the additional message exchanges, the need for the RPC server to
726	     advertise buffers to the peer, the necessity of the server
727	     maintaining a timer for the purpose of recovery from misbehaving
728	     clients, and the need for additional memory registration.  Their
729	     use is NOT RECOMMENDED by upper layers where efficiency is a
730	     primary concern. [NFSDDP]  However, they MAY be employed by upper
731	     layer protocol bindings which are primarily concerned with
732	     transparency, since they can frequently be implemented completely
733	     within the RPC lower layers.

735	     It is important to note that the Done message consumes a credit at
736	     the RPC server.  The RPC server SHOULD provide sufficient credits
737	     to the client to allow the Done message to be sent without deadlock
738	     (driving the outstanding credit count to zero).  The RPC client
739	     MUST account for its required Done messages to the server in its
740	     accounting of available credits, and the server SHOULD replenish
741	     any credit consumed by its use of such exchanges at its earliest
742	     opportunity.

744	     Finally, it is possible to conceive of RPC exchanges that involve
745	     any or all combinations of write chunks in the RPC call, read
746	     chunks in the RPC call, and read chunks in the RPC reply.  Support
747	     for such exchanges is straightforward from a protocol perspective,
748	     but in practice such exchanges would be quite rare, limited to
749	     upper layer protocol exchanges which transferred bulk data in both
750	     the call and corresponding reply.

752	3.9.  Padding

754	     Alignment of specific opaque data enables certain scatter/gather
755	     optimizations.  Padding leverages the useful property that RDMA
756	     transfers preserve alignment of data, even when they are placed
757	     into pre-posted receive buffers by Sends.

759	     Many servers can make good use of such padding.  Padding allows the
760	     chaining of RDMA receive buffers such that any data transferred by
761	     RDMA on behalf of RPC requests will be placed into appropriately
762	     aligned buffers on the system that receives the transfer.  In this
763	     way, the need for servers to perform RDMA Read to satisfy all but
764	     the largest client writes is obviated.

766	     The effect of padding is demonstrated below showing prior bytes on
767	     an XDR stream (XXX) followed by an opaque field consisting of four
768	     length bytes (LLLL) followed by data bytes (DDDD).  The receiver of
769	     the RDMA Send has posted two chained receive buffers.  Without
770	     padding, the opaque data is split across the two buffers.  With the
771	     addition of padding bytes ("ppp" in the figure below) prior to the
772	     first data byte, the data can be forced to align correctly in the
773	     second buffer.

775	                                           Buffer 1       Buffer 2
776	     Unpadded                           --------------  --------------

778	      XXXXXXXLLLLDDDDDDDDDDDDDD    ---> XXXXXXXLLLLDDD  DDDDDDDDDDD

780	     Padded

782	      XXXXXXXLLLLpppDDDDDDDDDDDDDD ---> XXXXXXXLLLLppp  DDDDDDDDDDDDDD

784	     Padding is implemented completely within the RDMA transport
785	     encoding, flagged with a specific message type.  Where padding is
786	     applied, two values are passed to the peer:  an "rdma_align" which
787	     is the padding value used, and "rdma_thresh", which is the opaque
788	     data size at or above which padding is applied.  For instance, if
789	     the server is using chained 4 KB receive buffers, then up to (4 KB
790	     - 1) padding bytes could be used to achieve alignment of the data.
791	     The XDR routine at the peer MUST consult these values when decoding
792	     opaque values.  Where the decoded length exceeds the rdma_thresh,
793	     the XDR decode MUST skip over the appropriate padding as indicated
794	     by rdma_align and the current XDR stream position.

796	4.  RPC RDMA Message Layout

798	     RPC call and reply messages are conveyed across an RDMA transport
799	     with a prepended RPC over RDMA header.  The RPC over RDMA header
800	     includes data for RDMA flow control credits, padding parameters and
801	     lists of addresses that provide direct data placement via RDMA Read
802	     and Write operations.  The layout of the RPC message itself is
803	     unchanged from that described in [RFC1831bis] except for the
804	     possible exclusion of large data chunks that will be moved by RDMA
805	     Read or Write operations.  If the RPC message (along with the RPC
806	     over RDMA header) is too long for the posted receive buffer (even
807	     after any large chunks are removed), then the entire RPC message
808	     MAY be moved separately as a chunk, leaving just the RPC over RDMA
809	     header in the RDMA Send.

811	4.1.  RPC over RDMA Header

813	     The RPC over RDMA header begins with four 32-bit fields that are
814	     always present and which control the RDMA interaction including
815	     RDMA-specific flow control.  These are then followed by a number of
816	     items such as chunk lists and padding which MAY or MUST NOT be
817	     present depending on the type of transmission.  The four fields
818	     which are always present are:

820	     1. Transaction ID (XID).
821	          The XID generated for the RPC call and reply.  Having the XID
822	          at the beginning of the message makes it easy to establish the
823	          message context.  This XID MUST be the same as the XID in the
824	          RPC header.  The receiver MAY perform its processing based
825	          solely on the XID in the RPC over RDMA header, and thereby
826	          ignore the XID in the RPC header, if it so chooses.

828	     2. Version number.
829	          This version of the RPC RDMA message protocol is 1.  The
830	          version number MUST be increased by one whenever the format of
831	          the RPC RDMA messages is changed.

833	     3. Flow control credit value.
834	          When sent in an RPC call message, the requested value is
835	          provided.  When sent in an RPC reply message, the granted
836	          value is returned.  RPC calls SHOULD NOT be sent in excess of
837	          the currently granted limit.

839	     4. Message type.

841	          o    RDMA_MSG = 0 indicates that chunk lists and RPC message
842	               follow.

844	          o    RDMA_NOMSG = 1 indicates that after the chunk lists there
845	               is no RPC message.  In this case, the chunk lists provide
846	               information to allow the message proper to be transferred
847	               using RDMA Read or write and thus is not appended to the
848	               RPC over RDMA header.

850	          o    RDMA_MSGP = 2 indicates that a chunk list and RPC message
851	               with some padding follow.

853	          0    RDMA_DONE = 3 indicates that the message signals the
854	               completion of a chunk transfer via RDMA Read.

856	          o    RDMA_ERROR = 4 is used to signal any detected error(s) in
857	               the RPC RDMA chunk encoding.

859	     Because the version number is encoded as part of this header, and
860	     the RDMA_ERROR message type is used to indicate errors, these first
861	     four fields and the start of the following message body MUST always
862	     remain aligned at these fixed offsets for all versions of the RPC
863	     over RDMA header.

865	     For a message of type RDMA_MSG or RDMA_NOMSG, the Read and Write
866	     chunk lists follow.  If the Read chunk list is null (a 32 bit word
867	     of zeros), then there are no chunks to be transferred separately
868	     and the RPC message follows in its entirety.  If non-null, then
869	     it's the beginning of an XDR encoded sequence of Read chunk list
870	     entries.  If the Write chunk list is non-null, then an XDR encoded
871	     sequence of Write chunk entries follows.

873	     If the message type is RDMA_MSGP, then two additional fields that
874	     specify the padding alignment and threshold are inserted prior to
875	     the Read and Write chunk lists.

877	     A header of message type RDMA_MSG or RDMA_MSGP MUST be followed by
878	     the RPC call or RPC reply message body, beginning with the XID.
879	     The XID in the RDMA_MSG or RDMA_MSGP header MUST match this.

881	     +--------+---------+---------+-----------+-------------+----------
882	     |        |         |         | Message   |   NULLs     | RPC Call
883	     |  XID   | Version | Credits |  Type     |    or       |    or
884	     |        |         |         |           | Chunk Lists | Reply Msg
885	     +--------+---------+---------+-----------+-------------+----------

887	     Note that in the case of RDMA_DONE and RDMA_ERROR, no chunk list or
888	     RPC message follows.  As an implementation hint: a gather operation
889	     on the Send of the RDMA RPC message can be used to marshal the
890	     initial header, the chunk list, and the RPC message itself.

892	4.2.  RPC over RDMA header errors

894	     When a peer receives an RPC RDMA message, it MUST perform the
895	     following basic validity checks on the header and chunk contents.
896	     If such errors are detected in the request, an RDMA_ERROR reply
897	     MUST be generated.

899	     Two types of errors are defined, version mismatch and invalid chunk
900	     format.  When the peer detects an RPC over RDMA header version
901	     which it does not support (currently this draft defines only
902	     version 1), it replies with an error code of ERR_VERS, and provides
903	     the low and high inclusive version numbers it does, in fact,
904	     support.  The version number in this reply MUST be any value
905	     otherwise valid at the receiver.  When other decoding errors are
906	     detected in the header or chunks, either an RPC decode error MAY be
907	     returned, or the ROC/RDMA error code ERR_CHUNK MUST be returned.

909	4.3.  XDR Language Description

911	     Here is the message layout in XDR language.

913	        struct xdr_rdma_segment {
914	           uint32 handle;          /* Registered memory handle */
915	           uint32 length;          /* Length of the chunk in bytes */
916	           uint64 offset;          /* Chunk virtual address or offset */
917	        };

919	        struct xdr_read_chunk {
920	           uint32 position;        /* Position in XDR stream */
921	           struct xdr_rdma_segment target;
922	        };

924	        struct xdr_read_list {
925	           struct xdr_read_chunk entry;
926	           struct xdr_read_list  *next;
927	        };

929	        struct xdr_write_chunk {
930	           struct xdr_rdma_segment target<>;
931	        };

933	        struct xdr_write_list {
934	           struct xdr_write_chunk entry;
935	           struct xdr_write_list  *next;
936	        };

938	        struct rdma_msg {
939	           uint32    rdma_xid;     /* Mirrors the RPC header xid */
940	           uint32    rdma_vers;    /* Version of this protocol */
941	           uint32    rdma_credit;  /* Buffers requested/granted */
942	           rdma_body rdma_body;
943	        };

945	        enum rdma_proc {
946	           RDMA_MSG=0,   /* An RPC call or reply msg */
947	           RDMA_NOMSG=1, /* An RPC call or reply msg - separate body */
948	           RDMA_MSGP=2,  /* An RPC call or reply msg with padding */
949	           RDMA_DONE=3,  /* Client signals reply completion */
950	           RDMA_ERROR=4  /* An RPC RDMA encoding error */
951	        };
952	        union rdma_body switch (rdma_proc proc) {
953	           case RDMA_MSG:
954	             rpc_rdma_header rdma_msg;
955	           case RDMA_NOMSG:
956	             rpc_rdma_header_nomsg rdma_nomsg;
957	           case RDMA_MSGP:
958	             rpc_rdma_header_padded rdma_msgp;
959	           case RDMA_DONE:
960	             void;
961	           case RDMA_ERROR:
962	             rpc_rdma_error rdma_error;
963	        };

965	        struct rpc_rdma_header {
966	           struct xdr_read_list   *rdma_reads;
967	           struct xdr_write_list  *rdma_writes;
968	           struct xdr_write_chunk *rdma_reply;
969	           /* rpc body follows */
970	        };

972	        struct rpc_rdma_header_nomsg {
973	           struct xdr_read_list   *rdma_reads;
974	           struct xdr_write_list  *rdma_writes;
975	           struct xdr_write_chunk *rdma_reply;
976	        };

978	        struct rpc_rdma_header_padded {
979	           uint32                 rdma_align;   /* Padding alignment */
980	           uint32                 rdma_thresh;  /* Padding threshold */
981	           struct xdr_read_list   *rdma_reads;
982	           struct xdr_write_list  *rdma_writes;
983	           struct xdr_write_chunk *rdma_reply;
984	           /* rpc body follows */
985	        };
986	        enum rpc_rdma_errcode {
987	           ERR_VERS = 1,
988	           ERR_CHUNK = 2
989	        };

991	        union rpc_rdma_error switch (rpc_rdma_errcode err) {
992	           case ERR_VERS:
993	             uint32               rdma_vers_low;
994	             uint32               rdma_vers_high;
995	           case ERR_CHUNK:
996	             void;
997	           default:
998	             uint32               rdma_extra[8];
999	        };

1001	5.  Long Messages

1003	     The receiver of RDMA Send messages is required by RDMA to have
1004	     previously posted one or more adequately sized buffers.  The RPC
1005	     client can inform the server of the maximum size of its RDMA Send
1006	     messages via the Connection Configuration Protocol described later
1007	     in this document.

1009	     Since RPC messages are frequently small, memory savings can be
1010	     achieved by posting small buffers.  Even large messages like NFS
1011	     READ or WRITE will be quite small once the chunks are removed from
1012	     the message.  However, there may be large messages that would
1013	     demand a very large buffer be posted, where the contents of the
1014	     buffer may not be a chunkable XDR element.  A good example is an
1015	     NFS READDIR reply which may contain a large number of small
1016	     filename strings.  Also, the NFS version 4 protocol [RFC3530]
1017	     features COMPOUND request and reply messages of unbounded length.

1019	     Ideally, each upper layer will negotiate these limits.  However, it
1020	     is frequently necessary to provide a transparent solution.

1022	5.1.  Message as an RDMA Read Chunk

1024	     One relatively simple method is to have the client identify any RPC
1025	     message that exceeds the RPC server's posted buffer size and move
1026	     it separately as a chunk, i.e., reference it as the first entry in
1027	     the read chunk list with an XDR position of zero.

1029	     Normal Message

1031	     +--------+---------+---------+------------+-------------+----------
1032	     |        |         |         |            |             | RPC Call
1033	     |  XID   | Version | Credits |  RDMA_MSG  | Chunk Lists |    or
1034	     |        |         |         |            |             | Reply Msg
1035	     +--------+---------+---------+------------+-------------+----------

1037	     Long Message

1039	     +--------+---------+---------+------------+-------------+
1040	     |        |         |         |            |             |
1041	     |  XID   | Version | Credits | RDMA_NOMSG | Chunk Lists |
1042	     |        |         |         |            |             |
1043	     +--------+---------+---------+------------+-------------+
1044	                                                  |
1045	                                                  |  +----------
1046	                                                  |  | Long RPC Call
1047	                                                  +->|    or
1048	                                                     | Reply Message
1049	                                                     +----------

1051	     If the receiver gets an RPC over RDMA header with a message type of
1052	     RDMA_NOMSG and finds an initial read chunk list entry with a zero
1053	     XDR position, it allocates a registered buffer and issues an RDMA
1054	     Read of the long RPC message into it.  The receiver then proceeds
1055	     to XDR decode the RPC message as if it had received it inline with
1056	     the Send data.  Further decoding may issue additional RDMA Reads to
1057	     bring over additional chunks.

1059	     Although the handling of long messages requires one extra network
1060	     turnaround, in practice these messages will be rare if the posted
1061	     receive buffers are correctly sized, and of course they will be
1062	     non-existent for RDMA-aware upper layers.

1064	     A long call RPC with request supplied via RDMA Read

1066	       RPC Client                           RPC Server
1067	           |        RDMA over RPC Header         |
1068	      Send |   ------------------------------>   |
1069	           |                                     |
1070	           |          Long RPC Call Msg          |
1071	           |   +------------------------------   | Read
1072	           |   v----------------------------->   |
1073	           |                                     |
1074	           |         RDMA over RPC Reply         |
1075	           |   <------------------------------   | Send

1077	     An RPC with long reply returned via RDMA Read

1079	       RPC Client                           RPC Server
1080	           |             RPC Call                |
1081	      Send |   ------------------------------>   |
1082	           |                                     |
1083	           |         RDMA over RPC Header        |
1084	           |   <------------------------------   | Send
1085	           |                                     |
1086	           |          Long RPC Reply Msg         |
1087	      Read |   ------------------------------+   |
1088	           |   <-----------------------------v   |
1089	           |                                     |
1090	           |                Done                 |
1091	      Send |   ------------------------------>   |

1093	     It is possible for a single RPC procedure to employ both a long
1094	     call for its arguments, and a long reply for its results.  However,
1095	     such an operation is atypical, as few upper layers define such
1096	     exchanges.

1098	5.2.  RDMA Write of Long Replies (Reply Chunks)

1100	     A superior method of handling long RPC replies is to have the RPC
1101	     client post a large buffer into which the server can write a large
1102	     RPC reply.  This has the advantage that an RDMA Write may be
1103	     slightly faster in network latency than an RDMA Read, and does not
1104	     require the server to wait for the completion as it must for RDMA
1105	     Read.  Additionally, for a reply it removes the need for an
1106	     RDMA_DONE message if the large reply is returned as a Read chunk.

1108	     This protocol supports direct return of a large reply via the
1109	     inclusion of an OPTIONAL rdma_reply write chunk after the read
1110	     chunk list and the write chunk list.  The client allocates a buffer
1111	     sized to receive a large reply and enters its steering tag, address
1112	     and length in the rdma_reply write chunk.  If the reply message is
1113	     too long to return inline with an RDMA Send (exceeds the size of
1114	     the client's posted receive buffer), even with read chunks removed,
1115	     then the RPC server performs an RDMA Write of the RPC reply message
1116	     into the buffer indicated by the rdma_reply chunk.  If the client
1117	     doesn't provide an rdma_reply chunk, or if it's too small, then if
1118	     the upper layer specification permits, the message MAY be returned
1119	     as a Read chunk.

1121	     An RPC with long reply returned via RDMA Write

1123	       RPC Client                           RPC Server
1124	           |      RPC Call with rdma_reply       |
1125	      Send |   ------------------------------>   |
1126	           |                                     |
1127	           |          Long RPC Reply Msg         |
1128	           |   <------------------------------   | Write
1129	           |                                     |
1130	           |         RDMA over RPC Header        |
1131	           |   <------------------------------   | Send

1133	     The use of RDMA Write to return long replies requires that the
1134	     client applications anticipate a long reply and have some knowledge
1135	     of its size so that an adequately sized buffer can be allocated.
1136	     This is certainly true of NFS READDIR replies; where the client
1137	     already provides an upper bound on the size of the encoded
1138	     directory fragment to be returned by the server.

1140	     The use of these "reply chunks" is highly efficient and convenient
1141	     for both RPC client and server.  Their use is encouraged for
1142	     eligible RPC operations such as NFS READDIR, which would otherwise
1143	     require extensive chunk management within the results or use of
1144	     RDMA Read and a Done message. [NFSDDP]

1146	6.  Connection Configuration Protocol

1148	     RDMA Send operations require the receiver to post one or more
1149	     buffers at the RDMA connection endpoint, each large enough to
1150	     receive the largest Send message.  Buffers are consumed as Send
1151	     messages are received.  If a buffer is too small, or if there are
1152	     no buffers posted, the RDMA transport MAY return an error and break
1153	     the RDMA connection.  The receiver MUST post sufficient, adequately
1154	     buffers to avoid buffer overrun or capacity errors.

1156	     The protocol described above includes only a mechanism for managing
1157	     the number of such receive buffers, and no explicit features to
1158	     allow the RPC client and server to provision or control buffer
1159	     sizing, nor any other session parameters.

1161	     In the past, this type of connection management has not been
1162	     necessary for RPC.  RPC over UDP or TCP does not have a protocol to
1163	     negotiate the link.  The server can get a rough idea of the maximum
1164	     size of messages from the server protocol code.  However, a
1165	     protocol to negotiate transport features on a more dynamic basis is
1166	     desirable.

1168	     The Connection Configuration Protocol allows the client to pass its
1169	     connection requirements to the server, and allows the server to
1170	     inform the client of its connection limits.

1172	     Use of the Connection Configuration Protocol by an upper layer is
1173	     OPTIONAL.

1175	6.1.  Initial Connection State

1177	     This protocol MAY be used for connection setup prior to the use of
1178	     another RPC protocol that uses the RDMA transport.  It operates in-
1179	     band, i.e., it uses the connection itself to negotiate the
1180	     connection parameters.  To provide a basis for connection
1181	     negotiation, the connection is assumed to provide a basic level of
1182	     interoperability: the ability to exchange at least one RPC message
1183	     at a time that is at least 1 KB in size.  The server MAY exceed
1184	     this basic level of configuration, but the client MUST NOT assume
1185	     more than one, and MUST receive a valid reply from the server
1186	     carrying the actual number of available receive messages, prior to
1187	     sending its next request.

1189	6.2.  Protocol Description

1191	     Version 1 of the Connection Configuration protocol consists of a
1192	     single procedure that allows the client to inform the server of its
1193	     connection requirements and the server to return connection
1194	     information to the client.

1196	     The maxcall_sendsize argument is the maximum size of an RPC call
1197	     message that the client MAY send inline in an RDMA Send message to
1198	     the server.  The server MAY return a maxcall_sendsize value that is
1199	     smaller or larger than the client's request.  The client MUST NOT
1200	     send an inline call message larger than what the server will
1201	     accept.  The maxcall_sendsize limits only the size of inline RPC
1202	     calls.  It does not limit the size of long RPC messages transferred
1203	     as an initial chunk in the Read chunk list.

1205	     The maxreply_sendsize is the maximum size of an inline RPC message
1206	     that the client will accept from the server.

1208	     The maxrdmaread is the maximum number of RDMA Reads which may be
1209	     active at the peer.  This number correlates to the RDMA incoming
1210	     RDMA Read count ("IRD") configured into each originating endpoint
1211	     by the client or server.  If more than this number of RDMA Read
1212	     operations by the connected peer are issued simultaneously,
1213	     connection loss or suboptimal flow control may result, therefore
1214	     the value SHOULD be observed at all times.  The peers' values need
1215	     not be equal.  If zero, the peer MUST NOT issue requests which
1216	     require RDMA Read to satisfy, as no transfer will be possible.

1218	     The align value is the value recommended by the server for opaque
1219	     data values such as strings and counted byte arrays.  The client
1220	     MAY use this value to compute the number of prepended pad bytes
1221	     when XDR encoding opaque values in the RPC call message.

1223	        typedef unsigned int uint32;

1225	        struct config_rdma_req {
1226	             uint32  maxcall_sendsize;
1227	                         /* max size of inline RPC call */
1228	             uint32  maxreply_sendsize;
1229	                         /* max size of inline RPC reply */
1230	             uint32  maxrdmaread;
1231	                         /* max active RDMA Reads at client */
1232	        };

1234	        struct config_rdma_reply {
1235	             uint32  maxcall_sendsize;
1236	                         /* max call size accepted by server */
1237	             uint32  align;
1238	                         /* server's receive buffer alignment */
1239	             uint32  maxrdmaread;
1240	                         /* max active RDMA Reads at server */
1241	        };

1243	        program CONFIG_RDMA_PROG {
1244	           version VERS1 {
1245	              /*
1246	               * Config call/reply
1247	               */
1248	              config_rdma_reply CONF_RDMA(config_rdma_req) = 1;
1249	           } = 1;
1250	        } = 100417;

1252	7.  Memory Registration Overhead

1254	     RDMA requires that all data be transferred between registered
1255	     memory regions at the source and destination.  All protocol headers
1256	     as well as separately transferred data chunks use registered
1257	     memory.  Since the cost of registering and de-registering memory
1258	     can be a large proportion of the RDMA transaction cost, it is
1259	     important to minimize registration activity.  This is easily
1260	     achieved within RPC controlled memory by allocating chunk list data
1261	     and RPC headers in a reusable way from pre-registered pools.

1263	     The data chunks transferred via RDMA MAY occupy memory that
1264	     persists outside the bounds of the RPC transaction.  Hence, the
1265	     default behavior of an RPC over RDMA transport is to register and
1266	     de-register these chunks on every transaction.  However, this is
1267	     not a limitation of the protocol - only of the existing local RPC
1268	     API.  The API is easily extended through such functions as
1269	     rpc_control(3) to change the default behavior so that the
1270	     application can assume responsibility for controlling memory
1271	     registration through an RPC-provided registered memory allocator.

1273	8.  Errors and Error Recovery

1275	     RPC RDMA protocol errors are described in section 4.  RPC errors
1276	     and RPC error recovery are not affected by the protocol, and
1277	     proceed as for any RPC error condition.  RDMA Transport error
1278	     reporting and recovery are outside the scope of this protocol.

1280	     It is assumed that the link itself will provide some degree of
1281	     error detection and retransmission.  iWARP's MPA layer (when used
1282	     over TCP), SCTP, as well as the Infiniband link layer all provide
1283	     CRC protection of the RDMA payload, and CRC-class protection is a
1284	     general attribute of such transports.  Additionally, the RPC layer
1285	     itself can accept errors from the link level and recover via
1286	     retransmission.  RPC recovery can handle complete loss and re-
1287	     establishment of the link.

1289	     See section 11 for further discussion of the use of RPC-level
1290	     integrity schemes to detect errors, and related efficiency issues.

1292	9.  Node Addressing

1294	     In setting up a new RDMA connection, the first action by an RPC
1295	     client will be to obtain a transport address for the server.  The
1296	     mechanism used to obtain this address, and to open an RDMA
1297	     connection is dependent on the type of RDMA transport, and is the
1298	     responsibility of each RPC protocol binding and its local
1299	     implementation.

1301	10.  RPC Binding

1303	     RPC services normally register with a portmap or rpcbind [RFC1833]
1304	     service, which associates an RPC program number with a service
1305	     address.  (In the case of UDP or TCP, the service address for NFS
1306	     is normally port 2049.)  This policy is no different with RDMA
1307	     interconnects, although it may require the allocation of port
1308	     numbers appropriate to each upper layer binding which uses the RPC
1309	     framing defined here.

1311	     When mapped atop the iWARP [RFC5040, RFC5041] transport, which uses
1312	     IP port addressing due to its layering on TCP and/or SCTP, port
1313	     mapping is trivial and consists merely of issuing the port in the
1314	     connection process.  The NFS/RDMA protocol service address has been
1315	     assigned port 20049 by IANA, for both iWARP/TCP and iWARP/SCTP.

1317	     When mapped atop Infiniband [IB], which uses a GID-based service
1318	     endpoint naming scheme, a translation MUST be employed.  One such
1319	     translation is defined in the Infiniband Port Addressing Annex
1320	     [IBPORT], which is appropriate for translating IP port addressing
1321	     to the Infiniband network.  Therefore, in this case, IP port
1322	     addressing may be readily employed by the upper layer.

1324	     When a mapping standard or convention exists for IP ports on an
1325	     RDMA interconnect, there are several possibilities for each upper
1326	     layer to consider:

1328	          One possibility is to have an upper layer server register its
1329	          mapped IP port with the rpcbind service, under the netid (or
1330	          netid's) defined here.  An RPC/RDMA-aware client can then
1331	          resolve its desired service to a mappable port, and proceed to
1332	          connect.  This is the most flexible and compatible approach,
1333	          for those upper layers which are defined to use the rpcbind
1334	          service.

1336	          A second possibility is to have the server's portmapper
1337	          register itself on the RDMA interconnect at a "well known"
1338	          service address.  (On UDP or TCP, this corresponds to port
1339	          111.)  A client could connect to this service address and use
1340	          the portmap protocol to obtain a service address in response
1341	          to a program number, e.g., an iWARP port number, or an
1342	          Infiniband GID.

1344	          Alternatively, the client could simply connect to the mapped
1345	          well-known port for the service itself, if it is appropriately
1346	          defined.  By convention, the NFS/RDMA service, when operating
1347	          atop such an Infiniband fabric, will use the same 20049
1348	          assignment as for iWARP.

1350	     Historically, different RPC protocols have taken different
1351	     approaches to their port assignment, therefore the specific method
1352	     is left to each RPC/RDMA-enabled upper layer binding, and not
1353	     addressed here.

1355	     This specification defines two new "netid" values, to be used for
1356	     registration of upper layers atop iWARP [RFC5040, RFC5041] and
1357	     (when a suitable port translation service is available) Infiniband
1358	     [IB] in section 12, "IANA Considerations."  Additional RDMA-capable
1359	     networks MAY define their own netids, or if they provide a port
1360	     translation, MAY share the one defined here.

1362	11.  Security Considerations

1364	     RPC provides its own security via the RPCSEC_GSS framework
1365	     [RFC2203].  RPCSEC_GSS can provide message authentication,
1366	     integrity checking, and privacy.  This security mechanism will be
1367	     unaffected by the RDMA transport.  The data integrity and privacy
1368	     features alter the body of the message, presenting it as a single
1369	     chunk.  For large messages the chunk may be large enough to qualify
1370	     for RDMA Read transfer.  However, there is much data movement
1371	     associated with computation and verification of integrity, or
1372	     encryption/decryption, so certain performance advantages may be
1373	     lost.

1375	     For efficiency, a more appropriate security mechanism for RDMA
1376	     links may be link-level protection, such as certain configurations
1377	     of IPsec, which may be co-located in the RDMA hardware.  The use of
1378	     link-level protection MAY be negotiated through the use of the new
1379	     RPCSEC_GSS mechanism defined in [RPCSECGSSV2] in conjunction with
1380	     the Channel Binding mechanism [RFC5056] and IPsec Channel
1381	     Connection Latching [BTNSLATCH].  Use of such mechanisms is
1382	     REQUIRED where integrity and/or privacy is desired, and where
1383	     efficiency is required.

1385	     An additional consideration is the protection of the integrity and
1386	     privacy of local memory by the RDMA transport itself.  The use of
1387	     RDMA by RPC MUST NOT introduce any vulnerabilities to system memory
1388	     contents, or to memory owned by user processes.  These protections
1389	     are provided by the RDMA layer specifications, and specifically
1390	     their security models.  It is REQUIRED that any RDMA provider used
1391	     for RPC transport be conformant to the requirements of [RFC5042] in
1392	     order to satisfy these protections.

1394	     Once delivered securely by the RDMA provider, any RDMA-exposed
1395	     addresses will contain only RPC payloads in the chunk lists,
1396	     transferred under the protection of RPCSEC_GSS integrity and
1397	     privacy.  By these means, the data will be protected end-to-end, as
1398	     required by the RPC layer security model.

1400	     Where upper layer protocols choose to supply results to the
1401	     requester via Read chunks, a server resource deficit can arise if
1402	     the client does not promptly acknowledge their status via the
1403	     RDMA_DONE message.  This can potentially lead to a denial of
1404	     service situation, with a single client unfairly (and
1405	     unnecessarily) consuming server RDMA resources.  Servers for such
1406	     upper layer protocols MUST protect against this situation,
1407	     originating from one or many clients.  For example, a time-based
1408	     window of buffer availability may be offered, if the client fails
1409	     to obtain the data within the window, it will simply retry using
1410	     ordinary RPC retry semantics.  Or, a more severe method would be
1411	     for the server to simply close the client's RDMA connection,
1412	     freeing the RDMA resources and allowing the server to reclaim them.

1414	     A fairer and more useful method is provided by the protocol itself.
1415	     The server MAY use the rdma_credit value to limit the number of
1416	     outstanding requests for each client.  By including the number of
1417	     outstanding RDMA_DONE completions in the computation of available
1418	     client credits, the server can limit its exposure to each client,
1419	     and therefore provide uninterrupted service as its resources
1420	     permit.

1422	     However, the server must ensure that it does not decrease the
1423	     credit count to zero with this method, since the RDMA_DONE message
1424	     is not acknowledged.  If the credit count were to drop to zero
1425	     solely due to outstanding RDMA_DONE messages, the client would
1426	     deadlock since it would never obtain a new credit with which to
1427	     continue.  Therefore, if the server adjusts credits to zero for
1428	     outstanding RDMA_DONE, it MUST withhold its reply to at least one
1429	     message in order to provide the next credit.  The time-based window
1430	     (or any other appropriate method) SHOULD be used by the server to
1431	     recover resources in the event that the client never returns.

1433	     The "Connection Configuration Protocol", when used, MUST be
1434	     protected by an appropriate RPC security flavor, to ensure it is
1435	     not attacked in the process of initiating an RPC/RDMA connection.

1437	12.  IANA Considerations

1439	     Three new assignments are specified by this document:

1441	     - A new set of RPC "netids" for resolving RPC/RDMA services

1443	     - Optional service port assignments for upper layer bindings
1444	     - An RPC program number assignment for the configuration protocol

1446	     These assignments have been established, as below.

1448	     The new RPC transport has been assigned an RPC "netid", which is an
1449	     rpcbind [RFC1833] string used to describe the underlying protocol
1450	     in order for RPC to select the appropriate transport framing, as
1451	     well as the format of the service addresses and ports.

1453	     The following "nc_proto" registry strings are defined for this
1454	     purpose:

1456	          NC_RDMA "rdma"
1457	          NC_RDMA6 "rdma6"

1459	     These netids MAY be used for any RDMA network satisfying the
1460	     requirements of section 2, and able to identify service endpoints
1461	     using IP port addressing, possibly through use of a translation
1462	     service as described above in section 10, RPC Binding.  The "rdma"
1463	     netid is to be used when IPv4 addressing is employed by the
1464	     underlying transport, and "rdma6" for IPv6 addressing.

1466	     The netid assignment policy and registry are defined in [IANA-
1467	     NETID].

1469	     As a new RPC transport, this protocol has no effect on RPC program
1470	     numbers or existing registered port numbers.  However, new port
1471	     numbers MAY be registered for use by RPC/RDMA-enabled services, as
1472	     appropriate to the new networks over which the services will
1473	     operate.

1475	     For example, the NFS/RDMA service defined in [NFSDDP] has been
1476	     assigned the port 20049, in the IANA registry:

1478	          nfsrdma 20049/tcp Network File System (NFS) over RDMA
1479	          nfsrdma 20049/udp Network File System (NFS) over RDMA
1480	          nfsrdma 20049/sctp Network File System (NFS) over RDMA

1482	     The OPTIONAL Connection Configuration protocol described herein
1483	     requires an RPC program number assignment.  The value "100417" has
1484	     been assigned:

1486	          rdmaconfig 100417 rpc.rdmaconfig

1488	     The RPC program number assignment policy and registry are defined
1489	     in [RFC1831bis].

1491	13.  Acknowledgments

1493	     The authors wish to thank Rob Thurlow, John Howard, Chet Juszczak,
1494	     Alex Chiu, Peter Staubach, Dave Noveck, Brian Pawlowski, Steve
1495	     Kleiman, Mike Eisler, Mark Wittle, Shantanu Mehendale, David
1496	     Robinson and Mallikarjun Chadalapaka for their contributions to
1497	     this document.

1499	14.  Normative References

1501	     [RFC2119]
1502	          S. Bradner, "Key words for use in RFCs to Indicate Requirement
1503	          Levels", Best Current Practice, BCP 14, RFC 2119, March 1997.

1505	     [RFC1831bis]
1506	          R. Thurlow, Ed., "RPC: Remote Procedure Call Protocol
1507	          Specification Version 2", Internet Draft Work in Progress,
1508	          draft-ietf-nfsv4-rfc1831bis

1510	     [RFC4506]
1511	          M. Eisler Ed., "XDR: External Data Representation Standard",
1512	          Standards Track RFC, http://www.ietf.org/rfc/rfc4506.txt

1514	     [RFC1833]
1515	          R. Srinivasan, "Binding Protocols for ONC RPC Version 2",
1516	          Standards Track RFC, http://www.ietf.org/rfc/rfc1833.txt

1518	     [RFC2203]
1519	          M. Eisler, A. Chiu, L. Ling, "RPCSEC_GSS Protocol
1520	          Specification", Standards Track RFC,
1521	          http://www.ietf.org/rfc/rfc2203.txt

1523	     [RPCSECGSSV2]
1524	          M. Eisler, "RPCSEC_GSS Version 2", Internet Draft Work in
1525	          Progress, draft-ietf-nfsv4-rpcsec-gss-v2

1527	     [RFC5056]
1528	          N. Williams, "On the Use of Channel Bindings to Secure
1529	          Channels", Standards Track RFC
1530	          http://www.ietf.org/rfc/rfc5056.txt

1532	     [BTNSLATCH]
1533	          N. Williams, "IPsec Channels: Connection Latching", Internet
1534	          Draft Work in Progress, draft-ietf-btns-connection-latching

1536	     [RFC5042]
1537	          J. Pinkerton, E. Deleganes, "Direct Data Placement Protocol
1538	          (DDP) / Remote Direct Memory Access Protocol (RDMAP)
1539	          Security", Standards Track RFC,
1540	          http://www.ietf.org/rfc/rfc5042.txt

1542	     [IANA-NETID]
1543	          M. Eisler, "IANA Considerations for RPC Net Identifiers and
1544	          Universal Address Formats", Internet Draft Work in Progress,
1545	          draft-ietf-nfsv4-rpc-netid

1547	15.  Informative References

1549	     [RFC1094]
1550	          Sun Microsystems, "NFS: Network File System Protocol
1551	          Specification", (NFS version 2) Informational RFC,
1552	          http://www.ietf.org/rfc/rfc1094.txt

1554	     [RFC1813]
1555	          B. Callaghan, B. Pawlowski, P. Staubach, "NFS Version 3
1556	          Protocol Specification", Informational RFC,
1557	          http://www.ietf.org/rfc/rfc1813.txt

1559	     [RFC3530]
1560	          S. Shepler, B. Callaghan, D. Robinson, R. Thurlow, C. Beame,
1561	          M.  Eisler, D. Noveck, "NFS version 4 Protocol", Standards
1562	          Track RFC, http://www.ietf.org/rfc/rfc3530.txt

1564	     [NFSDDP]
1565	          B. Callaghan, T. Talpey, "NFS Direct Data Placement", Internet
1566	          Draft Work in Progress, draft-ietf-nfsv4-nfsdirect

1568	     [RFC5040]
1569	          R. Recio et al., "A Remote Direct Memory Access Protocol
1570	          Specification", Standards Track RFC,
1571	          http://www.ietf.org/rfc/rfc5040.txt

1573	     [RFC5041]
1574	          H. Shah et al., "Direct Data Placement over Reliable
1575	          Transports", Standards Track RFC,
1576	          http://www.ietf.org/rfc/rfc5041.txt

1578	     [NFSRDMAPS]
1579	          T. Talpey, C. Juszczak, "NFS RDMA Problem Statement", Internet
1580	          Draft Work in Progress, draft-ietf-nfsv4-nfs-rdma-problem-
1581	          statement

1583	     [NFSv4.1]
1584	          S. Shepler et al., ed., "NFSv4 Minor Version 1", Internet
1585	          Draft Work in Progress, draft-ietf-nfsv4-minorversion1

1587	     [IB]
1588	          Infiniband Architecture Specification, available from
1589	          http://www.infinibandta.org

1591	     [IBPORT]
1592	          Infiniband Trade Association, "IP Addressing Annex", available
1593	          from http://www.infinibandta.org

1595	Authors' Addresses

1597	     Tom Talpey
1598	     Network Appliance, Inc.
1599	     1601 Trapelo Road, #16
1600	     Waltham, MA 02451 USA

1602	     Phone: +1 781 768 5329
1603	     EMail: thomas.talpey@netapp.com

1605	     Brent Callaghan
1606	     Apple Computer, Inc.
1607	     MS: 302-4K
1608	     2 Infinite Loop
1609	     Cupertino, CA 95014 USA

1611	     EMail: brentc@apple.com

1613	Intellectual Property Statement

1615	     The IETF Trust takes no position regarding the validity or scope of
1616	     any Intellectual Property Rights or other rights that might be
1617	     claimed to pertain to the implementation or use of the technology
1618	     described in any IETF Document or the extent to which any license
1619	     under such rights might or might not be available; nor does it
1620	     represent that it has made any independent effort to identify any
1621	     such rights.

1623	     Copies of Intellectual Property disclosures made to the IETF
1624	     Secretariat and any assurances of licenses to be made available, or
1625	     the result of an attempt made to obtain a general license or
1626	     permission for the use of such proprietary rights by implementers
1627	     or users of this specification can be obtained from the IETF on-
1628	     line IPR repository at http://www.ietf.org/ipr

1630	     The IETF invites any interested party to bring to its attention any
1631	     copyrights, patents or patent applications, or other proprietary
1632	     rights that may cover technology that may be required to implement
1633	     any standard or specification contained in an IETF Document. Please
1634	     address the information to the IETF at ietf-ipr@ietf.org

1636	     The definitive version of an IETF Document is that published by, or
1637	     under the auspices of, the IETF. Versions of IETF Documents that
1638	     are published by third parties, including those that are translated
1639	     into other languages, should not be considered to be definitive
1640	     versions of IETF Documents. The definitive version of these Legal
1641	     Provisions is that published by, or under the auspices of, the
1642	     IETF. Versions of these Legal Provisions that are published by
1643	     third parties, including those that are translated into other
1644	     languages, should not be considered to be definitive versions of
1645	     these Legal Provisions.

1647	     For the avoidance of doubt, each Contributor to the IETF Standards
1648	     Process licenses each Contribution that he or she makes as part of
1649	     the IETF Standards Process to the IETF Trust pursuant to the
1650	     provisions of RFC 5378. No language to the contrary, or terms,
1651	     conditions or rights that differ from or are inconsistent with the
1652	     rights and licenses granted under RFC 5378, shall have any effect
1653	     and shall be null and void, whether published or posted by such
1654	     Contributor, or included with or in such Contribution.

1656	Disclaimer of Validity

1658	     All IETF Documents and the information contained therein are
1659	     provided on an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION
1660	     HE/SHE REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET
1661	     SOCIETY, THE IETF TRUST AND THE INTERNET ENGINEERING TASK FORCE
1662	     DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT
1663	     LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION THEREIN
1664	     WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF
1665	     MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

1667	Copyright Statement

1669	     Copyright (c) 2008 IETF Trust and the persons identified as the
1670	     document authors. All rights reserved.

1672	     This document is subject to BCP 78 and the IETF Trust's Legal
1673	     Provisions Relating to IETF Documents
1674	     (http://trustee.ietf.org/license-info) in effect on the date of
1675	     publication of this document. Please review these documents
1676	     carefully, as they describe your rights and restrictions with
1677	     respect to this document.