idnits 2.17.1 draft-ietf-nfsv4-rpcrdma-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3978, Section 5.1 on line 14. -- Found old boilerplate from RFC 3978, Section 5.5 on line 1378. -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 1354. -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 1361. -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 1367. ** This document has an original RFC 3978 Section 5.4 Copyright Line, instead of the newer IETF Trust Copyright according to RFC 4748. ** This document has an original RFC 3978 Section 5.5 Disclaimer, instead of the newer disclaimer which includes the IETF Trust according to RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- Couldn't find a document date in the document -- date freshness check skipped. -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '
' and
     '' lines.


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  -- Looks like a reference, but probably isn't: '8' on line 908

  ** Downref: Normative reference to an Informational RFC: RFC 1094

  ** Obsolete normative reference: RFC 1831 (Obsoleted by RFC 5531)

  ** Obsolete normative reference: RFC 1832 (Obsoleted by RFC 4506)

  ** Downref: Normative reference to an Informational RFC: RFC 1813

  ** Obsolete normative reference: RFC 3530 (Obsoleted by RFC 7530)


     Summary: 8 errors (**), 0 flaws (~~), 2 warnings (==), 9 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Internet-Draft                                      Tom Talpey
3	Expires: December 2006                         Brent Callaghan

5	Document: draft-ietf-nfsv4-rpcrdma-03               June, 2006

7	                       RDMA Transport for ONC RPC

9	Status of this Memo

11	     By submitting this Internet-Draft, each author represents that any
12	     applicable patent or other IPR claims of which he or she is aware
13	     have been or will be disclosed, and any of which he or she becomes
14	     aware will be disclosed, in accordance with Section 6 of BCP 79.

16	     Internet-Drafts are working documents of the Internet Engineering
17	     Task Force (IETF), its areas, and its working groups.  Note that
18	     other groups may also distribute working documents as Internet-
19	     Drafts.

21	     Internet-Drafts are draft documents valid for a maximum of six
22	     months and may be updated, replaced, or obsoleted by other
23	     documents at any time.  It is inappropriate to use Internet-Drafts
24	     as reference material or to cite them other than as "work in
25	     progress."

27	     The list of current Internet-Drafts can be accessed at
28	         http://www.ietf.org/ietf/1id-abstracts.txt

30	     The list of Internet-Draft Shadow Directories can be accessed at
31	         http://www.ietf.org/shadow.html.

33	Abstract

35	     A protocol is described providing RDMA as a new transport for ONC
36	     RPC.  The RDMA transport binding conveys the benefits of efficient,
37	     bulk data transport over high speed networks, while providing for
38	     minimal change to RPC applications and with no required revision of
39	     the application RPC protocol, or the RPC protocol itself.

41	Table of Contents

43	     1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2
44	     2.  Abstract RDMA Requirements . . . . . . . . . . . . . . . . . 3
45	     3.  Protocol Outline . . . . . . . . . . . . . . . . . . . . . . 4
46	     3.1.  Short Messages . . . . . . . . . . . . . . . . . . . . . . 4
47	     3.2.  Data Chunks  . . . . . . . . . . . . . . . . . . . . . . . 5
48	     3.3.  Flow Control . . . . . . . . . . . . . . . . . . . . . . . 5
49	     3.4.  XDR Encoding with Chunks . . . . . . . . . . . . . . . . . 7
50	     3.5.  Padding  . . . . . . . . . . . . . . . . . . . . . . . .  10
51	     3.6.  XDR Decoding with Read Chunks  . . . . . . . . . . . . .  11
52	     3.7.  XDR Decoding with Write Chunks . . . . . . . . . . . . .  12
53	     3.8.  RPC Call and Reply . . . . . . . . . . . . . . . . . . .  12
54	     4.  RPC RDMA Message Layout  . . . . . . . . . . . . . . . . .  15
55	     4.1.  RPC over RDMA Header . . . . . . . . . . . . . . . . . .  16
56	     4.2.  RPC over RDMA header errors  . . . . . . . . . . . . . .  17
57	     4.3.  XDR Language Description . . . . . . . . . . . . . . . .  18
58	     5.  Long Messages  . . . . . . . . . . . . . . . . . . . . . .  20
59	     5.1.  Message as an RDMA Read Chunk  . . . . . . . . . . . . .  20
60	     5.2.  RDMA Write of Long Replies (Reply Chunks)  . . . . . . .  22
61	     6.  Connection Configuration Protocol  . . . . . . . . . . . .  23
62	     6.1.  Initial Connection State . . . . . . . . . . . . . . . .  24
63	     6.2.  Protocol Description . . . . . . . . . . . . . . . . . .  24
64	     7.  Memory Registration Overhead . . . . . . . . . . . . . . .  25
65	     8.  Errors and Error Recovery  . . . . . . . . . . . . . . . .  26
66	     9.  Node Addressing  . . . . . . . . . . . . . . . . . . . . .  26
67	     10.  RPC Binding . . . . . . . . . . . . . . . . . . . . . . .  26
68	     11.  Security  . . . . . . . . . . . . . . . . . . . . . . . .  27
69	     12.  IANA Considerations . . . . . . . . . . . . . . . . . . .  27
70	     13.  Acknowledgements  . . . . . . . . . . . . . . . . . . . .  28
71	     14.  Normative References  . . . . . . . . . . . . . . . . . .  28
72	     15.  Informative References  . . . . . . . . . . . . . . . . .  29
73	     16.  Authors' Addresses  . . . . . . . . . . . . . . . . . . .  29
74	     17.  Intellectual Property and Copyright Statements  . . . . .  30
75	     Acknowledgement  . . . . . . . . . . . . . . . . . . . . . . .  31

77	1.  Introduction

79	     RDMA is a technique for efficient movement of data between end
80	     nodes, which becomes increasingly compelling over high speed
81	     transports.  By directing data into destination buffers as it is
82	     sent on a network, and placing it via direct memory access by
83	     hardware, the double benefit of faster transfers and reduced host
84	     overhead is obtained.

86	     ONC RPC [RFC1831] is a remote procedure call protocol that has been
87	     run over a variety of transports.  Most RPC implementations today
88	     use UDP or TCP.  RPC messages are defined in terms of an eXternal
89	     Data Representation (XDR) [RFC1832] which provides a canonical data
90	     representation across a variety of host architectures.  An XDR data
91	     stream is conveyed differently on each type of transport.  On UDP,
92	     RPC messages are encapsulated inside datagrams, while on a TCP byte
93	     stream, RPC messages are delineated by a record marking protocol.
94	     An RDMA transport also conveys RPC messages in a unique fashion
95	     that must be fully described if client and server implementations
96	     are to interoperate.

98	     RDMA transports present new semantics unlike the behaviors of
99	     either UDP and TCP alone.  They retain message delineations like
100	     UDP while also providing a reliable, sequenced data transfer like
101	     TCP.  And, they provide the new efficient, bulk transfer service of
102	     RDMA.  RDMA transports are therefore naturally viewed as a new
103	     transport type by ONC RPC.

105	     RDMA as a transport will benefit the performance of RPC protocols
106	     that move large "chunks" of data, since RDMA hardware excels at
107	     moving data efficiently between host memory and a high speed
108	     network with little or no host CPU involvement.  In this context,
109	     the NFS protocol, in all its versions [RFC1094] [RFC1813] [RFC3530]
110	     [NFSv4.1], is an obvious beneficiary of RDMA.  A complete problem
111	     statement is discussed in [NFSRDMAPS], and related NFSv4 issues are
112	     discussed in [NFSv4.1].  Many other RPC-based protocols will also
113	     benefit.

115	     Although the RDMA transport described here provides relatively
116	     transparent support for any RPC application, the proposal goes
117	     further in describing mechanisms that can optimize the use of RDMA
118	     with more active participation by the RPC application.

120	2.  Abstract RDMA Requirements

122	     An RPC transport is responsible for conveying an RPC message from a
123	     sender to a receiver.  An RPC message is either an RPC call from a
124	     client to a server, or an RPC reply from the server back to the
125	     client.  An RPC message contains an RPC call header followed by
126	     arguments if the message is an RPC call, or an RPC reply header
127	     followed by results if the message is an RPC reply.  The call
128	     header contains a transaction ID (XID) followed by the program and
129	     procedure number as well as a security credential.  An RPC reply
130	     header begins with an XID that matches that of the RPC call
131	     message, followed by a security verifier and results.  All data in
132	     an RPC message is XDR encoded.  For a complete description of the
133	     RPC protocol and XDR encoding, see [RFC1831] and [RFC1832].

135	     This protocol assumes the following abstract model for RDMA
136	     transports.  These terms, common in the RDMA lexicon, are used in
137	     this document.  A more complete glossary of RDMA terms can be found
138	     in [RDMAP].

140	     o Registered Memory
141	          All data moved via tagged RDMA operations must be resident in
142	          registered memory at its destination.  This protocol assumes
143	          that each segment of registered memory may be identified with
144	          a steering tag of no more than 32 bits and memory addresses of
145	          up to 64 bits in length.

147	     o RDMA Send
148	          The RDMA provider supports an RDMA Send operation with
149	          completion signalled at the receiver when data is placed in a
150	          pre-posted buffer.  The amount of transferred data is limited
151	          only by the size of the receiver's buffer.  Sends complete at
152	          the receiver in the order they were issued at the sender.

154	     o RDMA Write
155	          The RDMA provider supports an RDMA Write operation to directly
156	          place data in the receiver's buffer.  An RDMA Write is
157	          initiated by the sender and completion is signalled at the
158	          sender.  No completion is signalled at the receiver.  The
159	          sender uses a steering tag, memory address and length of the
160	          remote destination buffer.  RDMA Writes are not necessarily
161	          ordered with respect to one another, but are ordered with
162	          respect to RDMA Sends; a subsequent RDMA Send completion must
163	          be obtained at the receiver to notify that prior RDMA Write
164	          data has been successfully placed in the receiver's memory.

166	     o RDMA Read
167	          The RDMA provider supports an RDMA Read operation to directly
168	          place peer source data in the requester's buffer.  An RDMA
169	          Read is initiated by the receiver and completion is signalled
170	          at the receiver.  The receiver provides steering tags, memory
171	          addresses and a length for the remote source and local
172	          destination buffers.  Since the peer at the data source
173	          receives no notification of RDMA Read completion, there is an
174	          assumption that on receiving the data the receiver will signal
175	          completion with an RDMA Send message, so that the peer can
176	          free the source buffers and the associated steering tags.

178	     This protocol is designed to be carried over all RDMA transports
179	     meeting the stated requirements.  This protocol conveys to the RPC
180	     peer, information sufficient for that RPC peer to direct an RDMA
181	     layer to perform transfers containing RPC data, and to communicate
182	     their result(s).  For example, it is readily carried over RDMA
183	     transports such as iWARP [RDDP] or Infiniband [IB].

185	3.  Protocol Outline

187	     An RPC message can be conveyed in identical fashion, whether it is
188	     a call or reply message.  In each case, the transmission of the
189	     message proper is preceded by transmission of a transport-specific
190	     header for use by RPC over RDMA transports.  This header is
191	     analogous to the record marking used for RPC over TCP, but is more
192	     extensive, since RDMA transports support several modes of data
193	     transfer and it is important to allow the client and server to use
194	     the most efficient mode for any given transfer.  Multiple segments
195	     of a message may be transferred in different ways to different
196	     remote memory destinations.

198	     All transfers of a call or reply begin with an RDMA Send which
199	     transfers at least the RPC over RDMA header, usually with the call
200	     or reply message appended, or at least some part thereof.  Because
201	     the size of what may be transmitted via RDMA Send is limited by the
202	     size of the receiver's pre-posted buffer, the RPC over RDMA
203	     transport provides a number of methods to reduce the amount
204	     transferred by means of the RDMA Send, when necessary, by
205	     transferring various parts of the message using RDMA Read and RDMA
206	     Write.

208	3.1.  Short Messages

210	     Many RPC messages are quite short.  For example, the NFS version 3
211	     GETATTR request, is only 56 bytes: 20 bytes of RPC header plus a 32
212	     byte filehandle argument and 4 bytes of length.  The reply to this
213	     common request is about 100 bytes.

215	     There is no benefit in transferring such small messages with an
216	     RDMA Read or Write operation.  The overhead in transferring
217	     steering tags and memory addresses is justified only by large
218	     transfers.  The critical message size that justifies RDMA transfer
219	     will vary depending on the RDMA implementation and network, but is
220	     typically of the order of a few kilobytes.  It is appropriate to
221	     transfer a short message with an RDMA Send to a pre-posted buffer.
222	     The RPC over RDMA header with the short message (call or reply)
223	     immediately following is transferred using a single RDMA Send
224	     operation.

226	     Short RPC messages over an RDMA transport will look like this:

228	       RPC Client                           RPC Server
229	           |               RPC Call              |
230	      Send |   ------------------------------>   |
231	           |                                     |
232	           |               RPC Reply             |
233	           |   <------------------------------   | Send

235	3.2.  Data Chunks

237	     Some protocols, like NFS, have RPC procedures that can transfer
238	     very large "chunks" of data in the RPC call or reply and would
239	     cause the maximum send size to be exceeded if one tried to transfer
240	     them as part of the RDMA Send.  These large chunks typically range
241	     from a kilobyte to a megabyte or more.  An RDMA transport can
242	     transfer large chunks of data more efficiently via the direct
243	     placement of an RDMA Read or RDMA Write operation.  Using direct
244	     placement instead of inline transfer not only avoids expensive data
245	     copies, but provides correct data alignment at the destination.

247	3.3.  Flow Control

249	     It is critical to provide RDMA Send flow control for an RDMA
250	     connection.  RDMA receive operations will fail if a pre-posted
251	     receive buffer is not available to accept an incoming RDMA Send,
252	     and repeated occurrences of such errors can be fatal to the
253	     connection.  This is a departure from conventional TCP/IP
254	     networking where buffers are allocated dynamically on an as-needed
255	     basis, and pre-posting is not required.

257	     It is not practical to provide for fixed credit limits at the RPC
258	     server.  Fixed limits scale poorly, since posted buffers are
259	     dedicated to the associated connection until consumed by receive
260	     operations.  Additionally for protocol correctness, the RPC server
261	     must always be able to reply to client requests, whether or not new
262	     buffers have been posted to accept future receives.  (Note that the
263	     RPC server may in fact be a client at some other layer.  For
264	     example, NFSv4 callbacks are processed by the NFSv4 client, acting
265	     as an RPC server. The credit discussions apply equally in either
266	     case.)

268	     Flow control for RDMA Send operations is implemented as a simple
269	     request/grant protocol in the RPC over RDMA header associated with
270	     each RPC message.  The RPC over RDMA header for RPC call messages
271	     contains a requested credit value for the RPC server, which may be
272	     dynamically adjusted by the caller to match its expected needs.
273	     The RPC over RDMA header for the RPC reply messages provides the
274	     granted result, which may have any value except it may not be zero
275	     when no in-progress operations are present at the server, since
276	     such a value would result in deadlock.  The value may be adjusted
277	     up or down at each opportunity to match the server's needs or
278	     policies.

280	     The RPC client must not send unacknowledged requests in excess of
281	     this granted RPC server credit limit.  If the limit is exceeded,
282	     the RDMA layer may signal an error, possibly terminating the
283	     connection.  Even if an error does not occur, there is no
284	     requirement that the server must handle the excess request(s), and
285	     it may return an RPC error to the client.  Also note that the
286	     never-zero requirement implies that an RPC server must always
287	     provide at least one credit to each connected RPC client.  It does
288	     not however require that the server must always be prepared to
289	     receive a request from each client, for example when the server is
290	     busy processing all granted client requests.

292	     While RPC call may complete in any order, the current flow control
293	     limit at the RPC server is known to the RPC client from the Send
294	     ordering properties.  It is always the most recent server-granted
295	     credit value minus the number of requests in flight.

297	     Certain RDMA implementations may impose additional flow control
298	     restrictions, such as limits on RDMA Read operations in progress at
299	     the responder.  Because these operations are outside the scope of
300	     this protocol, they are not addressed and must be provided for by
301	     other layers.  For example, a simple upper layer RPC consumer might
302	     perform single-issue RDMA Read requests, while a more
303	     sophisticated, multithreaded RPC consumer may implement its own
304	     FIFO queue of such operations.  For further discussion of possible
305	     protocol implementations capable of negotiating these values, see
306	     section 6 "Connection Configuration Protocol" of this draft, or
307	     [NFSv4.1].

309	3.4.  XDR Encoding with Chunks

311	     The data comprising an RPC call or reply message is marshaled or
312	     serialized into a contiguous stream by an XDR routine.  XDR data
313	     types such as integers, strings, arrays and linked lists are
314	     commonly implemented over two very simple functions that encode
315	     either an XDR data unit (32 bits) or an array of bytes.

317	     Normally, the separate data items in an RPC call or reply are
318	     encoded as a contiguous sequence of bytes for network transmission
319	     over UDP or TCP.  However, in the case of an RDMA transport, local
320	     routines such as XDR encode can determine that (for instance) an
321	     opaque byte array is large enough to be more efficiently moved via
322	     an RDMA data transfer operation like RDMA Read or RDMA Write.

324	     Semantically speaking, the protocol has no restriction regarding
325	     data types which may or may not be represented by a read or write
326	     chunk.  In practice however, efficiency considerations lead to the
327	     conclusion that certain data types are not generally "chunkable".
328	     Typically, only opaque and aggregate data types which may attain
329	     substantial size are considered to be eligible.  With today's
330	     hardware this size may be a kilobyte or more.  However any object
331	     may be chosen for chunking in any given message.

333	     The eligibility of XDR data items to be candidates for being moved
334	     as data chunks (as opposed to being marshalled inline) is not
335	     specified by the RPC over RDMA protocol.  Chunk eligibility
336	     criteria must be determined by each upper layer in order to provide
337	     for an interoperable specification.  One such example with
338	     rationale, for the NFS protocol family, is provided in [NFSDDP].

340	     The interface by which an upper layer implementation communicates
341	     the eligibility of a data item locally to RPC for chunking is out
342	     of scope for this specification.  In many implementations, it is
343	     possible to implement a transparent RPC chunking facility.
344	     However, such implementations may lead to inefficiencies, either
345	     because they require the RPC layer to perform expensive
346	     registration and deregistration of memory "on the fly", or they may
347	     require using RDMA chunks in reply messages, along with the
348	     resulting additional handshaking with the RPC over RDMA peer.
349	     However, these issues are internal and generally confined to the
350	     local interface between RPC and its upper layers, one in which
351	     implementations are free to innovate.  The only requirement is that
352	     the resulting RPC RDMA protocol sent to the peer is valid for the
353	     upper layer.  See for example [NFSDDP].

355	     When sending any message (request or reply) that contains an
356	     eligible large data chunk, the XDR encoding routine avoids moving
357	     the data into the XDR stream.  Instead, it does not encode the data
358	     portion, but records the address and size of each chunk in a
359	     separate "read chunk list" encoded within RPC RDMA transport-
360	     specific headers.  Such chunks will be transferred via RDMA Read
361	     operations initiated by the receiver.

363	     When the read chunks are to be moved via RDMA, the memory for each
364	     chunk must be registered.  This registration may take place within
365	     XDR itself, providing for full transparency to upper layers, or it
366	     may be performed by any other specific local implementation.

368	     Additionally, when making an RPC call that can result in bulk data
369	     transferred in the reply, it is desirable to provide chunks to
370	     accept the data directly via RDMA Write.  These write chunks will
371	     therefore be pre-filled by the RPC server prior to responding, and
372	     XDR decode at the client will not be required.  These chunks
373	     undergo a similar registration and advertisement via "write chunk
374	     lists" built as a part of XDR encoding.

376	     Some RPC client implementations are not able to determine where an
377	     RPC call's results reside during the "encode" phase.  This makes it
378	     difficult or impossible for the RPC client layer to encode the
379	     write chunk list at the time of building the request.  In this
380	     case, it is difficult for the RPC implementation to provide
381	     transparency to the RPC consumer, which may require recoding to
382	     provide result information at this earlier stage.

384	     Therefore if the RPC client does not make a write chunk list
385	     available to receive the result, then the RPC server must return
386	     data inline in the reply, or if it so chooses, via a read chunk
387	     list.  RPC clients are discouraged from omitting write chunk lists
388	     for eligible replies, due to the lower performance of the
389	     additional handshaking to perform data transfer, and the
390	     requirement that the RPC server must expose (and preserve) the
391	     reply data for a period of time.  In the absence of a server-
392	     provided read chunk list in the reply, if the encoded reply
393	     overflows the posted receive buffer, the RPC will fail.

395	     When any data within a message is provided via either read or write
396	     chunks, the chunk itself refers only to the data portion of the XDR
397	     stream element.  In particular, for counted fields (e.g. a "<>"
398	     encoding) the byte count which is encoded as part of the field
399	     remains in the XDR stream, and is also encoded in the chunk list.
400	     The data portion is however elided from the encoded XDR stream, and
401	     is transferred as part of chunk list processing.  This is important
402	     to maintain upper layer implementation compatibility - both the
403	     count and the data must be transferred as part of the logical XDR
404	     stream.  While the chunk list processing results in the data being
405	     available to the upper layer peer for XDR decoding, the length
406	     present in the chunk list entries is not.  Any byte count in the
407	     XDR stream must match the sum of the byte counts present in the
408	     corresponding read or write chunk list.  If they do not agree, an
409	     RPC protocol encoding error results.

411	     The following items are contained in a chunk list entry.

413	     Handle
414	          Steering tag or handle obtained when the chunk memory is
415	          registered for RDMA.

417	     Length
418	          The length of the chunk in bytes.

420	     Offset
421	          The offset or beginning memory address of the chunk.  In order
422	          to support the widest array of RDMA implementations, as well
423	          as the most general steering tag scheme, this field is
424	          unconditionally included in each chunk list entry.

426	          While zero-based offset schemes are available in many RDMA
427	          implementations, their use by RPC requires individual
428	          registration of each read or write chunk.  On many such
429	          implementations this can be a significant overhead.  By
430	          providing an offset in each chunk, many pre-registration or
431	          region-based registrations can be readily supported, and by
432	          using a single, universal chunk representation, the RPC RDMA
433	          protocol implementation is simplified to its most general
434	          form.

436	     Position
437	          For data which is to be encoded, the position in the XDR
438	          stream where the chunk would normally reside.  Note that the
439	          chunk therefore inserts its data into the XDR stream at this
440	          position, but its transfer is no longer "inline".  Also note
441	          it is possible that a contiguous sequence of chunks might all
442	          have the same position.  For data which is to be decoded, no
443	          "position" is used.

445	     When XDR marshaling is complete, the chunk list is XDR encoded,
446	     then sent to the receiver prepended to the RPC message.  Any source
447	     data for a read chunk, or the destination of a write chunk, remain
448	     behind in the sender's registered memory and their actual payload
449	     is not marshalled into the request or reply.

451	     +----------------+----------------+-------------
452	     | RPC over RDMA  |                |
453	     |    header w/   |   RPC Header   | Non-chunk args/results
454	     |     chunks     |                |
455	     +----------------+----------------+-------------

457	     Read chunk lists and write chunk lists are structured somewhat
458	     differently.  This is due to the different usage - read chunks are
459	     decoded and indexed by their position in the XDR data stream, their
460	     size is always known, and may be used for both arguments and
461	     results.  Write chunks on the other hand are used only for results,
462	     and have neither a preassigned offset in the XDR stream, nor a size
463	     until the results are produced, since the buffers may not be used
464	     for results at all, or may be partially filled.  Their presence in
465	     the XDR stream is therefore not known until the reply is processed.
466	     The mapping of Write chunks onto designated NFS procedures and
467	     their results is described in [NFSDDP].

469	     Therefore, read chunks are encoded into a read chunk list as a
470	     single array, with each entry tagged by its (known) size and
471	     position in the XDR stream.  Write chunks are encoded as a list of
472	     arrays of RDMA buffers, with each list element (an array) providing
473	     buffers for a separate result.  Individual write chunk list
474	     elements may thereby result in being partially or fully filled, or
475	     in fact not being filled at all.  Unused write chunks, or unused
476	     bytes in write chunk buffer lists, are not returned as results, and
477	     their memory is returned to the upper layer as part of RPC
478	     completion.  However, the RPC layer should not assume that the
479	     buffers have not been modified.

481	3.5.  Padding

483	     Alignment of specific opaque data enables certain scatter/gather
484	     optimizations.  Padding leverages the useful property that RDMA
485	     transfers preserve alignment of data, even when they are placed
486	     into pre-posted receive buffers by Sends.

488	     Many servers can make good use of such padding.  Padding allows the
489	     chaining of RDMA receive buffers such that any data transferred by
490	     RDMA on behalf of RPC requests will be placed into appropriately
491	     aligned buffers on the system that receives the transfer.  In this
492	     way, the need for servers to perform RDMA Read to satisfy all but
493	     the largest client writes is obviated.

495	     The effect of padding is demonstrated below showing prior bytes on
496	     an XDR stream (XXX) followed by an opaque field consisting of four
497	     length bytes (LLLL) followed by data bytes (DDDD).  The receiver of
498	     the RDMA Send has posted two chained receive buffers.  Without
499	     padding, the opaque data is split across the two buffers.  With the
500	     addition of padding bytes (ppp) prior to the first data byte, the
501	     data can be forced to align correctly in the second buffer.

503	                                           Buffer 1       Buffer 2
504	     Unpadded                           --------------  --------------

506	      XXXXXXXLLLLDDDDDDDDDDDDDD    ---> XXXXXXXLLLLDDD  DDDDDDDDDDD

508	     Padded

510	      XXXXXXXLLLLpppDDDDDDDDDDDDDD ---> XXXXXXXLLLLppp  DDDDDDDDDDDDDD

512	     Padding is implemented completely within the RDMA transport
513	     encoding, flagged with a specific message type.  Where padding is
514	     applied, two values are passed to the peer:  an "rdma_align" which
515	     is the padding value used, and "rdma_thresh", which is the opaque
516	     data size at or above which padding is applied.  For instance, if
517	     the server is using chained 4 KB receive buffers, then up to (4 KB
518	     - 1) padding bytes could be used to achieve alignment of the data.
519	     If padding is to apply only to chunks at least 1 KB in size, then
520	     the threshold should be set to 1 KB.  The XDR routine at the peer
521	     will consult these values when decoding opaque values.  Where the
522	     decoded length exceeds the rdma_thresh, the XDR decode will skip
523	     over the appropriate padding as indicated by rdma_align and the
524	     current XDR stream position.

526	3.6.  XDR Decoding with Read Chunks

528	     The XDR decode process moves data from an XDR stream into a data
529	     structure provided by the RPC client or server application.  Where
530	     elements of the destination data structure are buffers or strings,
531	     the RPC application can either pre-allocate storage to receive the
532	     data, or leave the string or buffer fields null and allow the XDR
533	     decode stage of RPC processing to automatically allocate storage of
534	     sufficient size.

536	     When decoding a message from an RDMA transport, the receiver first
537	     XDR decodes the chunk lists from the RPC over RDMA header, then
538	     proceeds to decode the body of the RPC message (arguments or
539	     results).  Whenever the XDR offset in the decode stream matches
540	     that of a chunk in the read chunk list, the XDR routine initiates
541	     an RDMA Read to bring over the chunk data into locally registered
542	     memory for the destination buffer.

544	     When processing an RPC request, the RPC receiver (RPC server)
545	     acknowledges its completion of use of the source buffers by simply
546	     replying to the RPC sender (client), and the peer may free all
547	     source buffers advertised by the request.

549	     When processing an RPC reply, after completing such a transfer the
550	     RPC receiver (client) must issue an RDMA_DONE message (described in
551	     Section 3.8) to notify the peer (server) that the source buffers
552	     can be freed.

554	     The read chunk list is constructed and used entirely within the
555	     RPC/XDR layer.  Other than specifying the minimum chunk size, the
556	     management of the read chunk list is automatic and transparent to
557	     an RPC application.

559	3.7.  XDR Decoding with Write Chunks

561	     When a "write chunk list" is provided for the results of the RPC
562	     call, the RPC server must provide any corresponding data via RDMA
563	     Write to the memory referenced in the chunk list entries.  The RPC
564	     reply conveys this by returning the write chunk list to the client
565	     with the lengths rewritten to match the actual transfer.  The XDR
566	     "decode" of the reply therefore performs no local data transfer but
567	     merely returns the length obtained from the reply.

569	     Each decoded result consumes one entry in the write chunk list,
570	     which in turn consists of an array of RDMA segments.  The length is
571	     therefore the sum of all returned lengths in all segments
572	     comprising the corresponding list entry.  As each list entry is
573	     "decoded", the entire entry is consumed.

575	     The write chunk list is constructed and used by the RPC
576	     application.  The RPC/XDR layer simply conveys the list between
577	     client and server and initiates the RDMA Writes back to the client.
578	     The mapping of write chunk list entries to procedure arguments must
579	     be determined for each protocol.  An example of a mapping is
580	     described in [NFSDDP].

582	3.8.  RPC Call and Reply

584	     The RDMA transport for RPC provides three methods of moving data
585	     between RPC client and server:

587	     Inline
588	          Data are moved between RPC client and server within an RDMA
589	          Send.

591	     RDMA Read
592	          Data are moved between RPC client and server via an RDMA Read
593	          operation via steering tag, address and offset obtained from a
594	          read chunk list.

596	     RDMA Write
597	          Result data is moved from RPC server to client via an RDMA
598	          Write operation via steering tag, address and offset obtained
599	          from a write chunk list or reply chunk in the client's RPC
600	          call message.

602	     These methods of data movement may occur in combinations within a
603	     single RPC.  For instance, an RPC call may contain some inline data
604	     along with some large chunks to be transferred via RDMA Read to the
605	     server.  The reply to that call may have some result chunks that
606	     the server RDMA Writes back to the client.  The following protocol
607	     interactions illustrate RPC calls that use these methods to move
608	     RPC message data:

610	     An RPC with write chunks in the call message looks like this:

612	       RPC Client                           RPC Server
613	           |     RPC Call + Write Chunk list     |
614	      Send |   ------------------------------>   |
615	           |                                     |
616	           |               Chunk 1               |
617	           |   <------------------------------   | Write
618	           |                  :                  |
619	           |               Chunk n               |
620	           |   <------------------------------   | Write
621	           |                                     |
622	           |               RPC Reply             |
623	           |   <------------------------------   | Send

625	     In the presence of write chunks, RDMA ordering provides the
626	     guarantee that all data in the RDMA Write operations has been
627	     placed in memory prior to the client's RPC reply processing.

629	     An RPC with read chunks in the call message looks like this:

631	       RPC Client                           RPC Server
632	           |     RPC Call + Read Chunk list      |
633	      Send |   ------------------------------>   |
634	           |                                     |
635	           |               Chunk 1               |
636	           |   +------------------------------   | Read
637	           |   v----------------------------->   |
638	           |                  :                  |
639	           |               Chunk n               |
640	           |   +------------------------------   | Read
641	           |   v----------------------------->   |
642	           |                                     |
643	           |               RPC Reply             |
644	           |   <------------------------------   | Send

646	     And an RPC with read chunks in the reply message looks like this:

648	       RPC Client                           RPC Server
649	           |               RPC Call              |
650	      Send |   ------------------------------>   |
651	           |                                     |
652	           |     RPC Reply + Read Chunk list     |
653	           |   <------------------------------   | Send
654	           |                                     |
655	           |               Chunk 1               |
656	      Read |   ------------------------------+   |
657	           |   <-----------------------------v   |
658	           |                  :                  |
659	           |               Chunk n               |
660	      Read |   ------------------------------+   |
661	           |   <-----------------------------v   |
662	           |                                     |
663	           |                 Done                |
664	      Send |   ------------------------------>   |

666	     The final Done message allows the RPC client to signal the server
667	     that it has received the chunks, so the server can de-register and
668	     free the memory holding the chunks.  A Done completion is not
669	     necessary for an RPC call, since the RPC reply Send is itself a
670	     receive completion notification.  In the event that the client
671	     fails to return the Done message within some timeout period, the
672	     server may conclude that a protocol violation has occurred and
673	     close the RPC connection, or it may proceed with a de-register and
674	     free its chunk buffers.  This may result in a fatal RDMA error if
675	     the client later attempts to perform an RDMA Read operation, which
676	     amounts to the same thing.

678	     The use of read chunks in RPC reply messages is much less efficient
679	     than providing write chunks in the originating RPC calls, due to
680	     the additional message exchanges, the need for the RPC server to
681	     advertise buffers to the peer, the necessity of the server
682	     maintaining a timer for the purpose of recovery from misbehaving
683	     clients, and the need for additional memory registration.  Their
684	     use is not recommended by upper layers where efficiency is a
685	     primary concern. [NFSDDP]  However, they may be employed by upper
686	     layer protocol bindings which are primarily concerned with
687	     transparency, since they can frequently be implemented completely
688	     within the RPC lower layers.

690	     It is important to note that the Done message consumes a credit at
691	     the RPC server.  The RPC server should provide sufficient credits
692	     to the client to allow the Done message to be sent without deadlock
693	     (driving the outstanding credit count to zero).  The RPC client
694	     must account for its required Done messages to the server in its
695	     accounting of available credits, and the server should replenish
696	     any credit consumed by its use of such exchanges at its earliest
697	     opportunity.

699	     Finally, it is possible to conceive of RPC exchanges that involve
700	     any or all combinations of write chunks in the RPC call, read
701	     chunks in the RPC call, and read chunks in the RPC reply.  Support
702	     for such exchanges is straightforward from a protocol perspective,
703	     but in practice such exchanges would be quite rare, limited to
704	     upper layer protocol exchanges which transferred bulk data in both
705	     the call and corresponding reply.

707	4.  RPC RDMA Message Layout

709	     RPC call and reply messages are conveyed across an RDMA transport
710	     with a prepended RPC over RDMA header.  The RPC over RDMA header
711	     includes data for RDMA flow control credits, padding parameters and
712	     lists of addresses that provide direct data placement via RDMA Read
713	     and Write operations.  The layout of the RPC message itself is
714	     unchanged from that described in [RFC1831] except for the possible
715	     exclusion of large data chunks that will be moved by RDMA Read or
716	     Write operations.  If the RPC message (along with the RPC over RDMA
717	     header) is too long for the posted receive buffer (even after any
718	     large chunks are removed), then the entire RPC message can be moved
719	     separately as a chunk, leaving just the RPC over RDMA header in the
720	     RDMA Send.

722	4.1.  RPC over RDMA Header

724	     The RPC over RDMA header begins with four 32-bit fields that are
725	     always present and which control the RDMA interaction including
726	     RDMA-specific flow control.  These are then followed by a number of
727	     items such as chunk lists and padding which may or may not be
728	     present depending on the type of transmission.  The four fields
729	     which are always present are:

731	     1. Transaction ID (XID).
732	          The XID generated for the RPC call and reply.  Having the XID
733	          at the beginning of the message makes it easy to establish the
734	          message context.  This XID mirrors the XID in the RPC header,
735	          and takes precedence.  The receiver may ignore the XID in the
736	          RPC header, if it so chooses.

738	     2. Version number.
739	          This version of the RPC RDMA message protocol is 1.  The
740	          version number must be increased by one whenever the format of
741	          the RPC RDMA messages is changed.

743	     3. Flow control credit value.
744	          When sent in an RPC call message, the requested value is
745	          provided.  When sent in an RPC reply message, the granted
746	          value is returned.  RPC calls must not be sent in excess of
747	          the currently granted limit.

749	     4. Message type.

751	          o    RDMA_MSG = 0 indicates that chunk lists and RPC message
752	               follow.

754	          o    RDMA_NOMSG = 1 indicates that after the chunk lists there
755	               is no RPC message.  In this case, the chunk lists provide
756	               information to allow the message proper to be transferred
757	               using RDMA Read or write and thus is not appended to the
758	               RPC over RDMA header.

760	          o    RDMA_MSGP = 2 indicates that a chunk list and RPC message
761	               with some padding follow.

763	          0    RDMA_DONE = 3 indicates that the message signals the
764	               completion of a chunk transfer via RDMA Read.

766	          o    RDMA_ERROR = 4 is used to signal any detected error(s) in
767	               the RPC RDMA chunk encoding.

769	     Because the version number is encoded as part of this header, and
770	     the RDMA_ERROR message type is used to indicate errors, these first
771	     four fields and the start of the following message body must always
772	     remain aligned at these fixed offsets for all versions of the RPC
773	     over RDMA header.

775	     For a message of type RDMA_MSG or RDMA_NOMSG, the Read and Write
776	     chunk lists follow.  If the Read chunk list is null (a 32 bit word
777	     of zeros), then there are no chunks to be transferred separately
778	     and the RPC message follows in its entirety.  If non-null, then
779	     it's the beginning of an XDR encoded sequence of Read chunk list
780	     entries.  If the Write chunk list is non-null, then an XDR encoded
781	     sequence of Write chunk entries follows.

783	     If the message type is RDMA_MSGP, then two additional fields that
784	     specify the padding alignment and threshold are inserted prior to
785	     the Read and Write chunk lists.

787	     A header of message type RDMA_MSG or RDMA_MSGP will be followed by
788	     the RPC call or RPC reply message body, beginning with the XID.
789	     The XID in the RDMA_MSG or RDMA_MSGP header must match this.

791	     +--------+---------+---------+-----------+-------------+----------
792	     |        |         |         | Message   |   NULLs     | RPC Call
793	     |  XID   | Version | Credits |  Type     |    or       |    or
794	     |        |         |         |           | Chunk Lists | Reply Msg
795	     +--------+---------+---------+-----------+-------------+----------

797	     Note that in the case of RDMA_DONE and RDMA_ERROR, no chunk list or
798	     RPC message follows.  As an implementation hint: a gather operation
799	     on the Send of the RDMA RPC message can be used to marshal the
800	     initial header, the chunk list, and the RPC message itself.

802	4.2.  RPC over RDMA header errors

804	     When a peer receives an RPC RDMA message, it must perform certain
805	     basic validity checks on the header and chunk contents.  If errors
806	     are detected in an RPC request, an RDMA_ERROR reply should be
807	     generated.

809	     Two types of errors are defined, version mismatch and invalid chunk
810	     format.  When the peer detects an RPC over RDMA header version
811	     which it does not support (currently this draft defines only
812	     version 1), it replies with an error code of ERR_VERS, and provides
813	     the low and high inclusive version numbers it does, in fact,
814	     support.  The version number in this reply can be any value
815	     otherwise valid at the receiver.  When other decoding errors are
816	     detected in the header or chunks, either an RPC decode error may be
817	     returned, or the error code ERR_CHUNK.

819	4.3.  XDR Language Description

821	     Here is the message layout in XDR language.

823	        struct xdr_rdma_segment {
824	           uint32 handle;          /* Registered memory handle */
825	           uint32 length;          /* Length of the chunk in bytes */
826	           uint64 offset;          /* Chunk virtual address or offset */
827	        };

829	        struct xdr_read_chunk {
830	           uint32 position;        /* Position in XDR stream */
831	           struct xdr_rdma_segment target;
832	        };

834	        struct xdr_read_list {
835	           struct xdr_read_chunk entry;
836	           struct xdr_read_list  *next;
837	        };

839	        struct xdr_write_chunk {
840	           struct xdr_rdma_segment target<>;
841	        };

843	        struct xdr_write_list {
844	           struct xdr_write_chunk entry;
845	           struct xdr_write_list  *next;
846	        };

848	        struct rdma_msg {
849	           uint32    rdma_xid;     /* Mirrors the RPC header xid */
850	           uint32    rdma_vers;    /* Version of this protocol */
851	           uint32    rdma_credit;  /* Buffers requested/granted */
852	           rdma_body rdma_body;
853	        };

855	        enum rdma_proc {
856	           RDMA_MSG=0,   /* An RPC call or reply msg */
857	           RDMA_NOMSG=1, /* An RPC call or reply msg - separate body */
858	           RDMA_MSGP=2,  /* An RPC call or reply msg with padding */
859	           RDMA_DONE=3,  /* Client signals reply completion */
860	           RDMA_ERROR=4  /* An RPC RDMA encoding error */
861	        };
862	        union rdma_body switch (rdma_proc proc) {
863	           case RDMA_MSG:
864	             rpc_rdma_header rdma_msg;
865	           case RDMA_NOMSG:
866	             rpc_rdma_header_nomsg rdma_nomsg;
867	           case RDMA_MSGP:
868	             rpc_rdma_header_padded rdma_msgp;
869	           case RDMA_DONE:
870	             void;
871	           case RDMA_ERROR:
872	             rpc_rdma_error rdma_error;
873	        };

875	        struct rpc_rdma_header {
876	           struct xdr_read_list   *rdma_reads;
877	           struct xdr_write_list  *rdma_writes;
878	           struct xdr_write_chunk *rdma_reply;
879	           /* rpc body follows */
880	        };

882	        struct rpc_rdma_header_nomsg {
883	           struct xdr_read_list   *rdma_reads;
884	           struct xdr_write_list  *rdma_writes;
885	           struct xdr_write_chunk *rdma_reply;
886	        };

888	        struct rpc_rdma_header_padded {
889	           uint32                 rdma_align;   /* Padding alignment */
890	           uint32                 rdma_thresh;  /* Padding threshold */
891	           struct xdr_read_list   *rdma_reads;
892	           struct xdr_write_list  *rdma_writes;
893	           struct xdr_write_chunk *rdma_reply;
894	           /* rpc body follows */
895	        };
896	        enum rpc_rdma_errcode {
897	           ERR_VERS = 1,
898	           ERR_CHUNK = 2
899	        };

901	        union rpc_rdma_error switch (rpc_rdma_errcode) {
902	           case ERR_VERS:
903	             uint32               rdma_vers_low;
904	             uint32               rdma_vers_high;
905	           case ERR_CHUNK:
906	             void;
907	           default:
908	             uint32               rdma_extra[8];
909	        };

911	5.  Long Messages

913	     The receiver of RDMA Send messages is required to have previously
914	     posted one or more adequately sized buffers.  The RPC client can
915	     inform the server of the maximum size of its RDMA Send messages via
916	     the Connection Configuration Protocol described later in this
917	     document.

919	     Since RPC messages are frequently small, memory savings can be
920	     achieved by posting small buffers.  Even large messages like NFS
921	     READ or WRITE will be quite small once the chunks are removed from
922	     the message.  However, there may be large messages that would
923	     demand a very large buffer be posted, where the contents of the
924	     buffer may not be a chunkable XDR element.  A good example is an
925	     NFS READDIR reply which may contain a large number of small
926	     filename strings.  Also, the NFS version 4 protocol [RFC3530]
927	     features COMPOUND request and reply messages of unbounded length.

929	     Ideally, each upper layer will negotiate these limits.  However, it
930	     is frequently necessary to provide a transparent solution.

932	5.1.  Message as an RDMA Read Chunk

934	     One relatively simple method is to have the client identify any RPC
935	     message that exceeds the RPC server's posted buffer size and move
936	     it separately as a chunk, i.e. reference it as the first entry in
937	     the read chunk list with an XDR position of zero.

939	     Normal Message

941	     +--------+---------+---------+------------+-------------+----------
942	     |        |         |         |            |             | RPC Call
943	     |  XID   | Version | Credits |  RDMA_MSG  | Chunk Lists |    or
944	     |        |         |         |            |             | Reply Msg
945	     +--------+---------+---------+------------+-------------+----------

947	     Long Message

949	     +--------+---------+---------+------------+-------------+
950	     |        |         |         |            |             |
951	     |  XID   | Version | Credits | RDMA_NOMSG | Chunk Lists |
952	     |        |         |         |            |             |
953	     +--------+---------+---------+------------+-------------+
954	                                                  |
955	                                                  |  +----------
956	                                                  |  | Long RPC Call
957	                                                  +->|    or
958	                                                     | Reply Message
959	                                                     +----------

961	     If the receiver gets an RPC over RDMA header with a message type of
962	     RDMA_NOMSG and finds an initial read chunk list entry with a zero
963	     XDR position, it allocates a registered buffer and issues an RDMA
964	     Read of the long RPC message into it.  The receiver then proceeds
965	     to XDR decode the RPC message as if it had received it inline with
966	     the Send data.  Further decoding may issue additional RDMA Reads to
967	     bring over additional chunks.

969	     Although the handling of long messages requires one extra network
970	     turnaround, in practice these messages should be rare if the posted
971	     receive buffers are correctly sized, and of course they will be
972	     non-existent for RDMA-aware upper layers.

974	     An RPC with long reply returned via RDMA Read looks like this:

976	       RPC Client                           RPC Server
977	           |             RPC Call                |
978	      Send |   ------------------------------>   |
979	           |                                     |
980	           |         RDMA over RPC Header        |
981	           |   <------------------------------   | Send
982	           |                                     |
983	           |          Long RPC Reply Msg         |
984	      Read |   ------------------------------+   |
985	           |   <-----------------------------v   |
986	           |                                     |
987	           |                Done                 |
988	      Send |   ------------------------------>   |

990	5.2.  RDMA Write of Long Replies (Reply Chunks)

992	     A superior method of handling long RPC replies is to have the RPC
993	     client post a large buffer into which the server can write a large
994	     RPC reply.  This has the advantage that an RDMA Write may be
995	     slightly faster in network latency than an RDMA Read.
996	     Additionally, for a reply it removes the need for an RDMA_DONE
997	     message if the large reply is returned as a Read chunk.

999	     This protocol supports direct return of a large reply via the
1000	     inclusion of an optional rdma_reply write chunk after the read
1001	     chunk list and the write chunk list.  The client allocates a buffer
1002	     sized to receive a large reply and enters its steering tag, address
1003	     and length in the rdma_reply write chunk.  If the reply message is
1004	     too long to return inline with an RDMA Send (exceeds the size of
1005	     the client's posted receive buffer), even with read chunks removed,
1006	     then the RPC server performs an RDMA Write of the RPC reply message
1007	     into the buffer indicated by the rdma_reply chunk.  If the client
1008	     doesn't provide an rdma_reply chunk, or if it's too small, then the
1009	     message must be returned as a Read chunk.

1011	     An RPC with long reply returned via RDMA Write looks like this:

1013	       RPC Client                           RPC Server
1014	           |      RPC Call with rdma_reply       |
1015	      Send |   ------------------------------>   |
1016	           |                                     |
1017	           |          Long RPC Reply Msg         |
1018	           |   <------------------------------   | Write
1019	           |                                     |
1020	           |         RDMA over RPC Header        |
1021	           |   <------------------------------   | Send

1023	     The use of RDMA Write to return long replies requires that the
1024	     client application anticipate a long reply and have some knowledge
1025	     of its size so that an adequately sized buffer can be allocated.
1026	     This is certainly true of NFS READDIR replies; where the client
1027	     already provides an upper bound on the size of the encoded
1028	     directory fragment to be returned by the server.

1030	     The use of these "reply chunks" is highly efficient and convenient
1031	     for both RPC client and server.  Their use is encouraged for
1032	     eligible RPC operations such as NFS READDIR, which would otherwise
1033	     require extensive chunk management within the results or use of
1034	     RDMA Read and a Done message. [NFSDDP]

1036	6.  Connection Configuration Protocol

1038	     RDMA Send operations require the receiver to post one or more
1039	     buffers at the RDMA connection endpoint, each large enough to
1040	     receive the largest Send message.  Buffers are consumed as Send
1041	     messages are received.  If a buffer is too small, or if there are
1042	     no buffers posted, the RDMA transport may return an error and break
1043	     the RDMA connection.  The receiver must post sufficient, adequately
1044	     buffers to avoid buffer overrun or capacity errors.

1046	     The protocol described above includes only a mechanism for managing
1047	     the number of such receive buffers, and no explicit features to
1048	     allow the RPC client and server to provision or control buffer
1049	     sizing, nor any other session parameters.

1051	     In the past, this type of connection management has not been
1052	     necessary for RPC.  RPC over UDP or TCP does not have a protocol to
1053	     negotiate the link.  The server can get a rough idea of the maximum
1054	     size of messages from the server protocol code.  However, a
1055	     protocol to negotiate transport features on a more dynamic basis is
1056	     desirable.

1058	     The Connection Configuration Protocol allows the client to pass its
1059	     connection requirements to the server, and allows the server to
1060	     inform the client of its connection limits.

1062	6.1.  Initial Connection State

1064	     This protocol will be used for connection setup prior to the use of
1065	     another RPC protocol that uses the RDMA transport.  It operates in-
1066	     band, i.e. it uses the connection itself to negotiate the
1067	     connection parameters.  To provide a basis for connection
1068	     negotiation, the connection is assumed to provide a basic level of
1069	     interoperability: the ability to exchange at least one RPC message
1070	     at a time that is at least 1 KB in size.  The server may exceed
1071	     this basic level of configuration, but the client must not assume
1072	     it.

1074	6.2.  Protocol Description

1076	     Version 1 of the Connection Configuration protocol consists of a
1077	     single procedure that allows the client to inform the server of its
1078	     connection requirements and the server to return connection
1079	     information to the client.

1081	     The maxcall_sendsize argument is the maximum size of an RPC call
1082	     message that the client will send inline in an RDMA Send message to
1083	     the server.  The server may return a maxcall_sendsize value that is
1084	     smaller or larger than the client's request.  The client must not
1085	     send an inline call message larger than what the server will
1086	     accept.  The maxcall_sendsize limits only the size of inline RPC
1087	     calls.  It does not limit the size of long RPC messages transferred
1088	     as an initial chunk in the Read chunk list.

1090	     The maxreply_sendsize is the maximum size of an inline RPC message
1091	     that the client will accept from the server.

1093	     The maxrdmaread is the maximum number of RDMA Reads which may be
1094	     active at the peer.  This number correlates to the RDMA incoming
1095	     RDMA Read count ("IRD") configured into each originating endpoint
1096	     by the client or server.  If more than this number of RDMA Read
1097	     operations by the connected peer are issued simultaneously,
1098	     connection loss or suboptimal flow control may result, therefore
1099	     the value should be observed at all times.  The peers' values need
1100	     not be equal.  If zero, the peer must not issue requests which
1101	     require RDMA Read to satisfy, as no transfer will be possible.

1103	     The align value is the value recommended by the server for opaque
1104	     data values such as strings and counted byte arrays.  The client
1105	     can use this value to compute the number of prepended pad bytes
1106	     when XDR encoding opaque values in the RPC call message.

1108	        typedef unsigned int uint32;

1110	        struct config_rdma_req {
1111	             uint32  maxcall_sendsize;
1112	                         /* max size of inline RPC call */
1113	             uint32  maxreply_sendsize;
1114	                         /* max size of inline RPC reply */
1115	             uint32  maxrdmaread;
1116	                         /* max active RDMA Reads at client */
1117	        };

1119	        struct config_rdma_reply {
1120	             uint32  maxcall_sendsize;
1121	                         /* max call size accepted by server */
1122	             uint32  align;
1123	                         /* server's receive buffer alignment */
1124	             uint32  maxrdmaread;
1125	                         /* max active RDMA Reads at server */
1126	        };

1128	        program CONFIG_RDMA_PROG {
1129	           version VERS1 {
1130	              /*
1131	               * Config call/reply
1132	               */
1133	              config_rdma_reply CONF_RDMA(config_rdma_req) = 1;
1134	           } = 1;
1135	        } = nnnnnn;  <-- Need program number assigned

1137	7.  Memory Registration Overhead

1139	     RDMA requires that all data be transferred between registered
1140	     memory regions at the source and destination.  All protocol headers
1141	     as well as separately transferred data chunks must use registered
1142	     memory.  Since the cost of registering and de-registering memory
1143	     can be a large proportion of the RDMA transaction cost, it is
1144	     important to minimize registration activity.  This is easily
1145	     achieved within RPC controlled memory by allocating chunk list data
1146	     and RPC headers in a reusable way from pre-registered pools.

1148	     The data chunks transferred via RDMA may occupy memory that
1149	     persists outside the bounds of the RPC transaction.  Hence, the
1150	     default behavior of an RPC over RDMA transport is to register and
1151	     de-register these chunks on every transaction.  However, this is
1152	     not a limitation of the protocol - only of the existing local RPC
1153	     API.  The API is easily extended through such functions as
1154	     rpc_control(3) to change the default behavior so that the
1155	     application can assume responsibility for controlling memory
1156	     registration through an RPC-provided registered memory allocator.

1158	8.  Errors and Error Recovery

1160	     RPC RDMA protocol errors are described in section 4.  RPC errors
1161	     and RPC error recovery are not affected by the protocol, and
1162	     proceed as for any RPC error condition.  RDMA Transport error
1163	     reporting and recovery are outside the scope of this protocol.

1165	     It is assumed that the link itself will provide some degree of
1166	     error detection and retransmission.  iWARP's MPA layer (when used
1167	     over TCP), SCTP, as well as the Infiniband link layer all provide
1168	     CRC protection of the RDMA payload, and CRC-class protection is a
1169	     general attribute of such transports.  Additionally, the RPC layer
1170	     itself can accept errors from the link level and recover via
1171	     retransmission.  RPC recovery can handle complete loss and re-
1172	     establishment of the link.

1174	     See section 11 for further discussion of the use of RPC-level
1175	     integrity schemes to detect errors, and related efficiency issues.

1177	9.  Node Addressing

1179	     In setting up a new RDMA connection, the first action by an RPC
1180	     client will be to obtain a transport address for the server.  The
1181	     mechanism used to obtain this address, and to open an RDMA
1182	     connection is dependent on the type of RDMA transport, and is the
1183	     responsibility of each RPC protocol binding and its local
1184	     implementation.

1186	10.  RPC Binding

1188	     RPC services normally register with a portmap or rpcbind service,
1189	     which associates an RPC program number with a service address.  (In
1190	     the case of UDP or TCP, the service address for NFS is normally
1191	     port 2049.)  This policy should be no different with RDMA
1192	     interconnects, although it may require the allocation of port
1193	     numbers appropriate to each.

1195	     When a mapping standard or convention exists for IP ports on an
1196	     RDMA interconnect, there are two possibilities:

1198	          One possibility is to have the server's portmapper register
1199	          itself on the RDMA interconnect at a "well known" service
1200	          address.  (On UDP or TCP, this corresponds to port 111.)  A
1201	          client could connect to this service address and use the
1202	          portmap protocol to obtain a service address in response to a
1203	          program number, e.g. an iWARP port number, or an Infiniband
1204	          GID.

1206	          Alternatively, the client could simply connect to the mapped
1207	          well-known port for the service itself, if it is appropriately
1208	          defined.

1210	     Historically, different RPC protocols have taken different
1211	     approaches to their port assignment, therefore the specific method
1212	     is left to each RPC/RDMA-enabled upper layer binding, and not
1213	     addressed here.

1215	11.  Security

1217	     ONC RPC provides its own security via the RPCSEC_GSS framework
1218	     [RFC2203].  RPCSEC_GSS can provide message authentication,
1219	     integrity checking, and privacy.  This security mechanism will be
1220	     unaffected by the RDMA transport.  The data integrity and privacy
1221	     features alter the body of the message, presenting it as a single
1222	     chunk.  For large messages the chunk may be large enough to qualify
1223	     for RDMA Read transfer.  However, there is much data movement
1224	     associated with computation and verification of integrity, or
1225	     encryption/decryption, so certain performance advantages may be
1226	     lost.

1228	     For efficiency, more appropriate security mechanism for RDMA links
1229	     may be link-level protection, such as IPSec, which may be co-
1230	     located in the RDMA hardware.  The use of link-level protection may
1231	     be negotiated through the use of a new RPCSEC_GSS mechanism like
1232	     the Credential Cache GSS Mechanism [CCM].  Use of such mechanisms
1233	     is recommended where end-to-end integrity and/or privacy is
1234	     desired, and where efficiency is required.

1236	     There are no new issues here with exposed addresses.  The only
1237	     exposed addresses here are in the chunk list and in the transport
1238	     packets transferred via RDMA.  The data contained in these
1239	     addresses continues to be protected by RPCSEC_GSS integrity and
1240	     privacy.

1242	12.  IANA Considerations

1244	     The new RPC transport should be assigned a new RPC "netid".

1246	     As a new RPC transport, this protocol should have no effect on RPC
1247	     program numbers or existing registered port numbers.  However, new
1248	     port numbers may be registered for use by RPC/RDMA-enabled
1249	     services, as appropriate to the new networks over which the
1250	     services will operate.

1252	     If adopted, the Connection Configuration protocol described herein
1253	     will require an RPC program number assignment.

1255	13.  Acknowledgements

1257	     The authors wish to thank Rob Thurlow, John Howard, Chet Juszczak,
1258	     Alex Chiu, Peter Staubach, Dave Noveck, Brian Pawlowski, Steve
1259	     Kleiman, Mike Eisler, Mark Wittle, Shantanu Mehendale, David
1260	     Robinson and Mallikarjun Chadalapaka for their contributions to
1261	     this document.

1263	14.  Normative References

1265	     [RFC1094]
1266	          Sun Microsystems, "NFS: Network File System Protocol
1267	          Specification", (NFS version 2) Informational RFC,
1268	          http://www.ietf.org/rfc/rfc1094.txt

1270	     [RFC1831]
1271	          R. Srinivasan, "RPC: Remote Procedure Call Protocol
1272	          Specification Version 2", Standards Track RFC,
1273	          http://www.ietf.org/rfc/rfc1831.txt

1275	     [RFC1832]
1276	          R. Srinivasan, "XDR: External Data Representation Standard",
1277	          Standards Track RFC, http://www.ietf.org/rfc/rfc1832.txt

1279	     [RFC1813]
1280	          B. Callaghan, B. Pawlowski, P. Staubach, "NFS Version 3
1281	          Protocol Specification", Informational RFC,
1282	          http://www.ietf.org/rfc/rfc1813.txt

1284	     [RFC3530]
1285	          S. Shepler, B. Callaghan, D. Robinson, R. Thurlow, C. Beame,
1286	          M.  Eisler, D. Noveck, "NFS version 4 Protocol", Standards
1287	          Track RFC, http://www.ietf.org/rfc/rfc3530.txt

1289	     [RFC2203]
1290	          M. Eisler, A. Chiu, L. Ling, "RPCSEC_GSS Protocol
1291	          Specification", Standards Track RFC,
1292	          http://www.ietf.org/rfc/rfc2203.txt

1294	15.  Informative References

1296	     [RDMAP]
1297	          R. Recio et al, "An RDMA Protocol Specification", Internet
1298	          Draft Work in Progress, draft-ietf-rddp-rdmap

1300	     [CCM]
1301	          M. Eisler, N. Williams, "CCM: The Credential Cache GSS
1302	          Mechanism", Internet Draft Work in Progress, draft-ietf-
1303	          nfsv4-ccm

1305	     [NFSDDP]
1306	          B. Callaghan, T. Talpey, "NFS Direct Data Placement" Internet
1307	          Draft Work in Progress, draft-ietf-nfsv4-nfsdirect

1309	     [RDDP]
1310	          Remote Direct Data Placement Working Group Charter,
1311	          http://www.ietf.org/html.charters/rddp-charter.html

1313	     [NFSRDMAPS]
1314	          T. Talpey, C. Juszczak, "NFS RDMA Problem Statement", Internet
1315	          Draft Work in Progress, draft-ietf-nfsv4-nfs-rdma-problem-
1316	          statement

1318	     [NFSv4.1]
1319	          S. Shepler, ed., "NFSv4 Minor Version 1" Internet Draft Work
1320	          in Progress, draft-ietf-nfsv4-minorversion1

1322	     [IB]
1323	          Infiniband Architecture Specification,
1324	          http://www.infinibandta.org

1326	16.  Authors' Addresses

1328	     Tom Talpey
1329	     Network Appliance, Inc.
1330	     375 Totten Pond Road
1331	     Waltham, MA 02451 USA

1333	     Phone: +1 781 768 5329
1334	     EMail: thomas.talpey@netapp.com
1335	     Brent Callaghan
1336	     Apple Computer, Inc.
1337	     MS: 302-4K
1338	     2 Infinite Loop
1339	     Cupertino, CA 95014 USA

1341	     EMail: brentc@apple.com

1343	17.  Intellectual Property and Copyright Statements

1345	Intellectual Property Statement

1347	     The IETF takes no position regarding the validity or scope of any
1348	     Intellectual Property Rights or other rights that might be claimed
1349	     to pertain to the implementation or use of the technology described
1350	     in this document or the extent to which any license under such
1351	     rights might or might not be available; nor does it represent that
1352	     it has made any independent effort to identify any such rights.
1353	     Information on the procedures with respect to rights in RFC
1354	     documents can be found in BCP 78 and BCP 79.

1356	     Copies of IPR disclosures made to the IETF Secretariat and any
1357	     assurances of licenses to be made available, or the result of an
1358	     attempt made to obtain a general license or permission for the use
1359	     of such proprietary rights by implementers or users of this
1360	     specification can be obtained from the IETF on-line IPR repository
1361	     at http://www.ietf.org/ipr.

1363	     The IETF invites any interested party to bring to its attention any
1364	     copyrights, patents or patent applications, or other proprietary
1365	     rights that may cover technology that may be required to implement
1366	     this standard.  Please address the information to the IETF at ietf-
1367	     ipr@ietf.org.

1369	Disclaimer of Validity

1371	     This document and the information contained herein are provided on
1372	     an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE
1373	     REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND
1374	     THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES,
1375	     EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT
1376	     THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR
1377	     ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A
1378	     PARTICULAR PURPOSE.

1380	Copyright Statement

1382	     Copyright (C) The Internet Society (2006).

1384	     This document is subject to the rights, licenses and restrictions
1385	     contained in BCP 78, and except as set forth therein, the authors
1386	     retain all their rights.

1388	Acknowledgement
1389	     Funding for the RFC Editor function is currently provided by the
1390	     Internet Society.