idnits 2.17.1 

draft-cel-nfsv4-rpcrdma-reliable-reply-03.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  == The document seems to lack the recommended RFC 2119 boilerplate, even if
     it appears to use RFC 2119 keywords -- however, there's a paragraph with
     a matching beginning. Boilerplate error?

     (The document does seem to have the reference to RFC 2119 which the
     ID-Checklist requires).
  -- The document date (July 16, 2018) is 2109 days in the past.  Is this
     intentional?


  Checking references for intended status: Experimental
  ----------------------------------------------------------------------------

  -- Obsolete informational reference (is this intentional?): RFC 5661
     (Obsoleted by RFC 8881)

  -- Obsolete informational reference (is this intentional?): RFC 5666
     (Obsoleted by RFC 8166)


     Summary: 0 errors (**), 0 flaws (~~), 2 warnings (==), 3 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.
--------------------------------------------------------------------------------


2	Network File System Version 4                                   C. Lever
3	Internet-Draft                                                    Oracle
4	Intended status: Experimental                              July 16, 2018
5	Expires: January 17, 2019

7	 Improving the Performance and Reliability of RPC Replies on RPC-over-
8	                            RDMA Transports
9	               draft-cel-nfsv4-rpcrdma-reliable-reply-03

11	Abstract

13	   RPC transports such as RPC-over-RDMA version 1 require reply buffers
14	   to be in place before an RPC Call is sent.  However, RPC consumers
15	   sometimes have difficulty estimating the expected maximum size of a
16	   particular RPC reply.  This introduces the risk that an RPC Reply
17	   message can overrun reply resources provided by the requester,
18	   preventing delivery of the message, through no fault of the
19	   requester.  This document describes a mechanism that eliminates the
20	   need for pre-allocation of reply resources for unpredictably large
21	   replies.

23	Status of This Memo

25	   This Internet-Draft is submitted in full conformance with the
26	   provisions of BCP 78 and BCP 79.

28	   Internet-Drafts are working documents of the Internet Engineering
29	   Task Force (IETF).  Note that other groups may also distribute
30	   working documents as Internet-Drafts.  The list of current Internet-
31	   Drafts is at https://datatracker.ietf.org/drafts/current/.

33	   Internet-Drafts are draft documents valid for a maximum of six months
34	   and may be updated, replaced, or obsoleted by other documents at any
35	   time.  It is inappropriate to use Internet-Drafts as reference
36	   material or to cite them other than as "work in progress."

38	   This Internet-Draft will expire on January 17, 2019.

40	Copyright Notice

42	   Copyright (c) 2018 IETF Trust and the persons identified as the
43	   document authors.  All rights reserved.

45	   This document is subject to BCP 78 and the IETF Trust's Legal
46	   Provisions Relating to IETF Documents
47	   (https://trustee.ietf.org/license-info) in effect on the date of
48	   publication of this document.  Please review these documents
49	   carefully, as they describe your rights and restrictions with respect
50	   to this document.  Code Components extracted from this document must
51	   include Simplified BSD License text as described in Section 4.e of
52	   the Trust Legal Provisions and are provided without warranty as
53	   described in the Simplified BSD License.

55	Table of Contents

57	   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
58	   2.  Requirements Language . . . . . . . . . . . . . . . . . . . .   3
59	   3.  Problem Statement . . . . . . . . . . . . . . . . . . . . . .   3
60	     3.1.  Reply Chunk Overrun . . . . . . . . . . . . . . . . . . .   4
61	     3.2.  Reply Size Calculation  . . . . . . . . . . . . . . . . .   4
62	     3.3.  Requester Registration Costs  . . . . . . . . . . . . . .   5
63	     3.4.  Denial of Service . . . . . . . . . . . . . . . . . . . .   5
64	     3.5.  Estimating Transport Header Size  . . . . . . . . . . . .   6
65	   4.  Responder-Provided Read Chunks  . . . . . . . . . . . . . . .   6
66	     4.1.  Specification . . . . . . . . . . . . . . . . . . . . . .   7
67	   5.  Analysis  . . . . . . . . . . . . . . . . . . . . . . . . . .   9
68	     5.1.  Benefits  . . . . . . . . . . . . . . . . . . . . . . . .   9
69	     5.2.  Costs . . . . . . . . . . . . . . . . . . . . . . . . . .  10
70	     5.3.  Selecting a Reply Mechanism . . . . . . . . . . . . . . .  11
71	     5.4.  Implementation Complexity . . . . . . . . . . . . . . . .  12
72	     5.5.  Alternatives  . . . . . . . . . . . . . . . . . . . . . .  13
73	   6.  Interoperation Considerations . . . . . . . . . . . . . . . .  14
74	   7.  Security Considerations . . . . . . . . . . . . . . . . . . .  14
75	   8.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .  15
76	   9.  References  . . . . . . . . . . . . . . . . . . . . . . . . .  15
77	     9.1.  Normative References  . . . . . . . . . . . . . . . . . .  15
78	     9.2.  Informative References  . . . . . . . . . . . . . . . . .  15
79	   Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . .  16
80	   Author's Address  . . . . . . . . . . . . . . . . . . . . . . . .  16

82	1.  Introduction

84	   One way in which RPC-over-RDMA version 1 improves transport
85	   efficiency is by ensuring resources for RPC replies are available in
86	   advance of each RPC transaction [RFC8166].  These resources are
87	   typically provisioned before a requester sends each RPC Call message.
88	   They are provided to the responder to use for transmiting the
89	   associated RPC Reply message back to the requester.

91	   In particular, when the Payload Stream of an RPC Reply message is
92	   expected to be large, the requester allocates and registers a Reply
93	   chunk.  The responder transfers the RPC Reply message's Payload
94	   stream directly into the requester memory associated with that chunk,
95	   then indicates that the RPC Reply is ready.  The requester
96	   invalidates the memory region.

98	   In most cases, Upper Layer Protocols are capable of accurately
99	   calculating the maximum size of RPC Reply messages.  In addition, the
100	   average size of RPC Reply messages is small, making the risk of Reply
101	   chunk overrun exceptionally small.

103	   However, on rare occasions an Upper Layer Protocol might not be able
104	   to derive a reply size upper bound.  An example of this is the NFS
105	   version 4.1 GETATTR operation [RFC5661] [RFC8267] where a reply can
106	   contain an unpredictable number of data content and hole descriptors.

108	   Further, since the average size of actual RPC Replies is small,
109	   requesters frequently allocate and register a Reply chunk for a reply
110	   that, once it has been constructed by the responder, is small enough
111	   to be sent inline.  In this case, a responder is free to either
112	   populate the Reply chunk or send the RPC Reply without the use of the
113	   Reply chunk.  The requester's cost of preparing the Reply chunk has
114	   been wasted, and the extra registration and invalidation adds
115	   unwanted latency to the operation.

117	   A better method of handling RPC replies could ensure that RPC Replies
118	   can be received even when the maximum possible size of some replies
119	   cannot be calculated in advance.  This method could also ensure that
120	   no extra memory registration/invalidation operations are necessary to
121	   make this guarantee.

123	   This document resurrects the responder-provided Read chunk mechanism
124	   that was briefly outlined in [RFC5666] to achieve these goals.  The
125	   discussion in this document assumes the reader is familiar with
126	   [RFC8166].

128	2.  Requirements Language

130	   The key words "MUST", "MUST NOT", "REQUIRED", "SHOULD", "SHOULD NOT",
131	   "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be
132	   interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only
133	   when, they appear in all capitals, as shown here.

135	3.  Problem Statement

137	   RPC-over-RDMA version 1 uses an RDMA Send request to transmit
138	   transport headers and small RPC messages.

140	   Each peer on an RPC-over-RDMA transport connection provisions Receive
141	   buffers in which to capture incoming RDMA Send messages.  There is a
142	   limited number of these buffers, necessitating accounting in the
143	   transport protocol to prevent a peer from emitting more Send
144	   operations than the receiver is prepared for.

146	   Because the selection of Receive Work Request to handle an incoming
147	   Send is outside the control of the host O/S, the smallest buffer in
148	   this pool determines the largest size message that can be received.
149	   The size of the largest message that can be received via RDMA Send is
150	   known as the receiver's "inline threshold" [RFC8166].

152	   When marshaling an RPC transaction, a requester allocates and
153	   registers a Reply chunk whenever the maximum possible size of the
154	   corresponding RPC-over-RDMA reply is larger than the requester's
155	   receive inline threshold.  The Reply chunk is presented to the
156	   responder as part of the RPC Call.  The responder may place the
157	   associated RPC Reply message in the memory region linked with this
158	   Reply chunk.

160	3.1.  Reply Chunk Overrun

162	   If a responder overruns a Reply chunk during an RDMA Write, a memory
163	   protection error occurs.  This typically results in connection loss.
164	   Any RPC transactions running on that connection must be
165	   retransmitted.  The failing RPC transaction will never get a reply,
166	   and retransmitting it may result in additional connection loss
167	   events.

169	   A smart responder compares the size of an RPC Reply with the size of
170	   the target Reply chunk before initiating the placement of data in
171	   that chunk.  A generic RDMA_ERROR message reports the problem and the
172	   requester can terminate the RPC transaction.

174	   In either case, the RPC is executed by the responder, but the
175	   requester does not receive the results or acknowledgement of its
176	   completion.

178	3.2.  Reply Size Calculation

180	   To determine when a Reply chunk is needed, requesters calculate the
181	   maximum possible size of the RPC Reply message expected for each
182	   transaction.  Upper Layer Bindings, such as [RFC8267] provide
183	   guidance on how to calculate Reply sizes and in what cases the Upper
184	   Layer Protocol might have difficulty giving an exact upper bound.

186	   Unfortunately, there are rare cases where an upper bound cannot be
187	   computed.  For instance, there is no way to know how large an NFS
188	   Access Control List (ACL) is until it is retrieved from an NFS server
189	   [RFC5661].  There is no protocol-specified limit on the size of NFS
190	   ACLs.  When retrieving an NFS ACL, there is always a risk, albeit a
191	   small one, that the NFS client has not provided a large enough Reply
192	   chunk, and that therefore the NFS server will not be able to return
193	   that ACL to the client (unless somehow a larger Reply chunk can be
194	   provided).

196	3.3.  Requester Registration Costs

198	   For an Upper Layer Protocol such as NFS version 4.2 [RFC7862], NFS
199	   COMPOUND Call and Reply messages can be large on occasion.  For
200	   instance, an NFSv4.2 COMPOUND can contain a LOOKUP operation together
201	   with a GETATTR operation.  The size of a LOOKUP result is relatively
202	   small.  However, the GETATTR in that COMPOUND may request attributes,
203	   such as ACLs or security labels, that can grow arbitrarily large and
204	   whose size is not known in advance.

206	   Thus a requester can be responsible for provisioning quite a large
207	   reply buffer for each LOOKUP COMPOUND, which is a frequent request.
208	   If the maximum possible reply message can be large, the requester is
209	   required to provide a Reply chunk.  Most of the time, however, the
210	   actual size of a LOOKUP COMPOUND reply is small enough to be sent
211	   using one RDMA Send.

213	   In other words, an NFS version 4 client provides a Reply chunk quite
214	   frequently during RPC transactions, but NFS version 4 servers almost
215	   never need to use it because the actual size of replies is typically
216	   less than the inline threshold.  The overhead of registering and
217	   invalidating this chunk is significant.  Moreover it is unnecessary
218	   whenever the size of an actual RPC reply is small.

220	   Before an RPC transaction is terminated, a requester is responsible
221	   for fencing the Reply chunk from the responder [RFC8166].  That makes
222	   RPC completion synchronous with Reply chunk invalidation.  Therefore
223	   the latency of Reply chunk invalidation adds to the total execution
224	   time of the RPC transaction.

226	3.4.  Denial of Service

228	   When an RPC transaction is canceled or aborted (for instance, because
229	   an application process exited prematurely), a requester must
230	   invalidate or set aside Write and Reply chunks associated with that
231	   transaction [RFC8166].

233	   This is because that RPC transaction is still running on the
234	   responder.  The responder remains obligated to return the result of
235	   that transaction via RDMA Write, if there are Write or Reply chunks.
236	   If memory registered on behalf of that transaction is re-used, the
237	   requester must protect that memory from server RDMA Writes associated
238	   with previous transactions by fencing it from the responder.  The
239	   responder triggers a memory protection error when it writes into
240	   those memory regions, and the connection is lost.

242	   A malfunctioning application or a malicious user on the requester can
243	   create a situation where RPCs are continuously initiated and then
244	   aborted, resulting in responder replies that repeatedly terminate the
245	   underlying RPC-over-RDMA connection.

247	   A rogue responder can purposely overrun a Reply chunk to kill a
248	   connection.  Repeated connection loss can result in a Denial of
249	   Service.

251	3.5.  Estimating Transport Header Size

253	   To determine whether a Reply chunk is needed, a requester computes
254	   the size of the Reply's Transport Header and the maximum possible
255	   size of the RPC Reply message, and sums the two.  If the sum is
256	   smaller than the requester's receive inline threshold, a Reply chunk
257	   is not required.

259	   The size of a Transport Header depends on how many Write chunks the
260	   requester provides, whether a Reply chunk is needed, and how many
261	   segments are contained in provided Write and Reply chunks.

263	   When the total size of the Reply message is already near the inline
264	   threshold, therefore, a requester has to know whether a Reply chunk
265	   is needed (and how many segments it contains) before it can determine
266	   if a Reply chunk is needed.

268	   A requester can resort to limiting Transport Header size to a fixed
269	   value that ensures this computation does not become a recursion.
270	   However, as in earlier sections, this can mean that some RPC
271	   transactions where a Reply chunk is not strictly necessary must incur
272	   the cost of preparing a Reply chunk.

274	4.  Responder-Provided Read Chunks

276	   A potential mechanism for resolving these issues is suggested in
277	   Section 3.4 of [RFC5666]:

279	      In the absence of a server-provided read chunk list in the reply,
280	      if the encoded reply overflows the posted receive buffer, the RPC
281	      will fail with an RDMA transport error.

283	   When sending a large RPC Call message, requesters already employ Read
284	   chunks.  There is no advance indication or limit on the size of any
285	   RPC Call message.  To achieve the same flexibility for RPC Replies,
286	   Read chunks can be used in the reverse direction (e.g., responder
287	   exposes memory, requester initiates RDMA Read).

289	   Rather than a requester providing a Reply chunk for conveying an as-
290	   yet-unconstructed large reply, a responder can expose a Read chunk
291	   containing the actual Payload stream of the RPC Reply message.  A
292	   responder would employ a Read chunk to return a reply any time
293	   requester-provided reply resources are not adequate.

295	   The requester does not have to calculate a reply size maximum or
296	   register and invalidate a Reply chunk in these cases.  Without a
297	   requester-provided Reply chunk, the responder sends each reply
298	   inline, except when the actual size of an RPC Reply message is larger
299	   than the receiver's inline threshold.

301	   This results in no wasted activity on the requester and arbitrarily
302	   large RPC Replies can be received reliably.

304	   Current RPC-over-RDMA version 1 implementations do not support
305	   responder-provided Read chunks, although RPC-over-RDMA version 1 did
306	   have this support in the past [RFC5666].  Adapting this deprecated
307	   mechanism for new RPC-over-RDMA transports is straightforward.

309	4.1.  Specification

311	   A responder MAY choose to send an RPC Reply using a Position Zero
312	   Read chunk comprised of one or more RDMA segments.  Position Zero
313	   Read chunks are defined in Section 3.5.3 of [RFC8166].

315	   Similar to its use in an RPC Call, a Position Zero Read chunk in an
316	   RPC Reply contains an RPC Reply's Payload stream.  Position Zero Read
317	   chunks are always sent using an RPC-over-RDMA RDMA_NOMSG message.

319	   In other words, a responder-provided Read chunk can replace the use
320	   of a Reply chunk in Long Replies.  And, as with Reply chunks, a
321	   responder must still make use of Write chunks provided by the
322	   requester.

324	4.1.1.  Responder Duties

326	   A responder MUST send a Position Zero Read chunk when the actual size
327	   of the RPC Reply's Payload stream exceeds all requester-provided
328	   reply resources; that is, when the inline threshold and any provided
329	   Reply chunk are both too small to accommodate the Payload stream of
330	   the reply.

332	   If a responder does not support responder-provided Read chunks in
333	   this case, it MUST return an appropriate permanent transport error to
334	   terminate the requester's RPC transaction.

336	4.1.2.  Requester Duties

338	   Upon receipt of an RDMA_NOMSG message containing a Position Zero Read
339	   chunk, the requester pulls the RPC Reply's Payload stream from the
340	   responder.

342	   After RDMA Read operations have completed (successfully or in error),
343	   the requester MUST inform the responder that it may invalidate the
344	   Read chunk containing the RPC Reply message.  This is referred to as
345	   "pull completion notification".

347	4.1.3.  Pull Completion Notification

349	   Pull completion notification is accomplished in one of two ways:

351	   o  The requester can send an RDMA_DONE message with the rdma_xid
352	      field set to the same value as the rdma_xid field in the
353	      RDMA_NOMSG request.  Or,

355	   o  The requester can piggyback the pull completion notification in
356	      the transport header of a subsequent RPC Call, if the transport
357	      protocol has such a facility.

359	   When an RPC transaction is aborted on a requester, the requester
360	   normally forgets its XID.  If a requester receives a reply bearing a
361	   Position Zero Read chunk and does not recognize the XID, the
362	   requester MUST notify the responder of pull completion.

364	   Whenever a responder receives a pull completion notification for an
365	   XID for which there is no Read chunk waiting to be invalidated, the
366	   responder MUST silently drop the notification.

368	   If a requester receives an RPC Reply via a responder-provided Read
369	   chunk, but does not support such chunks, it MUST inform the responder
370	   of pull completion and terminate the RPC transaction.

372	   A malicious or broken requester might neglect to send pull completion
373	   notifications for one or more RPC transactions that included
374	   responder-provided Read chunks.  To prevent exhaustion of responder
375	   resources, a responder can choose to invalidate its Read chunks after
376	   waiting for a short period.  If the requester attempts additional
377	   RDMA Read operations against that Read chunk, a remote access error
378	   occurs and the connection is lost.

380	4.1.4.  Remote Invalidation

382	   Remote Invalidation can reduce or eliminate the need for the
383	   responder to explicitly invalidate memory containing an RPC Reply
384	   message.

386	   Remote Invalidation might be done by transmitting an RDMA_DONE
387	   message using RDMA Send With Invalidate.  If instead pull completion
388	   notification is piggybacked on a subsequent RPC Call, a facility for
389	   Remote Invalidation would have to be built into RPC Call processing.

391	   If Remote Invalidate support is not indicated by one or both peers,
392	   messages carrying pull completion notification MUST be transmitted
393	   using RDMA Send.  If Remote Invalidation support is indicated by both
394	   peers, messages carrying pull completion messages SHOULD be
395	   transmitted using RDMA Send With Invalidate.

397	   The rule for choosing the value of the Send With Invalidate Work
398	   Request's inv_handle field depends on the version of the transport
399	   protocol that is use.  If the responder has provided an R_key that
400	   may be invalidated, the requester MUST present only that R_key when
401	   using RDMA Send With Invalidate.

403	5.  Analysis

405	5.1.  Benefits

407	5.1.1.  Less Frequent Use of Explicit RDMA

409	   The vast majority of RPC Replies can be conveyed via RDMA_MSG.  No
410	   extra Reply chunk registration and invalidation cost is incurred when
411	   a large RPC Reply message is possible but the actual reply size is
412	   small.  This reduces or even eliminates the use of explicit RDMA for
413	   frequent small-to-moderate-size replies, improving the average
414	   latency of individual RPCs and allowing RNIC and platform resources
415	   to scale better.

417	5.1.2.  Support for Arbitrarily Large Replies

419	   The responder-provided Read chunk approach accommodates arbitrarily
420	   large replies.  Requesters no longer need to calculate the maximum
421	   size of RPC Reply messages, even if a Reply chunk is provided.

423	5.1.3.  Protection of Connection After RPC Cancellation

425	   When an RPC is canceled on the requester (say, because the requesting
426	   application has been terminated), and no Reply chunk is provided, the
427	   requester is no longer responsible for invalidating that RPC's Reply
428	   chunk.  When the responder sends the reply, it provides a Position
429	   Zero Read chunk and does not use RDMA Write to transmit the RPC Reply
430	   message.  The transport connection is preserved because no memory
431	   protection violation can occur.

433	5.1.4.  Asynchronous Chunk Invalidation

435	   Registration of a responder-provided Read chunk must be completed
436	   before sending the RDMA_NOMSG message conveying the chunk
437	   information.  However, pull completion notification and subsequent
438	   responder-side memory invalidation can be performed after the RPC
439	   transaction has completed on the requester.  Because those are
440	   asynchronous to RPC completion, the additional latency is not
441	   attributed to the execution time of the RPC transaction.

443	5.2.  Costs

445	5.2.1.  Responder Memory Exposure

447	   Responder memory is registered and exposed to requesters when
448	   replying.  When a responder has properly allocated a Protection
449	   Domain for each connection and uses appropriate R_key rotation
450	   techniques (see Section 7), the exposure is minimal.  However,
451	   because current RPC-over-RDMA responder implementations do not expose
452	   memory to requesters, they typically share one Protection Domain
453	   among all connections.

455	5.2.2.  Round Trip Penalty

457	   Using a Read chunk for large replies introduces a round-trip penalty.
458	   A requester can provide a Reply chunk to avoid this penalty.
459	   However:

461	   o  The Read chunk round-trip penalty would be paid much less often
462	      than the Reply chunk registration cost is paid today, since
463	      responder-provided Read chunks are used only when necessary

465	   o  Read chunk frequency is reduced even further as the inline
466	      threshold is increased past the average size of the Upper Layer
467	      Protocol's RPC Replies

469	   o  Invalidation of a Reply chunk is synchronous with RPC completion,
470	      and may take as long as a round trip to the responder

472	   o  Read chunks are typically used for large payloads, where it is
473	      likely that data transmission time greatly exceeds the round-trip
474	      time

476	   There are a few particular situations where the frequency of large
477	   replies is high.  For example, the use of the krb5i or krb5p GSS
478	   services with RPC-over-RDMA require that Payload reduction is not
479	   used.  Thus, RPC-over-RDMA peers use only pure RDMA Sends or Long
480	   messages when these services are in use.  The actual size of a
481	   READDIR reply is often unpredictable but is frequently large.  In
482	   these two cases, using a Reply chunk could be the more efficient
483	   default choice.

485	5.2.3.  Credit Accounting Complexity

487	   Credit accounting is made more complex by the use of RDMA_DONE
488	   messages after RDMA Read operations have completed.  Sending an
489	   RDMA_DONE message consumes one credit, temporarily reducing RPC
490	   concurrency on the connection.  There is no response to RDMA_DONE, so
491	   it is not clear to the sender when that credit becomes available
492	   again.  One way to resolve this is to add a new message type to the
493	   protocol, RDMA_ACK, which could be used any time there is a uni-
494	   directional transport message to maintain the proper balance of
495	   credit grants and responses.

497	   Alternately, if the transport protocol supports piggybacking pull
498	   completion notification on RPC Call messages, the requester can
499	   piggyback in most cases to simplify credit accounting.  An explicit
500	   RDMA_DONE would be necessary only during light workloads, or the ULP
501	   could post an RPC NULL containing a piggybacked pull completion
502	   notification in these cases.

504	5.3.  Selecting a Reply Mechanism

506	   This section illustrates some possible implementation choices.

508	5.3.1.  Requester

510	   As an RPC Call is constructed, a requester might choose a reply
511	   mechanism based on its estimation of the range of possible sizes of
512	   the reply.

514	   Responder-provided Read chunk
515	      The requester knows the minimum size of the reply is smaller than
516	      the inline threshold, but the maximum size of the reply is larger
517	      than the inline threshold; or the requester cannot calculate the
518	      maximum size of the reply.  The client does not provide a Reply
519	      chunk, and relies on a responder-provider Read chunk to handle
520	      large replies.

522	   Reply chunk
523	      The requester knows the minimum and maximum size of the reply is
524	      larger than the inline threshold.  The requester provides a Reply
525	      chunk.

527	   Send-only
528	      The requester knows the maximum size of the reply is smaller than
529	      the inline threshold.  The requester does not provide a Reply
530	      chunk, and relies on a responder-provider Read chunk to handle
531	      large replies.

533	   A requester whose design requires Reply chunk invalidation after an
534	   RPC transaction is canceled might choose to never use Reply chunks,
535	   in favor of minimizing opportunities for connection loss.

537	5.3.2.  Responder

539	   After a responder has constructed an RPC Reply, it might choose which
540	   reply mechanism to employ based on the actual size of the Payload
541	   stream of the RPC Reply message.

543	   Responder-provided Read chunk
544	      The Payload stream is larger than the inline threshold and either
545	      no Reply chunk was provided or the provided Reply chunk is too
546	      small.  The responder uses a responder-provided Read chunk.

548	   Reply chunk
549	      If a usable Reply chunk is available, the responder uses the Reply
550	      chunk.

552	   Send-only
553	      If no Reply chunk is available and the Payload stream fits within
554	      the inline threshold, the responder uses only Send or Send With
555	      Invalidate to transmit the reply.

557	5.4.  Implementation Complexity

559	5.4.1.  RPC Call Path

561	   Implementation of responder-provided Read chunks introduces little or
562	   no additional complexity to the end-to-end RPC Call path.  Unless a
563	   requester implementer chooses to implement support for both Reply
564	   chunks and responder-provided Read chunks, there could be a net loss
565	   of code and run-time complexity in the RPC Call hot path.

567	   The responder's RPC Call path needs to recognize RDMA_DONE messages
568	   and initiate invalidation of Read chunks.  Because invalidation can
569	   be asynchronous, it is possible to perform Read chunk invalidation in
570	   a separate worker thread.

572	5.4.2.  RPC Reply Path

574	   On the RPC Reply path side, logic to initiate registration of Read
575	   chunks and wait for completion is added to the responder.  This path
576	   is not part of the hot path because it is used only infrequently.

578	   The requester's reply handling hot path must recognize when Read
579	   chunks are present in an RDMA_NOMSG message, and shunt execution to
580	   code that can initiate an RDMA Read and wait for completion.  Once
581	   complete, the requester posts an RDMA_DONE message.

583	5.4.3.  Managing RDMA_DONE messages

585	   In order for a responder to match incoming RDMA_DONE messages to
586	   reply buffers waiting to be invalidated, it might keep references to
587	   these buffers in a data structure searchable by XID.  This is similar
588	   to managing a set of pending backchannel replies.

590	   When an RDMA_DONE message arrives, the responder matches the XID in
591	   the message to a waiting reply buffer, invalidates that buffer, and
592	   removes the XID from the data structure.

594	   This data structure can also be used for housekeeping tasks such as:

596	   o  Invalidating waiting buffers after a timeout, in case the
597	      requester never sends RDMA_DONE

599	   o  Ignoring retransmitted or garbage RDMA_DONE requests

601	   o  Explicitly invalidating waiting Read chunks after a connection
602	      loss, if necessary

604	   o  Invalidating waiting buffers on device removal

606	5.5.  Alternatives

608	   Increasing the inline threshold reduces the likelihood of needing a
609	   Reply chunk, but does not eliminate the risks associated with
610	   unpredictably large replies.

612	   Message Continuation is more efficient than an explicit RDMA
613	   operation, and does not require the exposure of requester or
614	   responder memory [I-D.dnoveck-nfsv4-rpcrdma-rtrext].

616	   However, Message Continuation still limits the maximum size of a
617	   conveyed message.  As with a larger inline threshold, without
618	   responder-provided Read chunks, reply size estimation is still
619	   required to determine when a Reply chunk is required, and therefore
620	   there is still risk associated with unpredictably large replies.

622	   Message Continuation introduces complexity in the management of RPC-
623	   over-RDMA credit grants because the relationship between RPC
624	   transactions and credits is no longer one-to-one.  Credit management
625	   logic is an integral part of the RPC Call and Reply hot path on the
626	   requester.

628	6.  Interoperation Considerations

630	   When a requester supports responder-provided Read chunks, it is
631	   likely to neglect providing Reply chunks in some cases.  A responder
632	   that does not support responder-provided Read chunks can convey a
633	   transport-level error when it has generated an RPC Reply that is
634	   larger than the available reply resources.

636	   The situation is more problematic if a responder supports responder-
637	   provided Read chunks and sends them to a requester that is not able
638	   to recognize and unmarshal them.  The RPC transaction would never
639	   complete, and the requester would never send a pull completion
640	   notification.

642	   Thus responder-provided Read chunks MUST be used only when both peers
643	   support them: Either the base protocol version always has support
644	   enabled, or the base protocol provides an extension mechanism that
645	   indicates when support is available.

647	7.  Security Considerations

649	   The less frequent use of RDMA Write reduces opportunities for memory
650	   overrun on the requester, and reduces the risk of connection loss
651	   after an application is terminated prematurely.  This reduces
652	   exposure to accidental or malicious Denial of Service attacks.

654	   Responder-provided Read chunks are exposed for read-only access.
655	   Remote actors cannot alter the contents of exposed read-only memory,
656	   though a man-in-the-middle can read or alter RDMA payloads while they
657	   are in transit.  The use of RPCSEC GSS or a transport-layer
658	   confidentiality service completely blocks payload access by
659	   unintended recipients.

661	   Recommendations about adequate R_key rotation and the appropriate use
662	   of Protection Domains can be found in Section 8.1 of [RFC8166].
663	   These recommendations apply when responders expose memory to convey
664	   the Payload stream of an RPC Reply message.

666	   Otherwise, this mechanism does not alter the attack surface of a
667	   transport protocol that employs it.

669	8.  IANA Considerations

671	   This document does not require actions by IANA.

673	9.  References

675	9.1.  Normative References

677	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
678	              Requirement Levels", BCP 14, RFC 2119,
679	              DOI 10.17487/RFC2119, March 1997,
680	              <https://www.rfc-editor.org/info/rfc2119>.

682	   [RFC8166]  Lever, C., Ed., Simpson, W., and T. Talpey, "Remote Direct
683	              Memory Access Transport for Remote Procedure Call Version
684	              1", RFC 8166, DOI 10.17487/RFC8166, June 2017,
685	              <https://www.rfc-editor.org/info/rfc8166>.

687	   [RFC8174]  Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
688	              2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,
689	              May 2017, <https://www.rfc-editor.org/info/rfc8174>.

691	9.2.  Informative References

693	   [I-D.dnoveck-nfsv4-rpcrdma-rtrext]
694	              Noveck, D., "RPC-over-RDMA Extensions to Reduce Internode
695	              Round-trips", draft-dnoveck-nfsv4-rpcrdma-rtrext-03 (work
696	              in progress), December 2017.

698	   [RFC5661]  Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed.,
699	              "Network File System (NFS) Version 4 Minor Version 1
700	              Protocol", RFC 5661, DOI 10.17487/RFC5661, January 2010,
701	              <https://www.rfc-editor.org/info/rfc5661>.

703	   [RFC5666]  Talpey, T. and B. Callaghan, "Remote Direct Memory Access
704	              Transport for Remote Procedure Call", RFC 5666,
705	              DOI 10.17487/RFC5666, January 2010,
706	              <https://www.rfc-editor.org/info/rfc5666>.

708	   [RFC7862]  Haynes, T., "Network File System (NFS) Version 4 Minor
709	              Version 2 Protocol", RFC 7862, DOI 10.17487/RFC7862,
710	              November 2016, <https://www.rfc-editor.org/info/rfc7862>.

712	   [RFC8267]  Lever, C., "Network File System (NFS) Upper-Layer Binding
713	              to RPC-over-RDMA Version 1", RFC 8267,
714	              DOI 10.17487/RFC8267, October 2017,
715	              <https://www.rfc-editor.org/info/rfc8267>.

717	Acknowledgments

719	   Many thanks go to Karen Dietke, Chunli Zhang, Dai Ngo, and Tom
720	   Talpey.  The author also wishes to thank Bill Baker and Greg Marsden
721	   for their support of this work.

723	   Special thanks go to Transport Area Director Spencer Dawkins, NFSV4
724	   Working Group Chairs Spencer Shepler and Brian Pawlowski, and NFSV4
725	   Working Group Secretary Thomas Haynes for their support.

727	Author's Address

729	   Charles Lever
730	   Oracle Corporation
731	   1015 Granger Avenue
732	   Ann Arbor, MI  48104
733	   United States of America

735	   Phone: +1 248 816 6463
736	   Email: chuck.lever@oracle.com