idnits 2.17.1 

draft-dnoveck-nfsv4-rpcrdma-rtissues-04.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (August 21, 2017) is 2440 days in the past.  Is this
     intentional?


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  -- Obsolete informational reference (is this intentional?): RFC 5666
     (Obsoleted by RFC 8166)


     Summary: 0 errors (**), 0 flaws (~~), 1 warning (==), 2 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.
--------------------------------------------------------------------------------


2	Network File System Version 4                                  D. Noveck
3	Internet-Draft                                                    NetApp
4	Intended status: Informational                           August 21, 2017
5	Expires: February 22, 2018

7	         Issues Related to RPC-over-RDMA Internode Round Trips
8	                draft-dnoveck-nfsv4-rpcrdma-rtissues-04

10	Abstract

12	   As currently designed and implemented, the RPC-over-RDMA protocol
13	   requires use of multiple internode round trips to process some common
14	   operations.  For example, NFS WRITE operations require use of three
15	   internode round trips.  This document looks at this issue and
16	   discusses what can and what should be done to address it, both within
17	   the context of an extensible version of RPC-over-RDMA and potentially
18	   outside that framework.

20	Status of This Memo

22	   This Internet-Draft is submitted in full conformance with the
23	   provisions of BCP 78 and BCP 79.

25	   Internet-Drafts are working documents of the Internet Engineering
26	   Task Force (IETF).  Note that other groups may also distribute
27	   working documents as Internet-Drafts.  The list of current Internet-
28	   Drafts is at http://datatracker.ietf.org/drafts/current/.

30	   Internet-Drafts are draft documents valid for a maximum of six months
31	   and may be updated, replaced, or obsoleted by other documents at any
32	   time.  It is inappropriate to use Internet-Drafts as reference
33	   material or to cite them other than as "work in progress."

35	   This Internet-Draft will expire on February 22, 2018.

37	Copyright Notice

39	   Copyright (c) 2017 IETF Trust and the persons identified as the
40	   document authors.  All rights reserved.

42	   This document is subject to BCP 78 and the IETF Trust's Legal
43	   Provisions Relating to IETF Documents
44	   (http://trustee.ietf.org/license-info) in effect on the date of
45	   publication of this document.  Please review these documents
46	   carefully, as they describe your rights and restrictions with respect
47	   to this document.  Code Components extracted from this document must
48	   include Simplified BSD License text as described in Section 4.e of
49	   the Trust Legal Provisions and are provided without warranty as
50	   described in the Simplified BSD License.

52	Table of Contents

54	   1.  Preliminaries . . . . . . . . . . . . . . . . . . . . . . . .   2
55	     1.1.  Requirements Language . . . . . . . . . . . . . . . . . .   2
56	     1.2.  Introduction  . . . . . . . . . . . . . . . . . . . . . .   2
57	   2.  Review of the Current Situation . . . . . . . . . . . . . . .   3
58	     2.1.  Troublesome Requests  . . . . . . . . . . . . . . . . . .   3
59	     2.2.  WRITE Request Processing Details  . . . . . . . . . . . .   4
60	     2.3.  READ Request Processing Details . . . . . . . . . . . . .   5
61	   3.  Near-term Work  . . . . . . . . . . . . . . . . . . . . . . .   6
62	     3.1.  Target Performance  . . . . . . . . . . . . . . . . . . .   7
63	     3.2.  Message Continuation  . . . . . . . . . . . . . . . . . .   8
64	     3.3.  Send-based DDP  . . . . . . . . . . . . . . . . . . . . .   8
65	     3.4.  Feature Synergy . . . . . . . . . . . . . . . . . . . . .   9
66	     3.5.  Feature Selection and Negotiation . . . . . . . . . . . .  10
67	   4.  Possible Future Development . . . . . . . . . . . . . . . . .  12
68	   5.  Summary . . . . . . . . . . . . . . . . . . . . . . . . . . .  13
69	   6.  Security Considerations . . . . . . . . . . . . . . . . . . .  14
70	   7.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .  15
71	   8.  References  . . . . . . . . . . . . . . . . . . . . . . . . .  15
72	     8.1.  Normative References  . . . . . . . . . . . . . . . . . .  15
73	     8.2.  Informative References  . . . . . . . . . . . . . . . . .  15
74	   Appendix A.  Acknowledgements . . . . . . . . . . . . . . . . . .  16
75	   Author's Address  . . . . . . . . . . . . . . . . . . . . . . . .  16

77	1.  Preliminaries

79	1.1.  Requirements Language

81	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
82	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
83	   document are to be interpreted as described in [RFC2119].

85	1.2.  Introduction

87	   When many common operations are performed using RPC-over-RDMA,
88	   additional inter-node round-trip latencies are required to take
89	   advantage of the performance benefits provided by RDMA Functionality.

91	   While the latencies involved are generally small, they are a reason
92	   for concern for two reasons.

94	   o  With the ongoing improvement of persistent memory technologies,
95	      such internode latencies, being fixed, can be expected to consume
96	      an increasing portion of the total latency required for processing
97	      NFS requests using RPC-over-RDMA.

99	   o  High-performance transfers using NFS may be needed outside of a
100	      machine-room environment.  As RPC-over-RDMA is used in networks of
101	      campus and metropolitan scale, the internode round-trip time of
102	      sixteen microseconds per mile becomes an issue.

104	   Given this background, round trips beyond the minimum necessary need
105	   to be justified by corresponding benefits.  If they are not, work
106	   needs to be done to eliminate those excess round trips.

108	   We are going to look at the existing situation with regard to round-
109	   trip latency and make some suggestions as to how the issue might be
110	   best addressed.  We will consider things that could be done in the
111	   near future and also explore further possibilities that would require
112	   a longer-term approach to be adopted.

114	2.  Review of the Current Situation

116	2.1.  Troublesome Requests

118	   We will be looking at four sorts of situations:

120	   o  An RPC operation involving Direct Data Placement of request data
121	      (e.g., an NFSv3 WRITE or corresponding NFSv4 COMPOUND).

123	   o  An RPC operation involving Direct Data Placement of response data
124	      (e.g., an NFSv3 READ or corresponding NFSv4 COMPOUND).

126	   o  An RPC operation where the request data is longer than the inline
127	      buffer limit.

129	   o  An RPC operation where the response data is longer than the inline
130	      buffer limit.

132	   These are all simple examples of situations in which explicit RDMA
133	   operations are used, either to effect Direct Data Placement or to
134	   respond to message size limits that derive from a limited receive
135	   buffer size.

137	   We will survey the resulting latency and overhead issues in an RPC-
138	   over-RDMA Version One environment in Sections 2.2 and 2.3 below.

140	2.2.  WRITE Request Processing Details

142	   We'll start with the case of a request involving direct placement of
143	   request data.  In this case, an RDMA READ is used to transfer a DDR-
144	   eligible data item (e.g. the data to be written) from its location in
145	   requester memory, to a location selected by the responder.

147	   Processing proceeds as described below.  Although we are focused on
148	   internode latency, the time to perform a request also includes such
149	   things as interrupt latency, overhead involved in interacting with
150	   the RNIC, and the time for the server to execute the requested
151	   operation.

153	   o  First, the memory to be accessed remotely is registered.  This is
154	      a local operation.

156	   o  Once the registration has been done, the initial send of the
157	      request can proceed.  Since this is in the context of connected
158	      operation, there is an internode round trip involved.  However,
159	      the next step can proceed after the initial transmission is
160	      received by the responder.  As a result, only the responder-bound
161	      side of the transmission contributes to overall operation latency.

163	   o  The responder, after being notified of the receipt of the request,
164	      uses RDMA READ to fetch the bulk data.  This involves an internode
165	      round-trip latency.  After the fetch of the data, the responder
166	      needs to be notified of the completion of the explicit RDMA
167	      operation

169	   o  The responder (after performing the requested operation) sends the
170	      response.  Again, as this is in the context of connected
171	      operation, there is an internode round trip involved.  However,
172	      the next step can proceed after the initial transmission is
173	      received by the requester.

175	   o  The memory registered before the request was issued needs to be
176	      deregistered, before the request is considered complete and the
177	      sending process restarted.  When remote invalidation is not
178	      available, the requester, after being notified of the receipt of
179	      the response, performs a local operation to deregister the memory
180	      in question.  Alternatively, the responder will use Send With
181	      Invalidate and the responder's RNIC will effect the deregistration
182	      before notifying the requester of the response which has been
183	      received.

185	   To summarize, if we exclude the actual server execution of the
186	   request, the latency consists of two internode round-trip latencies
187	   plus two responder-side interrupt latencies plus one requester-side
188	   interrupt latency plus any necessary registration/de-registration
189	   overhead.  This is in contrast to a request not using explicit RDMA
190	   operations in which there is a single inter-node round-trip latency
191	   and one interrupt latency on the requester and the responder.

193	   The processing of the other sorts of requests mentioned in
194	   Section 2.1 show both similarities and differences:

196	   o  Handling of a long request is similar to the above.  The memory
197	      associated with a position-zero read chunk is registered,
198	      transferred using RDMA READ, and deregistered.  As a result, we
199	      have the same overhead and latency issues noted in the case of
200	      direct data placement, without the corresponding benefits.

202	   o  The case of direct data placement of response data follows a
203	      similar pattern.  The important difference is that the transfer of
204	      the bulk data is performed using RDMA WRITE, rather than RDMA
205	      READ.  However, because of the way that RDMA WRITE is effected
206	      over the wire, the latency consequences are different.  See
207	      Section 2.3 for a detailed discussion.

209	   o  Handling of a long response is similar to the previous case.

211	2.3.  READ Request Processing Details

213	   We'll now discuss the case of a request involving direct placement of
214	   response data.  In this case, an RDMA WRITE is used to transfer a
215	   DDR-eligible data item (e.g. the data being read) from its location
216	   in responder memory, to a location previously selected by the
217	   requester.

219	   Processing proceeds as described below.  Although we are focused on
220	   internode latency, the time to perform a request also includes such
221	   things as interrupt latency, overhead involved in interacting with
222	   the RNIC, and the time for the server to execute the requested
223	   operation.

225	   o  First, the memory to be accessed remotely is registered.  This is
226	      a local operation.

228	   o  Once the registration has been done, the initial send of the
229	      request can proceed.  Since this is in the context of connected
230	      operation, there is an internode round trip involved.  However,
231	      the next step can proceed after the initial transmission is
232	      received.  As a result, only the responder-bound side of the
233	      transmission contributes to overall operation latency.

235	   o  The responder, after being notified of the receipt of the request,
236	      proceeds to process the request until the data to be read is
237	      available in its own memory, with its location determined and
238	      fixed.  It then uses RDMA WRITE to transfer the bulk data to the
239	      location in requester memory selected previously.  This involves
240	      an internode latency, but there is no round trip and thus no
241	      round-trip latency,

243	   o  The responder continues processing and sends the inline portion of
244	      the response.  Again, as this is in the context of connected
245	      operation, there is an internode round trip involved.  However,
246	      the next step can proceed immediately.  If the RDMA WRITE or the
247	      send of the inline portion of the response were to fail, the
248	      responder can be notified subsequently.

250	   o  The requester, after being notified of the receipt of the
251	      response, can analyze it and can access the data written into its
252	      memory.  Deregistration of the memory originally registered before
253	      the request was issued can be done using remote invalidation or
254	      can be done by the requester as a local operation

256	   To summarize, in this case the additional latency that we saw in
257	   Section 2.2 does not arise.  Except for the additional overhead due
258	   to memory registration and invalidation, the situation is the same as
259	   for a request not using explicit RDMA operations in which there is a
260	   single inter-node round-trip latency and one interrupt latency on the
261	   requester and the responder.

263	3.  Near-term Work

265	   We are going to consider how the latency and overhead issues
266	   discussed in Section 2 might be addressed in the context of an
267	   extensible version of RPC-over-RDMA, such as that proposed in
268	   [rpcrdmav2].

270	   In Section 3.1, we will establish a performance target for the
271	   troublesome requests, based on the performance of requests that do
272	   not involve long messages or direct data placement.

274	   We will then consider how extensions might be defined to bring
275	   latency and overhead for the requests discussed in Section 2.1 into
276	   line with those for other requests.  There will be two specific
277	   classes of requests to address:

279	   o  Those that do not involve direct data placement will be addressed
280	      in Section 3.2.  In this case, there are no compensating benefits
281	      justifying the higher overhead and, in some cases, latency.

283	   o  The more complicated case of requests that do involve direct data
284	      placement is discussed in Section 3.3.  In this case, direct data
285	      placement could serve as a compensating benefit, and the important
286	      question to be addressed is whether Direct Data Placement can be
287	      effected without the use of explicit RDMA operations.

289	   The optional features to deal with each of the classes of messages
290	   discussed above could be implemented separately.  However, in the
291	   handling of RPCs with very large amounts of bulk data, the two
292	   features are synergistic.  This fact makes it desirable to define the
293	   features as part of the same extension.  See Sections 3.4 and 3.5 for
294	   details.

296	3.1.  Target Performance

298	   As our target, we will look at the latency and overhead associated
299	   with other sorts of RPC requests, i.e. those that do not use DDP, and
300	   that have request and response messages which do fit within the
301	   receive buffer limit.

303	   Processing proceeds as follows:

305	   o  The initial send of the request is done.  Since this is in the
306	      context of connected operation, there is an internode round-trip
307	      involved.  However, the next step can proceed after the initial
308	      transmission is received.  As a result, only the responder-bound
309	      side of the transmission contributes to overall operation latency.

311	   o  The responder, after being notified of the receipt of the request,
312	      performs the requested operation and sends the reply.  As in the
313	      case of the request, there is an internode round trip involved.
314	      However, the request can be considered complete upon receipt of
315	      the requester-bound transmission.  The responder-bound
316	      acknowledgment does not contribute to request latency.

318	   In this case, there is only a single internode round-trip latency
319	   necessary to effect the RPC.  Total request latency includes this
320	   round-trip latency plus interrupt latency on the requester and
321	   responder, plus the time for the responder to actually perform the
322	   requested operation.

324	   Thus the delta between the operations discussed in Section 2 and our
325	   baseline consists of two portions, one of which applies to all the
326	   requests we are concerned with and the second of which only applies
327	   to request which involve use of RDMA READ, as discussed in
328	   Section 2.2.  The latter category consists of:

330	   o  One additional internode round-trip latency.

332	   o  One additional instance of responder-side interrupt latency.

334	   The additional overhead necessary to do memory registration and
335	   deregistration applies to all requests using explicit RDMA
336	   operations.  The costs will vary with implementation characteristics.
337	   As a result, in some implementations, it may desirable to replace use
338	   of RDMA Write with send-based alternatives, while in others, use of
339	   RDMA Write may be preferable.

341	3.2.  Message Continuation

343	   Using multiple RPC-over-RDMA transmissions, in sequence, to send a
344	   single RPC message avoids the additional latency and overhead
345	   associated with the use of explicit RDMA operations to transfer
346	   position-zero read chunks.  In the case of reply chunks, only
347	   overhead is reduced.

349	   Although transfer of a single request or reply in N transmissions
350	   will involve N+1 internode latencies, overall request latency is not
351	   increased by requiring that operations involving multiple nodes be
352	   serialized.  Generally, these transmissions are pipelined.

354	   As an illustration, let's consider the case of a request involving a
355	   response consisting of two RPC-over-RDMA transmissions.  Even though
356	   each of these transmissions is acknowledged, that acknowledgement
357	   does not contribute to request latency.  The second transmission can
358	   be received by the requester and acted upon without waiting for
359	   either acknowledgment.

361	   This situation would require multiple receive-side interrupts but it
362	   is unlikely to result in extended interrupt latency.  With 1K sends
363	   (Version One), the second receive will complete about 200 nanoseconds
364	   after the first assuming a 40Gb/s transmission rate.  Given likely
365	   interrupt latencies, the first interrupt routine would be able to
366	   note that the completion of the second receive had already occurred.

368	3.3.  Send-based DDP

370	   In order to effect proper placement of request or reply data within
371	   the context of individual RPC-over-RDMA transmissions, receive
372	   buffers need to be structured to accommodate this function

374	   To illustrate the considerations that could lead clients and servers
375	   to choose particular buffer structures, we will use, as examples, the
376	   cases of NFS READs and WRITEs of 8K data blocks (or the corresponding
377	   NFSv4 COMPOUNDs).

379	   In such cases, the client and server need to have the DDP-eligible
380	   bulk data placed in appropriately aligned 8K buffer segments.  Rather
381	   than being transferred in separate transmissions using explicit RDMA
382	   operations, a message can be sent so that bulk data is received into
383	   an appropriate buffer segment.  In this case, it would be excised
384	   from the XDR payload stream, just as it is in the case of existing
385	   DDP facilities.

387	   Consider a server expecting write requests which are usually X bytes
388	   long or less, exclusive of an 8K bulk data area.  In this case the
389	   payload stream will most likely be less than X bytes and will fit in
390	   a buffer segment devoted to that purpose.  The bulk data needs to be
391	   placed in the subsequent buffer segment in order to align it
392	   properly, i.e. with the appropriate alignment, in the DDP target
393	   buffer.  In order to place the data appropriately, the sender (in
394	   this case, the client) needs to add padding of length X-Y bytes where
395	   Y is the length of payload stream for the current request.  The case
396	   of reads is exactly the same except that the sender adding the
397	   padding is the server.

399	   To provide send-based DDP as an RPC-over-RDMA extension, the
400	   framework defined in [rpcrdmav2] could be used.  A new "transport
401	   characteristic" could be defined which allowed a participant to
402	   expose the structure of his receive buffers and to identify the
403	   buffer segments capable of being used as DDP targets.  In addition, a
404	   new optional message header would have to be defined.  It would be
405	   defined to provide:

407	   o  A way to designate a DDP-eligible data item as corresponding to
408	      target buffer segments, rather than memory registered for RDMA.

410	   o  A way to indicate to the responder that it should place DDP-
411	      eligible data items in DDP-targetable buffer segments, rather than
412	      in memory registered for RDMA.

414	   o  A way to designate a limited portion of an RPC-over-RDMA
415	      transmission as constituting the payload stream.

417	3.4.  Feature Synergy

419	   While message continuation and send-based DDP each address an
420	   important class of commonly used messages, their combination allows
421	   simpler handling of some important classes of messages:

423	   o  READs and WRITEs transferring larger IOs

425	   o  COMPOUNDs containing multiple IO operations.

427	   o  Operations whose associated payload stream is longer than the
428	      typical value.

430	   To accommodate these situations, it would be best to have the
431	   definition of the headers to support message continuation interact
432	   with data structures to support send-based DDP as follows:

434	   o  The header type used for the initial transmission of a message
435	      continued across multiple transmissions would contain DDP-
436	      directing structures which support both send-based DDP as well as
437	      DDP using Explicit RDMA operations.

439	   o  Buffer references for Send-based DDP should be relative to the
440	      start of the group of transmissions and should allow transitions
441	      between buffer segments in different receive buffers.

443	   o  The header type for messages continuing a group of transmissions
444	      should not have DDP-related fields but should rely on the initial
445	      transmission of the group for DDP-related functions.

447	   o  The portion of each received transmission devoted to the payload
448	      stream should be part of the header for each message within a
449	      group of transmissions devoted to a single RPC message.  The
450	      payload stream for the message as a whole should be the
451	      concatenation of the streams for each transmission.

453	   A potential extension supporting these features interacting as
454	   described above can be found in [rtrext].

456	3.5.  Feature Selection and Negotiation

458	   Given that an appropriate extension is likely to support multiple
459	   OPTIONAL features, special attention will have to be given to
460	   defining how implementations which might not support the same subset
461	   of OPTIONAL features can successfully interact.  The goal is to allow
462	   interacting implementations to get the benefit of features that they
463	   both support, while allowing implementation pairs that do not share
464	   support for any of the OPTIONAL features to operate just as base
465	   Version Two implementations could do in the absence of the potential
466	   extension.

468	   It is helpful if each implementation provides characteristics
469	   defining its level of feature support which the peer implementation
470	   can test before attempting to use a particular feature.  In other
471	   similar contexts, the support level concerns the implementation in
472	   its role as responder, i.e. whether it is prepared to execute a given
473	   request.  In the case of the potential extension discussed here, most
474	   characteristics concern an implementation in its role as receiver.
475	   One might define characteristics which indicate,

477	   o  The ability of the implementation, in its role as receiver, to
478	      process messages continued across multiple RPC-over-RDMA
479	      transmissions.

481	   o  The ability of the implementation, in its role as receiver, to
482	      process messages containing DDP-eligible data items, directly
483	      placed using a send-based DDP approach.

485	   Use of such characteristics might allow asymmetric implementations.
486	   For example, a client might send requests containing DDP-eligible
487	   data items using send-based DDP without being able to accept messages
488	   containing data items using send-based DDP.  That is a likely
489	   implementation pattern, given the greater performance benefits of
490	   avoiding use of RDMA Read.

492	   Further useful characteristics would apply to the implementation in
493	   its role of responder.  For instance,

495	   o  The ability of the implementation, in its role as responder, to
496	      accept and process requests which REQUIRE that DDP-eligible data
497	      items in the response be sent using send-based DDP.  The presence
498	      of this characteristic would allow a requester to avoid
499	      registering memory to be used to accommodate DDP-eligible data
500	      items in the response.

502	   o  The ability of the implementation, in its role as responder, to
503	      send responses using message continuation, as opposed to using a
504	      reply chunk.

506	   Because of the potentially different needs of operations in the
507	   forward and backward directions, it may be desirable to separate the
508	   receiver-based characteristics according the direction of operation
509	   that they apply to.

511	   A further issue relates to the role of explicit RDMA operations in
512	   connection with backwards operation.  Although, no current protocols
513	   require support for DDP or transfer of large messages when operating
514	   in the backward direction, the protocol is designed to allow such
515	   support to be developed in the future.  Since the protocol, with the
516	   extension discussed here is likely to have multiple methods of
517	   providing these functions, we have a number of possible choices
518	   regarding the role of chunk-based methods of providing these
519	   functions
520	   o  Support for chunk-based operation remains a REQUIREMENT for
521	      responders, and requesters always have the option of using it,
522	      regardless of the direction of operation.

524	      Requesters could select alternatives to the use of explicit RDMA
525	      operations when these are supported by the responder

527	   o  When operating in the forward direction, support for chunk-based
528	      operation remains a REQUIREMENT for responders (i.e. servers), and
529	      requesters (i.e. clients).

531	      When operating in the backward direction, support for chunk-based
532	      is OPTIONAL for responders (i.e. clients) allowing requesters
533	      (i.e. servers) to select use of explicit RDMA operations or
534	      alternatives when each of these is supported by the requester.

536	   o  Support for chunk-based operation is treated as OPTIONAL for
537	      responders, regardless of the direction of operation.

539	      In this case, requesters would select use of explicit RDMA
540	      operations or alternatives when each of these is supported by the
541	      responder.  For a considerable time, support for explicit RDMA
542	      operations would be a practical necessity, even if not a
543	      REQUIREMENT, for operation in the forward direction.

545	4.  Possible Future Development

547	   Although the reduction of explicit RDMA operation reduces the number
548	   of inter-node round trips and eliminates sequences of operations in
549	   which multiple round-trip latencies are serialized with server
550	   interrupt latencies, the use of connected operations means that
551	   round-trip latencies will always be present, since each message is
552	   acknowledged.

554	   One avenue that has been considered is use of unreliable-datagram
555	   (UD) transmission in environments where the "unreliable" transmission
556	   is sufficiently reliable that RPC replay can deal with a very low
557	   rate of message loss.  For example, UD in Infiniband specifies a low
558	   enough rate of frame loss to make this a viable approach,
559	   particularly for use in supporting protocols such as NFSv4.1, that
560	   contain their own facilities to ensure exactly-once semantics.

562	   With this sort of arrangement, request latency is still the same.
563	   However, since the acknowledgements are not serving any substantial
564	   function, it is tempting to consider removing them, as they do take
565	   up some transmission bandwidth, that might be used otherwise, if the
566	   protocol were to reach the goal of effectively using the underlying
567	   medium.

569	   The size of such wasted transmission bandwidth depends on the average
570	   message size and many implementation considerations regarding how
571	   acknowledgments are done.  In any case, given expected message sizes,
572	   the wasted transmission bandwidth will be very small.

574	   When RPC messages are quite small, acknowledgments may be of concern.
575	   However, in that situation, a better response would be transfer
576	   multiple RPC messages within a single RPC-over-RDMA transmission.

578	   When multiple RPC messages are combined into a single transmission,
579	   the overhead of interfacing with the RNIC, particularly the interrupt
580	   handling overhead, is amortized over multiple RPC messages.

582	   Although this technique is quite outside the spirit of existing RPC-
583	   over-RDMA implementations, it appears possible to define new header
584	   types capable of supporting this sort of transmission, using the
585	   extension framework described in [rpcrdmav2].

587	5.  Summary

589	   We've examined the issue of round-trip latency and concluded:

591	   o  That the number of round trips per se is not as important as the
592	      contribution of any extra round trips to overall request latency.

594	   o  That the latency issue can be addressed using the extension
595	      mechanism provided for in [rpcrdmav2].

597	   o  That in many cases in which latency is not an issue, there may be
598	      overhead issues that can be addressed using the same sorts of
599	      techniques as those useful in latency reduction, again using the
600	      extension mechanism provided for in [rpcrdmav2].

602	   As it seems that the features sketched out could put internode
603	   latencies and overhead for a large class of requests back to the
604	   baseline value for the RPC paradigm, more detailed definition of the
605	   required extension functionality is in order.

607	   We've also looked at round trips at the physical level, in that
608	   acknowledgments are sent in circumstances where there is no obvious
609	   need for them.  With regard to these, we have concluded:

611	   o  That these acknowledgements do not contribute to request latency.

613	   o  That while UD transmission can remove acknowledgements of limited
614	      value, the performance benefits are not sufficient to justify the
615	      disruption that this would entail.

617	   o  That issues with transmission bandwidth overhead in a small-
618	      message environment are better addressed by combining multiple RPC
619	      messages in a single RPC-over-RDMA transmission.  This is
620	      particularly so, because such a step is likely to reduce overhead
621	      in such environments as well

623	   As the features described involve the use of alternatives to explicit
624	   RDMA operations, in performing direct data placement and in
625	   transferring messages that are larger than the receive buffer limit,
626	   it is appropriate to understand the role that such operations are
627	   expected to have once the extensions discussed in this document are
628	   fully specified and implemented.

630	   It is important to note that these extensions are OPTIONAL and are
631	   expected to remain so, while support for explicit RDMA operations
632	   will remain an integral part of RPC-over-RDMA.

634	   Given this framework, the degree to which explicit RDMA operations
635	   will be used will reflect future implementation choices and needs.
636	   While we have been focusing on cases in which other options might be
637	   more efficient in some cases, it worth looking also at the cases in
638	   which explicit RDMA operations are likely to remain preferable:

640	   o  In some environments in which direct data placement to memory of a
641	      certain alignment does not meet application requirements and in
642	      which data needs to be read into a particular address on the
643	      client.  Also, large physically contiguous buffers may be required
644	      in some environments.  In these situations, send-based DDP is not
645	      an option.

647	   o  Where large transfers are to be done, there will be limits to the
648	      capacity of send-based DDP to provide the required functionality,
649	      since the basic pattern using send/receive is to allocate a pool
650	      of memory to contain receive buffers in advance of issuing
651	      requests.  While this issue can be mitigated by use of message
652	      continuation, tying up large numbers of credits for a single
653	      request can cause difficult issues as well.  As a result, send-
654	      based DDP may be restricted to IO's of limited size, although the
655	      specific limits will depend on the details of the specific
656	      implementation.

658	6.  Security Considerations

660	   This document does not raise any security issues.

662	7.  IANA Considerations

664	   This document does not require any actions by IANA.

666	8.  References

668	8.1.  Normative References

670	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
671	              Requirement Levels", BCP 14, RFC 2119,
672	              DOI 10.17487/RFC2119, March 1997, <https://www.rfc-
673	              editor.org/info/rfc2119>.

675	   [RFC8166]  Lever, C., Ed., Simpson, W., and T. Talpey, "Remote Direct
676	              Memory Access Transport for Remote Procedure Call Version
677	              1", RFC 8166, DOI 10.17487/RFC8166, June 2017,
678	              <https://www.rfc-editor.org/info/rfc8166>.

680	8.2.  Informative References

682	   [RFC5666]  Talpey, T. and B. Callaghan, "Remote Direct Memory Access
683	              Transport for Remote Procedure Call", RFC 5666,
684	              DOI 10.17487/RFC5666, January 2010, <https://www.rfc-
685	              editor.org/info/rfc5666>.

687	   [rfc5667bis]
688	              Lever, C., Ed., "Network File System (NFS) Upper Layer
689	              Binding To RPC-Over-RDMA", August 2017,
690	              <http://www.ietf.org/id/
691	              draft-ietf-nfsv4-rfc5667bision-13.txt>.

693	              Work in progress.

695	   [rpcrdmav2]
696	              Lever, C., Ed. and D. Noveck, "RPC-over-RDMA Version Two",
697	              July 2017, <http://www.ietf.org/id/
698	              draft-cel-nfsv4-rpcrdma-version-two-05.txt>.

700	              Work in progress.

702	   [rtrext]   Noveck, D., "RPC-over-RDMA Extensions to Reduce Internode
703	              Round-trips", June 2017, <http://www.ietf.org/id/
704	              draft-dnoveck-nfsv4-rpcrdma-rtrext-02.txt>.

706	              Work in progress.

708	Appendix A.  Acknowledgements

710	   The author gratefully acknowledges the work of Brent Callaghan and
711	   Tom Talpey producing the original RPC-over-RDMA Version One
712	   specification [RFC5666] and also Tom's work in helping to clarify
713	   that specification.

715	   The author also wishes to thank Chuck Lever for his work reviving
716	   RDMA support for NFS in [RFC8166] and [rfc5667bis], for providing a
717	   path for incremental improvement of that support by his work on
718	   [rpcrdmav2], and for helpful discussions regarding RPC-over-RDMA
719	   latency issues.

721	Author's Address

723	   David Noveck
724	   NetApp
725	   1601 Trapelo Road
726	   Waltham, MA  02451
727	   US

729	   Phone: +1 781-572-8038
730	   Email: davenoveck@gmail.com