idnits 2.17.1 

draft-dnoveck-nfsv4-rpcrdma-rtissues-03.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (February 27, 2017) is 2608 days in the past.  Is this
     intentional?


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  -- Obsolete informational reference (is this intentional?): RFC 5666
     (Obsoleted by RFC 8166)


     Summary: 0 errors (**), 0 flaws (~~), 1 warning (==), 2 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.
--------------------------------------------------------------------------------

1	yy

3	Network File System Version 4                                  D. Noveck
4	Internet-Draft                                         February 27, 2017
5	Intended status: Informational
6	Expires: August 31, 2017

8	         Issues Related to RPC-over-RDMA Internode Round Trips
9	                draft-dnoveck-nfsv4-rpcrdma-rtissues-03

11	Abstract

13	   As currently designed and implemented, the RPC-over-RDMA protocol
14	   requires use of multiple internode round trips to process some common
15	   operations.  For example, NFS WRITE operations require use of three
16	   internode round trips.  This document looks at this issue and
17	   discusses what can and what should be done to address it, both within
18	   the context of an extensible version of RPC-over-RDMA and potentially
19	   outside that framework.

21	Status of This Memo

23	   This Internet-Draft is submitted in full conformance with the
24	   provisions of BCP 78 and BCP 79.

26	   Internet-Drafts are working documents of the Internet Engineering
27	   Task Force (IETF).  Note that other groups may also distribute
28	   working documents as Internet-Drafts.  The list of current Internet-
29	   Drafts is at http://datatracker.ietf.org/drafts/current/.

31	   Internet-Drafts are draft documents valid for a maximum of six months
32	   and may be updated, replaced, or obsoleted by other documents at any
33	   time.  It is inappropriate to use Internet-Drafts as reference
34	   material or to cite them other than as "work in progress."

36	   This Internet-Draft will expire on August 31, 2017.

38	Copyright Notice

40	   Copyright (c) 2017 IETF Trust and the persons identified as the
41	   document authors.  All rights reserved.

43	   This document is subject to BCP 78 and the IETF Trust's Legal
44	   Provisions Relating to IETF Documents
45	   (http://trustee.ietf.org/license-info) in effect on the date of
46	   publication of this document.  Please review these documents
47	   carefully, as they describe your rights and restrictions with respect
48	   to this document.  Code Components extracted from this document must
49	   include Simplified BSD License text as described in Section 4.e of
50	   the Trust Legal Provisions and are provided without warranty as
51	   described in the Simplified BSD License.

53	Table of Contents

55	   1.  Preliminaries . . . . . . . . . . . . . . . . . . . . . . . .   2
56	     1.1.  Requirements Language . . . . . . . . . . . . . . . . . .   2
57	     1.2.  Introduction  . . . . . . . . . . . . . . . . . . . . . .   2
58	   2.  Review of the Current Situation . . . . . . . . . . . . . . .   3
59	     2.1.  Troublesome Requests  . . . . . . . . . . . . . . . . . .   3
60	     2.2.  WRITE Request Processing Details  . . . . . . . . . . . .   4
61	     2.3.  READ Request Processing Details . . . . . . . . . . . . .   5
62	   3.  Near-term Work  . . . . . . . . . . . . . . . . . . . . . . .   6
63	     3.1.  Target Performance  . . . . . . . . . . . . . . . . . . .   7
64	     3.2.  Message Continuation  . . . . . . . . . . . . . . . . . .   8
65	     3.3.  Send-based DDP  . . . . . . . . . . . . . . . . . . . . .   8
66	     3.4.  Feature Synergy . . . . . . . . . . . . . . . . . . . . .   9
67	     3.5.  Feature Selection and Negotiation . . . . . . . . . . . .  10
68	   4.  Possible Future Development . . . . . . . . . . . . . . . . .  12
69	   5.  Summary . . . . . . . . . . . . . . . . . . . . . . . . . . .  13
70	   6.  Security Considerations . . . . . . . . . . . . . . . . . . .  14
71	   7.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .  15
72	   8.  References  . . . . . . . . . . . . . . . . . . . . . . . . .  15
73	     8.1.  Normative References  . . . . . . . . . . . . . . . . . .  15
74	     8.2.  Informative References  . . . . . . . . . . . . . . . . .  15
75	   Appendix A.  Acknowledgements . . . . . . . . . . . . . . . . . .  16
76	   Author's Address  . . . . . . . . . . . . . . . . . . . . . . . .  16

78	1.  Preliminaries

80	1.1.  Requirements Language

82	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
83	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
84	   document are to be interpreted as described in [RFC2119].

86	1.2.  Introduction

88	   When many common operations are performed using RPC-over-RDMA,
89	   additional inter-node round-trip latencies are required to take
90	   advantage of the performance benefits provided by RDMA Functionality.

92	   While the latencies involved are generally small, they are a reason
93	   for concern for two reasons.

95	   o  With the ongoing improvement of persistent memory technologies,
96	      such internode latencies, being fixed, can be expected to consume
97	      an increasing portion of the total latency required for processing
98	      NFS requests using RPC-over-RDMA.

100	   o  High-performance transfers using NFS may be needed outside of a
101	      machine-room environment.  As RPC-over-RDMA is used in networks of
102	      campus and metropolitan scale, the internode round-trip time of
103	      sixteen microseconds per mile becomes an issue.

105	   Given this background, round trips beyond the minimum necessary need
106	   to be justified by corresponding benefits.  If they are not, work
107	   needs to be done to eliminate those excess round trips.

109	   We are going to look at the existing situation with regard to round-
110	   trip latency and make some suggestions as to how the issue might be
111	   best addressed.  We will consider things that could be done in the
112	   near future and also explore further possibilities that would require
113	   a longer-term approach to be adopted.

115	2.  Review of the Current Situation

117	2.1.  Troublesome Requests

119	   We will be looking at four sorts of situations:

121	   o  An RPC operation involving Direct Data Placement of request data
122	      (e.g., an NFSv3 WRITE or corresponding NFSv4 COMPOUND).

124	   o  An RPC operation involving Direct Data Placement of response data
125	      (e.g., an NFSv3 READ or corresponding NFSv4 COMPOUND).

127	   o  An RPC operation where the request data is longer than the inline
128	      buffer limit.

130	   o  An RPC operation where the response data is longer than the inline
131	      buffer limit.

133	   These are all simple examples of situations in which explicit RDMA
134	   operations are used, either to effect Direct Data Placement or to
135	   respond to message size limits that derive from a limited receive
136	   buffer size.

138	   We will survey the resulting latency and overhead issues in an RPC-
139	   over-RDMA Version One environment in Sections 2.2 and 2.3 below.

141	2.2.  WRITE Request Processing Details

143	   We'll start with the case of a request involving direct placement of
144	   request data.  In this case, an RDMA READ is used to transfer a DDR-
145	   eligible data item (e.g. the data to be written) from its location in
146	   requester memory, to a location selected by the responder.

148	   Processing proceeds as described below.  Although we are focused on
149	   internode latency, the time to perform a request also includes such
150	   things as interrupt latency, overhead involved in interacting with
151	   the RNIC, and the time for the server to execute the requested
152	   operation.

154	   o  First, the memory to be accessed remotely is registered.  This is
155	      a local operation.

157	   o  Once the registration has been done, the initial send of the
158	      request can proceed.  Since this is in the context of connected
159	      operation, there is an internode round trip involved.  However,
160	      the next step can proceed after the initial transmission is
161	      received by the responder.  As a result, only the responder-bound
162	      side of the transmission contributes to overall operation latency.

164	   o  The responder, after being notified of the receipt of the request,
165	      uses RDMA READ to fetch the bulk data.  This involves an internode
166	      round-trip latency.  After the fetch of the data, the responder
167	      needs to be notified of the completion of the explicit RDMA
168	      operation

170	   o  The responder (after performing the requested operation) sends the
171	      response.  Again, as this is in the context of connected
172	      operation, there is an internode round trip involved.  However,
173	      the next step can proceed after the initial transmission is
174	      received by the requester.

176	   o  The memory registered before the request was issued needs to be
177	      deregistered, before the request is considered complete and the
178	      sending process restarted.  When remote invalidation is not
179	      available, the requester, after being notified of the receipt of
180	      the response, performs a local operation to deregister the memory
181	      in question.  Alternatively, the responder will use Send With
182	      Invalidate and the responder's RNIC will effect the deregistration
183	      before notifying the requester of the response which has been
184	      received.

186	   To summarize, if we exclude the actual server execution of the
187	   request, the latency consists of two internode round-trip latencies
188	   plus two responder-side interrupt latencies plus one requester-side
189	   interrupt latency plus any necessary registration/de-registration
190	   overhead.  This is in contrast to a request not using explicit RDMA
191	   operations in which there is a single inter-node round-trip latency
192	   and one interrupt latency on the requester and the responder.

194	   The processing of the other sorts of requests mentioned in
195	   Section 2.1 show both similarities and differences:

197	   o  Handling of a long request is similar to the above.  The memory
198	      associated with a position-zero read chunk is registered,
199	      transferred using RDMA READ, and deregistered.  As a result, we
200	      have the same overhead and latency issues noted in the case of
201	      direct data placement, without the corresponding benefits.

203	   o  The case of direct data placement of response data follows a
204	      similar pattern.  The important difference is that the transfer of
205	      the bulk data is performed using RDMA WRITE, rather than RDMA
206	      READ.  However, because of the way that RDMA WRITE is effected
207	      over the wire, the latency consequences are different.  See
208	      Section 2.3 for a detailed discussion.

210	   o  Handling of a long response is similar to the previous case.

212	2.3.  READ Request Processing Details

214	   We'll now discuss the case of a request involving direct placement of
215	   response data.  In this case, an RDMA WRITE is used to transfer a
216	   DDR-eligible data item (e.g. the data being read) from its location
217	   in responder memory, to a location previously selected by the
218	   requester.

220	   Processing proceeds as described below.  Although we are focused on
221	   internode latency, the time to perform a request also includes such
222	   things as interrupt latency, overhead involved in interacting with
223	   the RNIC, and the time for the server to execute the requested
224	   operation.

226	   o  First, the memory to be accessed remotely is registered.  This is
227	      a local operation.

229	   o  Once the registration has been done, the initial send of the
230	      request can proceed.  Since this is in the context of connected
231	      operation, there is an internode round trip involved.  However,
232	      the next step can proceed after the initial transmission is
233	      received.  As a result, only the responder-bound side of the
234	      transmission contributes to overall operation latency.

236	   o  The responder, after being notified of the receipt of the request,
237	      proceeds to process the request until the data to be read is
238	      available in its own memory, with its location determined and
239	      fixed.  It then uses RDMA WRITE to transfer the bulk data to the
240	      location in requester memory selected previously.  This involves
241	      an internode latency, but there is no round trip and thus no
242	      round-trip latency,

244	   o  The responder continues processing and sends the inline portion of
245	      the response.  Again, as this is in the context of connected
246	      operation, there is an internode round trip involved.  However,
247	      the next step can proceed immediately.  If the RDMA WRITE or the
248	      send of the inline portion of the response were to fail, the
249	      responder can be notified subsequently.

251	   o  The requester, after being notified of the receipt of the
252	      response, can analyze it and can access the data written into its
253	      memory.  Deregistration of the memory originally registered before
254	      the request was issued can be done using remote invalidation or
255	      can be done by the requester as a local operation

257	   To summarize, in this case the additional latency that we saw in
258	   Section 2.2 does not arise.  Except for the additional overhead due
259	   to memory registration and invalidation, the situation is the same as
260	   for a request not using explicit RDMA operations in which there is a
261	   single inter-node round-trip latency and one interrupt latency on the
262	   requester and the responder.

264	3.  Near-term Work

266	   We are going to consider how the latency and overhead issues
267	   discussed in Section 2 might be addressed in the context of an
268	   extensible version of RPC-over-RDMA, such as that proposed in
269	   [rpcrdmav2].

271	   In Section 3.1, we will establish a performance target for the
272	   troublesome requests, based on the performance of requests that do
273	   not involve long messages or direct data placement.

275	   We will then consider how extensions might be defined to bring
276	   latency and overhead for the requests discussed in Section 2.1 into
277	   line with those for other requests.  There will be two specific
278	   classes of requests to address:

280	   o  Those that do not involve direct data placement will be addressed
281	      in Section 3.2.  In this case, there are no compensating benefits
282	      justifying the higher overhead and, in some cases, latency.

284	   o  The more complicated case of requests that do involve direct data
285	      placement is discussed in Section 3.3.  In this case, direct data
286	      placement could serve as a compensating benefit, and the important
287	      question to be addressed is whether Direct Data Placement can be
288	      effected without the use of explicit RDMA operations.

290	   The optional features to deal with each of the classes of messages
291	   discussed above could be implemented separately.  However, in the
292	   handling of RPCs with very large amounts of bulk data, the two
293	   features are synergistic.  This fact makes it desirable to define the
294	   features as part of the same extension.  See Sections 3.4 and 3.5 for
295	   details.

297	3.1.  Target Performance

299	   As our target, we will look at the latency and overhead associated
300	   with other sorts of RPC requests, i.e. those that do not use DDP, and
301	   that have request and response messages which do fit within the
302	   receive buffer limit.

304	   Processing proceeds as follows:

306	   o  The initial send of the request is done.  Since this is in the
307	      context of connected operation, there is an internode round-trip
308	      involved.  However, the next step can proceed after the initial
309	      transmission is received.  As a result, only the responder-bound
310	      side of the transmission contributes to overall operation latency.

312	   o  The responder, after being notified of the receipt of the request,
313	      performs the requested operation and sends the reply.  As in the
314	      case of the request, there is an internode round trip involved.
315	      However, the request can be considered complete upon receipt of
316	      the requester-bound transmission.  The responder-bound
317	      acknowledgment does not contribute to request latency.

319	   In this case, there is only a single internode round-trip latency
320	   necessary to effect the RPC.  Total request latency includes this
321	   round-trip latency plus interrupt latency on the requester and
322	   responder, plus the time for the responder to actually perform the
323	   requested operation.

325	   Thus the delta between the operations discussed in Section 2 and our
326	   baseline consists of two portions, one of which applies to all the
327	   requests we are concerned with and the second of which only applies
328	   to request which involve use of RDMA READ, as discussed in
329	   Section 2.2.  The latter category consists of:

331	   o  One additional internode round-trip latency.

333	   o  One additional instance of responder-side interrupt latency.

335	   The additional overhead necessary to do memory registration and
336	   deregistration applies to all requests using explicit RDMA
337	   operations.  The costs will vary with implementation characteristics.
338	   As a result, in some implementations, it may desirable to replace use
339	   of RDMA Write with send-based alternatives, while in others, use of
340	   RDMA Write may be preferable.

342	3.2.  Message Continuation

344	   Using multiple RPC-over-RDMA transmissions, in sequence, to send a
345	   single RPC message avoids the additional latency and overhead
346	   associated with the use of explicit RDMA operations to transfer
347	   position-zero read chunks.  In the case of reply chunks, only
348	   overhead is reduced.

350	   Although transfer of a single request or reply in N transmissions
351	   will involve N+1 internode latencies, overall request latency is not
352	   increased by requiring that operations involving multiple nodes be
353	   serialized.  Generally, these transmissions are pipelined.

355	   As an illustration, let's consider the case of a request involving a
356	   response consisting of two RPC-over-RDMA transmissions.  Even though
357	   each of these transmissions is acknowledged, that acknowledgement
358	   does not contribute to request latency.  The second transmission can
359	   be received by the requester and acted upon without waiting for
360	   either acknowledgment.

362	   This situation would require multiple receive-side interrupts but it
363	   is unlikely to result in extended interrupt latency.  With 1K sends
364	   (Version One), the second receive will complete about 200 nanoseconds
365	   after the first assuming a 40Gb/s transmission rate.  Given likely
366	   interrupt latencies, the first interrupt routine would be able to
367	   note that the completion of the second receive had already occurred.

369	3.3.  Send-based DDP

371	   In order to effect proper placement of request or reply data within
372	   the context of individual RPC-over-RDMA transmissions, receive
373	   buffers need to be structured to accommodate this function

375	   To illustrate the considerations that could lead clients and servers
376	   to choose particular buffer structures, we will use, as examples, the
377	   cases of NFS READs and WRITEs of 8K data blocks (or the corresponding
378	   NFSv4 COMPOUNDs).

380	   In such cases, the client and server need to have the DDP-eligible
381	   bulk data placed in appropriately aligned 8K buffer segments.  Rather
382	   than being transferred in separate transmissions using explicit RDMA
383	   operations, a message can be sent so that bulk data is received into
384	   an appropriate buffer segment.  In this case, it would be excised
385	   from the XDR payload stream, just as it is in the case of existing
386	   DDP facilities.

388	   Consider a server expecting write requests which are usually X bytes
389	   long or less, exclusive of an 8K bulk data area.  In this case the
390	   payload stream will most likely be less than X bytes and will fit in
391	   a buffer segment devoted to that purpose.  The bulk data needs to be
392	   placed in the subsequent buffer segment in order to align it
393	   properly, i.e. with the appropriate alignment, in the DDP target
394	   buffer.  In order to place the data appropriately, the sender (in
395	   this case, the client) needs to add padding of length X-Y bytes where
396	   Y is the length of payload stream for the current request.  The case
397	   of reads is exactly the same except that the sender adding the
398	   padding is the server.

400	   To provide send-based DDP as an RPC-over-RDMA extension, the
401	   framework defined in [rpcrdmav2] could be used.  A new "transport
402	   characteristic" could be defined which allowed a participant to
403	   expose the structure of his receive buffers and to identify the
404	   buffer segments capable of being used as DDP targets.  In addition, a
405	   new optional message header would have to be defined.  It would be
406	   defined to provide:

408	   o  A way to designate a DDP-eligible data item as corresponding to
409	      target buffer segments, rather than memory registered for RDMA.

411	   o  A way to indicate to the responder that it should place DDP-
412	      eligible data items in DDP-targetable buffer segments, rather than
413	      in memory registered for RDMA.

415	   o  A way to designate a limited portion of an RPC-over-RDMA
416	      transmission as constituting the payload stream.

418	3.4.  Feature Synergy

420	   While message continuation and send-based DDP each address an
421	   important class of commonly used messages, their combination allows
422	   simpler handling of some important classes of messages:

424	   o  READs and WRITEs transferring larger IOs

426	   o  COMPOUNDs containing multiple IO operations.

428	   o  Operations whose associated payload stream is longer than the
429	      typical value.

431	   To accommodate these situations, it would be best to have the
432	   definition of the headers to support message continuation interact
433	   with data structures to support send-based DDP as follows:

435	   o  The header type used for the initial transmission of a message
436	      continued across multiple transmissions would contain DDP-
437	      directing structures which support both send-based DDP as well as
438	      DDP using Explicit RDMA operations.

440	   o  Buffer references for Send-based DDP should be relative to the
441	      start of the group of transmissions and should allow transitions
442	      between buffer segments in different receive buffers.

444	   o  The header type for messages continuing a group of transmissions
445	      should not have DDP-related fields but should rely on the initial
446	      transmission of the group for DDP-related functions.

448	   o  The portion of each received transmission devoted to the payload
449	      stream should be part of the header for each message within a
450	      group of transmissions devoted to a single RPC message.  The
451	      payload stream for the message as a whole should be the
452	      concatenation of the streams for each transmission.

454	   A potential extension supporting these features interacting as
455	   described above can be found in [rtrext].

457	3.5.  Feature Selection and Negotiation

459	   Given that an appropriate extension is likely to support multiple
460	   OPTIONAL features, special attention will have to be given to
461	   defining how implementations which might not support the same subset
462	   of OPTIONAL features can successfully interact.  The goal is to allow
463	   interacting implementations to get the benefit of features that they
464	   both support, while allowing implementation pairs that do not share
465	   support for any of the OPTIONAL features to operate just as base
466	   Version Two implementations could do in the absence of the potential
467	   extension.

469	   It is helpful if each implementation provides characteristics
470	   defining its level of feature support which the peer implementation
471	   can test before attempting to use a particular feature.  In other
472	   similar contexts, the support level concerns the implementation in
473	   its role as responder, i.e. whether it is prepared to execute a given
474	   request.  In the case of the potential extension discussed here, most
475	   characteristics concern an implementation in its role as receiver.
476	   One might define characteristics which indicate,

478	   o  The ability of the implementation, in its role as receiver, to
479	      process messages continued across multiple RPC-over-RDMA
480	      transmissions.

482	   o  The ability of the implementation, in its role as receiver, to
483	      process messages containing DDP-eligible data items, directly
484	      placed using a send-based DDP approach.

486	   Use of such characteristics might allow asymmetric implementations.
487	   For example, a client might send requests containing DDP-eligible
488	   data items using send-based DDP without being able to accept messages
489	   containing data items using send-based DDP.  That is a likely
490	   implementation pattern, given the greater performance benefits of
491	   avoiding use of RDMA Read.

493	   Further useful characteristics would apply to the implementation in
494	   its role of responder.  For instance,

496	   o  The ability of the implementation, in its role as responder, to
497	      accept and process requests which REQUIRE that DDP-eligible data
498	      items in the response be sent using send-based DDP.  The presence
499	      of this characteristic would allow a requester to avoid
500	      registering memory to be used to accommodate DDP-eligible data
501	      items in the response.

503	   o  The ability of the implementation, in its role as responder, to
504	      send responses using message continuation, as opposed to using a
505	      reply chunk.

507	   Because of the potentially different needs of operations in the
508	   forward and backward directions, it may be desirable to separate the
509	   receiver-based characteristics according the direction of operation
510	   that they apply to.

512	   A further issue relates to the role of explicit RDMA operations in
513	   connection with backwards operation.  Although, no current protocols
514	   require support for DDP or transfer of large messages when operating
515	   in the backward direction, the protocol is designed to allow such
516	   support to be developed in the future.  Since the protocol, with the
517	   extension discussed here is likely to have multiple methods of
518	   providing these functions, we have a number of possible choices
519	   regarding the role of chunk-based methods of providing these
520	   functions
521	   o  Support for chunk-based operation remains a REQUIREMENT for
522	      responders, and requesters always have the option of using it,
523	      regardless of the direction of operation.

525	      Requesters could select alternatives to the use of explicit RDMA
526	      operations when these are supported by the responder

528	   o  When operating in the forward direction, support for chunk-based
529	      operation remains a REQUIREMENT for responders (i.e. servers), and
530	      requesters (i.e. clients).

532	      When operating in the backward direction, support for chunk-based
533	      is OPTIONAL for responders (i.e. clients) allowing requesters
534	      (i.e. servers) to select use of explicit RDMA operations or
535	      alternatives when each of these is supported by the requester.

537	   o  Support for chunk-based operation is treated as OPTIONAL for
538	      responders, regardless of the direction of operation.

540	      In this case, requesters would select use of explicit RDMA
541	      operations or alternatives when each of these is supported by the
542	      responder.  For a considerable time, support for explicit RDMA
543	      operations would be a practical necessity, even if not a
544	      REQUIREMENT, for operation in the forward direction.

546	4.  Possible Future Development

548	   Although the reduction of explicit RDMA operation reduces the number
549	   of inter-node round trips and eliminates sequences of operations in
550	   which multiple round-trip latencies are serialized with server
551	   interrupt latencies, the use of connected operations means that
552	   round-trip latencies will always be present, since each message is
553	   acknowledged.

555	   One avenue that has been considered is use of unreliable-datagram
556	   (UD) transmission in environments where the "unreliable" transmission
557	   is sufficiently reliable that RPC replay can deal with a very low
558	   rate of message loss.  For example, UD in Infiniband specifies a low
559	   enough rate of frame loss to make this a viable approach,
560	   particularly for use in supporting protocols such as NFSv4.1, that
561	   contain their own facilities to ensure exactly-once semantics.

563	   With this sort of arrangement, request latency is still the same.
564	   However, since the acknowledgements are not serving any substantial
565	   function, it is tempting to consider removing them, as they do take
566	   up some transmission bandwidth, that might be used otherwise, if the
567	   protocol were to reach the goal of effectively using the underlying
568	   medium.

570	   The size of such wasted transmission bandwidth depends on the average
571	   message size and many implementation considerations regarding how
572	   acknowledgments are done.  In any case, given expected message sizes,
573	   the wasted transmission bandwidth will be very small.

575	   When RPC messages are quite small, acknowledgments may be of concern.
576	   However, in that situation, a better response would be transfer
577	   multiple RPC messages within a single RPC-over-RDMA transmission.

579	   When multiple RPC messages are combined into a single transmission,
580	   the overhead of interfacing with the RNIC, particularly the interrupt
581	   handling overhead, is amortized over multiple RPC messages.

583	   Although this technique is quite outside the spirit of existing RPC-
584	   over-RDMA implementations, it appears possible to define new header
585	   types capable of supporting this sort of transmission, using the
586	   extension framework described in [rpcrdmav2].

588	5.  Summary

590	   We've examined the issue of round-trip latency and concluded:

592	   o  That the number of round trips per se is not as important as the
593	      contribution of any extra round trips to overall request latency.

595	   o  That the latency issue can be addressed using the extension
596	      mechanism provided for in [rpcrdmav2].

598	   o  That in many cases in which latency is not an issue, there may be
599	      overhead issues that can be addressed using the same sorts of
600	      techniques as those useful in latency reduction, again using the
601	      extension mechanism provided for in [rpcrdmav2].

603	   As it seems that the features sketched out could put internode
604	   latencies and overhead for a large class of requests back to the
605	   baseline value for the RPC paradigm, more detailed definition of the
606	   required extension functionality is in order.

608	   We've also looked at round trips at the physical level, in that
609	   acknowledgments are sent in circumstances where there is no obvious
610	   need for them.  With regard to these, we have concluded:

612	   o  That these acknowledgements do not contribute to request latency.

614	   o  That while UD transmission can remove acknowledgements of limited
615	      value, the performance benefits are not sufficient to justify the
616	      disruption that this would entail.

618	   o  That issues with transmission bandwidth overhead in a small-
619	      message environment are better addressed by combining multiple RPC
620	      messages in a single RPC-over-RDMA transmission.  This is
621	      particularly so, because such a step is likely to reduce overhead
622	      in such environments as well

624	   As the features described involve the use of alternatives to explicit
625	   RDMA operations, in performing direct data placement and in
626	   transferring messages that are larger than the receive buffer limit,
627	   it is appropriate to understand the role that such operations are
628	   expected to have once the extensions discussed in this document are
629	   fully specified and implemented.

631	   It is important to note that these extensions are OPTIONAL and are
632	   expected to remain so, while support for explicit RDMA operations
633	   will remain an integral part of RPC-over-RDMA.

635	   Given this framework, the degree to which explicit RDMA operations
636	   will be used will reflect future implementation choices and needs.
637	   While we have been focusing on cases in which other options might be
638	   more efficient in some cases, it worth looking also at the cases in
639	   which explicit RDMA operations are likely to remain preferable:

641	   o  In some environments in which direct data placement to memory of a
642	      certain alignment does not meet application requirements and in
643	      which data needs to be read into a particular address on the
644	      client.  Also, large physically contiguous buffers may be required
645	      in some environments.  In these situations, send-based DDP is not
646	      an option.

648	   o  Where large transfers are to be done, there will be limits to the
649	      capacity of send-based DDP to provide the required functionality,
650	      since the basic pattern using send/receive is to allocate a pool
651	      of memory to contain receive buffers in advance of issuing
652	      requests.  While this issue can be mitigated by use of message
653	      continuation, tying up large numbers of credits for a single
654	      request can cause difficult issues as well.  As a result, send-
655	      based DDP may be restricted to IO's of limited size, although the
656	      specific limits will depend on the details of the specific
657	      implementation.

659	6.  Security Considerations

661	   This document does not raise any security issues.

663	7.  IANA Considerations

665	   This document does not require any actions by IANA.

667	8.  References

669	8.1.  Normative References

671	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
672	              Requirement Levels", BCP 14, RFC 2119,
673	              DOI 10.17487/RFC2119, March 1997,
674	              <http://www.rfc-editor.org/info/rfc2119>.

676	   [rfc5666bis]
677	              Lever, C., Ed., Simpson, W., and T. Talpey, "Remote Direct
678	              Memory Access Transport for Remote Procedure Call",
679	              February 2017, <http://www.ietf.org/id/
680	              draft-ietf-nfsv4-rfc5666bis-10.txt>.

682	              Work in progress.

684	8.2.  Informative References

686	   [RFC5666]  Talpey, T. and B. Callaghan, "Remote Direct Memory Access
687	              Transport for Remote Procedure Call", RFC 5666,
688	              DOI 10.17487/RFC5666, January 2010,
689	              <http://www.rfc-editor.org/info/rfc5666>.

691	   [rfc5667bis]
692	              Lever, C., Ed., "Network File System (NFS) Upper Layer
693	              Binding To RPC-Over-RDMA", February 2017,
694	              <http://www.ietf.org/id/
695	              draft-ietf-nfsv4-rfc5667bision-06.txt>.

697	              Work in progress.

699	   [rpcrdmav2]
700	              Lever, C., Ed. and D. Noveck, "RPC-over-RDMA Version Two",
701	              December 2016, <http://www.ietf.org/id/
702	              draft-cel-nfsv4-rpcrdma-version-two-03.txt>.

704	              Work in progress.

706	   [rtrext]   Noveck, D., "RPC-over-RDMA Extensions to Reduce Internode
707	              Round-trips", December 2016, <http://www.ietf.org/id/
708	              draft-dnoveck-nfsv4-rpcrdma-rtrext-01.txt>.

710	              Work in progress.

712	Appendix A.  Acknowledgements

714	   The author gratefully acknowledges the work of Brent Callaghan and
715	   Tom Talpey producing the original RPC-over-RDMA Version One
716	   specification [RFC5666] and also Tom's work in helping to clarify
717	   that specification.

719	   The author also wishes to thank Chuck Lever for his work reviving
720	   RDMA support for NFS in [rfc5666bis], [rfc5667bis], and [rpcrdmav2],
721	   and for helpful discussion regarding RPC-over-RDMA latency issues.

723	Author's Address

725	   David Noveck
726	   26 Locust Avenue
727	   Lexington, MA  02421
728	   USA

730	   Phone: +1 781-572-8038
731	   Email: davenoveck@gmail.com