idnits 2.17.1 draft-dnoveck-nfsv4-rpcrdma-rtissues-05.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (February 22, 2018) is 2227 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Outdated reference: A later version (-09) exists of draft-cel-nfsv4-rpcrdma-version-two-06 -- Obsolete informational reference (is this intentional?): RFC 5666 (Obsoleted by RFC 8166) Summary: 0 errors (**), 0 flaws (~~), 2 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network File System Version 4 D. Noveck 3 Internet-Draft NetApp 4 Intended status: Informational February 22, 2018 5 Expires: August 26, 2018 7 Issues Related to RPC-over-RDMA Internode Round Trips 8 draft-dnoveck-nfsv4-rpcrdma-rtissues-05 10 Abstract 12 As currently designed and implemented, the RPC-over-RDMA protocol 13 requires use of multiple internode round trips to process some common 14 operations. For example, NFS WRITE operations require use of three 15 internode round trips. This document looks at this issue and 16 discusses what can and what should be done to address it, both within 17 the context of an extensible version of RPC-over-RDMA and potentially 18 outside that framework. 20 Status of This Memo 22 This Internet-Draft is submitted in full conformance with the 23 provisions of BCP 78 and BCP 79. 25 Internet-Drafts are working documents of the Internet Engineering 26 Task Force (IETF). Note that other groups may also distribute 27 working documents as Internet-Drafts. The list of current Internet- 28 Drafts is at https://datatracker.ietf.org/drafts/current/. 30 Internet-Drafts are draft documents valid for a maximum of six months 31 and may be updated, replaced, or obsoleted by other documents at any 32 time. It is inappropriate to use Internet-Drafts as reference 33 material or to cite them other than as "work in progress." 35 This Internet-Draft will expire on August 26, 2018. 37 Copyright Notice 39 Copyright (c) 2018 IETF Trust and the persons identified as the 40 document authors. All rights reserved. 42 This document is subject to BCP 78 and the IETF Trust's Legal 43 Provisions Relating to IETF Documents 44 (https://trustee.ietf.org/license-info) in effect on the date of 45 publication of this document. Please review these documents 46 carefully, as they describe your rights and restrictions with respect 47 to this document. Code Components extracted from this document must 48 include Simplified BSD License text as described in Section 4.e of 49 the Trust Legal Provisions and are provided without warranty as 50 described in the Simplified BSD License. 52 Table of Contents 54 1. Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . 2 55 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 2 56 1.2. Introduction . . . . . . . . . . . . . . . . . . . . . . 2 57 2. Review of the Current Situation . . . . . . . . . . . . . . . 3 58 2.1. Potentially Troublesome Requests . . . . . . . . . . . . 3 59 2.2. WRITE Request Processing Details . . . . . . . . . . . . 4 60 2.3. READ Request Processing Details . . . . . . . . . . . . . 5 61 3. Near-term Work . . . . . . . . . . . . . . . . . . . . . . . 6 62 3.1. Target Performance . . . . . . . . . . . . . . . . . . . 7 63 3.2. Message Continuation . . . . . . . . . . . . . . . . . . 8 64 3.3. Send-based Data Placement . . . . . . . . . . . . . . . . 8 65 3.4. Feature Synergy . . . . . . . . . . . . . . . . . . . . . 9 66 3.5. Feature Selection and Negotiation . . . . . . . . . . . . 10 67 4. Possible Future Development of RPC-over-RDMA . . . . . . . . 12 68 5. Other Possible Approaches . . . . . . . . . . . . . . . . . . 13 69 6. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 70 7. Security Considerations . . . . . . . . . . . . . . . . . . . 15 71 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 15 72 9. References . . . . . . . . . . . . . . . . . . . . . . . . . 15 73 9.1. Normative References . . . . . . . . . . . . . . . . . . 15 74 9.2. Informative References . . . . . . . . . . . . . . . . . 15 75 Appendix A. Acknowledgements . . . . . . . . . . . . . . . . . . 16 76 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 16 78 1. Preliminaries 80 1.1. Requirements Language 82 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 83 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 84 document are to be interpreted as described in [RFC2119]. 86 1.2. Introduction 88 When many common operations are performed using RPC-over-RDMA, 89 additional inter-node round-trip latencies are required to take 90 advantage of the performance benefits provided by RDMA Functionality. 92 While the latencies involved are generally small, they are a reason 93 for concern for two reasons. 95 o With the ongoing improvement of persistent memory technologies, 96 such internode latencies, being fixed, can be expected to consume 97 an increasing portion of the total latency required for processing 98 NFS requests using RPC-over-RDMA. 100 o High-performance transfers using NFS may be needed outside of a 101 machine-room environment. As RPC-over-RDMA is used in networks of 102 campus and metropolitan scale, the internode round-trip time of 103 sixteen microseconds per mile becomes an issue. 105 Given this background, round trips beyond the minimum necessary need 106 to be justified by corresponding benefits. If they are not, work 107 needs to be done to eliminate those excess round trips. 109 We are going to look at the existing situation with regard to round- 110 trip latency and make some suggestions as to how the issue might be 111 best addressed. We will consider things that could be done in the 112 near future and also explore further possibilities that would require 113 a longer-term approach to be adopted. 115 2. Review of the Current Situation 117 2.1. Potentially Troublesome Requests 119 We will be looking at four sorts of situations: 121 o An RPC operation involving Direct Data Placement of request data 122 (e.g., an NFSv3 WRITE or corresponding NFSv4 COMPOUND). 124 o An RPC operation involving Direct Data Placement of response data 125 (e.g., an NFSv3 READ or corresponding NFSv4 COMPOUND). 127 o An RPC operation where the request data is longer than the inline 128 buffer limit. 130 o An RPC operation where the response data is longer than the inline 131 buffer limit. 133 These are all simple examples of situations in which explicit RDMA 134 operations are used, either to effect Direct Data Placement or to 135 respond to message size limits that derive from a limited receive 136 buffer size. 138 We will survey the resulting latency and overhead issues in an RPC- 139 over-RDMA Version One environment in Sections 2.2 and 2.3 below. 141 2.2. WRITE Request Processing Details 143 We'll start with the case of a request involving direct placement of 144 request data. In this case, an RDMA READ is used to transfer a DDR- 145 eligible data item (e.g. the data to be written) from its location in 146 requester memory, to a location selected by the responder. 148 Processing proceeds as described below. Although we are focused on 149 internode latency, the time to perform a request also includes such 150 things as interrupt latency, overhead involved in interacting with 151 the RNIC, and the time for the server to execute the requested 152 operation. 154 o First, the memory to be accessed remotely is registered. This is 155 a local operation. 157 o Once the registration has been done, the initial send of the 158 request can proceed. Since this is in the context of connected 159 operation, there is an internode round trip involved. However, 160 the next step can proceed after the initial transmission is 161 received by the responder. As a result, only the responder-bound 162 side of the transmission contributes to overall operation latency. 164 o The responder, after being notified of the receipt of the request, 165 uses RDMA READ to fetch the bulk data. This involves an internode 166 round-trip latency. After the fetch of the data, the responder 167 needs to be notified of the completion of the explicit RDMA 168 operation 170 o The responder (after performing the requested operation) sends the 171 response. Again, as this is in the context of connected 172 operation, there is an internode round trip involved. However, 173 the next step can proceed after the initial transmission is 174 received by the requester. 176 o The memory registered before the request was issued needs to be 177 deregistered, before the request is considered complete and the 178 sending process restarted. When remote invalidation is not 179 available, the requester, after being notified of the receipt of 180 the response, performs a local operation to deregister the memory 181 in question. Alternatively, the responder will use Send With 182 Invalidate and the responder's RNIC will effect the deregistration 183 before notifying the requester of the response which has been 184 received. 186 To summarize, if we exclude the actual server execution of the 187 request, the latency consists of two internode round-trip latencies 188 plus two responder-side interrupt latencies plus one requester-side 189 interrupt latency plus any necessary registration/de-registration 190 overhead. This is in contrast to a request not using explicit RDMA 191 operations in which there is a single inter-node round-trip latency 192 and one interrupt latency on the requester and the responder. 194 The processing of the other sorts of requests mentioned in 195 Section 2.1 show both similarities and differences: 197 o Handling of a long request is similar to the above. The memory 198 associated with a position-zero read chunk is registered, 199 transferred using RDMA READ, and deregistered. As a result, we 200 have the same overhead and latency issues noted in the case of 201 direct data placement, without the corresponding benefits. 203 o The case of direct data placement of response data follows a 204 similar pattern. The important difference is that the transfer of 205 the bulk data is performed using RDMA WRITE, rather than RDMA 206 READ. However, because of the way that RDMA WRITE is effected 207 over the wire, the latency consequences are different. See 208 Section 2.3 for a detailed discussion. 210 o Handling of a long response is similar to the previous case. 212 2.3. READ Request Processing Details 214 We'll now discuss the case of a request involving direct placement of 215 response data. In this case, an RDMA WRITE is used to transfer a 216 DDR-eligible data item (e.g. the data being read) from its location 217 in responder memory, to a location previously selected by the 218 requester. 220 Processing proceeds as described below. Although we are focused on 221 internode latency, the time to perform a request also includes such 222 things as interrupt latency, overhead involved in interacting with 223 the RNIC, and the time for the server to execute the requested 224 operation. 226 o First, the memory to be accessed remotely is registered. This is 227 a local operation. 229 o Once the registration has been done, the initial send of the 230 request can proceed. Since this is in the context of connected 231 operation, there is an internode round trip involved. However, 232 the next step can proceed after the initial transmission is 233 received. As a result, only the responder-bound side of the 234 transmission contributes to overall operation latency. 236 o The responder, after being notified of the receipt of the request, 237 proceeds to process the request until the data to be read is 238 available in its own memory, with its location determined and 239 fixed. It then uses RDMA WRITE to transfer the bulk data to the 240 location in requester memory selected previously. This involves 241 an internode latency, but there is no round trip and thus no 242 round-trip latency, 244 o The responder continues processing and sends the inline portion of 245 the response. Again, as this is in the context of connected 246 operation, there is an internode round trip involved. However, 247 the next step can proceed immediately. If the RDMA WRITE or the 248 send of the inline portion of the response were to fail, the 249 responder can be notified subsequently. 251 o The requester, after being notified of the receipt of the 252 response, can analyze it and can access the data written into its 253 memory. Deregistration of the memory originally registered before 254 the request was issued can be done using remote invalidation or 255 can be done by the requester as a local operation 257 To summarize, in this case the additional latency that we saw in 258 Section 2.2 does not arise. Except for the additional overhead due 259 to memory registration and invalidation, the situation is the same as 260 for a request not using explicit RDMA operations in which there is a 261 single inter-node round-trip latency and one interrupt latency on the 262 requester and the responder. 264 3. Near-term Work 266 We are going to consider how the latency and overhead issues 267 discussed in Section 2 might be addressed in the context of an 268 extensible version of RPC-over-RDMA, such as that proposed in 269 [I-D.cel-nfsv4-rpcrdma-version-two]. 271 In Section 3.1, we will establish a performance target for the 272 troublesome requests, based on the performance of requests that do 273 not involve long messages or direct data placement. 275 We will then consider how extensions might be defined to bring 276 latency and overhead for the requests discussed in Section 2.1 into 277 line with those for other requests. There will be two specific 278 classes of requests to address: 280 o Those that do not involve direct data placement will be addressed 281 in Section 3.2. In this case, there are no compensating benefits 282 justifying the higher overhead and, in some cases, latency. 284 o The more complicated case of requests that do involve direct data 285 placement is discussed in Section 3.3. In this case, direct data 286 placement could serve as a compensating benefit, and the important 287 question to be addressed is whether Direct Data Placement can be 288 effected without the use of explicit RDMA operations. 290 The optional features to deal with each of the classes of messages 291 discussed above could be implemented separately. However, in the 292 handling of RPCs with very large amounts of bulk data, the two 293 features are synergistic. This fact makes it desirable to define the 294 features as part of the same extension. See Sections 3.4 and 3.5 for 295 details. 297 3.1. Target Performance 299 As our target, we will look at the latency and overhead associated 300 with other sorts of RPC requests, i.e. those that do not use data 301 placement, and that have request and response messages which do fit 302 within the receive buffer limit. 304 Processing proceeds as follows: 306 o The initial send of the request is done. Since this is in the 307 context of connected operation, there is an internode round-trip 308 involved. However, the next step can proceed after the initial 309 transmission is received. As a result, only the responder-bound 310 side of the transmission contributes to overall operation latency. 312 o The responder, after being notified of the receipt of the request, 313 performs the requested operation and sends the reply. As in the 314 case of the request, there is an internode round trip involved. 315 However, the request can be considered complete upon receipt of 316 the requester-bound transmission. The responder-bound 317 acknowledgment does not contribute to request latency. 319 In this case, there is only a single internode round-trip latency 320 necessary to effect the RPC. Total request latency includes this 321 round-trip latency plus interrupt latency on the requester and 322 responder, plus the time for the responder to actually perform the 323 requested operation. 325 Thus the delta between the operations discussed in Section 2 and our 326 baseline consists of two portions, one of which applies to all the 327 requests we are concerned with and the second of which only applies 328 to request which involve use of RDMA READ, as discussed in 329 Section 2.2. The latter category consists of: 331 o One additional internode round-trip latency. 333 o One additional instance of responder-side interrupt latency. 335 The additional overhead necessary to do memory registration and 336 deregistration applies to all requests using explicit RDMA 337 operations. The costs will vary with implementation characteristics. 338 As a result, in some implementations, it may desirable to replace use 339 of RDMA Write with send-based alternatives, while in others, use of 340 RDMA Write may be preferable. 342 3.2. Message Continuation 344 Using multiple RPC-over-RDMA transmissions, in sequence, to send a 345 single RPC message avoids the additional latency and overhead 346 associated with the use of explicit RDMA operations to transfer 347 position-zero read chunks. In the case of reply chunks, only 348 overhead is reduced. 350 Although transfer of a single request or reply in N transmissions 351 will involve N+1 internode latencies, overall request latency is not 352 increased by requiring that operations involving multiple nodes be 353 serialized. Generally, these transmissions are pipelined. 355 As an illustration, let's consider the case of a request involving a 356 response consisting of two RPC-over-RDMA transmissions. Even though 357 each of these transmissions is acknowledged, that acknowledgement 358 does not contribute to request latency. The second transmission can 359 be received by the requester and acted upon without waiting for 360 either acknowledgment. 362 This situation would require multiple receive-side interrupts but it 363 is unlikely to result in extended interrupt latency. With 1K sends 364 (Version One), the second receive will complete about 200 nanoseconds 365 after the first assuming a 40Gb/s transmission rate. Given likely 366 interrupt latencies, the first interrupt routine would be able to 367 note that the completion of the second receive had already occurred. 369 3.3. Send-based Data Placement 371 In order to effect proper placement of request or reply data within 372 the context of individual RPC-over-RDMA transmissions, receive 373 buffers need to be structured to accommodate this function 375 To illustrate the considerations that could lead clients and servers 376 to choose particular buffer structures, we will use, as examples, the 377 cases of NFS READs and WRITEs of 8K data blocks (or the corresponding 378 NFSv4 COMPOUNDs). 380 In such cases, the client and server need to have the DDP-eligible 381 bulk data placed in appropriately aligned 8K buffer segments. Rather 382 than being transferred in separate transmissions using explicit RDMA 383 operations, a message can be sent so that bulk data is received into 384 an appropriate buffer segment. In this case, it would be excised 385 from the XDR payload stream, just as it is in the case of existing 386 DDP facilities. 388 Consider a server expecting write requests which are usually X bytes 389 long or less, exclusive of an 8K bulk data area. In this case the 390 payload stream will most likely be less than X bytes and will fit in 391 a buffer segment devoted to that purpose. The bulk data needs to be 392 placed in the subsequent buffer segment in order to align it 393 properly, i.e. with the appropriate alignment, in the data placement 394 target buffer. In order to place the data appropriately, the sender 395 (in this case, the client) needs to add padding of length X-Y bytes 396 where Y is the length of payload stream for the current request. The 397 case of reads is exactly the same except that the sender adding the 398 padding is the server. 400 To provide send-based data placement as an RPC-over-RDMA extension, 401 the framework defined in [I-D.cel-nfsv4-rpcrdma-version-two] could be 402 used. A new "transport characteristic" could be defined which 403 allowed a participant to expose the structure of his receive buffers 404 and to identify the buffer segments capable of being used as data 405 placemenr targets. In addition, a new optional message header would 406 have to be defined. It would be defined to provide: 408 o A way to designate a DDP-eligible data item as corresponding to 409 target buffer segments, rather than memory registered for RDMA. 411 o A way to indicate to the responder that it should place DDP- 412 eligible data items in DDP-targetable buffer segments, rather than 413 in memory registered for RDMA. 415 o A way to designate a limited portion of an RPC-over-RDMA 416 transmission as constituting the payload stream. 418 3.4. Feature Synergy 420 While message continuation and send-based data placement each address 421 an important class of commonly used messages, their combination 422 allows simpler handling of some important classes of messages: 424 o READs and WRITEs transferring larger IOs 426 o COMPOUNDs containing multiple IO operations. 428 o Operations whose associated payload stream is longer than the 429 typical value. 431 To accommodate these situations, it would be best to have the 432 definition of the headers to support message continuation interact 433 with data structures to support send-based data placement as follows: 435 o The header type used for the initial transmission of a message 436 continued across multiple transmissions would contain placement- 437 directing structures which support both send-based data placement 438 as well as DDP using Explicit RDMA operations. 440 o Buffer references for Send-based data placement should be relative 441 to the start of the group of transmissions and should allow 442 transitions between buffer segments in different receive buffers. 444 o The header type for messages continuing a group of transmissions 445 should not have DDP-related fields but should rely on the initial 446 transmission of the group for DDP-related functions. 448 o The portion of each received transmission devoted to the payload 449 stream should be part of the header for each message within a 450 group of transmissions devoted to a single RPC message. The 451 payload stream for the message as a whole should be the 452 concatenation of the streams for each transmission. 454 A potential extension supporting these features interacting as 455 described above can be found in [I-D.dnoveck-nfsv4-rpcrdma-rtrext]. 457 3.5. Feature Selection and Negotiation 459 Given that an appropriate extension is likely to support multiple 460 OPTIONAL features, special attention will have to be given to 461 defining how implementations which might not support the same subset 462 of OPTIONAL features can successfully interact. The goal is to allow 463 interacting implementations to get the benefit of features that they 464 both support, while allowing implementation pairs that do not share 465 support for any of the OPTIONAL features to operate just as base 466 Version Two implementations could do in the absence of the potential 467 extension. 469 It is helpful if each implementation provides characteristics 470 defining its level of feature support which the peer implementation 471 can test before attempting to use a particular feature. In other 472 similar contexts, the support level concerns the implementation in 473 its role as responder, i.e. whether it is prepared to execute a given 474 request. In the case of the potential extension discussed here, most 475 characteristics concern an implementation in its role as receiver. 476 One might define characteristics which indicate, 478 o The ability of the implementation, in its role as receiver, to 479 process messages continued across multiple RPC-over-RDMA 480 transmissions. 482 o The ability of the implementation, in its role as receiver, to 483 process messages containing DDP-eligible data items, placed using 484 a send-based data placement approach. 486 Use of such characteristics might allow asymmetric implementations. 487 For example, a client might send requests containing DDP-eligible 488 data items using send-based data placement without being able to 489 accept messages containing data items using send-based data 490 placement. That is a likely implementation pattern, given the 491 greater performance benefits of avoiding use of RDMA Read. 493 Further useful characteristics would apply to the implementation in 494 its role of responder. For instance, 496 o The ability of the implementation, in its role as responder, to 497 accept and process requests which REQUIRE that DDP-eligible data 498 items in the response be sent using send-based DDP. The presence 499 of this characteristic would allow a requester to avoid 500 registering memory to be used to accommodate DDP-eligible data 501 items in the response. 503 o The ability of the implementation, in its role as responder, to 504 send responses using message continuation, as opposed to using a 505 reply chunk. 507 Because of the potentially different needs of operations in the 508 forward and backward directions, it may be desirable to separate the 509 receiver-based characteristics according the direction of operation 510 that they apply to. 512 A further issue relates to the role of explicit RDMA operations in 513 connection with backwards operation. Although, no current protocols 514 require support for DDP or transfer of large messages when operating 515 in the backward direction, the protocol is designed to allow such 516 support to be developed in the future. Since the protocol, with the 517 extension discussed here is likely to have multiple methods of 518 providing these functions, we have a number of possible choices 519 regarding the role of chunk-based methods of providing these 520 functions 521 o Support for chunk-based operation remains a REQUIREMENT for 522 responders, and requesters always have the option of using it, 523 regardless of the direction of operation. 525 Requesters could select alternatives to the use of explicit RDMA 526 operations when these are supported by the responder 528 o When operating in the forward direction, support for chunk-based 529 operation remains a REQUIREMENT for responders (i.e. servers), and 530 requesters (i.e. clients). 532 When operating in the backward direction, support for chunk-based 533 is OPTIONAL for responders (i.e. clients) allowing requesters 534 (i.e. servers) to select use of explicit RDMA operations or 535 alternatives when each of these is supported by the requester. 537 o Support for chunk-based operation is treated as OPTIONAL for 538 responders, regardless of the direction of operation. 540 In this case, requesters would select use of explicit RDMA 541 operations or alternatives when each of these is supported by the 542 responder. For a considerable time, support for explicit RDMA 543 operations would be a practical necessity, even if not a 544 REQUIREMENT, for operation in the forward direction. 546 4. Possible Future Development of RPC-over-RDMA 548 Although the reduction of explicit RDMA operation reduces the number 549 of inter-node round trips and eliminates sequences of operations in 550 which multiple round-trip latencies are serialized with server 551 interrupt latencies, the use of connected operations means that 552 round-trip latencies will always be present, since each message is 553 acknowledged. 555 One avenue that has been considered is use of unreliable-datagram 556 (UD) transmission in environments where the "unreliable" transmission 557 is sufficiently reliable that RPC replay can deal with a very low 558 rate of message loss. For example, UD in Infiniband specifies a low 559 enough rate of frame loss to make this a viable approach, 560 particularly for use in supporting protocols such as NFSv4.1, that 561 contain their own facilities to ensure exactly-once semantics. 563 With this sort of arrangement, request latency is still the same. 564 However, since the acknowledgements are not serving any substantial 565 function, it is tempting to consider removing them, as they do take 566 up some transmission bandwidth, that might be used otherwise, if the 567 protocol were to reach the goal of effectively using the underlying 568 medium. 570 The size of such wasted transmission bandwidth depends on the average 571 message size and many implementation considerations regarding how 572 acknowledgments are done. In any case, given expected message sizes, 573 the wasted transmission bandwidth will be very small. 575 When RPC messages are quite small, acknowledgments may be of concern. 576 However, in that situation, a better response would be transfer 577 multiple RPC messages within a single RPC-over-RDMA transmission. 579 When multiple RPC messages are combined into a single transmission, 580 the overhead of interfacing with the RNIC, particularly the interrupt 581 handling overhead, is amortized over multiple RPC messages. 583 Although this technique is quite outside the spirit of existing RPC- 584 over-RDMA implementations, it appears possible to define new header 585 types capable of supporting this sort of transmission, using the 586 extension framework described in [I-D.cel-nfsv4-rpcrdma-version-two]. 588 5. Other Possible Approaches 590 It is possible that the additional round-trips associated with 591 writing data to the server might be addressed outside the context of 592 RPC-over-RDMA, by avoiding use of the RDMA paradigm for such 593 transfers. 595 One possibility that has been discussed is the use of an RDMA-based 596 pNFS mapping type, in which areas in server memory are presented via 597 RDMA-based layouts so that the client could obtain file data using 598 RDMA Read and modify it using RDMA Write. In each case, only a 599 single round-trip would be required to effect each transfer, assming 600 that the appropriate layouts have been obtained. Although some I-D's 601 have been written presenting the outlines of this approach, none are 602 curently active. 604 6. Summary 606 We've examined the issue of round-trip latency and concluded: 608 o That the number of round trips per se is not as important as the 609 contribution of any extra round trips to overall request latency. 611 o That the latency issue can be addressed using the extension 612 mechanism provided for in [I-D.cel-nfsv4-rpcrdma-version-two]. 614 o That in many cases in which latency is not an issue, there may be 615 overhead issues that can be addressed using the same sorts of 616 techniques as those useful in latency reduction, again using the 617 extension mechanism provided for in 618 [I-D.cel-nfsv4-rpcrdma-version-two]. 620 As it seems that the features sketched out could put internode 621 latencies and overhead for a large class of requests back to the 622 baseline value for the RPC paradigm, more detailed definition of the 623 required extension functionality is in order. 625 We've also looked at round trips at the physical level, in that 626 acknowledgments are sent in circumstances where there is no obvious 627 need for them. With regard to these, we have concluded: 629 o That these acknowledgements do not contribute to request latency. 631 o That while UD transmission can remove acknowledgements of limited 632 value, the performance benefits are not sufficient to justify the 633 disruption that this would entail. 635 o That issues with transmission bandwidth overhead in a small- 636 message environment are better addressed by combining multiple RPC 637 messages in a single RPC-over-RDMA transmission. This is 638 particularly so, because such a step is likely to reduce overhead 639 in such environments as well 641 As the features described involve the use of alternatives to explicit 642 RDMA operations, in performing direct data placement and in 643 transferring messages that are larger than the receive buffer limit, 644 it is appropriate to understand the role that such operations are 645 expected to have once the extensions discussed in this document are 646 fully specified and implemented. 648 It is important to note that these extensions are OPTIONAL and are 649 expected to remain so, while support for explicit RDMA operations 650 will remain an integral part of RPC-over-RDMA. 652 Given this framework, the degree to which explicit RDMA operations 653 will be used will reflect future implementation choices and needs. 654 While we have been focusing on cases in which other options might be 655 more efficient in some cases, it worth looking also at the cases in 656 which explicit RDMA operations are likely to remain preferable: 658 o In some environments in which direct data placement to memory of a 659 certain alignment does not meet application requirements and in 660 which data needs to be read into a particular address on the 661 client. Also, large physically contiguous buffers may be required 662 in some environments. In these situations, send-based data 663 placement is not an option. 665 o Where large transfers are to be done, there will be limits to the 666 capacity of send-based data placement to provide the required 667 functionality, since the basic pattern using send/receive is to 668 allocate a pool of memory to contain receive buffers in advance of 669 issuing requests. While this issue can be mitigated by use of 670 message continuation, tying up large numbers of credits for a 671 single request can cause difficult issues as well. As a result, 672 send-based data placement may be restricted to IO's of limited 673 size, although the specific limits will depend on the details of 674 the specific implementation. 676 7. Security Considerations 678 This document does not raise any security issues. 680 8. IANA Considerations 682 This document does not require any actions by IANA. 684 9. References 686 9.1. Normative References 688 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 689 Requirement Levels", BCP 14, RFC 2119, 690 DOI 10.17487/RFC2119, March 1997, 691 . 693 [RFC8166] Lever, C., Ed., Simpson, W., and T. Talpey, "Remote Direct 694 Memory Access Transport for Remote Procedure Call Version 695 1", RFC 8166, DOI 10.17487/RFC8166, June 2017, 696 . 698 [RFC8267] Lever, C., "Network File System (NFS) Upper-Layer Binding 699 to RPC-over-RDMA Version 1", RFC 8267, 700 DOI 10.17487/RFC8267, October 2017, 701 . 703 9.2. Informative References 705 [I-D.cel-nfsv4-rpcrdma-version-two] 706 Lever, C. and D. Noveck, "RPC-over-RDMA Version 2 707 Protocol", draft-cel-nfsv4-rpcrdma-version-two-06 (work in 708 progress), January 2018. 710 [I-D.dnoveck-nfsv4-rpcrdma-rtrext] 711 Noveck, D., "RPC-over-RDMA Extensions to Reduce Internode 712 Round-trips", draft-dnoveck-nfsv4-rpcrdma-rtrext-03 (work 713 in progress), December 2017. 715 [RFC5666] Talpey, T. and B. Callaghan, "Remote Direct Memory Access 716 Transport for Remote Procedure Call", RFC 5666, 717 DOI 10.17487/RFC5666, January 2010, 718 . 720 Appendix A. Acknowledgements 722 The author gratefully acknowledges the work of Brent Callaghan and 723 Tom Talpey producing the original RPC-over-RDMA Version One 724 specification [RFC5666] and also Tom's work in helping to clarify 725 that specification. 727 The author also wishes to thank Chuck Lever for his work reviving 728 RDMA support for NFS in [RFC8166] and [RFC8267], for providing a path 729 for incremental improvement of that support by his work on 730 [I-D.cel-nfsv4-rpcrdma-version-two], and for helpful discussions 731 regarding RPC-over-RDMA latency issues. 733 Author's Address 735 David Noveck 736 NetApp 737 1601 Trapelo Road 738 Waltham, MA 02451 739 United state of America 741 Phone: +1 781-572-8038 742 Email: davenoveck@gmail.com