idnits 2.17.1 draft-dnoveck-nfsv4-rpcrdma-rtissues-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (February 27, 2017) is 2608 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- -- Obsolete informational reference (is this intentional?): RFC 5666 (Obsoleted by RFC 8166) Summary: 0 errors (**), 0 flaws (~~), 1 warning (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 yy 3 Network File System Version 4 D. Noveck 4 Internet-Draft February 27, 2017 5 Intended status: Informational 6 Expires: August 31, 2017 8 Issues Related to RPC-over-RDMA Internode Round Trips 9 draft-dnoveck-nfsv4-rpcrdma-rtissues-03 11 Abstract 13 As currently designed and implemented, the RPC-over-RDMA protocol 14 requires use of multiple internode round trips to process some common 15 operations. For example, NFS WRITE operations require use of three 16 internode round trips. This document looks at this issue and 17 discusses what can and what should be done to address it, both within 18 the context of an extensible version of RPC-over-RDMA and potentially 19 outside that framework. 21 Status of This Memo 23 This Internet-Draft is submitted in full conformance with the 24 provisions of BCP 78 and BCP 79. 26 Internet-Drafts are working documents of the Internet Engineering 27 Task Force (IETF). Note that other groups may also distribute 28 working documents as Internet-Drafts. The list of current Internet- 29 Drafts is at http://datatracker.ietf.org/drafts/current/. 31 Internet-Drafts are draft documents valid for a maximum of six months 32 and may be updated, replaced, or obsoleted by other documents at any 33 time. It is inappropriate to use Internet-Drafts as reference 34 material or to cite them other than as "work in progress." 36 This Internet-Draft will expire on August 31, 2017. 38 Copyright Notice 40 Copyright (c) 2017 IETF Trust and the persons identified as the 41 document authors. All rights reserved. 43 This document is subject to BCP 78 and the IETF Trust's Legal 44 Provisions Relating to IETF Documents 45 (http://trustee.ietf.org/license-info) in effect on the date of 46 publication of this document. Please review these documents 47 carefully, as they describe your rights and restrictions with respect 48 to this document. Code Components extracted from this document must 49 include Simplified BSD License text as described in Section 4.e of 50 the Trust Legal Provisions and are provided without warranty as 51 described in the Simplified BSD License. 53 Table of Contents 55 1. Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . 2 56 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 2 57 1.2. Introduction . . . . . . . . . . . . . . . . . . . . . . 2 58 2. Review of the Current Situation . . . . . . . . . . . . . . . 3 59 2.1. Troublesome Requests . . . . . . . . . . . . . . . . . . 3 60 2.2. WRITE Request Processing Details . . . . . . . . . . . . 4 61 2.3. READ Request Processing Details . . . . . . . . . . . . . 5 62 3. Near-term Work . . . . . . . . . . . . . . . . . . . . . . . 6 63 3.1. Target Performance . . . . . . . . . . . . . . . . . . . 7 64 3.2. Message Continuation . . . . . . . . . . . . . . . . . . 8 65 3.3. Send-based DDP . . . . . . . . . . . . . . . . . . . . . 8 66 3.4. Feature Synergy . . . . . . . . . . . . . . . . . . . . . 9 67 3.5. Feature Selection and Negotiation . . . . . . . . . . . . 10 68 4. Possible Future Development . . . . . . . . . . . . . . . . . 12 69 5. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 70 6. Security Considerations . . . . . . . . . . . . . . . . . . . 14 71 7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 15 72 8. References . . . . . . . . . . . . . . . . . . . . . . . . . 15 73 8.1. Normative References . . . . . . . . . . . . . . . . . . 15 74 8.2. Informative References . . . . . . . . . . . . . . . . . 15 75 Appendix A. Acknowledgements . . . . . . . . . . . . . . . . . . 16 76 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 16 78 1. Preliminaries 80 1.1. Requirements Language 82 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 83 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 84 document are to be interpreted as described in [RFC2119]. 86 1.2. Introduction 88 When many common operations are performed using RPC-over-RDMA, 89 additional inter-node round-trip latencies are required to take 90 advantage of the performance benefits provided by RDMA Functionality. 92 While the latencies involved are generally small, they are a reason 93 for concern for two reasons. 95 o With the ongoing improvement of persistent memory technologies, 96 such internode latencies, being fixed, can be expected to consume 97 an increasing portion of the total latency required for processing 98 NFS requests using RPC-over-RDMA. 100 o High-performance transfers using NFS may be needed outside of a 101 machine-room environment. As RPC-over-RDMA is used in networks of 102 campus and metropolitan scale, the internode round-trip time of 103 sixteen microseconds per mile becomes an issue. 105 Given this background, round trips beyond the minimum necessary need 106 to be justified by corresponding benefits. If they are not, work 107 needs to be done to eliminate those excess round trips. 109 We are going to look at the existing situation with regard to round- 110 trip latency and make some suggestions as to how the issue might be 111 best addressed. We will consider things that could be done in the 112 near future and also explore further possibilities that would require 113 a longer-term approach to be adopted. 115 2. Review of the Current Situation 117 2.1. Troublesome Requests 119 We will be looking at four sorts of situations: 121 o An RPC operation involving Direct Data Placement of request data 122 (e.g., an NFSv3 WRITE or corresponding NFSv4 COMPOUND). 124 o An RPC operation involving Direct Data Placement of response data 125 (e.g., an NFSv3 READ or corresponding NFSv4 COMPOUND). 127 o An RPC operation where the request data is longer than the inline 128 buffer limit. 130 o An RPC operation where the response data is longer than the inline 131 buffer limit. 133 These are all simple examples of situations in which explicit RDMA 134 operations are used, either to effect Direct Data Placement or to 135 respond to message size limits that derive from a limited receive 136 buffer size. 138 We will survey the resulting latency and overhead issues in an RPC- 139 over-RDMA Version One environment in Sections 2.2 and 2.3 below. 141 2.2. WRITE Request Processing Details 143 We'll start with the case of a request involving direct placement of 144 request data. In this case, an RDMA READ is used to transfer a DDR- 145 eligible data item (e.g. the data to be written) from its location in 146 requester memory, to a location selected by the responder. 148 Processing proceeds as described below. Although we are focused on 149 internode latency, the time to perform a request also includes such 150 things as interrupt latency, overhead involved in interacting with 151 the RNIC, and the time for the server to execute the requested 152 operation. 154 o First, the memory to be accessed remotely is registered. This is 155 a local operation. 157 o Once the registration has been done, the initial send of the 158 request can proceed. Since this is in the context of connected 159 operation, there is an internode round trip involved. However, 160 the next step can proceed after the initial transmission is 161 received by the responder. As a result, only the responder-bound 162 side of the transmission contributes to overall operation latency. 164 o The responder, after being notified of the receipt of the request, 165 uses RDMA READ to fetch the bulk data. This involves an internode 166 round-trip latency. After the fetch of the data, the responder 167 needs to be notified of the completion of the explicit RDMA 168 operation 170 o The responder (after performing the requested operation) sends the 171 response. Again, as this is in the context of connected 172 operation, there is an internode round trip involved. However, 173 the next step can proceed after the initial transmission is 174 received by the requester. 176 o The memory registered before the request was issued needs to be 177 deregistered, before the request is considered complete and the 178 sending process restarted. When remote invalidation is not 179 available, the requester, after being notified of the receipt of 180 the response, performs a local operation to deregister the memory 181 in question. Alternatively, the responder will use Send With 182 Invalidate and the responder's RNIC will effect the deregistration 183 before notifying the requester of the response which has been 184 received. 186 To summarize, if we exclude the actual server execution of the 187 request, the latency consists of two internode round-trip latencies 188 plus two responder-side interrupt latencies plus one requester-side 189 interrupt latency plus any necessary registration/de-registration 190 overhead. This is in contrast to a request not using explicit RDMA 191 operations in which there is a single inter-node round-trip latency 192 and one interrupt latency on the requester and the responder. 194 The processing of the other sorts of requests mentioned in 195 Section 2.1 show both similarities and differences: 197 o Handling of a long request is similar to the above. The memory 198 associated with a position-zero read chunk is registered, 199 transferred using RDMA READ, and deregistered. As a result, we 200 have the same overhead and latency issues noted in the case of 201 direct data placement, without the corresponding benefits. 203 o The case of direct data placement of response data follows a 204 similar pattern. The important difference is that the transfer of 205 the bulk data is performed using RDMA WRITE, rather than RDMA 206 READ. However, because of the way that RDMA WRITE is effected 207 over the wire, the latency consequences are different. See 208 Section 2.3 for a detailed discussion. 210 o Handling of a long response is similar to the previous case. 212 2.3. READ Request Processing Details 214 We'll now discuss the case of a request involving direct placement of 215 response data. In this case, an RDMA WRITE is used to transfer a 216 DDR-eligible data item (e.g. the data being read) from its location 217 in responder memory, to a location previously selected by the 218 requester. 220 Processing proceeds as described below. Although we are focused on 221 internode latency, the time to perform a request also includes such 222 things as interrupt latency, overhead involved in interacting with 223 the RNIC, and the time for the server to execute the requested 224 operation. 226 o First, the memory to be accessed remotely is registered. This is 227 a local operation. 229 o Once the registration has been done, the initial send of the 230 request can proceed. Since this is in the context of connected 231 operation, there is an internode round trip involved. However, 232 the next step can proceed after the initial transmission is 233 received. As a result, only the responder-bound side of the 234 transmission contributes to overall operation latency. 236 o The responder, after being notified of the receipt of the request, 237 proceeds to process the request until the data to be read is 238 available in its own memory, with its location determined and 239 fixed. It then uses RDMA WRITE to transfer the bulk data to the 240 location in requester memory selected previously. This involves 241 an internode latency, but there is no round trip and thus no 242 round-trip latency, 244 o The responder continues processing and sends the inline portion of 245 the response. Again, as this is in the context of connected 246 operation, there is an internode round trip involved. However, 247 the next step can proceed immediately. If the RDMA WRITE or the 248 send of the inline portion of the response were to fail, the 249 responder can be notified subsequently. 251 o The requester, after being notified of the receipt of the 252 response, can analyze it and can access the data written into its 253 memory. Deregistration of the memory originally registered before 254 the request was issued can be done using remote invalidation or 255 can be done by the requester as a local operation 257 To summarize, in this case the additional latency that we saw in 258 Section 2.2 does not arise. Except for the additional overhead due 259 to memory registration and invalidation, the situation is the same as 260 for a request not using explicit RDMA operations in which there is a 261 single inter-node round-trip latency and one interrupt latency on the 262 requester and the responder. 264 3. Near-term Work 266 We are going to consider how the latency and overhead issues 267 discussed in Section 2 might be addressed in the context of an 268 extensible version of RPC-over-RDMA, such as that proposed in 269 [rpcrdmav2]. 271 In Section 3.1, we will establish a performance target for the 272 troublesome requests, based on the performance of requests that do 273 not involve long messages or direct data placement. 275 We will then consider how extensions might be defined to bring 276 latency and overhead for the requests discussed in Section 2.1 into 277 line with those for other requests. There will be two specific 278 classes of requests to address: 280 o Those that do not involve direct data placement will be addressed 281 in Section 3.2. In this case, there are no compensating benefits 282 justifying the higher overhead and, in some cases, latency. 284 o The more complicated case of requests that do involve direct data 285 placement is discussed in Section 3.3. In this case, direct data 286 placement could serve as a compensating benefit, and the important 287 question to be addressed is whether Direct Data Placement can be 288 effected without the use of explicit RDMA operations. 290 The optional features to deal with each of the classes of messages 291 discussed above could be implemented separately. However, in the 292 handling of RPCs with very large amounts of bulk data, the two 293 features are synergistic. This fact makes it desirable to define the 294 features as part of the same extension. See Sections 3.4 and 3.5 for 295 details. 297 3.1. Target Performance 299 As our target, we will look at the latency and overhead associated 300 with other sorts of RPC requests, i.e. those that do not use DDP, and 301 that have request and response messages which do fit within the 302 receive buffer limit. 304 Processing proceeds as follows: 306 o The initial send of the request is done. Since this is in the 307 context of connected operation, there is an internode round-trip 308 involved. However, the next step can proceed after the initial 309 transmission is received. As a result, only the responder-bound 310 side of the transmission contributes to overall operation latency. 312 o The responder, after being notified of the receipt of the request, 313 performs the requested operation and sends the reply. As in the 314 case of the request, there is an internode round trip involved. 315 However, the request can be considered complete upon receipt of 316 the requester-bound transmission. The responder-bound 317 acknowledgment does not contribute to request latency. 319 In this case, there is only a single internode round-trip latency 320 necessary to effect the RPC. Total request latency includes this 321 round-trip latency plus interrupt latency on the requester and 322 responder, plus the time for the responder to actually perform the 323 requested operation. 325 Thus the delta between the operations discussed in Section 2 and our 326 baseline consists of two portions, one of which applies to all the 327 requests we are concerned with and the second of which only applies 328 to request which involve use of RDMA READ, as discussed in 329 Section 2.2. The latter category consists of: 331 o One additional internode round-trip latency. 333 o One additional instance of responder-side interrupt latency. 335 The additional overhead necessary to do memory registration and 336 deregistration applies to all requests using explicit RDMA 337 operations. The costs will vary with implementation characteristics. 338 As a result, in some implementations, it may desirable to replace use 339 of RDMA Write with send-based alternatives, while in others, use of 340 RDMA Write may be preferable. 342 3.2. Message Continuation 344 Using multiple RPC-over-RDMA transmissions, in sequence, to send a 345 single RPC message avoids the additional latency and overhead 346 associated with the use of explicit RDMA operations to transfer 347 position-zero read chunks. In the case of reply chunks, only 348 overhead is reduced. 350 Although transfer of a single request or reply in N transmissions 351 will involve N+1 internode latencies, overall request latency is not 352 increased by requiring that operations involving multiple nodes be 353 serialized. Generally, these transmissions are pipelined. 355 As an illustration, let's consider the case of a request involving a 356 response consisting of two RPC-over-RDMA transmissions. Even though 357 each of these transmissions is acknowledged, that acknowledgement 358 does not contribute to request latency. The second transmission can 359 be received by the requester and acted upon without waiting for 360 either acknowledgment. 362 This situation would require multiple receive-side interrupts but it 363 is unlikely to result in extended interrupt latency. With 1K sends 364 (Version One), the second receive will complete about 200 nanoseconds 365 after the first assuming a 40Gb/s transmission rate. Given likely 366 interrupt latencies, the first interrupt routine would be able to 367 note that the completion of the second receive had already occurred. 369 3.3. Send-based DDP 371 In order to effect proper placement of request or reply data within 372 the context of individual RPC-over-RDMA transmissions, receive 373 buffers need to be structured to accommodate this function 375 To illustrate the considerations that could lead clients and servers 376 to choose particular buffer structures, we will use, as examples, the 377 cases of NFS READs and WRITEs of 8K data blocks (or the corresponding 378 NFSv4 COMPOUNDs). 380 In such cases, the client and server need to have the DDP-eligible 381 bulk data placed in appropriately aligned 8K buffer segments. Rather 382 than being transferred in separate transmissions using explicit RDMA 383 operations, a message can be sent so that bulk data is received into 384 an appropriate buffer segment. In this case, it would be excised 385 from the XDR payload stream, just as it is in the case of existing 386 DDP facilities. 388 Consider a server expecting write requests which are usually X bytes 389 long or less, exclusive of an 8K bulk data area. In this case the 390 payload stream will most likely be less than X bytes and will fit in 391 a buffer segment devoted to that purpose. The bulk data needs to be 392 placed in the subsequent buffer segment in order to align it 393 properly, i.e. with the appropriate alignment, in the DDP target 394 buffer. In order to place the data appropriately, the sender (in 395 this case, the client) needs to add padding of length X-Y bytes where 396 Y is the length of payload stream for the current request. The case 397 of reads is exactly the same except that the sender adding the 398 padding is the server. 400 To provide send-based DDP as an RPC-over-RDMA extension, the 401 framework defined in [rpcrdmav2] could be used. A new "transport 402 characteristic" could be defined which allowed a participant to 403 expose the structure of his receive buffers and to identify the 404 buffer segments capable of being used as DDP targets. In addition, a 405 new optional message header would have to be defined. It would be 406 defined to provide: 408 o A way to designate a DDP-eligible data item as corresponding to 409 target buffer segments, rather than memory registered for RDMA. 411 o A way to indicate to the responder that it should place DDP- 412 eligible data items in DDP-targetable buffer segments, rather than 413 in memory registered for RDMA. 415 o A way to designate a limited portion of an RPC-over-RDMA 416 transmission as constituting the payload stream. 418 3.4. Feature Synergy 420 While message continuation and send-based DDP each address an 421 important class of commonly used messages, their combination allows 422 simpler handling of some important classes of messages: 424 o READs and WRITEs transferring larger IOs 426 o COMPOUNDs containing multiple IO operations. 428 o Operations whose associated payload stream is longer than the 429 typical value. 431 To accommodate these situations, it would be best to have the 432 definition of the headers to support message continuation interact 433 with data structures to support send-based DDP as follows: 435 o The header type used for the initial transmission of a message 436 continued across multiple transmissions would contain DDP- 437 directing structures which support both send-based DDP as well as 438 DDP using Explicit RDMA operations. 440 o Buffer references for Send-based DDP should be relative to the 441 start of the group of transmissions and should allow transitions 442 between buffer segments in different receive buffers. 444 o The header type for messages continuing a group of transmissions 445 should not have DDP-related fields but should rely on the initial 446 transmission of the group for DDP-related functions. 448 o The portion of each received transmission devoted to the payload 449 stream should be part of the header for each message within a 450 group of transmissions devoted to a single RPC message. The 451 payload stream for the message as a whole should be the 452 concatenation of the streams for each transmission. 454 A potential extension supporting these features interacting as 455 described above can be found in [rtrext]. 457 3.5. Feature Selection and Negotiation 459 Given that an appropriate extension is likely to support multiple 460 OPTIONAL features, special attention will have to be given to 461 defining how implementations which might not support the same subset 462 of OPTIONAL features can successfully interact. The goal is to allow 463 interacting implementations to get the benefit of features that they 464 both support, while allowing implementation pairs that do not share 465 support for any of the OPTIONAL features to operate just as base 466 Version Two implementations could do in the absence of the potential 467 extension. 469 It is helpful if each implementation provides characteristics 470 defining its level of feature support which the peer implementation 471 can test before attempting to use a particular feature. In other 472 similar contexts, the support level concerns the implementation in 473 its role as responder, i.e. whether it is prepared to execute a given 474 request. In the case of the potential extension discussed here, most 475 characteristics concern an implementation in its role as receiver. 476 One might define characteristics which indicate, 478 o The ability of the implementation, in its role as receiver, to 479 process messages continued across multiple RPC-over-RDMA 480 transmissions. 482 o The ability of the implementation, in its role as receiver, to 483 process messages containing DDP-eligible data items, directly 484 placed using a send-based DDP approach. 486 Use of such characteristics might allow asymmetric implementations. 487 For example, a client might send requests containing DDP-eligible 488 data items using send-based DDP without being able to accept messages 489 containing data items using send-based DDP. That is a likely 490 implementation pattern, given the greater performance benefits of 491 avoiding use of RDMA Read. 493 Further useful characteristics would apply to the implementation in 494 its role of responder. For instance, 496 o The ability of the implementation, in its role as responder, to 497 accept and process requests which REQUIRE that DDP-eligible data 498 items in the response be sent using send-based DDP. The presence 499 of this characteristic would allow a requester to avoid 500 registering memory to be used to accommodate DDP-eligible data 501 items in the response. 503 o The ability of the implementation, in its role as responder, to 504 send responses using message continuation, as opposed to using a 505 reply chunk. 507 Because of the potentially different needs of operations in the 508 forward and backward directions, it may be desirable to separate the 509 receiver-based characteristics according the direction of operation 510 that they apply to. 512 A further issue relates to the role of explicit RDMA operations in 513 connection with backwards operation. Although, no current protocols 514 require support for DDP or transfer of large messages when operating 515 in the backward direction, the protocol is designed to allow such 516 support to be developed in the future. Since the protocol, with the 517 extension discussed here is likely to have multiple methods of 518 providing these functions, we have a number of possible choices 519 regarding the role of chunk-based methods of providing these 520 functions 521 o Support for chunk-based operation remains a REQUIREMENT for 522 responders, and requesters always have the option of using it, 523 regardless of the direction of operation. 525 Requesters could select alternatives to the use of explicit RDMA 526 operations when these are supported by the responder 528 o When operating in the forward direction, support for chunk-based 529 operation remains a REQUIREMENT for responders (i.e. servers), and 530 requesters (i.e. clients). 532 When operating in the backward direction, support for chunk-based 533 is OPTIONAL for responders (i.e. clients) allowing requesters 534 (i.e. servers) to select use of explicit RDMA operations or 535 alternatives when each of these is supported by the requester. 537 o Support for chunk-based operation is treated as OPTIONAL for 538 responders, regardless of the direction of operation. 540 In this case, requesters would select use of explicit RDMA 541 operations or alternatives when each of these is supported by the 542 responder. For a considerable time, support for explicit RDMA 543 operations would be a practical necessity, even if not a 544 REQUIREMENT, for operation in the forward direction. 546 4. Possible Future Development 548 Although the reduction of explicit RDMA operation reduces the number 549 of inter-node round trips and eliminates sequences of operations in 550 which multiple round-trip latencies are serialized with server 551 interrupt latencies, the use of connected operations means that 552 round-trip latencies will always be present, since each message is 553 acknowledged. 555 One avenue that has been considered is use of unreliable-datagram 556 (UD) transmission in environments where the "unreliable" transmission 557 is sufficiently reliable that RPC replay can deal with a very low 558 rate of message loss. For example, UD in Infiniband specifies a low 559 enough rate of frame loss to make this a viable approach, 560 particularly for use in supporting protocols such as NFSv4.1, that 561 contain their own facilities to ensure exactly-once semantics. 563 With this sort of arrangement, request latency is still the same. 564 However, since the acknowledgements are not serving any substantial 565 function, it is tempting to consider removing them, as they do take 566 up some transmission bandwidth, that might be used otherwise, if the 567 protocol were to reach the goal of effectively using the underlying 568 medium. 570 The size of such wasted transmission bandwidth depends on the average 571 message size and many implementation considerations regarding how 572 acknowledgments are done. In any case, given expected message sizes, 573 the wasted transmission bandwidth will be very small. 575 When RPC messages are quite small, acknowledgments may be of concern. 576 However, in that situation, a better response would be transfer 577 multiple RPC messages within a single RPC-over-RDMA transmission. 579 When multiple RPC messages are combined into a single transmission, 580 the overhead of interfacing with the RNIC, particularly the interrupt 581 handling overhead, is amortized over multiple RPC messages. 583 Although this technique is quite outside the spirit of existing RPC- 584 over-RDMA implementations, it appears possible to define new header 585 types capable of supporting this sort of transmission, using the 586 extension framework described in [rpcrdmav2]. 588 5. Summary 590 We've examined the issue of round-trip latency and concluded: 592 o That the number of round trips per se is not as important as the 593 contribution of any extra round trips to overall request latency. 595 o That the latency issue can be addressed using the extension 596 mechanism provided for in [rpcrdmav2]. 598 o That in many cases in which latency is not an issue, there may be 599 overhead issues that can be addressed using the same sorts of 600 techniques as those useful in latency reduction, again using the 601 extension mechanism provided for in [rpcrdmav2]. 603 As it seems that the features sketched out could put internode 604 latencies and overhead for a large class of requests back to the 605 baseline value for the RPC paradigm, more detailed definition of the 606 required extension functionality is in order. 608 We've also looked at round trips at the physical level, in that 609 acknowledgments are sent in circumstances where there is no obvious 610 need for them. With regard to these, we have concluded: 612 o That these acknowledgements do not contribute to request latency. 614 o That while UD transmission can remove acknowledgements of limited 615 value, the performance benefits are not sufficient to justify the 616 disruption that this would entail. 618 o That issues with transmission bandwidth overhead in a small- 619 message environment are better addressed by combining multiple RPC 620 messages in a single RPC-over-RDMA transmission. This is 621 particularly so, because such a step is likely to reduce overhead 622 in such environments as well 624 As the features described involve the use of alternatives to explicit 625 RDMA operations, in performing direct data placement and in 626 transferring messages that are larger than the receive buffer limit, 627 it is appropriate to understand the role that such operations are 628 expected to have once the extensions discussed in this document are 629 fully specified and implemented. 631 It is important to note that these extensions are OPTIONAL and are 632 expected to remain so, while support for explicit RDMA operations 633 will remain an integral part of RPC-over-RDMA. 635 Given this framework, the degree to which explicit RDMA operations 636 will be used will reflect future implementation choices and needs. 637 While we have been focusing on cases in which other options might be 638 more efficient in some cases, it worth looking also at the cases in 639 which explicit RDMA operations are likely to remain preferable: 641 o In some environments in which direct data placement to memory of a 642 certain alignment does not meet application requirements and in 643 which data needs to be read into a particular address on the 644 client. Also, large physically contiguous buffers may be required 645 in some environments. In these situations, send-based DDP is not 646 an option. 648 o Where large transfers are to be done, there will be limits to the 649 capacity of send-based DDP to provide the required functionality, 650 since the basic pattern using send/receive is to allocate a pool 651 of memory to contain receive buffers in advance of issuing 652 requests. While this issue can be mitigated by use of message 653 continuation, tying up large numbers of credits for a single 654 request can cause difficult issues as well. As a result, send- 655 based DDP may be restricted to IO's of limited size, although the 656 specific limits will depend on the details of the specific 657 implementation. 659 6. Security Considerations 661 This document does not raise any security issues. 663 7. IANA Considerations 665 This document does not require any actions by IANA. 667 8. References 669 8.1. Normative References 671 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 672 Requirement Levels", BCP 14, RFC 2119, 673 DOI 10.17487/RFC2119, March 1997, 674 . 676 [rfc5666bis] 677 Lever, C., Ed., Simpson, W., and T. Talpey, "Remote Direct 678 Memory Access Transport for Remote Procedure Call", 679 February 2017, . 682 Work in progress. 684 8.2. Informative References 686 [RFC5666] Talpey, T. and B. Callaghan, "Remote Direct Memory Access 687 Transport for Remote Procedure Call", RFC 5666, 688 DOI 10.17487/RFC5666, January 2010, 689 . 691 [rfc5667bis] 692 Lever, C., Ed., "Network File System (NFS) Upper Layer 693 Binding To RPC-Over-RDMA", February 2017, 694 . 697 Work in progress. 699 [rpcrdmav2] 700 Lever, C., Ed. and D. Noveck, "RPC-over-RDMA Version Two", 701 December 2016, . 704 Work in progress. 706 [rtrext] Noveck, D., "RPC-over-RDMA Extensions to Reduce Internode 707 Round-trips", December 2016, . 710 Work in progress. 712 Appendix A. Acknowledgements 714 The author gratefully acknowledges the work of Brent Callaghan and 715 Tom Talpey producing the original RPC-over-RDMA Version One 716 specification [RFC5666] and also Tom's work in helping to clarify 717 that specification. 719 The author also wishes to thank Chuck Lever for his work reviving 720 RDMA support for NFS in [rfc5666bis], [rfc5667bis], and [rpcrdmav2], 721 and for helpful discussion regarding RPC-over-RDMA latency issues. 723 Author's Address 725 David Noveck 726 26 Locust Avenue 727 Lexington, MA 02421 728 USA 730 Phone: +1 781-572-8038 731 Email: davenoveck@gmail.com