idnits 2.17.1 draft-dnoveck-nfsv4-rpcrdma-rtissues-04.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (August 21, 2017) is 2440 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- -- Obsolete informational reference (is this intentional?): RFC 5666 (Obsoleted by RFC 8166) Summary: 0 errors (**), 0 flaws (~~), 1 warning (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network File System Version 4 D. Noveck 3 Internet-Draft NetApp 4 Intended status: Informational August 21, 2017 5 Expires: February 22, 2018 7 Issues Related to RPC-over-RDMA Internode Round Trips 8 draft-dnoveck-nfsv4-rpcrdma-rtissues-04 10 Abstract 12 As currently designed and implemented, the RPC-over-RDMA protocol 13 requires use of multiple internode round trips to process some common 14 operations. For example, NFS WRITE operations require use of three 15 internode round trips. This document looks at this issue and 16 discusses what can and what should be done to address it, both within 17 the context of an extensible version of RPC-over-RDMA and potentially 18 outside that framework. 20 Status of This Memo 22 This Internet-Draft is submitted in full conformance with the 23 provisions of BCP 78 and BCP 79. 25 Internet-Drafts are working documents of the Internet Engineering 26 Task Force (IETF). Note that other groups may also distribute 27 working documents as Internet-Drafts. The list of current Internet- 28 Drafts is at http://datatracker.ietf.org/drafts/current/. 30 Internet-Drafts are draft documents valid for a maximum of six months 31 and may be updated, replaced, or obsoleted by other documents at any 32 time. It is inappropriate to use Internet-Drafts as reference 33 material or to cite them other than as "work in progress." 35 This Internet-Draft will expire on February 22, 2018. 37 Copyright Notice 39 Copyright (c) 2017 IETF Trust and the persons identified as the 40 document authors. All rights reserved. 42 This document is subject to BCP 78 and the IETF Trust's Legal 43 Provisions Relating to IETF Documents 44 (http://trustee.ietf.org/license-info) in effect on the date of 45 publication of this document. Please review these documents 46 carefully, as they describe your rights and restrictions with respect 47 to this document. Code Components extracted from this document must 48 include Simplified BSD License text as described in Section 4.e of 49 the Trust Legal Provisions and are provided without warranty as 50 described in the Simplified BSD License. 52 Table of Contents 54 1. Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . 2 55 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 2 56 1.2. Introduction . . . . . . . . . . . . . . . . . . . . . . 2 57 2. Review of the Current Situation . . . . . . . . . . . . . . . 3 58 2.1. Troublesome Requests . . . . . . . . . . . . . . . . . . 3 59 2.2. WRITE Request Processing Details . . . . . . . . . . . . 4 60 2.3. READ Request Processing Details . . . . . . . . . . . . . 5 61 3. Near-term Work . . . . . . . . . . . . . . . . . . . . . . . 6 62 3.1. Target Performance . . . . . . . . . . . . . . . . . . . 7 63 3.2. Message Continuation . . . . . . . . . . . . . . . . . . 8 64 3.3. Send-based DDP . . . . . . . . . . . . . . . . . . . . . 8 65 3.4. Feature Synergy . . . . . . . . . . . . . . . . . . . . . 9 66 3.5. Feature Selection and Negotiation . . . . . . . . . . . . 10 67 4. Possible Future Development . . . . . . . . . . . . . . . . . 12 68 5. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 69 6. Security Considerations . . . . . . . . . . . . . . . . . . . 14 70 7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 15 71 8. References . . . . . . . . . . . . . . . . . . . . . . . . . 15 72 8.1. Normative References . . . . . . . . . . . . . . . . . . 15 73 8.2. Informative References . . . . . . . . . . . . . . . . . 15 74 Appendix A. Acknowledgements . . . . . . . . . . . . . . . . . . 16 75 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 16 77 1. Preliminaries 79 1.1. Requirements Language 81 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 82 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 83 document are to be interpreted as described in [RFC2119]. 85 1.2. Introduction 87 When many common operations are performed using RPC-over-RDMA, 88 additional inter-node round-trip latencies are required to take 89 advantage of the performance benefits provided by RDMA Functionality. 91 While the latencies involved are generally small, they are a reason 92 for concern for two reasons. 94 o With the ongoing improvement of persistent memory technologies, 95 such internode latencies, being fixed, can be expected to consume 96 an increasing portion of the total latency required for processing 97 NFS requests using RPC-over-RDMA. 99 o High-performance transfers using NFS may be needed outside of a 100 machine-room environment. As RPC-over-RDMA is used in networks of 101 campus and metropolitan scale, the internode round-trip time of 102 sixteen microseconds per mile becomes an issue. 104 Given this background, round trips beyond the minimum necessary need 105 to be justified by corresponding benefits. If they are not, work 106 needs to be done to eliminate those excess round trips. 108 We are going to look at the existing situation with regard to round- 109 trip latency and make some suggestions as to how the issue might be 110 best addressed. We will consider things that could be done in the 111 near future and also explore further possibilities that would require 112 a longer-term approach to be adopted. 114 2. Review of the Current Situation 116 2.1. Troublesome Requests 118 We will be looking at four sorts of situations: 120 o An RPC operation involving Direct Data Placement of request data 121 (e.g., an NFSv3 WRITE or corresponding NFSv4 COMPOUND). 123 o An RPC operation involving Direct Data Placement of response data 124 (e.g., an NFSv3 READ or corresponding NFSv4 COMPOUND). 126 o An RPC operation where the request data is longer than the inline 127 buffer limit. 129 o An RPC operation where the response data is longer than the inline 130 buffer limit. 132 These are all simple examples of situations in which explicit RDMA 133 operations are used, either to effect Direct Data Placement or to 134 respond to message size limits that derive from a limited receive 135 buffer size. 137 We will survey the resulting latency and overhead issues in an RPC- 138 over-RDMA Version One environment in Sections 2.2 and 2.3 below. 140 2.2. WRITE Request Processing Details 142 We'll start with the case of a request involving direct placement of 143 request data. In this case, an RDMA READ is used to transfer a DDR- 144 eligible data item (e.g. the data to be written) from its location in 145 requester memory, to a location selected by the responder. 147 Processing proceeds as described below. Although we are focused on 148 internode latency, the time to perform a request also includes such 149 things as interrupt latency, overhead involved in interacting with 150 the RNIC, and the time for the server to execute the requested 151 operation. 153 o First, the memory to be accessed remotely is registered. This is 154 a local operation. 156 o Once the registration has been done, the initial send of the 157 request can proceed. Since this is in the context of connected 158 operation, there is an internode round trip involved. However, 159 the next step can proceed after the initial transmission is 160 received by the responder. As a result, only the responder-bound 161 side of the transmission contributes to overall operation latency. 163 o The responder, after being notified of the receipt of the request, 164 uses RDMA READ to fetch the bulk data. This involves an internode 165 round-trip latency. After the fetch of the data, the responder 166 needs to be notified of the completion of the explicit RDMA 167 operation 169 o The responder (after performing the requested operation) sends the 170 response. Again, as this is in the context of connected 171 operation, there is an internode round trip involved. However, 172 the next step can proceed after the initial transmission is 173 received by the requester. 175 o The memory registered before the request was issued needs to be 176 deregistered, before the request is considered complete and the 177 sending process restarted. When remote invalidation is not 178 available, the requester, after being notified of the receipt of 179 the response, performs a local operation to deregister the memory 180 in question. Alternatively, the responder will use Send With 181 Invalidate and the responder's RNIC will effect the deregistration 182 before notifying the requester of the response which has been 183 received. 185 To summarize, if we exclude the actual server execution of the 186 request, the latency consists of two internode round-trip latencies 187 plus two responder-side interrupt latencies plus one requester-side 188 interrupt latency plus any necessary registration/de-registration 189 overhead. This is in contrast to a request not using explicit RDMA 190 operations in which there is a single inter-node round-trip latency 191 and one interrupt latency on the requester and the responder. 193 The processing of the other sorts of requests mentioned in 194 Section 2.1 show both similarities and differences: 196 o Handling of a long request is similar to the above. The memory 197 associated with a position-zero read chunk is registered, 198 transferred using RDMA READ, and deregistered. As a result, we 199 have the same overhead and latency issues noted in the case of 200 direct data placement, without the corresponding benefits. 202 o The case of direct data placement of response data follows a 203 similar pattern. The important difference is that the transfer of 204 the bulk data is performed using RDMA WRITE, rather than RDMA 205 READ. However, because of the way that RDMA WRITE is effected 206 over the wire, the latency consequences are different. See 207 Section 2.3 for a detailed discussion. 209 o Handling of a long response is similar to the previous case. 211 2.3. READ Request Processing Details 213 We'll now discuss the case of a request involving direct placement of 214 response data. In this case, an RDMA WRITE is used to transfer a 215 DDR-eligible data item (e.g. the data being read) from its location 216 in responder memory, to a location previously selected by the 217 requester. 219 Processing proceeds as described below. Although we are focused on 220 internode latency, the time to perform a request also includes such 221 things as interrupt latency, overhead involved in interacting with 222 the RNIC, and the time for the server to execute the requested 223 operation. 225 o First, the memory to be accessed remotely is registered. This is 226 a local operation. 228 o Once the registration has been done, the initial send of the 229 request can proceed. Since this is in the context of connected 230 operation, there is an internode round trip involved. However, 231 the next step can proceed after the initial transmission is 232 received. As a result, only the responder-bound side of the 233 transmission contributes to overall operation latency. 235 o The responder, after being notified of the receipt of the request, 236 proceeds to process the request until the data to be read is 237 available in its own memory, with its location determined and 238 fixed. It then uses RDMA WRITE to transfer the bulk data to the 239 location in requester memory selected previously. This involves 240 an internode latency, but there is no round trip and thus no 241 round-trip latency, 243 o The responder continues processing and sends the inline portion of 244 the response. Again, as this is in the context of connected 245 operation, there is an internode round trip involved. However, 246 the next step can proceed immediately. If the RDMA WRITE or the 247 send of the inline portion of the response were to fail, the 248 responder can be notified subsequently. 250 o The requester, after being notified of the receipt of the 251 response, can analyze it and can access the data written into its 252 memory. Deregistration of the memory originally registered before 253 the request was issued can be done using remote invalidation or 254 can be done by the requester as a local operation 256 To summarize, in this case the additional latency that we saw in 257 Section 2.2 does not arise. Except for the additional overhead due 258 to memory registration and invalidation, the situation is the same as 259 for a request not using explicit RDMA operations in which there is a 260 single inter-node round-trip latency and one interrupt latency on the 261 requester and the responder. 263 3. Near-term Work 265 We are going to consider how the latency and overhead issues 266 discussed in Section 2 might be addressed in the context of an 267 extensible version of RPC-over-RDMA, such as that proposed in 268 [rpcrdmav2]. 270 In Section 3.1, we will establish a performance target for the 271 troublesome requests, based on the performance of requests that do 272 not involve long messages or direct data placement. 274 We will then consider how extensions might be defined to bring 275 latency and overhead for the requests discussed in Section 2.1 into 276 line with those for other requests. There will be two specific 277 classes of requests to address: 279 o Those that do not involve direct data placement will be addressed 280 in Section 3.2. In this case, there are no compensating benefits 281 justifying the higher overhead and, in some cases, latency. 283 o The more complicated case of requests that do involve direct data 284 placement is discussed in Section 3.3. In this case, direct data 285 placement could serve as a compensating benefit, and the important 286 question to be addressed is whether Direct Data Placement can be 287 effected without the use of explicit RDMA operations. 289 The optional features to deal with each of the classes of messages 290 discussed above could be implemented separately. However, in the 291 handling of RPCs with very large amounts of bulk data, the two 292 features are synergistic. This fact makes it desirable to define the 293 features as part of the same extension. See Sections 3.4 and 3.5 for 294 details. 296 3.1. Target Performance 298 As our target, we will look at the latency and overhead associated 299 with other sorts of RPC requests, i.e. those that do not use DDP, and 300 that have request and response messages which do fit within the 301 receive buffer limit. 303 Processing proceeds as follows: 305 o The initial send of the request is done. Since this is in the 306 context of connected operation, there is an internode round-trip 307 involved. However, the next step can proceed after the initial 308 transmission is received. As a result, only the responder-bound 309 side of the transmission contributes to overall operation latency. 311 o The responder, after being notified of the receipt of the request, 312 performs the requested operation and sends the reply. As in the 313 case of the request, there is an internode round trip involved. 314 However, the request can be considered complete upon receipt of 315 the requester-bound transmission. The responder-bound 316 acknowledgment does not contribute to request latency. 318 In this case, there is only a single internode round-trip latency 319 necessary to effect the RPC. Total request latency includes this 320 round-trip latency plus interrupt latency on the requester and 321 responder, plus the time for the responder to actually perform the 322 requested operation. 324 Thus the delta between the operations discussed in Section 2 and our 325 baseline consists of two portions, one of which applies to all the 326 requests we are concerned with and the second of which only applies 327 to request which involve use of RDMA READ, as discussed in 328 Section 2.2. The latter category consists of: 330 o One additional internode round-trip latency. 332 o One additional instance of responder-side interrupt latency. 334 The additional overhead necessary to do memory registration and 335 deregistration applies to all requests using explicit RDMA 336 operations. The costs will vary with implementation characteristics. 337 As a result, in some implementations, it may desirable to replace use 338 of RDMA Write with send-based alternatives, while in others, use of 339 RDMA Write may be preferable. 341 3.2. Message Continuation 343 Using multiple RPC-over-RDMA transmissions, in sequence, to send a 344 single RPC message avoids the additional latency and overhead 345 associated with the use of explicit RDMA operations to transfer 346 position-zero read chunks. In the case of reply chunks, only 347 overhead is reduced. 349 Although transfer of a single request or reply in N transmissions 350 will involve N+1 internode latencies, overall request latency is not 351 increased by requiring that operations involving multiple nodes be 352 serialized. Generally, these transmissions are pipelined. 354 As an illustration, let's consider the case of a request involving a 355 response consisting of two RPC-over-RDMA transmissions. Even though 356 each of these transmissions is acknowledged, that acknowledgement 357 does not contribute to request latency. The second transmission can 358 be received by the requester and acted upon without waiting for 359 either acknowledgment. 361 This situation would require multiple receive-side interrupts but it 362 is unlikely to result in extended interrupt latency. With 1K sends 363 (Version One), the second receive will complete about 200 nanoseconds 364 after the first assuming a 40Gb/s transmission rate. Given likely 365 interrupt latencies, the first interrupt routine would be able to 366 note that the completion of the second receive had already occurred. 368 3.3. Send-based DDP 370 In order to effect proper placement of request or reply data within 371 the context of individual RPC-over-RDMA transmissions, receive 372 buffers need to be structured to accommodate this function 374 To illustrate the considerations that could lead clients and servers 375 to choose particular buffer structures, we will use, as examples, the 376 cases of NFS READs and WRITEs of 8K data blocks (or the corresponding 377 NFSv4 COMPOUNDs). 379 In such cases, the client and server need to have the DDP-eligible 380 bulk data placed in appropriately aligned 8K buffer segments. Rather 381 than being transferred in separate transmissions using explicit RDMA 382 operations, a message can be sent so that bulk data is received into 383 an appropriate buffer segment. In this case, it would be excised 384 from the XDR payload stream, just as it is in the case of existing 385 DDP facilities. 387 Consider a server expecting write requests which are usually X bytes 388 long or less, exclusive of an 8K bulk data area. In this case the 389 payload stream will most likely be less than X bytes and will fit in 390 a buffer segment devoted to that purpose. The bulk data needs to be 391 placed in the subsequent buffer segment in order to align it 392 properly, i.e. with the appropriate alignment, in the DDP target 393 buffer. In order to place the data appropriately, the sender (in 394 this case, the client) needs to add padding of length X-Y bytes where 395 Y is the length of payload stream for the current request. The case 396 of reads is exactly the same except that the sender adding the 397 padding is the server. 399 To provide send-based DDP as an RPC-over-RDMA extension, the 400 framework defined in [rpcrdmav2] could be used. A new "transport 401 characteristic" could be defined which allowed a participant to 402 expose the structure of his receive buffers and to identify the 403 buffer segments capable of being used as DDP targets. In addition, a 404 new optional message header would have to be defined. It would be 405 defined to provide: 407 o A way to designate a DDP-eligible data item as corresponding to 408 target buffer segments, rather than memory registered for RDMA. 410 o A way to indicate to the responder that it should place DDP- 411 eligible data items in DDP-targetable buffer segments, rather than 412 in memory registered for RDMA. 414 o A way to designate a limited portion of an RPC-over-RDMA 415 transmission as constituting the payload stream. 417 3.4. Feature Synergy 419 While message continuation and send-based DDP each address an 420 important class of commonly used messages, their combination allows 421 simpler handling of some important classes of messages: 423 o READs and WRITEs transferring larger IOs 425 o COMPOUNDs containing multiple IO operations. 427 o Operations whose associated payload stream is longer than the 428 typical value. 430 To accommodate these situations, it would be best to have the 431 definition of the headers to support message continuation interact 432 with data structures to support send-based DDP as follows: 434 o The header type used for the initial transmission of a message 435 continued across multiple transmissions would contain DDP- 436 directing structures which support both send-based DDP as well as 437 DDP using Explicit RDMA operations. 439 o Buffer references for Send-based DDP should be relative to the 440 start of the group of transmissions and should allow transitions 441 between buffer segments in different receive buffers. 443 o The header type for messages continuing a group of transmissions 444 should not have DDP-related fields but should rely on the initial 445 transmission of the group for DDP-related functions. 447 o The portion of each received transmission devoted to the payload 448 stream should be part of the header for each message within a 449 group of transmissions devoted to a single RPC message. The 450 payload stream for the message as a whole should be the 451 concatenation of the streams for each transmission. 453 A potential extension supporting these features interacting as 454 described above can be found in [rtrext]. 456 3.5. Feature Selection and Negotiation 458 Given that an appropriate extension is likely to support multiple 459 OPTIONAL features, special attention will have to be given to 460 defining how implementations which might not support the same subset 461 of OPTIONAL features can successfully interact. The goal is to allow 462 interacting implementations to get the benefit of features that they 463 both support, while allowing implementation pairs that do not share 464 support for any of the OPTIONAL features to operate just as base 465 Version Two implementations could do in the absence of the potential 466 extension. 468 It is helpful if each implementation provides characteristics 469 defining its level of feature support which the peer implementation 470 can test before attempting to use a particular feature. In other 471 similar contexts, the support level concerns the implementation in 472 its role as responder, i.e. whether it is prepared to execute a given 473 request. In the case of the potential extension discussed here, most 474 characteristics concern an implementation in its role as receiver. 475 One might define characteristics which indicate, 477 o The ability of the implementation, in its role as receiver, to 478 process messages continued across multiple RPC-over-RDMA 479 transmissions. 481 o The ability of the implementation, in its role as receiver, to 482 process messages containing DDP-eligible data items, directly 483 placed using a send-based DDP approach. 485 Use of such characteristics might allow asymmetric implementations. 486 For example, a client might send requests containing DDP-eligible 487 data items using send-based DDP without being able to accept messages 488 containing data items using send-based DDP. That is a likely 489 implementation pattern, given the greater performance benefits of 490 avoiding use of RDMA Read. 492 Further useful characteristics would apply to the implementation in 493 its role of responder. For instance, 495 o The ability of the implementation, in its role as responder, to 496 accept and process requests which REQUIRE that DDP-eligible data 497 items in the response be sent using send-based DDP. The presence 498 of this characteristic would allow a requester to avoid 499 registering memory to be used to accommodate DDP-eligible data 500 items in the response. 502 o The ability of the implementation, in its role as responder, to 503 send responses using message continuation, as opposed to using a 504 reply chunk. 506 Because of the potentially different needs of operations in the 507 forward and backward directions, it may be desirable to separate the 508 receiver-based characteristics according the direction of operation 509 that they apply to. 511 A further issue relates to the role of explicit RDMA operations in 512 connection with backwards operation. Although, no current protocols 513 require support for DDP or transfer of large messages when operating 514 in the backward direction, the protocol is designed to allow such 515 support to be developed in the future. Since the protocol, with the 516 extension discussed here is likely to have multiple methods of 517 providing these functions, we have a number of possible choices 518 regarding the role of chunk-based methods of providing these 519 functions 520 o Support for chunk-based operation remains a REQUIREMENT for 521 responders, and requesters always have the option of using it, 522 regardless of the direction of operation. 524 Requesters could select alternatives to the use of explicit RDMA 525 operations when these are supported by the responder 527 o When operating in the forward direction, support for chunk-based 528 operation remains a REQUIREMENT for responders (i.e. servers), and 529 requesters (i.e. clients). 531 When operating in the backward direction, support for chunk-based 532 is OPTIONAL for responders (i.e. clients) allowing requesters 533 (i.e. servers) to select use of explicit RDMA operations or 534 alternatives when each of these is supported by the requester. 536 o Support for chunk-based operation is treated as OPTIONAL for 537 responders, regardless of the direction of operation. 539 In this case, requesters would select use of explicit RDMA 540 operations or alternatives when each of these is supported by the 541 responder. For a considerable time, support for explicit RDMA 542 operations would be a practical necessity, even if not a 543 REQUIREMENT, for operation in the forward direction. 545 4. Possible Future Development 547 Although the reduction of explicit RDMA operation reduces the number 548 of inter-node round trips and eliminates sequences of operations in 549 which multiple round-trip latencies are serialized with server 550 interrupt latencies, the use of connected operations means that 551 round-trip latencies will always be present, since each message is 552 acknowledged. 554 One avenue that has been considered is use of unreliable-datagram 555 (UD) transmission in environments where the "unreliable" transmission 556 is sufficiently reliable that RPC replay can deal with a very low 557 rate of message loss. For example, UD in Infiniband specifies a low 558 enough rate of frame loss to make this a viable approach, 559 particularly for use in supporting protocols such as NFSv4.1, that 560 contain their own facilities to ensure exactly-once semantics. 562 With this sort of arrangement, request latency is still the same. 563 However, since the acknowledgements are not serving any substantial 564 function, it is tempting to consider removing them, as they do take 565 up some transmission bandwidth, that might be used otherwise, if the 566 protocol were to reach the goal of effectively using the underlying 567 medium. 569 The size of such wasted transmission bandwidth depends on the average 570 message size and many implementation considerations regarding how 571 acknowledgments are done. In any case, given expected message sizes, 572 the wasted transmission bandwidth will be very small. 574 When RPC messages are quite small, acknowledgments may be of concern. 575 However, in that situation, a better response would be transfer 576 multiple RPC messages within a single RPC-over-RDMA transmission. 578 When multiple RPC messages are combined into a single transmission, 579 the overhead of interfacing with the RNIC, particularly the interrupt 580 handling overhead, is amortized over multiple RPC messages. 582 Although this technique is quite outside the spirit of existing RPC- 583 over-RDMA implementations, it appears possible to define new header 584 types capable of supporting this sort of transmission, using the 585 extension framework described in [rpcrdmav2]. 587 5. Summary 589 We've examined the issue of round-trip latency and concluded: 591 o That the number of round trips per se is not as important as the 592 contribution of any extra round trips to overall request latency. 594 o That the latency issue can be addressed using the extension 595 mechanism provided for in [rpcrdmav2]. 597 o That in many cases in which latency is not an issue, there may be 598 overhead issues that can be addressed using the same sorts of 599 techniques as those useful in latency reduction, again using the 600 extension mechanism provided for in [rpcrdmav2]. 602 As it seems that the features sketched out could put internode 603 latencies and overhead for a large class of requests back to the 604 baseline value for the RPC paradigm, more detailed definition of the 605 required extension functionality is in order. 607 We've also looked at round trips at the physical level, in that 608 acknowledgments are sent in circumstances where there is no obvious 609 need for them. With regard to these, we have concluded: 611 o That these acknowledgements do not contribute to request latency. 613 o That while UD transmission can remove acknowledgements of limited 614 value, the performance benefits are not sufficient to justify the 615 disruption that this would entail. 617 o That issues with transmission bandwidth overhead in a small- 618 message environment are better addressed by combining multiple RPC 619 messages in a single RPC-over-RDMA transmission. This is 620 particularly so, because such a step is likely to reduce overhead 621 in such environments as well 623 As the features described involve the use of alternatives to explicit 624 RDMA operations, in performing direct data placement and in 625 transferring messages that are larger than the receive buffer limit, 626 it is appropriate to understand the role that such operations are 627 expected to have once the extensions discussed in this document are 628 fully specified and implemented. 630 It is important to note that these extensions are OPTIONAL and are 631 expected to remain so, while support for explicit RDMA operations 632 will remain an integral part of RPC-over-RDMA. 634 Given this framework, the degree to which explicit RDMA operations 635 will be used will reflect future implementation choices and needs. 636 While we have been focusing on cases in which other options might be 637 more efficient in some cases, it worth looking also at the cases in 638 which explicit RDMA operations are likely to remain preferable: 640 o In some environments in which direct data placement to memory of a 641 certain alignment does not meet application requirements and in 642 which data needs to be read into a particular address on the 643 client. Also, large physically contiguous buffers may be required 644 in some environments. In these situations, send-based DDP is not 645 an option. 647 o Where large transfers are to be done, there will be limits to the 648 capacity of send-based DDP to provide the required functionality, 649 since the basic pattern using send/receive is to allocate a pool 650 of memory to contain receive buffers in advance of issuing 651 requests. While this issue can be mitigated by use of message 652 continuation, tying up large numbers of credits for a single 653 request can cause difficult issues as well. As a result, send- 654 based DDP may be restricted to IO's of limited size, although the 655 specific limits will depend on the details of the specific 656 implementation. 658 6. Security Considerations 660 This document does not raise any security issues. 662 7. IANA Considerations 664 This document does not require any actions by IANA. 666 8. References 668 8.1. Normative References 670 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 671 Requirement Levels", BCP 14, RFC 2119, 672 DOI 10.17487/RFC2119, March 1997, . 675 [RFC8166] Lever, C., Ed., Simpson, W., and T. Talpey, "Remote Direct 676 Memory Access Transport for Remote Procedure Call Version 677 1", RFC 8166, DOI 10.17487/RFC8166, June 2017, 678 . 680 8.2. Informative References 682 [RFC5666] Talpey, T. and B. Callaghan, "Remote Direct Memory Access 683 Transport for Remote Procedure Call", RFC 5666, 684 DOI 10.17487/RFC5666, January 2010, . 687 [rfc5667bis] 688 Lever, C., Ed., "Network File System (NFS) Upper Layer 689 Binding To RPC-Over-RDMA", August 2017, 690 . 693 Work in progress. 695 [rpcrdmav2] 696 Lever, C., Ed. and D. Noveck, "RPC-over-RDMA Version Two", 697 July 2017, . 700 Work in progress. 702 [rtrext] Noveck, D., "RPC-over-RDMA Extensions to Reduce Internode 703 Round-trips", June 2017, . 706 Work in progress. 708 Appendix A. Acknowledgements 710 The author gratefully acknowledges the work of Brent Callaghan and 711 Tom Talpey producing the original RPC-over-RDMA Version One 712 specification [RFC5666] and also Tom's work in helping to clarify 713 that specification. 715 The author also wishes to thank Chuck Lever for his work reviving 716 RDMA support for NFS in [RFC8166] and [rfc5667bis], for providing a 717 path for incremental improvement of that support by his work on 718 [rpcrdmav2], and for helpful discussions regarding RPC-over-RDMA 719 latency issues. 721 Author's Address 723 David Noveck 724 NetApp 725 1601 Trapelo Road 726 Waltham, MA 02451 727 US 729 Phone: +1 781-572-8038 730 Email: davenoveck@gmail.com