idnits 2.17.1 draft-cel-nfsv4-rpcrdma-reliable-reply-05.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (May 20, 2019) is 1775 days in the past. Is this intentional? Checking references for intended status: Experimental ---------------------------------------------------------------------------- -- Obsolete informational reference (is this intentional?): RFC 5661 (Obsoleted by RFC 8881) -- Obsolete informational reference (is this intentional?): RFC 5666 (Obsoleted by RFC 8166) Summary: 0 errors (**), 0 flaws (~~), 1 warning (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network File System Version 4 C. Lever 3 Internet-Draft Oracle 4 Intended status: Experimental May 20, 2019 5 Expires: November 21, 2019 7 Improving the Performance and Reliability of RPC Replies on RPC-over- 8 RDMA Transports 9 draft-cel-nfsv4-rpcrdma-reliable-reply-05 11 Abstract 13 RPC transports such as RPC-over-RDMA version 1 require reply buffers 14 to be in place before an RPC Call is sent. However, RPC consumers 15 sometimes have difficulty estimating the expected maximum size of a 16 particular RPC reply. This introduces the risk that an RPC Reply 17 message can overrun reply resources provided by the requester, 18 preventing delivery of the message, through no fault of the 19 requester. This document describes a mechanism that eliminates the 20 need for pre-allocation of reply resources for unpredictably large 21 replies. 23 Status of This Memo 25 This Internet-Draft is submitted in full conformance with the 26 provisions of BCP 78 and BCP 79. 28 Internet-Drafts are working documents of the Internet Engineering 29 Task Force (IETF). Note that other groups may also distribute 30 working documents as Internet-Drafts. The list of current Internet- 31 Drafts is at https://datatracker.ietf.org/drafts/current/. 33 Internet-Drafts are draft documents valid for a maximum of six months 34 and may be updated, replaced, or obsoleted by other documents at any 35 time. It is inappropriate to use Internet-Drafts as reference 36 material or to cite them other than as "work in progress." 38 This Internet-Draft will expire on November 21, 2019. 40 Copyright Notice 42 Copyright (c) 2019 IETF Trust and the persons identified as the 43 document authors. All rights reserved. 45 This document is subject to BCP 78 and the IETF Trust's Legal 46 Provisions Relating to IETF Documents 47 (https://trustee.ietf.org/license-info) in effect on the date of 48 publication of this document. Please review these documents 49 carefully, as they describe your rights and restrictions with respect 50 to this document. Code Components extracted from this document must 51 include Simplified BSD License text as described in Section 4.e of 52 the Trust Legal Provisions and are provided without warranty as 53 described in the Simplified BSD License. 55 Table of Contents 57 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 58 2. Requirements Language . . . . . . . . . . . . . . . . . . . . 4 59 3. Problem Statement . . . . . . . . . . . . . . . . . . . . . . 4 60 3.1. Reply Chunk Overrun . . . . . . . . . . . . . . . . . . . 4 61 3.2. Reply Size Calculation . . . . . . . . . . . . . . . . . 5 62 3.3. Requester Registration Costs . . . . . . . . . . . . . . 5 63 3.4. Denial of Service . . . . . . . . . . . . . . . . . . . . 6 64 3.5. Estimating Transport Header Size . . . . . . . . . . . . 6 65 4. Responder-Provided Read Chunks . . . . . . . . . . . . . . . 7 66 4.1. Specification . . . . . . . . . . . . . . . . . . . . . . 7 67 4.1.1. Responder Duties . . . . . . . . . . . . . . . . . . 8 68 4.1.2. Requester Duties . . . . . . . . . . . . . . . . . . 8 69 4.1.3. Pull Completion Notification . . . . . . . . . . . . 8 70 4.1.4. Remote Invalidation . . . . . . . . . . . . . . . . . 9 71 5. Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 9 72 5.1. Benefits . . . . . . . . . . . . . . . . . . . . . . . . 9 73 5.1.1. Less Frequent Use of Explicit RDMA . . . . . . . . . 9 74 5.1.2. Support for Arbitrarily Large Replies . . . . . . . . 10 75 5.1.3. Protection of Connection After RPC Cancellation . . . 10 76 5.1.4. Asynchronous Chunk Invalidation . . . . . . . . . . . 10 77 5.2. Costs . . . . . . . . . . . . . . . . . . . . . . . . . . 10 78 5.2.1. Responder Memory Exposure . . . . . . . . . . . . . . 10 79 5.2.2. Round Trip Penalty . . . . . . . . . . . . . . . . . 10 80 5.2.3. Credit Accounting Complexity . . . . . . . . . . . . 11 81 5.3. Selecting a Reply Mechanism . . . . . . . . . . . . . . . 11 82 5.3.1. Requester . . . . . . . . . . . . . . . . . . . . . . 11 83 5.3.2. Responder . . . . . . . . . . . . . . . . . . . . . . 12 84 5.4. Implementation Complexity . . . . . . . . . . . . . . . . 12 85 5.4.1. RPC Call Path . . . . . . . . . . . . . . . . . . . . 13 86 5.4.2. RPC Reply Path . . . . . . . . . . . . . . . . . . . 13 87 5.4.3. Managing RDMA_DONE messages . . . . . . . . . . . . . 13 88 5.5. Alternatives . . . . . . . . . . . . . . . . . . . . . . 14 89 6. Interoperation Considerations . . . . . . . . . . . . . . . . 14 90 7. Security Considerations . . . . . . . . . . . . . . . . . . . 14 91 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 15 92 9. References . . . . . . . . . . . . . . . . . . . . . . . . . 15 93 9.1. Normative References . . . . . . . . . . . . . . . . . . 15 94 9.2. Informative References . . . . . . . . . . . . . . . . . 15 95 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . 16 96 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 16 98 1. Introduction 100 One way in which RPC-over-RDMA version 1 improves transport 101 efficiency is by ensuring resources for RPC replies are available in 102 advance of each RPC transaction [RFC8166]. These resources are 103 typically provisioned before a requester sends each RPC Call message. 104 They are provided to the responder to use for transmiting the 105 associated RPC Reply message back to the requester. 107 In particular, when the Payload Stream of an RPC Reply message is 108 expected to be large, the requester allocates and registers a Reply 109 chunk. The responder transfers the RPC Reply message's Payload 110 stream directly into the requester memory associated with that chunk, 111 then indicates that the RPC Reply is ready. The requester 112 invalidates the memory region. 114 In most cases, Upper Layer Protocols are capable of accurately 115 calculating the maximum size of RPC Reply messages. In addition, the 116 average size of RPC Reply messages is small, making the risk of Reply 117 chunk overrun exceptionally small. 119 However, on rare occasions an Upper Layer Protocol might not be able 120 to derive a reply size upper bound. An example of this is the NFS 121 version 4.1 GETATTR operation [RFC5661] [RFC8267] where a reply can 122 contain an unpredictable number of data content and hole descriptors. 124 Further, since the average size of actual RPC Replies is small, 125 requesters frequently allocate and register a Reply chunk for a reply 126 that, once it has been constructed by the responder, is small enough 127 to be sent inline. In this case, a responder is free to either 128 populate the Reply chunk or send the RPC Reply without the use of the 129 Reply chunk. The requester's cost of preparing the Reply chunk has 130 been wasted, and the extra registration and invalidation adds 131 unwanted latency to the operation. 133 A better method of handling RPC replies could ensure that RPC Replies 134 can be received even when the maximum possible size of some replies 135 cannot be calculated in advance. This method could also ensure that 136 no extra memory registration/invalidation operations are necessary to 137 make this guarantee. 139 This document resurrects the responder-provided Read chunk mechanism 140 that was briefly outlined in [RFC5666] to achieve these goals. The 141 discussion in this document assumes the reader is familiar with 142 [RFC8166]. 144 2. Requirements Language 146 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 147 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 148 "OPTIONAL" in this document are to be interpreted as described in BCP 149 14 [RFC2119] [RFC8174] when, and only when, they appear in all 150 capitals, as shown here. 152 3. Problem Statement 154 RPC-over-RDMA version 1 uses an RDMA Send request to transmit 155 transport headers and small RPC messages. 157 Each peer on an RPC-over-RDMA transport connection provisions Receive 158 buffers in which to capture incoming RDMA Send messages. There is a 159 limited number of these buffers, necessitating accounting in the 160 transport protocol to prevent a peer from emitting more Send 161 operations than the receiver is prepared for. 163 Because the selection of Receive Work Request to handle an incoming 164 Send is outside the control of the host O/S, the smallest buffer in 165 this pool determines the largest size message that can be received. 166 The size of the largest message that can be received via RDMA Send is 167 known as the receiver's "inline threshold" [RFC8166]. 169 When marshaling an RPC transaction, a requester allocates and 170 registers a Reply chunk whenever the maximum possible size of the 171 corresponding RPC-over-RDMA reply is larger than the requester's 172 receive inline threshold. The Reply chunk is presented to the 173 responder as part of the RPC Call. The responder may place the 174 associated RPC Reply message in the memory region linked with this 175 Reply chunk. 177 3.1. Reply Chunk Overrun 179 If a responder overruns a Reply chunk during an RDMA Write, a memory 180 protection error occurs. This typically results in connection loss. 181 Any RPC transactions running on that connection must be 182 retransmitted. The failing RPC transaction will never get a reply, 183 and retransmitting it may result in additional connection loss 184 events. 186 A smart responder compares the size of an RPC Reply with the size of 187 the target Reply chunk before initiating the placement of data in 188 that chunk. A generic RDMA_ERROR message reports the problem and the 189 requester can terminate the RPC transaction. 191 In either case, the RPC is executed by the responder, but the 192 requester does not receive the results or acknowledgement of its 193 completion. 195 3.2. Reply Size Calculation 197 To determine when a Reply chunk is needed, requesters calculate the 198 maximum possible size of the RPC Reply message expected for each 199 transaction. Upper Layer Bindings, such as [RFC8267] provide 200 guidance on how to calculate Reply sizes and in what cases the Upper 201 Layer Protocol might have difficulty giving an exact upper bound. 203 Unfortunately, there are rare cases where an upper bound cannot be 204 computed. For instance, there is no way to know how large an NFS 205 Access Control List (ACL) is until it is retrieved from an NFS server 206 [RFC5661]. There is no protocol-specified limit on the size of NFS 207 ACLs. When retrieving an NFS ACL, there is always a risk, albeit a 208 small one, that the NFS client has not provided a large enough Reply 209 chunk, and that therefore the NFS server will not be able to return 210 that ACL to the client (unless somehow a larger Reply chunk can be 211 provided). 213 3.3. Requester Registration Costs 215 For an Upper Layer Protocol such as NFS version 4.2 [RFC7862], NFS 216 COMPOUND Call and Reply messages can be large on occasion. For 217 instance, an NFSv4.2 COMPOUND can contain a LOOKUP operation together 218 with a GETATTR operation. The size of a LOOKUP result is relatively 219 small. However, the GETATTR in that COMPOUND may request attributes, 220 such as ACLs or security labels, that can grow arbitrarily large and 221 whose size is not known in advance. 223 Thus a requester can be responsible for provisioning quite a large 224 reply buffer for each LOOKUP COMPOUND, which is a frequent request. 225 If the maximum possible reply message can be large, the requester is 226 required to provide a Reply chunk. Most of the time, however, the 227 actual size of a LOOKUP COMPOUND reply is small enough to be sent 228 using one RDMA Send. 230 In other words, an NFS version 4 client provides a Reply chunk quite 231 frequently during RPC transactions, but NFS version 4 servers almost 232 never need to use it because the actual size of replies is typically 233 less than the inline threshold. The overhead of registering and 234 invalidating this chunk is significant. Moreover it is unnecessary 235 whenever the size of an actual RPC reply is small. 237 Before an RPC transaction is terminated, a requester is responsible 238 for fencing the Reply chunk from the responder [RFC8166]. That makes 239 RPC completion synchronous with Reply chunk invalidation. Therefore 240 the latency of Reply chunk invalidation adds to the total execution 241 time of the RPC transaction. 243 3.4. Denial of Service 245 When an RPC transaction is canceled or aborted (for instance, because 246 an application process exited prematurely), a requester must 247 invalidate or set aside Write and Reply chunks associated with that 248 transaction [RFC8166]. 250 This is because that RPC transaction is still running on the 251 responder. The responder remains obligated to return the result of 252 that transaction via RDMA Write, if there are Write or Reply chunks. 253 If memory registered on behalf of that transaction is re-used, the 254 requester must protect that memory from server RDMA Writes associated 255 with previous transactions by fencing it from the responder. The 256 responder triggers a memory protection error when it writes into 257 those memory regions, and the connection is lost. 259 A malfunctioning application or a malicious user on the requester can 260 create a situation where RPCs are continuously initiated and then 261 aborted, resulting in responder replies that repeatedly terminate the 262 underlying RPC-over-RDMA connection. 264 A rogue responder can purposely overrun a Reply chunk to kill a 265 connection. Repeated connection loss can result in a Denial of 266 Service. 268 3.5. Estimating Transport Header Size 270 To determine whether a Reply chunk is needed, a requester computes 271 the size of the Reply's Transport Header and the maximum possible 272 size of the RPC Reply message, and sums the two. If the sum is 273 smaller than the requester's receive inline threshold, a Reply chunk 274 is not required. 276 The size of a Transport Header depends on how many Write chunks the 277 requester provides, whether a Reply chunk is needed, and how many 278 segments are contained in provided Write and Reply chunks. 280 When the total size of the Reply message is already near the inline 281 threshold, therefore, a requester has to know whether a Reply chunk 282 is needed (and how many segments it contains) before it can determine 283 if a Reply chunk is needed. 285 A requester can resort to limiting Transport Header size to a fixed 286 value that ensures this computation does not become a recursion. 288 However, as in earlier sections, this can mean that some RPC 289 transactions where a Reply chunk is not strictly necessary must incur 290 the cost of preparing a Reply chunk. 292 4. Responder-Provided Read Chunks 294 A potential mechanism for resolving these issues is suggested in 295 Section 3.4 of [RFC5666]: 297 In the absence of a server-provided read chunk list in the reply, 298 if the encoded reply overflows the posted receive buffer, the RPC 299 will fail with an RDMA transport error. 301 When sending a large RPC Call message, requesters already employ Read 302 chunks. There is no advance indication or limit on the size of any 303 RPC Call message. To achieve the same flexibility for RPC Replies, 304 Read chunks can be used in the reverse direction (e.g., responder 305 exposes memory, requester initiates RDMA Read). 307 Rather than a requester providing a Reply chunk for conveying an as- 308 yet-unconstructed large reply, a responder can expose a Read chunk 309 containing the actual Payload stream of the RPC Reply message. A 310 responder would employ a Read chunk to return a reply any time 311 requester-provided reply resources are not adequate. 313 The requester does not have to calculate a reply size maximum or 314 register and invalidate a Reply chunk in these cases. Without a 315 requester-provided Reply chunk, the responder sends each reply 316 inline, except when the actual size of an RPC Reply message is larger 317 than the receiver's inline threshold. 319 This results in no wasted activity on the requester and arbitrarily 320 large RPC Replies can be received reliably. 322 Current RPC-over-RDMA version 1 implementations do not support 323 responder-provided Read chunks, although RPC-over-RDMA version 1 did 324 have this support in the past [RFC5666]. Adapting this deprecated 325 mechanism for new RPC-over-RDMA transports is straightforward. 327 4.1. Specification 329 A responder MAY choose to send an RPC Reply using a Position Zero 330 Read chunk comprised of one or more RDMA segments. Position Zero 331 Read chunks are defined in Section 3.5.3 of [RFC8166]. 333 Similar to its use in an RPC Call, a Position Zero Read chunk in an 334 RPC Reply contains an RPC Reply's Payload stream. Position Zero Read 335 chunks are always sent using an RPC-over-RDMA RDMA_NOMSG message. 337 In other words, a responder-provided Read chunk can replace the use 338 of a Reply chunk in Long Replies. And, as with Reply chunks, a 339 responder must still make use of Write chunks provided by the 340 requester. 342 4.1.1. Responder Duties 344 A responder MUST send a Position Zero Read chunk when the actual size 345 of the RPC Reply's Payload stream exceeds all requester-provided 346 reply resources; that is, when the inline threshold and any provided 347 Reply chunk are both too small to accommodate the Payload stream of 348 the reply. 350 If a responder does not support responder-provided Read chunks in 351 this case, it MUST return an appropriate permanent transport error to 352 terminate the requester's RPC transaction. 354 4.1.2. Requester Duties 356 Upon receipt of an RDMA_NOMSG message containing a Position Zero Read 357 chunk, the requester pulls the RPC Reply's Payload stream from the 358 responder. 360 After RDMA Read operations have completed (successfully or in error), 361 the requester MUST inform the responder that it may invalidate the 362 Read chunk containing the RPC Reply message. This is referred to as 363 "pull completion notification". 365 4.1.3. Pull Completion Notification 367 Pull completion notification is accomplished in one of two ways: 369 o The requester can send an RDMA_DONE message with the rdma_xid 370 field set to the same value as the rdma_xid field in the 371 RDMA_NOMSG request. Or, 373 o The requester can piggyback the pull completion notification in 374 the transport header of a subsequent RPC Call, if the transport 375 protocol has such a facility. 377 When an RPC transaction is aborted on a requester, the requester 378 normally forgets its XID. If a requester receives a reply bearing a 379 Position Zero Read chunk and does not recognize the XID, the 380 requester MUST notify the responder of pull completion. 382 Whenever a responder receives a pull completion notification for an 383 XID for which there is no Read chunk waiting to be invalidated, the 384 responder MUST silently drop the notification. 386 If a requester receives an RPC Reply via a responder-provided Read 387 chunk, but does not support such chunks, it MUST inform the responder 388 of pull completion and terminate the RPC transaction. 390 A malicious or broken requester might neglect to send pull completion 391 notifications for one or more RPC transactions that included 392 responder-provided Read chunks. To prevent exhaustion of responder 393 resources, a responder can choose to invalidate its Read chunks after 394 waiting for a short period. If the requester attempts additional 395 RDMA Read operations against that Read chunk, a remote access error 396 occurs and the connection is lost. 398 4.1.4. Remote Invalidation 400 Remote Invalidation can reduce or eliminate the need for the 401 responder to explicitly invalidate memory containing an RPC Reply 402 message. 404 Remote Invalidation might be done by transmitting an RDMA_DONE 405 message using RDMA Send With Invalidate. If instead pull completion 406 notification is piggybacked on a subsequent RPC Call, a facility for 407 Remote Invalidation would have to be built into RPC Call processing. 409 If Remote Invalidate support is not indicated by one or both peers, 410 messages carrying pull completion notification MUST be transmitted 411 using RDMA Send. If Remote Invalidation support is indicated by both 412 peers, messages carrying pull completion messages SHOULD be 413 transmitted using RDMA Send With Invalidate. 415 The rule for choosing the value of the Send With Invalidate Work 416 Request's inv_handle field depends on the version of the transport 417 protocol that is use. If the responder has provided an R_key that 418 may be invalidated, the requester MUST present only that R_key when 419 using RDMA Send With Invalidate. 421 5. Analysis 423 5.1. Benefits 425 5.1.1. Less Frequent Use of Explicit RDMA 427 The vast majority of RPC Replies can be conveyed via RDMA_MSG. No 428 extra Reply chunk registration and invalidation cost is incurred when 429 a large RPC Reply message is possible but the actual reply size is 430 small. This reduces or even eliminates the use of explicit RDMA for 431 frequent small-to-moderate-size replies, improving the average 432 latency of individual RPCs and allowing RNIC and platform resources 433 to scale better. 435 5.1.2. Support for Arbitrarily Large Replies 437 The responder-provided Read chunk approach accommodates arbitrarily 438 large replies. Requesters no longer need to calculate the maximum 439 size of RPC Reply messages, even if a Reply chunk is provided. 441 5.1.3. Protection of Connection After RPC Cancellation 443 When an RPC is canceled on the requester (say, because the requesting 444 application has been terminated), and no Reply chunk is provided, the 445 requester is no longer responsible for invalidating that RPC's Reply 446 chunk. When the responder sends the reply, it provides a Position 447 Zero Read chunk and does not use RDMA Write to transmit the RPC Reply 448 message. The transport connection is preserved because no memory 449 protection violation can occur. 451 5.1.4. Asynchronous Chunk Invalidation 453 Registration of a responder-provided Read chunk must be completed 454 before sending the RDMA_NOMSG message conveying the chunk 455 information. However, pull completion notification and subsequent 456 responder-side memory invalidation can be performed after the RPC 457 transaction has completed on the requester. Because those are 458 asynchronous to RPC completion, the additional latency is not 459 attributed to the execution time of the RPC transaction. 461 5.2. Costs 463 5.2.1. Responder Memory Exposure 465 Responder memory is registered and exposed to requesters when 466 replying. When a responder has properly allocated a Protection 467 Domain for each connection and uses appropriate R_key rotation 468 techniques (see Section 7), the exposure is minimal. However, 469 because current RPC-over-RDMA responder implementations do not expose 470 memory to requesters, they typically share one Protection Domain 471 among all connections. 473 5.2.2. Round Trip Penalty 475 Using a Read chunk for large replies introduces a round-trip penalty. 476 A requester can provide a Reply chunk to avoid this penalty. 477 However: 479 o The Read chunk round-trip penalty would be paid much less often 480 than the Reply chunk registration cost is paid today, since 481 responder-provided Read chunks are used only when necessary 483 o Read chunk frequency is reduced even further as the inline 484 threshold is increased past the average size of the Upper Layer 485 Protocol's RPC Replies 487 o Invalidation of a Reply chunk is synchronous with RPC completion, 488 and may take as long as a round trip to the responder 490 o Read chunks are typically used for large payloads, where it is 491 likely that data transmission time greatly exceeds the round-trip 492 time 494 There are a few particular situations where the frequency of large 495 replies is high. For example, the use of the krb5i or krb5p GSS 496 services with RPC-over-RDMA require that Payload reduction is not 497 used. Thus, RPC-over-RDMA peers use only pure RDMA Sends or Long 498 messages when these services are in use. The actual size of a 499 READDIR reply is often unpredictable but is frequently large. In 500 these two cases, using a Reply chunk could be the more efficient 501 default choice. 503 5.2.3. Credit Accounting Complexity 505 Credit accounting is made more complex by the use of RDMA_DONE 506 messages after RDMA Read operations have completed. Sending an 507 RDMA_DONE message consumes one credit, temporarily reducing RPC 508 concurrency on the connection. There is no response to RDMA_DONE, so 509 it is not clear to the sender when that credit becomes available 510 again. One way to resolve this is to add a new message type to the 511 protocol, RDMA_ACK, which could be used any time there is a uni- 512 directional transport message to maintain the proper balance of 513 credit grants and responses. 515 Alternately, if the transport protocol supports piggybacking pull 516 completion notification on RPC Call messages, the requester can 517 piggyback in most cases to simplify credit accounting. An explicit 518 RDMA_DONE would be necessary only during light workloads, or the ULP 519 could post an RPC NULL containing a piggybacked pull completion 520 notification in these cases. 522 5.3. Selecting a Reply Mechanism 524 This section illustrates some possible implementation choices. 526 5.3.1. Requester 528 As an RPC Call is constructed, a requester might choose a reply 529 mechanism based on its estimation of the range of possible sizes of 530 the reply. 532 Responder-provided Read chunk 533 The requester knows the minimum size of the reply is smaller than 534 the inline threshold, but the maximum size of the reply is larger 535 than the inline threshold; or the requester cannot calculate the 536 maximum size of the reply. The client does not provide a Reply 537 chunk, and relies on a responder-provider Read chunk to handle 538 large replies. 540 Reply chunk 541 The requester knows the minimum and maximum size of the reply is 542 larger than the inline threshold. The requester provides a Reply 543 chunk. 545 Send-only 546 The requester knows the maximum size of the reply is smaller than 547 the inline threshold. The requester does not provide a Reply 548 chunk, and relies on a responder-provider Read chunk to handle 549 large replies. 551 A requester whose design requires Reply chunk invalidation after an 552 RPC transaction is canceled might choose to never use Reply chunks, 553 in favor of minimizing opportunities for connection loss. 555 5.3.2. Responder 557 After a responder has constructed an RPC Reply, it might choose which 558 reply mechanism to employ based on the actual size of the Payload 559 stream of the RPC Reply message. 561 Responder-provided Read chunk 562 The Payload stream is larger than the inline threshold and either 563 no Reply chunk was provided or the provided Reply chunk is too 564 small. The responder uses a responder-provided Read chunk. 566 Reply chunk 567 If a usable Reply chunk is available, the responder uses the Reply 568 chunk. 570 Send-only 571 If no Reply chunk is available and the Payload stream fits within 572 the inline threshold, the responder uses only Send or Send With 573 Invalidate to transmit the reply. 575 5.4. Implementation Complexity 576 5.4.1. RPC Call Path 578 Implementation of responder-provided Read chunks introduces little or 579 no additional complexity to the end-to-end RPC Call path. Unless a 580 requester implementer chooses to implement support for both Reply 581 chunks and responder-provided Read chunks, there could be a net loss 582 of code and run-time complexity in the RPC Call hot path. 584 The responder's RPC Call path needs to recognize RDMA_DONE messages 585 and initiate invalidation of Read chunks. Because invalidation can 586 be asynchronous, it is possible to perform Read chunk invalidation in 587 a separate worker thread. 589 5.4.2. RPC Reply Path 591 On the RPC Reply path side, logic to initiate registration of Read 592 chunks and wait for completion is added to the responder. This path 593 is not part of the hot path because it is used only infrequently. 595 The requester's reply handling hot path must recognize when Read 596 chunks are present in an RDMA_NOMSG message, and shunt execution to 597 code that can initiate an RDMA Read and wait for completion. Once 598 complete, the requester posts an RDMA_DONE message. 600 5.4.3. Managing RDMA_DONE messages 602 In order for a responder to match incoming RDMA_DONE messages to 603 reply buffers waiting to be invalidated, it might keep references to 604 these buffers in a data structure searchable by XID. This is similar 605 to managing a set of pending backchannel replies. 607 When an RDMA_DONE message arrives, the responder matches the XID in 608 the message to a waiting reply buffer, invalidates that buffer, and 609 removes the XID from the data structure. 611 This data structure can also be used for housekeeping tasks such as: 613 o Invalidating waiting buffers after a timeout, in case the 614 requester never sends RDMA_DONE 616 o Ignoring retransmitted or garbage RDMA_DONE requests 618 o Explicitly invalidating waiting Read chunks after a connection 619 loss, if necessary 621 o Invalidating waiting buffers on device removal 623 5.5. Alternatives 625 Increasing the inline threshold reduces the likelihood of needing a 626 Reply chunk, but does not eliminate the risks associated with 627 unpredictably large replies. 629 Message Continuation is more efficient than an explicit RDMA 630 operation, and does not require the exposure of requester or 631 responder memory 633 However, Message Continuation still limits the maximum size of a 634 conveyed message. As with a larger inline threshold, without 635 responder-provided Read chunks, reply size estimation is still 636 required to determine when a Reply chunk is required, and therefore 637 there is still risk associated with unpredictably large replies. 639 Message Continuation introduces complexity in the management of RPC- 640 over-RDMA credit grants because the relationship between RPC 641 transactions and credits is no longer one-to-one. Credit management 642 logic is an integral part of the RPC Call and Reply hot path on the 643 requester. 645 6. Interoperation Considerations 647 When a requester supports responder-provided Read chunks, it is 648 likely to neglect providing Reply chunks in some cases. A responder 649 that does not support responder-provided Read chunks can convey a 650 transport-level error when it has generated an RPC Reply that is 651 larger than the available reply resources. 653 The situation is more problematic if a responder supports responder- 654 provided Read chunks and sends them to a requester that is not able 655 to recognize and unmarshal them. The RPC transaction would never 656 complete, and the requester would never send a pull completion 657 notification. 659 Thus responder-provided Read chunks MUST be used only when both peers 660 support them: Either the base protocol version always has support 661 enabled, or the base protocol provides an extension mechanism that 662 indicates when support is available. 664 7. Security Considerations 666 The less frequent use of RDMA Write reduces opportunities for memory 667 overrun on the requester, and reduces the risk of connection loss 668 after an application is terminated prematurely. This reduces 669 exposure to accidental or malicious Denial of Service attacks. 671 Responder-provided Read chunks are exposed for read-only access. 672 Remote actors cannot alter the contents of exposed read-only memory, 673 though a man-in-the-middle can read or alter RDMA payloads while they 674 are in transit. The use of RPCSEC GSS or a transport-layer 675 confidentiality service completely blocks payload access by 676 unintended recipients. 678 Recommendations about adequate R_key rotation and the appropriate use 679 of Protection Domains can be found in Section 8.1 of [RFC8166]. 680 These recommendations apply when responders expose memory to convey 681 the Payload stream of an RPC Reply message. 683 Otherwise, this mechanism does not alter the attack surface of a 684 transport protocol that employs it. 686 8. IANA Considerations 688 This document has no IANA actions. 690 9. References 692 9.1. Normative References 694 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 695 Requirement Levels", BCP 14, RFC 2119, 696 DOI 10.17487/RFC2119, March 1997, 697 . 699 [RFC8166] Lever, C., Ed., Simpson, W., and T. Talpey, "Remote Direct 700 Memory Access Transport for Remote Procedure Call Version 701 1", RFC 8166, DOI 10.17487/RFC8166, June 2017, 702 . 704 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 705 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, 706 May 2017, . 708 9.2. Informative References 710 [RFC5661] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed., 711 "Network File System (NFS) Version 4 Minor Version 1 712 Protocol", RFC 5661, DOI 10.17487/RFC5661, January 2010, 713 . 715 [RFC5666] Talpey, T. and B. Callaghan, "Remote Direct Memory Access 716 Transport for Remote Procedure Call", RFC 5666, 717 DOI 10.17487/RFC5666, January 2010, 718 . 720 [RFC7862] Haynes, T., "Network File System (NFS) Version 4 Minor 721 Version 2 Protocol", RFC 7862, DOI 10.17487/RFC7862, 722 November 2016, . 724 [RFC8267] Lever, C., "Network File System (NFS) Upper-Layer Binding 725 to RPC-over-RDMA Version 1", RFC 8267, 726 DOI 10.17487/RFC8267, October 2017, 727 . 729 Acknowledgments 731 Many thanks go to Karen Dietke, Chunli Zhang, Dai Ngo, and Tom 732 Talpey. The author also wishes to thank Bill Baker and Greg Marsden 733 for their support of this work. 735 Special thanks go to Transport Area Director Magnus Westerlund, NFSV4 736 Working Group Chairs Spencer Shepler and Brian Pawlowski, and NFSV4 737 Working Group Secretary Thomas Haynes for their support. 739 Author's Address 741 Charles Lever 742 Oracle Corporation 743 United States of America 745 Email: chuck.lever@oracle.com