idnits 2.17.1 draft-cel-nfsv4-reminv-design-09.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (November 19, 2018) is 1984 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Outdated reference: A later version (-09) exists of draft-cel-nfsv4-rpcrdma-version-two-08 Summary: 0 errors (**), 0 flaws (~~), 2 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network File System Version 4 C. Lever 3 Internet-Draft Oracle 4 Intended status: Informational November 19, 2018 5 Expires: May 23, 2019 7 Using Remote Invalidation With RPC-Over-RDMA Transports 8 draft-cel-nfsv4-reminv-design-09 10 Abstract 12 Remote Invalidation relieves RDMA responders of some of the burden of 13 preparing memory to be accessed remotely, thus reducing the latency 14 of RDMA Read and Write operations. This document considers how to 15 introduce generic support for Remote Invalidation to RPC-over-RDMA 16 transport protocols. 18 Status of This Memo 20 This Internet-Draft is submitted in full conformance with the 21 provisions of BCP 78 and BCP 79. 23 Internet-Drafts are working documents of the Internet Engineering 24 Task Force (IETF). Note that other groups may also distribute 25 working documents as Internet-Drafts. The list of current Internet- 26 Drafts is at https://datatracker.ietf.org/drafts/current/. 28 Internet-Drafts are draft documents valid for a maximum of six months 29 and may be updated, replaced, or obsoleted by other documents at any 30 time. It is inappropriate to use Internet-Drafts as reference 31 material or to cite them other than as "work in progress." 33 This Internet-Draft will expire on May 23, 2019. 35 Copyright Notice 37 Copyright (c) 2018 IETF Trust and the persons identified as the 38 document authors. All rights reserved. 40 This document is subject to BCP 78 and the IETF Trust's Legal 41 Provisions Relating to IETF Documents 42 (https://trustee.ietf.org/license-info) in effect on the date of 43 publication of this document. Please review these documents 44 carefully, as they describe your rights and restrictions with respect 45 to this document. Code Components extracted from this document must 46 include Simplified BSD License text as described in Section 4.e of 47 the Trust Legal Provisions and are provided without warranty as 48 described in the Simplified BSD License. 50 Table of Contents 52 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 53 2. Requirements Language . . . . . . . . . . . . . . . . . . . . 4 54 3. General Requirements . . . . . . . . . . . . . . . . . . . . 4 55 3.1. Memory Management Extensions . . . . . . . . . . . . . . 4 56 3.2. Registration Types . . . . . . . . . . . . . . . . . . . 4 57 3.3. Selecting STags to Invalidate Remotely . . . . . . . . . 5 58 3.4. Future Enhancements . . . . . . . . . . . . . . . . . . . 6 59 4. Remote Invalidation in Operation . . . . . . . . . . . . . . 6 60 4.1. Determining Remote Invalidation Support Status . . . . . 7 61 4.2. Selection of Which STag to Invalidate Remotely . . . . . 8 62 4.3. Reverse-Direction Operation . . . . . . . . . . . . . . . 8 63 5. Protocol Elements . . . . . . . . . . . . . . . . . . . . . . 9 64 5.1. Per Protocol Version Remote Invalidation . . . . . . . . 9 65 5.2. Per Connection Remote Invalidation . . . . . . . . . . . 10 66 5.3. Fixed Protocol Remote Invalidation . . . . . . . . . . . 10 67 5.4. Per RPC Remote Invalidation (Single STag) . . . . . . . . 11 68 5.5. Per RPC Remote Invalidation (Multiple STags) . . . . . . 12 69 5.6. Inter-RPC Remote Invalidation . . . . . . . . . . . . . . 13 70 6. Recommendations . . . . . . . . . . . . . . . . . . . . . . . 13 71 6.1. General Considerations . . . . . . . . . . . . . . . . . 13 72 6.2. Choosing a Protocol Extension . . . . . . . . . . . . . . 14 73 6.3. Example Remote Invalidation Protocol . . . . . . . . . . 15 74 6.4. Corner Cases . . . . . . . . . . . . . . . . . . . . . . 16 75 7. Security Considerations . . . . . . . . . . . . . . . . . . . 17 76 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 18 77 9. References . . . . . . . . . . . . . . . . . . . . . . . . . 18 78 9.1. Normative References . . . . . . . . . . . . . . . . . . 18 79 9.2. Informative References . . . . . . . . . . . . . . . . . 18 80 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . 19 81 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 19 83 1. Introduction 85 Similar to RDMA-enabled storage protocols such as iSER [RFC7145], an 86 RPC-over-RDMA version 1 requester exposes regions of its memory to an 87 RPC-over-RDMA responder. The responder then uses RDMA Read and Write 88 operations to transfer bulk data payloads [RFC8166]. 90 In preparation for a bulk data transfer, a requester asks its RNIC to 91 assign a steering tag, or STag, to a region of memory containing the 92 data to be moved. At this time, access rights are granted that allow 93 the RNIC to access or update that memory on behalf of a remote peer. 94 This act is referred to as "memory registration." The RNIC uses this 95 STag to steer data to and from the registered memory region. 97 When data transfer is complete, each STag is dissociated from its 98 memory region. This act is referred to as "memory invalidation." It 99 prevents further responder access to that memory region by revoking 100 its remote access rights. Invalidation should be done before RPC 101 applications on the requester are allowed access to memory that was 102 involved in an explicit RDMA operation. 104 Before an RPC transaction is terminated, the requester is responsible 105 for fencing memory from the responder. This is a hard requirement by 106 the transport protocol [RFC8166]. Fencing serializes the completion 107 of RPC transactions with the invalidation of RPC-over-RDMA chunks. 108 Therefore the latency of invalidation adds to the total execution 109 time of each RPC transaction. 111 Remote Invalidation is a mechanism by which an RDMA peer can request 112 that the remote peer RNIC invalidate an STag associated with memory 113 on that remote peer [RFC5042]. An RDMA consumer requests Remote 114 Invalidation by posting an RDMA Send With Invalidate Work Request in 115 place of an RDMA Send Work Request. RDMA Send With Invalidate is 116 similar to RDMA Send, but takes one additional argument: a single 117 STag to be invalidated by the RNIC that receives the sent message. 118 The resulting RDMA Send operation is transmitted with additional 119 header information that conveys the STag that is to be invalidated 120 [RFC5040]. 122 The benefit of Remote Invalidation is that the requester is not 123 required to post an additional Work Request, context switch, and 124 handle an interrupt to perform memory invalidation as part of 125 completing an RPC transaction. Memory invalidation is essentially 126 offloaded to the RNIC. The upshot is faster completion of RPC 127 transactions that involve registered memory. 129 This mechanism has the most impact when explicit RDMA operations are 130 needed to move moderate amounts of data. Invalidation latency is 131 quite small compared to the time it takes to convey a large payload 132 with an explicit RDMA operation. Small RPCs are already conveyed 133 entirely via RDMA Send, thus Remote Invalidation is unnecessary for 134 them. When the time it takes to invalidate a memory region is on the 135 same order as the time it takes to move the contents of that region, 136 Remote Invalidation has its greatest impact. 138 Remote Invalidation confers benefits similar to the benefits of 139 increasing the size of Send and Receive buffers. However, Remote 140 Invalidation does not incur the cost of maintaining a pool of large 141 Receive buffers on either the requester or responder. Moderate-sized 142 RPC payloads can be transferred without much of the cost of memory 143 registration. Requesters can rely on RDMA Write to structure their 144 Receive buffers without introducing additional latency. 146 There are some downsides, however. Remote Invalidation is not 147 available on all RNIC devices. And, Remote Invalidation does not 148 address the extra round trip latency incurred when using RDMA Read. 149 This extra latency can be eliminated using a large inline threshold 150 for transmitting RPC Calls. 152 The purpose of this document is to explore at a high level how Remote 153 Invalidation can be introduced into the RPC-over-RDMA transport 154 protocol. The primary design considerations for the transport 155 protocol are to provide a mechanism to indicate when Remote 156 Invalidation is safe to use, and to provide selection criteria for 157 choosing which STag (when there are more than one) to invalidate 158 remotely. Elements of the XDR definition of the RPC-over-RDMA 159 protocol will need to be altered to some degree, depending on desired 160 flexibility of operation, invasiveness of XDR changes, and breadth of 161 hardware support. 163 2. Requirements Language 165 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 166 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 167 "OPTIONAL" in this document are to be interpreted as described in BCP 168 14 [RFC2119] [RFC8174] when, and only when, they appear in all 169 capitals, as shown here. 171 3. General Requirements 173 3.1. Memory Management Extensions 175 Remote Invalidation was not available in the original RDMA Verbs API. 176 New verbs API objects were specified that include operations that 177 enable Remote Invalidation, now described in [IBARCH]. The Verbs API 178 provides a capabilities flag, MEM_MGT_EXTENSIONS, that indicates that 179 an RNIC and the local verbs implementation can support these new APIs 180 and objects. 182 An STag that is registered using the FRWR mechanism (in a privileged 183 execution context) or is registered via a Memory Window (in a non- 184 privileged context) may be invalidated remotely [RFC5040]. These 185 mechanisms are available when an RNIC supports MEM_MGT_EXTENSIONS. 187 RDMA Send With Invalidate is available only with MEM_MGT_EXTENSIONS. 189 3.2. Registration Types 191 For the purposes of this discussion, there are two classes of STags. 192 Dynamically-registered STags are used in a single RPC, then 193 invalidated. Persistently-registered STags are used in multiple RPC 194 transactions. They may persist for the life of an RPC-over-RDMA 195 connection, or longer. 197 In RPC-over-RDMA version 1, a requester may provide more than one 198 STag in the chunk lists of an RPC. Therefore a requester may provide 199 any combination of the following registration types in one RPC, any 200 combination of these in a series of RPCs on the same connection, or 201 it may use some other registration model. 203 Examples of persistently-registered STags include: 205 o The device's reserved DMA R_key 207 o An STag registered for a connection that doesn't change from RPC 208 to RPC (for a utility buffer, say) 210 o An STag registered for a fixed memory region that is updated after 211 each time it is advertised 213 o An STag covering a large single region that is utilized in small 214 segments by many RPCs 216 Examples of dynamically-registered STags include: 218 o An STag registered for a single RPC transaction using a legacy 219 registration mechanism, then invalidated when the RPC is retired 221 o An STag registered for a single RPC transaction using either 222 Memory Windows or FRWR, then invalidated when the RPC is retired 224 Among these examples, only dynamically-registered STags using Memory 225 Windows or FRWR may be invalidated remotely. 227 3.3. Selecting STags to Invalidate Remotely 229 Remote Invalidation protocol mechanisms come in different styles: 231 Fixed Protocol 232 The rules by which a responder selects which STag to invalidate 233 remotely is fixed in the protocol specification. 235 Responder's Choice 236 The responder chooses an STag to invalidate remotely from among 237 all the STags in incoming requests. 239 Requester's Choice 240 The requester chooses one or more STags that may be invalidated 241 remotely, indicating its choices in each request. The responder 242 chooses an STag to invalidate remotely from among the requester's 243 picks. 245 There is no RDMA layer mechanism by which a responder can determine 246 how a requester-provided STag was registered. Thus a requester that 247 mixes persistently- and dynamically-registered STags in one RPC, or 248 mixes them across RPCs on the same connection, cannot tolerate 249 Responder's Choice. 251 3.4. Future Enhancements 253 There are two related enhancements that further reduce the effort 254 needed to invalidate STags associated with complex RPCs: 256 o The ability for one registered STag to represent a list of memory 257 regions that are not contiguous 259 o The ability to specify more than one remote STag in a single Send 260 Work Request to be remotely invalidated 262 At this time, the first mechanism has been implemented in at least 263 one RNIC on the market. The second is speculative (i.e., has not yet 264 been implemented anywhere). 266 Given support for registering non-contiguous memory regions with one 267 STag, when an RPC-over-RDMA requester constructs an RPC that has both 268 a Read list and a Write list, the requester has a choice: 270 o The requester can register a separate STag for each access mode 271 (one STag for memory regions needing read access, and one STag for 272 those needing write access) to provide good data security 274 o The requester can register a single STag with read and write 275 access enabled for the whole set of memory regions, to allow RDMA 276 Send With Invalidate to work optimally 278 Having the ability to remotely invalidate multiple STags at once 279 enables the combination of optimal performance and optimal security. 281 4. Remote Invalidation in Operation 283 When requester memory is registered for remote access, an RPC-over- 284 RDMA implementation can use Remote Invalidation by following these 285 steps: 287 1. The requester DMA-maps a memory region that will participate in 288 an RPC transaction, then registers an STag for that region. 290 2. The requester transmits the RPC Call to the responder. This 291 request conveys the STag to the responder. 293 3. The responder processes the RPC transaction. The peer RNICs use 294 the STag to move RPC arguments and/or results. 296 4. The responder transmits the RPC Reply using an RDMA Send With 297 Invalidate Work Request, setting the Work Request's inv_handle 298 field to the value of the STag. 300 5. A Receive Work Request completes on the requester, carrying this 301 RPC reply. The completion reports the invalidated STag. 303 6. The requester skips invalidation of the STag, then DMA-unmaps the 304 memory region associated with the STag. 306 The requester no longer needs to invalidate the STag involved with 307 this RPC. However, there are additional details that must be 308 resolved before the use of Remote Invalidation can commence. 310 4.1. Determining Remote Invalidation Support Status 312 A requester that does not support Remote Invalidation might not 313 tolerate the use of RDMA Send With Invalidate by a responder. Such a 314 requester performs Local Invalidation on STags that already happen to 315 be invalid. In some cases this results in protection errors or other 316 issues. 318 Thus, to avoid spurious connection termination, a responder must not 319 post an RDMA Send With Invalidate Work Request unless it is sure the 320 following three conditions are met: 322 o The requester's RNIC is prepared to receive the additional header 323 information associated with Remote Invalidation 325 o The requester has used an appropriate registration mechanism to 326 register STags it wants invalidated remotely 328 o The requester is prepared to recognize remotely invalidated STags 329 during Receive processing to avoid invalidating them a second time 331 When all three of these conditions are met, a requester can report 332 positive Remote Invalidation support status to responders using an 333 Upper Layer Protocol mechanism. When a responder does not know the 334 requester's Remote Invalidation support status, it cannot use Remote 335 Invalidation without endangering the connection. 337 4.2. Selection of Which STag to Invalidate Remotely 339 The RDMA Send With Invalidate Work Request invalidates only one STag. 340 RPC-over-RDMA requesters may register more than one STag to handle 341 the movement of payloads for a single RPC. Either the client will 342 have to specify which STag may be remotely invalidated, the protocol 343 will have to specify a fixed way to select which STag to invalidate, 344 or the responder will have to choose arbitrarily which STag to 345 remotely invalidate. 347 In some circumstances, requesters may wish to utilize STags during 348 transactions that are registered using a mechanism that does not 349 tolerate Remote Invalidation. For example, an STag that is the 350 requester's local DMA R_key should never be invalidated remotely. If 351 a responder attempts to invalidate a such an STag, the result is 352 undefined, but the connection may be terminated or other failures can 353 occur. 355 Even with Remote Invalidation enabled, requesters remain responsible 356 for ensuring all STags are invalid before RPC transactions complete. 357 To avoid leaving STags registered, a requester must be prepared for 358 the responder or the requester's own RNIC to have not invalidated any 359 of an RPC's STags. When there are multiple STags associated with a 360 single RPC, a requester must be prepared for any of the STags to have 361 been remotely invalidated or that all of the RPC's STags remain 362 registered. 364 4.3. Reverse-Direction Operation 366 As of this writing, no current RPC-over-RDMA implementation supports 367 direct data placement in the reverse direction. However, existing 368 protocol specifications do not forbid it [RFC8166] [RFC8167] 369 [I-D.cel-nfsv4-rpcrdma-version-two]. 371 When chunks are present in a reverse-direction RPC request, Remote 372 Invalidation allows the responder to trigger invalidation of a 373 requester's STags as part of sending a reply, the same as in the 374 forward direction. 376 However, in the reverse direction, the server acts as the requester, 377 and the client is the responder. The server's RNIC, therefore, must 378 support receiving an IETH, and the server must have registered the 379 STags with an appropriate registration mechanism. Thus the server 380 must indicate its Remote Invalidation support status to the client 381 (the opposite of forward direction Remote Invalidation). 383 5. Protocol Elements 385 In this section, a number of abstract protocol variations are 386 considered. These vary in functionality and the invasiveness of 387 changes to the tranport protocol's XDR definition. Some of these 388 variations might be appropriate to use in combination. 390 5.1. Per Protocol Version Remote Invalidation 392 5.1.1. Description 394 When a higher protocol version number is negotiated, Remote 395 Invalidation is always enabled. Both peers assume that Remote 396 Invalidation may be used in either direction. 398 5.1.2. Similar Existing Implementations 400 SMB Direct [MS-SMBD] 402 5.1.3. Advantages 404 No XDR changes or protocol extensions are required. 406 Reverse direction use of Remote Invalidation is automatically 407 supported. 409 5.1.4. Disadvantages 411 The requester is not in control of which STags in an RPC may be 412 invalidated. Thus, a requester must not advertise STags which must 413 never be invalidated, or the protocol must specify a fixed choice of 414 which STag(s) in each request are allowed to be invalidated remotely. 416 This new protocol version would then be usable only with RNICs that 417 support Remote Invalidation. Other features and benefits of the new 418 protocol version would not be available when an implementation 419 employs an RNIC that does not support Remote Invalidation. In 420 particular, RNICs that do not support MEM_MGT_EXTENTIONS could not 421 use the new protocol version. 423 An extension or addition protocol version bump is required to 424 indicate support for transport-level mechanisms that can invalidate 425 multiple STags at once. 427 5.2. Per Connection Remote Invalidation 429 5.2.1. Description 431 At connection initiation time, messages are exchanged that indicate 432 each peer's Remote Invalidation support status. Without these 433 messages, peers assume Remote Invalidation is not supported. 435 5.2.2. Similar Existing Implementations 437 iSER [RFC7145]. Information is exchanged in RDMA-CM connection 438 requests to report an implementation's Remote Invalidation support 439 status. 441 5.2.3. Advantages 443 No changes to the base protocol XDR are required. 445 5.2.4. Disadvantages 447 Out-of-band messages are required to establish Remote Invalidation 448 support status. 450 The requester is not in control of which STags in an RPC may be 451 invalidated. Thus, a requester must not advertise STags which must 452 never be invalidated. 454 To support reverse-direction operation, the server must separately 455 indicate that it supports Remote Invalidation. 457 To enable support for multiple STag invalidation, this negotiation 458 protocol would have to be extended again to indicate when mechanisms 459 other than RDMA Send With Invalidate are supported by the requester's 460 RNIC. 462 5.3. Fixed Protocol Remote Invalidation 464 5.3.1. Description 466 Protocol specification determines how the responder chooses which 467 STag is to be invalidated remotely. Some other means is used to 468 determine whether Remote Invalidation can be used or not. 470 5.3.2. Similar Existing Implementations 472 iSER [RFC7145]. Two STags fields appear in each request: one 473 advertises Read data and one advertises Write data. When only one 474 STag is used in the request, it may be invalidated remotely. One 475 both STags are used, only the Read STag may be invalidated remotely. 477 SMB Direct [MS-SMBD]. The responder always chooses the first STag in 478 each request to be invalidated remotely. 480 5.3.3. Advantages 482 No changes to the base protocol XDR are required. 484 5.3.4. Disadvantages 486 Out-of-band messages are required to establish support status. 488 The requester is not in control of which STags in an RPC may be 489 invalidated. Thus, a requester must not advertise STags which must 490 never be invalidated. 492 This mechanism may not work well for transport protocols that allow 493 multiple read and write STags. 495 5.4. Per RPC Remote Invalidation (Single STag) 497 5.4.1. Description 499 A field is added to the transport header that contains an STag which 500 may be invalidated by the responder. A special value can be chosen 501 to mean "no STag may be invalidated" for use by requesters that have 502 no support for Remote Invalidation. 504 5.4.2. Similar Existing Implementations 506 None. 508 5.4.3. Advantages 510 A requester may advertise STags that cannot be invalidated remotely, 511 as long as they are never marked as "may invalidate." 513 No out-of-band support status negotiation is needed. 515 Reverse-direction RPCs can each indicate whether a reverse-direction 516 requester desires or does not support Remote Invalidation. 518 The responder needs no special logic or assumptions to choose the 519 STag to invalidate remotely. 521 5.4.4. Disadvantages 523 Either the base RPC-over-RDMA header XDR definition is altered, or a 524 protocol extension is required. 526 Requesters transmit a little extra data per RPC, making RPC-over-RDMA 527 messages slightly more costly to send and parse. 529 This mechanism cannot support the remote invalidation of multiple 530 STags at once. 532 5.5. Per RPC Remote Invalidation (Multiple STags) 534 5.5.1. Description 536 A new data structure is added to the transport header that indicate 537 which STags which may be invalidated by the responder. 539 This information might appear as a new field in the RDMA segment data 540 structure, as each segment has its own STag field. The field 541 indicates whether or not that STag may be invalidated by the 542 responder. Perhaps that field is a boolean, though in XDR, a boolean 543 is a full 32 bits. 545 Or, this information could appear in the header as an array of STags, 546 to reduce the amount of extra data contained in the RPC-over-RDMA 547 header. Zero array elements means the requester does not support 548 Remote Invalidation. 550 5.5.2. Similar Existing Implementations 552 NVMe/Fabrics [NVME]. Each STag in a request has an associated bit 553 flag that indicates whether the responder is allowed to invalidate it 554 remotely. 556 5.5.3. Advantages 558 A requester may advertise STags that cannot be invalidated remotely, 559 as long as they are never marked as "may invalidate." 561 The mechanism allows a requester to request either invalidation of 562 multiple STags at once, or to choose one STag to invalidate remotely. 564 No out-of-band support status negotiation is needed. 566 Each reverse-direction RPC can indicate whether a reverse-direction 567 requester desires or does not support Remote Invalidation. 569 The responder needs no special logic or assumptions to choose the 570 STag to invalidate remotely. 572 5.5.4. Disadvantages 574 The RPC-over-RDMA header XDR definition is possibly extensively 575 altered. 577 Requesters transmit extra data per RPC. However, it is limited to 578 only one or two 32-bit words in most cases. 580 5.6. Inter-RPC Remote Invalidation 582 5.6.1. Description 584 As a subfeature of support for Remote Invalidation, it is possible 585 that a responder can remotely invalidate an STag (using RDMA Send 586 With Invalidate) that refers to registered memory being used in the 587 Read chunk of a different RPC. Such Remote Invalidation would be 588 requested only after the responder has already completed its RDMA 589 Read. 591 This can be useful when a responder is replying to an RPC via an 592 inline message, but notices there are other RPC replies pending that 593 have multiple STags, some of which are Read chunks. 595 5.6.2. Similar Existing Implementations 597 None 599 5.6.3. Advantages 601 This is one way to enable remote invalidation of multiple STags per 602 RPC, using only RDMA Send With Invalidate. 604 5.6.4. Disadvantages 606 Additional requester and responder complexity would be required to 607 keep track of STags. 609 6. Recommendations 611 6.1. General Considerations 613 When constructing a protocol to support Remote Invalidation, one of 614 the above designs, or some combination of them, may be chosen. 616 In no particular order, the author feels that the design priorities 617 are: 619 o Do not prevent the efficient operation of RNICs that do not handle 620 RDMA Send With Invalidate 622 o Introduce as little impact on header XDR and header length as 623 possible, to keep collateral performance and complexity impacts 624 low 626 o Enable support for Remote Invalidation when explicit RDMA is used 627 in reverse-direction RPCs. 629 An important question is whether the base RPC-over-RDMA protocol 630 should support Remote Invalidation, whether Remote Invalidation 631 support should be carried entirely on the shoulders of protocol 632 extensions, or whether some combination of the two is best. 634 Upper Layer Protocols will likely always be responsible for some 635 degree of signaling Remote Invalidation capabilities, as long as 636 innovation continues at the transport layer (e.g., new RDMA 637 operations that enable multi-STag Remote Invalidation). Predicting 638 future hardware capabilities is challenging, limiting the ability to 639 design long-lived protocol support for them. 641 Lastly, it is difficult to estimate how long the industry must 642 continue to support less capable devices. 644 6.2. Choosing a Protocol Extension 646 All things being equal, making no changes to the base XDR definition 647 has great appeal. If the mechanism in Section 5.2 can be broadly 648 effective at enabling Remote Invalidation in the current set of RPC- 649 over-RDMA implementations, it would be the proper choice. 651 Unfortunately, among current RPC-over-RDMA client implementations, 652 there is one client that can immediately use a per-connection style 653 protocol, and one that can use only a per-RPC style protocol such as 654 Section 5.4. A third known client resides in user space and uses FMR 655 registration, thus it is incapable of immediately employing Remote 656 Invalidation. 658 Because there is a wide latitude of implementation choice already 659 allowed by the RPC-over-RDMA transport protocol, the author's 660 preference is to implement Section 5.4. The target STag can be added 661 to the RPC-over-RDMA transport as a single field in a new version of 662 the RPC-over-RDMA transport protocol. No further changes or 663 extensions are needed. 665 In the longer term, the requester appears to be in the better 666 position to determine which STag may be invalidated remotely. With 667 this mechanism, the requester can choose based on which STags may be 668 invalidated remotely, or may use criteria based on the strengths of 669 its RNIC. For instance, choosing the largest registered memory 670 region might be beneficial in some cases. 672 Allowing the responder to select from among several choices does not 673 seem to bring additional value, and burdens the responder with 674 additional header parsing costs for each chunk-bearing RPC 675 transaction. 677 Furthermore, the ability to request Remote Invalidation of multiple 678 STags in a single Work Request appears to be somewhat distant. It 679 would require additional Upper Layer Protocol mechanisms to 680 distinguish the new mechanism from using RDMA Send With Invalidate, 681 which we are not in a position to design today. Thus it does not 682 seem worth the extra implementation and protocol complexity of having 683 the requester provide a list of STags for the responder to choose 684 from. 686 As an alternative to modifying the XDR definition for the RDMA_MSG 687 and RDMA_NOMSG message types, a new RDMA message type could be 688 introduced in a new version of RPC-over-RDMA that provides similar 689 functionality to RDMA_MSG and RDMA_NOMSG but adds one or more new 690 fields. This has the advantage of leaving the version 1-compatible 691 parts of the the new XDR definition unchanged. It is an open 692 question whether this introduces more complexity to existing 693 implementations than adding new fields to RDMA_MSG and RDMA_NOMSG. 694 However, this approach is similar to the introduction of READ_PLUS in 695 the specification of NFSv4.2 [RFC7862]. 697 Allowing the feature described in Section 5.6 is likely to increase 698 the complexity of responder and especially requester implementations, 699 as they would have to remember invalidated STags independently of RPC 700 completions. Because it does not require any XDR changes, it could 701 easily be enabled in a future protocol extension. The author's 702 preference is to forbid this behavior in the initial specification, 703 but allow for a future extension to introduce it. 705 6.3. Example Remote Invalidation Protocol 707 As an example of how to proceed, the simplest approach would replace 708 struct rpcrdma2_chunk_lists (as defined in 709 [I-D.cel-nfsv4-rpcrdma-version-two]) with the following: 711 713 struct rpcrdma2_chunk_lists { 714 enum msg_type rdma_direction; 715 u32 rdma_inv_handle; 716 struct rpcrdma2_read_list *rdma_reads; 717 struct rpcrdma2_write_list *rdma_writes; 718 struct rpcrdma2_write_chunk *rdma_reply; 719 }; 721 723 The following language describes how to utilize the new field: 725 The requester sets the value of the rdma_inv_handle field to the 726 value of any one of the rdma_handle fields in the RPC-over-RDMA 727 header of the RPC Call that may be invalidated remotely. If the 728 RPC-over-RDMA header of the RPC Call contains no rdma_handles that 729 may be invalidated remotely, the requester MUST set the value of 730 the rdma_inv_handle field to zero. 732 If the rdma_inv_handle field in the RPC-over-RDMA header of an RPC 733 Call contains zero, the responder MUST NOT use RDMA Send With 734 Invalidate to transmit the matching RPC Reply. Otherwise, the 735 responder SHOULD use RDMA Send With Invalidate to transmit the RPC 736 Reply, specifying the value in the RPC-over-RDMA header's 737 rdma_inv_handle field as the Send With Invalidate Work Request's 738 inv_rkey. 740 6.4. Corner Cases 742 A remote invalidation-enabled client remains responsible for 743 protecting its registered memory even when there is no Reply. 744 Consider these important corner cases: 746 o The responder never sends a response to Call-only procedures, thus 747 there is no opportunity for remote invalidation. Moreover, if the 748 transport protocol has no RDMA_DONE message, requesters cannot 749 know when they may safely invalidate registered memory used for 750 Call arguments. Therefore memory registration should not be used 751 for RPC procedures that do not expect a Reply. 753 o The RPC Reply is lost but the responder is still functional. In 754 some cases, the Upper Layer Protocol requires that the responder 755 close the connection to signal the loss of an RPC transaction. 756 This renders existing STags invalid. 758 o An application on a client is interrupted before the RPC Reply 759 completes on the requester, or the RPC transaction times out 760 waiting for a Reply. This exposes a race condition: 762 * The MR has already been invalidated by the requester when the 763 RPC Reply arrives at the RNIC. Typically this results in a 764 Memory Management Operation error, and the QP is placed in the 765 Error state. 767 * The MR has already been invalidated by the RNIC when the 768 requester invalidates locally. This also typically results in 769 a Memory Management Operation error, and the QP is placed in 770 the Error state. 772 A protocol mechanism that enables the requester to indicate to the 773 responder that an RPC transaction has been canceled can be used to 774 avoid this race. Otherwise, the requester and responder 775 implementations must tolerate connection loss and re- 776 establishment. 778 7. Security Considerations 780 Remote Invalidation metadata is conveyed in the clear in RPC-over- 781 RDMA headers. This does not expose any new information to attackers. 783 A man-in-the-middle can alter Remote Invalidation metadata while it 784 is in transit. Requesters are prepared to handle the case where 785 responders have not invalidated any STags associated with an RPC. An 786 attacker can cause other STags in flight to be invalidated before the 787 responder is finished with the associated memory. Or an attacker can 788 replace the "to-be invalidated" STag with an STag in the same RPC 789 that should not be invalidated remotely. Any of these might cause 790 loss of connection or other failures, triggering a denial-of-service 791 situation. 793 A connection relationship is required to exist between a requester 794 and a responder. The requester's RNIC has associated a Protection 795 Domain with that connection. The STag on the requester to be 796 invalidated is associated with that Protection Domain. This protects 797 against arbitrary invalidation of STags by network nodes not part of 798 the connection. 800 Further discussion appears in [RFC5042]. 802 8. IANA Considerations 804 This document does not require actions by IANA. 806 9. References 808 9.1. Normative References 810 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 811 Requirement Levels", BCP 14, RFC 2119, 812 DOI 10.17487/RFC2119, March 1997, 813 . 815 [RFC5040] Recio, R., Metzler, B., Culley, P., Hilland, J., and D. 816 Garcia, "A Remote Direct Memory Access Protocol 817 Specification", RFC 5040, DOI 10.17487/RFC5040, October 818 2007, . 820 [RFC5042] Pinkerton, J. and E. Deleganes, "Direct Data Placement 821 Protocol (DDP) / Remote Direct Memory Access Protocol 822 (RDMAP) Security", RFC 5042, DOI 10.17487/RFC5042, October 823 2007, . 825 [RFC8166] Lever, C., Ed., Simpson, W., and T. Talpey, "Remote Direct 826 Memory Access Transport for Remote Procedure Call Version 827 1", RFC 8166, DOI 10.17487/RFC8166, June 2017, 828 . 830 [RFC8167] Lever, C., "Bidirectional Remote Procedure Call on RPC- 831 over-RDMA Transports", RFC 8167, DOI 10.17487/RFC8167, 832 June 2017, . 834 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 835 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, 836 May 2017, . 838 9.2. Informative References 840 [I-D.cel-nfsv4-rpcrdma-version-two] 841 Lever, C. and D. Noveck, "RPC-over-RDMA Version 2 842 Protocol", draft-cel-nfsv4-rpcrdma-version-two-08 (work in 843 progress), November 2018. 845 [IBARCH] InfiniBand Trade Association, "InfiniBand Architecture 846 Specification Volume 1", Release 1.3, March 2015, 847 . 850 [MS-SMBD] Microsoft Corporation, "SMB Remote Direct Memory Access 851 (RDMA) Transport Protocol Specification", July 2016. 853 [NVME] NVM Express, Inc., "NVM Express Revision 1.2.1", July 854 2016. 856 [RFC7145] Ko, M. and A. Nezhinsky, "Internet Small Computer System 857 Interface (iSCSI) Extensions for the Remote Direct Memory 858 Access (RDMA) Specification", RFC 7145, 859 DOI 10.17487/RFC7145, April 2014, 860 . 862 [RFC7862] Haynes, T., "Network File System (NFS) Version 4 Minor 863 Version 2 Protocol", RFC 7862, DOI 10.17487/RFC7862, 864 November 2016, . 866 Acknowledgments 868 The author wishes to thank Sagi Grimberg, Christoph Hellwig, Karen 869 Deitke, Dave Noveck, and Tom Talpey. The author also wishes to thank 870 Bill Baker and Greg Marsden for their support of this work. 872 Special thanks go to Transport Area Director Spencer Dawkins, NFSV4 873 Working Group Chairs Spencer Shepler and Brian Pawlowski, and NFSV4 874 Working Group Secretary Thomas Haynes for their support. 876 Author's Address 878 Charles Lever 879 Oracle Corporation 880 1015 Granger Avenue 881 Ann Arbor, MI 48104 882 United States of America 884 Email: chuck.lever@oracle.com