idnits 2.17.1 draft-cel-nfsv4-reminv-design-08.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (July 17, 2018) is 2111 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Outdated reference: A later version (-09) exists of draft-cel-nfsv4-rpcrdma-version-two-07 Summary: 0 errors (**), 0 flaws (~~), 2 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network File System Version 4 C. Lever 3 Internet-Draft Oracle 4 Intended status: Informational July 17, 2018 5 Expires: January 18, 2019 7 Using Remote Invalidation With RPC-Over-RDMA Transports 8 draft-cel-nfsv4-reminv-design-08 10 Abstract 12 Remote Invalidation relieves RDMA responders of some of the burden of 13 preparing memory to be accessed remotely, thus reducing the latency 14 of RDMA Read and Write operations. This document considers how to 15 introduce generic support for Remote Invalidation to RPC-over-RDMA 16 transport protocols. 18 Status of This Memo 20 This Internet-Draft is submitted in full conformance with the 21 provisions of BCP 78 and BCP 79. 23 Internet-Drafts are working documents of the Internet Engineering 24 Task Force (IETF). Note that other groups may also distribute 25 working documents as Internet-Drafts. The list of current Internet- 26 Drafts is at https://datatracker.ietf.org/drafts/current/. 28 Internet-Drafts are draft documents valid for a maximum of six months 29 and may be updated, replaced, or obsoleted by other documents at any 30 time. It is inappropriate to use Internet-Drafts as reference 31 material or to cite them other than as "work in progress." 33 This Internet-Draft will expire on January 18, 2019. 35 Copyright Notice 37 Copyright (c) 2018 IETF Trust and the persons identified as the 38 document authors. All rights reserved. 40 This document is subject to BCP 78 and the IETF Trust's Legal 41 Provisions Relating to IETF Documents 42 (https://trustee.ietf.org/license-info) in effect on the date of 43 publication of this document. Please review these documents 44 carefully, as they describe your rights and restrictions with respect 45 to this document. Code Components extracted from this document must 46 include Simplified BSD License text as described in Section 4.e of 47 the Trust Legal Provisions and are provided without warranty as 48 described in the Simplified BSD License. 50 Table of Contents 52 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 53 2. Requirements Language . . . . . . . . . . . . . . . . . . . . 4 54 3. General Requirements . . . . . . . . . . . . . . . . . . . . 4 55 3.1. Memory Management Extensions . . . . . . . . . . . . . . 4 56 3.2. Registration Types . . . . . . . . . . . . . . . . . . . 4 57 3.3. Selecting STags to Invalidate Remotely . . . . . . . . . 5 58 3.4. Future Enhancements . . . . . . . . . . . . . . . . . . . 6 59 4. Remote Invalidation in Operation . . . . . . . . . . . . . . 6 60 4.1. Determining Remote Invalidation Support Status . . . . . 7 61 4.2. Selection of Which STag to Invalidate Remotely . . . . . 8 62 4.3. Reverse-Direction Operation . . . . . . . . . . . . . . . 8 63 5. Protocol Elements . . . . . . . . . . . . . . . . . . . . . . 9 64 5.1. Per Protocol Version Remote Invalidation . . . . . . . . 9 65 5.2. Per Connection Remote Invalidation . . . . . . . . . . . 10 66 5.3. Fixed Protocol Remote Invalidation . . . . . . . . . . . 10 67 5.4. Per RPC Remote Invalidation (Single STag) . . . . . . . . 11 68 5.5. Per RPC Remote Invalidation (Multiple STags) . . . . . . 12 69 5.6. Inter-RPC Remote Invalidation . . . . . . . . . . . . . . 13 70 6. Recommendations . . . . . . . . . . . . . . . . . . . . . . . 13 71 6.1. General Considerations . . . . . . . . . . . . . . . . . 13 72 6.2. Choosing a Protocol Extension . . . . . . . . . . . . . . 14 73 6.3. Example Remote Invalidation Protocol . . . . . . . . . . 15 74 6.4. Corner Cases . . . . . . . . . . . . . . . . . . . . . . 16 75 7. Security Considerations . . . . . . . . . . . . . . . . . . . 17 76 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 18 77 9. References . . . . . . . . . . . . . . . . . . . . . . . . . 18 78 9.1. Normative References . . . . . . . . . . . . . . . . . . 18 79 9.2. Informative References . . . . . . . . . . . . . . . . . 18 80 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . 19 81 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 19 83 1. Introduction 85 Similar to RDMA-enabled storage protocols such as iSER [RFC7145], an 86 RPC-over-RDMA version 1 requester exposes regions of its memory to an 87 RPC-over-RDMA responder. The responder then uses RDMA Read and Write 88 operations to transfer bulk data payloads [RFC8166]. 90 In preparation for a bulk data transfer, a requester asks its RNIC to 91 assign a steering tag, or STag, to a region of memory containing the 92 data to be moved. At this time, access rights are granted that allow 93 the RNIC to access or update that memory on behalf of a remote peer. 94 This act is referred to as "memory registration." The RNIC uses this 95 STag to steer data to and from the registered memory region. 97 When data transfer is complete, each STag is dissociated from its 98 memory region. This act is referred to as "memory invalidation." It 99 prevents further responder access to that memory region by revoking 100 its remote access rights. Invalidation should be done before RPC 101 applications on the requester are allowed access to memory that was 102 involved in an explicit RDMA operation. 104 Before an RPC transaction is terminated, the requester is responsible 105 for fencing memory from the responder. This is a hard requirement by 106 the transport protocol [RFC8166]. Fencing serializes the completion 107 of RPC transactions with the invalidation of RPC-over-RDMA chunks. 108 Therefore the latency of invalidation adds to the total execution 109 time of each RPC transaction. 111 Remote Invalidation is a mechanism by which an RDMA peer can request 112 that the remote peer RNIC invalidate an STag associated with memory 113 on that remote peer [RFC5042]. An RDMA consumer requests Remote 114 Invalidation by posting an RDMA Send With Invalidate Work Request in 115 place of an RDMA Send Work Request. RDMA Send With Invalidate is 116 similar to RDMA Send, but takes one additional argument: a single 117 STag to be invalidated by the RNIC that receives the sent message. 118 The resulting RDMA Send operation is transmitted with additional 119 header information that conveys the STag that is to be invalidated 120 [RFC5040]. 122 The benefit of Remote Invalidation is that the requester is not 123 required to post an additional Work Request, context switch, and 124 handle an interrupt to perform memory invalidation as part of 125 completing an RPC transaction. Memory invalidation is essentially 126 offloaded to the RNIC. The upshot is faster completion of RPC 127 transactions that involve registered memory. 129 This mechanism has the most impact when explicit RDMA operations are 130 needed to move moderate amounts of data. Invalidation latency is 131 quite small compared to the time it takes to convey a large payload 132 with an explicit RDMA operation. Small RPCs are already conveyed 133 entirely via RDMA Send, thus Remote Invalidation is unnecessary for 134 them. When the time it takes to invalidate a memory region is on the 135 same order as the time it takes to move the contents of that region, 136 Remote Invalidation has its greatest impact. 138 Remote Invalidation confers benefits similar to the benefits of 139 increasing the size of Send and Receive buffers. However, Remote 140 Invalidation does not incur the cost of maintaining a pool of large 141 Receive buffers on either the requester or responder. Moderate-sized 142 RPC payloads can be transferred without much of the cost of memory 143 registration. Requesters can rely on RDMA Write to structure their 144 Receive buffers without introducing additional latency. 146 There are some downsides, however. Remote Invalidation is not 147 available on all RNIC devices. And, Remote Invalidation does not 148 address the extra round trip latency incurred when using RDMA Read. 149 This extra latency can be eliminated using a large inline threshold 150 for transmitting RPC Calls. 152 The purpose of this document is to explore at a high level how Remote 153 Invalidation can be introduced into the RPC-over-RDMA transport 154 protocol. The primary design considerations for the transport 155 protocol are to provide a mechanism to indicate when Remote 156 Invalidation is safe to use, and to provide selection criteria for 157 choosing which STag (when there are more than one) to invalidate 158 remotely. Elements of the XDR definition of the RPC-over-RDMA 159 protocol will need to be altered to some degree, depending on desired 160 flexibility of operation, invasiveness of XDR changes, and breadth of 161 hardware support. 163 2. Requirements Language 165 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 166 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 167 document are to be interpreted as described in [RFC2119] [RFC8174] 168 when, and only when, they appear in all capitals, as shown here. 170 3. General Requirements 172 3.1. Memory Management Extensions 174 Remote Invalidation was not available in the original RDMA Verbs API. 175 New verbs API objects were specified that include operations that 176 enable Remote Invalidation, now described in [IBARCH]. The Verbs API 177 provides a capabilities flag, MEM_MGT_EXTENSIONS, that indicates that 178 an RNIC and the local verbs implementation can support these new APIs 179 and objects. 181 An STag that is registered using the FRWR mechanism (in a privileged 182 execution context) or is registered via a Memory Window (in a non- 183 privileged context) may be invalidated remotely [RFC5040]. These 184 mechanisms are available when an RNIC supports MEM_MGT_EXTENSIONS. 186 RDMA Send With Invalidate is available only with MEM_MGT_EXTENSIONS. 188 3.2. Registration Types 190 For the purposes of this discussion, there are two classes of STags. 191 Dynamically-registered STags are used in a single RPC, then 192 invalidated. Persistently-registered STags are used in multiple RPC 193 transactions. They may persist for the life of an RPC-over-RDMA 194 connection, or longer. 196 In RPC-over-RDMA version 1, a requester may provide more than one 197 STag in the chunk lists of an RPC. Therefore a requester may provide 198 any combination of the following registration types in one RPC, any 199 combination of these in a series of RPCs on the same connection, or 200 it may use some other registration model. 202 Examples of persistently-registered STags include: 204 o The device's reserved DMA R_key 206 o An STag registered for a connection that doesn't change from RPC 207 to RPC (for a utility buffer, say) 209 o An STag registered for a fixed memory region that is updated after 210 each time it is advertised 212 o An STag covering a large single region that is utilized in small 213 segments by many RPCs 215 Examples of dynamically-registered STags include: 217 o An STag registered for a single RPC transaction using a legacy 218 registration mechanism, then invalidated when the RPC is retired 220 o An STag registered for a single RPC transaction using either 221 Memory Windows or FRWR, then invalidated when the RPC is retired 223 Among these examples, only dynamically-registered STags using Memory 224 Windows or FRWR may be invalidated remotely. 226 3.3. Selecting STags to Invalidate Remotely 228 Remote Invalidation protocol mechanisms come in different styles: 230 Fixed Protocol 231 The rules by which a responder selects which STag to invalidate 232 remotely is fixed in the protocol specification. 234 Responder's Choice 235 The responder chooses an STag to invalidate remotely from among 236 all the STags in incoming requests. 238 Requester's Choice 239 The requester chooses one or more STags that may be invalidated 240 remotely, indicating its choices in each request. The responder 241 chooses an STag to invalidate remotely from among the requester's 242 picks. 244 There is no RDMA layer mechanism by which a responder can determine 245 how a requester-provided STag was registered. Thus a requester that 246 mixes persistently- and dynamically-registered STags in one RPC, or 247 mixes them across RPCs on the same connection, cannot tolerate 248 Responder's Choice. 250 3.4. Future Enhancements 252 There are two related enhancements that further reduce the effort 253 needed to invalidate STags associated with complex RPCs: 255 o The ability for one registered STag to represent a list of memory 256 regions that are not contiguous 258 o The ability to specify more than one remote STag in a single Send 259 Work Request to be remotely invalidated 261 At this time, the first mechanism has been implemented in at least 262 one RNIC on the market. The second is speculative (i.e., has not yet 263 been implemented anywhere). 265 Given support for registering non-contiguous memory regions with one 266 STag, when an RPC-over-RDMA requester constructs an RPC that has both 267 a Read list and a Write list, the requester has a choice: 269 o The requester can register a separate STag for each access mode 270 (one STag for memory regions needing read access, and one STag for 271 those needing write access) to provide good data security 273 o The requester can register a single STag with read and write 274 access enabled for the whole set of memory regions, to allow RDMA 275 Send With Invalidate to work optimally 277 Having the ability to remotely invalidate multiple STags at once 278 enables the combination of optimal performance and optimal security. 280 4. Remote Invalidation in Operation 282 When requester memory is registered for remote access, an RPC-over- 283 RDMA implementation can use Remote Invalidation by following these 284 steps: 286 1. The requester DMA-maps a memory region that will participate in 287 an RPC transaction, then registers an STag for that region. 289 2. The requester transmits the RPC Call to the responder. This 290 request conveys the STag to the responder. 292 3. The responder processes the RPC transaction. The peer RNICs use 293 the STag to move RPC arguments and/or results. 295 4. The responder transmits the RPC Reply using an RDMA Send With 296 Invalidate Work Request, setting the Work Request's inv_handle 297 field to the value of the STag. 299 5. A Receive Work Request completes on the requester, carrying this 300 RPC reply. The completion reports the invalidated STag. 302 6. The requester skips invalidation of the STag, then DMA-unmaps the 303 memory region associated with the STag. 305 The requester no longer needs to invalidate the STag involved with 306 this RPC. However, there are additional details that must be 307 resolved before the use of Remote Invalidation can commence. 309 4.1. Determining Remote Invalidation Support Status 311 A requester that does not support Remote Invalidation might not 312 tolerate the use of RDMA Send With Invalidate by a responder. Such a 313 requester performs Local Invalidation on STags that already happen to 314 be invalid. In some cases this results in protection errors or other 315 issues. 317 Thus, to avoid spurious connection termination, a responder must not 318 post an RDMA Send With Invalidate Work Request unless it is sure the 319 following three conditions are met: 321 o The requester's RNIC is prepared to receive the additional header 322 information associated with Remote Invalidation 324 o The requester has used an appropriate registration mechanism to 325 register STags it wants invalidated remotely 327 o The requester is prepared to recognize remotely invalidated STags 328 during Receive processing to avoid invalidating them a second time 330 When all three of these conditions are met, a requester can report 331 positive Remote Invalidation support status to responders using an 332 Upper Layer Protocol mechanism. When a responder does not know the 333 requester's Remote Invalidation support status, it cannot use Remote 334 Invalidation without endangering the connection. 336 4.2. Selection of Which STag to Invalidate Remotely 338 The RDMA Send With Invalidate Work Request invalidates only one STag. 339 RPC-over-RDMA requesters may register more than one STag to handle 340 the movement of payloads for a single RPC. Either the client will 341 have to specify which STag may be remotely invalidated, the protocol 342 will have to specify a fixed way to select which STag to invalidate, 343 or the responder will have to choose arbitrarily which STag to 344 remotely invalidate. 346 In some circumstances, requesters may wish to utilize STags during 347 transactions that are registered using a mechanism that does not 348 tolerate Remote Invalidation. For example, an STag that is the 349 requester's local DMA R_key should never be invalidated remotely. If 350 a responder attempts to invalidate a such an STag, the result is 351 undefined, but the connection may be terminated or other failures can 352 occur. 354 Even with Remote Invalidation enabled, requesters remain responsible 355 for ensuring all STags are invalid before RPC transactions complete. 356 To avoid leaving STags registered, a requester must be prepared for 357 the responder or the requester's own RNIC to have not invalidated any 358 of an RPC's STags. When there are multiple STags associated with a 359 single RPC, a requester must be prepared for any of the STags to have 360 been remotely invalidated or that all of the RPC's STags remain 361 registered. 363 4.3. Reverse-Direction Operation 365 As of this writing, no current RPC-over-RDMA implementation supports 366 direct data placement in the reverse direction. However, existing 367 protocol specifications do not forbid it [RFC8166] [RFC8167] 368 [I-D.cel-nfsv4-rpcrdma-version-two]. 370 When chunks are present in a reverse-direction RPC request, Remote 371 Invalidation allows the responder to trigger invalidation of a 372 requester's STags as part of sending a reply, the same as in the 373 forward direction. 375 However, in the reverse direction, the server acts as the requester, 376 and the client is the responder. The server's RNIC, therefore, must 377 support receiving an IETH, and the server must have registered the 378 STags with an appropriate registration mechanism. Thus the server 379 must indicate its Remote Invalidation support status to the client 380 (the opposite of forward direction Remote Invalidation). 382 5. Protocol Elements 384 In this section, a number of abstract protocol variations are 385 considered. These vary in functionality and the invasiveness of 386 changes to the tranport protocol's XDR definition. Some of these 387 variations might be appropriate to use in combination. 389 5.1. Per Protocol Version Remote Invalidation 391 5.1.1. Description 393 When a higher protocol version number is negotiated, Remote 394 Invalidation is always enabled. Both peers assume that Remote 395 Invalidation may be used in either direction. 397 5.1.2. Similar Existing Implementations 399 SMB Direct [MS-SMBD] 401 5.1.3. Advantages 403 No XDR changes or protocol extensions are required. 405 Reverse direction use of Remote Invalidation is automatically 406 supported. 408 5.1.4. Disadvantages 410 The requester is not in control of which STags in an RPC may be 411 invalidated. Thus, a requester must not advertise STags which must 412 never be invalidated, or the protocol must specify a fixed choice of 413 which STag(s) in each request are allowed to be invalidated remotely. 415 This new protocol version would then be usable only with RNICs that 416 support Remote Invalidation. Other features and benefits of the new 417 protocol version would not be available when an implementation 418 employs an RNIC that does not support Remote Invalidation. In 419 particular, RNICs that do not support MEM_MGT_EXTENTIONS could not 420 use the new protocol version. 422 An extension or addition protocol version bump is required to 423 indicate support for transport-level mechanisms that can invalidate 424 multiple STags at once. 426 5.2. Per Connection Remote Invalidation 428 5.2.1. Description 430 At connection initiation time, messages are exchanged that indicate 431 each peer's Remote Invalidation support status. Without these 432 messages, peers assume Remote Invalidation is not supported. 434 5.2.2. Similar Existing Implementations 436 iSER [RFC7145]. Information is exchanged in RDMA-CM connection 437 requests to report an implementation's Remote Invalidation support 438 status. 440 5.2.3. Advantages 442 No changes to the base protocol XDR are required. 444 5.2.4. Disadvantages 446 Out-of-band messages are required to establish Remote Invalidation 447 support status. 449 The requester is not in control of which STags in an RPC may be 450 invalidated. Thus, a requester must not advertise STags which must 451 never be invalidated. 453 To support reverse-direction operation, the server must separately 454 indicate that it supports Remote Invalidation. 456 To enable support for multiple STag invalidation, this negotiation 457 protocol would have to be extended again to indicate when mechanisms 458 other than RDMA Send With Invalidate are supported by the requester's 459 RNIC. 461 5.3. Fixed Protocol Remote Invalidation 463 5.3.1. Description 465 Protocol specification determines how the responder chooses which 466 STag is to be invalidated remotely. Some other means is used to 467 determine whether Remote Invalidation can be used or not. 469 5.3.2. Similar Existing Implementations 471 iSER [RFC7145]. Two STags fields appear in each request: one 472 advertises Read data and one advertises Write data. When only one 473 STag is used in the request, it may be invalidated remotely. One 474 both STags are used, only the Read STag may be invalidated remotely. 476 SMB Direct [MS-SMBD]. The responder always chooses the first STag in 477 each request to be invalidated remotely. 479 5.3.3. Advantages 481 No changes to the base protocol XDR are required. 483 5.3.4. Disadvantages 485 Out-of-band messages are required to establish support status. 487 The requester is not in control of which STags in an RPC may be 488 invalidated. Thus, a requester must not advertise STags which must 489 never be invalidated. 491 This mechanism may not work well for transport protocols that allow 492 multiple read and write STags. 494 5.4. Per RPC Remote Invalidation (Single STag) 496 5.4.1. Description 498 A field is added to the transport header that contains an STag which 499 may be invalidated by the responder. A special value can be chosen 500 to mean "no STag may be invalidated" for use by requesters that have 501 no support for Remote Invalidation. 503 5.4.2. Similar Existing Implementations 505 None. 507 5.4.3. Advantages 509 A requester may advertise STags that cannot be invalidated remotely, 510 as long as they are never marked as "may invalidate." 512 No out-of-band support status negotiation is needed. 514 Reverse-direction RPCs can each indicate whether a reverse-direction 515 requester desires or does not support Remote Invalidation. 517 The responder needs no special logic or assumptions to choose the 518 STag to invalidate remotely. 520 5.4.4. Disadvantages 522 Either the base RPC-over-RDMA header XDR definition is altered, or a 523 protocol extension is required. 525 Requesters transmit a little extra data per RPC, making RPC-over-RDMA 526 messages slightly more costly to send and parse. 528 This mechanism cannot support the remote invalidation of multiple 529 STags at once. 531 5.5. Per RPC Remote Invalidation (Multiple STags) 533 5.5.1. Description 535 A new data structure is added to the transport header that indicate 536 which STags which may be invalidated by the responder. 538 This information might appear as a new field in the RDMA segment data 539 structure, as each segment has its own STag field. The field 540 indicates whether or not that STag may be invalidated by the 541 responder. Perhaps that field is a boolean, though in XDR, a boolean 542 is a full 32 bits. 544 Or, this information could appear in the header as an array of STags, 545 to reduce the amount of extra data contained in the RPC-over-RDMA 546 header. Zero array elements means the requester does not support 547 Remote Invalidation. 549 5.5.2. Similar Existing Implementations 551 NVMe/Fabrics [NVME]. Each STag in a request has an associated bit 552 flag that indicates whether the responder is allowed to invalidate it 553 remotely. 555 5.5.3. Advantages 557 A requester may advertise STags that cannot be invalidated remotely, 558 as long as they are never marked as "may invalidate." 560 The mechanism allows a requester to request either invalidation of 561 multiple STags at once, or to choose one STag to invalidate remotely. 563 No out-of-band support status negotiation is needed. 565 Each reverse-direction RPC can indicate whether a reverse-direction 566 requester desires or does not support Remote Invalidation. 568 The responder needs no special logic or assumptions to choose the 569 STag to invalidate remotely. 571 5.5.4. Disadvantages 573 The RPC-over-RDMA header XDR definition is possibly extensively 574 altered. 576 Requesters transmit extra data per RPC. However, it is limited to 577 only one or two 32-bit words in most cases. 579 5.6. Inter-RPC Remote Invalidation 581 5.6.1. Description 583 As a subfeature of support for Remote Invalidation, it is possible 584 that a responder can remotely invalidate an STag (using RDMA Send 585 With Invalidate) that refers to registered memory being used in the 586 Read chunk of a different RPC. Such Remote Invalidation would be 587 requested only after the responder has already completed its RDMA 588 Read. 590 This can be useful when a responder is replying to an RPC via an 591 inline message, but notices there are other RPC replies pending that 592 have multiple STags, some of which are Read chunks. 594 5.6.2. Similar Existing Implementations 596 None 598 5.6.3. Advantages 600 This is one way to enable remote invalidation of multiple STags per 601 RPC, using only RDMA Send With Invalidate. 603 5.6.4. Disadvantages 605 Additional requester and responder complexity would be required to 606 keep track of STags. 608 6. Recommendations 610 6.1. General Considerations 612 When constructing a protocol to support Remote Invalidation, one of 613 the above designs, or some combination of them, may be chosen. 615 In no particular order, the author feels that the design priorities 616 are: 618 o Do not prevent the efficient operation of RNICs that do not handle 619 RDMA Send With Invalidate 621 o Introduce as little impact on header XDR and header length as 622 possible, to keep collateral performance and complexity impacts 623 low 625 o Enable support for Remote Invalidation when explicit RDMA is used 626 in reverse-direction RPCs. 628 An important question is whether the base RPC-over-RDMA protocol 629 should support Remote Invalidation, whether Remote Invalidation 630 support should be carried entirely on the shoulders of protocol 631 extensions, or whether some combination of the two is best. 633 Upper Layer Protocols will likely always be responsible for some 634 degree of signaling Remote Invalidation capabilities, as long as 635 innovation continues at the transport layer (e.g., new RDMA 636 operations that enable multi-STag Remote Invalidation). Predicting 637 future hardware capabilities is challenging, limiting the ability to 638 design long-lived protocol support for them. 640 Lastly, it is difficult to estimate how long the industry must 641 continue to support less capable devices. 643 6.2. Choosing a Protocol Extension 645 All things being equal, making no changes to the base XDR definition 646 has great appeal. If the mechanism in Section 5.2 can be broadly 647 effective at enabling Remote Invalidation in the current set of RPC- 648 over-RDMA implementations, it would be the proper choice. 650 Unfortunately, among current RPC-over-RDMA client implementations, 651 there is one client that can immediately use a per-connection style 652 protocol, and one that can use only a per-RPC style protocol such as 653 Section 5.4. A third known client resides in user space and uses FMR 654 registration, thus it is incapable of immediately employing Remote 655 Invalidation. 657 Because there is a wide latitude of implementation choice already 658 allowed by the RPC-over-RDMA transport protocol, the author's 659 preference is to implement Section 5.4. The target STag can be added 660 to the RPC-over-RDMA transport as a single field in a new version of 661 the RPC-over-RDMA transport protocol. No further changes or 662 extensions are needed. 664 In the longer term, the requester appears to be in the better 665 position to determine which STag may be invalidated remotely. With 666 this mechanism, the requester can choose based on which STags may be 667 invalidated remotely, or may use criteria based on the strengths of 668 its RNIC. For instance, choosing the largest registered memory 669 region might be beneficial in some cases. 671 Allowing the responder to select from among several choices does not 672 seem to bring additional value, and burdens the responder with 673 additional header parsing costs for each chunk-bearing RPC 674 transaction. 676 Furthermore, the ability to request Remote Invalidation of multiple 677 STags in a single Work Request appears to be somewhat distant. It 678 would require additional Upper Layer Protocol mechanisms to 679 distinguish the new mechanism from using RDMA Send With Invalidate, 680 which we are not in a position to design today. Thus it does not 681 seem worth the extra implementation and protocol complexity of having 682 the requester provide a list of STags for the responder to choose 683 from. 685 As an alternative to modifying the XDR definition for the RDMA_MSG 686 and RDMA_NOMSG message types, a new RDMA message type could be 687 introduced in a new version of RPC-over-RDMA that provides similar 688 functionality to RDMA_MSG and RDMA_NOMSG but adds one or more new 689 fields. This has the advantage of leaving the version 1-compatible 690 parts of the the new XDR definition unchanged. It is an open 691 question whether this introduces more complexity to existing 692 implementations than adding new fields to RDMA_MSG and RDMA_NOMSG. 693 However, this approach is similar to the introduction of READ_PLUS in 694 the specification of NFSv4.2 [RFC7862]. 696 Allowing the feature described in Section 5.6 is likely to increase 697 the complexity of responder and especially requester implementations, 698 as they would have to remember invalidated STags independently of RPC 699 completions. Because it does not require any XDR changes, it could 700 easily be enabled in a future protocol extension. The author's 701 preference is to forbid this behavior in the initial specification, 702 but allow for a future extension to introduce it. 704 6.3. Example Remote Invalidation Protocol 706 As an example of how to proceed, the simplest approach would replace 707 struct rpcrdma2_chunk_lists (as defined in 708 [I-D.cel-nfsv4-rpcrdma-version-two]) with the following: 710 712 struct rpcrdma2_chunk_lists { 713 enum msg_type rdma_direction; 714 u32 rdma_inv_handle; 715 struct rpcrdma2_read_list *rdma_reads; 716 struct rpcrdma2_write_list *rdma_writes; 717 struct rpcrdma2_write_chunk *rdma_reply; 718 }; 720 722 The following language describes how to utilize the new field: 724 The requester sets the value of the rdma_inv_handle field to the 725 value of any one of the rdma_handle fields in the RPC-over-RDMA 726 header of the RPC Call that may be invalidated remotely. If the 727 RPC-over-RDMA header of the RPC Call contains no rdma_handles that 728 may be invalidated remotely, the requester MUST set the value of 729 the rdma_inv_handle field to zero. 731 If the rdma_inv_handle field in the RPC-over-RDMA header of an RPC 732 Call contains zero, the responder MUST NOT use RDMA Send With 733 Invalidate to transmit the matching RPC Reply. Otherwise, the 734 responder SHOULD use RDMA Send With Invalidate to transmit the RPC 735 Reply, specifying the value in the RPC-over-RDMA header's 736 rdma_inv_handle field as the Send With Invalidate Work Request's 737 inv_rkey. 739 6.4. Corner Cases 741 A remote invalidation-enabled client remains responsible for 742 protecting its registered memory even when there is no Reply. 743 Consider these important corner cases: 745 o The responder never sends a response to Call-only procedures, thus 746 there is no opportunity for remote invalidation. Moreover, if the 747 transport protocol has no RDMA_DONE message, requesters cannot 748 know when they may safely invalidate registered memory used for 749 Call arguments. Therefore memory registration should not be used 750 for RPC procedures that do not expect a Reply. 752 o The RPC Reply is lost but the responder is still functional. In 753 some cases, the Upper Layer Protocol requires that the responder 754 close the connection to signal the loss of an RPC transaction. 755 This renders existing STags invalid. 757 o An application on a client is interrupted before the RPC Reply 758 completes on the requester, or the RPC transaction times out 759 waiting for a Reply. This exposes a race condition: 761 * The MR has already been invalidated by the requester when the 762 RPC Reply arrives at the RNIC. Typically this results in a 763 Memory Management Operation error, and the QP is placed in the 764 Error state. 766 * The MR has already been invalidated by the RNIC when the 767 requester invalidates locally. This also typically results in 768 a Memory Management Operation error, and the QP is placed in 769 the Error state. 771 A protocol mechanism that enables the requester to indicate to the 772 responder that an RPC transaction has been canceled can be used to 773 avoid this race. Otherwise, the requester and responder 774 implementations must tolerate connection loss and re- 775 establishment. 777 7. Security Considerations 779 Remote Invalidation metadata is conveyed in the clear in RPC-over- 780 RDMA headers. This does not expose any new information to attackers. 782 A man-in-the-middle can alter Remote Invalidation metadata while it 783 is in transit. Requesters are prepared to handle the case where 784 responders have not invalidated any STags associated with an RPC. An 785 attacker can cause other STags in flight to be invalidated before the 786 responder is finished with the associated memory. Or an attacker can 787 replace the "to-be invalidated" STag with an STag in the same RPC 788 that should not be invalidated remotely. Any of these might cause 789 loss of connection or other failures, triggering a denial-of-service 790 situation. 792 A connection relationship is required to exist between a requester 793 and a responder. The requester's RNIC has associated a Protection 794 Domain with that connection. The STag on the requester to be 795 invalidated is associated with that Protection Domain. This protects 796 against arbitrary invalidation of STags by network nodes not part of 797 the connection. 799 Further discussion appears in [RFC5042]. 801 8. IANA Considerations 803 This document does not require actions by IANA. 805 9. References 807 9.1. Normative References 809 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 810 Requirement Levels", BCP 14, RFC 2119, 811 DOI 10.17487/RFC2119, March 1997, 812 . 814 [RFC5040] Recio, R., Metzler, B., Culley, P., Hilland, J., and D. 815 Garcia, "A Remote Direct Memory Access Protocol 816 Specification", RFC 5040, DOI 10.17487/RFC5040, October 817 2007, . 819 [RFC5042] Pinkerton, J. and E. Deleganes, "Direct Data Placement 820 Protocol (DDP) / Remote Direct Memory Access Protocol 821 (RDMAP) Security", RFC 5042, DOI 10.17487/RFC5042, October 822 2007, . 824 [RFC8166] Lever, C., Ed., Simpson, W., and T. Talpey, "Remote Direct 825 Memory Access Transport for Remote Procedure Call Version 826 1", RFC 8166, DOI 10.17487/RFC8166, June 2017, 827 . 829 [RFC8167] Lever, C., "Bidirectional Remote Procedure Call on RPC- 830 over-RDMA Transports", RFC 8167, DOI 10.17487/RFC8167, 831 June 2017, . 833 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 834 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, 835 May 2017, . 837 9.2. Informative References 839 [I-D.cel-nfsv4-rpcrdma-version-two] 840 Lever, C. and D. Noveck, "RPC-over-RDMA Version 2 841 Protocol", draft-cel-nfsv4-rpcrdma-version-two-07 (work in 842 progress), July 2018. 844 [IBARCH] InfiniBand Trade Association, "InfiniBand Architecture 845 Specification Volume 1", Release 1.3, March 2015, 846 . 849 [MS-SMBD] Microsoft Corporation, "SMB Remote Direct Memory Access 850 (RDMA) Transport Protocol Specification", July 2016. 852 [NVME] NVM Express, Inc., "NVM Express Revision 1.2.1", July 853 2016. 855 [RFC7145] Ko, M. and A. Nezhinsky, "Internet Small Computer System 856 Interface (iSCSI) Extensions for the Remote Direct Memory 857 Access (RDMA) Specification", RFC 7145, 858 DOI 10.17487/RFC7145, April 2014, 859 . 861 [RFC7862] Haynes, T., "Network File System (NFS) Version 4 Minor 862 Version 2 Protocol", RFC 7862, DOI 10.17487/RFC7862, 863 November 2016, . 865 Acknowledgments 867 The author wishes to thank Sagi Grimberg, Christoph Hellwig, Karen 868 Deitke, Dave Noveck, and Tom Talpey. The author also wishes to thank 869 Bill Baker and Greg Marsden for their support of this work. 871 Special thanks go to Transport Area Director Spencer Dawkins, NFSV4 872 Working Group Chairs Spencer Shepler and Brian Pawlowski, and NFSV4 873 Working Group Secretary Thomas Haynes for their support. 875 Author's Address 877 Charles Lever 878 Oracle Corporation 879 1015 Granger Avenue 880 Ann Arbor, MI 48104 881 United States of America 883 Phone: +1 248 816 6463 884 Email: chuck.lever@oracle.com