idnits 2.17.1 draft-ietf-nfsv4-rpcrdma-version-two-05.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document seems to contain a disclaimer for pre-RFC5378 work, but was first submitted on or after 10 November 2008. The disclaimer is usually necessary only for documents that revise or obsolete older RFCs, and that take significant amounts of text from those RFCs. If you can contact all authors of the source material and they are willing to grant the BCP78 rights to the IETF Trust, you can and should remove the disclaimer. Otherwise, the disclaimer is needed and you can ignore this comment. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (6 July 2021) is 1025 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Obsolete informational reference (is this intentional?): RFC 793 (Obsoleted by RFC 9293) -- Obsolete informational reference (is this intentional?): RFC 5666 (Obsoleted by RFC 8166) Summary: 0 errors (**), 0 flaws (~~), 2 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network File System Version 4 C. Lever, Ed. 3 Internet-Draft Oracle 4 Intended status: Standards Track D. Noveck 5 Expires: 7 January 2022 NetApp 6 6 July 2021 8 RPC-over-RDMA Version 2 Protocol 9 draft-ietf-nfsv4-rpcrdma-version-two-05 11 Abstract 13 This document specifies the second version of a transport protocol 14 that conveys Remote Procedure Call (RPC) messages using Remote Direct 15 Memory Access (RDMA). This version of the protocol is extensible. 17 Note 19 This note is to be removed before publishing as an RFC. 21 Discussion of this draft takes place on the NFSv4 working group 22 mailing list (nfsv4@ietf.org), which is archived at 23 https://mailarchive.ietf.org/arch/browse/nfsv4/. Working Group 24 information can be found at https://datatracker.ietf.org/wg/nfsv4/ 25 about/. 27 The source for this draft is maintained in GitHub. Suggested changes 28 can be submitted as pull requests at https://github.com/chucklever/ 29 i-d-rpcrdma-version-two. Instructions are on that page as well. 31 Status of This Memo 33 This Internet-Draft is submitted in full conformance with the 34 provisions of BCP 78 and BCP 79. 36 Internet-Drafts are working documents of the Internet Engineering 37 Task Force (IETF). Note that other groups may also distribute 38 working documents as Internet-Drafts. The list of current Internet- 39 Drafts is at https://datatracker.ietf.org/drafts/current/. 41 Internet-Drafts are draft documents valid for a maximum of six months 42 and may be updated, replaced, or obsoleted by other documents at any 43 time. It is inappropriate to use Internet-Drafts as reference 44 material or to cite them other than as "work in progress." 46 This Internet-Draft will expire on 7 January 2022. 48 Copyright Notice 50 Copyright (c) 2021 IETF Trust and the persons identified as the 51 document authors. All rights reserved. 53 This document is subject to BCP 78 and the IETF Trust's Legal 54 Provisions Relating to IETF Documents (https://trustee.ietf.org/ 55 license-info) in effect on the date of publication of this document. 56 Please review these documents carefully, as they describe your rights 57 and restrictions with respect to this document. Code Components 58 extracted from this document must include Simplified BSD License text 59 as described in Section 4.e of the Trust Legal Provisions and are 60 provided without warranty as described in the Simplified BSD License. 62 This document may contain material from IETF Documents or IETF 63 Contributions published or made publicly available before November 64 10, 2008. The person(s) controlling the copyright in some of this 65 material may not have granted the IETF Trust the right to allow 66 modifications of such material outside the IETF Standards Process. 67 Without obtaining an adequate license from the person(s) controlling 68 the copyright in such materials, this document may not be modified 69 outside the IETF Standards Process, and derivative works of it may 70 not be created outside the IETF Standards Process, except to format 71 it for publication as an RFC or to translate it into languages other 72 than English. 74 Table of Contents 76 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 4 77 1.1. Design Goals . . . . . . . . . . . . . . . . . . . . . . 4 78 1.2. Motivation for a New Version . . . . . . . . . . . . . . 5 79 2. Requirements Language . . . . . . . . . . . . . . . . . . . . 6 80 3. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 6 81 3.1. Remote Procedure Calls . . . . . . . . . . . . . . . . . 6 82 3.2. Remote Direct Memory Access . . . . . . . . . . . . . . . 10 83 4. RPC-over-RDMA Framework . . . . . . . . . . . . . . . . . . . 13 84 4.1. Message Framing . . . . . . . . . . . . . . . . . . . . . 13 85 4.2. Reliable Message Delivery . . . . . . . . . . . . . . . . 13 86 4.3. Initial Connection State . . . . . . . . . . . . . . . . 16 87 4.4. Using Direct Data Placement . . . . . . . . . . . . . . . 17 88 4.5. Encoding Chunks . . . . . . . . . . . . . . . . . . . . . 19 89 4.6. Reverse-Direction Operation . . . . . . . . . . . . . . . 23 90 5. Transport Properties . . . . . . . . . . . . . . . . . . . . 26 91 5.1. Transport Properties Model . . . . . . . . . . . . . . . 26 92 5.2. Current Transport Properties . . . . . . . . . . . . . . 28 93 6. Transport Messages . . . . . . . . . . . . . . . . . . . . . 31 94 6.1. Transport Header Types . . . . . . . . . . . . . . . . . 32 95 6.2. Headers and Chunks . . . . . . . . . . . . . . . . . . . 33 96 6.3. Header Types . . . . . . . . . . . . . . . . . . . . . . 34 97 6.4. Transport Header Prefix . . . . . . . . . . . . . . . . . 42 98 6.5. Remote Invalidation . . . . . . . . . . . . . . . . . . . 42 99 6.6. Payload Formats . . . . . . . . . . . . . . . . . . . . . 43 100 7. Error Handling . . . . . . . . . . . . . . . . . . . . . . . 49 101 7.1. Basic Transport Stream Parsing Errors . . . . . . . . . . 50 102 7.2. XDR Errors . . . . . . . . . . . . . . . . . . . . . . . 51 103 7.3. Responder RDMA Operational Errors . . . . . . . . . . . . 52 104 7.4. Other Operational Errors . . . . . . . . . . . . . . . . 54 105 7.5. RDMA Transport Errors . . . . . . . . . . . . . . . . . . 55 106 8. XDR Protocol Definition . . . . . . . . . . . . . . . . . . . 56 107 8.1. Code Component License . . . . . . . . . . . . . . . . . 56 108 8.2. Extraction of the XDR Definition . . . . . . . . . . . . 58 109 8.3. XDR Definition for RPC-over-RDMA Version 2 Core 110 Structures . . . . . . . . . . . . . . . . . . . . . . . 59 111 8.4. XDR Definition for RPC-over-RDMA Version 2 Base Header 112 Types . . . . . . . . . . . . . . . . . . . . . . . . . . 61 113 8.5. Use of the XDR Description . . . . . . . . . . . . . . . 64 114 9. RPC Bind Parameters . . . . . . . . . . . . . . . . . . . . . 65 115 10. Implementation Status . . . . . . . . . . . . . . . . . . . . 66 116 11. Security Considerations . . . . . . . . . . . . . . . . . . . 66 117 11.1. Memory Protection . . . . . . . . . . . . . . . . . . . 67 118 11.2. RPC Message Security . . . . . . . . . . . . . . . . . . 68 119 11.3. Transport Properties . . . . . . . . . . . . . . . . . . 71 120 11.4. Host Authentication . . . . . . . . . . . . . . . . . . 72 121 12. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 72 122 13. References . . . . . . . . . . . . . . . . . . . . . . . . . 72 123 13.1. Normative References . . . . . . . . . . . . . . . . . . 72 124 13.2. Informative References . . . . . . . . . . . . . . . . . 74 125 Appendix A. ULB Specifications . . . . . . . . . . . . . . . . . 76 126 A.1. DDP-Eligibility . . . . . . . . . . . . . . . . . . . . . 76 127 A.2. Maximum Reply Size . . . . . . . . . . . . . . . . . . . 78 128 A.3. Reverse-Direction Operation . . . . . . . . . . . . . . . 78 129 A.4. Additional Considerations . . . . . . . . . . . . . . . . 78 130 A.5. ULP Extensions . . . . . . . . . . . . . . . . . . . . . 79 131 Appendix B. Extending RPC-over-RDMA Version 2 . . . . . . . . . 79 132 B.1. Documentation Requirements . . . . . . . . . . . . . . . 80 133 B.2. Adding New Header Types to RPC-over-RDMA Version 2 . . . 80 134 B.3. Adding New Transport properties to the Protocol . . . . . 81 135 B.4. Adding New Error Codes to the Protocol . . . . . . . . . 82 136 Appendix C. Differences from RPC-over-RDMA Version 1 . . . . . . 82 137 C.1. Changes to the XDR Definition . . . . . . . . . . . . . . 83 138 C.2. Transport Properties . . . . . . . . . . . . . . . . . . 84 139 C.3. Credit Management Changes . . . . . . . . . . . . . . . . 84 140 C.4. Inline Threshold Changes . . . . . . . . . . . . . . . . 85 141 C.5. Message Continuation Changes . . . . . . . . . . . . . . 86 142 C.6. Host Authentication Changes . . . . . . . . . . . . . . . 86 143 C.7. Support for Remote Invalidation . . . . . . . . . . . . . 87 144 C.8. Integration of Reverse-Direction Operation . . . . . . . 88 145 C.9. Error Reporting Changes . . . . . . . . . . . . . . . . . 89 146 C.10. Changes in Terminology . . . . . . . . . . . . . . . . . 89 147 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . 90 148 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 90 150 1. Introduction 152 Remote Direct Memory Access (RDMA) [RFC5040] [RFC5041] [IBA] is a 153 technique for moving data efficiently between network nodes. By 154 placing transferred data directly into destination buffers using 155 Direct Memory Access, RDMA delivers the reciprocal benefits of faster 156 data transfer and reduced host CPU overhead. 158 Open Network Computing Remote Procedure Call (ONC RPC, often 159 shortened in NFSv4 documents to RPC) [RFC5531] is a Remote Procedure 160 Call protocol that runs over a variety of transports. Most RPC 161 implementations today use UDP [RFC0768] or TCP [RFC0793]. On UDP, a 162 datagram encapsulates each RPC message. Within a TCP byte stream, a 163 record marking protocol delineates RPC messages. 165 An RDMA transport, too, conveys RPC messages in a fashion that must 166 be fully defined if RPC implementations are to interoperate when 167 using RDMA to transport RPC transactions. Although RDMA transports 168 encapsulate messages like UDP, they deliver them reliably and in 169 order, like TCP. Further, they implement a bulk data transfer 170 service not provided by traditional network transports. Therefore, 171 we treat RDMA as a novel transport type for RPC. 173 1.1. Design Goals 175 The general mission of RPC-over-RDMA transports is to leverage 176 network hardware capabilities to reduce host CPU needs related to the 177 transport of RPC messages. In particular, this includes mitigating 178 host interrupt rates and limiting the necessity to copy RPC payload 179 bytes on receivers. 181 These hardware capabilities benefit both RPC clients and servers. On 182 balance, however, the RPC-over-RDMA protocol design approach has been 183 to bolster clients more than servers, as the client is typically 184 where applications are most hungry for CPU resources. 186 Additionally, RPC-over-RDMA transports are designed to support RPC 187 applications transparently. However, such transports can also 188 provide mechanisms that enable further optimization of data transfer 189 when RPC applications are structured to exploit direct data 190 placement. In this context, the Network File System (NFS) family of 191 protocols (as described in [RFC1094], [RFC1813], [RFC7530], 192 [RFC7862], [RFC8881], and subsequent NFSv4 minor versions) are all 193 potential beneficiaries of RPC-over-RDMA. 195 A complete problem statement appears in [RFC5532]. 197 1.2. Motivation for a New Version 199 Storage administrators have broadly deployed the RPC-over-RDMA 200 version 1 protocol specified in [RFC8166]. However, there are known 201 shortcomings to this protocol: 203 * The protocol's default size of Receive buffers forces the use of 204 RDMA Read and Write transfers for small payloads, and limits the 205 size of reverse-direction messages. 207 * It is difficult to make optimizations or protocol fixes that 208 require changes to on-the-wire behavior. 210 * For some RPC procedures, the maximum reply size is difficult or 211 impossible for an RPC client to estimate in advance. 213 To address these issues in a way that preserves interoperation with 214 existing RPC-over-RDMA version 1 deployments, the current document 215 presents an updated version of the RPC-over-RDMA transport protocol. 217 This version of RPC-over-RDMA is extensible, enabling the 218 introduction of OPTIONAL extensions without impacting existing 219 implementations. See Appendix C.1 for further discussion. It 220 introduces a mechanism to exchange implementation properties to 221 automatically provide further optimization of data transfer. 223 This version also contains incremental changes that relieve 224 performance constraints and enable recovery from unusual corner 225 cases. These changes are outlined in Appendix C and include a larger 226 default inline threshold, the ability to convey a single RPC message 227 using multiple RDMA Send operations, support for authentication of 228 connection peers, richer error reporting, improved credit-based flow 229 control, and support for Remote Invalidation. 231 2. Requirements Language 233 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 234 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 235 "OPTIONAL" in this document are to be interpreted as described in 236 BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all 237 capitals, as shown here. 239 3. Terminology 241 3.1. Remote Procedure Calls 243 This section highlights critical elements of the RPC protocol 244 [RFC5531] and the External Data Representation (XDR) [RFC4506] it 245 uses. RPC-over-RDMA version 2 enables the transmission of RPC 246 messges built using XDR and also uses XDR internally to describe its 247 header format. 249 3.1.1. Upper-Layer Protocols 251 RPCs are an abstraction used to implement the operations of an Upper- 252 Layer Protocol (ULP). For RPC-over-RDMA, "ULP" refers to an RPC 253 Program and Version tuple, which is a versioned set of procedure 254 calls that comprise a single well-defined API. One example of a ULP 255 is the Network File System Version 4.0 [RFC7530]. In the current 256 document, the term "RPC consumer" refers to an implementation of a 257 ULP running on an RPC client. 259 3.1.2. Requesters and Responders 261 Like a local procedure call, every RPC procedure has a set of 262 "arguments" and a set of "results". A calling context invokes a 263 procedure, passing arguments to it, and the procedure subsequently 264 returns a set of results. Unlike a local procedure call, the called 265 procedure is executed remotely rather than in the local application's 266 execution context. 268 The RPC protocol as described in [RFC5531] is fundamentally a 269 message-passing protocol between one or more clients, where RPC 270 consumers are running, and a server, where a remote execution context 271 is available to process RPC transactions on behalf of these 272 consumers. 274 ONC RPC transactions consist of two types of messages: 276 * A CALL message, or "Call", requests work. An RPC Call message is 277 designated by the value zero (0) in the message's msg_type field. 278 The sender places a unique 32-bit value in the message's XID field 279 to match this RPC Call message to a corresponding RPC Reply 280 message. 282 * A REPLY message, or "Reply", reports the results of work requested 283 by an RPC Call message. An RPC Reply message is designated by the 284 value one (1) in the message's msg_type field. The sender copies 285 the value contained in an RPC Reply message's XID field from the 286 RPC Call message whose results the sender is reporting. 288 Each RPC client endpoint acts as a "Requester", which serializes the 289 procedure's arguments and conveys them to a server endpoint via an 290 RPC Call message. A Call message contains an RPC protocol header, a 291 header describing the requested upper-layer operation, and all 292 arguments. 294 An RPC server endpoint acts as a "Responder", which deserializes the 295 arguments and processes the requested operation. It then serializes 296 the operation's results into an RPC Reply message. An RPC Reply 297 message contains an RPC protocol header, a header describing the 298 upper-layer reply, and all results. 300 The Requester deserializes the results and allows the RPC consumer to 301 proceed. At this point, the RPC transaction designated by the XID in 302 the RPC Call message is complete, and the XID is retired. 304 In summary, Requesters send RPC Call messages to Responders to 305 initiate RPC transactions. Responders send RPC Reply messages to 306 Requesters to complete the processing on an RPC transaction. 308 3.1.3. RPC Transports 310 The role of an "RPC transport" is to mediate the exchange of RPC 311 messages between Requesters and Responders. An RPC transport bridges 312 the gap between the RPC message abstraction and the native operations 313 of a network transport (e.g., a socket). 315 RPC-over-RDMA is a connection-oriented RPC transport. When a 316 transport type is connection-oriented, clients initiate transport 317 connections, while servers wait passively to accept incoming 318 connection requests. 320 3.1.3.1. Transport Failure Recovery 322 So that appropriate and timely recovery action can be taken, the 323 transport implementation is responsible for notifying a Requester 324 when an RPC Call or Reply was not able to be conveyed. Recovery can 325 take the form of establishing a new connection, re-sending RPC Calls, 326 or terminating RPC transactions pending on the Requester. 328 For instance, a connection loss may occur after a Responder has 329 received an RPC Call but before it can send the matching RPC Reply. 330 Once the transport notifies the Requester of the connection loss, the 331 Requester can re-send all pending RPC Calls on a fresh connection. 333 3.1.3.2. Forward Direction 335 Traditionally, an RPC client acts as a Requester, while an RPC 336 service acts as a Responder. The current document refers to this 337 direction of RPC message passing as "forward-direction" operation. 339 3.1.3.3. Reverse-Direction 341 The RPC specification [RFC5531] does not forbid performing RPC 342 transactions in the other direction. An RPC service endpoint can act 343 as a Requester, in which case an RPC client endpoint acts as a 344 Responder. This direction of RPC message passing is known as 345 "reverse-direction" operation. 347 During reverse-direction operation, an RPC client is responsible for 348 establishing transport connections, even though the RPC server 349 originates RPC Calls. 351 RPC clients and servers are usually optimized to perform and scale 352 well when handling traffic in the forward direction. They might not 353 be prepared to handle operation in the reverse direction. Not until 354 NFS version 4.1 [RFC8881] has there been a strong need to handle 355 reverse-direction operation. 357 3.1.3.4. Bi-directional Operation 359 A pair of connected RPC endpoints may choose to use only forward- 360 direction or only reverse-direction operation on a particular 361 transport connection. Or, these endpoints may send Calls in both 362 directions concurrently on the same transport connection. 364 "Bi-directional operation" occurs when both transport endpoints act 365 as a Requester and a Responder at the same time on a single 366 connection. 368 Bi-directionality is an extension of RPC transport connection 369 sharing. Two RPC endpoints wish to exchange independent RPC messages 370 over a shared connection but in opposite directions. These messages 371 may or may not be related to the same workloads or RPC Programs. 373 3.1.3.5. XID Values 375 Section 9 of [RFC5531] introduces the RPC transaction identifier, or 376 "XID" for short. A connection peer interprets the value of an XID in 377 the context of the message's msg_type field. 379 * The XID of a Call is arbitrary but is unique among outstanding 380 Calls from that Requester on that connection. 382 * The XID of a Reply always matches that of the initiating Call. 384 After receiving a Reply, a Requester matches the XID value in that 385 Reply with a Call it previously sent. 387 During bi-directional operation, forward- and reverse- direction XIDs 388 are typically generated on distinct hosts by possibly different 389 algorithms. There is no coordination between the generation of XIDs 390 used in forward-direction and reverse-direction operation. 392 Therefore, a forward-direction Requester MAY use the same XID value 393 at the same time as a reverse-direction Requester on the same 394 transport connection. Although such concurrent requests use the same 395 XID value, they represent distinct RPC transactions. 397 3.1.4. External Data Representation 399 One cannot assume that all Requesters and Responders represent data 400 objects in the same way internally. RPC uses External Data 401 Representation (XDR) to translate native data types and serialize 402 arguments and results [RFC4506]. 404 XDR encodes data independently of the endianness or size of host- 405 native data types, enabling unambiguous decoding of data by a 406 receiver. 408 XDR assumes only that the number of bits in a byte (octet) and their 409 order are the same on both endpoints and the physical network. The 410 smallest indivisible unit of XDR encoding is a group of four octets. 411 XDR can also flatten lists, arrays, and other complex data types into 412 a stream of bytes. 414 We refer to a serialized stream of bytes that is the result of XDR 415 encoding as an "XDR stream". A sender encodes native data into an 416 XDR stream and then transmits that stream to a receiver. The 417 receiver decodes incoming XDR byte streams into its native data 418 representation format. 420 3.1.4.1. XDR Opaque Data 422 Sometimes, a data item is to be transferred as-is, without encoding 423 or decoding. We refer to the contents of such a data item as "opaque 424 data". XDR encoding places the content of opaque data items directly 425 into an XDR stream without altering it in any way. ULPs or 426 applications perform any needed data translation in this case. 427 Examples of opaque data items include the content of files or generic 428 byte strings. 430 3.1.4.2. XDR Roundup 432 The number of octets in a variable-length data item precedes that 433 item in an XDR stream. If the size of an encoded data item is not a 434 multiple of four octets, the sender appends octets containing zero 435 after the end of the data item. These zero octets shift the next 436 encoded data item in the XDR stream so that it always starts on a 437 four-octet boundary. The addition of extra octets does not change 438 the encoded size of the data item. Receivers do not expose the extra 439 octets to ULPs. 441 We refer to this technique as "XDR roundup", and the extra octets as 442 "XDR roundup padding". 444 3.2. Remote Direct Memory Access 446 When a third party transfers large RPC payloads, RPC Requesters and 447 Responders can become more efficient. An example of such a third 448 party might be an intelligent network interface (data movement 449 offload), which places data in the receiver's memory so that no 450 additional adjustment of data alignment is necessary (direct data 451 placement or "DDP"). RDMA transports enable both of these 452 optimizations. 454 In the current document, the standalone term "RDMA" refers to the 455 physical mechanism an RDMA transport utilizes when moving data. 457 3.2.1. Direct Data Placement 459 Typically, RPC implementations copy the contents of RPC messages into 460 a buffer before being sent. An efficient RPC implementation sends 461 bulk data without first copying it into a separate send buffer. 463 However, socket-based RPC implementations are often unable to receive 464 data directly into its final place in memory. Receivers often need 465 to copy incoming data to finish an RPC operation, if only to adjust 466 data alignment. 468 Although it may not be efficient, before an RDMA transfer, a sender 469 may copy data into an intermediate buffer. After an RDMA transfer, a 470 receiver may copy that data again to its final destination. In this 471 document, the term "DDP" refers to any optimized data transfer where 472 a receiving host's CPU does not move transferred data to another 473 location after arrival. 475 RPC-over-RDMA version 2 enables the use of RDMA Read and Write 476 operations to achieve both data movement offload and DDP. However, 477 note that not all RDMA-based data transfer qualifies as DDP, and some 478 mechanisms that do not employ explicit RDMA can place data directly. 480 3.2.2. RDMA Transport Operation 482 RDMA transports require that RDMA consumers provision resources in 483 advance to achieve good performance during receive operations. An 484 RDMA consumer might provide Receive buffers in advance by posting an 485 RDMA Receive Work Request for every expected RDMA Send from a remote 486 peer. These buffers are provided before the remote peer posts RDMA 487 Send Work Requests. Thus this is often referred to as "pre-posting" 488 buffers. 490 An RDMA Receive Work Request remains outstanding until the RDMA 491 provider matches it to an inbound Send operation. The resources 492 associated with that Receive must be retained in host memory, or 493 "pinned", until the Receive completes. 495 Given these tenets of operation, the RPC-over-RDMA version 2 protocol 496 assumes each transport provides the following abstract operations. A 497 more complete discussion of these operations appears in [RFC5040]. 499 3.2.2.1. Memory Registration 501 Memory registration assigns a steering tag to a region of memory, 502 permitting the RDMA provider to perform data-transfer operations. 503 The RPC-over-RDMA version 2 protocol assumes that a steering tag of 504 no more than 32 bits and memory addresses of up to 64 bits in length 505 identifies each registered memory region. 507 3.2.2.2. RDMA Send 509 The RDMA provider supports an RDMA Send operation, with completion 510 signaled on the receiving peer after the RDMA provider has placed 511 data in a pre-posted buffer. Sends complete at the receiver in the 512 order they were posted at the sender. The size of the remote peer's 513 pre-posted buffers limits the amount of data that can be transferred 514 by a single RDMA Send operation. 516 3.2.2.3. RDMA Receive 518 The RDMA provider supports an RDMA Receive operation to receive data 519 conveyed by incoming RDMA Send operations. To reduce the amount of 520 memory that must remain pinned awaiting incoming Sends, the amount of 521 memory posted per Receive is limited. The RDMA consumer (in this 522 case, the RPC-over-RDMA version 2 protocol) provides flow control to 523 prevent overrunning receiver resources. 525 3.2.2.4. RDMA Write 527 The RDMA provider supports an RDMA Write operation to place data 528 directly into a remote memory region. The local host initiates an 529 RDMA Write and the RDMA provider signals completion there. The 530 remote RDMA provider does not signal completion on the remote peer. 531 The local host provides the steering tag, the memory address, and the 532 length of the remote peer's memory region. 534 RDMA Writes are not ordered relative to one another, but are ordered 535 relative to RDMA Sends. Thus, a subsequent RDMA Send completion 536 signaled on the local peer guarantees that prior RDMA Write data has 537 been successfully placed in the remote peer's memory. 539 3.2.2.5. RDMA Read 541 The RDMA provider supports an RDMA Read operation to place remote 542 source data directly into local memory. The local host initiates an 543 RDMA Read and and the RDMA provider signals completion there. The 544 remote RDMA provider does not signal completion on the remote peer. 545 The local host provides the steering tags, the memory addresses, and 546 the lengths for the remote source and local destination memory 547 regions. 549 The RDMA consumer (in this case, the RPC-over-RDMA version 2 550 protocol) signals Read completion to the remote peer as part of a 551 subsequent RDMA Send message. The remote peer can then invalidate 552 steering tags and subsequently free associated source memory regions. 554 4. RPC-over-RDMA Framework 556 Before an RDMA data transfer can occur, an endpoint first exposes 557 regions of its memory to a remote endpoint. The remote endpoint then 558 initiates RDMA Read and Write operations against the exposed memory. 559 A "transfer model" designates which endpoint exposes its memory and 560 which is responsible for initiating the transfer of data. 562 In RPC-over-RDMA version 2, only Requesters expose their memory to 563 the Responder, and only Responders initiate RDMA Read and Write 564 operations. Read access to memory regions enables the Responder to 565 pull RPC arguments or whole RPC Calls from each Requester. The 566 Responder pushes RPC results or whole RPC Replies to a Requester's 567 memory regions to which it has write access. 569 4.1. Message Framing 571 Each RPC-over-RDMA version 2 message consists of at most two XDR 572 streams: 574 * The "Transport stream" contains a header that describes and 575 controls the transfer of the Payload stream in this RPC-over-RDMA 576 message. Every RDMA Send on an RPC-over-RDMA version 2 connection 577 MUST begin with a Transport stream. 579 * The "Payload stream" contains part or all of a single RPC message. 580 The sender MAY divide an RPC message at any convenient boundary 581 but MUST send RPC message fragments in XDR stream order and MUST 582 NOT interleave Payload streams from multiple RPC messages. 584 The RPC-over-RDMA framing mechanism described in this section 585 replaces all other RPC framing mechanisms. Connection peers use RPC- 586 over-RDMA framing even when the underlying RDMA protocol runs on a 587 transport type with well-defined RPC framing, such as TCP. However, 588 a ULP can negotiate the use of RDMA, dynamically enabling the use of 589 RPC-over-RDMA on a connection established on some other transport 590 type. Because RPC framing delimits an entire RPC request or reply, 591 the resulting shift in framing must occur between distinct RPC 592 messages, and in concert with the underlying transport. 594 4.2. Reliable Message Delivery 596 RPC-over-RDMA provides a reliable and in-order data transport service 597 for RPC Calls and Replies. 599 RPC-over-RDMA transports MUST operate only on a reliable Queue Pair 600 (QP) such as the RDMA RC (Reliable Connected) QP type as defined in 601 Section 9.7.7 of [IBA]. The Marker PDU Aligned (MPA) protocol 603 [RFC5044], when deployed on a reliable transport such as TCP, 604 provides similar functionality. Using a reliable QP type ensures in- 605 transit data integrity and proper recovery from packet loss in the 606 lower layers. 608 If any pre-posted Receive buffer on the connection is not large 609 enough to contain an incoming message, the receiving RDMA provider 610 cannot deliver that message to the upper-layer consumer. Likewise, 611 if no pre-posted Receive buffer is available to accept an incoming 612 message, the receiving RDMA provide cannot pass that message to the 613 consumer. Exceeding these limits results in a transition to a QP 614 error state, the loss of an in-flight message, and the potential loss 615 of the connection. 617 Therefore, senders need to respect peer receiver resource limits to 618 ensure that the transport service can deliver every message reliably. 619 Two operational parameters communicate these limits between RPC-over- 620 RDMA peers: credits and inline threshold. 622 4.2.1. Flow Control 624 RPC-over-RDMA version 2 employs end-to-end credit-based flow control 625 on each connection to prevent senders from transmitting more messages 626 than a receiver is prepared to accept [CBFC]. Credit-based flow 627 control is relatively simple, providing robust operation in the face 628 of bursty traffic and automated management of receive buffer 629 allocation while enabling effective pipelining. A simplified sliding 630 window approach is all that is necessary for our purposes because the 631 underlying RDMA transport service already guarantees reliable and in- 632 order message delivery. 634 An RPC-over-RDMA version 2 credit represents the capability to convey 635 exactly one RPC-over-RDMA version 2 message, regardless of its size, 636 via an RDMA Send/Receive pair. This arrangement enables RPC-over- 637 RDMA version 2 transport connections to support multiple 638 unacknowledged messages in each direction. 640 Because an RPC-over-RDMA version 2 connection is full-duplex, each 641 connection peer has its own set of credits. The two receivers manage 642 their credits independently, although they typically communicate 643 these values by piggy-backing them on a payload-bearing message in 644 the opposite direction. 646 Each RPC-over-RDMA version 2 message header contains two fields that 647 handle credit accounting: 649 * The rdma_credit field contains the receiver's credit window size. 650 This field functions in much the same way as the RPC-over-RDMA 651 version 1 used "credit grants". When sending an RPC-over-RDMA 652 message, a peer fills in this field with the number of 653 unacknowledged messages it is prepared to receive on this 654 connection. 656 * A second field, which is yet to be defined, reports the number of 657 messages received so far. When sending an RPC-over-RDMA message, 658 a peer fills in this field with the count of messages it has 659 already received on this connection. 661 A sender also keeps track of the count of messages it has sent on the 662 connection. The sender MUST stop transmitting messages when the 663 number of messages it has already sent is about to exceed the number 664 of messages the receiver has acknowledged plus the receiver's credit 665 window. 667 A receiver MAY adjust the credit window to match the needs or 668 policies in effect on either peer. For instance, a peer may reduce 669 the size of its credit window to accommodate the available resources 670 in a Shared Receive Queue. Certain RDMA implementations may impose 671 additional flow-control restrictions, such as limits on RDMA Read 672 operations in progress at the Responder. Accommodation of such 673 checks is considered the responsibility of each RPC-over-RDMA version 674 2 implementation. 676 The credit window size MUST be less than the total sequence number 677 range (let's make this a more quantitative statement later). The 678 receiver generally chooses a credit window that is large enough to 679 maximize throughput, given the bandwidth-delay product of the 680 connection, while not overwhelming memory resources on the local 681 system. 683 4.2.1.1. Asynchronous Credit Grants 685 Credit accounting information is usually piggy-backed on data-bearing 686 messages. However, on occasion, a receiver might need to refresh its 687 credit window without sending an RPC payload. A receiving peer can 688 use an alternate header type when the sender's credit window is 689 exhausted during a stream of unacknowledged messages. See 690 Section 6.3.2 for information about this header type. 692 Unlike RPC-over-RDMA version 1, the credit window on an RPC-over-RDMA 693 version 2 connection MAY be zero. In that case, the sender waits 694 until the receiver sends it an asynchronous credit refresh. 696 Therefore, receivers MUST always be in a position to receive one 697 asynchronous credit update message, in addition to payload-bearing 698 messages, to prevent transport deadlock. A receiver can do this is 699 by posting one more RDMA Receive than the advertised credit window. 701 4.2.2. Inline Threshold 703 An "inline threshold" value is the largest message size (in octets) 704 that can be conveyed in one direction between peer implementations 705 using RDMA Send and Receive channel operations. An inline threshold 706 value is less than or equal to the largest number of octets the 707 sender can post in a single RDMA Send operation. It is also less 708 than or equal to the largest number of octets the receiver can 709 reliably accept via a single RDMA Receive operation. 711 Each connection has two inline threshold values. There is one for 712 messages flowing from Requester-to-Responder (referred to as the 713 "call inline threshold"), and one for messages flowing from 714 Responder-to-Requester (referred to as the "reply inline threshold"). 716 Peers can advertise their inline threshold values via RPC-over-RDMA 717 version 2 Transport Properties (see Section 5). In the absence of an 718 exchange of Transport Properties, connection peers MUST assume both 719 inline thresholds are 4096 octets. 721 4.3. Initial Connection State 723 When an RPC-over-RDMA version 2 client establishes a connection to a 724 server, its first order of business is to determine the server's 725 highest supported protocol version. 727 Upon connection establishment, a client MUST send only a single RPC- 728 over-RDMA message until it receives a valid RPC-over-RDMA message 729 from the server that provides a credit window update. 731 The second word of each transport header conveys the transport 732 protocol version. In the interest of clarity, the current document 733 refers to that word as rdma_vers even though in the RPC-over-RDMA 734 version 2 XDR definition, it appears as rdma_start.rdma_vers. 736 Immediately after the client establishes a connection, it sends a 737 single valid RPC-over-RDMA message with the value two (2) in the 738 rdma_vers field. Because the server might support only RPC-over-RDMA 739 version 1, this initial message MUST NOT be larger than the version 1 740 default inline threshold of 1024 octets. 742 4.3.1. Server Supports RPC-over-RDMA Version 2 744 If the server supports RPC-over-RDMA version 2, it sends RPC-over- 745 RDMA messages back to the client with the value two (2) in the 746 rdma_vers field. Both peers may assume the default inline threshold 747 value for RPC-over-RDMA version 2 connections (4096 octets). 749 4.3.2. Server Does Not Support RPC-over-RDMA Version 2 751 If the server does not support RPC-over-RDMA version 2, it MUST send 752 an RPC-over-RDMA message to the client with an XID that matches the 753 client's first message, RDMA2_ERROR in the rdma_start.rdma_htype 754 field, and with the error code RDMA2_ERR_VERS. This message also 755 reports the range of RPC-over-RDMA protocol versions that the server 756 supports. To continue operation, the client selects a protocol 757 version in that range for subsequent messages on this connection. 759 If the connection is dropped immediately after an RDMA2_ERROR/ 760 RDMA2_ERR_VERS message is received, the client should try to avoid a 761 version negotiation loop when re-establishing another connection. It 762 can assume that the server does not support RPC-over-RDMA version 2. 763 A client can assume the same situation (i.e., no server support for 764 RPC-over-RDMA version 2) if the initial negotiation message is lost 765 or dropped. Once the version negotiation exchange is complete, both 766 peers may use the default inline threshold value for the negotiated 767 transport protocol version. 769 4.3.3. Client Does Not Support RPC-over-RDMA Version 2 771 The server examines the RPC-over-RDMA protocol version used in the 772 first RPC-over-RDMA message it receives. If it supports this 773 protocol version, it MUST use it in all subsequent messages it sends 774 on that connection. The client MUST NOT change the protocol version 775 for the duration of the connection. 777 4.4. Using Direct Data Placement 779 RPC-over-RDMA version 2 provides a mechanism for moving part of an 780 RPC message via a data transfer distinct from RDMA Send and Receive. 781 For example, a sender can remove one or more XDR data items from the 782 Payload stream. These items are then conveyed via other mechanisms, 783 such as one or more RDMA Read or Write operations. 785 4.4.1. Chunks and Segments 787 A Requester records the location information for each registered 788 memory region associated with an RPC payload in the transport header 789 of an RPC-over-RDMA message. With this information, the Responder 790 uses RDMA Read and Write operations to retrieve arguments contained 791 in the specified region of the Requester's memory or place results in 792 that region. 794 A "segment" is a transport header data object that contains the 795 precise coordinates of a contiguous registered memory region. Each 796 segment contains the following information: 798 Handle: A steering Tag (STag) or R_key generated by registering this 799 memory with the RDMA provider. 801 Length: The length of the segment's memory region, in octets. The 802 length of a segment MAY be aligned to a single octet. An "empty 803 segment" is defined as a segment with the value zero (0) in its 804 length field. 806 Offset: The offset or beginning memory address of the segment's 807 memory region. 809 The meaning of the values contained in these fields is elaborated in 810 [RFC5040]. 812 A "chunk" is simply a set of segments that have a related purpose. A 813 Requester MAY divide a chunk into segments using any convenient 814 boundaries. The length of a chunk is defined as the sum of the 815 lengths of the segments that comprise it. 817 4.4.2. Reducing a Payload Stream 819 We refer to a data item that a sender removes from a Payload stream 820 to transmit separately as a "reduced" data item. After a sender has 821 finished removing XDR data items from a Payload stream, we refer to 822 it as a "reduced" Payload stream. A set of segments that describe 823 memory regions containing a single reduced data item is categorized 824 as a "data item chunk." 826 Not all XDR data items benefit from Direct Data Placement. For 827 example, small data items or data items that require XDR unmarshaling 828 by the receiver do not benefit from DDP. Moreover, it is impractical 829 for receivers to prepare for every possible XDR data item in a 830 protocol to appear in a data item chunk. 832 Specifying which data items are DDP-eligible is done in separate 833 standards track documents known as "Upper Layer Bindings". A ULB 834 identifies which XDR data items a peer MAY transfer using DDP. We 835 refer to such data items as "DDP-eligible." Senders MUST NOT reduce 836 any other XDR data items. Detailed requirements for ULB 837 specifications appear in Appendix A of the current document. 839 4.4.3. Moving Whole RPC Messages using Explicit RDMA 841 RPC-over-RDMA version 2 also enables the movement of a whole RPC 842 message via data transfer distinct from RDMA Send and Receive. A 843 sender registers the memory containing a Payload stream without 844 regard to data item boundaries or DDP-eligibility. The Payload 845 stream is then conveyed via other mechanisms, such as one or more 846 RDMA Read or Write operations. A set of segments that describe 847 memory regions containing a Payload stream is categorized as a "body 848 chunk". 850 A sender may first reduce that Payload stream if it contains one or 851 more DDP-eligible data items. The sender moves these data items 852 using data items chunks, and the reduced Payload stream using a body 853 chunk. 855 4.5. Encoding Chunks 857 The RPC-over-RDMA version 2 transport protocol does not place a limit 858 on chunk size. However, each ULP may cap the amount of data that can 859 be transferred by a single RPC transaction. For example, NFS 860 implementations typically have settings that restrict the payload 861 size of NFS READ and WRITE operations. The Responder can use such 862 limits to sanity check chunk sizes before using them in RDMA 863 operations. 865 4.5.1. Read Chunks 867 A "Read chunk" contains data that its receiver pulls from the sender. 868 Each Read chunk is a set of one or more "Read segments" encoded as a 869 list. A Read segment consists of a Position field followed by a 870 segment, as defined in Section 4.4.1. 872 Position: The byte offset in the unreduced Payload stream where the 873 receiver reinserts the data item conveyed in the chunk. The 874 sender MUST compute the Position value from the beginning of the 875 unreduced Payload stream, which begins at Position zero. All 876 segments in the same Read chunk share the same Position value, 877 even if one or more of the segments have a non-four-byte-aligned 878 length. The value in this field MUST be a multiple of four. 880 When constructing an RPC-over-RDMA message, the sender registers 881 memory regions containing data intended for RDMA Read operations. It 882 advertises the coordinates of these regions in Read chunks added to 883 the transport header of an RPC-over-RDMA message. 885 The receiver of this message then pulls the chunk's data from the 886 sender using RDMA Read operations. When receiving a Read chunk, the 887 receiver inserts the first Read segment in a Read chunk into the 888 Payload stream at the byte offset indicated by its Position field. 889 The receiver concatenates Read segments whose Position field value 890 matches this offset until there are no more Read segments at that 891 Position value. 893 4.5.1.1. The Read List 895 Each RPC-over-RDMA message carries a list of Read segments that make 896 up the set of Read chunks for that message. When no RDMA Read 897 operations are needed to complete the transmission of the message's 898 Payload stream, the message's Read list is empty. 900 If a Responder receives a Read list whose segment position values do 901 not appear in monotonically increasing order, it MUST discard the 902 message without processing it and respond with an RDMA2_ERROR message 903 with the rdma_xid field set to the XID of the malformed message and 904 the rdma_err field set to RDMA2_ERR_BAD_XDR. 906 4.5.1.2. The Call Chunk 908 The Call chunk is a Read chunk that acts as a body chunk containing 909 an RPC Call message. A Requester can utilize a Call chunk at any 910 time. However, using a Call chunk is less efficient than an RDMA 911 Send. 913 A Read chunk may act as either a data item chunk or a body chunk. 914 When the chunk's position is zero, it acts as a body chunk. 915 Otherwise, it is a data item chunk containing exactly one XDR data 916 item. 918 4.5.1.3. Read Completion 920 A Responder acknowledges that it is finished with the Requester's 921 Read chunk memory regions when it sends the corresponding RPC Reply 922 message. The Requester may then invalidate memory regions belonging 923 to Read chunks associated with the associated RPC Call message. 925 4.5.2. Write Chunks 927 Each "Write chunk" consists of a counted array of zero or more 928 segments, as defined in Section 4.4.1. The function of a Write chunk 929 depends on the direction of the containing RPC-over-RDMA message. In 930 a Call message, a Write chunk advertises registered memory regions 931 into which the Responder may push data. In a Reply message, a Write 932 chunk reports how much data has been pushed. 934 A Requester provisions Write chunks for an RPC transaction long 935 before the Responder has constructed a corresponding Reply message. 936 A Requester typically does not know the actual length of the result 937 data items or Reply to be returned, since the Reply does not yet 938 exist. Thus, a Requester MUST provision Write chunks large enough to 939 accommodate the maximum possible size of each returned data item. 941 An "empty Write chunk" is a Write chunk with a zero segment count. 942 By definition, the length of an empty Write chunk is zero. An 943 "unused Write chunk" has a non-zero segment count, but all of its 944 segments are empty segments. 946 4.5.2.1. The Write List 948 Each RPC-over-RDMA message carries a list of Write chunks. When no 949 DDP-eligible data items are to appear in the Reply to an RPC 950 transaction, the Requester provides an empty Write list in the RPC 951 Call, and the Responder leaves the Write list empty in the matching 952 RPC Reply. When a Write chunk appears in the Write list, it acts 953 only as a data item chunk. 955 For each Write chunk in the Write list, the Responder pushes one DDP- 956 eligible data item to the Requester. It fills the chunk contiguously 957 and in segment array order until the Responder has written that data 958 item to the Requester in its entirety. The Responder MUST copy the 959 segment count and all segments from the Requester-provided Write 960 chunk into the RPC Reply message's transport header. As it does so, 961 the Responder updates each segment length field to reflect the actual 962 amount of data returned in that segment. 964 The Responder then sends the RPC Reply message via an RDMA Send 965 operation. 967 4.5.2.2. The Reply Chunk 969 The Reply chunk is a single Write chunk that acts as a body chunk. 970 that contains an RPC Reply message. When a Requester estimates that 971 the Reply message can exceed the connection's ability to convey that 972 Reply using RDMA Send operations, it should provision a Reply chunk. 974 4.5.2.3. Write Completion 976 A Responder acknowledges that it is finished updating the Requester's 977 Write chunk memory regions when it sends the corresponding RPC Reply 978 message. The RDMA provider guarantees that the written data is at 979 rest before the next Receive operation, which typically contains the 980 corresponding RPC Reply, completes. The Requester may then 981 invalidate memory regions belonging to Write chunks associated with 982 the associated RPC Call message. 984 4.5.2.4. Write Chunk Roundup 986 When provisioning a Write chunk for a variable-length result data 987 item, the Requester MUST NOT include additional space for XDR roundup 988 padding. A Responder MUST NOT write XDR roundup padding into a Write 989 chunk, even if the result is shorter than the available space in the 990 chunk. 992 4.5.3. Reducing Complex XDR Data Types 994 XDR data items may appear in body chunks without regard to their DDP- 995 eligibility. As body chunks contain a Payload stream, they MUST 996 include all appropriate XDR roundup padding to maintain proper XDR 997 alignment of their contents. 999 However, a data item chunk MUST contain only one XDR data item, and 1000 the chunk MUST occupy a four-byte aligned length in the Payload 1001 stream so that subsequent data items remain properly aligned once the 1002 reduced data item is removed from the Payload stream. 1004 4.5.3.1. Variable-Length Data Items 1006 When a sender reduces a variable-length XDR data item, the length of 1007 the item MUST remain in the Payload stream. The sender MUST omit the 1008 item's XDR roundup padding from the Payload stream and the chunk. 1009 The chunk's total length MUST be the same as the encoded length of 1010 the data item. 1012 4.5.3.2. Counted Arrays 1014 When reducing a data item that is a counted array data type, the 1015 count of array elements MUST remain in the Payload stream. The 1016 sender MUST move the array elements into the chunk. For example, 1017 when encoding an opaque byte array as a chunk, the count of bytes 1018 stays in the Payload stream, and the sender places the bytes in the 1019 array in the chunk. 1021 Individual array elements appear in a chunk in their entirety. For 1022 example, when encoding an array of arrays as a chunk, the count of 1023 items in the enclosing array stays in the Payload stream. But each 1024 enclosed array, including its item count, is transferred as part of 1025 the chunk. 1027 4.5.3.3. Optional-Data 1029 Similar to a counted array, when reducing an optional-data data type, 1030 the discriminator field MUST remain in the Payload stream. The 1031 sender MUST place the data, when present, in the chunk. 1033 4.5.3.4. XDR Unions 1035 A union data type MUST NOT be made DDP-eligible. However, one or 1036 more of its arms MAY be made DDP-eligible, subject to the other 1037 requirements in this section. 1039 4.6. Reverse-Direction Operation 1041 The terminology used in this section is introduced in 1042 Section 3.1.3.3. 1044 4.6.1. Sending a Reverse-Direction RPC Call 1046 An RPC-over-RDMA server endpoint constructs the transport header for 1047 a reverse-direction RPC Call as follows: 1049 * The server generates a new XID value (see Section 3.1.3.5 for full 1050 requirements) and places it in the rdma_xid field of the transport 1051 header and the xid field of the RPC Call message. The RPC Call 1052 header MUST start with the same XID value that is present in the 1053 transport header. 1055 * The rdma_vers field of each reverse-direction Call MUST contain 1056 the same value as forward-direction Calls on the same connection. 1058 * The server fills in the rdma_credit field with the credit values 1059 for the connection, as described in Section 4.2.1. 1061 * The server determines the Payload format for the RPC message and 1062 fills in the rdma_htype field as appropriate (see Sections 6.6 and 1063 4.6.4). Section 4.6.4 also covers the disposition of the chunk 1064 lists. 1066 4.6.2. Sending a Reverse-Direction RPC Reply 1068 An RPC-over-RDMA client endpoint constructs the transport header for 1069 a reverse-direction RPC Reply as follows: 1071 * The client copies the XID value from the matching RPC Call and 1072 places it in the rdma_xid field of the transport header and the 1073 xid field of the RPC Reply message. The RPC Reply header MUST 1074 start with the same XID value that is present in the transport 1075 header. 1077 * The rdma_vers field of each reverse-direction Call MUST contain 1078 the same value as forward-direction Replies on the same 1079 connection. 1081 * The client fills in the rdma_credit field with the credit values 1082 for the connection, as described in Section 4.2.1. 1084 * The client determines the Payload format for the RPC message and 1085 fills in the rdma_htype field as appropriate (see Sections 6.6 and 1086 4.6.4). Section 4.6.4 also covers the disposition of the chunk 1087 lists. 1089 4.6.3. When Reverse-Direction Operation is Not Supported 1091 An RPC-over-RDMA transport endpoint does not have to support reverse- 1092 direction operation. There might be no mechanism in the transport 1093 implementation to do so. Or, the transport implementation might 1094 support operation in the reverse direction, but the Upper-Layer 1095 Protocol might not configure the transport to handle reverse- 1096 direction traffic. 1098 If an endpoint is unprepared to receive a reverse-direction message, 1099 loss of the RDMA connection might result. Thus a denial of service 1100 can occur if an RPC server continues to send reverse-direction 1101 messages after a client that is not prepared to receive them 1102 reconnects to that server. 1104 Connection peers indicate their support for reverse-direction 1105 operation as part of the exchange of Transport Properties just after 1106 a connection is established (see Section 5.2.5). 1108 When dealing with the possibility that the remote peer has no 1109 transport level support for reverse-direction operation, the Upper- 1110 Layer Protocol is responsible for informing peers when reverse- 1111 direction operation is supported. Otherwise, even a simple reverse- 1112 direction RPC NULL procedure from a peer could result in a lost 1113 connection. Therefore, an Upper-Layer Protocol MUST NOT perform 1114 reverse-direction RPC operations until the RPC client indicates 1115 support for them. 1117 4.6.4. Using Chunks During Reverse-Direction Operation 1119 Reverse-direction operations can use chunks for DDP-eligible data 1120 items and Special payload formats the same way chunks are used in 1121 forward-direction operation. Connection peers indicate their support 1122 for using chunks in the reverse direction as part of the exchange of 1123 Transport Properties just after a connection is established (see 1124 Section 5.2.5). 1126 However, an implementation might support only Upper-Layer Protocols 1127 that have no DDP-eligible data items. Such Upper-Layer Protocols can 1128 use only small messages, or they might have a native mechanism for 1129 restricting the size of reverse-direction RPC messages, obviating the 1130 need to handle chunks in the reverse direction. 1132 When there is no Upper-Layer Protocol need for chunks in the reverse 1133 direction, implementers MAY choose not to provide support for chunks 1134 in the reverse direction, thus avoiding the complexity of 1135 implementing support for RDMA Reads and Writes in the reverse 1136 direction. When an RPC-over-RDMA transport implementation does not 1137 support chunks in the reverse direction, RPC endpoints use only the 1138 Simple Payload format without data item chunks or the Continued 1139 Payload format without data item chunks to send RPC messages in the 1140 reverse direction. 1142 If a reverse-direction Requester provides a non-empty chunk list to a 1143 Responder that does not support chunks, the Responder MUST report its 1144 lack of support using one of the error values defined in Section 7.3. 1146 4.6.5. Reverse-Direction Retransmission 1148 In rare cases, an RPC server cannot complete an RPC transaction and 1149 cannot send a Reply. In these cases, the Requester may send the RPC 1150 transaction again using the same RPC XID. We refer to this as an 1151 "RPC retransmission" or a "replay." 1153 In the forward direction, an RPC client is the Requester. The client 1154 is always responsible for ensuring a transport connection is in place 1155 before sending a dropped Call again. 1157 With reverse-direction operation, an RPC server is the Requester. 1158 Because an RPC server is not responsible for establishing transport 1159 connections with clients, the Requester is unable to retransmit a 1160 reverse-direction Call whenever there is no transport connection. In 1161 this case, the RPC server must wait for the RPC client to re- 1162 establish a transport connection before it can retransmit reverse- 1163 direction RPC Calls. 1165 If the forward-direction Requester has no work to do, it can be some 1166 time before the RPC client re-establishes a transport connection. An 1167 RPC server may need to abandon a pending reverse-direction RPC Call 1168 to avoid waiting indefinitely for the client to re-establish a 1169 transport connection. 1171 Therefore forward-direction Requesters SHOULD maintain a transport 1172 connection as long as the RPC server might send reverse-direction 1173 Calls. For example, while an NFS version 4.1 client has open 1174 delegated files or active pNFS layouts, it maintains one or more 1175 transport connections to enable the NFS server to perform callback 1176 operations. 1178 5. Transport Properties 1180 RPC-over-RDMA version 2 enables connection endpoints to exchange 1181 information about implementation properties. Compatible endpoints 1182 use this information to optimize data transfer. Initially, only a 1183 small set of transport properties are defined. The protocol provides 1184 header types to exchange transport properties (see 6.3.3 and 6.3.4). 1186 Both the set of transport properties and the operations used to 1187 communicate them may be extended. Within RPC-over-RDMA version 2, 1188 such extensions are OPTIONAL. A discussion of extending the set of 1189 transport properties appears in Appendix B.3. 1191 5.1. Transport Properties Model 1193 The current document specifies a basic set of receiver and sender 1194 properties. Such properties are specified using a code point that 1195 identifies the particular transport property and a nominally opaque 1196 array containing the XDR encoding of the property. 1198 The following XDR types handle transport properties: 1200 1201 typedef rpcrdma2_propid uint32; 1203 struct rpcrdma2_propval { 1204 rpcrdma2_propid rdma_which; 1205 opaque rdma_data<>; 1206 }; 1208 typedef rpcrdma2_propval rpcrdma2_propset<>; 1210 typedef uint32 rpcrdma2_propsubset<>; 1211 1213 The rpcrdma2_propid type specifies a distinct transport property. 1214 The property code points are defined as const values rather than 1215 elements in an enum type to enable the extension by concatenating XDR 1216 definition files. 1218 The rpcrdma2_propval type carries the value of a transport property. 1219 The rdma_which field identifies the particular property, and the 1220 rdma_data field contains the associated value of that property. A 1221 zero-length rdma_data field represents the default value of the 1222 property specified by rdma_which. 1224 Although the rdma_data field is opaque, receivers interpret its 1225 contents using the XDR type associated with the property specified by 1226 rdma_which. When the contents of the rdma_data field do not conform 1227 to that XDR type, the receiver MUST return the error 1228 RDMA2_ERR_BAD_PROPVAL using the header type RDMA2_ERROR, as described 1229 in Section 6.3.1. 1231 For example, the receiver of a message containing a valid 1232 rpcrdma2_propval returns this error if the length of rdma_data is 1233 greater than the length of the transferred message. Also, when the 1234 receiver recognizes the rpcrdma2_propid contained in rdma_which, it 1235 MUST report the error RDMA2_ERR_BAD_PROPVAL if either of the 1236 following occurs: 1238 * The nominally opaque data within rdma_data is not valid when 1239 interpreted using the property-associated typedef. 1241 * The length of rdma_data is insufficient to contain the data 1242 represented by the property-associated typedef. 1244 A receiver does not report an error if it does not recognize the 1245 value contained in rdma_which. In that case, the receiver does not 1246 process that rpcrdma2_propval. Processing continues with the next 1247 rpcrdma2_propval, if any. 1249 The rpcrdma2_propset type specifies a set of transport properties. 1250 The protocol does not impose a particular ordering of the 1251 rpcrdma2_propval items within it. 1253 The rpcrdma2_propsubset type identifies a subset of the properties in 1254 a rpcrdma2_propset. Each bit in the mask denotes a particular 1255 element in a previously specified rpcrdma2_propset. If a particular 1256 rpcrdma2_propval is at position N in the array, then bit number N mod 1257 32 in word N div 32 specifies whether the defined subset includes 1258 that particular rpcrdma2_propval. Words beyond the last one 1259 specified are assumed to contain zero. 1261 5.2. Current Transport Properties 1263 Table 1 specifies a basic set of transport properties. The columns 1264 contain the following information: 1266 * The column labeled "Property" contains a name of the transport 1267 property described by the current row. 1269 * The column labeled "Code" specifies the code point that identifies 1270 this property. 1272 * The column labeled "XDR type" gives the XDR type of the data used 1273 to communicate the value of this property. This data type 1274 overlays the data portion of the nominally opaque rdma_data field. 1276 * The column labeled "Default" gives the default value for the 1277 property. 1279 * The column labeled "Section" indicates the section within the 1280 current document that explains the use of this property. 1282 +===========================+======+==========+=========+=========+ 1283 | Property | Code | XDR type | Default | Section | 1284 +===========================+======+==========+=========+=========+ 1285 | Maximum Send Size | 1 | uint32 | 4096 | 5.2.1 | 1286 +---------------------------+------+----------+---------+---------+ 1287 | Receive Buffer Size | 2 | uint32 | 4096 | 5.2.2 | 1288 +---------------------------+------+----------+---------+---------+ 1289 | Maximum Segment Size | 3 | uint32 | 1048576 | 5.2.3 | 1290 +---------------------------+------+----------+---------+---------+ 1291 | Maximum Segment Count | 4 | uint32 | 16 | 5.2.4 | 1292 +---------------------------+------+----------+---------+---------+ 1293 | Reverse-Direction Support | 5 | uint32 | 0 | 5.2.5 | 1294 +---------------------------+------+----------+---------+---------+ 1295 | Host Auth Message | 6 | opaque<> | N/A | 5.2.6 | 1296 +---------------------------+------+----------+---------+---------+ 1298 Table 1 1300 5.2.1. Maximum Send Size 1302 The value of this property specifies the maximum size, in octets, of 1303 Send payloads. The endpoint receiving this value can size its 1304 Receive buffers based on the value of this property. 1306 1307 const uint32 RDMA2_PROPID_SBSIZ = 1; 1308 typedef uint32 rpcrdma2_prop_sbsiz; 1309 1311 5.2.2. Receive Buffer Size 1313 The value of this property specifies the minimum size, in octets, of 1314 pre-posted receive buffers. 1316 1317 const uint32 RDMA2_PROPID_RBSIZ = 2; 1318 typedef uint32 rpcrdma2_prop_rbsiz; 1319 1321 A sender can subsequently use this value to determine when a message 1322 to be sent fits in pre-posted receive buffers that the receiver has 1323 set up. In particular: 1325 * Requesters may use the value to determine when to use a Call chunk 1326 or Message Continuation when sending a Call. 1328 * Requesters may use the value to determine when to provide a Reply 1329 chunk when sending a Call, based on the maximum possible size of 1330 the Reply. 1332 * Responders may use the value to determine when to use a Reply 1333 chunk provided by the Requester, given the actual size of a Reply. 1335 5.2.3. Maximum Segment Size 1337 The value of this property specifies the maximum size, in octets, of 1338 a segment this endpoint is prepared to send or receive. 1340 1341 const uint32 RDMA2_PROPID_RSSIZ = 3; 1342 typedef uint32 rpcrdma2_prop_rssiz; 1343 1345 5.2.4. Maximum Segment Count 1347 The value of this property specifies the maximum number of segments 1348 that can appear in a Requester's transport header. 1350 1351 const uint32 RDMA2_PROPID_RCSIZ = 4; 1352 typedef uint32 rpcrdma2_prop_rcsiz; 1353 1355 5.2.5. Reverse-Direction Support 1357 The value of this property specifies a client implementation's 1358 readiness to process messages that are part of reverse-direction RPC 1359 requests. 1361 1362 const uint32 RDMA_RVRSDIR_NONE = 0; 1363 const uint32 RDMA_RVRSDIR_SIMPLE = 1; 1364 const uint32 RDMA_RVRSDIR_CONT = 2; 1365 const uint32 RDMA_RVRSDIR_GENL = 3; 1367 const uint32 RDMA2_PROPID_BRS = 5; 1368 typedef uint32 rpcrdma2_prop_brs; 1369 1371 Multiple levels of support are distinguished: 1373 * The value RDMA2_RVRSDIR_NONE indicates that the sender does not 1374 support reverse-direction operation. 1376 * The value RDMA2_RVRSDIR_SIMPLE indicates that the sender supports 1377 using only Simple Format messages without data item chunks for 1378 reverse-direction messages. 1380 * The value RDMA2_RVRSDIR_CONT indicates that the sender supports 1381 using either Simple Format without data item chunks or Continued 1382 Format messages without data item chunks for reverse-direction 1383 messages. 1385 * The value RDMA2_RVRSDIR_GENL indicates that the sender supports 1386 reverse-direction messages in the same way as forward-direction 1387 messages. 1389 When a peer does not provide this property, the default is the peer 1390 does not support reverse-direction operation. 1392 5.2.6. Host Authentication Message 1394 The value of this transport property enables the exchange of host 1395 authentication material. This property can accommodate 1396 authentication handshakes that require multiple challenge-response 1397 interactions and potentially large amounts of material. 1399 1400 const uint32 RDMA2_PROPID_HOSTAUTH = 6; 1401 typedef opaque rpcrdma2_prop_hostauth<>; 1402 1404 When this property is not present, the peer(s) remain 1405 unauthenticated. Local security policy on each peer determines 1406 whether the connection is permitted to continue. 1408 6. Transport Messages 1410 Each transport message consists of multiple sections. 1412 * A transport header prefix, as defined in Section 6.4. Among other 1413 things, this structure indicates the header type. 1415 * The transport header proper, as defined by one of the sub-sections 1416 below. See Section 6.1 for the mapping between header types and 1417 the corresponding header structure. 1419 * Potentially, all or part of an RPC message payload. 1421 This organization differs from that presented in the definition of 1422 RPC-over-RDMA version 1 [RFC8166], which defined the first and second 1423 of the items above as a single XDR data structure. The new 1424 organization is in keeping with RPC-over-RDMA version 2's 1425 extensibility model, which enables the definition of new header types 1426 without modifying the XDR definition of existing header types. 1428 6.1. Transport Header Types 1430 Table 2 lists the RPC-over-RDMA version 2 header types. The columns 1431 contain the following information: 1433 * The column labeled "Operation" names the particular operation. 1435 * The column labeled "Code" specifies the value of the header type 1436 for this operation. 1438 * The column labeled "XDR type" gives the XDR type of the data 1439 structure used to organize the information in this new header 1440 type. This data immediately follows the universal portion on the 1441 transport header present in every RPC-over-RDMA transport header. 1443 * The column labeled "Msg" indicates whether this operation is 1444 followed (or not) by an RPC message payload. 1446 * The column labeled "Section" refers to the section within the 1447 current document that explains the use of this header type. 1449 +==============+======+=============================+=====+=========+ 1450 | Operation | Code | XDR type | Msg | Section | 1451 +==============+======+=============================+=====+=========+ 1452 | Report | 4 | rpcrdma2_hdr_error | No | 6.3.1 | 1453 | Transport | | | | | 1454 | Error | | | | | 1455 +--------------+------+-----------------------------+-----+---------+ 1456 | Grant | 5 | void | No | 6.3.2 | 1457 | Credits | | | | | 1458 +--------------+------+-----------------------------+-----+---------+ 1459 | Specify | 6 | rpcrdma2_hdr_connprop | No | 6.3.3 | 1460 | Properties | | | | | 1461 | (Middle) | | | | | 1462 +--------------+------+-----------------------------+-----+---------+ 1463 | Specify | 7 | rpcrdma2_hdr_connprop | No | 6.3.4 | 1464 | Properties | | | | | 1465 | (Final) | | | | | 1466 +--------------+------+-----------------------------+-----+---------+ 1467 | Convey | 8 | rpcrdma2_hdr_call_external | No | 6.3.5 | 1468 | External | | | | | 1469 | RPC Call | | | | | 1470 | Message | | | | | 1471 +--------------+------+-----------------------------+-----+---------+ 1472 | Convey | 9 | rpcrdma2_hdr_call_middle | Yes | 6.3.6 | 1473 | Continued | | | | | 1474 | RPC Call | | | | | 1475 | Message | | | | | 1476 +--------------+------+-----------------------------+-----+---------+ 1477 | Convey | 10 | rpcrdma2_hdr_call_inline | Yes | 6.3.7 | 1478 | Inline RPC | | | | | 1479 | Call | | | | | 1480 | Message | | | | | 1481 +--------------+------+-----------------------------+-----+---------+ 1482 | Convey | 11 | rpcrdma2_hdr_reply_external | No | 6.3.8 | 1483 | External | | | | | 1484 | RPC Reply | | | | | 1485 | Message | | | | | 1486 +--------------+------+-----------------------------+-----+---------+ 1487 | Convey | 12 | rpcrdma2_hdr_reply_middle | Yes | 6.3.9 | 1488 | Continued | | | | | 1489 | RPC Reply | | | | | 1490 | Message | | | | | 1491 +--------------+------+-----------------------------+-----+---------+ 1492 | Convey | 13 | rpcrdma2_hdr_reply_inline | Yes | 6.3.10 | 1493 | Inline RPC | | | | | 1494 | Reply | | | | | 1495 | Message | | | | | 1496 +--------------+------+-----------------------------+-----+---------+ 1498 Table 2 1500 RPC-over-RDMA version 2 peers are REQUIRED to support all message 1501 header types in Table 2. RPC-over-RDMA version 2 implementations 1502 that receive an unrecognized header type MUST respond with an 1503 RDMA2_ERROR message with an rdma_err field containing 1504 RDMA2_ERR_INVAL_HTYPE and drop the incoming message without 1505 processing it further. 1507 6.2. Headers and Chunks 1509 Most RPC-over-RDMA version 2 data structures have antecedents in 1510 corresponding structures in RPC-over-RDMA version 1. As is typical 1511 for new versions of an existing protocol, the XDR data structures 1512 have new names, and there are a few small changes in content. In 1513 some cases, there have been structural re-organizations to enable 1514 protocol extensibility. 1516 6.2.1. Common Transport Header Prefix 1518 The rpcrdma_common structure defines the initial part of each RPC- 1519 over-RDMA transport header for RPC-over-RDMA version 2 and subsequent 1520 versions. 1522 1523 struct rpcrdma_common { 1524 uint32 rdma_xid; 1525 uint32 rdma_vers; 1526 uint32 rdma_credit; 1527 uint32 rdma_htype; 1528 }; 1529 1531 RPC-over-RDMA version 2's use of these first four words aligns with 1532 that of version 1 as required by Section 4.2 of [RFC8166]. However, 1533 there are crucial structural differences in the XDR definition of 1534 RPC-over-RDMA version 2: in the way that these words are described by 1535 the respective XDR descriptions: 1537 * The header type is represented as a uint32 rather than as an enum 1538 type. An enum would need to be modified to reflect additions to 1539 the set of header types made by later extensions. 1541 * The header type field is part of an XDR structure devoted to 1542 representing the transport header prefix, rather than being part 1543 of a discriminated union, that includes the body of each transport 1544 header type. 1546 * There is now a prefix structure (see Section 6.4) of which the 1547 rpcrdma_common structure is the initial segment. This prefix is a 1548 newly defined XDR object within the protocol description, which 1549 constrains the universal portion of all header types to the four 1550 words in rpcrdma_common. 1552 These changes are part of a more considerable structural change in 1553 the XDR definition of RPC-over-RDMA version 2 that facilitates a 1554 cleaner treatment of protocol extension. The XDR appearing in 1555 Section 8 reflects these changes, which Appendix C.1 discusses in 1556 further detail. 1558 6.3. Header Types 1560 The header types defined and used in RPC-over-RDMA version 1 are not 1561 carried over into RPC-over-RDMA version 2, although there are easy 1562 equivalents to the version 1 procedures: 1564 * The RDMA2_ERROR header (defined in Section 6.3.1) has an XDR 1565 definition that differs from that in RPC-over-RDMA version 1, and 1566 its modifications are all compatible extensions. 1568 * Senders use RDMA2_CALL_INLINE or RDMA2_REPLY_INLINE (defined in 1569 Sections 6.3.7 and 6.3.10) in place of RDMA_MSG. There are minor 1570 differences in the on-the-wire format between the version 1 1571 procedure and the version 2 header types. 1573 * Senders use RDMA2_CALL_EXTERNAL or RDMA2_REPLY_EXTERNAL (defined 1574 in Sections 6.3.5 and 6.3.8) in place of RDMA_NOMSG. There are 1575 minor differences in the on-the-wire format between the version 1 1576 procedure and the version 2 header types. 1578 * RDMA2_CONNPROP_MIDDLE and RDMA2_CONNPROP_FINAL (defined in 1579 Sections 6.3.3 and 6.3.4) are new header types devoted to enabling 1580 connection peers to exchange information about their transport 1581 properties. 1583 6.3.1. RDMA2_ERROR: Report Transport Error 1585 RDMA2_ERROR reports a transport layer error on a previous 1586 transmission. 1588 1589 const rpcrdma2_proc RDMA2_ERROR = 4; 1591 struct rpcrdma2_err_vers { 1592 uint32 rdma_vers_low; 1593 uint32 rdma_vers_high; 1594 }; 1596 struct rpcrdma2_err_write { 1597 uint32 rdma_chunk_index; 1598 uint32 rdma_length_needed; 1599 }; 1601 union rpcrdma2_hdr_error switch (rpcrdma2_errcode rdma_err) { 1602 case RDMA2_ERR_VERS: 1603 rpcrdma2_err_vers rdma_vrange; 1604 case RDMA2_ERR_READ_CHUNKS: 1605 uint32 rdma_max_chunks; 1606 case RDMA2_ERR_WRITE_CHUNKS: 1607 uint32 rdma_max_chunks; 1608 case RDMA2_ERR_SEGMENTS: 1609 uint32 rdma_max_segments; 1610 case RDMA2_ERR_WRITE_RESOURCE: 1611 rpcrdma2_err_write rdma_writeres; 1612 case RDMA2_ERR_REPLY_RESOURCE: 1613 uint32 rdma_length_needed; 1614 default: 1615 void; 1616 }; 1617 1619 See Section 7 for details on the use of this header type. 1621 6.3.2. RDMA2_GRANT: Grant Credits 1623 The RDMA2_GRANT header type enables a connection peer to update 1624 credit information without conveying a payload. 1626 1627 const rpcrdma2_proc RDMA2_GRANT = 5; 1628 1630 This message carries no payload except for a struct 1631 rpcrdma2_hdr_prefix. The rdma_xid field is unused. Senders MUST set 1632 the rdma_xid field to zero and receivers MUST ignore the value in 1633 this field. 1635 6.3.3. RDMA2_CONNPROP_MIDDLE: Exchange Transport Properties 1637 The RDMA2_CONNPROP_MIDDLE header type enables a connection peer to 1638 publish the properties of its implementation to its remote peer. 1640 1641 const rpcrdma2_proc RDMA2_CONNPROP_MIDDLE = 6; 1643 struct rpcrdma2_hdr_connprop { 1644 rpcrdma2_propset rdma_props; 1645 }; 1646 1648 A peer sends an RDMA2_CONNPROP_MIDDLE header type when it has one or 1649 more properties to send that do not fit within the default inline 1650 threshold for the RPC-over-RDMA version that is in effect. 1652 A peer may encounter properties that it does not recognize or 1653 support. In such cases, the receiver ignores unsupported properties 1654 without generating an error response. 1656 If a peer sends follows an RDMA2_CONNPROP_MIDDLE header type with 1657 anything other than another RDMA2_CONNPROP_MIDDLE message or an 1658 RDMA2_CONNPROP_FINAL message, the receiver MUST respond with an 1659 RDMA2_ERROR header type and set its rdma_err field to 1660 RDMA2_ERR_INVAL_CONT and drop the incoming message without processing 1661 it further. 1663 6.3.4. RDMA2_CONNPROP_FINAL: Exchange Transport Properties 1665 The RDMA2_CONNPROP_FINAL header type enables a connection peer to 1666 publish the properties of its implementation to its remote peer. 1668 1669 const rpcrdma2_proc RDMA2_CONNPROP_FINAL = 7; 1671 struct rpcrdma2_hdr_connprop { 1672 rpcrdma2_propset rdma_props; 1673 }; 1674 1676 Each peer sends an RDMA2_CONNPROP_FINAL header type as the final 1677 CONNPROP-type message after the client has established a connection. 1678 The size of this message is limited to the default inline threshold 1679 for the RPC-over-RDMA version that is in effect. 1681 A peer may encounter properties that it does not recognize or 1682 support. In such cases, the receiver ignores unsupported properties 1683 without generating an error response. 1685 If a peer sends a CONNPROP-type message on a connection after it has 1686 sent an RDMA2_CONNPROP_FINAL message, the receiver MUST respond with 1687 an RDMA2_ERROR header type and set its rdma_err field to 1688 RDMA2_ERR_INVAL_CONT and drop the incoming message without processing 1689 it further. 1691 6.3.5. RDMA2_CALL_EXTERNAL: Convey External RPC Call Message 1693 RDMA2_CALL_EXTERNAL conveys an RPC Call message payload using 1694 explicit RDMA operations. The Responder reads the Payload stream 1695 from a memory area specified by the Call chunk. The sender MUST set 1696 the rdma_xid field to the same value as the xid of the RPC Reply 1697 message payload. 1699 1700 const rpcrdma2_proc RDMA2_CALL_EXTERNAL = 8; 1702 struct rpcrdma2_hdr_call_external { 1703 uint32 rdma_inv_handle; 1705 struct rpcrdma2_read_list *rdma_call; 1706 struct rpcrdma2_read_list *rdma_reads; 1707 struct rpcrdma2_write_list *rdma_provisional_writes; 1708 struct rpcrdma2_write_chunk *rdma_provisional_reply; 1709 }; 1710 1712 rdma_inv_handle: The rdma_inv_handle field contains a 32-bit RDMA 1713 handle that the Responder may use in a Send With Invalidation 1714 operation. See Section 6.5. 1716 rdma_call: The rdma_call field anchors a list of one or more Read 1717 segments that contain the RPC Call's Payload stream. 1719 rdma_reads: The rdma_reads field anchors a list of zero or more Read 1720 segments that contain data item chunks. 1722 rdma_provisional_writes: The rdma_writes field anchors a list of 1723 zero or more provisional Write chunks. 1725 rdma_provisional_reply: The rdma_reply field is a list containing 1726 zero or one provisional Reply chunk. 1728 6.3.6. RDMA2_CALL_MIDDLE: Convey Continued RPC Call Message 1730 RDMA2_CALL_MIDDLE conveys a beginning or middle portion of an RPC 1731 Call message immediately following the transport header in the send 1732 buffer. The sender MUST set the rdma_xid field to the same value as 1733 the xid of the RPC Reply message payload. The sender sets the 1734 rdma_remaining field to the number of bytes in the RPC Call message 1735 payload that remain to be sent. The rdma_rpc_first_word field 1736 demarks the first word of the Payload stream. 1738 1739 const rpcrdma2_proc RDMA2_CALL_MIDDLE = 9; 1741 struct rpcrdma2_hdr_call_middle { 1742 uint32 rdma_remaining; 1744 /* The rpc message starts here and continues 1745 * through the end of the transmission. */ 1746 uint32 rdma_rpc_first_word; 1747 }; 1748 1750 If a peer sends follows an RDMA2_CALL_MIDDLE header type with 1751 anything other than an RDMA2_CALL_MIDDLE message or an 1752 RDMA2_CALL_INLINE message, the receiver MUST respond with an 1753 RDMA2_ERROR header type and set its rdma_err field to 1754 RDMA2_ERR_INVAL_CONT and drop the incoming message without processing 1755 it further. 1757 6.3.7. RDMA2_CALL_INLINE: Convey Inline RPC Call Message 1759 RDMA2_CALL_INLINE conveys the only or final portion of an RPC Call 1760 message. The rdma_rpc_first_word field demarks the first word of 1761 this Payload stream. 1763 1764 const rpcrdma2_proc RDMA2_CALL_INLINE = 10; 1766 struct rpcrdma2_hdr_call_inline { 1767 uint32 rdma_inv_handle; 1769 struct rpcrdma2_read_list *rdma_reads; 1770 struct rpcrdma2_write_list *rdma_provisional_writes; 1771 struct rpcrdma2_write_chunk *rdma_provisional_reply; 1773 /* The rpc message starts here and continues 1774 * through the end of the transmission. */ 1775 uint32 rdma_rpc_first_word; 1776 }; 1777 1779 rdma_inv_handle: The rdma_inv_handle field contains a 32-bit RDMA 1780 handle that the Responder may use in a Send With Invalidation 1781 operation. See Section 6.5. 1783 rdma_reads: The rdma_reads field anchors a list of zero or more Read 1784 segments that contain only data item chunks. A Requester MUST NOT 1785 insert Position-zero Read chunks in this list. 1787 rdma_provisional_writes: The rdma_writes field anchors a list of 1788 zero or more provisional Write chunks. 1790 rdma_provisional_reply: The rdma_reply field is a list containing 1791 zero or one provisional Reply chunk. 1793 6.3.8. RDMA2_REPLY_EXTERNAL: Convey External RPC Reply Message 1795 RDMA2_REPLY_EXTERNAL conveys an RPC Reply message payload using 1796 explicit RDMA operations. In particular, it is referred to as a 1797 Special Format Reply when the Responder writes the RPC payload into a 1798 memory area specified by a Reply chunk. The sender MUST set the 1799 rdma_xid field to the same value as the xid of the RPC Reply message 1800 payload. 1802 1803 const rpcrdma2_proc RDMA2_REPLY_EXTERNAL = 11; 1805 struct rpcrdma2_hdr_reply_external { 1806 struct rpcrdma2_write_list *rdma_writes; 1807 struct rpcrdma2_write_chunk *rdma_reply; 1808 }; 1809 1810 rdma_writes: The rdma_writes field anchors a list of zero or more 1811 Write chunks that are either empty or contain reduced data items. 1813 rdma_reply: The rdma_reply field is a list that MUST contain exactly 1814 one Reply chunk. 1816 6.3.9. RDMA2_REPLY_MIDDLE: Convey Continued RPC Reply Message 1818 RDMA2_REPLY_MIDDLE conveys a beginning or middle portion of an RPC 1819 Reply message immediately following the transport header in the send 1820 buffer. The sender MUST set the rdma_xid field to the same value as 1821 the xid of the RPC Reply message payload. The sender sets the 1822 rdma_remaining field to the number of bytes in the RPC Call message 1823 payload that remain to be sent. The rdma_rpc_first_word field 1824 demarks the first word of the Payload stream. 1826 1827 const rpcrdma2_proc RDMA2_REPLY_MIDDLE = 12; 1829 struct rpcrdma2_hdr_reply_middle { 1830 uint32 rdma_remaining; 1832 /* The rpc message starts here and continues 1833 * through the end of the transmission. */ 1834 uint32 rdma_rpc_first_word; 1835 }; 1836 1838 If a peer sends follows an RDMA2_REPLY_MIDDLE header type with 1839 anything other than an RDMA2_REPLY_MIDDLE message or an 1840 RDMA2_REPLY_INLINE message, the receiver MUST respond with an 1841 RDMA2_ERROR header type and set its rdma_err field to 1842 RDMA2_ERR_INVAL_CONT and drop the incoming message without processing 1843 it further. 1845 6.3.10. RDMA2_REPLY_INLINE: Convey RPC Reply Message Inline 1847 RDMA2_REPLY_INLINE conveys the only or final portion of an RPC Reply 1848 message immediately following the transport header in the send 1849 buffer. If the Reply message payload has been reduced, the 1850 rdma_chunks object carries the reduced data item chunks. 1852 1853 const rpcrdma2_proc RDMA2_REPLY_INLINE = 13; 1855 struct rpcrdma2_hdr_reply_inline { 1856 struct rpcrdma2_write_list *rdma_writes; 1858 /* The rpc message starts here and continues 1859 * through the end of the transmission. */ 1860 uint32 rdma_rpc_first_word; 1861 }; 1862 1864 rdma_writes: The rdma_writes field anchors a list of zero or more 1865 Write chunks that are either empty or contain reduced data items. 1867 6.4. Transport Header Prefix 1869 The following prefix structure appears at the start of each RPC-over- 1870 RDMA version 2 transport header. 1872 1873 struct rpcrdma2_hdr_prefix { 1874 struct rpcrdma_common rdma_start; 1875 }; 1876 1878 6.5. Remote Invalidation 1880 To solicit the use of Remote Invalidation, a Requester sets the value 1881 of the rdma_inv_handle field in an RPC Call's transport header to a 1882 non-zero value that matches one of the rdma_handle fields in that 1883 header. If the Responder may invalidate none of the rdma_handle 1884 values in the header conveying the Call, the Requester sets the RPC 1885 Call's rdma_inv_handle field to the value zero. 1887 If the Responder chooses not to use remote invalidation for this 1888 particular RPC Reply, or the RPC Call's rdma_inv_handle field 1889 contains the value zero, the Responder simply uses RDMA Send to 1890 transmit the matching RPC reply. However, if the Responder chooses 1891 to use Remote Invalidation, it uses RDMA Send With Invalidate to 1892 transmit the RPC Reply. It MUST use the value in the corresponding 1893 Call's rdma_inv_handle field to construct the Send With Invalidate 1894 Work Request. 1896 A Responder never uses a Send With Invalidate Work Request when 1897 sending a control plane header type. This includes the RDMA2_ERROR 1898 header type, the RDMA2_GRANT header type, the RDMA2_CONNPROP_MIDDLE 1899 header type, and the RDMA2_CONNPROP_FINAL header type. 1901 6.6. Payload Formats 1903 RPC-over-RDMA version 2 provides several ways, known as "payload 1904 formats", to convey an RPC-over-RDMA message. A sender chooses the 1905 payload format for each message based on several factors: 1907 * The existence of DDP-eligible data items in the RPC message 1908 payload 1910 * The size of the RPC message payload 1912 * The direction of the RPC message (i.e., Call or Reply) 1914 * The available hardware resources 1916 * The arrangement of source and sink memory buffers 1918 The following subsections describe in detail how Requesters and 1919 Responders format RPC-over-RDMA message payloads. 1921 6.6.1. Simple Format 1923 All RPC messages conveyed via RPC-over-RDMA version 2 need at least 1924 one RDMA Send operation to convey. Thus, the most efficient way to 1925 send an RPC message that is smaller than the inline threshold is to 1926 append the Payload stream directly to the Transport stream and use an 1927 RDMA Send to convey both. When no chunks are present, senders 1928 construct Calls and Replies the same way, and no other operations are 1929 needed. 1931 6.6.1.1. Simple Format with Data Item Chunks 1933 If DDP-eligible data items are present in a Payload stream, a sender 1934 MAY reduce some or all of these items, removing them from the Payload 1935 stream. The sender then uses a separate mechanism to transfer the 1936 reduced data items. The Transport stream immediately followed by the 1937 reduced Payload stream is then transferred using one RDMA Send 1938 operation. 1940 When data item chunks are present, senders construct Calls 1941 differently than Replies. 1943 Simple Call 1944 After receiving the Transport and Payload streams of an RPC Call 1945 message with Read chunks, the Responder uses RDMA Read operations 1946 to move the reduced data items contained in the Read chunks. RPC- 1947 over-RDMA Calls can carry Write chunks for the Responder to use 1948 when sending the matching Reply. 1950 Simple Reply 1951 The Responder uses RDMA Write operations to move reduced data 1952 items contained in Write chunks. Afterward, it sends the 1953 Transport and Payload streams of the RPC Reply message using one 1954 RDMA Send. RPC-over-RDMA Replies always carry an empty Read chunk 1955 list. 1957 6.6.1.2. Simple Format Examples 1959 Requester Responder 1960 | RDMA Send (RDMA2_CALL_INLINE) | 1961 Call | ----------------------------------> | 1962 | | 1963 | | Processing 1964 | | 1965 | RDMA Send (RDMA2_REPLY_INLINE) | 1966 | <---------------------------------- | Reply 1968 Figure 1: A Simple Call without data item chunks and a Simple 1969 Reply without data item chunks 1971 Requester Responder 1972 | RDMA Send (RDMA2_CALL_INLINE) | 1973 Call | ----------------------------------> | 1974 | RDMA Read | 1975 | <---------------------------------- | 1976 | RDMA Response (arg data) | 1977 | ----------------------------------> | 1978 | | 1979 | | Processing 1980 | | 1981 | RDMA Send (RDMA2_REPLY_INLINE) | 1982 | <---------------------------------- | Reply 1984 Figure 2: A Simple Call with a Read chunk and a Simple Reply 1985 without data item chunks 1987 Requester Responder 1988 | RDMA Send (RDMA2_CALL_INLINE) | 1989 Call | ----------------------------------> | 1990 | | 1991 | | Processing 1992 | | 1993 | RDMA Write (result data) | 1994 | <---------------------------------- | 1995 | RDMA Send (RDMA2_REPLY_INLINE) | 1996 | <---------------------------------- | Reply 1998 Figure 3: A Simple Call without data item chunks and a Simple 1999 Reply with a Write chunk 2001 6.6.2. Continued Format 2003 For various reasons, a sender can choose to split a message payload 2004 over multiple RPC-over-RDMA messages. The Payload stream of each 2005 RPC-over-RDMA message contains a part of the RPC message. The 2006 receiver reconstructs the original RPC message by concatenating the 2007 Payload stream of each RPC-over-RDMA message in received order. A 2008 sender MAY split the Payload stream on any convenient boundary. 2010 6.6.2.1. Continued Format with Data Item Chunks 2012 If DDP-eligible data items are present in the Payload stream, a 2013 sender MAY reduce some or all of these items, removing them from the 2014 Payload stream. The sender then uses a separate mechanism to 2015 transfer the reduced data items. The Transport stream immediately 2016 follwed by the reduced Payload stream is then transferred using one 2017 RDMA Send operation. 2019 As with Simple Format messages, when chunks are present, senders 2020 construct Calls differently than Replies. 2022 Continued Call 2023 After receiving the Transport and Payload streams of an RPC Call 2024 message with Read chunks, the Responder uses RDMA Read operations 2025 to move the reduced data items contained in Read chunks. RPC- 2026 over-RDMA Calls can carry Write chunks for the Responder to use 2027 when sending the matching Reply. 2029 Continued Reply 2030 The Responder uses RDMA Write operations to move reduced data 2031 items contained in Write chunks. Afterward, it sends the 2032 Transport and Payload streams of the RPC Reply message using 2033 multiple RDMA Sends. RPC-over-RDMA Replies always carry an empty 2034 Read chunk list. 2036 6.6.2.2. Continued Format Examples 2037 Requester Responder 2038 | RDMA Send (RDMA2_CALL_MIDDLE) | 2039 Call | ----------------------------------> | 2040 | RDMA Send (RDMA2_CALL_MIDDLE) | 2041 | ----------------------------------> | 2042 | RDMA Send (RDMA2_CALL_INLINE) | 2043 | ----------------------------------> | 2044 | | 2045 | | Processing 2046 | | 2047 | RDMA Send (RDMA2_REPLY_MIDDLE) | 2048 | <---------------------------------- | Reply 2049 | RDMA Send (RDMA2_REPLY_MIDDLE) | 2050 | <---------------------------------- | 2051 | RDMA Send (RDMA2_REPLY_INLINE) | 2052 | <---------------------------------- | 2054 Figure 4: A Continued Call without data item chunks and a 2055 Continued Reply without data item chunks 2057 Requester Responder 2058 | RDMA Send (RDMA2_CALL_MIDDLE) | 2059 Call | ----------------------------------> | 2060 | RDMA Send (RDMA2_CALL_MIDDLE) | 2061 | ----------------------------------> | 2062 | RDMA Send (RDMA2_CALL_INLINE) | 2063 | ----------------------------------> | 2064 | RDMA Read | 2065 | <---------------------------------- | 2066 | RDMA Response (arg data) | 2067 | ----------------------------------> | 2068 | | 2069 | | Processing 2070 | | 2071 | RDMA Send (RDMA2_REPLY_INLINE) | 2072 | <---------------------------------- | Reply 2074 Figure 5: A Continued Call with a Read chunk and a Simple Reply 2075 without data item chunks 2077 Requester Responder 2078 | RDMA Send (RDMA2_CALL_INLINE) | 2079 Call | ----------------------------------> | 2080 | | 2081 | | Processing 2082 | | 2083 | RDMA Write (result data) | 2084 | <---------------------------------- | 2085 | RDMA Send (RDMA2_REPLY_MIDDLE) | 2086 | <---------------------------------- | Reply 2087 | RDMA Send (RDMA2_REPLY_MIDDLE) | 2088 | <---------------------------------- | 2089 | RDMA Send (RDMA2_REPLY_INLINE) | 2090 | <---------------------------------- | 2092 Figure 6: A Simple Call without data item chunks and a Continued 2093 Reply with a Write chunk 2095 6.6.3. Special Format 2097 Even after DDP-eligible data items have been removed, a Payload 2098 stream can sometimes be too large to send using only RDMA Send 2099 operations. In those cases, the sender can use RDMA Read or Write 2100 operations to convey the entire RPC message. We refer to this as a 2101 "Special Format" message. 2103 To transmit a Special Format message, the sender transmits only the 2104 Transport stream with an RDMA Send operation. The sender does not 2105 include the Payload stream in the send buffer. Instead, the 2106 Requester provides a body chunk that the Responder uses to move the 2107 Payload stream. 2109 Because chunks are always present in Special Format messages, the 2110 sender always handles Calls and Replies differently. 2112 Special Call 2113 The Requester provides a Read chunk that contains the RPC Call 2114 message's Payload stream. Every Read segment in this chunk MUST 2115 contain zero (0) in its Position field. This type of Read chunk 2116 is a body chunk known as a Call chunk. 2118 Special Reply 2119 The Requester provisions a Reply chunk in advance. This body 2120 chunk is a Write chunk into which the Responder places the RPC 2121 Reply message's Payload stream. The Requester provisions the 2122 Reply chunk to accommodate the maximum expected reply size for 2123 that upper-layer operation. 2125 One purpose of a Special Format message is to handle large RPC 2126 messages. However, Requesters MAY use a Special Format message at 2127 any time to convey an RPC Call message. 2129 When it has alternatives, a Responder chooses which Format to use 2130 based on the chunks provided by the Requester. If a Requester 2131 provided a Write chunk and the Responder has a DDP-eligible result, 2132 it first reduces the reply Payload stream. If a Requester provided a 2133 Reply chunk and the reduced Payload stream is larger than the reply 2134 inline threshold, the Responder MUST use the Requester-provided Reply 2135 chunk for the reply. 2137 6.6.3.1. Special Format Examples 2139 Requester Responder 2140 | RDMA Send (RDMA2_CALL_EXTERNAL) | 2141 Call | ----------------------------------> | 2142 | RDMA Read | 2143 | <---------------------------------- | 2144 | RDMA Response (RPC call) | 2145 | ----------------------------------> | 2146 | | 2147 | | Processing 2148 | | 2149 | RDMA Send (RDMA2_REPLY_INLINE) | 2150 | <---------------------------------- | Reply 2152 Figure 7: A Special Call and a Simple Reply without data item chunks 2154 Requester Responder 2155 | RDMA Send (RDMA2_CALL_INLINE) | 2156 Call | ----------------------------------> | 2157 | | 2158 | | Processing 2159 | | 2160 | RDMA Write (RPC reply) | 2161 | <---------------------------------- | 2162 | RDMA Send (RDMA2_REPLY_EXTERNAL) | 2163 | <---------------------------------- | Reply 2165 Figure 8: A Simple Call without data item chunks and a Special Reply 2167 6.6.4. Choosing a Reply Payload Format 2169 A Requester provisions all necessary registered memory resources for 2170 both an RPC Call and its matching RPC Reply. A Requester constructs 2171 each RPC Call, thus it can compute the exact memory resources needed 2172 to send every Call. However, the Requester allocates memory 2173 resources to receive the corresponding Reply before the Responder has 2174 constructed it. Occasionally, it is challenging for the Requester to 2175 know in advance precisely what resources are needed to receive the 2176 Reply. 2178 In RPC-over-RDMA version 2, a Requester can provide a Reply chunk for 2179 any transaction. The Responder can use the provided Reply chunk or 2180 it can decide to use another means to convey the RPC Reply. If the 2181 combination of the provided Write chunk list and Reply chunk is not 2182 adequate to convey a Reply, the Responder SHOULD use Message 2183 Continuation to send that Reply. If even that is not possible, the 2184 Responder sends an RDMA2_ERROR message to the Requester, as described 2185 in Section 6.3.1: 2187 * If the Write chunk list cannot accommodate the ULP's DDP-eligible 2188 data payload, the Responder sends an RDMA2_ERR_WRITE_RESOURCE 2189 error. 2191 * If the Reply chunk cannot accommodate the parts of the Reply that 2192 are not DDP-eligible, the Responder sends an 2193 RDMA2_ERR_REPLY_RESOURCE error. 2195 When receiving such errors, the Requester can retry the ULP call 2196 using more substantial reply resources. In cases where retrying the 2197 ULP request is not possible (e.g., the request is non-idempotent), 2198 the Requester terminates the RPC transaction and presents an error to 2199 the RPC consumer. 2201 7. Error Handling 2203 A receiver performs validity checks on each ingress RPC-over-RDMA 2204 message before it assembles that message's Payload stream and passes 2205 it to the RPC layer. For example, if an ingress RPC-over-RDMA 2206 message is not as long as the size of struct rpcrdma2_hdr_prefix (20 2207 octets), the receiver cannot trust the value of the rdma_xid field. 2208 In this case, the receiver MUST silently discard the ingress message 2209 without processing it further, and without a response to the sender. 2211 When a request (for instance, an RPC Call or a control plane 2212 operation) is made, typically an RPC consumer blocks while waiting 2213 for the response. Thus when an incoming message conveys a request 2214 and that request cannot be acted upon, the receiver of that request 2215 needs to report the problem to its sender in order to unblock 2216 waiters. Likewise, if, after processing a request, a sender is 2217 unable to transmit the response on an otherwise healthy connection, 2218 the sender needs to report that problem for the same reason. 2220 The RDMA2_ERROR header type is used for this purpose. To form an 2221 RDMA2_ERROR type header: 2223 * The rdma_xid field MUST contain the same XID that was in the 2224 rdma_xid field in the ingress request. 2226 * The rdma_vers field MUST contain the same version that was in the 2227 rdma_vers field in the ingress request. 2229 * The sender sets the rdma_credit field to the credit values in 2230 effect for this connection. 2232 * The rdma_htype field MUST contain the value RDMA2_ERROR. 2234 * The rdma_err field contains a value that reflects the type of 2235 error that occurred, as described in the subsections below. 2237 When a peer receives an RDMA2_ERROR message type with an unrecognized 2238 or unsupported value in its rdma_err field, it MUST silently discard 2239 the message without processing it further. 2241 7.1. Basic Transport Stream Parsing Errors 2243 7.1.1. RDMA2_ERR_VERS 2245 When a Responder detects an RPC-over-RDMA header version that it does 2246 not support (the current document defines version 2), it MUST respond 2247 with an RDMA2_ERROR message type and set its rdma_err field to 2248 RDMA2_ERR_VERS. The Responder then fills in the rpcrdma2_err_vers 2249 structure with the RPC-over-RDMA versions it supports. The Responder 2250 MUST silently discard the ingress message without passing it to the 2251 RPC layer. 2253 When a Requester receives this error message, it uses the information 2254 in the rpcrdma2_err_vers structure to select an RPC-over-RDMA version 2255 that both peers support for subsequent operations on the connection. 2256 A Requester MUST NOT subsequently send a message that uses a version 2257 that the Responder has indicated it does not support. RDMA2_ERR_VERS 2258 indicates a permanent error. Receipt of this error completes the RPC 2259 transaction associated with XID in the rdma_xid field. 2261 7.1.2. RDMA2_ERR_VERS_MISMATCH 2263 When a Responder receives a message with a transport protocol version 2264 that does not match the protocol version that was used in previous 2265 successful exchanges on the same connection, it MUST respond with an 2266 RDMA2_ERROR message type and set its rdma_err field to 2267 RDMA2_ERR_VERS_MISMATCH. The Responder MUST silently discard the 2268 ingress message without passing it to the RPC layer. 2270 A Requester MUST NOT subsequently send a message that uses a protocol 2271 version that the Responder has indicated it does not recognize on 2272 this connection. The Requester can recover by sending the message 2273 again using a corrected protocol version, or it can terminate the RPC 2274 transaction associated with the XID in the rdma_xid field with an 2275 error. 2277 7.1.3. RDMA2_ERR_INVAL_HTYPE 2279 If a Responder recognizes the value in an ingress rdma_vers field, 2280 but it does not recognize the value in the rdma_htype field or does 2281 not support that header type, it MUST set the rdma_err field to 2282 RDMA2_ERR_INVAL_HTYPE. The Responder MUST silently discard the 2283 incoming message without passing it to the RPC layer. 2285 A Requester MUST NOT subsequently send a message on the connection 2286 that uses an htype that the Responder has indicated it does not 2287 support. RDMA2_ERR_INVAL_HTYPE indicates a permanent error. Receipt 2288 of this error completes the RPC transaction associated with XID in 2289 the rdma_xid field. 2291 7.1.4. RDMA2_ERR_INVAL_CONT 2293 If a Responder detects a problem with an ingress RPC-over-RDMA 2294 message that is part of a Message Continuation sequence, the 2295 Responder MUST set the rdma_err field to RDMA2_ERR_INVAL_CONT. The 2296 Responder MUST silently discard all ingress messages with an rdma_xid 2297 field that matches the failing message without reassembling the 2298 payload. 2300 RDMA2_ERR_INVAL_CONT indicates a permanent error. Receipt of this 2301 error completes the RPC transaction associated with XID in the 2302 rdma_xid field. 2304 7.2. XDR Errors 2306 A receiver might encounter an XDR parsing error that prevents it from 2307 processing an ingress Transport stream. Examples of such errors 2308 include: 2310 * The value of the rdma_xid field does not match the value of the 2311 XID field in the accompanying RPC message. 2313 * The receive buffer ends before the end of a data object contained 2314 in the Transport stream. 2316 Moreover, when a Responder receives a valid RPC-over-RDMA header but 2317 the Responder's ULP implementation cannot parse the RPC arguments in 2318 the RPC Call, the Responder returns an RPC Reply with status 2319 GARBAGE_ARGS, using an RDMA2_REPLY_INLINE message type. This type of 2320 parsing failure might be due to mismatches between chunk sizes or 2321 offsets and the contents of the Payload stream, for example. In this 2322 case, the error is permanent, but the Requester has no way to know 2323 how much processing the Responder has completed for this RPC 2324 transaction. 2326 7.2.1. RDMA2_ERR_BAD_XDR 2328 If a Responder recognizes the values in the rdma_vers field, but it 2329 cannot otherwise parse the ingress Transport stream, it MUST set the 2330 rdma_err field to RDMA2_ERR_BAD_XDR. The Responder MUST silently 2331 discard the ingress message without passing it to the RPC layer. 2333 RDMA2_ERR_BAD_XDR indicates a permanent error. Receipt of this error 2334 completes the RPC transaction associated with XID in the rdma_xid 2335 field. 2337 7.2.2. RDMA2_ERR_BAD_PROPVAL 2339 If a receiver recognizes the value in an ingress rdma_which field, 2340 but it cannot parse the accompanying propval, it MUST set the 2341 rdma_err field to RDMA2_ERR_BAD_PROPVAL (see Section 5.1). The 2342 receiver MUST silently discard the ingress message without applying 2343 any of its property settings. 2345 7.3. Responder RDMA Operational Errors 2347 In RPC-over-RDMA version 2, the Responder initiates RDMA Read and 2348 Write operations that target the Requester's memory. Problems might 2349 arise as the Responder attempts to use Requester-provided resources 2350 for RDMA operations. For example: 2352 * Usually, chunks can be validated only by using their contents to 2353 perform data transfers. If chunk contents are invalid (e.g., a 2354 memory region is no longer registered or a chunk length exceeds 2355 the end of the registered memory region), a Remote Access Error 2356 occurs. 2358 * If a Requester's Receive buffer is too small, the Responder's Send 2359 operation completes with a Local Length Error. 2361 * If the Requester-provided Reply chunk is too small to accommodate 2362 a large RPC Reply message, a Remote Access Error occurs. A 2363 Responder might detect this problem before attempting to write 2364 past the end of the Reply chunk. 2366 RDMA operational errors can be fatal to the connection. To avoid a 2367 retransmission loop and repeated connection loss that deadlocks the 2368 connection, once the Requester has re-established a connection, the 2369 Responder SHOULD send an RDMA2_ERROR response to indicate that no 2370 RPC-level reply is possible for that transaction. 2372 7.3.1. RDMA2_ERR_READ_CHUNKS 2374 If a Requester presents more DDP-eligible arguments than a Responder 2375 is prepared to Read, the Responder MUST set the rdma_err field to 2376 RDMA2_ERR_READ_CHUNKS and set the rdma_max_chunks field to the 2377 maximum number of Read chunks the Responder can process. If the 2378 Responder implementation cannot handle any Read chunks for a request, 2379 it MUST set the rdma_max_chunks to zero in this response. The 2380 Responder MUST silently discard the ingress message without 2381 processing it further. 2383 The Requester can reconstruct the Call using Message Continuation or 2384 a Special Format payload and resend it. If the Requester chooses not 2385 to resend the Call, it MUST terminate this RPC transaction with an 2386 error. 2388 7.3.2. RDMA2_ERR_WRITE_CHUNKS 2390 If a Requester has constructed an RPC Call with more DDP-eligible 2391 results than the Responder is prepared to Write, the Responder MUST 2392 set the rdma_err field to RDMA2_ERR_WRITE_CHUNKS and set the 2393 rdma_max_chunks field to the maximum number of Write chunks the 2394 Responder can return. The Requester can reconstruct the Call with no 2395 Write chunks and a Reply chunk of appropriate size. If the Requester 2396 does not resend the Call, it MUST terminate this RPC transaction with 2397 an error. 2399 If the Responder implementation cannot handle any Write chunks for a 2400 request and cannot send the Reply using Message Continuation, it MUST 2401 return a response of RDMA2_ERR_REPLY_RESOURCE instead (see below). 2403 7.3.3. RDMA2_ERR_SEGMENTS 2405 If a Requester has constructed an RPC Call with a chunk that contains 2406 more segments than the Responder supports, the Responder MUST set the 2407 rdma_err field to RDMA2_ERR_SEGMENTS and set the rdma_max_segments 2408 field to the maximum number of segments the Responder can process. 2409 The Requester can reconstruct the Call and resend it. If the 2410 Requester does not resend the Call, it MUST terminate this RPC 2411 transaction with an error. 2413 7.3.4. RDMA2_ERR_WRITE_RESOURCE 2415 If a Requester has provided a Write chunk that is not large enough to 2416 contain a DDP-eligible result, the Responder MUST set the rdma_err 2417 field to RDMA2_ERR_WRITE_RESOURCE. The Responder MUST set the 2418 rdma_chunk_index field to point to the first Write chunk in the 2419 transport header that is too short, or to zero to indicate that it 2420 was not possible to determine which chunk is too small. Indexing 2421 starts at one (1), which represents the first Write chunk. The 2422 Responder MUST set the rdma_length_needed to the number of bytes 2423 needed in that chunk to convey the result data item. 2425 The Requester can reconstruct the Call with more reply resources and 2426 resend it. If the Requester does not resend the Call (for instance, 2427 if the Responder set the index and length fields to zero), it MUST 2428 terminate this RPC transaction with an error. 2430 7.3.5. RDMA2_ERR_REPLY_RESOURCE 2432 If a Responder cannot send an RPC Reply using Message Continuation 2433 and the Reply does not fit in the Reply chunk, the Responder MUST set 2434 the rdma_err field to RDMA2_ERR_REPLY_RESOURCE. The Responder MUST 2435 set the rdma_length_needed to the number of Reply chunk bytes needed 2436 to convey the reply. The Requester can reconstruct the Call with 2437 more reply resources and resend it. If the Requester does not resend 2438 the Call (for instance, if the Responder set the length field to 2439 zero), it MUST terminate this RPC transaction with an error. 2441 7.4. Other Operational Errors 2443 While a Requester is constructing an RPC Call message, an 2444 unrecoverable problem might occur that prevents the Requester from 2445 posting further RDMA Work Requests on behalf of that message. As 2446 with other transports, if a Requester is unable to construct and 2447 transmit an RPC Call, the associated RPC transaction fails 2448 immediately. 2450 After a Requester has received a Reply, if it is unable to invalidate 2451 a memory region due to an unrecoverable problem, the Requester MUST 2452 close the connection to protect that memory from Responder access 2453 before the associated RPC transaction is complete. 2455 While a Responder is constructing an RPC Reply message or error 2456 message, an unrecoverable problem might occur that prevents the 2457 Responder from posting further RDMA Work Requests on behalf of that 2458 message. If a Responder is unable to construct and transmit an RPC 2459 Reply or RPC-over-RDMA error message, the Responder MUST close the 2460 connection to signal to the Requester that a reply was lost. 2462 7.4.1. RDMA2_ERR_SYSTEM 2464 If some problem occurs on a Responder that does not fit into the 2465 above categories, the Responder MAY report it to the Requester by 2466 setting the rdma_err field to RDMA2_ERR_SYSTEM. The Responder MUST 2467 silently discard the message(s) associated with the failing 2468 transaction without further processing. 2470 RDMA2_ERR_SYSTEM is a permanent error. This error does not indicate 2471 how much of the transaction the Responder has processed, nor does it 2472 indicate a particular recovery action for the Requester. A Requester 2473 that receives this error MUST terminate the RPC transaction 2474 associated with the XID value in the RDMA2_ERROR message's rdma_xid 2475 field. 2477 7.5. RDMA Transport Errors 2479 The RDMA connection and physical link provide some degree of error 2480 detection and retransmission. The Marker PDU Aligned Framing (MPA) 2481 protocol (as described in Section 7.1 of [RFC5044]) as well as the 2482 InfiniBand link layer [IBA] provide Cyclic Redundancy Check (CRC) 2483 protection of RDMA payloads. CRC-class protection is a general 2484 attribute of such transports. 2486 Additionally, the RPC layer itself can accept errors from the 2487 transport and recover via retransmission. RPC recovery can typically 2488 handle complete loss and re-establishment of a transport connection. 2490 The details of reporting and recovery from RDMA link-layer errors are 2491 described in specific link-layer APIs and operational specifications 2492 and are outside the scope of this protocol specification. See 2493 Section 11 for further discussion of RPC-level integrity schemes. 2495 8. XDR Protocol Definition 2497 This section contains a description of the core features of the RPC- 2498 over-RDMA version 2 protocol expressed in the XDR language [RFC4506]. 2499 It organizes the description to make it simple to extract into a form 2500 that is ready to compile or combine with similar descriptions 2501 published later as extensions to RPC-over-RDMA version 2. 2503 8.1. Code Component License 2505 Code Components extracted from the current document must include the 2506 following license text. When combining the extracted XDR code with 2507 other XDR code which has an identical license, only a single copy of 2508 the license text needs to be retained. 2510 2511 /// /* 2512 /// * Copyright (c) 2010, 2020 IETF Trust and the persons 2513 /// * identified as authors of the code. All rights reserved. 2514 /// * 2515 /// * The authors of the code are: 2516 /// * B. Callaghan, T. Talpey, C. Lever, and D. Noveck. 2517 /// * 2518 /// * Redistribution and use in source and binary forms, with 2519 /// * or without modification, are permitted provided that the 2520 /// * following conditions are met: 2521 /// * 2522 /// * - Redistributions of source code must retain the above 2523 /// * copyright notice, this list of conditions and the 2524 /// * following disclaimer. 2525 /// * 2526 /// * - Redistributions in binary form must reproduce the above 2527 /// * copyright notice, this list of conditions and the 2528 /// * following disclaimer in the documentation and/or other 2529 /// * materials provided with the distribution. 2530 /// * 2531 /// * - Neither the name of Internet Society, IETF or IETF 2532 /// * Trust, nor the names of specific contributors, may be 2533 /// * used to endorse or promote products derived from this 2534 /// * software without specific prior written permission. 2535 /// * 2536 /// * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS 2537 /// * AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED 2538 /// * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 2539 /// * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS 2540 /// * FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO 2541 /// * EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE 2542 /// * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, 2543 /// * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT 2544 /// * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 2545 /// * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS 2546 /// * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF 2547 /// * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 2548 /// * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING 2549 /// * IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF 2550 /// * ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 2551 /// */ 2552 /// 2553 2555 8.2. Extraction of the XDR Definition 2557 Implementers can apply the following sed script to the current 2558 document to produce a machine-readable XDR description of the base 2559 RPC-over-RDMA version 2 protocol. 2561 2562 sed -n -e 's:^ */// ::p' -e 's:^ *///$::p' 2563 2565 That is, if this document is in a file called "spec.txt", then 2566 implementers can do the following to extract an XDR description file 2567 and store it in the file rpcrdma-v2.x. 2569 2570 sed -n -e 's:^ */// ::p' -e 's:^ *///$::p' \ 2571 < spec.txt > rpcrdma-v2.x 2572 2574 Although this file is a usable description of the base protocol, when 2575 extensions are to be supported, it may be desirable to divide the 2576 description into multiple files. The following script achieves that 2577 purpose: 2579 2580 #!/usr/local/bin/perl 2581 open(IN,"rpcrdma-v2.x"); 2582 open(OUT,">temp.x"); 2583 while() 2584 { 2585 if (m/FILE ENDS: (.*)$/) 2586 { 2587 close(OUT); 2588 rename("temp.x", $1); 2589 open(OUT,">temp.x"); 2590 } 2591 else 2592 { 2593 print OUT $_; 2594 } 2595 } 2596 close(IN); 2597 close(OUT); 2598 2600 Running the above script results in two files: 2602 * The file common.x, containing the license plus the shared XDR 2603 definitions that need to be made available to both the base 2604 protocol and any subsequent extensions. 2606 * The file baseops.x containing the XDR definitions for the base 2607 protocol defined in this document. 2609 Extensions to RPC-over-RDMA version 2, published as Standards Track 2610 documents, should have similarly structured XDR definitions. Once an 2611 implementer has extracted the XDR for all desired extensions and the 2612 base XDR definition contained in the current document, she can 2613 concatenate them to produce a consolidated XDR definition that 2614 reflects the set of extensions selected for her RPC-over-RDMA version 2615 2 implementation. 2617 Alternatively, the XDR descriptions can be compiled separately. In 2618 that case, the combination of common.x and baseops.x defines the base 2619 transport. The combination of common.x and the XDR description of 2620 each extension produces a full XDR definition of that extension. 2622 8.3. XDR Definition for RPC-over-RDMA Version 2 Core Structures 2624 2625 /// /*************************************************************** 2626 /// * Transport Header Prefixes 2627 /// ***************************************************************/ 2628 /// 2629 /// struct rpcrdma_common { 2630 /// uint32 rdma_xid; 2631 /// uint32 rdma_vers; 2632 /// uint32 rdma_credit; 2633 /// uint32 rdma_htype; 2634 /// }; 2635 /// 2636 /// struct rpcrdma2_hdr_prefix { 2637 /// struct rpcrdma_common rdma_start; 2638 /// }; 2639 /// 2640 /// /*************************************************************** 2641 /// * Chunks and Chunk Lists 2642 /// ***************************************************************/ 2643 /// 2644 /// struct rpcrdma2_segment { 2645 /// uint32 rdma_handle; 2646 /// uint32 rdma_length; 2647 /// uint64 rdma_offset; 2648 /// }; 2649 /// 2650 /// struct rpcrdma2_read_segment { 2651 /// uint32 rdma_position; 2652 /// struct rpcrdma2_segment rdma_target; 2653 /// }; 2654 /// 2655 /// struct rpcrdma2_read_list { 2656 /// struct rpcrdma2_read_segment rdma_entry; 2657 /// struct rpcrdma2_read_list *rdma_next; 2658 /// }; 2659 /// 2660 /// struct rpcrdma2_write_chunk { 2661 /// struct rpcrdma2_segment rdma_target<>; 2662 /// }; 2663 /// 2664 /// struct rpcrdma2_write_list { 2665 /// struct rpcrdma2_write_chunk rdma_entry; 2666 /// struct rpcrdma2_write_list *rdma_next; 2667 /// }; 2668 /// 2669 /// /*************************************************************** 2670 /// * Transport Properties 2671 /// ***************************************************************/ 2672 /// 2673 /// /* 2674 /// * Types for transport properties model 2675 /// */ 2676 /// typedef rpcrdma2_propid uint32; 2677 /// 2678 /// struct rpcrdma2_propval { 2679 /// rpcrdma2_propid rdma_which; 2680 /// opaque rdma_data<>; 2681 /// }; 2682 /// 2683 /// typedef rpcrdma2_propval rpcrdma2_propset<>; 2684 /// typedef uint32 rpcrdma2_propsubset<>; 2685 /// 2686 /// /* 2687 /// * Transport propid values for basic properties 2688 /// */ 2689 /// const RDMA2_PROPID_SBSIZ = 1; 2690 /// const RDMA2_PROPID_RBSIZ = 2; 2691 /// const RDMA2_PROPID_RSSIZ = 3; 2692 /// const RDMA2_PROPID_RCSIZ = 4; 2693 /// const RDMA2_PROPID_BRS = 5; 2694 /// const RDMA2_PROPID_HOSTAUTH = 6; 2695 /// 2696 /// /* 2697 /// * Types specific to particular properties 2698 /// */ 2699 /// typedef uint32 rpcrdma2_prop_sbsiz; 2700 /// typedef uint32 rpcrdma2_prop_rbsiz; 2701 /// typedef uint32 rpcrdma2_prop_rssiz; 2702 /// typedef uint32 rpcrdma2_prop_rcsiz; 2703 /// typedef uint32 rpcrdma2_prop_brs; 2704 /// typedef opaque rpcrdma2_prop_hostauth<>; 2705 /// 2706 /// const RDMA2_RVRSDIR_NONE = 0; 2707 /// const RDMA2_RVRSDIR_SIMPLE = 1; 2708 /// const RDMA2_RVRSDIR_CONT = 2; 2709 /// const RDMA2_RVRSDIR_GENL = 3; 2710 /// 2711 /// /* FILE ENDS: common.x; */ 2712 2714 8.4. XDR Definition for RPC-over-RDMA Version 2 Base Header Types 2716 2717 /// /*************************************************************** 2718 /// * Descriptions of RPC-over-RDMA Header Types 2719 /// ***************************************************************/ 2720 /// 2721 /// /* 2722 /// * Header Type Codes: Control plane operations. 2723 /// */ 2724 /// const RDMA2_ERROR = 4; 2725 /// const RDMA2_GRANT = 5; 2726 /// const RDMA2_CONNPROP_MIDDLE = 6; 2727 /// const RDMA2_CONNPROP_FINAL = 7; 2728 /// 2729 /// /* 2730 /// * Header Type Codes: Call messages. 2731 /// */ 2732 /// const RDMA2_CALL_EXTERNAL = 8; 2733 /// const RDMA2_CALL_MIDDLE = 9; 2734 /// const RDMA2_CALL_INLINE = 10; 2735 /// 2736 /// /* 2737 /// * Header Type Codes: Reply messages. 2738 /// */ 2739 /// const RDMA2_REPLY_EXTERNAL = 11; 2740 /// const RDMA2_REPLY_MIDDLE = 12; 2741 /// const RDMA2_REPLY_INLINE = 13; 2742 /// 2743 /// /* 2744 /// * Header Type to Report Errors. 2745 /// */ 2746 /// const RDMA2_ERR_VERS = 1; 2747 /// const RDMA2_ERR_BAD_XDR = 2; 2748 /// const RDMA2_ERR_BAD_PROPVAL = 3; 2749 /// const RDMA2_ERR_INVAL_HTYPE = 4; 2750 /// const RDMA2_ERR_INVAL_CONT = 5; 2751 /// const RDMA2_ERR_READ_CHUNKS = 6; 2752 /// const RDMA2_ERR_WRITE_CHUNKS = 7; 2753 /// const RDMA2_ERR_SEGMENTS = 8; 2754 /// const RDMA2_ERR_WRITE_RESOURCE = 9; 2755 /// const RDMA2_ERR_REPLY_RESOURCE = 10; 2756 /// const RDMA2_ERR_VERS_MISMATCH = 11; 2757 /// const RDMA2_ERR_SYSTEM = 100; 2758 /// 2759 /// struct rpcrdma2_err_vers { 2760 /// uint32 rdma_vers_low; 2761 /// uint32 rdma_vers_high; 2762 /// }; 2763 /// 2764 /// struct rpcrdma2_err_write { 2765 /// uint32 rdma_chunk_index; 2766 /// uint32 rdma_length_needed; 2767 /// }; 2768 /// 2769 /// union rpcrdma2_hdr_error switch (rpcrdma2_errcode rdma_err) { 2770 /// case RDMA2_ERR_VERS: 2771 /// rpcrdma2_err_vers rdma_vrange; 2772 /// case RDMA2_ERR_READ_CHUNKS: 2773 /// uint32 rdma_max_chunks; 2774 /// case RDMA2_ERR_WRITE_CHUNKS: 2775 /// uint32 rdma_max_chunks; 2776 /// case RDMA2_ERR_SEGMENTS: 2777 /// uint32 rdma_max_segments; 2778 /// case RDMA2_ERR_WRITE_RESOURCE: 2779 /// rpcrdma2_err_write rdma_writeres; 2780 /// case RDMA2_ERR_REPLY_RESOURCE: 2781 /// uint32 rdma_length_needed; 2782 /// default: 2783 /// void; 2784 /// }; 2785 /// 2786 /// /* 2787 /// * Header Type to Exchange Transport Properties. 2788 /// */ 2789 /// struct rpcrdma2_hdr_connprop { 2790 /// rpcrdma2_propset rdma_props; 2791 /// }; 2792 /// 2793 /// /* 2794 /// * Header Types to Convey RPC Messages. 2795 /// */ 2796 /// struct rpcrdma2_hdr_call_external { 2797 /// uint32 rdma_inv_handle; 2798 /// 2799 /// struct rpcrdma2_read_list *rdma_call; 2800 /// struct rpcrdma2_read_list *rdma_reads; 2801 /// struct rpcrdma2_write_list *rdma_provisional_writes; 2802 /// struct rpcrdma2_write_chunk *rdma_provisional_reply; 2803 /// }; 2804 /// 2805 /// struct rpcrdma2_hdr_call_middle { 2806 /// uint32 rdma_remaining; 2807 /// 2808 /// /* The rpc message starts here and continues 2809 /// * through the end of the transmission. */ 2810 /// uint32 rdma_rpc_first_word; 2811 /// }; 2812 /// 2813 /// struct rpcrdma2_hdr_call_inline { 2814 /// uint32 rdma_inv_handle; 2815 /// 2816 /// struct rpcrdma2_read_list *rdma_reads; 2817 /// struct rpcrdma2_write_list *rdma_provisional_writes; 2818 /// struct rpcrdma2_write_chunk *rdma_provisional_reply; 2819 /// 2820 /// /* The rpc message starts here and continues 2821 /// * through the end of the transmission. */ 2822 /// uint32 rdma_rpc_first_word; 2823 /// }; 2824 /// 2825 /// struct rpcrdma2_hdr_reply_external { 2826 /// struct rpcrdma2_write_list *rdma_writes; 2827 /// struct rpcrdma2_write_chunk *rdma_reply; 2828 /// }; 2829 /// 2830 /// struct rpcrdma2_hdr_reply_middle { 2831 /// uint32 rdma_remaining; 2832 /// 2833 /// /* The rpc message starts here and continues 2834 /// * through the end of the transmission. */ 2835 /// uint32 rdma_rpc_first_word; 2836 /// }; 2837 /// 2838 /// struct rpcrdma2_hdr_reply_inline { 2839 /// struct rpcrdma2_write_list *rdma_writes; 2840 /// 2841 /// /* The rpc message starts here and continues 2842 /// * through the end of the transmission. */ 2843 /// uint32 rdma_rpc_first_word; 2844 /// }; 2845 /// 2846 /// /* FILE ENDS: baseops.x; */ 2847 2849 8.5. Use of the XDR Description 2851 The files common.x and baseops.x, when combined with the XDR 2852 descriptions for extension defined later, produce a human-readable 2853 and compilable description of the RPC-over-RDMA version 2 protocol 2854 with the included extensions. 2856 Although this XDR description can generate encoders and decoders for 2857 the Transport and Payload streams, there are elements of the 2858 operation of RPC-over-RDMA version 2 that cannot be expressed within 2859 the XDR language. Implementations that use the output of an 2860 automated XDR processor need to provide additional code to bridge 2861 these gaps. 2863 * The Transport stream is not a single XDR object. Instead, the 2864 header prefix is one XDR data item, and the rest of the header is 2865 a separate XDR data item. Table 2 expresses the mapping between 2866 the header type in the header prefix and the XDR object 2867 representing the header type. 2869 * The relationship between the Transport stream and the Payload 2870 stream is not specified using XDR. Comments within the XDR text 2871 make clear where transported messages, described by their own XDR 2872 definitions, need to appear. Such data is opaque to the 2873 transport. 2875 * Continuation of RPC messages across transport message boundaries 2876 requires that message assembly facilities not specifiable within 2877 XDR are part of transport implementations. 2879 * Transport properties are constant integer values. Table 1 2880 expresses the mapping between each property's code point and the 2881 XDR typedef that represents the structure of the property's value. 2882 XDR does not possess the facility to express that mapping in an 2883 extensible way. 2885 The role of XDR in RPC-over-RDMA specifications is more limited than 2886 for protocols where the totality of the protocol is expressible 2887 within XDR. XDR lacks the facility to represent the embedding of 2888 XDR-encoded payload material. Also, the need to cleanly accommodate 2889 extensions has meant that those using rpcgen in their applications 2890 need to take an active role to provide the facilities that cannot be 2891 expressed within XDR. 2893 9. RPC Bind Parameters 2895 Before establishing a new connection, an RPC client obtains a 2896 transport address for the RPC server. The means used to obtain this 2897 address and to open an RDMA connection is dependent on the type of 2898 RDMA transport and is the responsibility of each RPC protocol binding 2899 and its local implementation. 2901 RPC services typically register with a portmap or rpcbind service 2902 [RFC1833], which associates an RPC Program number with a service 2903 address. This policy is no different with RDMA transports. However, 2904 a distinct service address (port number) is sometimes required for 2905 operation on RPC-over-RDMA. 2907 When mapped atop MPA [RFC5044], which uses IP port addressing due to 2908 its layering on TCP or SCTP, port mapping is trivial and consists 2909 merely of issuing the port in the connection process. The NFS/RDMA 2910 protocol service address has been assigned port 20049 by IANA for 2911 this deployment scenario [RFC8267]. 2913 When mapped atop InfiniBand [IBA], which uses a service endpoint 2914 naming scheme based on a Group Identifier (GID), a translation MUST 2915 be employed. One such translation is described in Annexes A3 2916 (Application Specific Identifiers), A4 (Sockets Direct Protocol 2917 (SDP)), and A11 (RDMA IP CM Service) of [IBA], which is appropriate 2918 for translating IP port addressing to the InfiniBand network. 2919 Therefore, in this case, IP port addressing may be readily employed 2920 by the upper layer. 2922 When a mapping standard or convention exists for IP ports on an RDMA 2923 interconnect, there are several possibilities for each upper layer to 2924 consider: 2926 * One possibility is to have the server register its mapped IP port 2927 with the rpcbind service under the netid (or netids) defined in 2928 [RFC8166]. An RPC-over-RDMA-aware RPC client can then resolve its 2929 desired service to a mappable port and proceed to connect. This 2930 method is the most flexible and compatible approach for those 2931 upper layers that are defined to use the rpcbind service. 2933 * A second possibility is to have the RPC server's portmapper 2934 register itself on the RDMA interconnect at a "well-known" service 2935 address (on UDP or TCP, this corresponds to port 111). An RPC 2936 client can connect to this service address and use the portmap 2937 protocol to obtain a service address in response to a program 2938 number (e.g., a TCP port number or an InfiniBand GID). 2940 * Alternately, an RPC client can connect to the mapped well-known 2941 port for the service itself, if it is appropriately defined. By 2942 convention, the NFS/RDMA service, when operating atop an 2943 InfiniBand fabric, uses the same 20049 assignment as for MPA. 2945 Historically, different RPC protocols have taken different approaches 2946 to their port assignments. The current document leaves the specific 2947 method for each RPC-over-RDMA-enabled ULB. 2949 [RFC8166] defines two new netid values to be used for registration of 2950 upper layers atop MPA and (when a suitable port translation service 2951 is available) InfiniBand. Additional RDMA-capable networks MAY 2952 define their own netids, or if they provide a port translation, they 2953 MAY share the one defined in [RFC8166]. 2955 10. Implementation Status 2957 This section is to be removed before publishing as an RFC. 2959 This section records the status of known implementations of the 2960 protocol defined by this specification at the time of posting of this 2961 Internet-Draft, and is based on a proposal described in [RFC7942]. 2962 The description of implementations in this section is intended to 2963 assist the IETF in its decision processes in progressing drafts to 2964 RFCs. 2966 Please note that the listing of any individual implementation here 2967 does not imply endorsement by the IETF. Furthermore, no effort has 2968 been spent to verify the information presented here that was supplied 2969 by IETF contributors. This is not intended as, and must not be 2970 construed to be, a catalog of available implementations or their 2971 features. Readers are advised to note that other implementations may 2972 exist. 2974 At this time, no known implementations of the protocol described in 2975 the current document exist. 2977 11. Security Considerations 2978 11.1. Memory Protection 2980 A primary consideration is the protection of the integrity and 2981 confidentiality of host memory by an RPC-over-RDMA transport. The 2982 use of an RPC-over-RDMA transport protocol MUST NOT introduce 2983 vulnerabilities to system memory contents nor memory owned by user 2984 processes. Any RDMA provider used for RPC transport MUST conform to 2985 the requirements of [RFC5042] to satisfy these protections. 2987 11.1.1. Protection Domains 2989 The use of a Protection Domain to limit the exposure of memory 2990 regions to a single connection is critical. Any attempt by an 2991 endpoint not participating in that connection to reuse memory handles 2992 needs to result in immediate failure of that connection. Because ULP 2993 security mechanisms rely on this aspect of Reliable Connected 2994 behavior, implementations SHOULD cryptographically authenticate 2995 connection endpoints. 2997 11.1.2. Handle (STag) Predictability 2999 Implementations should use unpredictable memory handles for any 3000 operation requiring exposed memory regions. Exposing a continuously 3001 registered memory region allows a remote host to read or write to 3002 that region even when an RPC involving that memory is not underway. 3003 Therefore, implementations should avoid the use of persistently 3004 registered memory. 3006 11.1.3. Memory Protection 3008 Requesters should register memory regions for remote access only when 3009 they are about to be the target of an RPC transaction that involves 3010 an RDMA Read or Write. 3012 Requesters should invalidate memory regions as soon as related RPC 3013 operations are complete. Invalidation and DMA unmapping of memory 3014 regions should complete before the receiver checks message integrity, 3015 and before the RPC consumer can use or alter the contents of the 3016 exposed memory region. 3018 An RPC transaction on a Requester can terminate before a Reply 3019 arrives, for example, if the RPC consumer is signaled, or a 3020 segmentation fault occurs. When an RPC terminates abnormally, memory 3021 regions associated with that RPC should be invalidated before the 3022 Requester reuses those regions for other purposes. 3024 11.1.4. Denial of Service 3026 A detailed discussion of denial-of-service exposures that can result 3027 from the use of an RDMA transport appears in Section 6.4 of 3028 [RFC5042]. 3030 A Responder is not obliged to pull unreasonably large Read chunks. A 3031 Responder can use an RDMA2_ERROR response to terminate RPCs with 3032 unreadable Read chunks. If a Responder transmits more data than a 3033 Requester is prepared to receive in a Write or Reply chunk, the RDMA 3034 provider typically terminates the connection. For further 3035 discussion, see Section 6.3.1. Such repeated connection termination 3036 can deny service to other users sharing the connection from the 3037 errant Requester. 3039 An RPC-over-RDMA transport implementation is not responsible for 3040 throttling the RPC request rate, other than to keep the number of 3041 concurrent RPC transactions at or under the per connection credit 3042 window (see Section 4.2.1). A sender can trigger a self-denial of 3043 service by exceeding the credit window repeatedly. 3045 When an RPC transaction terminates due to a signal or premature exit 3046 of an application process, a Requester should invalidate the RPC's 3047 Write and Reply chunks. Invalidation prevents the subsequent arrival 3048 of the Responder's Reply from altering the memory regions associated 3049 with those chunks after the Requester has released that memory. 3051 On the Requester, a malfunctioning application or a malicious user 3052 can create a situation where RPCs initiate and abort continuously, 3053 resulting in Responder replies that terminate the underlying RPC- 3054 over-RDMA connection repeatedly. Such situations can deny service to 3055 other users sharing the connection from that Requester. 3057 11.2. RPC Message Security 3059 ONC RPC provides cryptographic security via the RPCSEC_GSS framework 3060 [RFC7861]. RPCSEC_GSS implements message authentication 3061 (rpc_gss_svc_none), per-message integrity checking 3062 (rpc_gss_svc_integrity), and per-message confidentiality 3063 (rpc_gss_svc_privacy) in a layer above the RPC-over-RDMA transport. 3064 The integrity and privacy services require significant computation 3065 and movement of data on each endpoint host. Some performance 3066 benefits enabled by RDMA transports can be lost. 3068 11.2.1. RPC-over-RDMA Protection at Other Layers 3070 For any RPC transport, utilizing RPCSEC_GSS integrity or privacy 3071 services has performance implications. Protection below the RPC 3072 implementation is often a better choice in performance-sensitive 3073 deployments, especially if it, too, can be offloaded. Certain 3074 implementations of IPsec can be co-located in RDMA hardware, for 3075 example, without change to RDMA consumers and with little loss of 3076 data movement efficiency. Such arrangements can also provide a 3077 higher degree of privacy by hiding endpoint identity or altering the 3078 frequency at which messages are exchanged, at a performance cost. 3080 Implementations MAY negotiate the use of protection in another layer 3081 through the use of an RPCSEC_GSS security flavor defined in [RFC7861] 3082 in conjunction with the Channel Binding mechanism [RFC5056] and IPsec 3083 Channel Connection Latching [RFC5660]. 3085 11.2.2. RPCSEC_GSS on RPC-over-RDMA Transports 3087 Not all RDMA devices and fabrics support the above protection 3088 mechanisms. Also, NFS clients, where multiple users can access NFS 3089 files, still require per-message authentication. In these cases, 3090 RPCSEC_GSS can protect NFS traffic conveyed on RPC-over-RDMA 3091 connections. 3093 RPCSEC_GSS extends the ONC RPC protocol without changing the format 3094 of RPC messages. By observing the conventions described in this 3095 section, an RPC-over-RDMA transport can convey RPCSEC_GSS-protected 3096 RPC messages interoperably. 3098 Senders MUST NOT reduce protocol elements of RPCSEC_GSS that appear 3099 in the Payload stream of an RPC-over-RDMA message. Such elements 3100 include control messages exchanged as part of establishing or 3101 destroying a security context, or data items that are part of 3102 RPCSEC_GSS authentication material. 3104 11.2.2.1. RPCSEC_GSS Context Negotiation 3106 Some NFS client implementations use a separate connection to 3107 establish a Generic Security Service (GSS) context for NFS operation. 3108 Such clients use TCP and the standard NFS port (2049) for context 3109 establishment. Therefore, an NFS server MUST also provide a TCP- 3110 based NFS service on port 2049 to enable the use of RPCSEC_GSS with 3111 NFS/RDMA. 3113 11.2.2.2. RPC-over-RDMA with RPCSEC_GSS Authentication 3115 The RPCSEC_GSS authentication service has no impact on the DDP- 3116 eligibility of data items in a ULP. 3118 However, RPCSEC_GSS authentication material appearing in an RPC 3119 message header can be larger than, say, an AUTH_SYS authenticator. 3120 In particular, when an RPCSEC_GSS pseudoflavor is in use, a Requester 3121 needs to accommodate a larger RPC credential when marshaling RPC 3122 Calls and needs to provide for a maximum size RPCSEC_GSS verifier 3123 when allocating reply buffers and Reply chunks. 3125 RPC messages, and thus Payload streams, are larger on average as a 3126 result. ULP operations that fit in a Simple Format message when a 3127 simpler form of authentication is in use might need to be reduced or 3128 conveyed via a Special Format message when RPCSEC_GSS authentication 3129 is in use. It is therefore more likely that a Requester provisions 3130 both a Read list and a Reply chunk in the same RPC-over-RDMA 3131 Transport header to convey a Special Format Call and provision a 3132 receptacle for a Special Format Reply. 3134 In addition to this cost, the XDR encoding and decoding of each RPC 3135 message using RPCSEC_GSS authentication requires per-message host 3136 compute resources to construct the GSS verifier. 3138 11.2.2.3. RPC-over-RDMA with RPCSEC_GSS Integrity or Privacy 3140 The RPCSEC_GSS integrity service enables endpoints to detect the 3141 modification of RPC messages in flight. The RPCSEC_GSS privacy 3142 service prevents all but the intended recipient from viewing the 3143 cleartext content of RPC arguments and results. RPCSEC_GSS integrity 3144 and privacy services are end-to-end. They protect RPC arguments and 3145 results from application to server endpoint, and back. 3147 The RPCSEC_GSS integrity and encryption services operate on whole RPC 3148 messages after they have been XDR encoded, and before they have been 3149 XDR decoded after receipt. Connection endpoints use intermediate 3150 buffers to prevent exposure of encrypted or unverified cleartext data 3151 to RPC consumers. After a sender has verified, encrypted, and 3152 wrapped a message, the transport layer MAY use RDMA data transfer 3153 between these intermediate buffers. 3155 The process of reducing a DDP-eligible data item removes the data 3156 item and its XDR padding from an encoded Payload stream. In a non- 3157 protected RPC-over-RDMA message, a reduced data item does not include 3158 XDR padding. After reduction, the Payload stream contains fewer 3159 octets than the whole XDR stream did beforehand. XDR padding octets 3160 are often zero bytes, but they don't have to be. Thus, reducing DDP- 3161 eligible items affects the result of message integrity verification 3162 and encryption. 3164 Therefore, a sender MUST NOT reduce a Payload stream when RPCSEC_GSS 3165 integrity or encryption services are in use. Effectively, no data 3166 item is DDP-eligible in this situation. Senders can use only Simple 3167 and Continued Formats without data item chunks, or Special Format. 3168 In this mode, an RPC-over-RDMA transport operates in the same manner 3169 as a transport that does not support DDP. 3171 11.2.2.4. Protecting RPC-over-RDMA Transport Headers 3173 Like the header fields in an RPC message (e.g., the xid and mtype 3174 fields), RPCSEC_GSS does not protect the RPC-over-RDMA Transport 3175 stream. XIDs, connection credit limits, and chunk lists (though not 3176 the content of the data items they refer to) are exposed to malicious 3177 behavior, which can redirect data that is transferred by the RPC- 3178 over-RDMA message, result in spurious retransmits, or trigger 3179 connection loss. 3181 In particular, if an attacker alters the information contained in the 3182 chunk lists of an RPC-over-RDMA Transport header, data contained in 3183 those chunks can be redirected to other registered memory regions on 3184 Requesters. An attacker might alter the arguments of RDMA Read and 3185 RDMA Write operations on the wire to gain a similar effect. If such 3186 alterations occur, the use of RPCSEC_GSS integrity or privacy 3187 services enables a Requester to detect unexpected material in a 3188 received RPC message. 3190 Encryption at other layers, as described in Section 11.2.1, protects 3191 the content of the Transport stream. RDMA transport implementations 3192 should conform to [RFC5042] to address attacks on RDMA protocols 3193 themselves. 3195 11.3. Transport Properties 3197 Like other fields that appear in the Transport stream, transport 3198 properties are sent in the clear with no integrity protection, making 3199 them vulnerable to man-in-the-middle attacks. 3201 For example, if a man-in-the-middle were to change the value of the 3202 Receive buffer size, it could reduce connection performance or 3203 trigger loss of connection. Repeated connection loss can impact 3204 performance or even prevent a new connection from being established. 3205 The recourse is to deploy on a private network or use transport layer 3206 encryption. 3208 11.4. Host Authentication 3210 [ cel: This subsection is unfinished. ] 3212 Wherein we use the relevant sections of [RFC3552] to analyze the 3213 addition of host authentication to this RPC-over-RDMA transport. 3215 The authors refer readers to Appendix C of [RFC8446] for information 3216 on how to design and test a secure authentication handshake 3217 implementation. 3219 12. IANA Considerations 3221 The RPC-over-RDMA family of transports have been assigned RPC netids 3222 by [RFC8166]. A netid is an rpcbind [RFC1833] string used to 3223 identify the underlying protocol in order for RPC to select 3224 appropriate transport framing and the format of the service addresses 3225 and ports. 3227 The following netid registry strings are already defined for this 3228 purpose: 3230 NC_RDMA "rdma" 3231 NC_RDMA6 "rdma6" 3233 The "rdma" netid is to be used when IPv4 addressing is employed by 3234 the underlying transport, and "rdma6" when IPv6 addressing is 3235 employed. The netid assignment policy and registry are defined in 3236 [RFC5665]. The current document does not alter these netid 3237 assignments. 3239 These netids MAY be used for any RDMA network that satisfies the 3240 requirements of Section 3.2.2 and that is able to identify service 3241 endpoints using IP port addressing, possibly through use of a 3242 translation service as described in Section 9. 3244 13. References 3246 13.1. Normative References 3248 [RFC1833] Srinivasan, R., "Binding Protocols for ONC RPC Version 2", 3249 RFC 1833, DOI 10.17487/RFC1833, August 1995, 3250 . 3252 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 3253 Requirement Levels", BCP 14, RFC 2119, 3254 DOI 10.17487/RFC2119, March 1997, 3255 . 3257 [RFC4506] Eisler, M., Ed., "XDR: External Data Representation 3258 Standard", STD 67, RFC 4506, DOI 10.17487/RFC4506, May 3259 2006, . 3261 [RFC5042] Pinkerton, J. and E. Deleganes, "Direct Data Placement 3262 Protocol (DDP) / Remote Direct Memory Access Protocol 3263 (RDMAP) Security", RFC 5042, DOI 10.17487/RFC5042, October 3264 2007, . 3266 [RFC5056] Williams, N., "On the Use of Channel Bindings to Secure 3267 Channels", RFC 5056, DOI 10.17487/RFC5056, November 2007, 3268 . 3270 [RFC5531] Thurlow, R., "RPC: Remote Procedure Call Protocol 3271 Specification Version 2", RFC 5531, DOI 10.17487/RFC5531, 3272 May 2009, . 3274 [RFC5660] Williams, N., "IPsec Channels: Connection Latching", 3275 RFC 5660, DOI 10.17487/RFC5660, October 2009, 3276 . 3278 [RFC5665] Eisler, M., "IANA Considerations for Remote Procedure Call 3279 (RPC) Network Identifiers and Universal Address Formats", 3280 RFC 5665, DOI 10.17487/RFC5665, January 2010, 3281 . 3283 [RFC7861] Adamson, A. and N. Williams, "Remote Procedure Call (RPC) 3284 Security Version 3", RFC 7861, DOI 10.17487/RFC7861, 3285 November 2016, . 3287 [RFC7942] Sheffer, Y. and A. Farrel, "Improving Awareness of Running 3288 Code: The Implementation Status Section", BCP 205, 3289 RFC 7942, DOI 10.17487/RFC7942, July 2016, 3290 . 3292 [RFC8166] Lever, C., Ed., Simpson, W., and T. Talpey, "Remote Direct 3293 Memory Access Transport for Remote Procedure Call Version 3294 1", RFC 8166, DOI 10.17487/RFC8166, June 2017, 3295 . 3297 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 3298 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, 3299 May 2017, . 3301 [RFC8267] Lever, C., "Network File System (NFS) Upper-Layer Binding 3302 to RPC-over-RDMA Version 1", RFC 8267, 3303 DOI 10.17487/RFC8267, October 2017, 3304 . 3306 [RFC8446] Rescorla, E., "The Transport Layer Security (TLS) Protocol 3307 Version 1.3", RFC 8446, DOI 10.17487/RFC8446, August 2018, 3308 . 3310 13.2. Informative References 3312 [CBFC] Kung, H.T., Blackwell, T., and A. Chapman, "Credit-Based 3313 Flow Control for ATM Networks: Credit Update Protocol, 3314 Adaptive Credit Allocation, and Statistical Multiplexing", 3315 Proc. ACM SIGCOMM '94 Symposium on Communications 3316 Architectures, Protocols and Applications, pp. 101-114., 3317 August 1994. 3319 [I-D.ietf-nfsv4-rpc-tls] 3320 Myklebust, T. and C. Lever, "Towards Remote Procedure Call 3321 Encryption By Default", Work in Progress, Internet-Draft, 3322 draft-ietf-nfsv4-rpc-tls-11, 23 November 2020, 3323 . 3326 [IBA] InfiniBand Trade Association, "InfiniBand Architecture 3327 Specification Volume 1", Release 1.3, March 2015. 3328 Available from https://www.infinibandta.org/ 3330 [RFC0768] Postel, J., "User Datagram Protocol", STD 6, RFC 768, 3331 DOI 10.17487/RFC0768, August 1980, 3332 . 3334 [RFC0793] Postel, J., "Transmission Control Protocol", STD 7, 3335 RFC 793, DOI 10.17487/RFC0793, September 1981, 3336 . 3338 [RFC1094] Nowicki, B., "NFS: Network File System Protocol 3339 specification", RFC 1094, DOI 10.17487/RFC1094, March 3340 1989, . 3342 [RFC1813] Callaghan, B., Pawlowski, B., and P. Staubach, "NFS 3343 Version 3 Protocol Specification", RFC 1813, 3344 DOI 10.17487/RFC1813, June 1995, 3345 . 3347 [RFC3552] Rescorla, E. and B. Korver, "Guidelines for Writing RFC 3348 Text on Security Considerations", BCP 72, RFC 3552, 3349 DOI 10.17487/RFC3552, July 2003, 3350 . 3352 [RFC5040] Recio, R., Metzler, B., Culley, P., Hilland, J., and D. 3353 Garcia, "A Remote Direct Memory Access Protocol 3354 Specification", RFC 5040, DOI 10.17487/RFC5040, October 3355 2007, . 3357 [RFC5041] Shah, H., Pinkerton, J., Recio, R., and P. Culley, "Direct 3358 Data Placement over Reliable Transports", RFC 5041, 3359 DOI 10.17487/RFC5041, October 2007, 3360 . 3362 [RFC5044] Culley, P., Elzur, U., Recio, R., Bailey, S., and J. 3363 Carrier, "Marker PDU Aligned Framing for TCP 3364 Specification", RFC 5044, DOI 10.17487/RFC5044, October 3365 2007, . 3367 [RFC5532] Talpey, T. and C. Juszczak, "Network File System (NFS) 3368 Remote Direct Memory Access (RDMA) Problem Statement", 3369 RFC 5532, DOI 10.17487/RFC5532, May 2009, 3370 . 3372 [RFC5662] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed., 3373 "Network File System (NFS) Version 4 Minor Version 1 3374 External Data Representation Standard (XDR) Description", 3375 RFC 5662, DOI 10.17487/RFC5662, January 2010, 3376 . 3378 [RFC5666] Talpey, T. and B. Callaghan, "Remote Direct Memory Access 3379 Transport for Remote Procedure Call", RFC 5666, 3380 DOI 10.17487/RFC5666, January 2010, 3381 . 3383 [RFC7530] Haynes, T., Ed. and D. Noveck, Ed., "Network File System 3384 (NFS) Version 4 Protocol", RFC 7530, DOI 10.17487/RFC7530, 3385 March 2015, . 3387 [RFC7862] Haynes, T., "Network File System (NFS) Version 4 Minor 3388 Version 2 Protocol", RFC 7862, DOI 10.17487/RFC7862, 3389 November 2016, . 3391 [RFC8167] Lever, C., "Bidirectional Remote Procedure Call on RPC- 3392 over-RDMA Transports", RFC 8167, DOI 10.17487/RFC8167, 3393 June 2017, . 3395 [RFC8881] Noveck, D., Ed. and C. Lever, "Network File System (NFS) 3396 Version 4 Minor Version 1 Protocol", RFC 8881, 3397 DOI 10.17487/RFC8881, August 2020, 3398 . 3400 Appendix A. ULB Specifications 3402 Typically, an Upper-Layer Protocol (ULP) is defined without regard to 3403 a particular RPC transport. An Upper-Layer Binding (ULB) 3404 specification provides guidance that helps a ULP interoperate 3405 correctly and efficiently over a particular transport. For RPC-over- 3406 RDMA version 2, a ULB may provide: 3408 * A taxonomy of XDR data items that are eligible for DDP 3410 * Constraints on which upper-layer procedures a sender may reduce, 3411 and on how many chunks may appear in a single RPC message 3413 * A method enabling a Requester to determine the maximum size of the 3414 reply Payload stream for all procedures in the ULP 3416 * An rpcbind port assignment for the RPC Program and Version when 3417 operating on the particular transport 3419 Each RPC Program and Version tuple that operates on RPC-over-RDMA 3420 version 2 needs to have a ULB specification. 3422 A.1. DDP-Eligibility 3424 A ULB designates specific XDR data items as eligible for DDP. As a 3425 sender constructs an RPC-over-RDMA message, it can remove DDP- 3426 eligible data items from the Payload stream so that the RDMA provider 3427 can place them directly in the receiver's memory. An XDR data item 3428 should be considered for DDP-eligibility if there is a clear benefit 3429 to moving the contents of the item directly from the sender's memory 3430 to the receiver's memory. 3432 Criteria for DDP-eligibility include: 3434 * The XDR data item is frequently sent or received, and its size is 3435 often much larger than typical inline thresholds. 3437 * If the XDR data item is a result, its maximum size must be 3438 predictable in advance by the Requester. 3440 * Transport-level processing of the XDR data item is not needed. 3441 For example, the data item is an opaque byte array, which requires 3442 no XDR encoding and decoding of its content. 3444 * The content of the XDR data item is sensitive to address 3445 alignment. For example, a data copy operation would be required 3446 on the receiver to enable the message to be parsed correctly, or 3447 to enable the data item to be accessed. 3449 * The XDR data item itself does not contain DDP-eligible data items. 3451 In addition to defining the set of data items that are DDP-eligible, 3452 a ULB may limit the use of chunks to particular upper-layer 3453 procedures. If more than one data item in a procedure is DDP- 3454 eligible, the ULB may limit the number of chunks that a Requester can 3455 provide for a particular upper-layer procedure. 3457 Senders never reduce data items that are not DDP-eligible. Such data 3458 items can, however, be part of a Special Format payload. 3460 The programming interface by which an upper-layer implementation 3461 indicates the DDP-eligibility of a data item to the RPC transport is 3462 not described by this specification. The only requirements are that 3463 the receiver can re-assemble the transmitted RPC-over-RDMA message 3464 into a valid XDR stream and that DDP-eligibility rules specified by 3465 the ULB are respected. 3467 There is no provision to express DDP-eligibility within the XDR 3468 language. The only definitive specification of DDP-eligibility is a 3469 ULB. 3471 In general, a DDP-eligibility violation occurs when: 3473 * A Requester reduces a non-DDP-eligible argument data item. The 3474 Responder reports the violation as described in Section 6.3.1. 3476 * A Responder reduces a non-DDP-eligible result data item. The 3477 Requester terminates the pending RPC transaction and reports an 3478 appropriate permanent error to the RPC consumer. 3480 * A Responder does not reduce a DDP-eligible result data item into 3481 an available Write chunk. The Requester terminates the pending 3482 RPC transaction and reports an appropriate permanent error to the 3483 RPC consumer. 3485 A.2. Maximum Reply Size 3487 When expecting small and moderately-sized Replies, a Requester should 3488 rely on Message Continuation rather than provision a Reply chunk. 3489 For each ULP procedure where there is no clear Reply size maximum and 3490 the maximum can be substantial, the ULB should specify a dependable 3491 means for determining the maximum Reply size. 3493 A.3. Reverse-Direction Operation 3495 The direction of operation does not preclude the need for DDP- 3496 eligibility statements. 3498 Reverse-direction operation occurs on an already-established 3499 connection. Specification of RPC binding parameters is usually not 3500 necessary in this case. 3502 Other considerations may apply when distinct RPC Programs share an 3503 RPC-over-RDMA transport connection concurrently. 3505 A.4. Additional Considerations 3507 There may be other details provided in a ULB. 3509 * A ULB may recommend inline threshold values or other transport- 3510 related parameters for RPC-over-RDMA version 2 connections bearing 3511 that ULP. 3513 * A ULP may provide a means to communicate transport-related 3514 parameters between peers. 3516 * Multiple ULPs may share a single RPC-over-RDMA version 2 3517 connection when their ULBs allow the use of RPC-over-RDMA version 3518 2 and the rpcbind port assignments for those protocols permit 3519 connection sharing. In this case, the same transport parameters 3520 (such as inline threshold) apply to all ULPs using that 3521 connection. 3523 Each ULB needs to be designed to allow correct interoperation without 3524 regard to the transport parameters actually in use. Furthermore, 3525 implementations of ULPs must be designed to interoperate correctly 3526 regardless of the connection parameters in effect on a connection. 3528 A.5. ULP Extensions 3530 An RPC Program and Version tuple may be extensible. For instance, 3531 the RPC version number may not reflect a ULP minor versioning scheme, 3532 or the ULP may allow the specification of additional features after 3533 the publication of the original RPC Program specification. ULBs are 3534 provided for interoperable RPC Programs and Versions by extending 3535 existing ULBs to reflect the changes made necessary by each addition 3536 to the existing XDR. 3538 [ cel: The final sentence is unclear, and may be inaccurate. I 3539 believe I copied this section directly from RFC 8166. Is there more 3540 to be said, now that we have some experience? ] 3542 Appendix B. Extending RPC-over-RDMA Version 2 3544 This Appendix is not addressed to protocol implementers, but rather 3545 to authors of documents that extend the protocol specified in the 3546 current document. 3548 RPC-over-RDMA version 2 extensibility facilitates limited extensions 3549 to the base protocol presented in the current document so that new 3550 optional capabilities can be introduced without a protocol version 3551 change while maintaining robust interoperability with existing RPC- 3552 over-RDMA version 2 implementations. It allows extensions to be 3553 defined, including the definition of new protocol elements, without 3554 requiring modification or recompilation of the XDR for the base 3555 protocol. 3557 Standards Track documents may introduce extensions to the base RPC- 3558 over-RDMA version 2 protocol in two ways: 3560 * They may introduce new OPTIONAL transport header types. 3561 Appendix B.2 covers such transport header types. 3563 * They may define new OPTIONAL transport properties. Appendix B.3 3564 describes such transport properties. 3566 These documents may also add the following sorts of ancillary 3567 protocol elements to the protocol to support the addition of new 3568 transport properties and header types: 3570 * They may create new error codes, as described in Appendix B.4. 3572 New capabilities can be proposed and developed independently of each 3573 other. Implementers can choose among them, making it straightforward 3574 to create and document experimental features and then bring them 3575 through the standards process. 3577 B.1. Documentation Requirements 3579 As described earlier, a Standards Track document introduces a set of 3580 new protocol elements. Together these elements are considered an 3581 OPTIONAL feature. Each implementation is either aware of all the 3582 protocol elements introduced by that feature or is aware of none of 3583 them. 3585 Documents specifying extensions to RPC-over-RDMA version 2 should 3586 contain: 3588 * An explanation of the purpose and use of each new protocol 3589 element. 3591 * An XDR description including all of the new protocol elements, and 3592 a script to extract it. 3594 * A discussion of interactions with other extensions. This 3595 discussion includes requirements for other OPTIONAL features to be 3596 present, or that a particular level of support for an OPTIONAL 3597 facility is required. 3599 Implementers combine the XDR descriptions of the new features they 3600 intend to use with the XDR description of the base protocol in the 3601 current document. This combination is necessary to create a valid 3602 XDR input file because extensions are free to use XDR types defined 3603 in the base protocol, and later extensions may use types defined by 3604 earlier extensions. 3606 The XDR description for the RPC-over-RDMA version 2 base protocol 3607 combined with that for any selected extensions should provide a 3608 human-readable and compilable definition of the extended protocol. 3610 B.2. Adding New Header Types to RPC-over-RDMA Version 2 3612 New transport header types are defined similar to Sections 6.3.5 3613 through 6.3.10. In particular, what is needed is: 3615 * A description of the function and use of the new header type. 3617 * A complete XDR description of the new header type. 3619 * A description of how receivers report errors, including mechanisms 3620 for reporting errors outside the available choices already 3621 available in the base protocol or other extensions. 3623 * An indication of whether a Payload stream must be present, and a 3624 description of its contents and how receivers use such Payload 3625 streams to reconstruct RPC messages. 3627 * As appropriate, a statement of whether a Responder may use Remote 3628 Invalidation when sending messages that contain the new header 3629 type. 3631 There needs to be additional documentation that is made necessary due 3632 to the OPTIONAL status of new transport header types: 3634 * The document should discuss constraints on support for the new 3635 header types. For example, if support for one header type is 3636 implied or foreclosed by another one, this needs to be documented. 3638 * The document should describe the preferred method by which a 3639 sender determines whether its peer supports a particular header 3640 type. It is always possible to send a test invocation of a 3641 particular header type to see if support is available. However, 3642 when more efficient means are available (e.g., the value of a 3643 transport property), this should be noted. 3645 B.3. Adding New Transport properties to the Protocol 3647 A Standards Track document defining a new transport property should 3648 include the following information paralleling that provided in this 3649 document for the transport properties defined herein: 3651 * The rpcrdma2_propid value identifying the new property. 3653 * The XDR typedef specifying the structure of its property value. 3655 * A description of the new property. 3657 * An explanation of how the receiver can use this information. 3659 * The default value if a peer never receives the new property. 3661 There is no requirement that propid assignments occur in a continuous 3662 range of values. Implementations should not rely on all such values 3663 being small integers. 3665 Before the defining Standards Track document is published, the nfsv4 3666 Working Group should select a unique propid value, and ensure that: 3668 * rpcrdma2_propid values specified in the document do not conflict 3669 with those currently assigned or in use by other pending working 3670 group documents defining transport properties. 3672 * rpcrdma2_propid values specified in the document do not conflict 3673 with the range reserved for experimental use, as defined in 3674 Section 8.2. 3676 [ cel: There is no longer a section 8.2 or an experimental range 3677 of propid values. Should we request the creation of an IANA 3678 registry for propid values? ]. 3680 When a Standards Track document proposes additional transport 3681 properties, reviewers should deal with possible security issues 3682 exposed by those new transport properties. 3684 B.4. Adding New Error Codes to the Protocol 3686 The same Standards Track document that defines a new header type may 3687 introduce new error codes used to support it. A Standards Track 3688 document may similarly define new error codes that an existing header 3689 type can return. 3691 For error codes that do not require the return of additional 3692 information, a peer can use the existing RDMA_ERR2 header type to 3693 report the new error. The sender sets the new error code as the 3694 value of rdma_err with the result that the default switch arm of the 3695 rpcrdma2_error (i.e., void) is selected. 3697 For error codes that do require the return of related information 3698 together with the error, a new header type should be defined that 3699 returns the error together with the related information. The sender 3700 of a new header type needs to be prepared to accept header types 3701 necessary to report associated errors. 3703 Appendix C. Differences from RPC-over-RDMA Version 1 3705 The primary goal of RPC-over-RDMA version 2 is to relieve constraints 3706 that have become evident in RPC-over-RDMA version 1 with deployment 3707 experience: 3709 * RPC-over-RDMA version 1 has been challenging to update to address 3710 shortcomings or improve data transfer efficiency. 3712 * The average size of NFSv4 COMPOUNDs is significantly greater than 3713 NFSv3 requests, requiring the use of Long messages for frequent 3714 operations. 3716 * Reply size estimation is difficult more often than first expected. 3718 This section details specific changes in RPC-over-RDMA version 2 that 3719 address these constraints directly, in addition to other changes to 3720 make implementation easier. 3722 C.1. Changes to the XDR Definition 3724 Several XDR structural changes enable within-version protocol 3725 extensibility. 3727 [RFC8166] defines the RPC-over-RDMA version 1 transport header as a 3728 single XDR object, with an RPC message potentially following it. In 3729 RPC-over-RDMA version 2, there are separate XDR definitions of the 3730 transport header prefix (see Section 6.4), which specifies the 3731 transport header type to be used, and the transport header itself 3732 (defined within one of the subsections of Section 6.3). This 3733 construction is similar to an RPC message, which consists of an RPC 3734 header (defined in [RFC5531]) followed by a message defined by an 3735 Upper-Layer Protocol. 3737 As a new version of the RPC-over-RDMA transport protocol, RPC-over- 3738 RDMA version 2 exists within the versioning rules defined in 3739 [RFC8166]. In particular, it maintains the first four words of the 3740 protocol header, as specified in Section 4.2 of [RFC8166], even 3741 though, as explained in Section 6.2.1 of the current document, the 3742 XDR definition of those words is structured differently. 3744 Although each of the first four fields retains its semantic function, 3745 there are differences in interpretation: 3747 * The first word of the header, the rdma_xid field, retains the 3748 format and function that it had in RPC-over-RDMA version 1. 3749 Because RPC-over-RDMA version 2 messages can convey non-RPC 3750 messages, a receiver should not use the contents of this field 3751 without consideration of the protocol version and header type. 3753 * The second word of the header, the rdma_vers field, retains the 3754 format and function that it had in RPC-over-RDMA version 1. To 3755 clearly distinguish version 1 and version 2 messages, senders need 3756 to fill in the correct version (fixed after version negotiation). 3757 Receivers should check that the content of the rdma_vers is 3758 correct before using the content of any other header field. 3760 * The third word of the header, the rdma_credit field, retains the 3761 size and general purpose that it had in RPC-over-RDMA version 1. 3762 However, RPC-over-RDMA version 2 divides this field into two 3763 16-bit subfields. See Section 4.2.1 for further details. 3765 * The fourth word of the header, previously the union discriminator 3766 field rdma_proc, retains its format and general function even 3767 though the set of valid values has changed. Within RPC-over-RDMA 3768 version 2, this word is the rdma_htype field of the structure 3769 rdma_start. The value of this field is now an unsigned 32-bit 3770 integer rather than an enum type, to facilitate header type 3771 extension. 3773 Beyond conforming to the restrictions specified in [RFC8166], RPC- 3774 over-RDMA version 2 attempts to limit the scope of the changes made 3775 to ensure interoperability. Although it introduces the Call chunk 3776 and splits the two version 1 workhorse procedure types RDMA_MSG and 3777 RDMA_NOMSG into several variants, RPC-over-RDMA version 2 otherwise 3778 expresses chunks in the same format and utilizes them the same way. 3780 C.2. Transport Properties 3782 RPC-over-RDMA version 2 provides a mechanism for exchanging an 3783 implementation's operational properties. The purpose of this 3784 exchange is to help endpoints improve the efficiency of data transfer 3785 by exploiting the characteristics of both peers rather than falling 3786 back on the lowest common denominator default settings. A full 3787 discussion of transport properties appears in Section 5. 3789 C.3. Credit Management Changes 3791 RPC-over-RDMA transports employ credit-based flow control to ensure 3792 that a Requester does not emit more RDMA Sends than the Responder is 3793 prepared to receive. 3795 Section 3.3.1 of [RFC8166] explains the operation of RPC-over-RDMA 3796 version 1 credit management in detail. In that design, each RDMA 3797 Send from a Requester contains an RPC Call with a credit request, and 3798 each RDMA Send from a Responder contains an RPC Reply with a credit 3799 grant. The credit grant implies that enough Receives have been 3800 posted on the Responder to handle the credit grant minus the number 3801 of pending RPC transactions (the number of remaining Receive buffers 3802 might be zero). 3804 Each RPC Reply acts as an implicit ACK for a previous RPC Call from 3805 the Requester. Without an RPC Reply message, the Requester has no 3806 way to know that the Responder is ready for subsequent RPC Calls. 3808 Because version 1 embeds credit management in each message, there is 3809 a strict one-to-one ratio between RDMA Send and RPC message. There 3810 are interesting use cases that might be enabled if this relationship 3811 were more flexible: 3813 * RPC-over-RDMA operations that do not carry an RPC message, e.g., 3814 control plane operations. 3816 * A single RDMA Send that conveys more than one RPC message, e.g., 3817 for interrupt mitigation. 3819 * An RPC message that requires several sequential RDMA Sends, e.g., 3820 to reduce the use of explicit RDMA operations for moderate-sized 3821 RPC messages. 3823 * An RPC transaction that requires multiple exchanges or an odd 3824 number of RPC-over-RDMA operations to complete. 3826 RPC-over-RDMA version 2 provides a more sophisticated credit 3827 accounting mechanism to address these shortcomings. Section 4.2.1 3828 explains the new mechanism in detail. 3830 C.4. Inline Threshold Changes 3832 An "inline threshold" value is the largest message size (in octets) 3833 that can be conveyed on an RDMA connection using only RDMA Send and 3834 Receive. Each connection has two inline threshold values: one for 3835 messages flowing from client-to-server (referred to as the "client- 3836 to-server inline threshold") and one for messages flowing from 3837 server-to-client (referred to as the "server-to-client inline 3838 threshold"). 3840 A connection's inline thresholds determine, among other things, when 3841 RDMA Read or Write operations are required because an RPC message 3842 cannot be conveyed via a single RDMA Send and Receive pair. When an 3843 RPC message does not contain DDP-eligible data items, a Requester can 3844 prepare a Special Format Call or Reply to convey the whole RPC 3845 message using RDMA Read or Write operations. 3847 RDMA Read and Write operations require that data payloads reside in 3848 memory registered with the local RNIC. When an RPC completes, that 3849 memory is invalidated to fence it from the Responder. Memory 3850 registration and invalidation typically have a latency cost that is 3851 insignificant compared to data handling costs. 3853 When a data payload is small, however, the cost of registering and 3854 invalidating memory where the payload resides becomes a significant 3855 part of total RPC latency. Therefore the most efficient operation of 3856 an RPC-over-RDMA transport occurs when the peers use explicit RDMA 3857 Read and Write operations for large payloads but avoid those 3858 operations for small payloads. 3860 When the authors of [RFC8166] first conceived RPC-over-RDMA version 3861 1, the average size of RPC messages that did not involve a 3862 significant data payload was under 500 bytes. A 1024-byte inline 3863 threshold adequately minimized the frequency of inefficient Long 3864 messages. 3866 With NFS version 4 [RFC7530], the increased size of NFS COMPOUND 3867 operations resulted in RPC messages that are, on average, larger than 3868 previous versions of NFS. With a 1024-byte inline threshold, 3869 frequent operations such as GETATTR and LOOKUP require RDMA Read or 3870 Write operations, reducing the efficiency of data transport. 3872 To reduce the frequency of Special Format messages, RPC-over-RDMA 3873 version 2 increases the default size of inline thresholds. This 3874 change also increases the maximum size of reverse-direction RPC 3875 messages. 3877 C.5. Message Continuation Changes 3879 In addition to a larger default inline threshold, RPC-over-RDMA 3880 version 2 introduces Message Continuation. Message Continuation is a 3881 mechanism that enables the transmission of a data payload using more 3882 than one RDMA Send. The purpose of Message Continuation is to 3883 provide relief in several essential cases: 3885 * If a Requester finds that it is inefficient to convey a 3886 moderately-sized data payload using Read chunks, the Requester can 3887 use Message Continuation to send the RPC Call. 3889 * If a Requester has provided insufficient Reply chunk space for a 3890 Responder to send an RPC Reply, the Responder can use Message 3891 Continuation to send the RPC Reply. 3893 * If a sender has to convey a sizeable non-RPC data payload (e.g., a 3894 large transport property), the sender can use Message Continuation 3895 to avoid having to register memory. 3897 C.6. Host Authentication Changes 3899 For the general operation of NFS on open networks, we eventually 3900 intend to rely on RPC-on-TLS [I-D.ietf-nfsv4-rpc-tls] to provide 3901 cryptographic authentication of the two ends of each connection. In 3902 turn, this can improve the trustworthiness of AUTH_SYS-style user 3903 identities that flow on TCP, which are not cryptographically 3904 protected. We do not have a similar solution for RPC-over-RDMA, 3905 however. 3907 Here, the RDMA transport layer already provides a strong guarantee of 3908 message integrity. On some network fabrics, IPsec or TLS can protect 3909 the privacy of in-transit data. However, this is not the case for 3910 all fabrics (e.g., InfiniBand [IBA]). 3912 Thus, RPC-over-RDMA version 2 introduces a mechanism for 3913 authenticating connection peers (see Section 5.2.6). And like GSS 3914 channel binding, there is also a way to determine when the use of 3915 host authentication is unnecessary. 3917 C.7. Support for Remote Invalidation 3919 When an RDMA consumer uses FRWR or Memory Windows to register memory, 3920 that memory may be invalidated remotely [RFC5040]. These mechanisms 3921 are available when a Requester's RNIC supports MEM_MGT_EXTENSIONS. 3923 For this discussion, there are two classes of STags. Dynamically- 3924 registered STags appear in a single RPC, then are invalidated. 3925 Persistently-registered STags survive longer than one RPC. They may 3926 persist for the life of an RPC-over-RDMA connection or even longer. 3928 An RPC-over-RDMA Requester can provide more than one STag in a 3929 transport header. It may provide a combination of dynamically- and 3930 persistently-registered STags in one RPC message, or any combination 3931 of these in a series of RPCs on the same connection. Only 3932 dynamically-registered STags using Memory Windows or FRWR may be 3933 invalidated remotely. 3935 There is no transport-level mechanism by which a Responder can 3936 determine how a Requester-provided STag was registered, nor whether 3937 it is eligible to be invalidated remotely. A Requester that mixes 3938 persistently- and dynamically-registered STags in one RPC, or mixes 3939 them across RPCs on the same connection, must, therefore, indicate 3940 which STag the Responder may invalidate remotely via a mechanism 3941 provided in the Upper-Layer Protocol. RPC-over-RDMA version 2 3942 provides such a mechanism. 3944 A sender uses the RDMA Send With Invalidate operation to invalidate 3945 an STag on the remote peer. It is available only when both peers 3946 support MEM_MGT_EXTENSIONS (can send and process an IETH). 3948 Existing RPC-over-RDMA transport protocol specifications [RFC8166] 3949 [RFC8167] do not forbid direct data placement in the reverse 3950 direction. Moreover, there is currently no Upper-Layer Protocol that 3951 makes data items in reverse-direction operations eligible for direct 3952 data placement. 3954 When chunks are present in a reverse-direction RPC request, Remote 3955 Invalidation enables the Responder to trigger invalidation of a 3956 Requester's STags as part of sending an RPC Reply, the same way as is 3957 done in the forward direction. 3959 However, in the reverse direction, the server acts as the Requester, 3960 and the client is the Responder. The server's RNIC, therefore, must 3961 support receiving an IETH, and the server must have registered its 3962 STags with an appropriate registration mechanism. 3964 C.8. Integration of Reverse-Direction Operation 3966 Because [RFC5666] did not include specification of reverse-direction 3967 operation, [RFC8166] does not include it either. Reverse-direction 3968 operation in RPC-over-RDMA version 1 is specified by a separate 3969 standards track document [RFC8167]. 3971 Reverse-direction operation in RPC-over-RDMA version 1 was 3972 constrained by the limited ability to extend that version of the 3973 protocol. The most awkward issue is that a receiver needs to peek at 3974 ingress RPC message payloads to determine whether it is a Call or 3975 Reply message. This is necessary because the meaning of several 3976 fields in the RPC-over-RDMA transport header is determined by the 3977 direction of the RPC message payload: 3979 * The meaning of the value in the rdma_xid field is determined by 3980 the direction of the message because the XID spaces in the forward 3981 and reverse directions are distinct. 3983 * The meaning of the value in the rdma_credit field is determined by 3984 the direction of the message because credits are granted 3985 separately for forward and reverse direction operation. 3987 * The purpose of Write chunks and the meaning of their length fields 3988 is determined by the direction of the message because in Call 3989 messages, they are provisional, but in Reply messages, they 3990 represent returned results. 3992 The current document remedies this awkwardness by integrating 3993 reverse-direction operation into RPC-over-RDMA version 2 so that it 3994 can make use of all facilities that are available in the forward- 3995 direction, including body chunks, remote invalidation, and message 3996 continuation. To enable this integration, the direction of the RPC 3997 message payload is encoded in each RPC-over-RDMA version 2 transport 3998 header. 4000 C.9. Error Reporting Changes 4002 RPC-over-RDMA version 2 expands the repertoire of errors that 4003 connection peers may report to each other. The goals of this 4004 expansion are: 4006 * To fill in details of peer recovery actions. 4008 * To enable retrying certain conditions caused by mis-estimation of 4009 the maximum reply size. 4011 * To minimize the likelihood of a Requester waiting forever for a 4012 Reply when there are communications problems that prevent the 4013 Responder from sending it. 4015 C.10. Changes in Terminology 4017 The RPC-over-RDMA version 2 specification makes the following changes 4018 in terminology. These changes do not result in changes in the 4019 behavior or operation of the protocol. 4021 * The current document explicitly acknowledges the different 4022 semantics and purpose of Write chunks appearing in Call messages 4023 and those appearing in Reply messages. 4025 * The current document introduces the term "payload format" to 4026 describe the selection of a mechanism for reducing and conveying 4027 an RPC message payload. It replaces the terms "short message" and 4028 "long message" with the terms "simple format" and "special format" 4029 because this selection is not based only on the size of the 4030 payload. 4032 * The current document introduces the terms "data item chunk" and 4033 "body chunk" in order to distinguish the purpose and operation of 4034 these two categories of chunk. 4036 * For improved readability, the current document replaces the terms 4037 "RDMA segment" and "plain segment" with the term "segment", and 4038 the term "RDMA read segment" with the term "Read segment". 4040 * The current document refers specifically to the RDMAP, DDP, and 4041 MPA standards track protocols rather than using the nebulous term 4042 "iWARP". 4044 Acknowledgments 4046 The authors gratefully acknowledge the work of Brent Callaghan and 4047 Tom Talpey on the original RPC-over-RDMA version 1 specification 4048 [RFC5666]. The authors also wish to thank Bill Baker, Greg Marsden, 4049 and Matt Benjamin for their support of this work. 4051 The XDR extraction conventions were first described by the authors of 4052 the NFS version 4.1 XDR specification [RFC5662]. Herbert van den 4053 Bergh suggested the replacement sed script used in this document. 4055 Special thanks go to Transport Area Director Magnus Westerlund, NFSV4 4056 Working Group Chairs Spencer Shepler, and Brian Pawlowski, and NFSV4 4057 Working Group Secretary Thomas Haynes for their support. 4059 Authors' Addresses 4061 Charles Lever (editor) 4062 Oracle Corporation 4063 United States of America 4065 Email: chuck.lever@oracle.com 4067 David Noveck 4068 NetApp 4069 1601 Trapelo Road 4070 Waltham, MA 02451 4071 United States of America 4073 Phone: +1 781 572 8038 4074 Email: davenoveck@gmail.com