idnits 2.17.1 draft-ietf-nfsv4-rpcrdma-version-two-06.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document seems to contain a disclaimer for pre-RFC5378 work, but was first submitted on or after 10 November 2008. The disclaimer is usually necessary only for documents that revise or obsolete older RFCs, and that take significant amounts of text from those RFCs. If you can contact all authors of the source material and they are willing to grant the BCP78 rights to the IETF Trust, you can and should remove the disclaimer. Otherwise, the disclaimer is needed and you can ignore this comment. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (2 January 2022) is 844 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Obsolete informational reference (is this intentional?): RFC 793 (Obsoleted by RFC 9293) -- Obsolete informational reference (is this intentional?): RFC 5666 (Obsoleted by RFC 8166) Summary: 0 errors (**), 0 flaws (~~), 2 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network File System Version 4 C. Lever, Ed. 3 Internet-Draft Oracle 4 Intended status: Standards Track D. Noveck 5 Expires: 6 July 2022 NetApp 6 2 January 2022 8 RPC-over-RDMA Version 2 Protocol 9 draft-ietf-nfsv4-rpcrdma-version-two-06 11 Abstract 13 This document specifies the second version of a transport protocol 14 that conveys Remote Procedure Call (RPC) messages using Remote Direct 15 Memory Access (RDMA). This version of the protocol is extensible. 17 Note 19 This note is to be removed before publishing as an RFC. 21 Discussion of this draft takes place on the NFSv4 working group 22 mailing list (nfsv4@ietf.org), which is archived at 23 https://mailarchive.ietf.org/arch/browse/nfsv4/. Working Group 24 information can be found at https://datatracker.ietf.org/wg/nfsv4/ 25 about/. 27 The source for this draft is maintained in GitHub. Suggested changes 28 can be submitted as pull requests at https://github.com/chucklever/ 29 i-d-rpcrdma-version-two. Instructions are on that page as well. 31 Status of This Memo 33 This Internet-Draft is submitted in full conformance with the 34 provisions of BCP 78 and BCP 79. 36 Internet-Drafts are working documents of the Internet Engineering 37 Task Force (IETF). Note that other groups may also distribute 38 working documents as Internet-Drafts. The list of current Internet- 39 Drafts is at https://datatracker.ietf.org/drafts/current/. 41 Internet-Drafts are draft documents valid for a maximum of six months 42 and may be updated, replaced, or obsoleted by other documents at any 43 time. It is inappropriate to use Internet-Drafts as reference 44 material or to cite them other than as "work in progress." 46 This Internet-Draft will expire on 6 July 2022. 48 Copyright Notice 50 Copyright (c) 2022 IETF Trust and the persons identified as the 51 document authors. All rights reserved. 53 This document is subject to BCP 78 and the IETF Trust's Legal 54 Provisions Relating to IETF Documents (https://trustee.ietf.org/ 55 license-info) in effect on the date of publication of this document. 56 Please review these documents carefully, as they describe your rights 57 and restrictions with respect to this document. Code Components 58 extracted from this document must include Revised BSD License text as 59 described in Section 4.e of the Trust Legal Provisions and are 60 provided without warranty as described in the Revised BSD License. 62 This document may contain material from IETF Documents or IETF 63 Contributions published or made publicly available before November 64 10, 2008. The person(s) controlling the copyright in some of this 65 material may not have granted the IETF Trust the right to allow 66 modifications of such material outside the IETF Standards Process. 67 Without obtaining an adequate license from the person(s) controlling 68 the copyright in such materials, this document may not be modified 69 outside the IETF Standards Process, and derivative works of it may 70 not be created outside the IETF Standards Process, except to format 71 it for publication as an RFC or to translate it into languages other 72 than English. 74 Table of Contents 76 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 4 77 1.1. Design Goals . . . . . . . . . . . . . . . . . . . . . . 4 78 1.2. Motivation for a New Version . . . . . . . . . . . . . . 5 79 2. Requirements Language . . . . . . . . . . . . . . . . . . . . 5 80 3. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 6 81 3.1. Remote Procedure Calls . . . . . . . . . . . . . . . . . 6 82 3.2. Remote Direct Memory Access . . . . . . . . . . . . . . . 11 83 4. RPC-over-RDMA Framework . . . . . . . . . . . . . . . . . . . 13 84 4.1. Message Framing . . . . . . . . . . . . . . . . . . . . . 13 85 4.2. Reliable Message Delivery . . . . . . . . . . . . . . . . 14 86 4.3. Initial Connection State . . . . . . . . . . . . . . . . 17 87 4.4. Using Direct Data Placement . . . . . . . . . . . . . . . 18 88 4.5. Encoding Chunks . . . . . . . . . . . . . . . . . . . . . 20 89 4.6. Reverse-Direction Operation . . . . . . . . . . . . . . . 24 90 4.7. Call-Only Operation . . . . . . . . . . . . . . . . . . . 27 91 5. Transport Properties . . . . . . . . . . . . . . . . . . . . 27 92 5.1. Transport Properties Model . . . . . . . . . . . . . . . 27 93 5.2. Current Transport Properties . . . . . . . . . . . . . . 28 94 6. Transport Messages . . . . . . . . . . . . . . . . . . . . . 32 95 6.1. Transport Header Types . . . . . . . . . . . . . . . . . 32 96 6.2. Headers and Chunks . . . . . . . . . . . . . . . . . . . 34 97 6.3. Header Types . . . . . . . . . . . . . . . . . . . . . . 35 98 6.4. Transport Header Prefix . . . . . . . . . . . . . . . . . 42 99 6.5. Remote Invalidation . . . . . . . . . . . . . . . . . . . 42 100 6.6. Payload Formats . . . . . . . . . . . . . . . . . . . . . 43 101 7. Error Handling . . . . . . . . . . . . . . . . . . . . . . . 49 102 7.1. Basic Transport Stream Parsing Errors . . . . . . . . . . 50 103 7.2. XDR Errors . . . . . . . . . . . . . . . . . . . . . . . 51 104 7.3. Responder RDMA Operational Errors . . . . . . . . . . . . 52 105 7.4. Other Operational Errors . . . . . . . . . . . . . . . . 54 106 7.5. RDMA Transport Errors . . . . . . . . . . . . . . . . . . 55 107 8. XDR Protocol Definition . . . . . . . . . . . . . . . . . . . 56 108 8.1. Code Component License . . . . . . . . . . . . . . . . . 56 109 8.2. Extraction of the XDR Definition . . . . . . . . . . . . 58 110 8.3. XDR Definition for RPC-over-RDMA Version 2 Core 111 Structures . . . . . . . . . . . . . . . . . . . . . . . 59 112 8.4. XDR Definition for RPC-over-RDMA Version 2 Base Header 113 Types . . . . . . . . . . . . . . . . . . . . . . . . . . 61 114 8.5. Use of the XDR Description . . . . . . . . . . . . . . . 64 115 9. RPC Bind Parameters . . . . . . . . . . . . . . . . . . . . . 65 116 10. Implementation Status . . . . . . . . . . . . . . . . . . . . 66 117 11. Security Considerations . . . . . . . . . . . . . . . . . . . 66 118 11.1. Memory Protection . . . . . . . . . . . . . . . . . . . 67 119 11.2. RPC Message Security . . . . . . . . . . . . . . . . . . 68 120 11.3. Transport Properties . . . . . . . . . . . . . . . . . . 71 121 11.4. Host Authentication . . . . . . . . . . . . . . . . . . 72 122 12. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 72 123 13. References . . . . . . . . . . . . . . . . . . . . . . . . . 72 124 13.1. Normative References . . . . . . . . . . . . . . . . . . 72 125 13.2. Informative References . . . . . . . . . . . . . . . . . 74 126 Appendix A. ULB Specifications . . . . . . . . . . . . . . . . . 76 127 A.1. DDP-Eligibility . . . . . . . . . . . . . . . . . . . . . 76 128 A.2. Maximum Reply Size . . . . . . . . . . . . . . . . . . . 78 129 A.3. Reverse-Direction Operation . . . . . . . . . . . . . . . 78 130 A.4. Additional Considerations . . . . . . . . . . . . . . . . 78 131 A.5. ULP Extensions . . . . . . . . . . . . . . . . . . . . . 79 132 Appendix B. Extending RPC-over-RDMA Version 2 . . . . . . . . . 79 133 B.1. Documentation Requirements . . . . . . . . . . . . . . . 80 134 B.2. Adding New Header Types to RPC-over-RDMA Version 2 . . . 80 135 B.3. Adding New Transport properties to the Protocol . . . . . 81 136 B.4. Adding New Error Codes to the Protocol . . . . . . . . . 82 137 Appendix C. Differences from RPC-over-RDMA Version 1 . . . . . . 82 138 C.1. Changes to the XDR Definition . . . . . . . . . . . . . . 83 139 C.2. Transport Properties . . . . . . . . . . . . . . . . . . 84 140 C.3. Credit Management Changes . . . . . . . . . . . . . . . . 84 141 C.4. Inline Threshold Changes . . . . . . . . . . . . . . . . 85 142 C.5. Message Continuation Changes . . . . . . . . . . . . . . 86 143 C.6. Host Authentication Changes . . . . . . . . . . . . . . . 86 144 C.7. Support for Remote Invalidation . . . . . . . . . . . . . 87 145 C.8. Integration of Reverse-Direction Operation . . . . . . . 88 146 C.9. Error Reporting Changes . . . . . . . . . . . . . . . . . 89 147 C.10. Changes in Terminology . . . . . . . . . . . . . . . . . 89 148 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . 90 149 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 90 151 1. Introduction 153 Remote Direct Memory Access (RDMA) [RFC5040] [RFC5041] [IBA] is a 154 technique for moving data efficiently between network nodes. By 155 placing transferred data directly into destination buffers using 156 Direct Memory Access, RDMA delivers the reciprocal benefits of faster 157 data transfer and reduced host CPU overhead. 159 Open Network Computing Remote Procedure Call (ONC RPC, often 160 shortened in NFSv4 documents to RPC) [RFC5531] is a Remote Procedure 161 Call protocol that runs over a variety of transports. Most RPC 162 implementations today use UDP [RFC0768] or TCP [RFC0793]. On UDP, a 163 datagram encapsulates each RPC message. Within a TCP byte stream, a 164 record marking protocol delineates RPC messages. 166 An RDMA transport, too, conveys RPC messages in a fashion that must 167 be fully defined if RPC implementations are to interoperate when 168 using RDMA to transport RPC transactions. Although RDMA transports 169 encapsulate messages like UDP, they deliver them reliably and in 170 order, like TCP. Further, they implement a bulk data transfer 171 service not provided by traditional network transports. Therefore, 172 we treat RDMA as a novel transport type for RPC. 174 1.1. Design Goals 176 The general mission of RPC-over-RDMA transports is to leverage 177 network hardware capabilities to reduce host CPU needs related to the 178 transport of RPC messages. In particular, this includes mitigating 179 host interrupt rates and limiting the necessity to copy RPC payload 180 bytes on receivers. 182 These hardware capabilities benefit both RPC clients and servers. On 183 balance, however, the RPC-over-RDMA protocol design approach has been 184 to bolster clients more than servers, as the client is typically 185 where applications are most hungry for CPU resources. 187 Additionally, RPC-over-RDMA transports are designed to support RPC 188 applications transparently. However, such transports can also 189 provide mechanisms that enable further optimization of data transfer 190 when RPC applications are structured to exploit direct data 191 placement. In this context, the Network File System (NFS) family of 192 protocols (as described in [RFC1094], [RFC1813], [RFC7530], 193 [RFC7862], [RFC8881], and subsequent NFSv4 minor versions) are all 194 potential beneficiaries of RPC-over-RDMA. 196 A complete problem statement appears in [RFC5532]. 198 1.2. Motivation for a New Version 200 Storage administrators have broadly deployed the RPC-over-RDMA 201 version 1 protocol specified in [RFC8166]. However, there are known 202 shortcomings to this protocol: 204 * The protocol's default size of Receive buffers forces the use of 205 RDMA Read and Write transfers for small payloads, and limits the 206 size of reverse-direction messages. 208 * It is difficult to make optimizations or protocol fixes that 209 require changes to on-the-wire behavior. 211 * For some RPC procedures, the maximum reply size is difficult or 212 impossible for an RPC client to estimate in advance. 214 To address these issues in a way that preserves interoperation with 215 existing RPC-over-RDMA version 1 deployments, the current document 216 presents an updated version of the RPC-over-RDMA transport protocol. 218 This version of RPC-over-RDMA is extensible, enabling the 219 introduction of OPTIONAL extensions without impacting existing 220 implementations. See Appendix C.1 for further discussion. It 221 introduces a mechanism to exchange implementation properties to 222 automatically provide further optimization of data transfer. 224 This version also contains incremental changes that relieve 225 performance constraints and enable recovery from unusual corner 226 cases. These changes are outlined in Appendix C and include a larger 227 default inline threshold, the ability to convey a single RPC message 228 using multiple RDMA Send operations, support for authentication of 229 connection peers, richer error reporting, improved credit-based flow 230 control, and support for Remote Invalidation. 232 2. Requirements Language 234 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 235 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 236 "OPTIONAL" in this document are to be interpreted as described in 237 BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all 238 capitals, as shown here. 240 3. Terminology 242 3.1. Remote Procedure Calls 244 This section highlights critical elements of the RPC protocol 245 [RFC5531] and the External Data Representation (XDR) [RFC4506] it 246 uses. RPC-over-RDMA version 2 enables the transmission of RPC 247 messges built using XDR and also uses XDR internally to describe its 248 header format. 250 3.1.1. Upper-Layer Protocols 252 RPCs are an abstraction used to implement the operations of an Upper- 253 Layer Protocol (ULP). For RPC-over-RDMA, "ULP" refers to an RPC 254 Program and Version tuple, which is a versioned set of procedure 255 calls that comprise a single well-defined API. One example of a ULP 256 is the Network File System Version 4.0 [RFC7530]. In the current 257 document, the term "RPC consumer" refers to an implementation of a 258 ULP running on an RPC client. 260 3.1.2. RPC Procedures 262 Like a local procedure call, every RPC procedure has a set of 263 "arguments" and a set of "results". A calling context invokes an RPC 264 procedure, passing arguments to it, and the procedure subsequently 265 returns a set of results. Unlike a local procedure call, an RPC 266 procedure is executed remotely rather than in the local application's 267 execution context. 269 3.1.3. RPC Transactions 271 The RPC protocol as described in [RFC5531] is fundamentally a 272 message-passing protocol between one or more clients, where RPC 273 consumers are running, and a server, where a remote execution context 274 is available to process RPC transactions on behalf of these 275 consumers. 277 ONC RPC transactions consist of two types of messages: 279 * A CALL message, or "Call", requests work. An RPC Call message is 280 designated by the value zero (0) in the message's msg_type field. 282 * A REPLY message, or "Reply", reports the results of work requested 283 by an RPC Call message. An RPC Reply message is designated by the 284 value one (1) in the message's msg_type field. 286 Section 9 of [RFC5531] introduces the RPC transaction identifier, or 287 "XID" for short. Each connection endpoint interprets the value of an 288 XID in the context of the message's msg_type field. 290 * The sender of a Call message generates an arbitrary XID value for 291 each RPC that is unique among outstanding Calls from that sender. 293 * The sender of a Reply message copies the XID of the initiating 294 Call to the Reply containing the results of that procedure. 296 After receiving a Reply, a Requester then matches the XID value in 297 that Reply with a Call it previously sent. 299 The ratio of Call messages to Reply messages is typically but not 300 always one-to-one. 302 The most common operational paradigm is when a Requester sends a Call 303 message to a Responder, who then sends a Reply message back to the 304 Requester with the results of that procedure. One Call message 305 elicits a single Reply message in response. A Responder never sends 306 more than one Reply for each received Call message. 308 A "retransmission" occurs when a Requester sends exactly the same 309 Call message, with the same arguments and XID, more than once. A 310 Requester can retransmit if it believes the network layer or 311 Responder has dropped a Call message, or if the Responder's Reply has 312 been likewise lost. To prevent unnecessary network traffic or the 313 execution of non-idempotent procedures multiple times, Requesters 314 avoid retransmitting needlessly. 316 In rare cases, an RPC procedure may not require any results or even 317 acknowledgement that the Responder has executed the procedure. In 318 that case, the Requester sends a Call message but no Reply is 319 returned. This document refers to that case as "Call-only". 321 3.1.4. Message Serialization 323 RPC messages are always transmitted atomically. RPC peers may 324 interleave messages, but the contents of individual messages cannot 325 be broken up or interleaved without making the messages illegible. 327 An RPC peer acting as a "Requester" serializes the procedure's 328 arguments and conveys them to a "Responder" endpoint via an RPC Call 329 message. A Call message contains an RPC protocol header with a 330 unique XID, a header describing the requested upper-layer operation, 331 and all arguments. 333 An RPC peer acting as a "Responder" deserializes these arguments and 334 processes the requested procedure. It then serializes the 335 procedure's results into an RPC Reply message. An RPC Reply message 336 contains an RPC protocol header with the same XID, a header 337 describing the upper-layer reply, and all results. 339 The Requester deserializes the results and allows the RPC consumer to 340 proceed. At this point, the RPC transaction designated by the XID in 341 the RPC Call message is complete, and the XID is retired. 343 3.1.5. RPC Transports 345 The role of an "RPC transport" is to mediate the exchange of RPC 346 messages between Requesters and Responders, bridging the gap between 347 the RPC message abstraction and the native operations of a network 348 transport (e.g., a socket). 350 When an RPC transport type is connection-oriented, RPC client 351 endpoints initiate transport connections, while RPC server endpoints 352 wait passively to accept incoming connection requests. RPC messages 353 may also be exchanged without a connection association. Because RPC- 354 over-RDMA is a connection-oriented RPC transport, connectionless 355 operation is not discussed further in the current document. 357 3.1.5.1. Transport Failure Recovery 359 So that appropriate and timely recovery action can be taken, the 360 transport implementation is responsible for notifying a Requester 361 when an RPC Call or Reply was not able to be conveyed. Recovery can 362 take the form of establishing a new connection, re-sending RPC Calls, 363 or terminating RPC transactions pending on the Requester. 365 For instance, a connection loss may occur after a Responder has 366 received an RPC Call but before it can send the matching RPC Reply. 367 Once the transport notifies the Requester of the connection loss, the 368 Requester can re-send all pending RPC Calls on a fresh connection. 370 3.1.5.2. Forward Direction 372 Traditionally, an RPC client acts as a Requester, while an RPC 373 service acts as a Responder. The current document refers to this 374 direction of RPC message passing as "forward-direction" operation. 376 3.1.5.3. Reverse-Direction 378 The RPC specification [RFC5531] does not forbid performing RPC 379 transactions in the other direction. An RPC service endpoint can act 380 as a Requester, in which case an RPC client endpoint acts as a 381 Responder. This direction of RPC message passing is known as 382 "reverse-direction" operation. 384 During reverse-direction operation, an RPC client is responsible for 385 establishing transport connections, even though the RPC server 386 originates RPC Calls. 388 RPC clients and servers are usually optimized to perform and scale 389 well when handling traffic in the forward direction. They might not 390 be prepared to handle operation in the reverse direction. Not until 391 NFS version 4.1 [RFC8881] has there been a strong need to handle 392 reverse-direction operation. 394 3.1.5.4. Bi-directional Operation 396 A pair of connected RPC endpoints may choose to use only forward- 397 direction or only reverse-direction operation on a particular 398 transport connection. Or, these endpoints may send Calls in both 399 directions concurrently on the same transport connection. 401 "Bi-directional operation" occurs when both transport endpoints act 402 as a Requester and a Responder at the same time on a single 403 connection. 405 Bi-directionality is an extension of RPC transport connection 406 sharing. Two RPC endpoints wish to exchange independent RPC messages 407 over a shared connection but in opposite directions. These messages 408 may or may not be related to the same workloads or RPC Programs. 410 During bi-directional operation, forward- and reverse- direction XIDs 411 are typically generated on distinct hosts by possibly different 412 algorithms. There is no coordination between the generation of XIDs 413 used in forward-direction and reverse-direction operation. 415 Therefore, a forward-direction Requester MAY use the same XID value 416 at the same time as a reverse-direction Requester on the same 417 transport connection. Although such concurrent requests use the same 418 XID value, they represent distinct RPC transactions. 420 3.1.6. External Data Representation 422 One cannot assume that all Requesters and Responders represent data 423 objects in the same way internally. RPC uses External Data 424 Representation (XDR) to translate native data types and serialize 425 arguments and results [RFC4506]. 427 XDR encodes data independently of the endianness or size of host- 428 native data types, enabling unambiguous decoding of data by a 429 receiver. 431 XDR assumes only that the number of bits in a byte (octet) and their 432 order are the same on both endpoints and the physical network. The 433 smallest indivisible unit of XDR encoding is a group of four octets. 434 XDR can also flatten lists, arrays, and other complex data types into 435 a stream of bytes. 437 We refer to a serialized stream of bytes that is the result of XDR 438 encoding as an "XDR stream". A sender encodes native data into an 439 XDR stream and then transmits that stream to a receiver. The 440 receiver decodes incoming XDR byte streams into its native data 441 representation format. 443 3.1.6.1. XDR Opaque Data 445 Sometimes, a data item is to be transferred as-is, without encoding 446 or decoding. We refer to the contents of such a data item as "opaque 447 data". XDR encoding places the content of opaque data items directly 448 into an XDR stream without altering it in any way. ULPs or 449 applications perform any needed data translation in this case. 450 Examples of opaque data items include the content of files or generic 451 byte strings. 453 3.1.6.2. XDR Roundup 455 The number of octets in a variable-length data item precedes that 456 item in an XDR stream. If the size of an encoded data item is not a 457 multiple of four octets, the sender appends octets containing zero 458 after the end of the data item. These zero octets shift the next 459 encoded data item in the XDR stream so that it always starts on a 460 four-octet boundary. The addition of extra octets does not change 461 the encoded size of the data item. Receivers do not expose the extra 462 octets to ULPs. 464 We refer to this technique as "XDR roundup", and the extra octets as 465 "XDR roundup padding". 467 3.2. Remote Direct Memory Access 469 When a third party transfers large RPC payloads, RPC Requesters and 470 Responders can become more efficient. An example of such a third 471 party might be an intelligent network interface (data movement 472 offload), which places data in the receiver's memory so that no 473 additional adjustment of data alignment is necessary (direct data 474 placement or "DDP"). RDMA transports enable both of these 475 optimizations. 477 In the current document, the standalone term "RDMA" refers to the 478 physical mechanism an RDMA transport utilizes when moving data. 480 3.2.1. Direct Data Placement 482 Typically, RPC implementations copy the contents of RPC messages into 483 a buffer before being sent. An efficient RPC implementation sends 484 bulk data without first copying it into a separate send buffer. 486 However, socket-based RPC implementations are often unable to receive 487 data directly into its final place in memory. Receivers often need 488 to copy incoming data to finish an RPC operation, if only to adjust 489 data alignment. 491 Although it may not be efficient, before an RDMA transfer, a sender 492 may copy data into an intermediate buffer. After an RDMA transfer, a 493 receiver may copy that data again to its final destination. In this 494 document, the term "DDP" refers to any optimized data transfer where 495 a receiving host's CPU does not move transferred data to another 496 location after arrival. 498 RPC-over-RDMA version 2 enables the use of RDMA Read and Write 499 operations to achieve both data movement offload and DDP. However, 500 note that not all RDMA-based data transfer qualifies as DDP, and some 501 mechanisms that do not employ explicit RDMA can place data directly. 503 3.2.2. RDMA Transport Operation 505 RDMA transports require that RDMA consumers provision resources in 506 advance to achieve good performance during receive operations. An 507 RDMA consumer might provide Receive buffers in advance by posting an 508 RDMA Receive Work Request for every expected RDMA Send from a remote 509 peer. These buffers are provided before the remote peer posts RDMA 510 Send Work Requests. Thus this is often referred to as "pre-posting" 511 buffers. 513 An RDMA Receive Work Request remains outstanding until the RDMA 514 provider matches it to an inbound Send operation. The resources 515 associated with that Receive must be retained in host memory, or 516 "pinned", until the Receive completes. 518 Given these tenets of operation, the RPC-over-RDMA version 2 protocol 519 assumes each transport provides the following abstract operations. A 520 more complete discussion of these operations appears in [RFC5040]. 522 3.2.2.1. Memory Registration 524 Memory registration assigns a steering tag to a region of memory, 525 permitting the RDMA provider to perform data-transfer operations. 526 The RPC-over-RDMA version 2 protocol assumes that a steering tag of 527 no more than 32 bits and memory addresses of up to 64 bits in length 528 identifies each registered memory region. 530 3.2.2.2. RDMA Send 532 The RDMA provider supports an RDMA Send operation, with completion 533 signaled on the receiving peer after the RDMA provider has placed 534 data in a pre-posted buffer. Sends complete at the receiver in the 535 order they were posted at the sender. The size of the remote peer's 536 pre-posted buffers limits the amount of data that can be transferred 537 by a single RDMA Send operation. 539 3.2.2.3. RDMA Receive 541 The RDMA provider supports an RDMA Receive operation to receive data 542 conveyed by incoming RDMA Send operations. To reduce the amount of 543 memory that must remain pinned awaiting incoming Sends, the amount of 544 memory posted per Receive is limited. The RDMA consumer (in this 545 case, the RPC-over-RDMA version 2 protocol) provides flow control to 546 prevent overrunning receiver resources. 548 3.2.2.4. RDMA Write 550 The RDMA provider supports an RDMA Write operation to place data 551 directly into a remote memory region. The local host initiates an 552 RDMA Write and the RDMA provider signals completion there. The 553 remote RDMA provider does not signal completion on the remote peer. 554 The local host provides the steering tag, the memory address, and the 555 length of the remote peer's memory region. 557 RDMA Writes are not ordered relative to one another, but are ordered 558 relative to RDMA Sends. Thus, a subsequent RDMA Send completion 559 signaled on the local peer guarantees that prior RDMA Write data has 560 been successfully placed in the remote peer's memory. 562 3.2.2.5. RDMA Read 564 The RDMA provider supports an RDMA Read operation to place remote 565 source data directly into local memory. The local host initiates an 566 RDMA Read and and the RDMA provider signals completion there. The 567 remote RDMA provider does not signal completion on the remote peer. 568 The local host provides the steering tags, the memory addresses, and 569 the lengths for the remote source and local destination memory 570 regions. 572 The RDMA consumer (in this case, the RPC-over-RDMA version 2 573 protocol) signals Read completion to the remote peer as part of a 574 subsequent RDMA Send message. The remote peer can then invalidate 575 steering tags and subsequently free associated source memory regions. 577 4. RPC-over-RDMA Framework 579 Before an RDMA data transfer can occur, an endpoint first exposes 580 regions of its memory to a remote endpoint. The remote endpoint then 581 initiates RDMA Read and Write operations against the exposed memory. 582 A "transfer model" designates which endpoint exposes its memory and 583 which is responsible for initiating the transfer of data. 585 In RPC-over-RDMA version 2, only Requesters expose their memory to 586 the Responder, and only Responders initiate RDMA Read and Write 587 operations. Read access to memory regions enables the Responder to 588 pull RPC arguments or whole RPC Calls from each Requester. The 589 Responder pushes RPC results or whole RPC Replies to a Requester's 590 memory regions to which it has write access. 592 4.1. Message Framing 594 Each RPC-over-RDMA version 2 message consists of at most two XDR 595 streams: 597 * The "Transport stream" contains a header that describes and 598 controls the transfer of the Payload stream in this RPC-over-RDMA 599 message. Every RDMA Send on an RPC-over-RDMA version 2 connection 600 MUST begin with a Transport stream. 602 * The "Payload stream" contains part or all of a single RPC message. 603 The sender MAY divide an RPC message at any convenient boundary 604 but MUST send RPC message fragments in XDR stream order and MUST 605 NOT interleave Payload streams from multiple RPC messages. 607 The RPC-over-RDMA framing mechanism described in this section 608 replaces all other RPC framing mechanisms. Connection peers use RPC- 609 over-RDMA framing even when the underlying RDMA protocol runs on a 610 transport type with well-defined RPC framing, such as TCP. However, 611 a ULP can negotiate the use of RDMA, dynamically enabling the use of 612 RPC-over-RDMA on a connection established on some other transport 613 type. Because RPC framing delimits an entire RPC request or reply, 614 the resulting shift in framing must occur between distinct RPC 615 messages, and in concert with the underlying transport. 617 4.2. Reliable Message Delivery 619 RPC-over-RDMA provides a reliable and in-order data transport service 620 for RPC Calls and Replies. 622 RPC-over-RDMA transports MUST operate only on a reliable Queue Pair 623 (QP) such as the RDMA RC (Reliable Connected) QP type as defined in 624 Section 9.7.7 of [IBA]. The Marker PDU Aligned (MPA) protocol 625 [RFC5044], when deployed on a reliable transport such as TCP, 626 provides similar functionality. Using a reliable QP type ensures in- 627 transit data integrity and proper recovery from packet loss in the 628 lower layers. 630 If any pre-posted Receive buffer on the connection is not large 631 enough to contain an incoming message, the receiving RDMA provider 632 cannot deliver that message to the upper-layer consumer. Likewise, 633 if no pre-posted Receive buffer is available to accept an incoming 634 message, the receiving RDMA provide cannot pass that message to the 635 consumer. Exceeding these limits results in a transition to a QP 636 error state, the loss of an in-flight message, and the potential loss 637 of the connection. 639 Therefore, senders need to respect peer receiver resource limits to 640 ensure that the transport service can deliver every message reliably. 641 Two operational parameters communicate these limits between RPC-over- 642 RDMA peers: credits and inline threshold. 644 4.2.1. Flow Control 646 RPC-over-RDMA version 2 employs end-to-end credit-based flow control 647 on each connection to prevent senders from transmitting more messages 648 than a receiver is prepared to accept [CBFC]. Credit-based flow 649 control is relatively simple, providing automated management of 650 receive buffer allocation and robust operation in the face of bursty 651 traffic while enabling effective pipelining. The RPC-over-RDMA 652 version 2 flow control mechanism relies on reliable and in-order 653 message delivery guarantees provided by the underlying RDMA transport 654 service. 656 An RPC-over-RDMA version 2 credit represents the capability to convey 657 exactly one RPC-over-RDMA version 2 message, regardless of its size, 658 via an RDMA Send/Receive pair. Because an RPC-over-RDMA version 2 659 connection is full-duplex, each connection peer has its own set of 660 credits. The two receivers manage their credit limits independently, 661 although they communicate these values by piggy-backing them on a 662 message in the opposite direction. 664 A peer tracks a few critical values for each connection. It uses 665 these values to determine when it is safe to send a message on the 666 connection. 668 Sent message count: The total number of RDMA Send operations it has 669 posted. The peer MUST set this value to zero (0) when a 670 connection is first established. 672 Received message count: The total number of RDMA Receive channel 673 operations that have completed. The peer MUST set this value to 674 zero (0) when a connection is first established. 676 Credit limit: The number of RDMA Receive operations that are 677 currently posted. 679 Remote credits: The value in the rdma_start.rdma_credit field in the 680 most recently received message from its peer. The peer MUST set 681 this value to one (1) before it has received a message on a new 682 connection. 684 When constructing a new RPC-over-RDMA header, each sender MUST set 685 the header's rdma_start.rdma_credit field to its "Sent message count" 686 plus its "Credit limit". The sender MUST NOT post this message if 687 the sender's "Send message counter" is greater than the current 688 "Remote credits" value. To handle counter wrapping, the sender uses 689 appropriate modulo arithmetic to perform this comparison. 691 Because the rdma_start.rdma_credit field is 32 bits wide, a 692 receiver's credit limit MUST be less than 2^31 - 1. Given the 693 bandwidth-delay product of the connection, a receiver generally 694 chooses a credit limit that is large enough to maximize throughput 695 while not overwhelming memory resources on the local system. 697 A receiver MAY adjust its credit limit to match the needs or policies 698 in effect on either peer. For instance, a peer may reduce its credit 699 limit to accommodate the available resources in a Shared Receive 700 Queue. Certain RDMA implementations may impose additional flow- 701 control restrictions, such as limits on RDMA Read operations in 702 progress at the Responder. Accommodation of such checks is 703 considered the responsibility of each RPC-over-RDMA version 2 704 implementation. 706 4.2.1.1. Asynchronous Credit Grants 708 Credit accounting information is usually piggy-backed on payload- 709 bearing messages. However, on occasion, a receiver might need to 710 refresh its credit limit without sending an RPC payload. A receiving 711 peer can send a message using a special header type when the sender's 712 credit limit approaches exhaustion during a stream of unacknowledged 713 messages. See Section 6.3.2 for information about this header type. 715 Unlike RPC-over-RDMA version 1, the credit limit on an RPC-over-RDMA 716 version 2 connection MAY be zero. In that case, the sender waits 717 until the receiver sends it an asynchronous credit refresh. To 718 prevent a sender from ever having to wait for a credit limit refresh, 719 a good receiver implementation provides a credit refresh before half 720 its credit limit is exceeded. 722 To prevent transport deadlock, receivers MUST always be in a position 723 to receive one asynchronous credit update message, in addition to 724 payload-bearing messages. A receiver can do this is by posting one 725 more RDMA Receive than the credit limit it advertises to its 726 connection peer. 728 4.2.2. Inline Threshold 730 An "inline threshold" value is the largest message size (in octets) 731 that can be conveyed in one direction between peer implementations 732 using RDMA Send and Receive channel operations. An inline threshold 733 value is less than or equal to the largest number of octets the 734 sender can post in a single RDMA Send operation. It is also less 735 than or equal to the largest number of octets the receiver can 736 reliably accept via a single RDMA Receive operation. 738 Each connection has two inline threshold values. There is one for 739 messages flowing from Requester-to-Responder (referred to as the 740 "call inline threshold"), and one for messages flowing from 741 Responder-to-Requester (referred to as the "reply inline threshold"). 743 Peers can advertise their inline threshold values via RPC-over-RDMA 744 version 2 Transport Properties (see Section 5). In the absence of an 745 exchange of Transport Properties, connection peers MUST assume both 746 inline thresholds are 4096 octets. 748 4.3. Initial Connection State 750 Immediately upon connection establishment, both peers MUST allow only 751 one outstanding RPC-over-RDMA message on the connection at a time 752 until both the transport protocol version is established and both 753 peers have received an initial credit limit. Note that because RPC- 754 over-RDMA versions 1 and 2 each use a different flow control 755 mechanism, the meaning of the value in the rdma_start.rdma_credit 756 field depends on the value in the rdma_start.vers field. 758 The second word of each transport header conveys the transport 759 protocol version. Immediately after the client establishes a 760 connection, it sends a single valid RPC-over-RDMA message with the 761 value two (2) in the rdma_start.rdma_vers field. Because the server 762 might support only RPC-over-RDMA version 1, this initial message MUST 763 NOT be larger than the version 1 default inline threshold of 1024 764 octets. 766 4.3.1. Server Supports RPC-over-RDMA Version 2 768 If the server supports RPC-over-RDMA version 2, it sends RPC-over- 769 RDMA messages back to the client with the value two (2) in the 770 rdma_start.rdma_vers field. Both peers may assume the default inline 771 threshold value for RPC-over-RDMA version 2 connections (4096 772 octets). 774 4.3.2. Server Does Not Support RPC-over-RDMA Version 2 776 If the server does not support RPC-over-RDMA version 2, it MUST send 777 an RPC-over-RDMA message to the client with an XID that matches the 778 client's first message, RDMA2_ERROR in the rdma_start.rdma_htype 779 field, and with the error code RDMA2_ERR_VERS. This message also 780 reports the range of RPC-over-RDMA protocol versions that the server 781 supports. To continue operation, the client selects a protocol 782 version in that range for subsequent messages on this connection. 784 If the connection is dropped immediately after an RDMA2_ERROR/ 785 RDMA2_ERR_VERS message is received, the client should try to avoid a 786 version negotiation loop when re-establishing another connection. It 787 can assume that the server does not support RPC-over-RDMA version 2. 788 A client can assume the same situation (i.e., no server support for 789 RPC-over-RDMA version 2) if the initial negotiation message is lost 790 or dropped. Once the version negotiation exchange is complete, both 791 peers may use the default inline threshold value for the negotiated 792 transport protocol version. 794 4.3.3. Client Does Not Support RPC-over-RDMA Version 2 796 The server examines the RPC-over-RDMA protocol version used in the 797 first RPC-over-RDMA message it receives. If it supports this 798 protocol version, it MUST use it in all subsequent messages it sends 799 on that connection. The client MUST NOT change the protocol version 800 for the duration of the connection. 802 4.4. Using Direct Data Placement 804 RPC-over-RDMA version 2 provides a mechanism for moving part of an 805 RPC message via a data transfer distinct from RDMA Send and Receive. 806 For example, a sender can remove one or more XDR data items from the 807 Payload stream. These items are then conveyed via other mechanisms, 808 such as one or more RDMA Read or Write operations. 810 4.4.1. Chunks and Segments 812 A Requester records the location information for each registered 813 memory region associated with an RPC payload in the transport header 814 of an RPC-over-RDMA message. With this information, the Responder 815 uses RDMA Read and Write operations to retrieve arguments contained 816 in the specified region of the Requester's memory or place results in 817 that region. 819 A "segment" is a transport header data object that contains the 820 precise coordinates of a contiguous registered memory region. Each 821 segment contains the following information: 823 Handle: A steering Tag (STag) or R_key generated by registering this 824 memory with the RDMA provider. 826 Length: The length of the segment's memory region, in octets. The 827 length of a segment MAY be aligned to a single octet. An "empty 828 segment" is defined as a segment with the value zero (0) in its 829 length field. 831 Offset: The offset or beginning memory address of the segment's 832 memory region. 834 The meaning of the values contained in these fields is elaborated in 835 [RFC5040]. 837 A "chunk" is simply a set of segments that have a related purpose. A 838 Requester MAY divide a chunk into segments using any convenient 839 boundaries. The length of a chunk is defined as the sum of the 840 lengths of the segments that comprise it. 842 4.4.2. Reducing a Payload Stream 844 We refer to a data item that a sender removes from a Payload stream 845 to transmit separately as a "reduced" data item. After a sender has 846 finished removing XDR data items from a Payload stream, we refer to 847 it as a "reduced" Payload stream. A set of segments that describe 848 memory regions containing a single reduced data item is categorized 849 as a "data item chunk." 851 Not all XDR data items benefit from Direct Data Placement. For 852 example, small data items or data items that require XDR unmarshaling 853 by the receiver do not benefit from DDP. Moreover, it is impractical 854 for receivers to prepare for every possible XDR data item in a 855 protocol to appear in a data item chunk. 857 Specifying which data items are DDP-eligible is done in separate 858 standards track documents known as "Upper Layer Bindings". A ULB 859 identifies which XDR data items a peer MAY transfer using DDP. We 860 refer to such data items as "DDP-eligible." Senders MUST NOT reduce 861 any other XDR data items. Detailed requirements for ULB 862 specifications appear in Appendix A of the current document. 864 4.4.3. Moving Whole RPC Messages using Explicit RDMA 866 RPC-over-RDMA version 2 also enables the movement of a whole RPC 867 message via data transfer distinct from RDMA Send and Receive. A 868 sender registers the memory containing a Payload stream without 869 regard to data item boundaries or DDP-eligibility. The Payload 870 stream is then conveyed via other mechanisms, such as one or more 871 RDMA Read or Write operations. A set of segments that describe 872 memory regions containing a Payload stream is categorized as a "body 873 chunk". 875 A sender may first reduce that Payload stream if it contains one or 876 more DDP-eligible data items. The sender moves these data items 877 using data items chunks, and the reduced Payload stream using a body 878 chunk. 880 4.5. Encoding Chunks 882 The RPC-over-RDMA version 2 transport protocol does not place a limit 883 on chunk size. However, each ULP may cap the amount of data that can 884 be transferred by a single RPC transaction. For example, NFS 885 implementations typically have settings that restrict the payload 886 size of NFS READ and WRITE operations. The Responder can use such 887 limits to sanity check chunk sizes before using them in RDMA 888 operations. 890 4.5.1. Read Chunks 892 A "Read chunk" contains data that its receiver pulls from the sender. 893 Each Read chunk is a set of one or more "Read segments" encoded as a 894 list. A Read segment consists of a Position field followed by a 895 segment, as defined in Section 4.4.1. 897 Position: The byte offset in the unreduced Payload stream where the 898 receiver reinserts the data item conveyed in the chunk. The 899 sender MUST compute the Position value from the beginning of the 900 unreduced Payload stream, which begins at Position zero. All 901 segments in the same Read chunk share the same Position value, 902 even if one or more of the segments have a non-four-byte-aligned 903 length. The value in this field MUST be a multiple of four. 905 When constructing an RPC-over-RDMA message, the sender registers 906 memory regions containing data intended for RDMA Read operations. It 907 advertises the coordinates of these regions in Read chunks added to 908 the transport header of an RPC-over-RDMA message. 910 The receiver of this message then pulls the chunk's data from the 911 sender using RDMA Read operations. When receiving a Read chunk, the 912 receiver inserts the first Read segment in a Read chunk into the 913 Payload stream at the byte offset indicated by its Position field. 914 The receiver concatenates Read segments whose Position field value 915 matches this offset until there are no more Read segments at that 916 Position value. 918 4.5.1.1. The Read List 920 Each RPC-over-RDMA message carries a list of Read segments that make 921 up the set of Read chunks for that message. When no RDMA Read 922 operations are needed to complete the transmission of the message's 923 Payload stream, the message's Read list is empty. 925 If a Responder receives a Read list whose segment position values do 926 not appear in monotonically increasing order, it MUST discard the 927 message without processing it and respond with an RDMA2_ERROR message 928 with the rdma_xid field set to the XID of the malformed message and 929 the rdma_err field set to RDMA2_ERR_BAD_XDR. 931 4.5.1.2. The Call Chunk 933 The Call chunk is a Read chunk that acts as a body chunk containing 934 an RPC Call message. A Requester can utilize a Call chunk at any 935 time. However, using a Call chunk is less efficient than an RDMA 936 Send. 938 A Read chunk may act as either a data item chunk or a body chunk. 939 When the chunk's position is zero, it acts as a body chunk. 940 Otherwise, it is a data item chunk containing exactly one XDR data 941 item. 943 4.5.1.3. Read Completion 945 A Responder acknowledges that it is finished with the Requester's 946 Read chunk memory regions when it sends the corresponding RPC Reply 947 message. The Requester may then invalidate memory regions belonging 948 to Read chunks associated with the associated RPC Call message. 950 4.5.2. Write Chunks 952 Each "Write chunk" consists of a counted array of zero or more 953 segments, as defined in Section 4.4.1. The function of a Write chunk 954 depends on the direction of the containing RPC-over-RDMA message. In 955 a Call message, a Write chunk advertises registered memory regions 956 into which the Responder may push data. In a Reply message, a Write 957 chunk reports how much data has been pushed. 959 A Requester provisions Write chunks for an RPC transaction long 960 before the Responder has constructed a corresponding Reply message. 961 A Requester typically does not know the actual length of the result 962 data items or Reply to be returned, since the Reply does not yet 963 exist. Thus, a Requester MUST provision Write chunks large enough to 964 accommodate the maximum possible size of each returned data item. 966 An "empty Write chunk" is a Write chunk with a zero segment count. 967 By definition, the length of an empty Write chunk is zero. An 968 "unused Write chunk" has a non-zero segment count, but all of its 969 segments are empty segments. 971 4.5.2.1. The Write List 973 Each RPC-over-RDMA message carries a list of Write chunks. When no 974 DDP-eligible data items are to appear in the Reply to an RPC 975 transaction, the Requester provides an empty Write list in the RPC 976 Call, and the Responder leaves the Write list empty in the matching 977 RPC Reply. When a Write chunk appears in the Write list, it acts 978 only as a data item chunk. 980 For each Write chunk in the Write list, the Responder pushes one DDP- 981 eligible data item to the Requester. It fills the chunk contiguously 982 and in segment array order until the Responder has written that data 983 item to the Requester in its entirety. The Responder MUST copy the 984 segment count and all segments from the Requester-provided Write 985 chunk into the RPC Reply message's transport header. As it does so, 986 the Responder updates each segment length field to reflect the actual 987 amount of data returned in that segment. 989 The Responder then sends the RPC Reply message via an RDMA Send 990 operation. 992 4.5.2.2. The Reply Chunk 994 The Reply chunk is a single Write chunk that acts as a body chunk. 995 that contains an RPC Reply message. When a Requester estimates that 996 the Reply message can exceed the connection's ability to convey that 997 Reply using RDMA Send operations, it should provision a Reply chunk. 999 4.5.2.3. Write Completion 1001 A Responder acknowledges that it is finished updating the Requester's 1002 Write chunk memory regions when it sends the corresponding RPC Reply 1003 message. The RDMA provider guarantees that the written data is at 1004 rest before the next Receive operation, which typically contains the 1005 corresponding RPC Reply, completes. The Requester may then 1006 invalidate memory regions belonging to Write chunks associated with 1007 the associated RPC Call message. 1009 4.5.2.4. Write Chunk Roundup 1011 When provisioning a Write chunk for a variable-length result data 1012 item, the Requester MUST NOT include additional space for XDR roundup 1013 padding. A Responder MUST NOT write XDR roundup padding into a Write 1014 chunk, even if the result is shorter than the available space in the 1015 chunk. 1017 4.5.3. Reducing Complex XDR Data Types 1019 XDR data items may appear in body chunks without regard to their DDP- 1020 eligibility. As body chunks contain a Payload stream, they MUST 1021 include all appropriate XDR roundup padding to maintain proper XDR 1022 alignment of their contents. 1024 However, a data item chunk MUST contain only one XDR data item, and 1025 the chunk MUST occupy a four-byte aligned length in the Payload 1026 stream so that subsequent data items remain properly aligned once the 1027 reduced data item is removed from the Payload stream. 1029 4.5.3.1. Variable-Length Data Items 1031 When a sender reduces a variable-length XDR data item, the length of 1032 the item MUST remain in the Payload stream. The sender MUST omit the 1033 item's XDR roundup padding from the Payload stream and the chunk. 1034 The chunk's total length MUST be the same as the encoded length of 1035 the data item. 1037 4.5.3.2. Counted Arrays 1039 When reducing a data item that is a counted array data type, the 1040 count of array elements MUST remain in the Payload stream. The 1041 sender MUST move the array elements into the chunk. For example, 1042 when encoding an opaque byte array as a chunk, the count of bytes 1043 stays in the Payload stream, and the sender places the bytes in the 1044 array in the chunk. 1046 Individual array elements appear in a chunk in their entirety. For 1047 example, when encoding an array of arrays as a chunk, the count of 1048 items in the enclosing array stays in the Payload stream. But each 1049 enclosed array, including its item count, is transferred as part of 1050 the chunk. 1052 4.5.3.3. Optional-Data 1054 Similar to a counted array, when reducing an optional-data data type, 1055 the discriminator field MUST remain in the Payload stream. The 1056 sender MUST place the data, when present, in the chunk. 1058 4.5.3.4. XDR Unions 1060 A union data type MUST NOT be made DDP-eligible. However, one or 1061 more of its arms MAY be made DDP-eligible, subject to the other 1062 requirements in this section. 1064 4.6. Reverse-Direction Operation 1066 The terminology used in this section is introduced in 1067 Section 3.1.5.3. 1069 4.6.1. Sending a Reverse-Direction RPC Call 1071 An RPC-over-RDMA server endpoint constructs the transport header for 1072 a reverse-direction RPC Call as follows: 1074 * The server generates a new XID value (see Section 3.1.3 for full 1075 requirements) and places it in the rdma_xid field of the transport 1076 header and the xid field of the RPC Call message. The RPC Call 1077 header MUST start with the same XID value that is present in the 1078 transport header. 1080 * The rdma_vers field of each reverse-direction Call MUST contain 1081 the same value as forward-direction Calls on the same connection. 1083 * The server fills in the rdma_credit field with the credit values 1084 for the connection, as described in Section 4.2.1. 1086 * The server determines the Payload format for the RPC message and 1087 fills in the rdma_htype field as appropriate (see Sections 6.6 and 1088 4.6.4). Section 4.6.4 also covers the disposition of the chunk 1089 lists. 1091 4.6.2. Sending a Reverse-Direction RPC Reply 1093 An RPC-over-RDMA client endpoint constructs the transport header for 1094 a reverse-direction RPC Reply as follows: 1096 * The client copies the XID value from the matching RPC Call and 1097 places it in the rdma_xid field of the transport header and the 1098 xid field of the RPC Reply message. The RPC Reply header MUST 1099 start with the same XID value that is present in the transport 1100 header. 1102 * The rdma_vers field of each reverse-direction Call MUST contain 1103 the same value as forward-direction Replies on the same 1104 connection. 1106 * The client fills in the rdma_credit field with the credit values 1107 for the connection, as described in Section 4.2.1. 1109 * The client determines the Payload format for the RPC message and 1110 fills in the rdma_htype field as appropriate (see Sections 6.6 and 1111 4.6.4). Section 4.6.4 also covers the disposition of the chunk 1112 lists. 1114 4.6.3. When Reverse-Direction Operation is Not Supported 1116 An RPC-over-RDMA transport endpoint does not have to support reverse- 1117 direction operation. There might be no mechanism in the transport 1118 implementation to do so. Or, the transport implementation might 1119 support operation in the reverse direction, but the Upper-Layer 1120 Protocol might not configure the transport to handle reverse- 1121 direction traffic. 1123 If an endpoint is unprepared to receive a reverse-direction message, 1124 loss of the RDMA connection might result. Thus a denial of service 1125 can occur if an RPC server continues to send reverse-direction 1126 messages after a client that is not prepared to receive them 1127 reconnects to that server. 1129 Connection peers indicate their support for reverse-direction 1130 operation as part of the exchange of Transport Properties just after 1131 a connection is established (see Section 5.2.5). 1133 When dealing with the possibility that the remote peer has no 1134 transport level support for reverse-direction operation, the Upper- 1135 Layer Protocol is responsible for informing peers when reverse- 1136 direction operation is supported. Otherwise, even a simple reverse- 1137 direction RPC NULL procedure from a peer could result in a lost 1138 connection. Therefore, an Upper-Layer Protocol MUST NOT perform 1139 reverse-direction RPC operations until the RPC client indicates 1140 support for them. 1142 4.6.4. Using Chunks During Reverse-Direction Operation 1144 Reverse-direction operations can use chunks for DDP-eligible data 1145 items and Special payload formats the same way chunks are used in 1146 forward-direction operation. Connection peers indicate their support 1147 for using chunks in the reverse direction as part of the exchange of 1148 Transport Properties just after a connection is established (see 1149 Section 5.2.5). 1151 However, an implementation might support only Upper-Layer Protocols 1152 that have no DDP-eligible data items. Such Upper-Layer Protocols can 1153 use only small messages, or they might have a native mechanism for 1154 restricting the size of reverse-direction RPC messages, obviating the 1155 need to handle chunks in the reverse direction. 1157 When there is no Upper-Layer Protocol need for chunks in the reverse 1158 direction, implementers MAY choose not to provide support for chunks 1159 in the reverse direction, thus avoiding the complexity of 1160 implementing support for RDMA Reads and Writes in the reverse 1161 direction. When an RPC-over-RDMA transport implementation does not 1162 support chunks in the reverse direction, RPC endpoints use only the 1163 Simple Payload format without data item chunks or the Continued 1164 Payload format without data item chunks to send RPC messages in the 1165 reverse direction. 1167 If a reverse-direction Requester provides a non-empty chunk list to a 1168 Responder that does not support chunks, the Responder MUST report its 1169 lack of support using one of the error values defined in Section 7.3. 1171 4.6.5. Reverse-Direction Retransmission 1173 In rare cases, an RPC server cannot complete an RPC transaction or 1174 cannot send a Reply. In these cases, the Requester may send the RPC 1175 transaction again using the same RPC XID. 1177 In the forward direction, an RPC client is the Requester. The client 1178 is always responsible for ensuring a transport connection is in place 1179 before sending a dropped Call again. 1181 With reverse-direction operation, an RPC server is the Requester. 1182 Because an RPC server is not responsible for establishing transport 1183 connections with clients, the Requester is unable to retransmit a 1184 reverse-direction Call whenever there is no transport connection. In 1185 this case, the RPC server must wait for the RPC client to re- 1186 establish a transport connection before it can retransmit reverse- 1187 direction RPC Calls. 1189 If the forward-direction Requester has no work to do, it can be some 1190 time before the RPC client re-establishes a transport connection. An 1191 RPC server may need to abandon a pending reverse-direction RPC Call 1192 to avoid waiting indefinitely for the client to re-establish a 1193 transport connection. 1195 Therefore forward-direction Requesters SHOULD maintain a transport 1196 connection as long as the RPC server might send reverse-direction 1197 Calls. For example, while an NFS version 4.1 client has open 1198 delegated files or active pNFS layouts, it maintains one or more 1199 transport connections to enable the NFS server to perform callback 1200 operations. 1202 4.7. Call-Only Operation 1204 There is no corresponding Reply to a Call-only procedure. Thus there 1205 is no opportunity for the Responder to indicate it has completed its 1206 use of Read or Call chunks that hold arguments or the whole Call. In 1207 addition, Write and Reply chunks are not necessary because there are 1208 no results and no Reply message. Therefore, Requesters MUST NOT use 1209 chunks when sending Call-only RPC procedures. 1211 5. Transport Properties 1213 RPC-over-RDMA version 2 enables connection endpoints to exchange 1214 information about implementation properties. Compatible endpoints 1215 use this information to optimize data transfer. Initially, only a 1216 small set of transport properties are defined. The protocol provides 1217 header types to exchange transport properties (see 6.3.3 and 6.3.4). 1219 Both the set of transport properties and the operations used to 1220 communicate them may be extended. Within RPC-over-RDMA version 2, 1221 such extensions are OPTIONAL. A discussion of extending the set of 1222 transport properties appears in Appendix B.3. 1224 5.1. Transport Properties Model 1226 The current document specifies a basic set of receiver and sender 1227 properties. Such properties are specified using a code point that 1228 identifies the particular transport property and a nominally opaque 1229 array containing the XDR encoding of the property. 1231 The following XDR types handle transport properties: 1233 1234 typedef rpcrdma2_propid uint32; 1236 struct rpcrdma2_propval { 1237 rpcrdma2_propid rdma_which; 1238 opaque rdma_data<>; 1239 }; 1241 typedef rpcrdma2_propval rpcrdma2_propset<>; 1243 typedef uint32 rpcrdma2_propsubset<>; 1244 1246 The rpcrdma2_propid type specifies a distinct transport property. 1247 The property code points are defined as const values rather than 1248 elements in an enum type to enable the extension by concatenating XDR 1249 definition files. 1251 The rpcrdma2_propval type carries the value of a transport property. 1252 The rdma_which field identifies the particular property, and the 1253 rdma_data field contains the associated value of that property. A 1254 zero-length rdma_data field represents the default value of the 1255 property specified by rdma_which. 1257 Although the rdma_data field is opaque, receivers interpret its 1258 contents using the XDR type associated with the property specified by 1259 rdma_which. When the contents of the rdma_data field do not conform 1260 to that XDR type, the receiver MUST return the error 1261 RDMA2_ERR_BAD_PROPVAL using the header type RDMA2_ERROR, as described 1262 in Section 6.3.1. 1264 For example, the receiver of a message containing a valid 1265 rpcrdma2_propval returns this error if the length of rdma_data is 1266 greater than the length of the transferred message. Also, when the 1267 receiver recognizes the rpcrdma2_propid contained in rdma_which, it 1268 MUST report the error RDMA2_ERR_BAD_PROPVAL if either of the 1269 following occurs: 1271 * The nominally opaque data within rdma_data is not valid when 1272 interpreted using the property-associated typedef. 1274 * The length of rdma_data is insufficient to contain the data 1275 represented by the property-associated typedef. 1277 A receiver does not report an error if it does not recognize the 1278 value contained in rdma_which. In that case, the receiver does not 1279 process that rpcrdma2_propval. Processing continues with the next 1280 rpcrdma2_propval, if any. 1282 The rpcrdma2_propset type specifies a set of transport properties. 1283 The protocol does not impose a particular ordering of the 1284 rpcrdma2_propval items within it. 1286 The rpcrdma2_propsubset type identifies a subset of the properties in 1287 a rpcrdma2_propset. Each bit in the mask denotes a particular 1288 element in a previously specified rpcrdma2_propset. If a particular 1289 rpcrdma2_propval is at position N in the array, then bit number N mod 1290 32 in word N div 32 specifies whether the defined subset includes 1291 that particular rpcrdma2_propval. Words beyond the last one 1292 specified are assumed to contain zero. 1294 5.2. Current Transport Properties 1296 Table 1 specifies a basic set of transport properties. The columns 1297 contain the following information: 1299 * The column labeled "Property" contains a name of the transport 1300 property described by the current row. 1302 * The column labeled "Code" specifies the code point that identifies 1303 this property. 1305 * The column labeled "XDR type" gives the XDR type of the data used 1306 to communicate the value of this property. This data type 1307 overlays the data portion of the nominally opaque rdma_data field. 1309 * The column labeled "Default" gives the default value for the 1310 property. 1312 * The column labeled "Section" indicates the section within the 1313 current document that explains the use of this property. 1315 +===========================+======+==========+=========+=========+ 1316 | Property | Code | XDR type | Default | Section | 1317 +===========================+======+==========+=========+=========+ 1318 | Maximum Send Size | 1 | uint32 | 4096 | 5.2.1 | 1319 +---------------------------+------+----------+---------+---------+ 1320 | Receive Buffer Size | 2 | uint32 | 4096 | 5.2.2 | 1321 +---------------------------+------+----------+---------+---------+ 1322 | Maximum Segment Size | 3 | uint32 | 1048576 | 5.2.3 | 1323 +---------------------------+------+----------+---------+---------+ 1324 | Maximum Segment Count | 4 | uint32 | 16 | 5.2.4 | 1325 +---------------------------+------+----------+---------+---------+ 1326 | Reverse-Direction Support | 5 | uint32 | 0 | 5.2.5 | 1327 +---------------------------+------+----------+---------+---------+ 1328 | Host Auth Message | 6 | opaque<> | N/A | 5.2.6 | 1329 +---------------------------+------+----------+---------+---------+ 1331 Table 1 1333 5.2.1. Maximum Send Size 1335 The value of this property specifies the maximum size, in octets, of 1336 Send payloads. The endpoint receiving this value can size its 1337 Receive buffers based on the value of this property. 1339 1340 const uint32 RDMA2_PROPID_SBSIZ = 1; 1341 typedef uint32 rpcrdma2_prop_sbsiz; 1342 1344 5.2.2. Receive Buffer Size 1346 The value of this property specifies the minimum size, in octets, of 1347 pre-posted receive buffers. 1349 1350 const uint32 RDMA2_PROPID_RBSIZ = 2; 1351 typedef uint32 rpcrdma2_prop_rbsiz; 1352 1354 A sender can subsequently use this value to determine when a message 1355 to be sent fits in pre-posted receive buffers that the receiver has 1356 set up. In particular: 1358 * Requesters may use the value to determine when to use a Call chunk 1359 or Message Continuation when sending a Call. 1361 * Requesters may use the value to determine when to provide a Reply 1362 chunk when sending a Call, based on the maximum possible size of 1363 the Reply. 1365 * Responders may use the value to determine when to use a Reply 1366 chunk provided by the Requester, given the actual size of a Reply. 1368 5.2.3. Maximum Segment Size 1370 The value of this property specifies the maximum size, in octets, of 1371 a segment this endpoint is prepared to send or receive. 1373 1374 const uint32 RDMA2_PROPID_RSSIZ = 3; 1375 typedef uint32 rpcrdma2_prop_rssiz; 1376 1378 5.2.4. Maximum Segment Count 1380 The value of this property specifies the maximum number of segments 1381 that can appear in a Requester's transport header. 1383 1384 const uint32 RDMA2_PROPID_RCSIZ = 4; 1385 typedef uint32 rpcrdma2_prop_rcsiz; 1386 1388 5.2.5. Reverse-Direction Support 1390 The value of this property specifies a client implementation's 1391 readiness to process messages that are part of reverse-direction RPC 1392 requests. 1394 1395 const uint32 RDMA_RVRSDIR_NONE = 0; 1396 const uint32 RDMA_RVRSDIR_SIMPLE = 1; 1397 const uint32 RDMA_RVRSDIR_CONT = 2; 1398 const uint32 RDMA_RVRSDIR_GENL = 3; 1400 const uint32 RDMA2_PROPID_BRS = 5; 1401 typedef uint32 rpcrdma2_prop_brs; 1402 1404 Multiple levels of support are distinguished: 1406 * The value RDMA2_RVRSDIR_NONE indicates that the sender does not 1407 support reverse-direction operation. 1409 * The value RDMA2_RVRSDIR_SIMPLE indicates that the sender supports 1410 using only Simple Format messages without data item chunks for 1411 reverse-direction messages. 1413 * The value RDMA2_RVRSDIR_CONT indicates that the sender supports 1414 using either Simple Format without data item chunks or Continued 1415 Format messages without data item chunks for reverse-direction 1416 messages. 1418 * The value RDMA2_RVRSDIR_GENL indicates that the sender supports 1419 reverse-direction messages in the same way as forward-direction 1420 messages. 1422 When a peer does not provide this property, the default is the peer 1423 does not support reverse-direction operation. 1425 5.2.6. Host Authentication Message 1427 The value of this transport property enables the exchange of host 1428 authentication material. This property can accommodate 1429 authentication handshakes that require multiple challenge-response 1430 interactions and potentially large amounts of material. 1432 1433 const uint32 RDMA2_PROPID_HOSTAUTH = 6; 1434 typedef opaque rpcrdma2_prop_hostauth<>; 1435 1436 When this property is not present, the peer(s) remain 1437 unauthenticated. Local security policy on each peer determines 1438 whether the connection is permitted to continue. 1440 6. Transport Messages 1442 Each transport message consists of multiple sections. 1444 * A transport header prefix, as defined in Section 6.4. Among other 1445 things, this structure indicates the header type. 1447 * The transport header proper, as defined by one of the sub-sections 1448 below. See Section 6.1 for the mapping between header types and 1449 the corresponding header structure. 1451 * Potentially, all or part of an RPC message payload. 1453 This organization differs from that presented in the definition of 1454 RPC-over-RDMA version 1 [RFC8166], which defined the first and second 1455 of the items above as a single XDR data structure. The new 1456 organization is in keeping with RPC-over-RDMA version 2's 1457 extensibility model, which enables the definition of new header types 1458 without modifying the XDR definition of existing header types. 1460 6.1. Transport Header Types 1462 Table 2 lists the RPC-over-RDMA version 2 header types. The columns 1463 contain the following information: 1465 * The column labeled "Operation" names the particular operation. 1467 * The column labeled "Code" specifies the value of the header type 1468 for this operation. 1470 * The column labeled "XDR type" gives the XDR type of the data 1471 structure used to organize the information in this new header 1472 type. This data immediately follows the universal portion on the 1473 transport header present in every RPC-over-RDMA transport header. 1475 * The column labeled "Msg" indicates whether this operation is 1476 followed (or not) by an RPC message payload. 1478 * The column labeled "Section" refers to the section within the 1479 current document that explains the use of this header type. 1481 +==============+======+=============================+=====+=========+ 1482 | Operation | Code | XDR type | Msg | Section | 1483 +==============+======+=============================+=====+=========+ 1484 | Report | 4 | rpcrdma2_hdr_error | No | 6.3.1 | 1485 | Transport | | | | | 1486 | Error | | | | | 1487 +--------------+------+-----------------------------+-----+---------+ 1488 | Grant | 5 | void | No | 6.3.2 | 1489 | Credits | | | | | 1490 +--------------+------+-----------------------------+-----+---------+ 1491 | Specify | 6 | rpcrdma2_hdr_connprop | No | 6.3.3 | 1492 | Properties | | | | | 1493 | (Middle) | | | | | 1494 +--------------+------+-----------------------------+-----+---------+ 1495 | Specify | 7 | rpcrdma2_hdr_connprop | No | 6.3.4 | 1496 | Properties | | | | | 1497 | (Final) | | | | | 1498 +--------------+------+-----------------------------+-----+---------+ 1499 | Convey | 8 | rpcrdma2_hdr_call_external | No | 6.3.5 | 1500 | External | | | | | 1501 | RPC Call | | | | | 1502 | Message | | | | | 1503 +--------------+------+-----------------------------+-----+---------+ 1504 | Convey | 9 | rpcrdma2_hdr_call_middle | Yes | 6.3.6 | 1505 | Continued | | | | | 1506 | RPC Call | | | | | 1507 | Message | | | | | 1508 +--------------+------+-----------------------------+-----+---------+ 1509 | Convey | 10 | rpcrdma2_hdr_call_inline | Yes | 6.3.7 | 1510 | Inline RPC | | | | | 1511 | Call | | | | | 1512 | Message | | | | | 1513 +--------------+------+-----------------------------+-----+---------+ 1514 | Convey | 11 | rpcrdma2_hdr_reply_external | No | 6.3.8 | 1515 | External | | | | | 1516 | RPC Reply | | | | | 1517 | Message | | | | | 1518 +--------------+------+-----------------------------+-----+---------+ 1519 | Convey | 12 | rpcrdma2_hdr_reply_middle | Yes | 6.3.9 | 1520 | Continued | | | | | 1521 | RPC Reply | | | | | 1522 | Message | | | | | 1523 +--------------+------+-----------------------------+-----+---------+ 1524 | Convey | 13 | rpcrdma2_hdr_reply_inline | Yes | 6.3.10 | 1525 | Inline RPC | | | | | 1526 | Reply | | | | | 1527 | Message | | | | | 1528 +--------------+------+-----------------------------+-----+---------+ 1529 Table 2 1531 RPC-over-RDMA version 2 peers are REQUIRED to support all message 1532 header types in Table 2. RPC-over-RDMA version 2 implementations 1533 that receive an unrecognized header type MUST respond with an 1534 RDMA2_ERROR message with an rdma_err field containing 1535 RDMA2_ERR_INVAL_HTYPE and drop the incoming message without 1536 processing it further. 1538 6.2. Headers and Chunks 1540 Most RPC-over-RDMA version 2 data structures have antecedents in 1541 corresponding structures in RPC-over-RDMA version 1. As is typical 1542 for new versions of an existing protocol, the XDR data structures 1543 have new names, and there are a few small changes in content. In 1544 some cases, there have been structural re-organizations to enable 1545 protocol extensibility. 1547 6.2.1. Common Transport Header Prefix 1549 The rpcrdma_common structure defines the initial part of each RPC- 1550 over-RDMA transport header for RPC-over-RDMA version 2 and subsequent 1551 versions. 1553 1554 struct rpcrdma_common { 1555 uint32 rdma_xid; 1556 uint32 rdma_vers; 1557 uint32 rdma_credit; 1558 uint32 rdma_htype; 1559 }; 1560 1562 RPC-over-RDMA version 2's use of these first four words aligns with 1563 that of version 1 as required by Section 4.2 of [RFC8166]. However, 1564 there are crucial structural differences in the XDR definition of 1565 RPC-over-RDMA version 2: in the way that these words are described by 1566 the respective XDR descriptions: 1568 * The header type is represented as a uint32 rather than as an enum 1569 type. An enum would need to be modified to reflect additions to 1570 the set of header types made by later extensions. 1572 * The header type field is part of an XDR structure devoted to 1573 representing the transport header prefix, rather than being part 1574 of a discriminated union, that includes the body of each transport 1575 header type. 1577 * There is now a prefix structure (see Section 6.4) of which the 1578 rpcrdma_common structure is the initial segment. This prefix is a 1579 newly defined XDR object within the protocol description, which 1580 constrains the universal portion of all header types to the four 1581 words in rpcrdma_common. 1583 These changes are part of a more considerable structural change in 1584 the XDR definition of RPC-over-RDMA version 2 that facilitates a 1585 cleaner treatment of protocol extension. The XDR appearing in 1586 Section 8 reflects these changes, which Appendix C.1 discusses in 1587 further detail. 1589 6.3. Header Types 1591 The header types defined and used in RPC-over-RDMA version 1 are not 1592 carried over into RPC-over-RDMA version 2, although there are easy 1593 equivalents to the version 1 procedures: 1595 * The RDMA2_ERROR header (defined in Section 6.3.1) has an XDR 1596 definition that differs from that in RPC-over-RDMA version 1, and 1597 its modifications are all compatible extensions. 1599 * Senders use RDMA2_CALL_INLINE or RDMA2_REPLY_INLINE (defined in 1600 Sections 6.3.7 and 6.3.10) in place of RDMA_MSG. There are minor 1601 differences in the on-the-wire format between the version 1 1602 procedure and the version 2 header types. 1604 * Senders use RDMA2_CALL_EXTERNAL or RDMA2_REPLY_EXTERNAL (defined 1605 in Sections 6.3.5 and 6.3.8) in place of RDMA_NOMSG. There are 1606 minor differences in the on-the-wire format between the version 1 1607 procedure and the version 2 header types. 1609 * RDMA2_CONNPROP_MIDDLE and RDMA2_CONNPROP_FINAL (defined in 1610 Sections 6.3.3 and 6.3.4) are new header types devoted to enabling 1611 connection peers to exchange information about their transport 1612 properties. 1614 6.3.1. RDMA2_ERROR: Report Transport Error 1616 RDMA2_ERROR reports a transport layer error on a previous 1617 transmission. 1619 1620 const rpcrdma2_proc RDMA2_ERROR = 4; 1622 struct rpcrdma2_err_vers { 1623 uint32 rdma_vers_low; 1624 uint32 rdma_vers_high; 1625 }; 1627 struct rpcrdma2_err_write { 1628 uint32 rdma_chunk_index; 1629 uint32 rdma_length_needed; 1630 }; 1632 union rpcrdma2_hdr_error switch (rpcrdma2_errcode rdma_err) { 1633 case RDMA2_ERR_VERS: 1634 rpcrdma2_err_vers rdma_vrange; 1635 case RDMA2_ERR_READ_CHUNKS: 1636 uint32 rdma_max_chunks; 1637 case RDMA2_ERR_WRITE_CHUNKS: 1638 uint32 rdma_max_chunks; 1639 case RDMA2_ERR_SEGMENTS: 1640 uint32 rdma_max_segments; 1641 case RDMA2_ERR_WRITE_RESOURCE: 1642 rpcrdma2_err_write rdma_writeres; 1643 case RDMA2_ERR_REPLY_RESOURCE: 1644 uint32 rdma_length_needed; 1645 default: 1646 void; 1647 }; 1648 1650 See Section 7 for details on the use of this header type. 1652 6.3.2. RDMA2_GRANT: Grant Credits 1654 The RDMA2_GRANT header type enables a connection peer to update 1655 credit information without conveying a payload. 1657 1658 const rpcrdma2_proc RDMA2_GRANT = 5; 1659 1661 This message carries no payload except for a struct 1662 rpcrdma2_hdr_prefix. The rdma_xid field is unused. Senders MUST set 1663 the rdma_xid field to zero and receivers MUST ignore the value in 1664 this field. 1666 6.3.3. RDMA2_CONNPROP_MIDDLE: Exchange Transport Properties 1668 The RDMA2_CONNPROP_MIDDLE header type enables a connection peer to 1669 publish the properties of its implementation to its remote peer. 1671 1672 const rpcrdma2_proc RDMA2_CONNPROP_MIDDLE = 6; 1674 struct rpcrdma2_hdr_connprop { 1675 rpcrdma2_propset rdma_props; 1676 }; 1677 1679 A peer sends an RDMA2_CONNPROP_MIDDLE header type when it has one or 1680 more properties to send that do not fit within the default inline 1681 threshold for the RPC-over-RDMA version that is in effect. 1683 A peer may encounter properties that it does not recognize or 1684 support. In such cases, the receiver ignores unsupported properties 1685 without generating an error response. 1687 If a peer sends follows an RDMA2_CONNPROP_MIDDLE header type with 1688 anything other than another RDMA2_CONNPROP_MIDDLE message or an 1689 RDMA2_CONNPROP_FINAL message, the receiver MUST respond with an 1690 RDMA2_ERROR header type and set its rdma_err field to 1691 RDMA2_ERR_INVAL_CONT and drop the incoming message without processing 1692 it further. 1694 6.3.4. RDMA2_CONNPROP_FINAL: Exchange Transport Properties 1696 The RDMA2_CONNPROP_FINAL header type enables a connection peer to 1697 publish the properties of its implementation to its remote peer. 1699 1700 const rpcrdma2_proc RDMA2_CONNPROP_FINAL = 7; 1702 struct rpcrdma2_hdr_connprop { 1703 rpcrdma2_propset rdma_props; 1704 }; 1705 1707 Each peer sends an RDMA2_CONNPROP_FINAL header type as the final 1708 CONNPROP-type message after the client has established a connection. 1709 The size of this message is limited to the default inline threshold 1710 for the RPC-over-RDMA version that is in effect. 1712 A peer may encounter properties that it does not recognize or 1713 support. In such cases, the receiver ignores unsupported properties 1714 without generating an error response. 1716 If a peer sends a CONNPROP-type message on a connection after it has 1717 sent an RDMA2_CONNPROP_FINAL message, the receiver MUST respond with 1718 an RDMA2_ERROR header type and set its rdma_err field to 1719 RDMA2_ERR_INVAL_CONT and drop the incoming message without processing 1720 it further. 1722 6.3.5. RDMA2_CALL_EXTERNAL: Convey External RPC Call Message 1724 RDMA2_CALL_EXTERNAL conveys an RPC Call message payload using 1725 explicit RDMA operations. The Responder reads the Payload stream 1726 from a memory area specified by the Call chunk. The sender MUST set 1727 the rdma_xid field to the same value as the xid of the RPC Reply 1728 message payload. 1730 1731 const rpcrdma2_proc RDMA2_CALL_EXTERNAL = 8; 1733 struct rpcrdma2_hdr_call_external { 1734 uint32 rdma_inv_handle; 1736 struct rpcrdma2_read_list *rdma_call; 1737 struct rpcrdma2_read_list *rdma_reads; 1738 struct rpcrdma2_write_list *rdma_provisional_writes; 1739 struct rpcrdma2_write_chunk *rdma_provisional_reply; 1740 }; 1741 1743 rdma_inv_handle: The rdma_inv_handle field contains a 32-bit RDMA 1744 handle that the Responder may use in a Send With Invalidation 1745 operation. See Section 6.5. 1747 rdma_call: The rdma_call field anchors a list of one or more Read 1748 segments that contain the RPC Call's Payload stream. 1750 rdma_reads: The rdma_reads field anchors a list of zero or more Read 1751 segments that contain data item chunks. 1753 rdma_provisional_writes: The rdma_writes field anchors a list of 1754 zero or more provisional Write chunks. 1756 rdma_provisional_reply: The rdma_reply field is a list containing 1757 zero or one provisional Reply chunk. 1759 6.3.6. RDMA2_CALL_MIDDLE: Convey Continued RPC Call Message 1761 RDMA2_CALL_MIDDLE conveys a beginning or middle portion of an RPC 1762 Call message immediately following the transport header in the send 1763 buffer. The sender MUST set the rdma_xid field to the same value as 1764 the xid of the RPC Reply message payload. The sender sets the 1765 rdma_remaining field to the number of bytes in the RPC Call message 1766 payload that remain to be sent. The rdma_rpc_first_word field 1767 demarks the first word of the Payload stream. 1769 1770 const rpcrdma2_proc RDMA2_CALL_MIDDLE = 9; 1772 struct rpcrdma2_hdr_call_middle { 1773 uint32 rdma_remaining; 1775 /* The rpc message starts here and continues 1776 * through the end of the transmission. */ 1777 uint32 rdma_rpc_first_word; 1778 }; 1779 1781 If a peer sends follows an RDMA2_CALL_MIDDLE header type with 1782 anything other than an RDMA2_CALL_MIDDLE message or an 1783 RDMA2_CALL_INLINE message, the receiver MUST respond with an 1784 RDMA2_ERROR header type and set its rdma_err field to 1785 RDMA2_ERR_INVAL_CONT and drop the incoming message without processing 1786 it further. 1788 6.3.7. RDMA2_CALL_INLINE: Convey Inline RPC Call Message 1790 RDMA2_CALL_INLINE conveys the only or final portion of an RPC Call 1791 message. The rdma_rpc_first_word field demarks the first word of 1792 this Payload stream. 1794 1795 const rpcrdma2_proc RDMA2_CALL_INLINE = 10; 1797 struct rpcrdma2_hdr_call_inline { 1798 uint32 rdma_inv_handle; 1800 struct rpcrdma2_read_list *rdma_reads; 1801 struct rpcrdma2_write_list *rdma_provisional_writes; 1802 struct rpcrdma2_write_chunk *rdma_provisional_reply; 1804 /* The rpc message starts here and continues 1805 * through the end of the transmission. */ 1806 uint32 rdma_rpc_first_word; 1807 }; 1808 1810 rdma_inv_handle: The rdma_inv_handle field contains a 32-bit RDMA 1811 handle that the Responder may use in a Send With Invalidation 1812 operation. See Section 6.5. 1814 rdma_reads: The rdma_reads field anchors a list of zero or more Read 1815 segments that contain only data item chunks. A Requester MUST NOT 1816 insert Position-zero Read chunks in this list. 1818 rdma_provisional_writes: The rdma_writes field anchors a list of 1819 zero or more provisional Write chunks. 1821 rdma_provisional_reply: The rdma_reply field is a list containing 1822 zero or one provisional Reply chunk. 1824 6.3.8. RDMA2_REPLY_EXTERNAL: Convey External RPC Reply Message 1826 RDMA2_REPLY_EXTERNAL conveys an RPC Reply message payload using 1827 explicit RDMA operations. In particular, it is referred to as a 1828 Special Format Reply when the Responder writes the RPC payload into a 1829 memory area specified by a Reply chunk. The sender MUST set the 1830 rdma_xid field to the same value as the xid of the RPC Reply message 1831 payload. 1833 1834 const rpcrdma2_proc RDMA2_REPLY_EXTERNAL = 11; 1836 struct rpcrdma2_hdr_reply_external { 1837 struct rpcrdma2_write_list *rdma_writes; 1838 struct rpcrdma2_write_chunk *rdma_reply; 1839 }; 1840 1841 rdma_writes: The rdma_writes field anchors a list of zero or more 1842 Write chunks that are either empty or contain reduced data items. 1844 rdma_reply: The rdma_reply field is a list that MUST contain exactly 1845 one Reply chunk. 1847 6.3.9. RDMA2_REPLY_MIDDLE: Convey Continued RPC Reply Message 1849 RDMA2_REPLY_MIDDLE conveys a beginning or middle portion of an RPC 1850 Reply message immediately following the transport header in the send 1851 buffer. The sender MUST set the rdma_xid field to the same value as 1852 the xid of the RPC Reply message payload. The sender sets the 1853 rdma_remaining field to the number of bytes in the RPC Call message 1854 payload that remain to be sent. The rdma_rpc_first_word field 1855 demarks the first word of the Payload stream. 1857 1858 const rpcrdma2_proc RDMA2_REPLY_MIDDLE = 12; 1860 struct rpcrdma2_hdr_reply_middle { 1861 uint32 rdma_remaining; 1863 /* The rpc message starts here and continues 1864 * through the end of the transmission. */ 1865 uint32 rdma_rpc_first_word; 1866 }; 1867 1869 If a peer sends follows an RDMA2_REPLY_MIDDLE header type with 1870 anything other than an RDMA2_REPLY_MIDDLE message or an 1871 RDMA2_REPLY_INLINE message, the receiver MUST respond with an 1872 RDMA2_ERROR header type and set its rdma_err field to 1873 RDMA2_ERR_INVAL_CONT and drop the incoming message without processing 1874 it further. 1876 6.3.10. RDMA2_REPLY_INLINE: Convey RPC Reply Message Inline 1878 RDMA2_REPLY_INLINE conveys the only or final portion of an RPC Reply 1879 message immediately following the transport header in the send 1880 buffer. If the Reply message payload has been reduced, the 1881 rdma_chunks object carries the reduced data item chunks. 1883 1884 const rpcrdma2_proc RDMA2_REPLY_INLINE = 13; 1886 struct rpcrdma2_hdr_reply_inline { 1887 struct rpcrdma2_write_list *rdma_writes; 1889 /* The rpc message starts here and continues 1890 * through the end of the transmission. */ 1891 uint32 rdma_rpc_first_word; 1892 }; 1893 1895 rdma_writes: The rdma_writes field anchors a list of zero or more 1896 Write chunks that are either empty or contain reduced data items. 1898 6.4. Transport Header Prefix 1900 The following prefix structure appears at the start of each RPC-over- 1901 RDMA version 2 transport header. 1903 1904 struct rpcrdma2_hdr_prefix { 1905 struct rpcrdma_common rdma_start; 1906 }; 1907 1909 6.5. Remote Invalidation 1911 To solicit the use of Remote Invalidation, a Requester sets the value 1912 of the rdma_inv_handle field in an RPC Call's transport header to a 1913 non-zero value that matches one of the rdma_handle fields in that 1914 header. If the Responder may invalidate none of the rdma_handle 1915 values in the header conveying the Call, the Requester sets the RPC 1916 Call's rdma_inv_handle field to the value zero. 1918 If the Responder chooses not to use remote invalidation for this 1919 particular RPC Reply, or the RPC Call's rdma_inv_handle field 1920 contains the value zero, the Responder simply uses RDMA Send to 1921 transmit the matching RPC reply. However, if the Responder chooses 1922 to use Remote Invalidation, it uses RDMA Send With Invalidate to 1923 transmit the RPC Reply. It MUST use the value in the corresponding 1924 Call's rdma_inv_handle field to construct the Send With Invalidate 1925 Work Request. 1927 A Responder never uses a Send With Invalidate Work Request when 1928 sending a control plane header type. This includes the RDMA2_ERROR 1929 header type, the RDMA2_GRANT header type, the RDMA2_CONNPROP_MIDDLE 1930 header type, and the RDMA2_CONNPROP_FINAL header type. 1932 6.6. Payload Formats 1934 RPC-over-RDMA version 2 provides several ways, known as "payload 1935 formats", to convey an RPC-over-RDMA message. A sender chooses the 1936 payload format for each message based on several factors: 1938 * The existence of DDP-eligible data items in the RPC message 1939 payload 1941 * The size of the RPC message payload 1943 * The direction of the RPC message (i.e., Call or Reply) 1945 * The available hardware resources 1947 * The arrangement of source and sink memory buffers 1949 The following subsections describe in detail how Requesters and 1950 Responders format RPC-over-RDMA message payloads. 1952 6.6.1. Simple Format 1954 All RPC messages conveyed via RPC-over-RDMA version 2 need at least 1955 one RDMA Send operation to convey. Thus, the most efficient way to 1956 send an RPC message that is smaller than the inline threshold is to 1957 append the Payload stream directly to the Transport stream and use an 1958 RDMA Send to convey both. When no chunks are present, senders 1959 construct Calls and Replies the same way, and no other operations are 1960 needed. 1962 6.6.1.1. Simple Format with Data Item Chunks 1964 If DDP-eligible data items are present in a Payload stream, a sender 1965 MAY reduce some or all of these items, removing them from the Payload 1966 stream. The sender then uses a separate mechanism to transfer the 1967 reduced data items. The Transport stream immediately followed by the 1968 reduced Payload stream is then transferred using one RDMA Send 1969 operation. 1971 When data item chunks are present, senders construct Calls 1972 differently than Replies. 1974 Simple Call 1975 After receiving the Transport and Payload streams of an RPC Call 1976 message with Read chunks, the Responder uses RDMA Read operations 1977 to move the reduced data items contained in the Read chunks. RPC- 1978 over-RDMA Calls can carry Write chunks for the Responder to use 1979 when sending the matching Reply. 1981 Simple Reply 1982 The Responder uses RDMA Write operations to move reduced data 1983 items contained in Write chunks. Afterward, it sends the 1984 Transport and Payload streams of the RPC Reply message using one 1985 RDMA Send. RPC-over-RDMA Replies always carry an empty Read chunk 1986 list. 1988 6.6.1.2. Simple Format Examples 1990 Requester Responder 1991 | RDMA Send (RDMA2_CALL_INLINE) | 1992 Call | ----------------------------------> | 1993 | | 1994 | | Processing 1995 | | 1996 | RDMA Send (RDMA2_REPLY_INLINE) | 1997 | <---------------------------------- | Reply 1999 Figure 1: A Simple Call without data item chunks and a Simple 2000 Reply without data item chunks 2002 Requester Responder 2003 | RDMA Send (RDMA2_CALL_INLINE) | 2004 Call | ----------------------------------> | 2005 | RDMA Read | 2006 | <---------------------------------- | 2007 | RDMA Response (arg data) | 2008 | ----------------------------------> | 2009 | | 2010 | | Processing 2011 | | 2012 | RDMA Send (RDMA2_REPLY_INLINE) | 2013 | <---------------------------------- | Reply 2015 Figure 2: A Simple Call with a Read chunk and a Simple Reply 2016 without data item chunks 2018 Requester Responder 2019 | RDMA Send (RDMA2_CALL_INLINE) | 2020 Call | ----------------------------------> | 2021 | | 2022 | | Processing 2023 | | 2024 | RDMA Write (result data) | 2025 | <---------------------------------- | 2026 | RDMA Send (RDMA2_REPLY_INLINE) | 2027 | <---------------------------------- | Reply 2029 Figure 3: A Simple Call without data item chunks and a Simple 2030 Reply with a Write chunk 2032 6.6.2. Continued Format 2034 For various reasons, a sender can choose to split a message payload 2035 over multiple RPC-over-RDMA messages. The Payload stream of each 2036 RPC-over-RDMA message contains a part of the RPC message. The 2037 receiver reconstructs the original RPC message by concatenating the 2038 Payload stream of each RPC-over-RDMA message in received order. A 2039 sender MAY split the Payload stream on any convenient boundary. 2041 6.6.2.1. Continued Format with Data Item Chunks 2043 If DDP-eligible data items are present in the Payload stream, a 2044 sender MAY reduce some or all of these items, removing them from the 2045 Payload stream. The sender then uses a separate mechanism to 2046 transfer the reduced data items. The Transport stream immediately 2047 follwed by the reduced Payload stream is then transferred using one 2048 RDMA Send operation. 2050 As with Simple Format messages, when chunks are present, senders 2051 construct Calls differently than Replies. 2053 Continued Call 2054 After receiving the Transport and Payload streams of an RPC Call 2055 message with Read chunks, the Responder uses RDMA Read operations 2056 to move the reduced data items contained in Read chunks. RPC- 2057 over-RDMA Calls can carry Write chunks for the Responder to use 2058 when sending the matching Reply. 2060 Continued Reply 2061 The Responder uses RDMA Write operations to move reduced data 2062 items contained in Write chunks. Afterward, it sends the 2063 Transport and Payload streams of the RPC Reply message using 2064 multiple RDMA Sends. RPC-over-RDMA Replies always carry an empty 2065 Read chunk list. 2067 6.6.2.2. Continued Format Examples 2068 Requester Responder 2069 | RDMA Send (RDMA2_CALL_MIDDLE) | 2070 Call | ----------------------------------> | 2071 | RDMA Send (RDMA2_CALL_MIDDLE) | 2072 | ----------------------------------> | 2073 | RDMA Send (RDMA2_CALL_INLINE) | 2074 | ----------------------------------> | 2075 | | 2076 | | Processing 2077 | | 2078 | RDMA Send (RDMA2_REPLY_MIDDLE) | 2079 | <---------------------------------- | Reply 2080 | RDMA Send (RDMA2_REPLY_MIDDLE) | 2081 | <---------------------------------- | 2082 | RDMA Send (RDMA2_REPLY_INLINE) | 2083 | <---------------------------------- | 2085 Figure 4: A Continued Call without data item chunks and a 2086 Continued Reply without data item chunks 2088 Requester Responder 2089 | RDMA Send (RDMA2_CALL_MIDDLE) | 2090 Call | ----------------------------------> | 2091 | RDMA Send (RDMA2_CALL_MIDDLE) | 2092 | ----------------------------------> | 2093 | RDMA Send (RDMA2_CALL_INLINE) | 2094 | ----------------------------------> | 2095 | RDMA Read | 2096 | <---------------------------------- | 2097 | RDMA Response (arg data) | 2098 | ----------------------------------> | 2099 | | 2100 | | Processing 2101 | | 2102 | RDMA Send (RDMA2_REPLY_INLINE) | 2103 | <---------------------------------- | Reply 2105 Figure 5: A Continued Call with a Read chunk and a Simple Reply 2106 without data item chunks 2108 Requester Responder 2109 | RDMA Send (RDMA2_CALL_INLINE) | 2110 Call | ----------------------------------> | 2111 | | 2112 | | Processing 2113 | | 2114 | RDMA Write (result data) | 2115 | <---------------------------------- | 2116 | RDMA Send (RDMA2_REPLY_MIDDLE) | 2117 | <---------------------------------- | Reply 2118 | RDMA Send (RDMA2_REPLY_MIDDLE) | 2119 | <---------------------------------- | 2120 | RDMA Send (RDMA2_REPLY_INLINE) | 2121 | <---------------------------------- | 2123 Figure 6: A Simple Call without data item chunks and a Continued 2124 Reply with a Write chunk 2126 6.6.3. Special Format 2128 Even after DDP-eligible data items have been removed, a Payload 2129 stream can sometimes be too large to send using only RDMA Send 2130 operations. In those cases, the sender can use RDMA Read or Write 2131 operations to convey the entire RPC message. We refer to this as a 2132 "Special Format" message. 2134 To transmit a Special Format message, the sender transmits only the 2135 Transport stream with an RDMA Send operation. The sender does not 2136 include the Payload stream in the send buffer. Instead, the 2137 Requester provides a body chunk that the Responder uses to move the 2138 Payload stream. 2140 Because chunks are always present in Special Format messages, the 2141 sender always handles Calls and Replies differently. 2143 Special Call 2144 The Requester provides a Read chunk that contains the RPC Call 2145 message's Payload stream. Every Read segment in this chunk MUST 2146 contain zero (0) in its Position field. This type of Read chunk 2147 is a body chunk known as a Call chunk. 2149 Special Reply 2150 The Requester provisions a Reply chunk in advance. This body 2151 chunk is a Write chunk into which the Responder places the RPC 2152 Reply message's Payload stream. The Requester provisions the 2153 Reply chunk to accommodate the maximum expected reply size for 2154 that upper-layer operation. 2156 One purpose of a Special Format message is to handle large RPC 2157 messages. However, Requesters MAY use a Special Format message at 2158 any time to convey an RPC Call message. 2160 When it has alternatives, a Responder chooses which Format to use 2161 based on the chunks provided by the Requester. If a Requester 2162 provided a Write chunk and the Responder has a DDP-eligible result, 2163 it first reduces the reply Payload stream. If a Requester provided a 2164 Reply chunk and the reduced Payload stream is larger than the reply 2165 inline threshold, the Responder MUST use the Requester-provided Reply 2166 chunk for the reply. 2168 6.6.3.1. Special Format Examples 2170 Requester Responder 2171 | RDMA Send (RDMA2_CALL_EXTERNAL) | 2172 Call | ----------------------------------> | 2173 | RDMA Read | 2174 | <---------------------------------- | 2175 | RDMA Response (RPC call) | 2176 | ----------------------------------> | 2177 | | 2178 | | Processing 2179 | | 2180 | RDMA Send (RDMA2_REPLY_INLINE) | 2181 | <---------------------------------- | Reply 2183 Figure 7: A Special Call and a Simple Reply without data item chunks 2185 Requester Responder 2186 | RDMA Send (RDMA2_CALL_INLINE) | 2187 Call | ----------------------------------> | 2188 | | 2189 | | Processing 2190 | | 2191 | RDMA Write (RPC reply) | 2192 | <---------------------------------- | 2193 | RDMA Send (RDMA2_REPLY_EXTERNAL) | 2194 | <---------------------------------- | Reply 2196 Figure 8: A Simple Call without data item chunks and a Special Reply 2198 6.6.4. Choosing a Reply Payload Format 2200 A Requester provisions all necessary registered memory resources for 2201 both an RPC Call and its matching RPC Reply. A Requester constructs 2202 each RPC Call, thus it can compute the exact memory resources needed 2203 to send every Call. However, the Requester allocates memory 2204 resources to receive the corresponding Reply before the Responder has 2205 constructed it. Occasionally, it is challenging for the Requester to 2206 know in advance precisely what resources are needed to receive the 2207 Reply. 2209 In RPC-over-RDMA version 2, a Requester can provide a Reply chunk for 2210 any transaction. The Responder can use the provided Reply chunk or 2211 it can decide to use another means to convey the RPC Reply. If the 2212 combination of the provided Write chunk list and Reply chunk is not 2213 adequate to convey a Reply, the Responder SHOULD use Message 2214 Continuation to send that Reply. If even that is not possible, the 2215 Responder sends an RDMA2_ERROR message to the Requester, as described 2216 in Section 6.3.1: 2218 * If the Write chunk list cannot accommodate the ULP's DDP-eligible 2219 data payload, the Responder sends an RDMA2_ERR_WRITE_RESOURCE 2220 error. 2222 * If the Reply chunk cannot accommodate the parts of the Reply that 2223 are not DDP-eligible, the Responder sends an 2224 RDMA2_ERR_REPLY_RESOURCE error. 2226 When receiving such errors, the Requester can retry the ULP call 2227 using more substantial reply resources. In cases where retrying the 2228 ULP request is not possible (e.g., the request is non-idempotent), 2229 the Requester terminates the RPC transaction and presents an error to 2230 the RPC consumer. 2232 7. Error Handling 2234 A receiver performs validity checks on each ingress RPC-over-RDMA 2235 message before it assembles that message's Payload stream and passes 2236 it to the RPC layer. For example, if an ingress RPC-over-RDMA 2237 message is not as long as the size of struct rpcrdma2_hdr_prefix (20 2238 octets), the receiver cannot trust the value of the rdma_xid field. 2239 In this case, the receiver MUST silently discard the ingress message 2240 without processing it further, and without a response to the sender. 2242 When a request (for instance, an RPC Call or a control plane 2243 operation) is made, typically an RPC consumer blocks while waiting 2244 for the response. Thus when an incoming message conveys a request 2245 and that request cannot be acted upon, the receiver of that request 2246 needs to report the problem to its sender in order to unblock 2247 waiters. Likewise, if, after processing a request, a sender is 2248 unable to transmit the response on an otherwise healthy connection, 2249 the sender needs to report that problem for the same reason. 2251 The RDMA2_ERROR header type is used for this purpose. To form an 2252 RDMA2_ERROR type header: 2254 * The rdma_xid field MUST contain the same XID that was in the 2255 rdma_xid field in the ingress request. 2257 * The rdma_vers field MUST contain the same version that was in the 2258 rdma_vers field in the ingress request. 2260 * The sender sets the rdma_credit field to the credit values in 2261 effect for this connection. 2263 * The rdma_htype field MUST contain the value RDMA2_ERROR. 2265 * The rdma_err field contains a value that reflects the type of 2266 error that occurred, as described in the subsections below. 2268 When a peer receives an RDMA2_ERROR message type with an unrecognized 2269 or unsupported value in its rdma_err field, it MUST silently discard 2270 the message without processing it further. 2272 7.1. Basic Transport Stream Parsing Errors 2274 7.1.1. RDMA2_ERR_VERS 2276 When a Responder detects an RPC-over-RDMA header version that it does 2277 not support (the current document defines version 2), it MUST respond 2278 with an RDMA2_ERROR message type and set its rdma_err field to 2279 RDMA2_ERR_VERS. The Responder then fills in the rpcrdma2_err_vers 2280 structure with the RPC-over-RDMA versions it supports. The Responder 2281 MUST silently discard the ingress message without passing it to the 2282 RPC layer. 2284 When a Requester receives this error message, it uses the information 2285 in the rpcrdma2_err_vers structure to select an RPC-over-RDMA version 2286 that both peers support for subsequent operations on the connection. 2287 A Requester MUST NOT subsequently send a message that uses a version 2288 that the Responder has indicated it does not support. RDMA2_ERR_VERS 2289 indicates a permanent error. Receipt of this error completes the RPC 2290 transaction associated with XID in the rdma_xid field. 2292 7.1.2. RDMA2_ERR_VERS_MISMATCH 2294 When a Responder receives a message with a transport protocol version 2295 that does not match the protocol version that was used in previous 2296 successful exchanges on the same connection, it MUST respond with an 2297 RDMA2_ERROR message type and set its rdma_err field to 2298 RDMA2_ERR_VERS_MISMATCH. The Responder MUST silently discard the 2299 ingress message without passing it to the RPC layer. 2301 A Requester MUST NOT subsequently send a message that uses a protocol 2302 version that the Responder has indicated it does not recognize on 2303 this connection. The Requester can recover by sending the message 2304 again using a corrected protocol version, or it can terminate the RPC 2305 transaction associated with the XID in the rdma_xid field with an 2306 error. 2308 7.1.3. RDMA2_ERR_INVAL_HTYPE 2310 If a Responder recognizes the value in an ingress rdma_vers field, 2311 but it does not recognize the value in the rdma_htype field or does 2312 not support that header type, it MUST set the rdma_err field to 2313 RDMA2_ERR_INVAL_HTYPE. The Responder MUST silently discard the 2314 incoming message without passing it to the RPC layer. 2316 A Requester MUST NOT subsequently send a message on the connection 2317 that uses an htype that the Responder has indicated it does not 2318 support. RDMA2_ERR_INVAL_HTYPE indicates a permanent error. Receipt 2319 of this error completes the RPC transaction associated with XID in 2320 the rdma_xid field. 2322 7.1.4. RDMA2_ERR_INVAL_CONT 2324 If a Responder detects a problem with an ingress RPC-over-RDMA 2325 message that is part of a Message Continuation sequence, the 2326 Responder MUST set the rdma_err field to RDMA2_ERR_INVAL_CONT. The 2327 Responder MUST silently discard all ingress messages with an rdma_xid 2328 field that matches the failing message without reassembling the 2329 payload. 2331 RDMA2_ERR_INVAL_CONT indicates a permanent error. Receipt of this 2332 error completes the RPC transaction associated with XID in the 2333 rdma_xid field. 2335 7.2. XDR Errors 2337 A receiver might encounter an XDR parsing error that prevents it from 2338 processing an ingress Transport stream. Examples of such errors 2339 include: 2341 * The value of the rdma_xid field does not match the value of the 2342 XID field in the accompanying RPC message. 2344 * The receive buffer ends before the end of a data object contained 2345 in the Transport stream. 2347 Moreover, when a Responder receives a valid RPC-over-RDMA header but 2348 the Responder's ULP implementation cannot parse the RPC arguments in 2349 the RPC Call, the Responder returns an RPC Reply with status 2350 GARBAGE_ARGS, using an RDMA2_REPLY_INLINE message type. This type of 2351 parsing failure might be due to mismatches between chunk sizes or 2352 offsets and the contents of the Payload stream, for example. In this 2353 case, the error is permanent, but the Requester has no way to know 2354 how much processing the Responder has completed for this RPC 2355 transaction. 2357 7.2.1. RDMA2_ERR_BAD_XDR 2359 If a Responder recognizes the values in the rdma_vers field, but it 2360 cannot otherwise parse the ingress Transport stream, it MUST set the 2361 rdma_err field to RDMA2_ERR_BAD_XDR. The Responder MUST silently 2362 discard the ingress message without passing it to the RPC layer. 2364 RDMA2_ERR_BAD_XDR indicates a permanent error. Receipt of this error 2365 completes the RPC transaction associated with XID in the rdma_xid 2366 field. 2368 7.2.2. RDMA2_ERR_BAD_PROPVAL 2370 If a receiver recognizes the value in an ingress rdma_which field, 2371 but it cannot parse the accompanying propval, it MUST set the 2372 rdma_err field to RDMA2_ERR_BAD_PROPVAL (see Section 5.1). The 2373 receiver MUST silently discard the ingress message without applying 2374 any of its property settings. 2376 7.3. Responder RDMA Operational Errors 2378 In RPC-over-RDMA version 2, the Responder initiates RDMA Read and 2379 Write operations that target the Requester's memory. Problems might 2380 arise as the Responder attempts to use Requester-provided resources 2381 for RDMA operations. For example: 2383 * Usually, chunks can be validated only by using their contents to 2384 perform data transfers. If chunk contents are invalid (e.g., a 2385 memory region is no longer registered or a chunk length exceeds 2386 the end of the registered memory region), a Remote Access Error 2387 occurs. 2389 * If a Requester's Receive buffer is too small, the Responder's Send 2390 operation completes with a Local Length Error. 2392 * If the Requester-provided Reply chunk is too small to accommodate 2393 a large RPC Reply message, a Remote Access Error occurs. A 2394 Responder might detect this problem before attempting to write 2395 past the end of the Reply chunk. 2397 RDMA operational errors can be fatal to the connection. To avoid a 2398 retransmission loop and repeated connection loss that deadlocks the 2399 connection, once the Requester has re-established a connection, the 2400 Responder SHOULD send an RDMA2_ERROR response to indicate that no 2401 RPC-level reply is possible for that transaction. 2403 7.3.1. RDMA2_ERR_READ_CHUNKS 2405 If a Requester presents more DDP-eligible arguments than a Responder 2406 is prepared to Read, the Responder MUST set the rdma_err field to 2407 RDMA2_ERR_READ_CHUNKS and set the rdma_max_chunks field to the 2408 maximum number of Read chunks the Responder can process. If the 2409 Responder implementation cannot handle any Read chunks for a request, 2410 it MUST set the rdma_max_chunks to zero in this response. The 2411 Responder MUST silently discard the ingress message without 2412 processing it further. 2414 The Requester can reconstruct the Call using Message Continuation or 2415 a Special Format payload and resend it. If the Requester chooses not 2416 to resend the Call, it MUST terminate this RPC transaction with an 2417 error. 2419 7.3.2. RDMA2_ERR_WRITE_CHUNKS 2421 If a Requester has constructed an RPC Call with more DDP-eligible 2422 results than the Responder is prepared to Write, the Responder MUST 2423 set the rdma_err field to RDMA2_ERR_WRITE_CHUNKS and set the 2424 rdma_max_chunks field to the maximum number of Write chunks the 2425 Responder can return. The Requester can reconstruct the Call with no 2426 Write chunks and a Reply chunk of appropriate size. If the Requester 2427 does not resend the Call, it MUST terminate this RPC transaction with 2428 an error. 2430 If the Responder implementation cannot handle any Write chunks for a 2431 request and cannot send the Reply using Message Continuation, it MUST 2432 return a response of RDMA2_ERR_REPLY_RESOURCE instead (see below). 2434 7.3.3. RDMA2_ERR_SEGMENTS 2436 If a Requester has constructed an RPC Call with a chunk that contains 2437 more segments than the Responder supports, the Responder MUST set the 2438 rdma_err field to RDMA2_ERR_SEGMENTS and set the rdma_max_segments 2439 field to the maximum number of segments the Responder can process. 2440 The Requester can reconstruct the Call and resend it. If the 2441 Requester does not resend the Call, it MUST terminate this RPC 2442 transaction with an error. 2444 7.3.4. RDMA2_ERR_WRITE_RESOURCE 2446 If a Requester has provided a Write chunk that is not large enough to 2447 contain a DDP-eligible result, the Responder MUST set the rdma_err 2448 field to RDMA2_ERR_WRITE_RESOURCE. The Responder MUST set the 2449 rdma_chunk_index field to point to the first Write chunk in the 2450 transport header that is too short, or to zero to indicate that it 2451 was not possible to determine which chunk is too small. Indexing 2452 starts at one (1), which represents the first Write chunk. The 2453 Responder MUST set the rdma_length_needed to the number of bytes 2454 needed in that chunk to convey the result data item. 2456 The Requester can reconstruct the Call with more reply resources and 2457 resend it. If the Requester does not resend the Call (for instance, 2458 if the Responder set the index and length fields to zero), it MUST 2459 terminate this RPC transaction with an error. 2461 7.3.5. RDMA2_ERR_REPLY_RESOURCE 2463 If a Responder cannot send an RPC Reply using Message Continuation 2464 and the Reply does not fit in the Reply chunk, the Responder MUST set 2465 the rdma_err field to RDMA2_ERR_REPLY_RESOURCE. The Responder MUST 2466 set the rdma_length_needed to the number of Reply chunk bytes needed 2467 to convey the reply. The Requester can reconstruct the Call with 2468 more reply resources and resend it. If the Requester does not resend 2469 the Call (for instance, if the Responder set the length field to 2470 zero), it MUST terminate this RPC transaction with an error. 2472 7.4. Other Operational Errors 2474 While a Requester is constructing an RPC Call message, an 2475 unrecoverable problem might occur that prevents the Requester from 2476 posting further RDMA Work Requests on behalf of that message. As 2477 with other transports, if a Requester is unable to construct and 2478 transmit an RPC Call, the associated RPC transaction fails 2479 immediately. 2481 After a Requester has received a Reply, if it is unable to invalidate 2482 a memory region due to an unrecoverable problem, the Requester MUST 2483 close the connection to protect that memory from Responder access 2484 before the associated RPC transaction is complete. 2486 While a Responder is constructing an RPC Reply message or error 2487 message, an unrecoverable problem might occur that prevents the 2488 Responder from posting further RDMA Work Requests on behalf of that 2489 message. If a Responder is unable to construct and transmit an RPC 2490 Reply or RPC-over-RDMA error message, the Responder MUST close the 2491 connection to signal to the Requester that a reply was lost. 2493 7.4.1. RDMA2_ERR_SYSTEM 2495 If some problem occurs on a Responder that does not fit into the 2496 above categories, the Responder MAY report it to the Requester by 2497 setting the rdma_err field to RDMA2_ERR_SYSTEM. The Responder MUST 2498 silently discard the message(s) associated with the failing 2499 transaction without further processing. 2501 RDMA2_ERR_SYSTEM is a permanent error. This error does not indicate 2502 how much of the transaction the Responder has processed, nor does it 2503 indicate a particular recovery action for the Requester. A Requester 2504 that receives this error MUST terminate the RPC transaction 2505 associated with the XID value in the RDMA2_ERROR message's rdma_xid 2506 field. 2508 7.5. RDMA Transport Errors 2510 The RDMA connection and physical link provide some degree of error 2511 detection and retransmission. The Marker PDU Aligned Framing (MPA) 2512 protocol (as described in Section 7.1 of [RFC5044]) as well as the 2513 InfiniBand link layer [IBA] provide Cyclic Redundancy Check (CRC) 2514 protection of RDMA payloads. CRC-class protection is a general 2515 attribute of such transports. 2517 Additionally, the RPC layer itself can accept errors from the 2518 transport and recover via retransmission. RPC recovery can typically 2519 handle complete loss and re-establishment of a transport connection. 2521 The details of reporting and recovery from RDMA link-layer errors are 2522 described in specific link-layer APIs and operational specifications 2523 and are outside the scope of this protocol specification. See 2524 Section 11 for further discussion of RPC-level integrity schemes. 2526 8. XDR Protocol Definition 2528 This section contains a description of the core features of the RPC- 2529 over-RDMA version 2 protocol expressed in the XDR language [RFC4506]. 2530 It organizes the description to make it simple to extract into a form 2531 that is ready to compile or combine with similar descriptions 2532 published later as extensions to RPC-over-RDMA version 2. 2534 8.1. Code Component License 2536 Code Components extracted from the current document must include the 2537 following license text. When combining the extracted XDR code with 2538 other XDR code which has an identical license, only a single copy of 2539 the license text needs to be retained. 2541 2542 /// /* 2543 /// * Copyright (c) 2010, 2020 IETF Trust and the persons 2544 /// * identified as authors of the code. All rights reserved. 2545 /// * 2546 /// * The authors of the code are: 2547 /// * B. Callaghan, T. Talpey, C. Lever, and D. Noveck. 2548 /// * 2549 /// * Redistribution and use in source and binary forms, with 2550 /// * or without modification, are permitted provided that the 2551 /// * following conditions are met: 2552 /// * 2553 /// * - Redistributions of source code must retain the above 2554 /// * copyright notice, this list of conditions and the 2555 /// * following disclaimer. 2556 /// * 2557 /// * - Redistributions in binary form must reproduce the above 2558 /// * copyright notice, this list of conditions and the 2559 /// * following disclaimer in the documentation and/or other 2560 /// * materials provided with the distribution. 2561 /// * 2562 /// * - Neither the name of Internet Society, IETF or IETF 2563 /// * Trust, nor the names of specific contributors, may be 2564 /// * used to endorse or promote products derived from this 2565 /// * software without specific prior written permission. 2566 /// * 2567 /// * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS 2568 /// * AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED 2569 /// * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 2570 /// * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS 2571 /// * FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO 2572 /// * EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE 2573 /// * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, 2574 /// * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT 2575 /// * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 2576 /// * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS 2577 /// * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF 2578 /// * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 2579 /// * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING 2580 /// * IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF 2581 /// * ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 2582 /// */ 2583 /// 2584 2586 8.2. Extraction of the XDR Definition 2588 Implementers can apply the following sed script to the current 2589 document to produce a machine-readable XDR description of the base 2590 RPC-over-RDMA version 2 protocol. 2592 2593 sed -n -e 's:^ */// ::p' -e 's:^ *///$::p' 2594 2596 That is, if this document is in a file called "spec.txt", then 2597 implementers can do the following to extract an XDR description file 2598 and store it in the file rpcrdma-v2.x. 2600 2601 sed -n -e 's:^ */// ::p' -e 's:^ *///$::p' \ 2602 < spec.txt > rpcrdma-v2.x 2603 2605 Although this file is a usable description of the base protocol, when 2606 extensions are to be supported, it may be desirable to divide the 2607 description into multiple files. The following script achieves that 2608 purpose: 2610 2611 #!/usr/local/bin/perl 2612 open(IN,"rpcrdma-v2.x"); 2613 open(OUT,">temp.x"); 2614 while() 2615 { 2616 if (m/FILE ENDS: (.*)$/) 2617 { 2618 close(OUT); 2619 rename("temp.x", $1); 2620 open(OUT,">temp.x"); 2621 } 2622 else 2623 { 2624 print OUT $_; 2625 } 2626 } 2627 close(IN); 2628 close(OUT); 2629 2631 Running the above script results in two files: 2633 * The file common.x, containing the license plus the shared XDR 2634 definitions that need to be made available to both the base 2635 protocol and any subsequent extensions. 2637 * The file baseops.x containing the XDR definitions for the base 2638 protocol defined in this document. 2640 Extensions to RPC-over-RDMA version 2, published as Standards Track 2641 documents, should have similarly structured XDR definitions. Once an 2642 implementer has extracted the XDR for all desired extensions and the 2643 base XDR definition contained in the current document, she can 2644 concatenate them to produce a consolidated XDR definition that 2645 reflects the set of extensions selected for her RPC-over-RDMA version 2646 2 implementation. 2648 Alternatively, the XDR descriptions can be compiled separately. In 2649 that case, the combination of common.x and baseops.x defines the base 2650 transport. The combination of common.x and the XDR description of 2651 each extension produces a full XDR definition of that extension. 2653 8.3. XDR Definition for RPC-over-RDMA Version 2 Core Structures 2655 2656 /// /*************************************************************** 2657 /// * Transport Header Prefixes 2658 /// ***************************************************************/ 2659 /// 2660 /// struct rpcrdma_common { 2661 /// uint32 rdma_xid; 2662 /// uint32 rdma_vers; 2663 /// uint32 rdma_credit; 2664 /// uint32 rdma_htype; 2665 /// }; 2666 /// 2667 /// struct rpcrdma2_hdr_prefix { 2668 /// struct rpcrdma_common rdma_start; 2669 /// }; 2670 /// 2671 /// /*************************************************************** 2672 /// * Chunks and Chunk Lists 2673 /// ***************************************************************/ 2674 /// 2675 /// struct rpcrdma2_segment { 2676 /// uint32 rdma_handle; 2677 /// uint32 rdma_length; 2678 /// uint64 rdma_offset; 2679 /// }; 2680 /// 2681 /// struct rpcrdma2_read_segment { 2682 /// uint32 rdma_position; 2683 /// struct rpcrdma2_segment rdma_target; 2684 /// }; 2685 /// 2686 /// struct rpcrdma2_read_list { 2687 /// struct rpcrdma2_read_segment rdma_entry; 2688 /// struct rpcrdma2_read_list *rdma_next; 2689 /// }; 2690 /// 2691 /// struct rpcrdma2_write_chunk { 2692 /// struct rpcrdma2_segment rdma_target<>; 2693 /// }; 2694 /// 2695 /// struct rpcrdma2_write_list { 2696 /// struct rpcrdma2_write_chunk rdma_entry; 2697 /// struct rpcrdma2_write_list *rdma_next; 2698 /// }; 2699 /// 2700 /// /*************************************************************** 2701 /// * Transport Properties 2702 /// ***************************************************************/ 2703 /// 2704 /// /* 2705 /// * Types for transport properties model 2706 /// */ 2707 /// typedef rpcrdma2_propid uint32; 2708 /// 2709 /// struct rpcrdma2_propval { 2710 /// rpcrdma2_propid rdma_which; 2711 /// opaque rdma_data<>; 2712 /// }; 2713 /// 2714 /// typedef rpcrdma2_propval rpcrdma2_propset<>; 2715 /// typedef uint32 rpcrdma2_propsubset<>; 2716 /// 2717 /// /* 2718 /// * Transport propid values for basic properties 2719 /// */ 2720 /// const RDMA2_PROPID_SBSIZ = 1; 2721 /// const RDMA2_PROPID_RBSIZ = 2; 2722 /// const RDMA2_PROPID_RSSIZ = 3; 2723 /// const RDMA2_PROPID_RCSIZ = 4; 2724 /// const RDMA2_PROPID_BRS = 5; 2725 /// const RDMA2_PROPID_HOSTAUTH = 6; 2726 /// 2727 /// /* 2728 /// * Types specific to particular properties 2729 /// */ 2730 /// typedef uint32 rpcrdma2_prop_sbsiz; 2731 /// typedef uint32 rpcrdma2_prop_rbsiz; 2732 /// typedef uint32 rpcrdma2_prop_rssiz; 2733 /// typedef uint32 rpcrdma2_prop_rcsiz; 2734 /// typedef uint32 rpcrdma2_prop_brs; 2735 /// typedef opaque rpcrdma2_prop_hostauth<>; 2736 /// 2737 /// const RDMA2_RVRSDIR_NONE = 0; 2738 /// const RDMA2_RVRSDIR_SIMPLE = 1; 2739 /// const RDMA2_RVRSDIR_CONT = 2; 2740 /// const RDMA2_RVRSDIR_GENL = 3; 2741 /// 2742 /// /* FILE ENDS: common.x; */ 2743 2745 8.4. XDR Definition for RPC-over-RDMA Version 2 Base Header Types 2747 2748 /// /*************************************************************** 2749 /// * Descriptions of RPC-over-RDMA Header Types 2750 /// ***************************************************************/ 2751 /// 2752 /// /* 2753 /// * Header Type Codes: Control plane operations. 2754 /// */ 2755 /// const RDMA2_ERROR = 4; 2756 /// const RDMA2_GRANT = 5; 2757 /// const RDMA2_CONNPROP_MIDDLE = 6; 2758 /// const RDMA2_CONNPROP_FINAL = 7; 2759 /// 2760 /// /* 2761 /// * Header Type Codes: Call messages. 2762 /// */ 2763 /// const RDMA2_CALL_EXTERNAL = 8; 2764 /// const RDMA2_CALL_MIDDLE = 9; 2765 /// const RDMA2_CALL_INLINE = 10; 2766 /// 2767 /// /* 2768 /// * Header Type Codes: Reply messages. 2769 /// */ 2770 /// const RDMA2_REPLY_EXTERNAL = 11; 2771 /// const RDMA2_REPLY_MIDDLE = 12; 2772 /// const RDMA2_REPLY_INLINE = 13; 2773 /// 2774 /// /* 2775 /// * Header Type to Report Errors. 2776 /// */ 2777 /// const RDMA2_ERR_VERS = 1; 2778 /// const RDMA2_ERR_BAD_XDR = 2; 2779 /// const RDMA2_ERR_BAD_PROPVAL = 3; 2780 /// const RDMA2_ERR_INVAL_HTYPE = 4; 2781 /// const RDMA2_ERR_INVAL_CONT = 5; 2782 /// const RDMA2_ERR_READ_CHUNKS = 6; 2783 /// const RDMA2_ERR_WRITE_CHUNKS = 7; 2784 /// const RDMA2_ERR_SEGMENTS = 8; 2785 /// const RDMA2_ERR_WRITE_RESOURCE = 9; 2786 /// const RDMA2_ERR_REPLY_RESOURCE = 10; 2787 /// const RDMA2_ERR_VERS_MISMATCH = 11; 2788 /// const RDMA2_ERR_SYSTEM = 100; 2789 /// 2790 /// struct rpcrdma2_err_vers { 2791 /// uint32 rdma_vers_low; 2792 /// uint32 rdma_vers_high; 2793 /// }; 2794 /// 2795 /// struct rpcrdma2_err_write { 2796 /// uint32 rdma_chunk_index; 2797 /// uint32 rdma_length_needed; 2798 /// }; 2799 /// 2800 /// union rpcrdma2_hdr_error switch (rpcrdma2_errcode rdma_err) { 2801 /// case RDMA2_ERR_VERS: 2802 /// rpcrdma2_err_vers rdma_vrange; 2803 /// case RDMA2_ERR_READ_CHUNKS: 2804 /// uint32 rdma_max_chunks; 2805 /// case RDMA2_ERR_WRITE_CHUNKS: 2806 /// uint32 rdma_max_chunks; 2807 /// case RDMA2_ERR_SEGMENTS: 2808 /// uint32 rdma_max_segments; 2809 /// case RDMA2_ERR_WRITE_RESOURCE: 2810 /// rpcrdma2_err_write rdma_writeres; 2811 /// case RDMA2_ERR_REPLY_RESOURCE: 2812 /// uint32 rdma_length_needed; 2813 /// default: 2814 /// void; 2815 /// }; 2816 /// 2817 /// /* 2818 /// * Header Type to Exchange Transport Properties. 2819 /// */ 2820 /// struct rpcrdma2_hdr_connprop { 2821 /// rpcrdma2_propset rdma_props; 2822 /// }; 2823 /// 2824 /// /* 2825 /// * Header Types to Convey RPC Messages. 2826 /// */ 2827 /// struct rpcrdma2_hdr_call_external { 2828 /// uint32 rdma_inv_handle; 2829 /// 2830 /// struct rpcrdma2_read_list *rdma_call; 2831 /// struct rpcrdma2_read_list *rdma_reads; 2832 /// struct rpcrdma2_write_list *rdma_provisional_writes; 2833 /// struct rpcrdma2_write_chunk *rdma_provisional_reply; 2834 /// }; 2835 /// 2836 /// struct rpcrdma2_hdr_call_middle { 2837 /// uint32 rdma_remaining; 2838 /// 2839 /// /* The rpc message starts here and continues 2840 /// * through the end of the transmission. */ 2841 /// uint32 rdma_rpc_first_word; 2842 /// }; 2843 /// 2844 /// struct rpcrdma2_hdr_call_inline { 2845 /// uint32 rdma_inv_handle; 2846 /// 2847 /// struct rpcrdma2_read_list *rdma_reads; 2848 /// struct rpcrdma2_write_list *rdma_provisional_writes; 2849 /// struct rpcrdma2_write_chunk *rdma_provisional_reply; 2850 /// 2851 /// /* The rpc message starts here and continues 2852 /// * through the end of the transmission. */ 2853 /// uint32 rdma_rpc_first_word; 2854 /// }; 2855 /// 2856 /// struct rpcrdma2_hdr_reply_external { 2857 /// struct rpcrdma2_write_list *rdma_writes; 2858 /// struct rpcrdma2_write_chunk *rdma_reply; 2859 /// }; 2860 /// 2861 /// struct rpcrdma2_hdr_reply_middle { 2862 /// uint32 rdma_remaining; 2863 /// 2864 /// /* The rpc message starts here and continues 2865 /// * through the end of the transmission. */ 2866 /// uint32 rdma_rpc_first_word; 2867 /// }; 2868 /// 2869 /// struct rpcrdma2_hdr_reply_inline { 2870 /// struct rpcrdma2_write_list *rdma_writes; 2871 /// 2872 /// /* The rpc message starts here and continues 2873 /// * through the end of the transmission. */ 2874 /// uint32 rdma_rpc_first_word; 2875 /// }; 2876 /// 2877 /// /* FILE ENDS: baseops.x; */ 2878 2880 8.5. Use of the XDR Description 2882 The files common.x and baseops.x, when combined with the XDR 2883 descriptions for extension defined later, produce a human-readable 2884 and compilable description of the RPC-over-RDMA version 2 protocol 2885 with the included extensions. 2887 Although this XDR description can generate encoders and decoders for 2888 the Transport and Payload streams, there are elements of the 2889 operation of RPC-over-RDMA version 2 that cannot be expressed within 2890 the XDR language. Implementations that use the output of an 2891 automated XDR processor need to provide additional code to bridge 2892 these gaps. 2894 * The Transport stream is not a single XDR object. Instead, the 2895 header prefix is one XDR data item, and the rest of the header is 2896 a separate XDR data item. Table 2 expresses the mapping between 2897 the header type in the header prefix and the XDR object 2898 representing the header type. 2900 * The relationship between the Transport stream and the Payload 2901 stream is not specified using XDR. Comments within the XDR text 2902 make clear where transported messages, described by their own XDR 2903 definitions, need to appear. Such data is opaque to the 2904 transport. 2906 * Continuation of RPC messages across transport message boundaries 2907 requires that message assembly facilities not specifiable within 2908 XDR are part of transport implementations. 2910 * Transport properties are constant integer values. Table 1 2911 expresses the mapping between each property's code point and the 2912 XDR typedef that represents the structure of the property's value. 2913 XDR does not possess the facility to express that mapping in an 2914 extensible way. 2916 The role of XDR in RPC-over-RDMA specifications is more limited than 2917 for protocols where the totality of the protocol is expressible 2918 within XDR. XDR lacks the facility to represent the embedding of 2919 XDR-encoded payload material. Also, the need to cleanly accommodate 2920 extensions has meant that those using rpcgen in their applications 2921 need to take an active role to provide the facilities that cannot be 2922 expressed within XDR. 2924 9. RPC Bind Parameters 2926 Before establishing a new connection, an RPC client obtains a 2927 transport address for the RPC server. The means used to obtain this 2928 address and to open an RDMA connection is dependent on the type of 2929 RDMA transport and is the responsibility of each RPC protocol binding 2930 and its local implementation. 2932 RPC services typically register with a portmap or rpcbind service 2933 [RFC1833], which associates an RPC Program number with a service 2934 address. This policy is no different with RDMA transports. However, 2935 a distinct service address (port number) is sometimes required for 2936 operation on RPC-over-RDMA. 2938 When mapped atop MPA [RFC5044], which uses IP port addressing due to 2939 its layering on TCP or SCTP, port mapping is trivial and consists 2940 merely of issuing the port in the connection process. The NFS/RDMA 2941 protocol service address has been assigned port 20049 by IANA for 2942 this deployment scenario [RFC8267]. 2944 When mapped atop InfiniBand [IBA], which uses a service endpoint 2945 naming scheme based on a Group Identifier (GID), a translation MUST 2946 be employed. One such translation is described in Annexes A3 2947 (Application Specific Identifiers), A4 (Sockets Direct Protocol 2948 (SDP)), and A11 (RDMA IP CM Service) of [IBA], which is appropriate 2949 for translating IP port addressing to the InfiniBand network. 2950 Therefore, in this case, IP port addressing may be readily employed 2951 by the upper layer. 2953 When a mapping standard or convention exists for IP ports on an RDMA 2954 interconnect, there are several possibilities for each upper layer to 2955 consider: 2957 * One possibility is to have the server register its mapped IP port 2958 with the rpcbind service under the netid (or netids) defined in 2959 [RFC8166]. An RPC-over-RDMA-aware RPC client can then resolve its 2960 desired service to a mappable port and proceed to connect. This 2961 method is the most flexible and compatible approach for those 2962 upper layers that are defined to use the rpcbind service. 2964 * A second possibility is to have the RPC server's portmapper 2965 register itself on the RDMA interconnect at a "well-known" service 2966 address (on UDP or TCP, this corresponds to port 111). An RPC 2967 client can connect to this service address and use the portmap 2968 protocol to obtain a service address in response to a program 2969 number (e.g., a TCP port number or an InfiniBand GID). 2971 * Alternately, an RPC client can connect to the mapped well-known 2972 port for the service itself, if it is appropriately defined. By 2973 convention, the NFS/RDMA service, when operating atop an 2974 InfiniBand fabric, uses the same 20049 assignment as for MPA. 2976 Historically, different RPC protocols have taken different approaches 2977 to their port assignments. The current document leaves the specific 2978 method for each RPC-over-RDMA-enabled ULB. 2980 [RFC8166] defines two new netid values to be used for registration of 2981 upper layers atop MPA and (when a suitable port translation service 2982 is available) InfiniBand. Additional RDMA-capable networks MAY 2983 define their own netids, or if they provide a port translation, they 2984 MAY share the one defined in [RFC8166]. 2986 10. Implementation Status 2988 This section is to be removed before publishing as an RFC. 2990 This section records the status of known implementations of the 2991 protocol defined by this specification at the time of posting of this 2992 Internet-Draft, and is based on a proposal described in [RFC7942]. 2993 The description of implementations in this section is intended to 2994 assist the IETF in its decision processes in progressing drafts to 2995 RFCs. 2997 Please note that the listing of any individual implementation here 2998 does not imply endorsement by the IETF. Furthermore, no effort has 2999 been spent to verify the information presented here that was supplied 3000 by IETF contributors. This is not intended as, and must not be 3001 construed to be, a catalog of available implementations or their 3002 features. Readers are advised to note that other implementations may 3003 exist. 3005 At this time, no known implementations of the protocol described in 3006 the current document exist. 3008 11. Security Considerations 3009 11.1. Memory Protection 3011 A primary consideration is the protection of the integrity and 3012 confidentiality of host memory by an RPC-over-RDMA transport. The 3013 use of an RPC-over-RDMA transport protocol MUST NOT introduce 3014 vulnerabilities to system memory contents nor memory owned by user 3015 processes. Any RDMA provider used for RPC transport MUST conform to 3016 the requirements of [RFC5042] to satisfy these protections. 3018 11.1.1. Protection Domains 3020 The use of a Protection Domain to limit the exposure of memory 3021 regions to a single connection is critical. Any attempt by an 3022 endpoint not participating in that connection to reuse memory handles 3023 needs to result in immediate failure of that connection. Because ULP 3024 security mechanisms rely on this aspect of Reliable Connected 3025 behavior, implementations SHOULD cryptographically authenticate 3026 connection endpoints. 3028 11.1.2. Handle (STag) Predictability 3030 Implementations should use unpredictable memory handles for any 3031 operation requiring exposed memory regions. Exposing a continuously 3032 registered memory region allows a remote host to read or write to 3033 that region even when an RPC involving that memory is not underway. 3034 Therefore, implementations should avoid the use of persistently 3035 registered memory. 3037 11.1.3. Memory Protection 3039 Requesters should register memory regions for remote access only when 3040 they are about to be the target of an RPC transaction that involves 3041 an RDMA Read or Write. 3043 Requesters should invalidate memory regions as soon as related RPC 3044 operations are complete. Invalidation and DMA unmapping of memory 3045 regions should complete before the receiver checks message integrity, 3046 and before the RPC consumer can use or alter the contents of the 3047 exposed memory region. 3049 An RPC transaction on a Requester can terminate before a Reply 3050 arrives, for example, if the RPC consumer is signaled, or a 3051 segmentation fault occurs. When an RPC terminates abnormally, memory 3052 regions associated with that RPC should be invalidated before the 3053 Requester reuses those regions for other purposes. 3055 11.1.4. Denial of Service 3057 A detailed discussion of denial-of-service exposures that can result 3058 from the use of an RDMA transport appears in Section 6.4 of 3059 [RFC5042]. 3061 A Responder is not obliged to pull unreasonably large Read chunks. A 3062 Responder can use an RDMA2_ERROR response to terminate RPCs with 3063 unreadable Read chunks. If a Responder transmits more data than a 3064 Requester is prepared to receive in a Write or Reply chunk, the RDMA 3065 provider typically terminates the connection. For further 3066 discussion, see Section 6.3.1. Such repeated connection termination 3067 can deny service to other users sharing the connection from the 3068 errant Requester. 3070 An RPC-over-RDMA transport implementation is not responsible for 3071 throttling the RPC request rate, other than to keep the number of 3072 concurrent RPC transactions within the per connection credit limits 3073 (see Section 4.2.1). A sender can trigger a self-denial of service 3074 by exceeding the credit limit repeatedly. 3076 When an RPC transaction terminates due to a signal or premature exit 3077 of an application process, a Requester should invalidate the RPC's 3078 Write and Reply chunks. Invalidation prevents the subsequent arrival 3079 of the Responder's Reply from altering the memory regions associated 3080 with those chunks after the Requester has released that memory. 3082 On the Requester, a malfunctioning application or a malicious user 3083 can create a situation where RPCs initiate and abort continuously, 3084 resulting in Responder replies that terminate the underlying RPC- 3085 over-RDMA connection repeatedly. Such situations can deny service to 3086 other users sharing the connection from that Requester. 3088 11.2. RPC Message Security 3090 ONC RPC provides cryptographic security via the RPCSEC_GSS framework 3091 [RFC7861]. RPCSEC_GSS implements message authentication 3092 (rpc_gss_svc_none), per-message integrity checking 3093 (rpc_gss_svc_integrity), and per-message confidentiality 3094 (rpc_gss_svc_privacy) in a layer above the RPC-over-RDMA transport. 3095 The integrity and privacy services require significant computation 3096 and movement of data on each endpoint host. Some performance 3097 benefits enabled by RDMA transports can be lost. 3099 11.2.1. RPC-over-RDMA Protection at Other Layers 3101 For any RPC transport, utilizing RPCSEC_GSS integrity or privacy 3102 services has performance implications. Protection below the RPC 3103 implementation is often a better choice in performance-sensitive 3104 deployments, especially if it, too, can be offloaded. Certain 3105 implementations of IPsec can be co-located in RDMA hardware, for 3106 example, without change to RDMA consumers and with little loss of 3107 data movement efficiency. Such arrangements can also provide a 3108 higher degree of privacy by hiding endpoint identity or altering the 3109 frequency at which messages are exchanged, at a performance cost. 3111 Implementations MAY negotiate the use of protection in another layer 3112 through the use of an RPCSEC_GSS security flavor defined in [RFC7861] 3113 in conjunction with the Channel Binding mechanism [RFC5056] and IPsec 3114 Channel Connection Latching [RFC5660]. 3116 11.2.2. RPCSEC_GSS on RPC-over-RDMA Transports 3118 Not all RDMA devices and fabrics support the above protection 3119 mechanisms. Also, NFS clients, where multiple users can access NFS 3120 files, still require per-message authentication. In these cases, 3121 RPCSEC_GSS can protect NFS traffic conveyed on RPC-over-RDMA 3122 connections. 3124 RPCSEC_GSS extends the ONC RPC protocol without changing the format 3125 of RPC messages. By observing the conventions described in this 3126 section, an RPC-over-RDMA transport can convey RPCSEC_GSS-protected 3127 RPC messages interoperably. 3129 Senders MUST NOT reduce protocol elements of RPCSEC_GSS that appear 3130 in the Payload stream of an RPC-over-RDMA message. Such elements 3131 include control messages exchanged as part of establishing or 3132 destroying a security context, or data items that are part of 3133 RPCSEC_GSS authentication material. 3135 11.2.2.1. RPCSEC_GSS Context Negotiation 3137 Some NFS client implementations use a separate connection to 3138 establish a Generic Security Service (GSS) context for NFS operation. 3139 Such clients use TCP and the standard NFS port (2049) for context 3140 establishment. Therefore, an NFS server MUST also provide a TCP- 3141 based NFS service on port 2049 to enable the use of RPCSEC_GSS with 3142 NFS/RDMA. 3144 11.2.2.2. RPC-over-RDMA with RPCSEC_GSS Authentication 3146 The RPCSEC_GSS authentication service has no impact on the DDP- 3147 eligibility of data items in a ULP. 3149 However, RPCSEC_GSS authentication material appearing in an RPC 3150 message header can be larger than, say, an AUTH_SYS authenticator. 3151 In particular, when an RPCSEC_GSS pseudoflavor is in use, a Requester 3152 needs to accommodate a larger RPC credential when marshaling RPC 3153 Calls and needs to provide for a maximum size RPCSEC_GSS verifier 3154 when allocating reply buffers and Reply chunks. 3156 RPC messages, and thus Payload streams, are larger on average as a 3157 result. ULP operations that fit in a Simple Format message when a 3158 simpler form of authentication is in use might need to be reduced or 3159 conveyed via a Special Format message when RPCSEC_GSS authentication 3160 is in use. It is therefore more likely that a Requester provisions 3161 both a Read list and a Reply chunk in the same RPC-over-RDMA 3162 Transport header to convey a Special Format Call and provision a 3163 receptacle for a Special Format Reply. 3165 In addition to this cost, the XDR encoding and decoding of each RPC 3166 message using RPCSEC_GSS authentication requires per-message host 3167 compute resources to construct the GSS verifier. 3169 11.2.2.3. RPC-over-RDMA with RPCSEC_GSS Integrity or Privacy 3171 The RPCSEC_GSS integrity service enables endpoints to detect the 3172 modification of RPC messages in flight. The RPCSEC_GSS privacy 3173 service prevents all but the intended recipient from viewing the 3174 cleartext content of RPC arguments and results. RPCSEC_GSS integrity 3175 and privacy services are end-to-end. They protect RPC arguments and 3176 results from application to server endpoint, and back. 3178 The RPCSEC_GSS integrity and encryption services operate on whole RPC 3179 messages after they have been XDR encoded, and before they have been 3180 XDR decoded after receipt. Connection endpoints use intermediate 3181 buffers to prevent exposure of encrypted or unverified cleartext data 3182 to RPC consumers. After a sender has verified, encrypted, and 3183 wrapped a message, the transport layer MAY use RDMA data transfer 3184 between these intermediate buffers. 3186 The process of reducing a DDP-eligible data item removes the data 3187 item and its XDR padding from an encoded Payload stream. In a non- 3188 protected RPC-over-RDMA message, a reduced data item does not include 3189 XDR padding. After reduction, the Payload stream contains fewer 3190 octets than the whole XDR stream did beforehand. XDR padding octets 3191 are often zero bytes, but they don't have to be. Thus, reducing DDP- 3192 eligible items affects the result of message integrity verification 3193 and encryption. 3195 Therefore, a sender MUST NOT reduce a Payload stream when RPCSEC_GSS 3196 integrity or encryption services are in use. Effectively, no data 3197 item is DDP-eligible in this situation. Senders can use only Simple 3198 and Continued Formats without data item chunks, or Special Format. 3199 In this mode, an RPC-over-RDMA transport operates in the same manner 3200 as a transport that does not support DDP. 3202 11.2.2.4. Protecting RPC-over-RDMA Transport Headers 3204 Like the header fields in an RPC message (e.g., the xid and mtype 3205 fields), RPCSEC_GSS does not protect the RPC-over-RDMA Transport 3206 stream. XIDs, connection credit limits, and chunk lists (though not 3207 the content of the data items they refer to) are exposed to malicious 3208 behavior, which can redirect data that is transferred by the RPC- 3209 over-RDMA message, result in spurious retransmits, or trigger 3210 connection loss. 3212 In particular, if an attacker alters the information contained in the 3213 chunk lists of an RPC-over-RDMA Transport header, data contained in 3214 those chunks can be redirected to other registered memory regions on 3215 Requesters. An attacker might alter the arguments of RDMA Read and 3216 RDMA Write operations on the wire to gain a similar effect. If such 3217 alterations occur, the use of RPCSEC_GSS integrity or privacy 3218 services enables a Requester to detect unexpected material in a 3219 received RPC message. 3221 Encryption at other layers, as described in Section 11.2.1, protects 3222 the content of the Transport stream. RDMA transport implementations 3223 should conform to [RFC5042] to address attacks on RDMA protocols 3224 themselves. 3226 11.3. Transport Properties 3228 Like other fields that appear in the Transport stream, transport 3229 properties are sent in the clear with no integrity protection, making 3230 them vulnerable to man-in-the-middle attacks. 3232 For example, if a man-in-the-middle were to change the value of the 3233 Receive buffer size, it could reduce connection performance or 3234 trigger loss of connection. Repeated connection loss can impact 3235 performance or even prevent a new connection from being established. 3236 The recourse is to deploy on a private network or use transport layer 3237 encryption. 3239 11.4. Host Authentication 3241 [ cel: This subsection is unfinished. ] 3243 Wherein we use the relevant sections of [RFC3552] to analyze the 3244 addition of host authentication to this RPC-over-RDMA transport. 3246 The authors refer readers to Appendix C of [RFC8446] for information 3247 on how to design and test a secure authentication handshake 3248 implementation. 3250 12. IANA Considerations 3252 The RPC-over-RDMA family of transports have been assigned RPC netids 3253 by [RFC8166]. A netid is an rpcbind [RFC1833] string used to 3254 identify the underlying protocol in order for RPC to select 3255 appropriate transport framing and the format of the service addresses 3256 and ports. 3258 The following netid registry strings are already defined for this 3259 purpose: 3261 NC_RDMA "rdma" 3262 NC_RDMA6 "rdma6" 3264 The "rdma" netid is to be used when IPv4 addressing is employed by 3265 the underlying transport, and "rdma6" when IPv6 addressing is 3266 employed. The netid assignment policy and registry are defined in 3267 [RFC5665]. The current document does not alter these netid 3268 assignments. 3270 These netids MAY be used for any RDMA network that satisfies the 3271 requirements of Section 3.2.2 and that is able to identify service 3272 endpoints using IP port addressing, possibly through use of a 3273 translation service as described in Section 9. 3275 13. References 3277 13.1. Normative References 3279 [RFC1833] Srinivasan, R., "Binding Protocols for ONC RPC Version 2", 3280 RFC 1833, DOI 10.17487/RFC1833, August 1995, 3281 . 3283 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 3284 Requirement Levels", BCP 14, RFC 2119, 3285 DOI 10.17487/RFC2119, March 1997, 3286 . 3288 [RFC4506] Eisler, M., Ed., "XDR: External Data Representation 3289 Standard", STD 67, RFC 4506, DOI 10.17487/RFC4506, May 3290 2006, . 3292 [RFC5042] Pinkerton, J. and E. Deleganes, "Direct Data Placement 3293 Protocol (DDP) / Remote Direct Memory Access Protocol 3294 (RDMAP) Security", RFC 5042, DOI 10.17487/RFC5042, October 3295 2007, . 3297 [RFC5056] Williams, N., "On the Use of Channel Bindings to Secure 3298 Channels", RFC 5056, DOI 10.17487/RFC5056, November 2007, 3299 . 3301 [RFC5531] Thurlow, R., "RPC: Remote Procedure Call Protocol 3302 Specification Version 2", RFC 5531, DOI 10.17487/RFC5531, 3303 May 2009, . 3305 [RFC5660] Williams, N., "IPsec Channels: Connection Latching", 3306 RFC 5660, DOI 10.17487/RFC5660, October 2009, 3307 . 3309 [RFC5665] Eisler, M., "IANA Considerations for Remote Procedure Call 3310 (RPC) Network Identifiers and Universal Address Formats", 3311 RFC 5665, DOI 10.17487/RFC5665, January 2010, 3312 . 3314 [RFC7861] Adamson, A. and N. Williams, "Remote Procedure Call (RPC) 3315 Security Version 3", RFC 7861, DOI 10.17487/RFC7861, 3316 November 2016, . 3318 [RFC7942] Sheffer, Y. and A. Farrel, "Improving Awareness of Running 3319 Code: The Implementation Status Section", BCP 205, 3320 RFC 7942, DOI 10.17487/RFC7942, July 2016, 3321 . 3323 [RFC8166] Lever, C., Ed., Simpson, W., and T. Talpey, "Remote Direct 3324 Memory Access Transport for Remote Procedure Call Version 3325 1", RFC 8166, DOI 10.17487/RFC8166, June 2017, 3326 . 3328 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 3329 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, 3330 May 2017, . 3332 [RFC8267] Lever, C., "Network File System (NFS) Upper-Layer Binding 3333 to RPC-over-RDMA Version 1", RFC 8267, 3334 DOI 10.17487/RFC8267, October 2017, 3335 . 3337 [RFC8446] Rescorla, E., "The Transport Layer Security (TLS) Protocol 3338 Version 1.3", RFC 8446, DOI 10.17487/RFC8446, August 2018, 3339 . 3341 13.2. Informative References 3343 [CBFC] Kung, H.T., Blackwell, T., and A. Chapman, "Credit-Based 3344 Flow Control for ATM Networks: Credit Update Protocol, 3345 Adaptive Credit Allocation, and Statistical Multiplexing", 3346 Proc. ACM SIGCOMM '94 Symposium on Communications 3347 Architectures, Protocols and Applications, pp. 101-114., 3348 August 1994. 3350 [I-D.ietf-nfsv4-rpc-tls] 3351 Myklebust, T. and C. Lever, "Towards Remote Procedure Call 3352 Encryption By Default", Work in Progress, Internet-Draft, 3353 draft-ietf-nfsv4-rpc-tls-11, 23 November 2020, 3354 . 3357 [IBA] InfiniBand Trade Association, "InfiniBand Architecture 3358 Specification Volume 1", Release 1.3, March 2015. 3359 Available from https://www.infinibandta.org/ 3361 [RFC0768] Postel, J., "User Datagram Protocol", STD 6, RFC 768, 3362 DOI 10.17487/RFC0768, August 1980, 3363 . 3365 [RFC0793] Postel, J., "Transmission Control Protocol", STD 7, 3366 RFC 793, DOI 10.17487/RFC0793, September 1981, 3367 . 3369 [RFC1094] Nowicki, B., "NFS: Network File System Protocol 3370 specification", RFC 1094, DOI 10.17487/RFC1094, March 3371 1989, . 3373 [RFC1813] Callaghan, B., Pawlowski, B., and P. Staubach, "NFS 3374 Version 3 Protocol Specification", RFC 1813, 3375 DOI 10.17487/RFC1813, June 1995, 3376 . 3378 [RFC3552] Rescorla, E. and B. Korver, "Guidelines for Writing RFC 3379 Text on Security Considerations", BCP 72, RFC 3552, 3380 DOI 10.17487/RFC3552, July 2003, 3381 . 3383 [RFC5040] Recio, R., Metzler, B., Culley, P., Hilland, J., and D. 3384 Garcia, "A Remote Direct Memory Access Protocol 3385 Specification", RFC 5040, DOI 10.17487/RFC5040, October 3386 2007, . 3388 [RFC5041] Shah, H., Pinkerton, J., Recio, R., and P. Culley, "Direct 3389 Data Placement over Reliable Transports", RFC 5041, 3390 DOI 10.17487/RFC5041, October 2007, 3391 . 3393 [RFC5044] Culley, P., Elzur, U., Recio, R., Bailey, S., and J. 3394 Carrier, "Marker PDU Aligned Framing for TCP 3395 Specification", RFC 5044, DOI 10.17487/RFC5044, October 3396 2007, . 3398 [RFC5532] Talpey, T. and C. Juszczak, "Network File System (NFS) 3399 Remote Direct Memory Access (RDMA) Problem Statement", 3400 RFC 5532, DOI 10.17487/RFC5532, May 2009, 3401 . 3403 [RFC5662] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed., 3404 "Network File System (NFS) Version 4 Minor Version 1 3405 External Data Representation Standard (XDR) Description", 3406 RFC 5662, DOI 10.17487/RFC5662, January 2010, 3407 . 3409 [RFC5666] Talpey, T. and B. Callaghan, "Remote Direct Memory Access 3410 Transport for Remote Procedure Call", RFC 5666, 3411 DOI 10.17487/RFC5666, January 2010, 3412 . 3414 [RFC7530] Haynes, T., Ed. and D. Noveck, Ed., "Network File System 3415 (NFS) Version 4 Protocol", RFC 7530, DOI 10.17487/RFC7530, 3416 March 2015, . 3418 [RFC7862] Haynes, T., "Network File System (NFS) Version 4 Minor 3419 Version 2 Protocol", RFC 7862, DOI 10.17487/RFC7862, 3420 November 2016, . 3422 [RFC8167] Lever, C., "Bidirectional Remote Procedure Call on RPC- 3423 over-RDMA Transports", RFC 8167, DOI 10.17487/RFC8167, 3424 June 2017, . 3426 [RFC8881] Noveck, D., Ed. and C. Lever, "Network File System (NFS) 3427 Version 4 Minor Version 1 Protocol", RFC 8881, 3428 DOI 10.17487/RFC8881, August 2020, 3429 . 3431 Appendix A. ULB Specifications 3433 Typically, an Upper-Layer Protocol (ULP) is defined without regard to 3434 a particular RPC transport. An Upper-Layer Binding (ULB) 3435 specification provides guidance that helps a ULP interoperate 3436 correctly and efficiently over a particular transport. For RPC-over- 3437 RDMA version 2, a ULB may provide: 3439 * A taxonomy of XDR data items that are eligible for DDP 3441 * Constraints on which upper-layer procedures a sender may reduce, 3442 and on how many chunks may appear in a single RPC message 3444 * A method enabling a Requester to determine the maximum size of the 3445 reply Payload stream for all procedures in the ULP 3447 * An rpcbind port assignment for the RPC Program and Version when 3448 operating on the particular transport 3450 Each RPC Program and Version tuple that operates on RPC-over-RDMA 3451 version 2 needs to have a ULB specification. 3453 A.1. DDP-Eligibility 3455 A ULB designates specific XDR data items as eligible for DDP. As a 3456 sender constructs an RPC-over-RDMA message, it can remove DDP- 3457 eligible data items from the Payload stream so that the RDMA provider 3458 can place them directly in the receiver's memory. An XDR data item 3459 should be considered for DDP-eligibility if there is a clear benefit 3460 to moving the contents of the item directly from the sender's memory 3461 to the receiver's memory. 3463 Criteria for DDP-eligibility include: 3465 * The XDR data item is frequently sent or received, and its size is 3466 often much larger than typical inline thresholds. 3468 * If the XDR data item is a result, its maximum size must be 3469 predictable in advance by the Requester. 3471 * Transport-level processing of the XDR data item is not needed. 3472 For example, the data item is an opaque byte array, which requires 3473 no XDR encoding and decoding of its content. 3475 * The content of the XDR data item is sensitive to address 3476 alignment. For example, a data copy operation would be required 3477 on the receiver to enable the message to be parsed correctly, or 3478 to enable the data item to be accessed. 3480 * The XDR data item itself does not contain DDP-eligible data items. 3482 In addition to defining the set of data items that are DDP-eligible, 3483 a ULB may limit the use of chunks to particular upper-layer 3484 procedures. If more than one data item in a procedure is DDP- 3485 eligible, the ULB may limit the number of chunks that a Requester can 3486 provide for a particular upper-layer procedure. 3488 Senders never reduce data items that are not DDP-eligible. Such data 3489 items can, however, be part of a Special Format payload. 3491 The programming interface by which an upper-layer implementation 3492 indicates the DDP-eligibility of a data item to the RPC transport is 3493 not described by this specification. The only requirements are that 3494 the receiver can re-assemble the transmitted RPC-over-RDMA message 3495 into a valid XDR stream and that DDP-eligibility rules specified by 3496 the ULB are respected. 3498 There is no provision to express DDP-eligibility within the XDR 3499 language. The only definitive specification of DDP-eligibility is a 3500 ULB. 3502 In general, a DDP-eligibility violation occurs when: 3504 * A Requester reduces a non-DDP-eligible argument data item. The 3505 Responder reports the violation as described in Section 6.3.1. 3507 * A Responder reduces a non-DDP-eligible result data item. The 3508 Requester terminates the pending RPC transaction and reports an 3509 appropriate permanent error to the RPC consumer. 3511 * A Responder does not reduce a DDP-eligible result data item into 3512 an available Write chunk. The Requester terminates the pending 3513 RPC transaction and reports an appropriate permanent error to the 3514 RPC consumer. 3516 A.2. Maximum Reply Size 3518 When expecting small and moderately-sized Replies, a Requester should 3519 rely on Message Continuation rather than provision a Reply chunk. 3520 For each ULP procedure where there is no clear Reply size maximum and 3521 the maximum can be substantial, the ULB should specify a dependable 3522 means for determining the maximum Reply size. 3524 A.3. Reverse-Direction Operation 3526 The direction of operation does not preclude the need for DDP- 3527 eligibility statements. 3529 Reverse-direction operation occurs on an already-established 3530 connection. Specification of RPC binding parameters is usually not 3531 necessary in this case. 3533 Other considerations may apply when distinct RPC Programs share an 3534 RPC-over-RDMA transport connection concurrently. 3536 A.4. Additional Considerations 3538 There may be other details provided in a ULB. 3540 * A ULB may recommend inline threshold values or other transport- 3541 related parameters for RPC-over-RDMA version 2 connections bearing 3542 that ULP. 3544 * A ULP may provide a means to communicate transport-related 3545 parameters between peers. 3547 * Multiple ULPs may share a single RPC-over-RDMA version 2 3548 connection when their ULBs allow the use of RPC-over-RDMA version 3549 2 and the rpcbind port assignments for those protocols permit 3550 connection sharing. In this case, the same transport parameters 3551 (such as inline threshold) apply to all ULPs using that 3552 connection. 3554 Each ULB needs to be designed to allow correct interoperation without 3555 regard to the transport parameters actually in use. Furthermore, 3556 implementations of ULPs must be designed to interoperate correctly 3557 regardless of the connection parameters in effect on a connection. 3559 A.5. ULP Extensions 3561 An RPC Program and Version tuple may be extensible. For instance, 3562 the RPC version number may not reflect a ULP minor versioning scheme, 3563 or the ULP may allow the specification of additional features after 3564 the publication of the original RPC Program specification. ULBs are 3565 provided for interoperable RPC Programs and Versions by extending 3566 existing ULBs to reflect the changes made necessary by each addition 3567 to the existing XDR. 3569 [ cel: The final sentence is unclear, and may be inaccurate. I 3570 believe I copied this section directly from RFC 8166. Is there more 3571 to be said, now that we have some experience? ] 3573 Appendix B. Extending RPC-over-RDMA Version 2 3575 This Appendix is not addressed to protocol implementers, but rather 3576 to authors of documents that extend the protocol specified in the 3577 current document. 3579 RPC-over-RDMA version 2 extensibility facilitates limited extensions 3580 to the base protocol presented in the current document so that new 3581 optional capabilities can be introduced without a protocol version 3582 change while maintaining robust interoperability with existing RPC- 3583 over-RDMA version 2 implementations. It allows extensions to be 3584 defined, including the definition of new protocol elements, without 3585 requiring modification or recompilation of the XDR for the base 3586 protocol. 3588 Standards Track documents may introduce extensions to the base RPC- 3589 over-RDMA version 2 protocol in two ways: 3591 * They may introduce new OPTIONAL transport header types. 3592 Appendix B.2 covers such transport header types. 3594 * They may define new OPTIONAL transport properties. Appendix B.3 3595 describes such transport properties. 3597 These documents may also add the following sorts of ancillary 3598 protocol elements to the protocol to support the addition of new 3599 transport properties and header types: 3601 * They may create new error codes, as described in Appendix B.4. 3603 New capabilities can be proposed and developed independently of each 3604 other. Implementers can choose among them, making it straightforward 3605 to create and document experimental features and then bring them 3606 through the standards process. 3608 B.1. Documentation Requirements 3610 As described earlier, a Standards Track document introduces a set of 3611 new protocol elements. Together these elements are considered an 3612 OPTIONAL feature. Each implementation is either aware of all the 3613 protocol elements introduced by that feature or is aware of none of 3614 them. 3616 Documents specifying extensions to RPC-over-RDMA version 2 should 3617 contain: 3619 * An explanation of the purpose and use of each new protocol 3620 element. 3622 * An XDR description including all of the new protocol elements, and 3623 a script to extract it. 3625 * A discussion of interactions with other extensions. This 3626 discussion includes requirements for other OPTIONAL features to be 3627 present, or that a particular level of support for an OPTIONAL 3628 facility is required. 3630 Implementers combine the XDR descriptions of the new features they 3631 intend to use with the XDR description of the base protocol in the 3632 current document. This combination is necessary to create a valid 3633 XDR input file because extensions are free to use XDR types defined 3634 in the base protocol, and later extensions may use types defined by 3635 earlier extensions. 3637 The XDR description for the RPC-over-RDMA version 2 base protocol 3638 combined with that for any selected extensions should provide a 3639 human-readable and compilable definition of the extended protocol. 3641 B.2. Adding New Header Types to RPC-over-RDMA Version 2 3643 New transport header types are defined similar to Sections 6.3.5 3644 through 6.3.10. In particular, what is needed is: 3646 * A description of the function and use of the new header type. 3648 * A complete XDR description of the new header type. 3650 * A description of how receivers report errors, including mechanisms 3651 for reporting errors outside the available choices already 3652 available in the base protocol or other extensions. 3654 * An indication of whether a Payload stream must be present, and a 3655 description of its contents and how receivers use such Payload 3656 streams to reconstruct RPC messages. 3658 * As appropriate, a statement of whether a Responder may use Remote 3659 Invalidation when sending messages that contain the new header 3660 type. 3662 There needs to be additional documentation that is made necessary due 3663 to the OPTIONAL status of new transport header types: 3665 * The document should discuss constraints on support for the new 3666 header types. For example, if support for one header type is 3667 implied or foreclosed by another one, this needs to be documented. 3669 * The document should describe the preferred method by which a 3670 sender determines whether its peer supports a particular header 3671 type. It is always possible to send a test invocation of a 3672 particular header type to see if support is available. However, 3673 when more efficient means are available (e.g., the value of a 3674 transport property), this should be noted. 3676 B.3. Adding New Transport properties to the Protocol 3678 A Standards Track document defining a new transport property should 3679 include the following information paralleling that provided in this 3680 document for the transport properties defined herein: 3682 * The rpcrdma2_propid value identifying the new property. 3684 * The XDR typedef specifying the structure of its property value. 3686 * A description of the new property. 3688 * An explanation of how the receiver can use this information. 3690 * The default value if a peer never receives the new property. 3692 There is no requirement that propid assignments occur in a continuous 3693 range of values. Implementations should not rely on all such values 3694 being small integers. 3696 Before the defining Standards Track document is published, the nfsv4 3697 Working Group should select a unique propid value, and ensure that: 3699 * rpcrdma2_propid values specified in the document do not conflict 3700 with those currently assigned or in use by other pending working 3701 group documents defining transport properties. 3703 * rpcrdma2_propid values specified in the document do not conflict 3704 with the range reserved for experimental use, as defined in 3705 Section 8.2. 3707 [ cel: There is no longer a section 8.2 or an experimental range 3708 of propid values. Should we request the creation of an IANA 3709 registry for propid values? ]. 3711 When a Standards Track document proposes additional transport 3712 properties, reviewers should deal with possible security issues 3713 exposed by those new transport properties. 3715 B.4. Adding New Error Codes to the Protocol 3717 The same Standards Track document that defines a new header type may 3718 introduce new error codes used to support it. A Standards Track 3719 document may similarly define new error codes that an existing header 3720 type can return. 3722 For error codes that do not require the return of additional 3723 information, a peer can use the existing RDMA_ERR2 header type to 3724 report the new error. The sender sets the new error code as the 3725 value of rdma_err with the result that the default switch arm of the 3726 rpcrdma2_error (i.e., void) is selected. 3728 For error codes that do require the return of related information 3729 together with the error, a new header type should be defined that 3730 returns the error together with the related information. The sender 3731 of a new header type needs to be prepared to accept header types 3732 necessary to report associated errors. 3734 Appendix C. Differences from RPC-over-RDMA Version 1 3736 The primary goal of RPC-over-RDMA version 2 is to relieve constraints 3737 that have become evident in RPC-over-RDMA version 1 with deployment 3738 experience: 3740 * RPC-over-RDMA version 1 has been challenging to update to address 3741 shortcomings or improve data transfer efficiency. 3743 * The average size of NFSv4 COMPOUNDs is significantly greater than 3744 NFSv3 requests, requiring the use of Long messages for frequent 3745 operations. 3747 * Reply size estimation is difficult more often than first expected. 3749 This section details specific changes in RPC-over-RDMA version 2 that 3750 address these constraints directly, in addition to other changes to 3751 make implementation easier. 3753 C.1. Changes to the XDR Definition 3755 Several XDR structural changes enable within-version protocol 3756 extensibility. 3758 [RFC8166] defines the RPC-over-RDMA version 1 transport header as a 3759 single XDR object, with an RPC message potentially following it. In 3760 RPC-over-RDMA version 2, there are separate XDR definitions of the 3761 transport header prefix (see Section 6.4), which specifies the 3762 transport header type to be used, and the transport header itself 3763 (defined within one of the subsections of Section 6.3). This 3764 construction is similar to an RPC message, which consists of an RPC 3765 header (defined in [RFC5531]) followed by a message defined by an 3766 Upper-Layer Protocol. 3768 As a new version of the RPC-over-RDMA transport protocol, RPC-over- 3769 RDMA version 2 exists within the versioning rules defined in 3770 [RFC8166]. In particular, it maintains the first four words of the 3771 protocol header, as specified in Section 4.2 of [RFC8166], even 3772 though, as explained in Section 6.2.1 of the current document, the 3773 XDR definition of those words is structured differently. 3775 Although each of the first four fields retains its semantic function, 3776 there are differences in interpretation: 3778 * The first word of the header, the rdma_xid field, retains the 3779 format and function that it had in RPC-over-RDMA version 1. 3780 Because RPC-over-RDMA version 2 messages can convey non-RPC 3781 messages, a receiver should not use the contents of this field 3782 without consideration of the protocol version and header type. 3784 * The second word of the header, the rdma_vers field, retains the 3785 format and function that it had in RPC-over-RDMA version 1. To 3786 clearly distinguish version 1 and version 2 messages, senders need 3787 to fill in the correct version (fixed after version negotiation). 3788 Receivers should check that the content of the rdma_vers is 3789 correct before using the content of any other header field. 3791 * The third word of the header, the rdma_credit field, retains the 3792 size and general purpose that it had in RPC-over-RDMA version 1. 3793 However, RPC-over-RDMA version 2 divides this field into two 3794 16-bit subfields. See Section 4.2.1 for further details. 3796 * The fourth word of the header, previously the union discriminator 3797 field rdma_proc, retains its format and general function even 3798 though the set of valid values has changed. Within RPC-over-RDMA 3799 version 2, this word is the rdma_htype field of the structure 3800 rdma_start. The value of this field is now an unsigned 32-bit 3801 integer rather than an enum type, to facilitate header type 3802 extension. 3804 Beyond conforming to the restrictions specified in [RFC8166], RPC- 3805 over-RDMA version 2 attempts to limit the scope of the changes made 3806 to ensure interoperability. Although it introduces the Call chunk 3807 and splits the two version 1 workhorse procedure types RDMA_MSG and 3808 RDMA_NOMSG into several variants, RPC-over-RDMA version 2 otherwise 3809 expresses chunks in the same format and utilizes them the same way. 3811 C.2. Transport Properties 3813 RPC-over-RDMA version 2 provides a mechanism for exchanging an 3814 implementation's operational properties. The purpose of this 3815 exchange is to help endpoints improve the efficiency of data transfer 3816 by exploiting the characteristics of both peers rather than falling 3817 back on the lowest common denominator default settings. A full 3818 discussion of transport properties appears in Section 5. 3820 C.3. Credit Management Changes 3822 RPC-over-RDMA transports employ credit-based flow control to ensure 3823 that a Requester does not emit more RDMA Sends than the Responder is 3824 prepared to receive. 3826 Section 3.3.1 of [RFC8166] explains the operation of RPC-over-RDMA 3827 version 1 credit management in detail. In that design, each RDMA 3828 Send from a Requester contains an RPC Call with a credit request, and 3829 each RDMA Send from a Responder contains an RPC Reply with a credit 3830 grant. The credit grant implies that enough Receives have been 3831 posted on the Responder to handle the credit grant minus the number 3832 of pending RPC transactions (the number of remaining Receive buffers 3833 might be zero). 3835 Each RPC Reply acts as an implicit ACK for a previous RPC Call from 3836 the Requester. Without an RPC Reply message, the Requester has no 3837 way to know that the Responder is ready for subsequent RPC Calls. 3839 Because version 1 embeds credit management in each message, there is 3840 a strict one-to-one ratio between RDMA Send and RPC message. There 3841 are interesting use cases that might be enabled if this relationship 3842 were more flexible: 3844 * RPC-over-RDMA operations that do not carry an RPC message, e.g., 3845 control plane operations. 3847 * A single RDMA Send that conveys more than one RPC message, e.g., 3848 for interrupt mitigation. 3850 * An RPC message that requires several sequential RDMA Sends, e.g., 3851 to reduce the use of explicit RDMA operations for moderate-sized 3852 RPC messages. 3854 * An RPC transaction that requires multiple exchanges or an odd 3855 number of RPC-over-RDMA operations to complete. 3857 RPC-over-RDMA version 2 provides a more sophisticated credit 3858 accounting mechanism to address these shortcomings. Section 4.2.1 3859 explains the new mechanism in detail. 3861 C.4. Inline Threshold Changes 3863 An "inline threshold" value is the largest message size (in octets) 3864 that can be conveyed on an RDMA connection using only RDMA Send and 3865 Receive. Each connection has two inline threshold values: one for 3866 messages flowing from client-to-server (referred to as the "client- 3867 to-server inline threshold") and one for messages flowing from 3868 server-to-client (referred to as the "server-to-client inline 3869 threshold"). 3871 A connection's inline thresholds determine, among other things, when 3872 RDMA Read or Write operations are required because an RPC message 3873 cannot be conveyed via a single RDMA Send and Receive pair. When an 3874 RPC message does not contain DDP-eligible data items, a Requester can 3875 prepare a Special Format Call or Reply to convey the whole RPC 3876 message using RDMA Read or Write operations. 3878 RDMA Read and Write operations require that data payloads reside in 3879 memory registered with the local RNIC. When an RPC completes, that 3880 memory is invalidated to fence it from the Responder. Memory 3881 registration and invalidation typically have a latency cost that is 3882 insignificant compared to data handling costs. 3884 When a data payload is small, however, the cost of registering and 3885 invalidating memory where the payload resides becomes a significant 3886 part of total RPC latency. Therefore the most efficient operation of 3887 an RPC-over-RDMA transport occurs when the peers use explicit RDMA 3888 Read and Write operations for large payloads but avoid those 3889 operations for small payloads. 3891 When the authors of [RFC8166] first conceived RPC-over-RDMA version 3892 1, the average size of RPC messages that did not involve a 3893 significant data payload was under 500 bytes. A 1024-byte inline 3894 threshold adequately minimized the frequency of inefficient Long 3895 messages. 3897 With NFS version 4 [RFC7530], the increased size of NFS COMPOUND 3898 operations resulted in RPC messages that are, on average, larger than 3899 previous versions of NFS. With a 1024-byte inline threshold, 3900 frequent operations such as GETATTR and LOOKUP require RDMA Read or 3901 Write operations, reducing the efficiency of data transport. 3903 To reduce the frequency of Special Format messages, RPC-over-RDMA 3904 version 2 increases the default size of inline thresholds. This 3905 change also increases the maximum size of reverse-direction RPC 3906 messages. 3908 C.5. Message Continuation Changes 3910 In addition to a larger default inline threshold, RPC-over-RDMA 3911 version 2 introduces Message Continuation. Message Continuation is a 3912 mechanism that enables the transmission of a data payload using more 3913 than one RDMA Send. The purpose of Message Continuation is to 3914 provide relief in several essential cases: 3916 * If a Requester finds that it is inefficient to convey a 3917 moderately-sized data payload using Read chunks, the Requester can 3918 use Message Continuation to send the RPC Call. 3920 * If a Requester has provided insufficient Reply chunk space for a 3921 Responder to send an RPC Reply, the Responder can use Message 3922 Continuation to send the RPC Reply. 3924 * If a sender has to convey a sizeable non-RPC data payload (e.g., a 3925 large transport property), the sender can use Message Continuation 3926 to avoid having to register memory. 3928 C.6. Host Authentication Changes 3930 For the general operation of NFS on open networks, we eventually 3931 intend to rely on RPC-on-TLS [I-D.ietf-nfsv4-rpc-tls] to provide 3932 cryptographic authentication of the two ends of each connection. In 3933 turn, this can improve the trustworthiness of AUTH_SYS-style user 3934 identities that flow on TCP, which are not cryptographically 3935 protected. We do not have a similar solution for RPC-over-RDMA, 3936 however. 3938 Here, the RDMA transport layer already provides a strong guarantee of 3939 message integrity. On some network fabrics, IPsec or TLS can protect 3940 the privacy of in-transit data. However, this is not the case for 3941 all fabrics (e.g., InfiniBand [IBA]). 3943 Thus, RPC-over-RDMA version 2 introduces a mechanism for 3944 authenticating connection peers (see Section 5.2.6). And like GSS 3945 channel binding, there is also a way to determine when the use of 3946 host authentication is unnecessary. 3948 C.7. Support for Remote Invalidation 3950 When an RDMA consumer uses FRWR or Memory Windows to register memory, 3951 that memory may be invalidated remotely [RFC5040]. These mechanisms 3952 are available when a Requester's RNIC supports MEM_MGT_EXTENSIONS. 3954 For this discussion, there are two classes of STags. Dynamically- 3955 registered STags appear in a single RPC, then are invalidated. 3956 Persistently-registered STags survive longer than one RPC. They may 3957 persist for the life of an RPC-over-RDMA connection or even longer. 3959 An RPC-over-RDMA Requester can provide more than one STag in a 3960 transport header. It may provide a combination of dynamically- and 3961 persistently-registered STags in one RPC message, or any combination 3962 of these in a series of RPCs on the same connection. Only 3963 dynamically-registered STags using Memory Windows or FRWR may be 3964 invalidated remotely. 3966 There is no transport-level mechanism by which a Responder can 3967 determine how a Requester-provided STag was registered, nor whether 3968 it is eligible to be invalidated remotely. A Requester that mixes 3969 persistently- and dynamically-registered STags in one RPC, or mixes 3970 them across RPCs on the same connection, must, therefore, indicate 3971 which STag the Responder may invalidate remotely via a mechanism 3972 provided in the Upper-Layer Protocol. RPC-over-RDMA version 2 3973 provides such a mechanism. 3975 A sender uses the RDMA Send With Invalidate operation to invalidate 3976 an STag on the remote peer. It is available only when both peers 3977 support MEM_MGT_EXTENSIONS (can send and process an IETH). 3979 Existing RPC-over-RDMA transport protocol specifications [RFC8166] 3980 [RFC8167] do not forbid direct data placement in the reverse 3981 direction. Moreover, there is currently no Upper-Layer Protocol that 3982 makes data items in reverse-direction operations eligible for direct 3983 data placement. 3985 When chunks are present in a reverse-direction RPC request, Remote 3986 Invalidation enables the Responder to trigger invalidation of a 3987 Requester's STags as part of sending an RPC Reply, the same way as is 3988 done in the forward direction. 3990 However, in the reverse direction, the server acts as the Requester, 3991 and the client is the Responder. The server's RNIC, therefore, must 3992 support receiving an IETH, and the server must have registered its 3993 STags with an appropriate registration mechanism. 3995 C.8. Integration of Reverse-Direction Operation 3997 Because [RFC5666] did not include specification of reverse-direction 3998 operation, [RFC8166] does not include it either. Reverse-direction 3999 operation in RPC-over-RDMA version 1 is specified by a separate 4000 standards track document [RFC8167]. 4002 Reverse-direction operation in RPC-over-RDMA version 1 was 4003 constrained by the limited ability to extend that version of the 4004 protocol. The most awkward issue is that a receiver needs to peek at 4005 ingress RPC message payloads to determine whether it is a Call or 4006 Reply message. This is necessary because the meaning of several 4007 fields in the RPC-over-RDMA transport header is determined by the 4008 direction of the RPC message payload: 4010 * The meaning of the value in the rdma_xid field is determined by 4011 the direction of the message because the XID spaces in the forward 4012 and reverse directions are distinct. 4014 * The meaning of the value in the rdma_credit field is determined by 4015 the direction of the message because credits are granted 4016 separately for forward and reverse direction operation. 4018 * The purpose of Write chunks and the meaning of their length fields 4019 is determined by the direction of the message because in Call 4020 messages, they are provisional, but in Reply messages, they 4021 represent returned results. 4023 The current document remedies this awkwardness by integrating 4024 reverse-direction operation into RPC-over-RDMA version 2 so that it 4025 can make use of all facilities that are available in the forward- 4026 direction, including body chunks, remote invalidation, and message 4027 continuation. To enable this integration, the direction of the RPC 4028 message payload is encoded in each RPC-over-RDMA version 2 transport 4029 header. 4031 C.9. Error Reporting Changes 4033 RPC-over-RDMA version 2 expands the repertoire of errors that 4034 connection peers may report to each other. The goals of this 4035 expansion are: 4037 * To fill in details of peer recovery actions. 4039 * To enable retrying certain conditions caused by mis-estimation of 4040 the maximum reply size. 4042 * To minimize the likelihood of a Requester waiting forever for a 4043 Reply when there are communications problems that prevent the 4044 Responder from sending it. 4046 C.10. Changes in Terminology 4048 The RPC-over-RDMA version 2 specification makes the following changes 4049 in terminology. These changes do not result in changes in the 4050 behavior or operation of the protocol. 4052 * The current document explicitly acknowledges the different 4053 semantics and purpose of Write chunks appearing in Call messages 4054 and those appearing in Reply messages. 4056 * The current document introduces the term "payload format" to 4057 describe the selection of a mechanism for reducing and conveying 4058 an RPC message payload. It replaces the terms "short message" and 4059 "long message" with the terms "simple format" and "special format" 4060 because this selection is not based only on the size of the 4061 payload. 4063 * The current document introduces the terms "data item chunk" and 4064 "body chunk" in order to distinguish the purpose and operation of 4065 these two categories of chunk. 4067 * For improved readability, the current document replaces the terms 4068 "RDMA segment" and "plain segment" with the term "segment", and 4069 the term "RDMA read segment" with the term "Read segment". 4071 * The current document refers specifically to the RDMAP, DDP, and 4072 MPA standards track protocols rather than using the nebulous term 4073 "iWARP". 4075 Acknowledgments 4077 The authors gratefully acknowledge the work of Brent Callaghan and 4078 Tom Talpey on the original RPC-over-RDMA version 1 specification 4079 [RFC5666]. 4081 We are deeply indebted to Jana Igeyar for contributing the RPC-over- 4082 RDMA version 2 flow control mechanism described in Section 4.2.1. 4084 The authors also wish to thank Bill Baker, Greg Marsden, and Matt 4085 Benjamin for their support of this work. 4087 The XDR extraction conventions were first described by the authors of 4088 the NFS version 4.1 XDR specification [RFC5662]. Herbert van den 4089 Bergh suggested the replacement sed script used in this document. 4091 Special thanks go to Transport Area Director Magnus Westerlund, NFSV4 4092 Working Group Chairs Spencer Shepler, and Brian Pawlowski, and NFSV4 4093 Working Group Secretary Thomas Haynes for their support. 4095 Authors' Addresses 4097 Charles Lever (editor) 4098 Oracle Corporation 4099 United States of America 4101 Email: chuck.lever@oracle.com 4103 David Noveck 4104 NetApp 4105 1601 Trapelo Road 4106 Waltham, MA 02451 4107 United States of America 4109 Phone: +1 781 572 8038 4110 Email: davenoveck@gmail.com