idnits 2.17.1 draft-ietf-nfsv4-rpcrdma-version-two-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- == There are 1 instance of lines with non-RFC6890-compliant IPv4 addresses in the document. If these are example addresses, they should be changed. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document seems to contain a disclaimer for pre-RFC5378 work, but was first submitted on or after 10 November 2008. The disclaimer is usually necessary only for documents that revise or obsolete older RFCs, and that take significant amounts of text from those RFCs. If you can contact all authors of the source material and they are willing to grant the BCP78 rights to the IETF Trust, you can and should remove the disclaimer. Otherwise, the disclaimer is needed and you can ignore this comment. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (3 July 2020) is 1365 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Outdated reference: A later version (-11) exists of draft-ietf-nfsv4-rpc-tls-07 -- Obsolete informational reference (is this intentional?): RFC 793 (Obsoleted by RFC 9293) -- Obsolete informational reference (is this intentional?): RFC 5661 (Obsoleted by RFC 8881) Summary: 0 errors (**), 0 flaws (~~), 4 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network File System Version 4 C. Lever, Ed. 3 Internet-Draft Oracle 4 Intended status: Standards Track D. Noveck 5 Expires: 4 January 2021 NetApp 6 3 July 2020 8 RPC-over-RDMA Version 2 Protocol 9 draft-ietf-nfsv4-rpcrdma-version-two-02 11 Abstract 13 This document specifies the second version of a transport protocol 14 that conveys Remote Procedure Call (RPC) messages using Remote Direct 15 Memory Access (RDMA). This version of the protocol is extensible. 17 Note 19 Discussion of this draft takes place on the NFSv4 working group 20 mailing list (nfsv4@ietf.org), which is archived at 21 https://mailarchive.ietf.org/arch/browse/nfsv4/. Working Group 22 information can be found at https://datatracker.ietf.org/wg/nfsv4/ 23 about/. 25 This note is to be removed before publishing as an RFC. 27 The source for this draft is maintained in GitHub. Suggested changes 28 can be submitted as pull requests at https://github.com/chucklever/ 29 i-d-rpcrdma-version-two. Instructions are on that page as well. 31 Status of This Memo 33 This Internet-Draft is submitted in full conformance with the 34 provisions of BCP 78 and BCP 79. 36 Internet-Drafts are working documents of the Internet Engineering 37 Task Force (IETF). Note that other groups may also distribute 38 working documents as Internet-Drafts. The list of current Internet- 39 Drafts is at https://datatracker.ietf.org/drafts/current/. 41 Internet-Drafts are draft documents valid for a maximum of six months 42 and may be updated, replaced, or obsoleted by other documents at any 43 time. It is inappropriate to use Internet-Drafts as reference 44 material or to cite them other than as "work in progress." 46 This Internet-Draft will expire on 4 January 2021. 48 Copyright Notice 50 Copyright (c) 2020 IETF Trust and the persons identified as the 51 document authors. All rights reserved. 53 This document is subject to BCP 78 and the IETF Trust's Legal 54 Provisions Relating to IETF Documents (https://trustee.ietf.org/ 55 license-info) in effect on the date of publication of this document. 56 Please review these documents carefully, as they describe your rights 57 and restrictions with respect to this document. Code Components 58 extracted from this document must include Simplified BSD License text 59 as described in Section 4.e of the Trust Legal Provisions and are 60 provided without warranty as described in the Simplified BSD License. 62 This document may contain material from IETF Documents or IETF 63 Contributions published or made publicly available before November 64 10, 2008. The person(s) controlling the copyright in some of this 65 material may not have granted the IETF Trust the right to allow 66 modifications of such material outside the IETF Standards Process. 67 Without obtaining an adequate license from the person(s) controlling 68 the copyright in such materials, this document may not be modified 69 outside the IETF Standards Process, and derivative works of it may 70 not be created outside the IETF Standards Process, except to format 71 it for publication as an RFC or to translate it into languages other 72 than English. 74 Table of Contents 76 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 5 77 1.1. Design Goals . . . . . . . . . . . . . . . . . . . . . . 5 78 1.2. Motivation for a New Version . . . . . . . . . . . . . . 6 79 2. Requirements Language . . . . . . . . . . . . . . . . . . . . 7 80 3. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 7 81 3.1. Remote Procedure Calls . . . . . . . . . . . . . . . . . 7 82 3.1.1. Upper-Layer Protocols . . . . . . . . . . . . . . . . 7 83 3.1.2. Requesters and Responders . . . . . . . . . . . . . . 7 84 3.1.3. RPC Transports . . . . . . . . . . . . . . . . . . . 8 85 3.1.4. External Data Representation . . . . . . . . . . . . 10 86 3.2. Remote Direct Memory Access . . . . . . . . . . . . . . . 11 87 3.2.1. Direct Data Placement . . . . . . . . . . . . . . . . 11 88 3.2.2. RDMA Transport Requirements . . . . . . . . . . . . . 12 89 4. RPC-over-RDMA Framework . . . . . . . . . . . . . . . . . . . 14 90 4.1. Message Framing . . . . . . . . . . . . . . . . . . . . . 14 91 4.2. Managing Receiver Resources . . . . . . . . . . . . . . . 15 92 4.2.1. Flow Control . . . . . . . . . . . . . . . . . . . . 15 93 4.2.2. Inline Threshold . . . . . . . . . . . . . . . . . . 16 94 4.2.3. Initial Connection State . . . . . . . . . . . . . . 17 95 4.3. XDR Encoding with Chunks . . . . . . . . . . . . . . . . 18 96 4.3.1. Reducing an XDR Stream . . . . . . . . . . . . . . . 18 97 4.3.2. DDP-Eligibility . . . . . . . . . . . . . . . . . . . 18 98 4.3.3. RDMA Segments . . . . . . . . . . . . . . . . . . . . 19 99 4.3.4. Chunks . . . . . . . . . . . . . . . . . . . . . . . 19 100 4.3.5. Read Chunks . . . . . . . . . . . . . . . . . . . . . 21 101 4.3.6. Write Chunks . . . . . . . . . . . . . . . . . . . . 22 102 4.4. Payload Formats . . . . . . . . . . . . . . . . . . . . . 23 103 4.4.1. Simple Format . . . . . . . . . . . . . . . . . . . . 24 104 4.4.2. Continued Format . . . . . . . . . . . . . . . . . . 26 105 4.4.3. Special Format . . . . . . . . . . . . . . . . . . . 28 106 4.5. Reverse-Direction Operation . . . . . . . . . . . . . . . 30 107 4.5.1. Sending a Reverse-Direction RPC Call . . . . . . . . 30 108 4.5.2. Sending a Reverse-Direction RPC Reply . . . . . . . . 30 109 4.5.3. In the Absence of Support For Reverse-Direction 110 Operation . . . . . . . . . . . . . . . . . . . . . . 31 111 4.5.4. Using Chunks During Reverse-Direction Operation . . . 32 112 4.5.5. Reverse-Direction Retransmission . . . . . . . . . . 32 113 5. Transport Properties . . . . . . . . . . . . . . . . . . . . 33 114 5.1. Transport Properties Model . . . . . . . . . . . . . . . 33 115 5.2. Current Transport Properties . . . . . . . . . . . . . . 35 116 5.2.1. Maximum Send Size . . . . . . . . . . . . . . . . . . 36 117 5.2.2. Receive Buffer Size . . . . . . . . . . . . . . . . . 36 118 5.2.3. Maximum RDMA Segment Size . . . . . . . . . . . . . . 37 119 5.2.4. Maximum RDMA Segment Count . . . . . . . . . . . . . 37 120 5.2.5. Reverse-Direction Support . . . . . . . . . . . . . . 37 121 5.2.6. Host Authentication Message . . . . . . . . . . . . . 38 122 6. Transport Messages . . . . . . . . . . . . . . . . . . . . . 38 123 6.1. Transport Header Types . . . . . . . . . . . . . . . . . 39 124 6.2. Headers and Chunks . . . . . . . . . . . . . . . . . . . 40 125 6.2.1. Common Transport Header Prefix . . . . . . . . . . . 40 126 6.2.2. Transport Header Prefix . . . . . . . . . . . . . . . 41 127 6.2.3. External Data Payloads . . . . . . . . . . . . . . . 43 128 6.2.4. Remote Invalidation . . . . . . . . . . . . . . . . . 44 129 6.3. Header Types . . . . . . . . . . . . . . . . . . . . . . 44 130 6.3.1. RDMA2_MSG: Convey RPC Message Inline . . . . . . . . 45 131 6.3.2. RDMA2_NOMSG: Convey External RPC Message . . . . . . 45 132 6.3.3. RDMA2_ERROR: Report Transport Error . . . . . . . . . 46 133 6.3.4. RDMA2_CONNPROP: Exchange Transport Properties . . . . 47 134 6.3.5. RDMA2_GRANT: Grant Credits . . . . . . . . . . . . . 48 135 6.4. Choosing a Reply Mechanism . . . . . . . . . . . . . . . 48 136 7. Error Handling . . . . . . . . . . . . . . . . . . . . . . . 49 137 7.1. Basic Transport Stream Parsing Errors . . . . . . . . . . 50 138 7.1.1. RDMA2_ERR_VERS . . . . . . . . . . . . . . . . . . . 50 139 7.1.2. RDMA2_ERR_INVAL_HTYPE . . . . . . . . . . . . . . . . 50 140 7.1.3. RDMA2_ERR_INVAL_CONT . . . . . . . . . . . . . . . . 51 141 7.2. XDR Errors . . . . . . . . . . . . . . . . . . . . . . . 51 142 7.2.1. RDMA2_ERR_BAD_XDR . . . . . . . . . . . . . . . . . . 51 143 7.2.2. RDMA2_ERR_BAD_PROPVAL . . . . . . . . . . . . . . . . 52 145 7.3. Responder RDMA Operational Errors . . . . . . . . . . . . 52 146 7.3.1. RDMA2_ERR_READ_CHUNKS . . . . . . . . . . . . . . . . 52 147 7.3.2. RDMA2_ERR_WRITE_CHUNKS . . . . . . . . . . . . . . . 53 148 7.3.3. RDMA2_ERR_SEGMENTS . . . . . . . . . . . . . . . . . 53 149 7.3.4. RDMA2_ERR_WRITE_RESOURCE . . . . . . . . . . . . . . 53 150 7.3.5. RDMA2_ERR_REPLY_RESOURCE . . . . . . . . . . . . . . 54 151 7.4. Other Operational Errors . . . . . . . . . . . . . . . . 54 152 7.4.1. RDMA2_ERR_SYSTEM . . . . . . . . . . . . . . . . . . 54 153 7.5. RDMA Transport Errors . . . . . . . . . . . . . . . . . . 55 154 8. XDR Protocol Definition . . . . . . . . . . . . . . . . . . . 55 155 8.1. Code Component License . . . . . . . . . . . . . . . . . 55 156 8.2. Extraction of the XDR Definition . . . . . . . . . . . . 57 157 8.3. XDR Definition for RPC-over-RDMA Version 2 Core 158 Structures . . . . . . . . . . . . . . . . . . . . . . . 58 159 8.4. XDR Definition for RPC-over-RDMA Version 2 Base Header 160 Types . . . . . . . . . . . . . . . . . . . . . . . . . . 60 161 8.5. Use of the XDR Description . . . . . . . . . . . . . . . 62 162 9. RPC Bind Parameters . . . . . . . . . . . . . . . . . . . . . 63 163 10. Implementation Status . . . . . . . . . . . . . . . . . . . . 64 164 11. Security Considerations . . . . . . . . . . . . . . . . . . . 65 165 11.1. Memory Protection . . . . . . . . . . . . . . . . . . . 65 166 11.1.1. Protection Domains . . . . . . . . . . . . . . . . . 65 167 11.1.2. Handle (STag) Predictability . . . . . . . . . . . . 65 168 11.1.3. Memory Protection . . . . . . . . . . . . . . . . . 65 169 11.1.4. Denial of Service . . . . . . . . . . . . . . . . . 66 170 11.2. RPC Message Security . . . . . . . . . . . . . . . . . . 67 171 11.2.1. RPC-over-RDMA Protection at Other Layers . . . . . . 67 172 11.2.2. RPCSEC_GSS on RPC-over-RDMA Transports . . . . . . . 67 173 11.3. Transport Properties . . . . . . . . . . . . . . . . . . 70 174 11.4. Host Authentication . . . . . . . . . . . . . . . . . . 70 175 12. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 70 176 13. References . . . . . . . . . . . . . . . . . . . . . . . . . 71 177 13.1. Normative References . . . . . . . . . . . . . . . . . . 71 178 13.2. Informative References . . . . . . . . . . . . . . . . . 72 179 Appendix A. ULB Specifications . . . . . . . . . . . . . . . . . 74 180 A.1. DDP-Eligibility . . . . . . . . . . . . . . . . . . . . . 74 181 A.2. Maximum Reply Size . . . . . . . . . . . . . . . . . . . 75 182 A.3. Reverse-Direction Operation . . . . . . . . . . . . . . . 75 183 A.4. Additional Considerations . . . . . . . . . . . . . . . . 76 184 A.5. ULP Extensions . . . . . . . . . . . . . . . . . . . . . 76 185 Appendix B. Extending RPC-over-RDMA Version 2 . . . . . . . . . 77 186 B.1. Documentation Requirements . . . . . . . . . . . . . . . 77 187 B.2. Adding New Header Types to RPC-over-RDMA Version 2 . . . 78 188 B.3. Adding New Header Flags to the Protocol . . . . . . . . . 79 189 B.4. Adding New Transport properties to the Protocol . . . . . 79 190 B.5. Adding New Error Codes to the Protocol . . . . . . . . . 80 191 Appendix C. Differences from RPC-over-RDMA Version 1 . . . . . . 81 192 C.1. Changes to the XDR Definition . . . . . . . . . . . . . . 81 193 C.2. Transport Properties . . . . . . . . . . . . . . . . . . 82 194 C.3. Credit Management Changes . . . . . . . . . . . . . . . . 82 195 C.4. Inline Threshold Changes . . . . . . . . . . . . . . . . 83 196 C.5. Message Continuation Changes . . . . . . . . . . . . . . 84 197 C.6. Host Authentication Changes . . . . . . . . . . . . . . . 85 198 C.7. Support for Remote Invalidation . . . . . . . . . . . . . 85 199 C.8. Error Reporting Changes . . . . . . . . . . . . . . . . . 86 200 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . 87 201 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 87 203 1. Introduction 205 Remote Direct Memory Access (RDMA) [RFC5040] [RFC5041] [IBA] is a 206 technique for moving data efficiently between network nodes. By 207 placing transferred data directly into destination buffers using 208 Direct Memory Access, RDMA delivers the reciprocal benefits of faster 209 data transfer and reduced host CPU overhead. 211 Open Network Computing Remote Procedure Call (ONC RPC, often 212 shortened in NFSv4 documents to RPC) [RFC5531] is a Remote Procedure 213 Call protocol that runs over a variety of transports. Most RPC 214 implementations today use UDP [RFC0768] or TCP [RFC0793]. On UDP, a 215 datagram encapsulates each RPC message. Within a TCP byte stream, a 216 record marking protocol delineates RPC messages. 218 An RDMA transport, too, conveys RPC messages in a fashion that must 219 be fully defined if RPC implementations are to interoperate when 220 using RDMA to transport RPC transactions. Although RDMA transports 221 encapsulate messages like UDP, they deliver them reliably and in 222 order, like TCP. Further, they implement a bulk data transfer 223 service not provided by traditional network transports. Therefore, 224 we treat RDMA as a novel transport type for RPC. 226 1.1. Design Goals 228 The general mission of RPC-over-RDMA transports is to leverage 229 network hardware capabilities to reduce host CPU needs related to the 230 transport of RPC messages. In particular, this includes mitigating 231 host interrupt rates and limiting the necessity to copy RPC payload 232 bytes on receivers. 234 These hardware capabilities benefit both RPC clients and servers. On 235 balance, however, the RPC-over-RDMA protocol design approach has been 236 to bolster clients more than servers, as the client is typically 237 where applications are most hungry for CPU resources. 239 Additionally, RPC-over-RDMA transports are designed to support RPC 240 applications transparently. However, such transports can also 241 provide mechanisms that enable further optimization of data transfer 242 when RPC applications are structured to exploit direct data 243 placement. In this context, the Network File System (NFS) family of 244 protocols (as described in [RFC1094], [RFC1813], [RFC7530], 245 [RFC5661], [RFC7862], and subsequent NFSv4 minor versions) are all 246 potential beneficiaries of RPC-over-RDMA. 248 A complete problem statement appears in [RFC5532]. 250 1.2. Motivation for a New Version 252 Storage administrators have broadly deployed the RPC-over-RDMA 253 version 1 protocol specified in [RFC8166]. However, there are known 254 shortcomings to this protocol: 256 * The protocol's default size of Receive buffers forces the use of 257 RDMA Read and Write transfers for small payloads, and limits the 258 size of reverse direction messages. 260 * It is difficult to make optimizations or protocol fixes that 261 require changes to on-the-wire behavior. 263 * For some RPC procedures, the maximum reply size is difficult or 264 impossible for an RPC client to estimate in advance. 266 To address these issues in a way that preserves interoperation with 267 existing RPC-over-RDMA version 1 deployments, we present a second 268 version of the RPC-over-RDMA transport protocol in the current 269 document. 271 The version of RPC-over-RDMA presented here is extensible, enabling 272 the introduction of OPTIONAL extensions without impacting existing 273 implementations. See Appendix C.1, for further discussion. It 274 introduces a mechanism to exchange implementation properties to 275 automatically provide further optimization of data transfer. 277 This version also contains incremental changes that relieve 278 performance constraints and enable recovery from unusual corner 279 cases. These changes are outlined in Appendix C and include a larger 280 default inline threshold, the ability to convey a single RPC message 281 using multiple RDMA Send operations, support for authentication of 282 connection peers, richer error reporting, improved credit-based flow 283 control, and support for Remote Invalidation. 285 2. Requirements Language 287 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 288 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 289 "OPTIONAL" in this document are to be interpreted as described in 290 BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all 291 capitals, as shown here. 293 3. Terminology 295 3.1. Remote Procedure Calls 297 This section highlights critical elements of the RPC protocol 298 [RFC5531] and the External Data Representation (XDR) [RFC4506] it 299 uses. RPC-over-RDMA version 2 enables the transmission of RPC 300 messges built using XDR and also uses XDR internally to describe its 301 header formats. The remainder of this document requires an 302 understanding of RPC and its use of XDR. 304 3.1.1. Upper-Layer Protocols 306 RPCs are an abstraction used to implement the operations of an Upper- 307 Layer Protocol (ULP). For RPC-over-RDMA, "ULP" refers to an RPC 308 Program and Version tuple, which is a versioned set of procedure 309 calls that comprise a single well-defined API. One example of a ULP 310 is the Network File System Version 4.0 [RFC7530]. In the current 311 document, the term "RPC consumer" refers to an implementation of a 312 ULP running on an RPC client. 314 3.1.2. Requesters and Responders 316 Like a local procedure call, every RPC procedure has a set of 317 "arguments" and a set of "results". A calling context invokes a 318 procedure, passing arguments to it, and the procedure subsequently 319 returns a set of results. Unlike a local procedure call, the called 320 procedure is executed remotely rather than in the local application's 321 execution context. 323 The RPC protocol as described in [RFC5531] is fundamentally a 324 message-passing protocol between one or more clients, where RPC 325 consumers are running, and a server, where a remote execution context 326 is available to process RPC transactions on behalf of these 327 consumers. 329 ONC RPC transactions consist of two types of messages: 331 * A CALL message, or "Call", requests work. An RPC Call message is 332 designated by the value zero (0) in the message's msg_type field. 333 The sender places a unique 32-bit value in the message's XID field 334 to match this RPC Call message to a corresponding RPC Reply 335 message. 337 * A REPLY message, or "Reply", reports the results of work requested 338 by an RPC Call message. An RPC Reply message is designated by the 339 value one (1) in the message's msg_type field. The sender copies 340 the value contained in an RPC Reply message's XID field from the 341 RPC Call message whose results the sender is reporting. 343 Each RPC client endpoint acts as a "Requester", which serializes the 344 procedure's arguments and conveys them to a server endpoint via an 345 RPC Call message. A Call message contains an RPC protocol header, a 346 header describing the requested upper-layer operation, and all 347 arguments. 349 An RPC server endpoint acts as a "Responder", which deserializes the 350 arguments and processes the requested operation. It then serializes 351 the operation's results into an RPC Reply message. An RPC Reply 352 message contains an RPC protocol header, a header describing the 353 upper-layer reply, and all results. 355 The Requester deserializes the results and allows the RPC consumer to 356 proceed. At this point, the RPC transaction designated by the XID in 357 the RPC Call message is complete, and the XID is retired. 359 In summary, Requesters send RPC Call messages to Responders to 360 initiate RPC transactions. Responders send RPC Reply messages to 361 Requesters to complete the processing on an RPC transaction. 363 3.1.3. RPC Transports 365 The role of an "RPC transport" is to mediate the exchange of RPC 366 messages between Requesters and Responders. An RPC transport bridges 367 the gap between the RPC message abstraction and the native operations 368 of a network transport (e.g., a socket). 370 RPC-over-RDMA is a connection-oriented RPC transport. When a 371 transport type is connection-oriented, clients initiate transport 372 connections, while servers wait passively to accept incoming 373 connection requests. 375 3.1.3.1. Transport Failure Recovery 377 So that appropriate and timely recovery action can be taken, the 378 transport implementation is responsible for notifying a Requester 379 when an RPC Call or Reply was not able to be conveyed. Recovery can 380 take the form of establishing a new connection, re-sending RPC Calls, 381 or terminating RPC transactions pending on the Requester. 383 For instance, a connection loss may occur after a Responder has 384 received an RPC Call but before it can send the matching RPC Reply. 385 Once the transport notifies the Requester of the connection loss, the 386 Requester can re-send all pending RPC Calls on a fresh connection. 388 3.1.3.2. Forward Direction 390 Traditionally, an RPC client acts as a Requester, while an RPC 391 service acts as a Responder. The current document refers to this 392 direction of RPC message passing as "forward-direction" operation. 394 3.1.3.3. Reverse-Direction 396 The RPC specification [RFC5531] does not forbid performing RPC 397 transactions in the other direction. An RPC service endpoint can act 398 as a Requester, in which case an RPC client endpoint acts as a 399 Responder. This direction of RPC message passing is known as 400 "reverse-direction" operation. 402 During reverse-direction operation, an RPC client is responsible for 403 establishing transport connections, even though the RPC server 404 originates RPC Calls. 406 RPC clients and servers are usually optimized to perform and scale 407 well when handling traffic in the forward direction. They might not 408 be prepared to handle operation in the reverse direction. Not until 409 NFS version 4.1 [RFC5661] has there been a strong need to handle 410 reverse-direction operation. 412 3.1.3.4. Bi-directional Operation 414 A pair of connected RPC endpoints may choose to use only forward- 415 direction or only reverse-direction operation on a particular 416 transport connection. Or, these endpoints may send Calls in both 417 directions concurrently on the same transport connection. 419 "Bi-directional operation" occurs when both transport endpoints act 420 as a Requester and a Responder at the same time on a single 421 connection. 423 Bi-directionality is an extension of RPC transport connection 424 sharing. Two RPC endpoints wish to exchange independent RPC messages 425 over a shared connection but in opposite directions. These messages 426 may or may not be related to the same workloads or RPC Programs. 428 3.1.3.5. XID Values 430 Section 9 of [RFC5531] introduces the RPC transaction identifier, or 431 "XID" for short. A connection peer interprets the value of an XID in 432 the context of the message's msg_type field. 434 * The XID of a Call is arbitrary but is unique among outstanding 435 Calls from that Requester on that connection. 437 * The XID of a Reply always matches that of the initiating Call. 439 After receiving a Reply, a Requester matches the XID value in that 440 Reply with a Call it previously sent. 442 During bi-directional operation, forward- and reverse- direction XIDs 443 are typically generated on distinct hosts by possibly different 444 algorithms. There is no coordination between the generation of XIDs 445 used in forward-direction and reverse-direction operation. 447 Therefore, a forward-direction Requester MAY use the same XID value 448 at the same time as a reverse-direction Requester on the same 449 transport connection. Although such concurrent requests use the same 450 XID value, they represent distinct RPC transactions. 452 3.1.4. External Data Representation 454 One cannot assume that all Requesters and Responders represent data 455 objects in the same way internally. RPC uses External Data 456 Representation (XDR) to translate native data types and serialize 457 arguments and results [RFC4506]. 459 XDR encodes data independently of the endianness or size of host- 460 native data types, enabling unambiguous decoding of data by a 461 receiver. 463 XDR assumes only that the number of bits in a byte (octet) and their 464 order are the same on both endpoints and the physical network. The 465 smallest indivisible unit of XDR encoding is a group of four octets. 466 XDR can also flatten lists, arrays, and other complex data types into 467 a stream of bytes. 469 We refer to a serialized stream of bytes that is the result of XDR 470 encoding as an "XDR stream". A sender encodes native data into an 471 XDR stream and then transmits that stream to a receiver. The 472 receiver decodes incoming XDR byte streams into its native data 473 representation format. 475 3.1.4.1. XDR Opaque Data 477 Sometimes, a data item is to be transferred as-is, without encoding 478 or decoding. We refer to the contents of such a data item as "opaque 479 data". XDR encoding places the content of opaque data items directly 480 into an XDR stream without altering it in any way. ULPs or 481 applications perform any needed data translation in this case. 482 Examples of opaque data items include the content of files or generic 483 byte strings. 485 3.1.4.2. XDR Roundup 487 The number of octets in a variable-length data item precedes that 488 item in an XDR stream. If the size of an encoded data item is not a 489 multiple of four octets, the sender appends octets containing zero 490 after the end of the data item. These zero octets shift the next 491 encoded data item in the XDR stream so that it always starts on a 492 four-octet boundary. The addition of extra octets does not change 493 the encoded size of the data item. Receivers do not expose the extra 494 octets to ULPs. 496 We refer to this technique as "XDR roundup", and the extra octets as 497 "XDR roundup padding". 499 3.2. Remote Direct Memory Access 501 When a third party transfers large RPC payloads, RPC Requesters and 502 Responders can become more efficient. An example of such a third 503 party might be an intelligent network interface (data movement 504 offload), which places data in the receiver's memory so that no 505 additional adjustment of data alignment is necessary (direct data 506 placement or "DDP"). RDMA transports enable both of these 507 optimizations. 509 In the current document, "RDMA" refers to the physical mechanism an 510 RDMA transport utilizes when moving data. 512 3.2.1. Direct Data Placement 514 Typically, RPC implementations copy the contents of RPC messages into 515 a buffer before being sent. An efficient RPC implementation sends 516 bulk data without copying it into a separate send buffer first. 518 However, socket-based RPC implementations are often unable to receive 519 data directly into its final place in memory. Receivers often need 520 to copy incoming data to finish an RPC operation sometimes, if only 521 to adjust data alignment. 523 Although it may not be efficient, before an RDMA transfer, a sender 524 may copy data into an intermediate buffer. After an RDMA transfer, a 525 receiver may copy that data again to its final destination. In this 526 document, the term "DDP" refers to any optimized data transfer where 527 a receiving host's CPU does not move transferred data to another 528 location after arrival. 530 RPC-over-RDMA version 2 enables the use of RDMA Read and Write 531 operations to achieve both data movement offload and DDP. However, 532 note that not all RDMA-based data transfer qualifies as DDP, and some 533 mechanisms that do not employ explicit RDMA can place data directly. 535 3.2.2. RDMA Transport Requirements 537 RDMA transports require that RDMA consumers provision resources in 538 advance to achieve good performance during receive operations. An 539 RDMA consumer might provide Receive buffers in advance by posting an 540 RDMA Receive Work Request for every expected RDMA Send from a remote 541 peer. These buffers are provided before the remote peer posts RDMA 542 Send Work Requests. Thus this is often referred to as "pre-posting" 543 buffers. 545 An RDMA Receive Work Request remains outstanding until the RDMA 546 provider matches it to an inbound Send operation. The resources 547 associated with that Receive must be retained in host memory, or 548 "pinned", until the Receive completes. 550 Given these tenets of operation, the RPC-over-RDMA version 2 protocol 551 assumes each transport provides the following abstract operations. A 552 more complete discussion of these operations appears in [RFC5040]. 554 3.2.2.1. Memory Registration 556 Memory registration assigns a steering tag to a region of memory, 557 permitting the RDMA provider to perform data-transfer operations. 558 The RPC-over-RDMA version 2 protocol assumes that a steering tag of 559 no more than 32 bits and memory addresses of up to 64 bits in length 560 identifies each registered memory region. 562 3.2.2.2. RDMA Send 564 The RDMA provider supports an RDMA Send operation, with completion 565 signaled on the receiving peer after RDMA provider has placed data in 566 a pre-posted buffer. Sends complete at the receiver in the order 567 they were posted at the sender. The size of the remote peer's pre- 568 posted buffers limits the amount of data that can be transferred by a 569 single RDMA Send operation. 571 3.2.2.3. RDMA Receive 573 The RDMA provider supports an RDMA Receive operation to receive data 574 conveyed by incoming RDMA Send operations. To reduce the amount of 575 memory that must remain pinned awaiting incoming Sends, the amount of 576 memory posted per Receive is limited. The RDMA consumer (in this 577 case, the RPC-over-RDMA version 2 protocol) provides flow control to 578 prevent overrunning receiver resources. 580 3.2.2.4. RDMA Write 582 The RDMA provider supports an RDMA Write operation to place data 583 directly into a remote memory region. The local host initiates an 584 RDMA Write and the RDMA provider signals completion there. The 585 remote RDMA provider does not signal completion on the remote peer. 586 The local host provides the steering tag, the memory address, and the 587 length of the remote peer's memory region. 589 RDMA Writes are not ordered relative to one another, but are ordered 590 relative to RDMA Sends. Thus, a subsequent RDMA Send completion 591 signaled on the local peer guarantees that prior RDMA Write data has 592 been successfully placed in the remote peer's memory. 594 3.2.2.5. RDMA Read 596 The RDMA provider supports an RDMA Read operation to place remote 597 source data directly into local memory. The local host initiates an 598 RDMA Read and and the RDMA provider signals completion there. The 599 remote RDMA provider does not signal completion on the remote peer. 600 The local host provides the steering tags, the memory addresses, and 601 the lengths for the remote source and local destination memory 602 regions. 604 The RDMA consumer (in this case, the RPC-over-RDMA version 2 605 protocol) signals Read completion to the remote peer as part of a 606 subsequent RDMA Send message. The remote peer can then invalidate 607 steering tags and subsequently free associated source memory regions. 609 4. RPC-over-RDMA Framework 611 Before an RDMA data transfer can occur, an endpoint first exposes 612 regions of its memory to a remote endpoint. The remote endpoint then 613 initiates RDMA Read and Write operations against the exposed memory. 614 A "transfer model" designates which endpoint exposes its memory and 615 which is responsible for initiating the transfer of data. 617 In RPC-over-RDMA version 2, only Requesters expose their memory to 618 the Responder, and only Responders initiate RDMA Read and Write 619 operations. Read access to memory regions enables the Responder to 620 pull RPC arguments or whole RPC Calls from each Requester. The 621 Responder pushes RPC results or whole RPC Replies to a Requester's 622 memory regions to which it has write access. 624 4.1. Message Framing 626 Each RPC-over-RDMA version 2 message consists of at most two XDR 627 streams: 629 * The "Transport stream" contains a header that describes and 630 controls the transfer of the Payload stream in this RPC-over-RDMA 631 message. Every RDMA Send on an RPC-over-RDMA version 2 connection 632 MUST begin with a Transport stream. 634 * The "Payload stream" contains part or all of a single RPC message. 635 The sender MAY divide an RPC message at any convenient boundary 636 but MUST send RPC message fragments in XDR stream order and MUST 637 NOT interleave Payload streams from multiple RPC messages. The 638 RPC-over-RDMA version 2 message carrying the final part of an RPC 639 message is marked (see Section 6.2.2.2). 641 The RPC-over-RDMA framing mechanism described in this section 642 replaces all other RPC framing mechanisms. Connection peers use RPC- 643 over-RDMA framing even when the underlying RDMA protocol runs on a 644 transport type with well-defined RPC framing, such as TCP. However, 645 a ULP can negotiate the use of RDMA, dynamically enabling the use of 646 RPC-over-RDMA on a connection established on some other transport 647 type. Because RPC framing delimits an entire RPC request or reply, 648 the resulting shift in framing must occur between distinct RPC 649 messages, and in concert with the underlying transport. 651 4.2. Managing Receiver Resources 653 If any pre-posted Receive buffer on the connection is not large 654 enough to accept an incoming RDMA Send, the RDMA provider can 655 terminate the connection. Likewise, if a pre-posted Receive buffer 656 is not available to accept an incoming RDMA Send, the RDMA provider 657 can terminate the connection. Therefore, a sender needs to respect 658 the resource limits of its peer receiver to ensure the longevity of 659 each connection. Two operational parameters communicate these limits 660 between connection peers: flow control, and inline threshold. 662 4.2.1. Flow Control 664 RPC-over-RDMA requires reliable and in-order delivery of data 665 payloads. Therefore, RPC-over-RDMA transports MUST use the RDMA RC 666 (Reliable Connected) Queue Pair (QP) type. The use of an RC QP 667 ensures in-transit data integrity and proper recovery from packet 668 loss or misordering. 670 However, RPC-over-RDMA itself provides a flow control mechanism to 671 prevent a sender from overwhelming receiver resources. RPC-over-RDMA 672 transports employ end-to-end credit-based flow control for this 673 purpose [CBFC]. Credit-based flow control is relatively simple, 674 providing robust operation in the face of bursty traffic and 675 automated management of receive buffer allocation. 677 4.2.1.1. Granting Credits 679 An RPC-over-RDMA version 2 credit is the capability to receive one 680 RPC-over-RDMA version 2 message. This arrangement enables RPC-over- 681 RDMA version 2 to support asymmetrical operation, where a message in 682 one direction might trigger zero, one, or multiple messages in the 683 other direction in response. 685 To achieve this, each posted Receive buffer on both connection peers 686 receives one credit. Each Requester has a set of Receive credits, 687 and each Responder has a set of Receive credits. These credit values 688 are managed independently of one another. 690 Section 7 of [RFC8166] requires that the 32-bit field containing the 691 credit grant is the third word in the transport header. To conform 692 with that requirement, senders encode the two independent credit 693 values into a single 32-bit field in the fixed portion of the 694 transport header. At the receiver, the low-order two bytes are the 695 number of credits that are newly granted by the sender. The granted 696 credit value MUST NOT be zero since such a value would result in 697 deadlock. The high-order two bytes are the maximum number of credits 698 that can be outstanding at the sender. 700 A sender must avoid posting more RDMA Send messages than the 701 receiver's granted credit limit. If the sender exceeds the granted 702 value, the RDMA provider might signal an error, possibly terminating 703 the connection. 705 The granted credit values MAY be adjusted to match the needs or 706 policies in effect on either peer. For instance, a peer may reduce 707 its granted credit value to accommodate the available resources in a 708 Shared Receive Queue. 710 Certain RDMA implementations may impose additional flow-control 711 restrictions, such as limits on RDMA Read operations in progress at 712 the Responder. Accommodation of such restrictions is considered the 713 responsibility of each RPC-over-RDMA version 2 implementation. 715 4.2.1.2. Asynchronous Credit Grants 717 A special header type enables one peer to refresh its credit grant to 718 the other peer without sending an RPC payload. A receiving peer can 719 use this when the sender's credit grant is exhausted in the midst of 720 a stream of continued messages. See Section 6.3.5 for information 721 about this header type. 723 Receivers MUST always be in a position to receive one such credit 724 grant update message, in addition to payload-bearing messages, to 725 prevent transport deadlock. One way a receiver can do this is to 726 post one more RDMA Receive than the credit value the receiver 727 granted. 729 4.2.2. Inline Threshold 731 An "inline threshold" value is the largest message size (in octets) 732 that can be conveyed in one direction between peer implementations 733 using RDMA Send and Receive channel operations. An inline threshold 734 value is less than the largest number of octets the sender can post 735 in a single RDMA Send operation. It is also less than the largest 736 number of octets the receiver can reliably accept via a single RDMA 737 Receive operation. 739 Each connection has two inline threshold values. There is one for 740 messages flowing from Requester-to-Responder, referred to as the 741 "call inline threshold", and one for messages flowing from Responder- 742 to-Requester, referred to as the "reply inline threshold." 744 Peers can advertise their inline threshold values via RPC-over-RDMA 745 version 2 Transport Properties (see Section 5). In the absence of an 746 exchange of Transport Properties, connection peers MUST assume both 747 inline thresholds are 4096 octets. 749 4.2.3. Initial Connection State 751 When an RPC-over-RDMA version 2 client establishes a connection to a 752 server, its first order of business is to determine the server's 753 highest supported protocol version. 755 Upon connection establishment, a client MUST send only a single RPC- 756 over-RDMA message until it receives a valid RPC-over-RDMA message 757 from the server that grants client credits. 759 The second word of each transport header conveys the transport 760 protocol version. In the interest of clarity, the current document 761 refers to that word as rdma_vers even though in the RPC-over-RDMA 762 version 2 XDR definition, it appears as rdma_start.rdma_vers. 764 Immediately after the client establishes a connection, it sends a 765 single valid RPC-over-RDMA message with the value two (2) in the 766 rdma_vers field. Because the server might support only RPC-over-RDMA 767 version 1, this initial message MUST NOT be larger than the version 1 768 default inline threshold of 1024 octets. 770 4.2.3.1. Server Does Support RPC-over-RDMA Version 2 772 If the server supports RPC-over-RDMA version 2, it sends RPC-over- 773 RDMA messages back to the client with the value two (2) in the 774 rdma_vers field. Both peers may assume the default inline threshold 775 value for RPC-over-RDMA version 2 connections (4096 octets). 777 4.2.3.2. Server Does Not Support RPC-over-RDMA Version 2 779 If the server does not support RPC-over-RDMA version 2, it MUST send 780 an RPC-over-RDMA message to the client with an XID that matches the 781 client's first message, RDMA2_ERROR in the rdma_start.rdma_htype 782 field, and with the error code RDMA2_ERR_VERS. This message also 783 reports the range of RPC-over-RDMA protocol versions that the server 784 supports. To continue operation, the client selects a protocol 785 version in that range for subsequent messages on this connection. 787 If the connection is dropped immediately after an RDMA2_ERROR/ 788 RDMA2_ERR_VERS message is received, the client should try to avoid a 789 version negotiation loop when re-establishing another connection. It 790 can assume that the server does not support RPC-over-RDMA version 2. 791 A client can assume the same situation (i.e., no server support for 792 RPC-over-RDMA version 2) if the initial negotiation message is lost 793 or dropped. Once the version negotiation exchange is complete, both 794 peers may use the default inline threshold value for the negotiated 795 transport protocol version. 797 4.2.3.3. Client Does Not Support RPC-over-RDMA Version 2 799 The server examines the RPC-over-RDMA protocol version used in the 800 first RPC-over-RDMA message it receives. If it supports this 801 protocol version, it MUST use it in all subsequent messages it sends 802 on that connection. The client MUST NOT change the protocol version 803 for the duration of the connection. 805 4.3. XDR Encoding with Chunks 807 When a DDP capability is available, an RDMA provider can place the 808 contents of one or more XDR data items directly into a receiver's 809 memory. It can do this separately from the transfer of other parts 810 of the containing XDR stream. 812 4.3.1. Reducing an XDR Stream 814 RPC-over-RDMA version 2 provides a mechanism for moving part of an 815 RPC message via a data transfer distinct from an RDMA Send/Receive 816 pair. The sender removes one or more XDR data items from the Payload 817 stream. These items are conveyed via other mechanisms, such as one 818 or more RDMA Read or Write operations. As the receiver decodes an 819 incoming message, it skips over directly placed data items. 821 We refer to a data item that a sender removes from a Payload stream 822 to transmit separately as a "reduced" data item. After a sender has 823 finished removing XDR data items from a Payload stream, we refer to 824 it as a "reduced" Payload stream. The data object in a transport 825 header that describes memory regions containing reduced data items is 826 known as a "chunk." 828 4.3.2. DDP-Eligibility 830 Not all XDR data items benefit from Direct Data Placement. For 831 example, small data items or data items that require XDR unmarshaling 832 by the receiver do not benefit from DDP. Moreover, it is impractical 833 for receivers to prepare for every possible XDR data item in a 834 protocol to appear in a chunk. 836 Determining which data items are DDP-eligible is done in additional 837 specifications that describe how ULPs employ DDP. A "ULB 838 specification" identifies which XDR data items a peer MAY transfer 839 using DDP. Such data items are known as "DDP-eligible." Senders 840 MUST NOT reduce any other XDR data items. Detailed requirements for 841 ULB specifications appear in Appendix A of the current document. 843 4.3.3. RDMA Segments 845 When encoding a Payload stream that contains a DDP-eligible data 846 item, a sender may choose to reduce that data item. When it chooses 847 to do so, the sender does not place the item into the Payload stream. 848 Instead, the sender records in the transport header the location and 849 size of the memory region containing that data item. 851 The Requester provides location information for DDP-eligible data 852 items in both RPC Call and Reply messages. The Responder uses this 853 information to retrieve arguments contained in the specified region 854 of the Requester's memory or place results in that memory region. 856 An "RDMA segment", or "plain segment", is a transport header data 857 object that contains the precise coordinates of a contiguous memory 858 region. This region is conveyed separately from the Payload stream. 859 Each RDMA segment contains the following information: 861 Handle: A steering Tag (STag) or R_key generated by registering this 862 memory with the RDMA provider. 864 Length: The length of the RDMA segment's memory region, in octets. 865 An "empty segment" is an RDMA segment with the value zero (0) in 866 its length field. 868 Offset: The offset or beginning memory address of the RDMA segment's 869 memory region. 871 See [RFC5040] for further discussion. 873 4.3.4. Chunks 875 In RPC-over-RDMA version 2, a "chunk" refers to a portion of an RPC 876 message that is moved independently of the Payload stream. The 877 sender removes chunk data from the Payload stream, transfers it via 878 separate operations, and then the receiver reinserts it into the 879 received Payload stream to reconstruct the complete RPC message. 881 Each chunk consists of RDMA segments. Each RDMA segment represents a 882 piece of a chunk that is contiguous in memory. A Requester MAY 883 divide a chunk into RDMA segments using any convenient boundaries. 884 The length of a chunk is precisely the sum of the lengths of the RDMA 885 segments that comprise it. 887 The RPC-over-RDMA version 2 transport protocol does not place a limit 888 on chunk size. However, each ULP may cap the amount of data that can 889 be transferred by a single RPC transaction. For example, NFS has 890 "rsize" and "wsize", which restrict the payload size of NFS READ and 891 WRITE operations. The Responder can use such limits to sanity check 892 chunk sizes before using them in RDMA operations. 894 4.3.4.1. Counted Arrays 896 If a chunk is to contain a counted array data type, the count of 897 array elements MUST remain in the Payload stream. The sender MUST 898 move the array elements into the chunk. For example, when encoding 899 an opaque byte array as a chunk, the count of bytes stays in the 900 Payload stream. The sender removes the bytes in the array from the 901 Payload stream and places them in the chunk. 903 Individual array elements appear in a chunk in their entirety. For 904 example, when encoding an array of arrays as a chunk, the count of 905 items in the enclosing array stays in the Payload stream. But each 906 enclosed array, including its item count, is transferred as part of 907 the chunk. 909 4.3.4.2. Optional-Data 911 If a chunk contains an optional-data data type, the "is present" 912 field MUST remain in the Payload stream. The sender MUST move the 913 data, if present, to the chunk. 915 4.3.4.3. XDR Unions 917 A union data type MUST NOT be made DDP-eligible. However, one or 918 more of its arms MAY be made DDP-eligible, subject to the other 919 requirements in this section. 921 4.3.4.4. Chunk Roundup 923 Except in special cases (covered in Section 4.4.3), a chunk MUST 924 contain only one XDR data item. This restriction makes it 925 straightforward to reduce variable-length data items without 926 affecting the XDR alignment of other data items in the Payload 927 stream. 929 When a sender reduces a variable-length XDR data item, data items 930 remaining in the Payload stream MUST remain on four-byte alignment. 931 Therefore, the sender always removes XDR roundup padding for that 932 data item from the Payload stream. 934 4.3.5. Read Chunks 936 A "Read chunk" represents an XDR data item that its receiver pulls 937 from the sender. A Read chunk is a list of one or more RDMA read 938 segments. Each RDMA read segment consists of a Position field 939 followed by an RDMA segment, as defined in Section 4.3.3. 941 Position: The byte offset in the unreduced Payload stream where the 942 receiver reinserts the data item conveyed in the chunk. The 943 sender MUST compute the Position value from the beginning of the 944 unreduced Payload stream, which begins at Position zero. All RDMA 945 read segments belonging to the same Read chunk have the same value 946 in their Position field. 948 While constructing an RPC-over-RDMA message, the sender registers 949 memory regions containing data items intended for RDMA Read 950 operations. It advertises the coordinates of these regions in Read 951 chunks added to the transport header of the RPC-over-RDMA message. 953 The receiver of this message then pulls the chunk data from the 954 sender using RDMA Read operations. The receiver inserts the first 955 RDMA segment in a Read chunk into the Payload stream at the byte 956 offset indicated by its Position field. The receiver concatenates 957 RDMA segments whose Position field value matches this offset until 958 there are no more RDMA segments at that Position value. 960 The Position field in an RDMA read segment indicates where the 961 containing Read chunk starts in the Payload stream. The value in 962 this field MUST be a multiple of four. All segments in the same Read 963 chunk share the same Position value, even if one or more of the RDMA 964 segments have a non-four-byte-aligned length. 966 4.3.5.1. The Read List 968 Each RPC-over-RDMA message carries a list of RDMA read segments that 969 make up the set of Read chunks for that message. When no RDMA Read 970 operations are needed to complete the transmission of the message's 971 Payload stream, the message's Read list is empty. This is typically 972 the case for RPC Replies, for instance. 974 If a Responder receives a Read list whose RDMA segment position 975 values do not appear in monotonically increasing order, it MUST 976 discard the message without processing it and respond with an 977 RDMA2_ERROR message with the rdma_xid field set to the XID of the 978 malformed message and the rdma_err field set to RDMA2_ERR_BAD_XDR. 979 If a Requester receives a Read list whose RDMA segment position 980 values do not appear in monotonically increasing order, it MUST 981 discard the message without processing it and terminate the RPC 982 transaction corresponds to the XID value in the rdma_xid field of the 983 malformed message. 985 4.3.5.2. Decoding Read Chunks 987 The Responder initiates an RDMA Read to pull a Read chunk's data 988 content into registered local memory whenever the XDR offset in the 989 Payload stream matches that of a Read chunk. The Responder 990 acknowledges that it is finished with Read chunk source buffers when 991 it sends the corresponding RPC Reply message to the Requester. The 992 Requester may then release Read chunks advertised in the RPC-over- 993 RDMA Call. 995 4.3.5.3. Read Chunk Roundup 997 When reducing a variable-length argument data item, the Requester 998 MUST NOT include the data item's XDR roundup padding in the chunk 999 itself. The chunk's total length MUST be the same as the encoded 1000 length of the data item. 1002 4.3.6. Write Chunks 1004 While constructing an RPC Call message, a Requester prepares memory 1005 regions in which to receive DDP-eligible result data items. A "Write 1006 chunk" represents an XDR data item that a Responder is to push to a 1007 Requester. It consists of an array of zero or more plain segments. 1009 A Requester provisions Write chunks long before the Responder has 1010 prepared the reply message. A Requester often does not know the 1011 actual length of the result data items to be returned, since the 1012 result does not yet exist. Thus, it MUST provision Write chunks 1013 large enough to accommodate the maximum possible size of each 1014 returned data item. 1016 Note that the XDR position of DDP-eligible data items in the reply's 1017 Payload stream is not predictable when a Requester constructs an RPC 1018 Call message. Therefore, RDMA segments in a Write chunk do not have 1019 a Position field. 1021 For each Write chunk provided by a Requester, the Responder pushes 1022 DDP-eligible one data item to the Requester. It fills the chunk 1023 contiguously and in segment array order until the Responder has 1024 written that data item to the Requester in its entirety. The 1025 Responder MUST copy the segment count and all segments from the 1026 Requester-provided Write chunk into the RPC Reply message's transport 1027 header. As it does so, the Responder updates each segment length 1028 field to reflect the actual amount of data returned in that segment. 1029 The Responder then sends the RPC Reply message via an RDMA Send 1030 operation. 1032 An "empty Write chunk" is a Write chunk with a zero segment count. 1033 By definition, the length of an empty Write chunk is zero. An 1034 "unused Write chunk" has a non-zero segment count, but all of its 1035 segments are empty segments. 1037 4.3.6.1. The Write List 1039 Each RPC-over-RDMA message carries a list of Write chunks. When no 1040 DDP-eligible data items are to appear in the Reply to an RPC 1041 transaction, the Requester provides an empty Write list in the RPC 1042 Call, and the Responder leaves the Write list empty in the matching 1043 RPC Reply. 1045 4.3.6.2. Decoding Write Chunks 1047 After receiving the RPC Reply message, the Requester reconstructs the 1048 transferred data by concatenating the contents of each segment in 1049 array order into the RPC Reply message's XDR stream at the known XDR 1050 position of the associated DDP-eligible result data item. 1052 4.3.6.3. Write Chunk Roundup 1054 When provisioning a Write chunk for a variable-length result data 1055 item, the Requester MUST NOT include additional space for XDR roundup 1056 padding. A Responder MUST NOT write XDR roundup padding into a Write 1057 chunk, even if the result is shorter than the available space in the 1058 chunk. Therefore, when returning a single variable-length result 1059 data item, a returned Write chunk's total length MUST be the same as 1060 the encoded length of the result data item. 1062 4.4. Payload Formats 1064 Unlike RPC-over-TCP and RPC-over-UDP transports, RPC-over-RDMA 1065 transports are aware of the XDR encoding of each RPC message payload. 1066 For efficiency, the transport can convey DDP-eligible XDR data items 1067 separately from the RPC message itself. Also, receivers are required 1068 to post adequate receive resources in advance of each RPC message. 1070 RPC-over-RDMA version 2 provides several ways to arrange conveyance 1071 of an RPC-over-RDMA message. A sender chooses the specific format 1072 for a message among several factors: 1074 * The existence of DDP-eligible data items in the RPC message 1076 * The size of the RPC message 1078 * The direction of the RPC message (i.e., Call or Reply) 1080 * The available hardware resources 1082 * The arrangement of source and sink memory buffers 1084 The following subsections describe in detail how Requesters and 1085 Responders format RPC-over-RDMA message payloads. 1087 4.4.1. Simple Format 1089 All RPC messages conveyed via RPC-over-RDMA version 2 require at 1090 least one RDMA Send operation to convey. Thus, the most efficient 1091 way to send an RPC message that is smaller than the inline threshold 1092 is to append the Payload stream directly to the Transport stream. 1093 When no chunks are present, senders construct Calls and Replies the 1094 same way, and no other operations are needed. 1096 4.4.1.1. Simple Format with Chunks 1098 If DDP-eligible data items are present in a Payload stream, a sender 1099 MAY reduce some or all of these items, removing them from the Payload 1100 stream. The sender then uses a separate mechanism to transfer the 1101 reduced data items. The Transport stream with the reduced Payload 1102 stream immediately following it is then transferred using one RDMA 1103 Send operation. 1105 When chunks are present, senders construct Calls differently than 1106 Replies. 1108 Simple Call 1109 After receiving the Transport and Payload streams of an RPC Call 1110 message with Read chunks, the Responder uses RDMA Read operations 1111 to move the reduced data items contained in Read chunks. RPC- 1112 over-RDMA Calls can carry Write chunks for the Responder to use 1113 when sending the matching Reply. 1115 Simple Reply 1116 The Responder uses RDMA Write operations to move reduced data 1117 items contained in Write chunks. Afterward, it sends the 1118 Transport and Payload streams of the RPC Reply message using one 1119 RDMA Send. RPC-over-RDMA Replies always carry an empty Read chunk 1120 list. 1122 4.4.1.2. Simple Format Examples 1124 Requester Responder 1125 | RDMA Send (RDMA_MSG) | 1126 Call | ------------------------------> | 1127 | | 1128 | | Processing 1129 | | 1130 | RDMA Send (RDMA_MSG) | 1131 | <------------------------------ | Reply 1133 Figure 1: A Simple Call without chunks and a Simple Reply without 1134 chunks 1136 Requester Responder 1137 | RDMA Send (RDMA_MSG) | 1138 Call | ------------------------------> | 1139 | RDMA Read | 1140 | <------------------------------ | 1141 | RDMA Response (arg data) | 1142 | ------------------------------> | 1143 | | 1144 | | Processing 1145 | | 1146 | RDMA Send (RDMA_MSG) | 1147 | <------------------------------ | Reply 1149 Figure 2: A Simple Call with a Read chunk and a Simple Reply 1150 without chunks 1152 Requester Responder 1153 | RDMA Send (RDMA_MSG) | 1154 Call | ------------------------------> | 1155 | | 1156 | | Processing 1157 | | 1158 | RDMA Write (result data) | 1159 | <------------------------------ | 1160 | RDMA Send (RDMA_MSG) | 1161 | <------------------------------ | Reply 1163 Figure 3: A Simple Call without chunks and a Simple Reply with a 1164 Write chunk 1166 4.4.2. Continued Format 1168 For various reasons, a sender can choose to split a message payload 1169 over multiple RPC-over-RDMA messages. The Payload stream of each 1170 RPC-over-RDMA message contains a part of the RPC message. The 1171 receiver reconstructs the original RPC message by concatenating in 1172 sequence the Payload stream of each RPC-over-RDMA message. A sender 1173 MAY split an RPC message payload on any convenient boundary. 1175 4.4.2.1. Continued Format with Chunks 1177 If DDP-eligible data items are present in the Payload stream, a 1178 sender MAY reduce some or all of these items, removing them from the 1179 Payload stream. The sender then uses a separate mechanism to 1180 transfer the reduced data items. The Transport stream with the 1181 reduced Payload stream immediately following it is then transferred 1182 using one RDMA Send operation. 1184 As with Simple Format messages, when chunks are present, senders 1185 construct Calls differently than Replies. 1187 Continued Call 1188 After receiving the Transport and Payload streams of an RPC Call 1189 message with Read chunks, the Responder uses RDMA Read operations 1190 to move the reduced data items contained in Read chunks. RPC- 1191 over-RDMA Calls can carry Write chunks for the Responder to use 1192 when sending the matching Reply. 1194 Continued Reply 1195 The Responder uses RDMA Write operations to move reduced data 1196 items contained in Write chunks. Afterward, it sends the 1197 Transport and Payload streams of the RPC Reply message using 1198 multiple RDMA Sends. RPC-over-RDMA Replies always carry an empty 1199 Read chunk list. 1201 4.4.2.2. Continued Format Examples 1202 Requester Responder 1203 | RDMA Send (RDMA_MSG) | 1204 Call | ------------------------------> | 1205 | RDMA Send (RDMA_MSG) | 1206 | ------------------------------> | 1207 | RDMA Send (RDMA_MSG) | 1208 | ------------------------------> | 1209 | | 1210 | | 1211 | | Processing 1212 | | 1213 | RDMA Send (RDMA_MSG) | 1214 | <------------------------------ | Reply 1215 | RDMA Send (RDMA_MSG) | 1216 | <------------------------------ | 1217 | RDMA Send (RDMA_MSG) | 1218 | <------------------------------ | 1220 Figure 4: A Continued Call without chunks and a Continued Reply 1221 without chunks 1223 Requester Responder 1224 | RDMA Send (RDMA_MSG) | 1225 Call | ------------------------------> | 1226 | RDMA Send (RDMA_MSG) | 1227 | ------------------------------> | 1228 | RDMA Send (RDMA_MSG) | 1229 | ------------------------------> | 1230 | RDMA Read | 1231 | <------------------------------ | 1232 | RDMA Response (arg data) | 1233 | ------------------------------> | 1234 | | 1235 | | Processing 1236 | | 1237 | RDMA Send (RDMA_MSG) | 1238 | <------------------------------ | Reply 1240 Figure 5: A Continued Call with a Read chunk and a Simple Reply 1241 without chunks 1243 Requester Responder 1244 | RDMA Send (RDMA_MSG) | 1245 Call | ------------------------------> | 1246 | | 1247 | | Processing 1248 | | 1249 | RDMA Write (result data) | 1250 | <------------------------------ | 1251 | RDMA Send (RDMA_MSG) | 1252 | <------------------------------ | Reply 1253 | RDMA Send (RDMA_MSG) | 1254 | <------------------------------ | 1255 | RDMA Send (RDMA_MSG) | 1256 | <------------------------------ | 1258 Figure 6: A Simple Call without chunks and a Continued Reply with 1259 a Write chunk 1261 4.4.3. Special Format 1263 Sometimes, after DDP-eligible data items have been removed, a Payload 1264 stream is still too large to send using only RDMA Send operations. 1265 In those cases, the sender can use RDMA Read or Write operations to 1266 convey the entire RPC message. We refer to this as a "Special 1267 Format" message. 1269 To transmit a Special Format message, the sender transmits only the 1270 Transport stream with an RDMA Send operation. The sender does not 1271 include the Payload stream in the send buffer. Instead, the 1272 Requester provides chunks that the Responder uses to move the Payload 1273 stream. 1275 Because chunks are always present in Special Format messages, the 1276 sender always handles Calls and Replies differently. 1278 Special Call 1279 The Requester provides a Read chunk that contains the RPC Call 1280 message's Payload stream. Every read segment in this chunk MUST 1281 contain zero (0) in its Position field. This type of Read chunk 1282 is known as a "Position Zero Read chunk." 1284 Special Reply 1285 The Requester provisions a single Write chunk in advance, known as 1286 a "Reply chunk", in which the Responder places the RPC Reply 1287 message's Payload stream. The Requester provisions the Reply 1288 chunk to accommodate the maximum expected reply size for that 1289 upper-layer operation. 1291 One purpose of a Special Format message is to handle large RPC 1292 messages. However, Requesters MAY use a Special Format message at 1293 any time to convey an RPC Call message. 1295 When it has alternatives, a Responder chooses which Format to use 1296 based on the chunks provided by the Requester. If a Requester 1297 provided a Write chunk and the Responder has a DDP-eligible result, 1298 it first reduces the reply Payload stream. If a Requester provided a 1299 Reply chunk and the reduced Payload stream is larger than the reply 1300 inline threshold, the Responder MUST use the Requester-provided Reply 1301 chunk for the reply. 1303 XDR data items may appear in these chunks without regard to their 1304 DDP-eligibility. As these chunks contain a Payload stream, they MUST 1305 include appropriate XDR roundup padding to maintain proper XDR 1306 alignment of their contents. 1308 If a Responder receives an RPC-over-RDMA message that carries a 1309 Position Zero Read chunk and whose RDMA2_F_RESPONSE flag is clear but 1310 the message does not use the RDMA2_NOMSG header type, it MUST discard 1311 that message without processing it and respond with an RDMA2_ERROR 1312 message with the rdma_xid field set to the XID of the malformed 1313 message and the rdma_err field set to RDMA2_ERR_BAD_XDR. When a 1314 Requester receives an RPC-over-RDMA message that carries a Reply 1315 chunk and whose RDMA2_F_RESPONSE flag is set but the message does not 1316 use the RDMA2_NOMSG header type, it MUST discard that message without 1317 processing it and terminate the RPC transaction corresponding to the 1318 XID value in the message's rdma_xid field. 1320 4.4.3.1. Special Format Examples 1322 Requester Responder 1323 | RDMA Send (RDMA_NOMSG) | 1324 Call | ------------------------------> | 1325 | RDMA Read | 1326 | <------------------------------ | 1327 | RDMA Response (RPC call) | 1328 | ------------------------------> | 1329 | | 1330 | | Processing 1331 | | 1332 | RDMA Send (RDMA_MSG) | 1333 | <------------------------------ | Reply 1335 Figure 7: A Special Call and a Simple Reply without chunks 1336 Requester Responder 1337 | RDMA Send (RDMA_MSG) | 1338 Call | ------------------------------> | 1339 | | 1340 | | Processing 1341 | | 1342 | RDMA Write (RPC reply) | 1343 | <------------------------------ | 1344 | RDMA Send (RDMA_NOMSG) | 1345 | <------------------------------ | Reply 1347 Figure 8: A Simple Call without chunks and a Special Reply 1349 4.5. Reverse-Direction Operation 1351 4.5.1. Sending a Reverse-Direction RPC Call 1353 An RPC-over-RDMA server endpoint constructs the transport header for 1354 a reverse-direction RPC Call as follows: 1356 * The server generates a new XID value (see Section 3.1.3.5 for full 1357 requirements) and places it in the rdma_xid field of the transport 1358 header and the xid field of the RPC Call message. The RPC Call 1359 header MUST start with the same XID value that is present in the 1360 transport header. 1362 * The rdma_vers field of each reverse-direction Call MUST contain 1363 the same value as forward-direction Calls on the same connection. 1365 * The server fills in the rdma_credits with the credit values for 1366 the connection, as described in Section 4.2.1.1. 1368 * The server determines the Payload format for the RPC message and 1369 fills in the rdma_htype field as appropriate (see Sections 4.4 and 1370 4.5.4). Section 4.5.4 also covers the disposition of the chunk 1371 lists. 1373 * The server MUST clear the RDMA2_F_RESPONSE flag in the rdma_flags 1374 field. It sets the RDMA2_F_MORE flag in the rdma_flags field as 1375 described in Section 6.2.2.2. 1377 4.5.2. Sending a Reverse-Direction RPC Reply 1379 An RPC-over-RDMA server endpoint constructs the transport header for 1380 a reverse-direction RPC Reply as follows: 1382 * The server copies the XID value from the matching RPC Call and 1383 places it in the rdma_xid field of the transport header and the 1384 xid field of the RPC Reply message. The RPC Reply header MUST 1385 start with the same XID value that is present in the transport 1386 header. 1388 * The rdma_vers field of each reverse-direction Call MUST contain 1389 the same value as forward-direction Replies on the same 1390 connection. 1392 * The server fills in the rdma_credits with the credit values for 1393 the connection, as described in Section 4.2.1.1. 1395 * The server determines the Payload format for the RPC message and 1396 fills in the rdma_htype field as appropriate (see Sections 4.4 and 1397 4.5.4). Section 4.5.4 also covers the disposition of the chunk 1398 lists. 1400 * The server MUST set the RDMA2_F_RESPONSE flag in the rdma_flags 1401 field. It sets the RDMA2_F_MORE flag in the rdma_flags field as 1402 described in Section 6.2.2.2. 1404 4.5.3. In the Absence of Support For Reverse-Direction Operation 1406 An RPC-over-RDMA transport endpoint does not have to support reverse- 1407 direction operation. There might be no mechanism in the transport 1408 implementation to do so. Or, the transport implementation might 1409 support operation in the reverse direction, but the Upper-Layer 1410 Protocol might not yet have configured the transport to handle 1411 reverse-direction traffic. 1413 If an endpoint is unprepared to receive a reverse-direction message, 1414 loss of the RDMA connection might result. Thus a denial of service 1415 can occur if an RPC server continues to send reverse-direction 1416 messages after a client that is not prepared to receive them 1417 reconnects to an endpoint. 1419 Connection peers indicate their support for reverse-direction 1420 operation as part of the exchange of Transport Properties just after 1421 a connection is established (see Section 5.2.5). 1423 When dealing with the possibility that the remote peer has no 1424 transport level support for reverse-direction operation, the Upper- 1425 Layer Protocol is responsible for informing peers when reverse 1426 direction operation is supported. Otherwise, even a simple reverse 1427 direction RPC NULL procedure from a peer could result in a lost 1428 connection. Therefore, an Upper-Layer Protocol MUST NOT perform 1429 reverse-direction RPC operations until the RPC server indicates 1430 support for them. 1432 4.5.4. Using Chunks During Reverse-Direction Operation 1434 Reverse-direction operations can use chunks, as defined in 1435 Section 4.3.4, for DDP-eligible data items or in Special payload 1436 formats. Reverse-direction chunks operate the same way as in 1437 forward-direction operation. Connection peers indicate their support 1438 for reverse-direction chunks as part of the exchange of Transport 1439 Properties just after a connection is established (see 1440 Section 5.2.5). 1442 However, an implementation might support only Upper-Layer Protocols 1443 that have no DDP-eligible data items. Such Upper-Layer Protocols can 1444 use only small messages, or they might have a native mechanism for 1445 restricting the size of reverse-direction RPC messages, obviating the 1446 need to handle chunks in the reverse direction. 1448 When there is no Upper-Layer Protocol need for chunks in the reverse 1449 direction, implementers MAY choose not to provide support for chunks 1450 in the reverse direction, thus avoiding the complexity of 1451 implementing support for RDMA Reads and Writes in the reverse 1452 direction. 1454 When an RPC-over-RDMA transport implementation does not support 1455 chunks in the reverse direction, RPC endpoints use only the Simple 1456 Payload format without chunks or the Continued Payload format without 1457 chunks to send RPC messages in the reverse direction. 1459 If a reverse-direction Requester provides a non-empty chunk list to a 1460 Responder that does not support chunks, the Responder MUST report its 1461 lack of support using one of the error values defined in Section 7.3. 1463 4.5.5. Reverse-Direction Retransmission 1465 In rare cases, an RPC server cannot complete an RPC transaction and 1466 cannot send a Reply. In these cases, the Requester may send the RPC 1467 transaction again using the same RPC XID. We refer to this as an 1468 "RPC retransmission" or a "replay." 1469 In the forward direction, the Requester is the RPC client. The 1470 client is always responsible for ensuring a transport connection is 1471 in place before sending a dropped Call again. 1473 With reverse-direction operation, the Requester is an RPC server. 1474 Because an RPC server is not responsible for establishing transport 1475 connections with clients, the Requester is unable to retransmit a 1476 reverse-direction Call whenever there is no transport connection. In 1477 this case, the RPC server must wait for the RPC client to re- 1478 establish a transport connection before it can retransmit reverse- 1479 direction RPC Calls. 1481 If the forward-direction Requester has no work to do, it can be some 1482 time before the RPC client re-establishes a transport connection. An 1483 RPC server may need to abandon a waiting reverse-direction RPC Call 1484 to avoid waiting indefinitely for the client to re-establish a 1485 transport connection. 1487 Therefore forward-direction Requesters SHOULD maintain a transport 1488 connection as long as the RPC server might send reverse-direction 1489 Calls. For example, while an NFS version 4.1 client has open 1490 delegated files or active pNFS layouts, it maintains one or more 1491 transport connections to enable the NFS server to perform callback 1492 operations. 1494 5. Transport Properties 1496 RPC-over-RDMA version 2 enables connection endpoints to exchange 1497 information about implementation properties. Compatible endpoints 1498 use this information to optimize data transfer. Initially, only a 1499 small set of transport properties are defined. The protocol provides 1500 a single message type to exchange transport properties (see 1501 Section 6.3.4). 1503 Both the set of transport properties and the operations used to 1504 communicate them may be extended. Within RPC-over-RDMA version 2, 1505 such extensions are OPTIONAL. A discussion of extending the set of 1506 transport properties appears in Appendix B.4. 1508 5.1. Transport Properties Model 1510 The current document specifies a basic set of receiver and sender 1511 properties. Such properties are specified using a code point that 1512 identifies the particular transport property and a nominally opaque 1513 array containing the XDR encoding of the property. 1515 The following XDR types handle transport properties: 1517 1518 typedef rpcrdma2_propid uint32; 1520 struct rpcrdma2_propval { 1521 rpcrdma2_propid rdma_which; 1522 opaque rdma_data<>; 1523 }; 1525 typedef rpcrdma2_propval rpcrdma2_propset<>; 1527 typedef uint32 rpcrdma2_propsubset<>; 1528 1530 The rpcrdma2_propid type specifies a distinct transport property. 1531 The property code points are defined as const values rather than 1532 elements in an enum type to enable the extension by concatenating XDR 1533 definition files. 1535 The rpcrdma2_propval type carries the value of a transport property. 1536 The rdma_which field identifies the particular property, and the 1537 rdma_data field contains the associated value of that property. A 1538 zero-length rdma_data field represents the default value of the 1539 property specified by rdma_which. 1541 Although the rdma_data field is opaque, receivers interpret its 1542 contents using the XDR type associated with the property specified by 1543 rdma_which. When the contents of the rdma_data field do not conform 1544 to that XDR type, the receiver MUST return the error 1545 RDMA2_ERR_BAD_PROPVAL using the header type RDMA2_ERROR, as described 1546 in Section 6.3.3. 1548 For example, the receiver of a message containing a valid 1549 rpcrdma2_propval returns this error if the length of rdma_data is 1550 greater than the length of the transferred message. Also, when the 1551 receiver recognizes the rpcrdma2_propid contained in rdma_which, it 1552 MUST report the error RDMA2_ERR_BAD_PROPVAL if either of the 1553 following occurs: 1555 * The nominally opaque data within rdma_data is not valid when 1556 interpreted using the property-associated typedef. 1558 * The length of rdma_data is insufficient to contain the data 1559 represented by the property-associated typedef. 1561 A receiver does not report an error if it does not recognize the 1562 value contained in rdma_which. In that case, the receiver does not 1563 process that rpcrdma2_propval. Processing continues with the next 1564 rpcrdma2_propval, if any. 1566 The rpcrdma2_propset type specifies a set of transport properties. 1567 The protocol does not impose a particular ordering of the 1568 rpcrdma2_propval items within it. 1570 The rpcrdma2_propsubset type identifies a subset of the properties in 1571 a rpcrdma2_propset. Each bit in the mask denotes a particular 1572 element in a previously specified rpcrdma2_propset. If a particular 1573 rpcrdma2_propval is at position N in the array, then bit number N mod 1574 32 in word N div 32 specifies whether the defined subset includes 1575 that particular rpcrdma2_propval. Words beyond the last one 1576 specified are assumed to contain zero. 1578 5.2. Current Transport Properties 1580 Table 1 specifies a basic set of transport properties. The columns 1581 contain the following information: 1583 * The column labeled "Property" contains a name of the transport 1584 property described by the current row. 1586 * The column labeled "Code" specifies the code point that identifies 1587 this property. 1589 * The column labeled "XDR type" gives the XDR type of the data used 1590 to communicate the value of this property. This data type 1591 overlays the data portion of the nominally opaque rdma_data field. 1593 * The column labeled "Default" gives the default value for the 1594 property. 1596 * The column labeled "Section" indicates the section within the 1597 current document that explains the use of this property. 1599 +============================+======+==========+=========+=========+ 1600 | Property | Code | XDR type | Default | Section | 1601 +============================+======+==========+=========+=========+ 1602 | Maximum Send Size | 1 | uint32 | 4096 | 5.2.1 | 1603 +----------------------------+------+----------+---------+---------+ 1604 | Receive Buffer Size | 2 | uint32 | 4096 | 5.2.2 | 1605 +----------------------------+------+----------+---------+---------+ 1606 | Maximum RDMA Segment Size | 3 | uint32 | 1048576 | 5.2.3 | 1607 +----------------------------+------+----------+---------+---------+ 1608 | Maximum RDMA Segment Count | 4 | uint32 | 16 | 5.2.4 | 1609 +----------------------------+------+----------+---------+---------+ 1610 | Reverse-Direction Support | 5 | uint32 | 0 | 5.2.5 | 1611 +----------------------------+------+----------+---------+---------+ 1612 | Host Auth Message | 6 | opaque<> | N/A | 5.2.6 | 1613 +----------------------------+------+----------+---------+---------+ 1615 Table 1 1617 5.2.1. Maximum Send Size 1619 The value of this property specifies the maximum size, in octets, of 1620 Send payloads. The endpoint receiving this value can size its 1621 Receive buffers based on the value of this property. 1623 1624 const uint32 RDMA2_PROPID_SBSIZ = 1; 1625 typedef uint32 rpcrdma2_prop_sbsiz; 1626 1628 5.2.2. Receive Buffer Size 1630 The value of this property specifies the minimum size, in octets, of 1631 pre-posted receive buffers. 1633 1634 const uint32 RDMA2_PROPID_RBSIZ = 2; 1635 typedef uint32 rpcrdma2_prop_rbsiz; 1636 1638 A sender can subsequently use this value to determine when a message 1639 to be sent fits in pre-posted receive buffers that the receiver has 1640 set up. In particular: 1642 * Requesters may use the value to determine when to provide a 1643 Position Zero Read chunk or use Message Continuation when sending 1644 a Call. 1646 * Requesters may use the value to determine when to provide a Reply 1647 chunk when sending a Call, based on the maximum possible size of 1648 the Reply. 1650 * Responders may use the value to determine when to use a Reply 1651 chunk provided by the Requester, given the actual size of a Reply. 1653 5.2.3. Maximum RDMA Segment Size 1655 The value of this property specifies the maximum size, in octets, of 1656 an RDMA segment this endpoint is prepared to send or receive. 1658 1659 const uint32 RDMA2_PROPID_RSSIZ = 3; 1660 typedef uint32 rpcrdma2_prop_rssiz; 1661 1663 5.2.4. Maximum RDMA Segment Count 1665 The value of this property specifies the maximum number of RDMA 1666 segments that can appear in a Requester's transport header. 1668 1669 const uint32 RDMA2_PROPID_RCSIZ = 4; 1670 typedef uint32 rpcrdma2_prop_rcsiz; 1671 1673 5.2.5. Reverse-Direction Support 1675 The value of this property specifies a client implementation's 1676 readiness to process messages that are part of reverse direction RPC 1677 requests. 1679 1680 const uint32 RDMA_RVRSDIR_NONE = 0; 1681 const uint32 RDMA_RVRSDIR_SIMPLE = 1; 1682 const uint32 RDMA_RVRSDIR_CONT = 2; 1683 const uint32 RDMA_RVRSDIR_GENL = 3; 1685 const uint32 RDMA2_PROPID_BRS = 5; 1686 typedef uint32 rpcrdma2_prop_brs; 1687 1689 Multiple levels of support are distinguished: 1691 * The value RDMA2_RVRSDIR_NONE indicates that the sender does not 1692 support reverse-direction operation. 1694 * The value RDMA2_RVRSDIR_SIMPLE indicates that the sender supports 1695 using only Simple Format messages without chunks for reverse- 1696 direction messages. 1698 * The value RDMA2_RVRSDIR_CONT indicates that the sender supports 1699 using either Simple Format without chunks or Continued Format 1700 messages without chunks for reverse-direction messages. 1702 * The value RDMA2_RVRSDIR_GENL indicates that the sender supports 1703 reverse-direction messages in the same way as forward-direction 1704 messages. 1706 When a peer does not provide this property, the default is the peer 1707 does not support reverse-direction operation. 1709 5.2.6. Host Authentication Message 1711 The value of this transport property enables the exchange of host 1712 authentication material. This property can accommodate 1713 authentication handshakes that require multiple challenge-response 1714 interactions and potentially large amounts of material. 1716 1717 const uint32 RDMA2_PROPID_HOSTAUTH = 6; 1718 typedef opaque rpcrdma2_prop_hostauth<>; 1719 1721 When this property is not present, the peer(s) remain 1722 unauthenticated. Local security policy on each peer determines 1723 whether the connection is permitted to continue. 1725 6. Transport Messages 1727 Each transport message consists of multiple sections. 1729 * A transport header prefix, as defined in Section 6.2.2. Among 1730 other things, this structure indicates the header type. 1732 * The transport header proper, as defined by one of the sub-sections 1733 below. See Section 6.1 for the mapping between header types and 1734 the corresponding header structure. 1736 * Potentially, all or part of an RPC message payload. 1738 This organization differs from that presented in the definition of 1739 RPC-over-RDMA version 1 [RFC8166], which defined the first and second 1740 of the items above as a single XDR data structure. The new 1741 organization is in keeping with RPC-over-RDMA version 2's 1742 extensibility model, which enables the definition of new header types 1743 without modifying the XDR definition of existing header types. 1745 6.1. Transport Header Types 1747 Table 2 lists the RPC-over-RDMA version 2 header types. The columns 1748 contain the following information: 1750 * The column labeled "Operation" names the particular operation. 1752 * The column labeled "Code" specifies the value of the header type 1753 for this operation. 1755 * The column labeled "XDR type" gives the XDR type of the data 1756 structure used to organize the information in this new message 1757 type. This data immediately follows the universal portion on the 1758 transport header present in every RPC-over-RDMA transport header. 1760 * The column labeled "Msg" indicates whether this operation is 1761 followed (or not) by an RPC message payload. 1763 * The column labeled "Section" refers to the section within the 1764 current document that explains the use of this header type. 1766 +====================+======+===================+=====+=========+ 1767 | Operation | Code | XDR type | Msg | Section | 1768 +====================+======+===================+=====+=========+ 1769 | Convey Appended | 0 | rpcrdma2_msg | Yes | 6.3.1 | 1770 | RPC Message | | | | | 1771 +--------------------+------+-------------------+-----+---------+ 1772 | Convey External | 1 | rpcrdma2_nomsg | No | 6.3.2 | 1773 | RPC Message | | | | | 1774 +--------------------+------+-------------------+-----+---------+ 1775 | Report Transport | 4 | rpcrdma2_err | No | 6.3.3 | 1776 | Error | | | | | 1777 +--------------------+------+-------------------+-----+---------+ 1778 | Specify Properties | 5 | rpcrdma2_connprop | No | 6.3.4 | 1779 | at Connection | | | | | 1780 +--------------------+------+-------------------+-----+---------+ 1781 | Grant Credits | 6 | void | No | 6.3.5 | 1782 +--------------------+------+-------------------+-----+---------+ 1784 Table 2 1786 Suppport for the operations in Table 2 is REQUIRED. RPC-over-RDMA 1787 version 2 implementations that receive an unrecognized header type 1788 MUST respond with an RDMA2_ERROR message with an rdma_err field 1789 containing RDMA2_ERR_INVAL_HTYPE and drop the incoming message 1790 without processing it further. 1792 6.2. Headers and Chunks 1794 Most RPC-over-RDMA version 2 data structures have antecedents in 1795 corresponding structures in RPC-over-RDMA version 1. As is typical 1796 for new versions of an existing protocol, the XDR data structures 1797 have new names, and there are a few small changes in content. In 1798 some cases, there have been structural re-organizations to enable 1799 protocol extensibility. 1801 6.2.1. Common Transport Header Prefix 1803 The rpcrdma_common structure defines the initial part of each RPC- 1804 over-RDMA transport header for RPC-over-RDMA version 2 and subsequent 1805 versions. 1807 1808 struct rpcrdma_common { 1809 uint32 rdma_xid; 1810 uint32 rdma_vers; 1811 uint32 rdma_credit; 1812 uint32 rdma_htype; 1813 }; 1814 1816 RPC-over-RDMA version 2's use of these first four words matches that 1817 of version 1 as required by [RFC8166]. However, there are crucial 1818 structural differences in the XDR definition of RPC-over-RDMA version 1819 2: in the way that these words are described by the respective XDR 1820 descriptions: 1822 * The header type is represented as a uint32 rather than as an enum 1823 type. An enum would need to be modified to reflect additions to 1824 the set of header types made by later extensions. 1826 * The header type field is part of an XDR structure devoted to 1827 representing the transport header prefix, rather than being part 1828 of a discriminated union, that includes the body of each transport 1829 header type. 1831 * There is now a prefix structure (see Section 6.2.2) of which the 1832 rpcrdma_common structure is the initial segment. This prefix is a 1833 newly defined XDR object within the protocol description, which 1834 constrains the universal portion of all header types to the four 1835 words in rpcrdma_common. 1837 These changes are part of a more considerable structural change in 1838 the XDR definition of RPC-over-RDMA version 2 that facilitates a 1839 cleaner treatment of protocol extension. The XDR appearing in 1840 Section 8 reflects these changes, which Appendix C.1 discusses in 1841 further detail. 1843 6.2.2. Transport Header Prefix 1845 The following prefix structure appears at the start of each RPC-over- 1846 RDMA version 2 transport header. 1848 1849 const RDMA2_F_RESPONSE 0x00000001; 1850 const RDMA2_F_MORE 0x00000002; 1851 const RDMA2_F_TPMORE 0x00000004; 1853 struct rpcrdma2_hdr_prefix { 1854 struct rpcrdma_common rdma_start; 1855 uint32 rdma_flags; 1856 }; 1857 1859 The rdma_flags field is new to RPC-over-RDMA version 2. Currently, 1860 the only flags defined within this word are the RDMA2_F_RESPONSE flag 1861 and the RDMA2_F_MORE flag. The other flags are reserved for future 1862 use as described in Appendix B.3. The sender MUST set reserved flags 1863 to zero, and the receiver MUST ignore reserved flags. 1865 6.2.2.1. RDMA2_F_RESPONSE Flag 1867 The RDMA2_F_RESPONSE flag qualifies the value contained in the 1868 transport header's rdma_xid field. The RDMA2_F_RESPONSE flag enables 1869 a receiver to avoid performing an XID lookup on incoming reverse 1870 direction Call messages. Therefore: 1872 * When the rdma_htype field has the value RDMA2_MSG or RDMA2_NOMSG, 1873 the value of the RDMA2_F_RESPONSE flag MUST be the same as the 1874 value of the associated RPC message's msg_type field. 1876 * When the header type is anything else and a whole or partial RPC 1877 message payload is present, the value of the RDMA2_F_RESPONSE flag 1878 MUST be the same as the value of the associated RPC message's 1879 msg_type field. 1881 * When no RPC message payload is present, a Requester MUST set the 1882 value of RDMA2_F_RESPONSE to reflect how the receiver is to 1883 interpret the rdma_xid field. 1885 * When the rdma_htype field has the value RDMA2_ERROR, the 1886 RDMA2_F_RESPONSE flag MUST be set. 1888 6.2.2.2. RDMA2_F_MORE Flag 1890 The RDMA2_F_MORE flag signifies that the RPC-over-RDMA message 1891 payload continues in the next message. 1893 When the sender asserts the RDMA2_F_MORE flag, the receiver is to 1894 concatenate the data payload of the next received message to the end 1895 of the data payload of the received message. The sender clears the 1896 RDMA2_F_MORE flag in the final message in the sequence. 1898 All RPC-over-RDMA messages in such a sequence MUST have the same 1899 values in the rdma_xid and rdma_htype fields. Otherwise, the 1900 receiver MUST drop the message without processing it further. If the 1901 receiver is a Responder, it MUST also respond with an RDMA2_ERROR 1902 message with the rdma_err field set to RDMA2_ERR_INVAL_CONT. 1904 If a peer receives an RPC-over-RDMA message with the RDMA2_F_MORE 1905 flag set, and the rdma_htype field does not contain the values 1906 RDMA2_MSG or RDMA2_CONNPROP, the receiver MUST drop the message 1907 without processing it further. If the receiver is a Responder, it 1908 MUST also respond with an RDMA2_ERROR message with the rdma_err field 1909 set to RDMA2_ERR_INVAL_CONT. 1911 The sender includes chunks only in the final message in a sequence, 1912 in which the RDMA2_F_MORE flag is clear. If a peer receives an RPC- 1913 over-RDMA message with the RDMA2_F_MORE flag set, and its chunk lists 1914 are not empty, the receiver MUST drop the message without processing 1915 it further. If the receiver is a Responder, it MUST also respond 1916 with an RDMA2_ERROR message with the rdma_err field set to 1917 RDMA2_ERR_INVAL_CONT. 1919 There is no protocol-defined limit on the number of concatenated 1920 messages in a sequence. If the sender exhausts the receiver's credit 1921 grant before sending the final message in the sequence, the sender 1922 waits for a further credit grant from the receiver before continuing 1923 to send messages. 1925 Credit exhaustion can occur at the receiver in the middle of a 1926 sequence of continued messages. The receiver can grant more credits 1927 by sending an RPC message payload or an out-of-band credit grant (see 1928 Section 4.2.1.2) to enable the sender to send the remaining messages 1929 in the sequence. 1931 6.2.2.3. RDMA2_F_TPMORE Flag 1933 The RDMA2_F_TPMORE flag indicates that the sender has additional 1934 Transport Properties to send in a subsequent RPC-over-RDMA message. 1935 If a peer receives any message type other than RDMA2_CONNPROP with 1936 the RDMA2_F_TPMORE flag set, it MUST respond with an RDMA2_ERROR 1937 message type whose rdma_err field contains RDMA2_ERR_INVAL_HTYPE, and 1938 then silently discard the ingress message without processing it. 1940 The RDMA2_F_TPDONE flag is clear in the final RDMA2_CONNPROP message 1941 type from this peer on this connection. If a peer receives an 1942 RDMA2_CONNPROP message type after it has received an RDMA2_CONNPROP 1943 message type with a clear RDMA2_F_TPDONE flag, it MUST respond with 1944 an RDMA2_ERROR message type whose rdma_err field contains 1945 RDMA2_ERR_INVAL_HTYPE, and then silently discard the ingress message 1946 without processing it. 1948 After both connection peers have indicated they have finished sending 1949 their Transport Properties, they may begin passing RPC traffic. 1951 6.2.3. External Data Payloads 1953 The rpcrdma2_chunk_lists structure specifies where to find the parts 1954 of the message payload that are to be conveyed via explicit RDMA 1955 operations. 1957 1958 struct rpcrdma2_chunk_lists { 1959 uint32 rdma_inv_handle; 1960 struct rpcrdma2_read_list *rdma_reads; 1961 struct rpcrdma2_write_list *rdma_writes; 1962 struct rpcrdma2_write_chunk *rdma_reply; 1963 }; 1964 1966 rdma_inv_handle: The rdma_inv_handle field contains a 32-bit RDMA 1967 handle that the Responder should use in a Send With Invalidation 1968 operation. More detail is provided in Section 6.2.4. 1970 rdma_reads: The rdma_reads field anchors a list of zero or more RDMA 1971 read segments (see Section 4.3.5.1). 1973 rdma_writes: The rdma_writes field anchors a list of zero or more 1974 Write chunks (see Section 4.3.6.1). 1976 rdma_reply: The rdma_reply field is a list containing zero or one 1977 Reply chunk. The use of this chunk is explained further in 1978 Section 4.4.3. 1980 The content of the chunk lists is determined by the message's Payload 1981 format, as explained in Section 4.4. 1983 6.2.4. Remote Invalidation 1985 To solicit the use of Remote Invalidation, a Requester sets the value 1986 of the rdma_inv_handle field in an RPC Call's transport header to a 1987 non-zero value that matches one of the rdma_handle fields in that 1988 header. If the Responder may invalidate none of the rdma_handle 1989 values in the header conveying the Call, the Requester sets the RPC 1990 Call's rdma_inv_handle field to the value zero. 1992 If the Responder chooses not to use remote invalidation for this 1993 particular RPC Reply, or the RPC Call's rdma_inv_handle field 1994 contains the value zero, the Responder simply uses RDMA Send to 1995 transmit the matching RPC reply. However, if the Responder chooses 1996 to use Remote Invalidation, it uses RDMA Send With Invalidate to 1997 transmit the RPC Reply. It MUST use the value in the corresponding 1998 Call's rdma_inv_handle field to construct the Send With Invalidate 1999 Work Request. 2001 A Responder MUST NOT use a Send With Invalidate Work Request when 2002 sending an RDMA2_ERROR header type (see Section 6.3.3), an 2003 RDMA2_CONNPROP header type (see Section 6.3.4), or an asynchronous 2004 credit grant (see Section 4.2.1.2). 2006 6.3. Header Types 2008 The header types defined and used in RPC-over-RDMA version 1 are 2009 carried over into RPC-over-RDMA version 2, although there are some 2010 limited changes in the definitions of existing header types: 2012 * To simplify interoperability with RPC-over-RDMA version 1, only 2013 the RDMA2_ERROR header (defined in Section 6.3.3) has an XDR 2014 definition that differs from that in RPC-over-RDMA version 1, and 2015 its modifications are all compatible extensions. 2017 * RDMA2_MSG and RDMA2_NOMSG (defined in Sections 6.3.1 and 6.3.2) 2018 have XDR definitions that match the corresponding RPC-over-RDMA 2019 version 1 header types. However, because of the changes to the 2020 header prefix, the version 1 and version 2 header types differ in 2021 on-the-wire format. 2023 * RDMA2_CONNPROP (defined in Section 6.3.4) is an entirely new 2024 header type devoted to enabling connection peers to exchange 2025 information about their transport properties. 2027 6.3.1. RDMA2_MSG: Convey RPC Message Inline 2029 RDMA2_MSG conveys all or part of an RPC message immediately following 2030 the transport header in the send buffer. 2032 2033 const rpcrdma2_proc RDMA2_MSG = 0; 2035 struct rpcrdma2_msg { 2036 struct rpcrdma2_chunk_lists rdma_chunks; 2038 /* The rpc message starts here and continues 2039 * through the end of the transmission. */ 2040 uint32 rdma_rpc_first_word; 2041 }; 2042 2044 6.3.2. RDMA2_NOMSG: Convey External RPC Message 2046 RDMA2_NOMSG conveys an entire RPC message payload using explicit RDMA 2047 operations. In particular, it is referred to as a Special Format 2048 Call when the Responder reads the RPC payload from a memory area 2049 specified by a Position Zero Read chunk. It is referred to as a 2050 Special Format Reply when the Responder writes the RPC payload into a 2051 memory area specified by a Reply chunk. In both cases, the sender 2052 sets the rdma_xid field to the same value as the xid of the RPC 2053 message payload. 2055 If all the chunk lists are empty the message conveys a credit grant 2056 refresh. The header prefix of this message contains a credit grant 2057 refresh in the rdma_credit field. In this case, the sender MUST set 2058 the rdma_xid field to zero. 2060 2061 const rpcrdma2_proc RDMA2_NOMSG = 1; 2063 struct rpcrdma2_nomsg { 2064 struct rpcrdma2_chunk_lists rdma_chunks; 2065 }; 2066 2068 In RPC-over-RDMA version 2, a sender should use Message Continuation 2069 as an alternative to using a Special Format message. 2071 When a Responder receives an RDMA2_NOMSG header type with a non-empty 2072 Read list that does not carry a Position Zero Read chunk, it MUST 2073 discard that message without processing it and send an RDMA2_ERROR 2074 message with the rdma_err field set to RDMA2_ERR_BAD_XDR. When a 2075 Requester receives an RDMA2_NOMSG header type with an empty Reply 2076 chunk, it discards that message and terminates the RPC transaction 2077 represented by the XID value in the rdma_xid field of the malformed 2078 message. 2080 6.3.3. RDMA2_ERROR: Report Transport Error 2082 RDMA2_ERROR reports a transport layer error on a previous 2083 transmission. 2085 2086 const rpcrdma2_proc RDMA2_ERROR = 4; 2088 struct rpcrdma2_err_vers { 2089 uint32 rdma_vers_low; 2090 uint32 rdma_vers_high; 2091 }; 2093 struct rpcrdma2_err_write { 2094 uint32 rdma_chunk_index; 2095 uint32 rdma_length_needed; 2096 }; 2098 union rpcrdma2_error switch (rpcrdma2_errcode rdma_err) { 2099 case RDMA2_ERR_VERS: 2100 rpcrdma2_err_vers rdma_vrange; 2101 case RDMA2_ERR_READ_CHUNKS: 2102 uint32 rdma_max_chunks; 2103 case RDMA2_ERR_WRITE_CHUNKS: 2104 uint32 rdma_max_chunks; 2105 case RDMA2_ERR_SEGMENTS: 2106 uint32 rdma_max_segments; 2107 case RDMA2_ERR_WRITE_RESOURCE: 2108 rpcrdma2_err_write rdma_writeres; 2109 case RDMA2_ERR_REPLY_RESOURCE: 2110 uint32 rdma_length_needed; 2111 default: 2112 void; 2113 }; 2114 2116 See Section 7 for details on the use of this header type. 2118 6.3.4. RDMA2_CONNPROP: Exchange Transport Properties 2120 The RDMA2_CONNPROP message type enables a connection peer to publish 2121 the properties of its implementation to its remote peer. 2123 2124 const rpcrdma2_proc RDMA2_CONNPROP = 5; 2126 struct rpcrdma2_connprop { 2127 rpcrdma2_propset rdma_props; 2128 }; 2129 2130 Each peer sends an RDMA2_CONNPROP message type as the first message 2131 after the client has established a connection. The size of this 2132 initial message is limited to the default inline threshold for the 2133 RPC-over-RDMA version that is in effect. If a peer has more or 2134 larger Transport Properties than can fit in the initial 2135 RDMA2_CONNPROP message type, it sets the RDMA2_F_TPMORE flag. The 2136 final RDMA2_CONNPROP message type the peer sends on that connection 2137 must have a clear RDMA2_F_TPMORE flag. 2139 A peer may encounter properties that it does not recognize or 2140 support. In such cases, the receiver ignores unsupported properties 2141 without generating an error response. 2143 6.3.5. RDMA2_GRANT: Grant Credits 2145 The RDMA2_GRANT message type enables a connection peer to grant 2146 additional credits to its remote peer without conveying a payload. 2148 2149 const rpcrdma2_proc RDMA2_GRANT = 6; 2150 2152 This message carries no payload except for a struct 2153 rpcrdma2_hdr_prefix. The rdma_flags field is unused. Senders MUST 2154 set the rdma_flags field to zero (all clear) and receivers MUST 2155 ignore the value of the bits in this field. 2157 6.4. Choosing a Reply Mechanism 2159 A Requester provisions all necessary registered memory resources for 2160 both an RPC Call and its matching RPC Reply. A Requester constructs 2161 each RPC Call, thus it can compute the exact memory resources needed 2162 to send every Call. However, the Requester allocates memory 2163 resources to receive the corresponding Reply before the Responder has 2164 constructed it. Occasionally, it is challenging for the Requester to 2165 know in advance precisely what resources are needed to receive the 2166 Reply. 2168 In RPC-over-RDMA version 2, a Requester can provide a Reply chunk for 2169 any transaction. The Responder can use the provided Reply chunk or 2170 it can decide to use another means to convey the RPC Reply. If the 2171 combination of the provided Write chunk list and Reply chunk is not 2172 adequate to convey a Reply, the Responder SHOULD use Message 2173 Continuation (see Section 6.2.2.2) to send that Reply. If even that 2174 is not possible, the Responder sends an RDMA2_ERROR message to the 2175 Requester, as described in Section 6.3.3: 2177 * If the Write chunk list cannot accommodate the ULP's DDP-eligible 2178 data payload, the Responder sends an RDMA2_ERR_WRITE_RESOURCE 2179 error. 2181 * If the Reply chunk cannot accommodate the parts of the Reply that 2182 are not DDP-eligible, the Responder sends an 2183 RDMA2_ERR_REPLY_RESOURCE error. 2185 When receiving such errors, the Requester can retry the ULP call 2186 using more substantial reply resources. In cases where retrying the 2187 ULP request is not possible (e.g., the request is non-idempotent), 2188 the Requester terminates the RPC transaction and presents an error to 2189 the RPC consumer. 2191 7. Error Handling 2193 A receiver performs validity checks on each ingress RPC-over-RDMA 2194 message before it passes that message to the RPC layer. For example, 2195 if an ingress RPC-over-RDMA message is not as long as the size of 2196 struct rpcrdma2_hdr_prefix (20 octets), the receiver cannot trust the 2197 value of the rdma_xid field. In this case, the receiver MUST 2198 silently discard the ingress message without processing it further, 2199 and without a response to the sender. 2201 When a request (for instance, an RPC Call or a control plane 2202 operation) is made, typically an RPC consumer blocks while waiting 2203 for the response. Thus when an incoming message conveys a request 2204 and that request cannot be acted upon, the receiver of that request 2205 needs to report the problem to its sender in order to unblock 2206 waiters. Likewise, if, after processing a request, a sender is 2207 unable to transmit the response on an otherwise healthy connection, 2208 the sender needs to report that problem for the same reason. 2210 The RDMA2_ERROR header type is used for this purpose. To form an 2211 RDMA2_ERROR type header: 2213 * The rdma_xid field MUST contain the same XID that was in the 2214 rdma_xid field in the ingress request. 2216 * The rdma_vers field MUST contain the same version that was in the 2217 rdma_vers field in the ingress request. 2219 * The sender sets the rdma_credit field to the credit values in 2220 effect for this connection. 2222 * The rdma_htype field MUST contain the value RDMA2_ERROR. 2224 * The rdma_err field contains a value that reflects the type of 2225 error that occurred, as described in the subsections below. 2227 When a peer receives an RDMA2_ERROR message type with an unrecognized 2228 or unsupported value in its rdma_err field, it MUST silently discard 2229 the message without processing it further. 2231 7.1. Basic Transport Stream Parsing Errors 2233 7.1.1. RDMA2_ERR_VERS 2235 When a Responder detects an RPC-over-RDMA header version that it does 2236 not support (the current document defines version 2), it MUST respond 2237 with an RDMA2_ERROR message type and set its rdma_err field to 2238 RDMA2_ERR_VERS. The Responder then fills in the rpcrdma2_err_vers 2239 structure with the RPC-over-RDMA versions it supports. The Responder 2240 MUST silently discard the ingress message without passing it to the 2241 RPC layer 2243 When a Requester receives this error, it uses the information in the 2244 rpcrdma2_err_vers structure to select an RPC-over-RDMA version that 2245 both peers support. A Requester MUST NOT subsequently send a message 2246 that uses a version that the Responder has indciated it does not 2247 support. RDMA2_ERR_VERS indicates a permanent error. Receipt of 2248 this error completes the RPC transaction associated with XID in the 2249 rdma_xid field. 2251 7.1.2. RDMA2_ERR_INVAL_HTYPE 2253 If a Responder recognizes the value in an ingress rdma_vers field, 2254 but it does not recognize the value in the rdma_htype field or does 2255 not support that header type, it MUST set the rdma_err field to 2256 RDMA2_ERR_INVAL_HTYPE. The Responder MUST silently discard the 2257 incoming message without passing it to the RPC layer. 2259 A Requester MUST NOT subsequently send a message that uses an htype 2260 that the Responder has indicated it does not support. 2261 RDMA2_ERR_INVAL_HTYPE indicates a permanent error. Receipt of this 2262 error completes the RPC transaction associated with XID in the 2263 rdma_xid field. 2265 7.1.3. RDMA2_ERR_INVAL_CONT 2267 If a Responder detects a problem with an ingress RPC-over-RDMA 2268 message that is part of a Message Continuation sequence, the 2269 Responder MUST set the rdma_err field to RDMA2_ERR_INVAL_CONT. 2270 Section 6.2.2.2 discusses the types of problems to watch for. The 2271 Responder MUST silently discard all ingress messages with an rdma_xid 2272 field that matches the failing message without reassembling the 2273 payload. 2275 RDMA2_ERR_INVAL_CONT indicates a permanent error. Receipt of this 2276 error completes the RPC transaction associated with XID in the 2277 rdma_xid field. 2279 7.2. XDR Errors 2281 A receiver might encounter an XDR parsing error that prevents it from 2282 processing an ingress Transport stream. Examples of such errors 2283 include: 2285 * An invalid value in the rdma_proc field. 2287 * The value of the rdma_xid field does not match the value of the 2288 XID field in the accompanying RPC message. 2290 When a Responder receives a valid RPC-over-RDMA header but the 2291 Responder's ULP implementation cannot parse the RPC arguments in the 2292 RPC Call, the Responder returns an RPC Reply with status 2293 GARBAGE_ARGS, using an RDMA2_MSG message type. This type of parsing 2294 failure might be due to mismatches between chunk sizes or offsets and 2295 the contents of the Payload stream, for example. In this case, the 2296 error is permanent, but the Requester has no way to know how much 2297 processing the Responder has completed for this RPC transaction. 2299 7.2.1. RDMA2_ERR_BAD_XDR 2301 If a Responder recognizes the values in the rdma_vers field, but it 2302 cannot otherwise parse the ingress Transport stream, it MUST set the 2303 rdma_err field to RDMA2_ERR_BAD_XDR. The Responder MUST silently 2304 discard the ingress message without passing it to the RPC layer. 2306 RDMA2_ERR_BAD_XDR indicates a permanent error. Receipt of this error 2307 completes the RPC transaction associated with XID in the rdma_xid 2308 field. 2310 7.2.2. RDMA2_ERR_BAD_PROPVAL 2312 If a receiver recognizes the value in an ingress rdma_which field, 2313 but it cannot parse the accompanying propval, it MUST set the 2314 rdma_err field to RDMA2_ERR_BAD_PROPVAL (see Section 5.1). The 2315 receiver MUST silently discard the ingress message without applying 2316 any of its property settings. 2318 7.3. Responder RDMA Operational Errors 2320 In RPC-over-RDMA version 2, the Responder initiates RDMA Read and 2321 Write operations that target the Requester's memory. Problems might 2322 arise as the Responder attempts to use Requester-provided resources 2323 for RDMA operations. For example: 2325 * Usually, chunks can be validated only by using their contents to 2326 perform data transfers. If chunk contents are invalid (e.g., a 2327 memory region is no longer registered or a chunk length exceeds 2328 the end of the registered memory region), a Remote Access Error 2329 occurs. 2331 * If a Requester's Receive buffer is too small, the Responder's Send 2332 operation completes with a Local Length Error. 2334 * If the Requester-provided Reply chunk is too small to accommodate 2335 a large RPC Reply message, a Remote Access Error occurs. A 2336 Responder might detect this problem before attempting to write 2337 past the end of the Reply chunk. 2339 RDMA operational errors can be fatal to the connection. To avoid a 2340 retransmission loop and repeated connection loss that deadlocks the 2341 connection, once the Requester has re-established a connection, the 2342 Responder should send an RDMA2_ERROR response to indicate that no 2343 RPC-level reply is possible for that transaction. 2345 7.3.1. RDMA2_ERR_READ_CHUNKS 2347 If a Requester presents more DDP-eligible arguments than a Responder 2348 is prepared to Read, the Responder MUST set the rdma_err field to 2349 RDMA2_ERR_READ_CHUNKS and set the rdma_max_chunks field to the 2350 maximum number of Read chunks the Responder can process. If the 2351 Responder implementation cannot handle any Read chunks for a request, 2352 it MUST set the rdma_max_chunks to zero in this response. The 2353 Responder MUST silently discard the ingress message without 2354 processing it further. 2356 The Requester can reconstruct the Call using Message Continuation or 2357 a Special Format payload and resend it. If the Requester does not 2358 resend the Call, it MUST terminate this RPC transaction with an 2359 error. 2361 7.3.2. RDMA2_ERR_WRITE_CHUNKS 2363 If a Requester has constructed an RPC Call with more DDP-eligible 2364 results than the Responder is prepared to Write, the Responder MUST 2365 set the rdma_err field to RDMA2_ERR_WRITE_CHUNKS and set the 2366 rdma_max_chunks field to the maximum number of Write chunks the 2367 Responder can return. The Requester can reconstruct the Call with no 2368 Write chunks and a Reply chunk of appropriate size. If the Requester 2369 does not resend the Call, it MUST terminate this RPC transaction with 2370 an error. 2372 If the Responder implementation cannot handle any Write chunks for a 2373 request and cannot send the Reply using Message Continuation, it MUST 2374 return a response of RDMA2_ERR_REPLY_RESOURCE instead (see below). 2376 7.3.3. RDMA2_ERR_SEGMENTS 2378 If a Requester has constructed an RPC Call with a chunk that contains 2379 more segments than the Responder supports, the Responder MUST set the 2380 rdma_err field to RDMA2_ERR_SEGMENTS and set the rdma_max_segments 2381 field to the maximum number of segments the Responder can process. 2382 The Requester can reconstruct the Call and resend it. If the 2383 Requester does not resend the Call, it MUST terminate this RPC 2384 transaction with an error. 2386 7.3.4. RDMA2_ERR_WRITE_RESOURCE 2388 If a Requester has provided a Write chunk that is not large enough to 2389 contain a DDP-eligible result, the Responder MUST set the rdma_err 2390 field to RDMA2_ERR_WRITE_RESOURCE. The Responder MUST set the 2391 rdma_chunk_index field to point to the first Write chunk in the 2392 transport header that is too short, or to zero to indicate that it 2393 was not possible to determine which chunk is too small. Indexing 2394 starts at one (1), which represents the first Write chunk. The 2395 Responder MUST set the rdma_length_needed to the number of bytes 2396 needed in that chunk to convey the result data item. 2398 The Requester can reconstruct the Call with more reply resources and 2399 resend it. If the Requester does not resend the Call (for instance, 2400 if the Responder set the index and length fields to zero), it MUST 2401 terminate this RPC transaction with an error. 2403 7.3.5. RDMA2_ERR_REPLY_RESOURCE 2405 If a Responder cannot send an RPC Reply using Message Continuation 2406 and the Reply does not fit in the Reply chunk, the Responder MUST set 2407 the rdma_err field to RDMA2_ERR_REPLY_RESOURCE. The Responder MUST 2408 set the rdma_length_needed to the number of Reply chunk bytes needed 2409 to convey the reply. The Requester can reconstruct the Call with 2410 more reply resources and resend it. If the Requester does not resend 2411 the Call (for instance, if the Responder set the length field to 2412 zero), it MUST terminate this RPC transaction with an error. 2414 7.4. Other Operational Errors 2416 While a Requester is constructing an RPC Call message, an 2417 unrecoverable problem might occur that prevents the Requester from 2418 posting further RDMA Work Requests on behalf of that message. As 2419 with other transports, if a Requester is unable to construct and 2420 transmit an RPC Call, the associated RPC transaction fails 2421 immediately. 2423 After a Requester has received a Reply, if it is unable to invalidate 2424 a memory region due to an unrecoverable problem, the Requester MUST 2425 close the connection to protect that memory from Responder access 2426 before the associated RPC transaction is complete. 2428 While a Responder is constructing an RPC Reply message or error 2429 message, an unrecoverable problem might occur that prevents the 2430 Responder from posting further RDMA Work Requests on behalf of that 2431 message. If a Responder is unable to construct and transmit an RPC 2432 Reply or RPC-over-RDMA error message, the Responder MUST close the 2433 connection to signal to the Requester that a reply was lost. 2435 7.4.1. RDMA2_ERR_SYSTEM 2437 If some problem occurs on a Responder that does not fit into the 2438 above categories, the Responder MAY report it to the Requester by 2439 setting the rdma_err field to RDMA2_ERR_SYSTEM. The Responder MUST 2440 silently discard the message(s) associated with the failing 2441 transaction without further processing. 2443 RDMA2_ERR_SYSTEM is a permanent error. This error does not indicate 2444 how much of the transaction the Responder has processed, nor does it 2445 indicate a particular recovery action for the Requester. A Requester 2446 that receives this error MUST terminate the RPC transaction 2447 associated with the XID value in the RDMA2_ERROR message's rdma_xid 2448 field. 2450 7.5. RDMA Transport Errors 2452 The RDMA connection and physical link provide some degree of error 2453 detection and retransmission. iWARP's Marker PDU Aligned (MPA) layer 2454 (when used over TCP), the Stream Control Transmission Protocol 2455 (SCTP), as well as the InfiniBand link layer [IBA] all provide Cyclic 2456 Redundancy Check (CRC) protection of RDMA payloads. CRC-class 2457 protection is a general attribute of such transports. 2459 Additionally, the RPC layer itself can accept errors from the 2460 transport and recover via retransmission. RPC recovery can typically 2461 handle complete loss and re-establishment of a transport connection. 2463 The details of reporting and recovery from RDMA link-layer errors are 2464 described in specific link-layer APIs and operational specifications 2465 and are outside the scope of this protocol specification. See 2466 Section 11 for further discussion of RPC-level integrity schemes. 2468 8. XDR Protocol Definition 2470 This section contains a description of the core features of the RPC- 2471 over-RDMA version 2 protocol expressed in the XDR language [RFC4506]. 2472 It organizes the description to make it simple to extract into a form 2473 that is ready to compile or combine with similar descriptions 2474 published later as extensions to RPC-over-RDMA version 2. 2476 8.1. Code Component License 2478 Code Components extracted from the current document must include the 2479 following license text. When combining the extracted XDR code with 2480 other XDR code which has an identical license, only a single copy of 2481 the license text needs to be retained. 2483 2484 /// /* 2485 /// * Copyright (c) 2010-2018 IETF Trust and the persons 2486 /// * identified as authors of the code. All rights reserved. 2487 /// * 2488 /// * The authors of the code are: 2489 /// * B. Callaghan, T. Talpey, C. Lever, and D. Noveck. 2490 /// * 2491 /// * Redistribution and use in source and binary forms, with 2492 /// * or without modification, are permitted provided that the 2493 /// * following conditions are met: 2494 /// * 2495 /// * - Redistributions of source code must retain the above 2496 /// * copyright notice, this list of conditions and the 2497 /// * following disclaimer. 2498 /// * 2499 /// * - Redistributions in binary form must reproduce the above 2500 /// * copyright notice, this list of conditions and the 2501 /// * following disclaimer in the documentation and/or other 2502 /// * materials provided with the distribution. 2503 /// * 2504 /// * - Neither the name of Internet Society, IETF or IETF 2505 /// * Trust, nor the names of specific contributors, may be 2506 /// * used to endorse or promote products derived from this 2507 /// * software without specific prior written permission. 2508 /// * 2509 /// * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS 2510 /// * AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED 2511 /// * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 2512 /// * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS 2513 /// * FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO 2514 /// * EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE 2515 /// * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, 2516 /// * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT 2517 /// * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 2518 /// * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS 2519 /// * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF 2520 /// * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 2521 /// * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING 2522 /// * IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF 2523 /// * ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 2524 /// */ 2525 /// 2526 2528 8.2. Extraction of the XDR Definition 2530 Implementers can apply the following sed script to the current 2531 document to produce a machine-readable XDR description of the base 2532 RPC-over-RDMA version 2 protocol. 2534 2535 sed -n -e 's:^ */// ::p' -e 's:^ *///$::p' 2536 2538 That is, if this document is in a file called "spec.txt", then 2539 implementers can do the following to extract an XDR description file 2540 and store it in the file rpcrdma-v2.x. 2542 file "rpcrdma-v2.x" 2543 sed -n -e 's:^ */// ::p' -e 's:^ *///$::p' \ 2544 < spec.txt > rpcrdma-v2.x 2545 2547 Although this file is a usable description of the base protocol, when 2548 extensions are to be supported, it may be desirable to divide the 2549 description into multiple files. The following script achieves that 2550 purpose: 2552 2553 #!/usr/local/bin/perl 2554 open(IN,"rpcrdma-v2.x"); 2555 open(OUT,">temp.x"); 2556 while() 2557 { 2558 if (m/FILE ENDS: (.*)$/) 2559 { 2560 close(OUT); 2561 rename("temp.x", $1); 2562 open(OUT,">temp.x"); 2563 } 2564 else 2565 { 2566 print OUT $_; 2567 } 2568 } 2569 close(IN); 2570 close(OUT); 2571 2573 Running the above script results in two files: 2575 * The file common.x, containing the license plus the shared XDR 2576 definitions that need to be made available to both the base 2577 protocol and any subsequent extensions. 2579 * The file baseops.x containing the XDR definitions for the base 2580 protocol defined in this document. 2582 Extensions to RPC-over-RDMA version 2, published as Standards Track 2583 documents, should have similarly structured XDR definitions. Once an 2584 implementer has extracted the XDR for all desired extensions and the 2585 base XDR definition contained in the current document, she can 2586 concatenate them to produce a consolidated XDR definition that 2587 reflects the set of extensions selected for her RPC-over-RDMA version 2588 2 implementation. 2590 Alternatively, the XDR descriptions can be compiled separately. In 2591 that case, the combination of common.x and baseops.x defines the base 2592 transport. The combination of common.x and the XDR description of 2593 each extension produces a full XDR definition of that extension. 2595 8.3. XDR Definition for RPC-over-RDMA Version 2 Core Structures 2597 2598 /// /*************************************************************** 2599 /// * Transport Header Prefixes 2600 /// ***************************************************************/ 2601 /// 2602 /// struct rpcrdma_common { 2603 /// uint32 rdma_xid; 2604 /// uint32 rdma_vers; 2605 /// uint32 rdma_credit; 2606 /// uint32 rdma_htype; 2607 /// }; 2608 /// 2609 /// const RDMA2_F_RESPONSE 0x00000001; 2610 /// const RDMA2_F_MORE 0x00000002; 2611 /// const RDMA2_F_TPMORE 0x00000004; 2612 /// 2613 /// struct rpcrdma2_hdr_prefix 2614 /// struct rpcrdma_common rdma_start; 2615 /// uint32 rdma_flags; 2616 /// }; 2617 /// 2618 /// /*************************************************************** 2619 /// * Chunks and Chunk Lists 2620 /// ***************************************************************/ 2621 /// 2622 /// struct rpcrdma2_segment { 2623 /// uint32 rdma_handle; 2624 /// uint32 rdma_length; 2625 /// uint64 rdma_offset; 2626 /// }; 2627 /// 2628 /// struct rpcrdma2_read_segment { 2629 /// uint32 rdma_position; 2630 /// struct rpcrdma2_segment rdma_target; 2631 /// }; 2632 /// 2633 /// struct rpcrdma2_read_list { 2634 /// struct rpcrdma2_read_segment rdma_entry; 2635 /// struct rpcrdma2_read_list *rdma_next; 2636 /// }; 2637 /// 2638 /// struct rpcrdma2_write_chunk { 2639 /// struct rpcrdma2_segment rdma_target<>; 2640 /// }; 2641 /// 2642 /// struct rpcrdma2_write_list { 2643 /// struct rpcrdma2_write_chunk rdma_entry; 2644 /// struct rpcrdma2_write_list *rdma_next; 2645 /// }; 2646 /// 2647 /// struct rpcrdma2_chunk_lists { 2648 /// uint32 rdma_inv_handle; 2649 /// struct rpcrdma2_read_list *rdma_reads; 2650 /// struct rpcrdma2_write_list *rdma_writes; 2651 /// struct rpcrdma2_write_chunk *rdma_reply; 2652 /// }; 2653 /// 2654 /// /*************************************************************** 2655 /// * Transport Properties 2656 /// ***************************************************************/ 2657 /// 2658 /// /* 2659 /// * Types for transport properties model 2660 /// */ 2661 /// typedef rpcrdma2_propid uint32; 2662 /// 2663 /// struct rpcrdma2_propval { 2664 /// rpcrdma2_propid rdma_which; 2665 /// opaque rdma_data<>; 2666 /// }; 2667 /// 2668 /// typedef rpcrdma2_propval rpcrdma2_propset<>; 2669 /// typedef uint32 rpcrdma2_propsubset<>; 2670 /// 2671 /// /* 2672 /// * Transport propid values for basic properties 2673 /// */ 2674 /// const uint32 RDMA2_PROPID_SBSIZ = 1; 2675 /// const uint32 RDMA2_PROPID_RBSIZ = 2; 2676 /// const uint32 RDMA2_PROPID_RSSIZ = 3; 2677 /// const uint32 RDMA2_PROPID_RCSIZ = 4; 2678 /// const uint32 RDMA2_PROPID_BRS = 5; 2679 /// const uint32 RDMA2_PROPID_HOSTAUTH = 6; 2680 /// 2681 /// /* 2682 /// * Types specific to particular properties 2683 /// */ 2684 /// typedef uint32 rpcrdma2_prop_sbsiz; 2685 /// typedef uint32 rpcrdma2_prop_rbsiz; 2686 /// typedef uint32 rpcrdma2_prop_rssiz; 2687 /// typedef uint32 rpcrdma2_prop_rcsiz; 2688 /// typedef uint32 rpcrdma2_prop_brs; 2689 /// typedef opaque rpcrdma2_prop_hostauth<>; 2690 /// 2691 /// const uint32 RDMA2_RVRSDIR_NONE = 0; 2692 /// const uint32 RDMA2_RVRSDIR_SIMPLE = 1; 2693 /// const uint32 RDMA2_RVRSDIR_CONT = 1; 2694 /// const uint32 RDMA2_RVRSDIR_GENL = 3; 2695 /// 2696 /// /* FILE ENDS: common.x; */ 2697 2699 8.4. XDR Definition for RPC-over-RDMA Version 2 Base Header Types 2701 2702 /// /*************************************************************** 2703 /// * Descriptions of RPC-over-RDMA Header Types 2704 /// ***************************************************************/ 2705 /// 2706 /// /* 2707 /// * Header Type Codes. 2708 /// */ 2709 /// const rpcrdma2_proc RDMA2_MSG = 0; 2710 /// const rpcrdma2_proc RDMA2_NOMSG = 1; 2711 /// const rpcrdma2_proc RDMA2_ERROR = 4; 2712 /// const rpcrdma2_proc RDMA2_CONNPROP = 5; 2713 /// const rpcrdma2_proc RDMA2_GRANT = 6; 2714 /// 2715 /// /* 2716 /// * Header Types to Convey RPC Messages. 2717 /// */ 2718 /// struct rpcrdma2_msg { 2719 /// struct rpcrdma2_chunk_lists rdma_chunks; 2720 /// 2721 /// /* The rpc message starts here and continues 2722 /// * through the end of the transmission. */ 2723 /// uint32 rdma_rpc_first_word; 2724 /// }; 2725 /// 2726 /// struct rpcrdma2_nomsg { 2727 /// struct rpcrdma2_chunk_lists rdma_chunks; 2728 /// }; 2729 /// 2730 /// /* 2731 /// * Header Type to Report Errors. 2732 /// */ 2733 /// const uint32 RDMA2_ERR_VERS = 1; 2734 /// const uint32 RDMA2_ERR_BAD_XDR = 2; 2735 /// const uint32 RDMA2_ERR_BAD_PROPVAL = 3; 2736 /// const uint32 RDMA2_ERR_INVAL_HTYPE = 4; 2737 /// const uint32 RDMA2_ERR_INVAL_CONT = 5; 2738 /// const uint32 RDMA2_ERR_READ_CHUNKS = 6; 2739 /// const uint32 RDMA2_ERR_WRITE_CHUNKS = 7; 2740 /// const uint32 RDMA2_ERR_SEGMENTS = 8; 2741 /// const uint32 RDMA2_ERR_WRITE_RESOURCE = 9; 2742 /// const uint32 RDMA2_ERR_REPLY_RESOURCE = 10; 2743 /// const uint32 RDMA2_ERR_SYSTEM = 100; 2744 /// 2745 /// struct rpcrdma2_err_vers { 2746 /// uint32 rdma_vers_low; 2747 /// uint32 rdma_vers_high; 2748 /// }; 2749 /// 2750 /// struct rpcrdma2_err_write { 2751 /// uint32 rdma_chunk_index; 2752 /// uint32 rdma_length_needed; 2753 /// }; 2754 /// 2755 /// union rpcrdma2_error switch (rpcrdma2_errcode rdma_err) { 2756 /// case RDMA2_ERR_VERS: 2757 /// rpcrdma2_err_vers rdma_vrange; 2758 /// case RDMA2_ERR_READ_CHUNKS: 2759 /// uint32 rdma_max_chunks; 2760 /// case RDMA2_ERR_WRITE_CHUNKS: 2761 /// uint32 rdma_max_chunks; 2762 /// case RDMA2_ERR_SEGMENTS: 2763 /// uint32 rdma_max_segments; 2764 /// case RDMA2_ERR_WRITE_RESOURCE: 2765 /// rpcrdma2_err_write rdma_writeres; 2766 /// case RDMA2_ERR_REPLY_RESOURCE: 2768 /// uint32 rdma_length_needed; 2769 /// default: 2770 /// void; 2771 /// }; 2772 /// 2773 /// /* 2774 /// * Header Type to Exchange Transport Properties. 2775 /// */ 2776 /// struct rpcrdma2_connprop { 2777 /// rpcrdma2_propset rdma_props; 2778 /// }; 2779 /// 2780 /// /* FILE ENDS: baseops.x; */ 2781 2783 8.5. Use of the XDR Description 2785 The files common.x and baseops.x, when combined with the XDR 2786 descriptions for extension defined later, produce a human-readable 2787 and compilable description of the RPC-over-RDMA version 2 protocol 2788 with the included extensions. 2790 Although this XDR description can generate encoders and decoders for 2791 the Transport and Payload streams, there are elements of the 2792 operation of RPC-over-RDMA version 2 that cannot be expressed within 2793 the XDR language. Implementations that use the output of an 2794 automated XDR processor need to provide additional code to bridge 2795 these gaps. 2797 * The Transport stream is not a single XDR object. Instead, the 2798 header prefix is one XDR data item, and the rest of the header is 2799 a separate XDR data item. Table 2 expresses the mapping between 2800 the header type in the header prefix and the XDR object 2801 representing the header type. 2803 * The relationship between the Transport stream and the Payload 2804 stream is not specified using XDR. Comments within the XDR text 2805 make clear where transported messages, described by their own XDR 2806 definitions, need to appear. Such data is opaque to the 2807 transport. 2809 * Continuation of RPC messages across transport message boundaries 2810 requires that message assembly facilities not specifiable within 2811 XDR are part of transport implementations. 2813 * Transport properties are constant integer values. Table 1 2814 expresses the mapping between each property's code point and the 2815 XDR typedef that represents the structure of the property's value. 2816 XDR does not possess the facility to express that mapping in an 2817 extensible way. 2819 The role of XDR in RPC-over-RDMA specifications is more limited than 2820 for protocols where the totality of the protocol is expressible 2821 within XDR. XDR lacks the facility to represent the embedding of 2822 XDR-encoded payload material. Also, the need to cleanly accommodate 2823 extensions has meant that those using rpcgen in their applications 2824 need to take an active role to provide the facilities that cannot be 2825 expressed within XDR. 2827 9. RPC Bind Parameters 2829 Before establishing a new connection, an RPC client obtains a 2830 transport address for the RPC server. The means used to obtain this 2831 address and to open an RDMA connection is dependent on the type of 2832 RDMA transport and is the responsibility of each RPC protocol binding 2833 and its local implementation. 2835 RPC services typically register with a portmap or rpcbind service 2836 [RFC1833], which associates an RPC Program number with a service 2837 address. This policy is no different with RDMA transports. However, 2838 a distinct service address (port number) is sometimes required for 2839 operation on RPC-over-RDMA. 2841 When mapped atop the iWARP transport [RFC5040] [RFC5041], which uses 2842 IP port addressing due to its layering on TCP or SCTP, port mapping 2843 is trivial and consists merely of issuing the port in the connection 2844 process. The NFS/RDMA protocol service address has been assigned 2845 port 20049 by IANA, for both iWARP/TCP and iWARP/SCTP [RFC8267]. 2847 When mapped atop InfiniBand [IBA], which uses a service endpoint 2848 naming scheme based on a Group Identifier (GID), a translation MUST 2849 be employed. One such translation is described in Annexes A3 2850 (Application Specific Identifiers), A4 (Sockets Direct Protocol 2851 (SDP)), and A11 (RDMA IP CM Service) of [IBA], which is appropriate 2852 for translating IP port addressing to the InfiniBand network. 2853 Therefore, in this case, IP port addressing may be readily employed 2854 by the upper layer. 2856 When a mapping standard or convention exists for IP ports on an RDMA 2857 interconnect, there are several possibilities for each upper layer to 2858 consider: 2860 * One possibility is to have the server register its mapped IP port 2861 with the rpcbind service under the netid (or netids) defined in 2862 [RFC8166]. An RPC-over-RDMA-aware RPC client can then resolve its 2863 desired service to a mappable port and proceed to connect. This 2864 method is the most flexible and compatible approach for those 2865 upper layers that are defined to use the rpcbind service. 2867 * A second possibility is to have the RPC server's portmapper 2868 register itself on the RDMA interconnect at a "well-known" service 2869 address (on UDP or TCP, this corresponds to port 111). An RPC 2870 client can connect to this service address and use the portmap 2871 protocol to obtain a service address in response to a program 2872 number (e.g., an iWARP port number or an InfiniBand GID). 2874 * Alternately, an RPC client can connect to the mapped well-known 2875 port for the service itself, if it is appropriately defined. By 2876 convention, the NFS/RDMA service, when operating atop such an 2877 InfiniBand fabric, uses the same 20049 assignment as for iWARP. 2879 Historically, different RPC protocols have taken different approaches 2880 to their port assignments. The current document leaves the specific 2881 method for each RPC-over-RDMA-enabled ULB. 2883 [RFC8166] defines two new netid values to be used for registration of 2884 upper layers atop iWARP [RFC5040] [RFC5041] and (when a suitable port 2885 translation service is available) InfiniBand [IBA]. Additional RDMA- 2886 capable networks MAY define their own netids, or if they provide a 2887 port translation, they MAY share the one defined in [RFC8166]. 2889 10. Implementation Status 2891 This section is to be removed before publishing as an RFC. 2893 This section records the status of known implementations of the 2894 protocol defined by this specification at the time of posting of this 2895 Internet-Draft, and is based on a proposal described in [RFC7942]. 2896 The description of implementations in this section is intended to 2897 assist the IETF in its decision processes in progressing drafts to 2898 RFCs. 2900 Please note that the listing of any individual implementation here 2901 does not imply endorsement by the IETF. Furthermore, no effort has 2902 been spent to verify the information presented here that was supplied 2903 by IETF contributors. This is not intended as, and must not be 2904 construed to be, a catalog of available implementations or their 2905 features. Readers are advised to note that other implementations may 2906 exist. 2908 At this time, no known implementations of the protocol described in 2909 the current document exist. 2911 11. Security Considerations 2913 11.1. Memory Protection 2915 A primary consideration is the protection of the integrity and 2916 confidentiality of host memory by an RPC-over-RDMA transport. The 2917 use of an RPC-over-RDMA transport protocol MUST NOT introduce 2918 vulnerabilities to system memory contents nor memory owned by user 2919 processes. Any RDMA provider used for RPC transport MUST conform to 2920 the requirements of [RFC5042] to satisfy these protections. 2922 11.1.1. Protection Domains 2924 The use of a Protection Domain to limit the exposure of memory 2925 regions to a single connection is critical. Any attempt by an 2926 endpoint not participating in that connection to reuse memory handles 2927 needs to result in immediate failure of that connection. Because ULP 2928 security mechanisms rely on this aspect of Reliable Connected 2929 behavior, implementations SHOULD cryptographically authenticate 2930 connection endpoints. 2932 11.1.2. Handle (STag) Predictability 2934 Implementations should use unpredictable memory handles for any 2935 operation requiring exposed memory regions. Exposing a continuously 2936 registered memory region allows a remote host to read or write to 2937 that region even when an RPC involving that memory is not underway. 2938 Therefore, implementations should avoid the use of persistently 2939 registered memory. 2941 11.1.3. Memory Protection 2943 Requesters should register memory regions for remote access only when 2944 they are about to be the target of an RPC transaction that involves 2945 an RDMA Read or Write. 2947 Requesters should invalidate memory regions as soon as related RPC 2948 operations are complete. Invalidation and DMA unmapping of memory 2949 regions should complete before the receiver checks message integrity, 2950 and before the RPC consumer can use or alter the contents of the 2951 exposed memory region. 2953 An RPC transaction on a Requester can terminate before a Reply 2954 arrives, for example, if the RPC consumer is signaled, or a 2955 segmentation fault occurs. When an RPC terminates abnormally, memory 2956 regions associated with that RPC should be invalidated before the 2957 Requester reuses those regions for other purposes. 2959 11.1.4. Denial of Service 2961 A detailed discussion of denial-of-service exposures that can result 2962 from the use of an RDMA transport appears in Section 6.4 of 2963 [RFC5042]. 2965 A Responder is not obliged to pull unreasonably large Read chunks. A 2966 Responder can use an RDMA2_ERROR response to terminate RPCs with 2967 unreadable Read chunks. If a Responder transmits more data than a 2968 Requester is prepared to receive in a Write or Reply chunk, the RDMA 2969 provider typically terminates the connection. For further 2970 discussion, see Section 6.3.3. Such repeated connection termination 2971 can deny service to other users sharing the connection from the 2972 errant Requester. 2974 An RPC-over-RDMA transport implementation is not responsible for 2975 throttling the RPC request rate, other than to keep the number of 2976 concurrent RPC transactions at or under the number of credits granted 2977 per connection (see Section 4.2.1). A sender can trigger a self- 2978 denial of service by exceeding the credit grant repeatedly. 2980 When an RPC transaction terminates due to a signal or premature exit 2981 of an application process, a Requester should invalidate the RPC's 2982 Write and Reply chunks. Invalidation prevents the subsequent arrival 2983 of the Responder's Reply from altering the memory regions associated 2984 with those chunks after the Requester has released that memory. 2986 On the Requester, a malfunctioning application or a malicious user 2987 can create a situation where RPCs initiate and abort continuously, 2988 resulting in Responder replies that terminate the underlying RPC- 2989 over-RDMA connection repeatedly. Such situations can deny service to 2990 other users sharing the connection from that Requester. 2992 11.2. RPC Message Security 2994 ONC RPC provides cryptographic security via the RPCSEC_GSS framework 2995 [RFC7861]. RPCSEC_GSS implements message authentication 2996 (rpc_gss_svc_none), per-message integrity checking 2997 (rpc_gss_svc_integrity), and per-message confidentiality 2998 (rpc_gss_svc_privacy) in a layer above the RPC-over-RDMA transport. 2999 The integrity and privacy services require significant computation 3000 and movement of data on each endpoint host. Some performance 3001 benefits enabled by RDMA transports can be lost. 3003 11.2.1. RPC-over-RDMA Protection at Other Layers 3005 For any RPC transport, utilizing RPCSEC_GSS integrity or privacy 3006 services has performance implications. Protection below the RPC 3007 implementation is often a better choice in performance-sensitive 3008 deployments, especially if it, too, can be offloaded. Certain 3009 implementations of IPsec can be co-located in RDMA hardware, for 3010 example, without change to RDMA consumers and with little loss of 3011 data movement efficiency. Such arrangements can also provide a 3012 higher degree of privacy by hiding endpoint identity or altering the 3013 frequency at which messages are exchanged, at a performance cost. 3015 Implementations MAY negotiate the use of protection in another layer 3016 through the use of an RPCSEC_GSS security flavor defined in [RFC7861] 3017 in conjunction with the Channel Binding mechanism [RFC5056] and IPsec 3018 Channel Connection Latching [RFC5660]. 3020 11.2.2. RPCSEC_GSS on RPC-over-RDMA Transports 3022 Not all RDMA devices and fabrics support the above protection 3023 mechanisms. Also, NFS clients, where multiple users can access NFS 3024 files, still require per-message authentication. In these cases, 3025 RPCSEC_GSS can protect NFS traffic conveyed on RPC-over-RDMA 3026 connections. 3028 RPCSEC_GSS extends the ONC RPC protocol without changing the format 3029 of RPC messages. By observing the conventions described in this 3030 section, an RPC-over-RDMA transport can convey RPCSEC_GSS-protected 3031 RPC messages interoperably. 3033 Senders MUST NOT reduce protocol elements of RPCSEC_GSS that appear 3034 in the Payload stream of an RPC-over-RDMA message. Such elements 3035 include control messages exchanged as part of establishing or 3036 destroying a security context, or data items that are part of 3037 RPCSEC_GSS authentication material. 3039 11.2.2.1. RPCSEC_GSS Context Negotiation 3041 Some NFS client implementations use a separate connection to 3042 establish a Generic Security Service (GSS) context for NFS operation. 3043 Such clients use TCP and the standard NFS port (2049) for context 3044 establishment. Therefore, an NFS server MUST also provide a TCP- 3045 based NFS service on port 2049 to enable the use of RPCSEC_GSS with 3046 NFS/RDMA. 3048 11.2.2.2. RPC-over-RDMA with RPCSEC_GSS Authentication 3050 The RPCSEC_GSS authentication service has no impact on the DDP- 3051 eligibility of data items in a ULP. 3053 However, RPCSEC_GSS authentication material appearing in an RPC 3054 message header can be larger than, say, an AUTH_SYS authenticator. 3055 In particular, when an RPCSEC_GSS pseudoflavor is in use, a Requester 3056 needs to accommodate a larger RPC credential when marshaling RPC 3057 Calls and needs to provide for a maximum size RPCSEC_GSS verifier 3058 when allocating reply buffers and Reply chunks. 3060 RPC messages, and thus Payload streams, are larger on average as a 3061 result. ULP operations that fit in a Simple Format message when a 3062 simpler form of authentication is in use might need to be reduced or 3063 conveyed via a Special Format message when RPCSEC_GSS authentication 3064 is in use. It is therefore more likely that a Requester provisions 3065 both a Read list and a Reply chunk in the same RPC-over-RDMA 3066 Transport header to convey a Special Format Call and provision a 3067 receptacle for a Special Format Reply. 3069 In addition to this cost, the XDR encoding and decoding of each RPC 3070 message using RPCSEC_GSS authentication requires per-message host 3071 compute resources to construct the GSS verifier. 3073 11.2.2.3. RPC-over-RDMA with RPCSEC_GSS Integrity or Privacy 3075 The RPCSEC_GSS integrity service enables endpoints to detect the 3076 modification of RPC messages in flight. The RPCSEC_GSS privacy 3077 service prevents all but the intended recipient from viewing the 3078 cleartext content of RPC arguments and results. RPCSEC_GSS integrity 3079 and privacy services are end-to-end. They protect RPC arguments and 3080 results from application to server endpoint, and back. 3082 The RPCSEC_GSS integrity and encryption services operate on whole RPC 3083 messages after they have been XDR encoded, and before they have been 3084 XDR decoded after receipt. Connection endpoints use intermediate 3085 buffers to prevent exposure of encrypted or unverified cleartext data 3086 to RPC consumers. After a sender has verified, encrypted, and 3087 wrapped a message, the transport layer MAY use RDMA data transfer 3088 between these intermediate buffers. 3090 The process of reducing a DDP-eligible data item removes the data 3091 item and its XDR padding from an encoded Payload stream. In a non- 3092 protected RPC-over-RDMA message, a reduced data item does not include 3093 XDR padding. After reduction, the Payload stream contains fewer 3094 octets than the whole XDR stream did beforehand. XDR padding octets 3095 are often zero bytes, but they don't have to be. Thus, reducing DDP- 3096 eligible items affects the result of message integrity verification 3097 and encryption. 3099 Therefore, a sender MUST NOT reduce a Payload stream when RPCSEC_GSS 3100 integrity or encryption services are in use. Effectively, no data 3101 item is DDP-eligible in this situation. Senders can use only Simple 3102 and Continued Formats without chunks, or Special Format. In this 3103 mode, an RPC-over-RDMA transport operates in the same manner as a 3104 transport that does not support DDP. 3106 11.2.2.4. Protecting RPC-over-RDMA Transport Headers 3108 Like the header fields in an RPC message (e.g., the xid and mtype 3109 fields), RPCSEC_GSS does not protect the RPC-over-RDMA Transport 3110 stream. XIDs, connection credit limits, and chunk lists (though not 3111 the content of the data items they refer to) are exposed to malicious 3112 behavior, which can redirect data that is transferred by the RPC- 3113 over-RDMA message, result in spurious retransmits, or trigger 3114 connection loss. 3116 In particular, if an attacker alters the information contained in the 3117 chunk lists of an RPC-over-RDMA Transport header, data contained in 3118 those chunks can be redirected to other registered memory regions on 3119 Requesters. An attacker might alter the arguments of RDMA Read and 3120 RDMA Write operations on the wire to gain a similar effect. If such 3121 alterations occur, the use of RPCSEC_GSS integrity or privacy 3122 services enables a Requester to detect unexpected material in a 3123 received RPC message. 3125 Encryption at other layers, as described in Section 11.2.1, protects 3126 the content of the Transport stream. RDMA transport implementations 3127 should conform to [RFC5042] to address attacks on RDMA protocols 3128 themselves. 3130 11.3. Transport Properties 3132 Like other fields that appear in the Transport stream, transport 3133 properties are sent in the clear with no integrity protection, making 3134 them vulnerable to man-in-the-middle attacks. 3136 For example, if a man-in-the-middle were to change the value of the 3137 Receive buffer size, it could reduce connection performance or 3138 trigger loss of connection. Repeated connection loss can impact 3139 performance or even prevent a new connection from being established. 3140 The recourse is to deploy on a private network or use transport layer 3141 encryption. 3143 11.4. Host Authentication 3145 [ cel: This subsection is unfinished. ] 3147 Wherein we use the relevant sections of [RFC3552] to analyze the 3148 addition of host authentication to this RPC-over-RDMA transport. 3150 The authors refer readers to Appendix C of [RFC8446] for information 3151 on how to design and test a secure authentication handshake 3152 implementation. 3154 12. IANA Considerations 3156 The RPC-over-RDMA family of transports have been assigned RPC netids 3157 by [RFC8166]. A netid is an rpcbind [RFC1833] string used to 3158 identify the underlying protocol in order for RPC to select 3159 appropriate transport framing and the format of the service addresses 3160 and ports. 3162 The following netid registry strings are already defined for this 3163 purpose: 3165 NC_RDMA "rdma" 3166 NC_RDMA6 "rdma6" 3168 The "rdma" netid is to be used when IPv4 addressing is employed by 3169 the underlying transport, and "rdma6" when IPv6 addressing is 3170 employed. The netid assignment policy and registry are defined in 3171 [RFC5665]. The current document does not alter these netid 3172 assignments. 3174 These netids MAY be used for any RDMA network that satisfies the 3175 requirements of Section 3.2.2 and that is able to identify service 3176 endpoints using IP port addressing, possibly through use of a 3177 translation service as described in Section 9. 3179 13. References 3181 13.1. Normative References 3183 [RFC1833] Srinivasan, R., "Binding Protocols for ONC RPC Version 2", 3184 RFC 1833, DOI 10.17487/RFC1833, August 1995, 3185 . 3187 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 3188 Requirement Levels", BCP 14, RFC 2119, 3189 DOI 10.17487/RFC2119, March 1997, 3190 . 3192 [RFC4506] Eisler, M., Ed., "XDR: External Data Representation 3193 Standard", STD 67, RFC 4506, DOI 10.17487/RFC4506, May 3194 2006, . 3196 [RFC5042] Pinkerton, J. and E. Deleganes, "Direct Data Placement 3197 Protocol (DDP) / Remote Direct Memory Access Protocol 3198 (RDMAP) Security", RFC 5042, DOI 10.17487/RFC5042, October 3199 2007, . 3201 [RFC5056] Williams, N., "On the Use of Channel Bindings to Secure 3202 Channels", RFC 5056, DOI 10.17487/RFC5056, November 2007, 3203 . 3205 [RFC5531] Thurlow, R., "RPC: Remote Procedure Call Protocol 3206 Specification Version 2", RFC 5531, DOI 10.17487/RFC5531, 3207 May 2009, . 3209 [RFC5660] Williams, N., "IPsec Channels: Connection Latching", 3210 RFC 5660, DOI 10.17487/RFC5660, October 2009, 3211 . 3213 [RFC5665] Eisler, M., "IANA Considerations for Remote Procedure Call 3214 (RPC) Network Identifiers and Universal Address Formats", 3215 RFC 5665, DOI 10.17487/RFC5665, January 2010, 3216 . 3218 [RFC7861] Adamson, A. and N. Williams, "Remote Procedure Call (RPC) 3219 Security Version 3", RFC 7861, DOI 10.17487/RFC7861, 3220 November 2016, . 3222 [RFC7942] Sheffer, Y. and A. Farrel, "Improving Awareness of Running 3223 Code: The Implementation Status Section", BCP 205, 3224 RFC 7942, DOI 10.17487/RFC7942, July 2016, 3225 . 3227 [RFC8166] Lever, C., Ed., Simpson, W., and T. Talpey, "Remote Direct 3228 Memory Access Transport for Remote Procedure Call Version 3229 1", RFC 8166, DOI 10.17487/RFC8166, June 2017, 3230 . 3232 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 3233 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, 3234 May 2017, . 3236 [RFC8267] Lever, C., "Network File System (NFS) Upper-Layer Binding 3237 to RPC-over-RDMA Version 1", RFC 8267, 3238 DOI 10.17487/RFC8267, October 2017, 3239 . 3241 [RFC8446] Rescorla, E., "The Transport Layer Security (TLS) Protocol 3242 Version 1.3", RFC 8446, DOI 10.17487/RFC8446, August 2018, 3243 . 3245 13.2. Informative References 3247 [CBFC] Kung, H.T., Blackwell, T., and A. Chapman, "Credit-Based 3248 Flow Control for ATM Networks: Credit Update Protocol, 3249 Adaptive Credit Allocation, and Statistical Multiplexing", 3250 Proc. ACM SIGCOMM '94 Symposium on Communications 3251 Architectures, Protocols and Applications, pp. 101-114., 3252 August 1994. 3254 [I-D.ietf-nfsv4-rpc-tls] 3255 Myklebust, T. and C. Lever, "Towards Remote Procedure Call 3256 Encryption By Default", Work in Progress, Internet-Draft, 3257 draft-ietf-nfsv4-rpc-tls-07, 30 April 2020, 3258 . 3260 [IBA] InfiniBand Trade Association, "InfiniBand Architecture 3261 Specification Volume 1", Release 1.3, March 2015. 3262 Available from https://www.infinibandta.org/ 3264 [RFC0768] Postel, J., "User Datagram Protocol", STD 6, RFC 768, 3265 DOI 10.17487/RFC0768, August 1980, 3266 . 3268 [RFC0793] Postel, J., "Transmission Control Protocol", STD 7, 3269 RFC 793, DOI 10.17487/RFC0793, September 1981, 3270 . 3272 [RFC1094] Nowicki, B., "NFS: Network File System Protocol 3273 specification", RFC 1094, DOI 10.17487/RFC1094, March 3274 1989, . 3276 [RFC1813] Callaghan, B., Pawlowski, B., and P. Staubach, "NFS 3277 Version 3 Protocol Specification", RFC 1813, 3278 DOI 10.17487/RFC1813, June 1995, 3279 . 3281 [RFC3552] Rescorla, E. and B. Korver, "Guidelines for Writing RFC 3282 Text on Security Considerations", BCP 72, RFC 3552, 3283 DOI 10.17487/RFC3552, July 2003, 3284 . 3286 [RFC5040] Recio, R., Metzler, B., Culley, P., Hilland, J., and D. 3287 Garcia, "A Remote Direct Memory Access Protocol 3288 Specification", RFC 5040, DOI 10.17487/RFC5040, October 3289 2007, . 3291 [RFC5041] Shah, H., Pinkerton, J., Recio, R., and P. Culley, "Direct 3292 Data Placement over Reliable Transports", RFC 5041, 3293 DOI 10.17487/RFC5041, October 2007, 3294 . 3296 [RFC5532] Talpey, T. and C. Juszczak, "Network File System (NFS) 3297 Remote Direct Memory Access (RDMA) Problem Statement", 3298 RFC 5532, DOI 10.17487/RFC5532, May 2009, 3299 . 3301 [RFC5661] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed., 3302 "Network File System (NFS) Version 4 Minor Version 1 3303 Protocol", RFC 5661, DOI 10.17487/RFC5661, January 2010, 3304 . 3306 [RFC5662] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed., 3307 "Network File System (NFS) Version 4 Minor Version 1 3308 External Data Representation Standard (XDR) Description", 3309 RFC 5662, DOI 10.17487/RFC5662, January 2010, 3310 . 3312 [RFC7530] Haynes, T., Ed. and D. Noveck, Ed., "Network File System 3313 (NFS) Version 4 Protocol", RFC 7530, DOI 10.17487/RFC7530, 3314 March 2015, . 3316 [RFC7862] Haynes, T., "Network File System (NFS) Version 4 Minor 3317 Version 2 Protocol", RFC 7862, DOI 10.17487/RFC7862, 3318 November 2016, . 3320 [RFC8167] Lever, C., "Bidirectional Remote Procedure Call on RPC- 3321 over-RDMA Transports", RFC 8167, DOI 10.17487/RFC8167, 3322 June 2017, . 3324 Appendix A. ULB Specifications 3326 Typically, an Upper-Layer Protocol (ULP) is defined without regard to 3327 a particular RPC transport. An Upper-Layer Binding (ULB) 3328 specification provides guidance that helps a ULP interoperate 3329 correctly and efficiently over a particular transport. For RPC-over- 3330 RDMA version 2, a ULB may provide: 3332 * A taxonomy of XDR data items that are eligible for DDP 3334 * Constraints on which upper-layer procedures a sender may reduce, 3335 and on how many chunks may appear in a single RPC message 3337 * A method enabling a Requester to determine the maximum size of the 3338 reply Payload stream for all procedures in the ULP 3340 * An rpcbind port assignment for the RPC Program and Version when 3341 operating on the particular transport 3343 Each RPC Program and Version tuple that operates on RPC-over-RDMA 3344 version 2 needs to have a ULB specification. 3346 A.1. DDP-Eligibility 3348 A ULB designates specific XDR data items as eligible for DDP. As a 3349 sender constructs an RPC-over-RDMA message, it can remove DDP- 3350 eligible data items from the Payload stream so that the RDMA provider 3351 can place them directly in the receiver's memory. An XDR data item 3352 should be considered for DDP-eligibility if there is a clear benefit 3353 to moving the contents of the item directly from the sender's memory 3354 to the receiver's memory. 3356 Criteria for DDP-eligibility include: 3358 * The XDR data item is frequently sent or received, and its size is 3359 often much larger than typical inline thresholds. 3361 * If the XDR data item is a result, its maximum size must be 3362 predictable in advance by the Requester. 3364 * Transport-level processing of the XDR data item is not needed. 3365 For example, the data item is an opaque byte array, which requires 3366 no XDR encoding and decoding of its content. 3368 * The content of the XDR data item is sensitive to address 3369 alignment. For example, a data copy operation would be required 3370 on the receiver to enable the message to be parsed correctly, or 3371 to enable the data item to be accessed. 3373 * The XDR data item itself does not contain DDP-eligible data items. 3375 In addition to defining the set of data items that are DDP-eligible, 3376 a ULB may limit the use of chunks to particular upper-layer 3377 procedures. If more than one data item in a procedure is DDP- 3378 eligible, the ULB may limit the number of chunks that a Requester can 3379 provide for a particular upper-layer procedure. 3381 Senders never reduce data items that are not DDP-eligible. Such data 3382 items can, however, be part of a Special Format payload. 3384 The programming interface by which an upper-layer implementation 3385 indicates the DDP-eligibility of a data item to the RPC transport is 3386 not described by this specification. The only requirements are that 3387 the receiver can re-assemble the transmitted RPC-over-RDMA message 3388 into a valid XDR stream and that DDP-eligibility rules specified by 3389 the ULB are respected. 3391 There is no provision to express DDP-eligibility within the XDR 3392 language. The only definitive specification of DDP-eligibility is a 3393 ULB. 3395 In general, a DDP-eligibility violation occurs when: 3397 * A Requester reduces a non-DDP-eligible argument data item. The 3398 Responder reports the violation as described in Section 6.3.3. 3400 * A Responder reduces a non-DDP-eligible result data item. The 3401 Requester terminates the pending RPC transaction and reports an 3402 appropriate permanent error to the RPC consumer. 3404 * A Responder does not reduce a DDP-eligible result data item into 3405 an available Write chunk. The Requester terminates the pending 3406 RPC transaction and reports an appropriate permanent error to the 3407 RPC consumer. 3409 A.2. Maximum Reply Size 3411 When expecting small and moderately-sized Replies, a Requester should 3412 rely on Message Continuation rather than provision a Reply chunk. 3413 For each ULP procedure where there is no clear Reply size maximum and 3414 the maximum can be substantial, the ULB should specify a dependable 3415 means for determining the maximum Reply size. 3417 A.3. Reverse-Direction Operation 3419 The direction of operation does not preclude the need for DDP- 3420 eligibility statements. 3422 Reverse-direction operation occurs on an already-established 3423 connection. Specification of RPC binding parameters is usually not 3424 necessary in this case. 3426 Other considerations may apply when distinct RPC Programs share an 3427 RPC-over-RDMA transport connection concurrently. 3429 A.4. Additional Considerations 3431 There may be other details provided in a ULB. 3433 * A ULB may recommend inline threshold values or other transport- 3434 related parameters for RPC-over-RDMA version 2 connections bearing 3435 that ULP. 3437 * A ULP may provide a means to communicate transport-related 3438 parameters between peers. 3440 * Multiple ULPs may share a single RPC-over-RDMA version 2 3441 connection when their ULBs allow the use of RPC-over-RDMA version 3442 2 and the rpcbind port assignments for those protocols permit 3443 connection sharing. In this case, the same transport parameters 3444 (such as inline threshold) apply to all ULPs using that 3445 connection. 3447 Each ULB needs to be designed to allow correct interoperation without 3448 regard to the transport parameters actually in use. Furthermore, 3449 implementations of ULPs must be designed to interoperate correctly 3450 regardless of the connection parameters in effect on a connection. 3452 A.5. ULP Extensions 3454 An RPC Program and Version tuple may be extensible. For instance, 3455 the RPC version number may not reflect a ULP minor versioning scheme, 3456 or the ULP may allow the specification of additional features after 3457 the publication of the original RPC Program specification. ULBs are 3458 provided for interoperable RPC Programs and Versions by extending 3459 existing ULBs to reflect the changes made necessary by each addition 3460 to the existing XDR. 3462 [ cel: The final sentence is unclear, and may be inaccurate. I 3463 believe I copied this section directly from RFC 8166. Is there more 3464 to be said, now that we have some experience? ] 3466 Appendix B. Extending RPC-over-RDMA Version 2 3468 This Appendix is not addressed to protocol implementers, but rather 3469 to authors of documents that extend the protocol specified in the 3470 current document. 3472 RPC-over-RDMA version 2 extensibility facilitates limited extensions 3473 to the base protocol presented in the current document so that new 3474 optional capabilities can be introduced without a protocol version 3475 change while maintaining robust interoperability with existing RPC- 3476 over-RDMA version 2 implementations. It allows extensions to be 3477 defined, including the definition of new protocol elements, without 3478 requiring modification or recompilation of the XDR for the base 3479 protocol. 3481 Standards Track documents may introduce extensions to the base RPC- 3482 over-RDMA version 2 protocol in two ways: 3484 * They may introduce new OPTIONAL transport header types. 3485 Appendix B.2 covers such transport header types. 3487 * They may define new OPTIONAL transport properties. Appendix B.4 3488 describes such transport properties. 3490 These documents may also add the following sorts of ancillary 3491 protocol elements to the protocol to support the addition of new 3492 transport properties and header types: 3494 * They may create new error codes, as described in Appendix B.5. 3496 * They may define new flags to use within the rdma_flags field, as 3497 discussed in Appendix B.3. 3499 New capabilities can be proposed and developed independently of each 3500 other. Implementers can choose among them, making it straightforward 3501 to create and document experimental features and then bring them 3502 through the standards process. 3504 B.1. Documentation Requirements 3506 As described earlier, a Standards Track document introduces a set of 3507 new protocol elements. Together these elements are considered an 3508 OPTIONAL feature. Each implementation is either aware of all the 3509 protocol elements introduced by that feature or is aware of none of 3510 them. 3512 Documents specifying extensions to RPC-over-RDMA version 2 should 3513 contain: 3515 * An explanation of the purpose and use of each new protocol 3516 element. 3518 * An XDR description including all of the new protocol elements, and 3519 a script to extract it. 3521 * A discussion of interactions with other extensions. This 3522 discussion includes requirements for other OPTIONAL features to be 3523 present, or that a particular level of support for an OPTIONAL 3524 facility is required. 3526 Implementers combine the XDR descriptions of the new features they 3527 intend to use with the XDR description of the base protocol in the 3528 current document. This combination is necessary to create a valid 3529 XDR input file because extensions are free to use XDR types defined 3530 in the base protocol, and later extensions may use types defined by 3531 earlier extensions. 3533 The XDR description for the RPC-over-RDMA version 2 base protocol 3534 combined with that for any selected extensions should provide a 3535 human-readable and compilable definition of the extended protocol. 3537 B.2. Adding New Header Types to RPC-over-RDMA Version 2 3539 New transport header types are defined similar to Sections 6.3.1 3540 through 6.3.4. In particular, what is needed is: 3542 * A description of the function and use of the new header type. 3544 * A complete XDR description of the new header type. 3546 * A description of how receivers report errors, including mechanisms 3547 for reporting errors outside the available choices already 3548 available in the base protocol or other extensions. 3550 * An indication of whether a Payload stream must be present, and a 3551 description of its contents and how receivers use such Payload 3552 streams to reconstruct RPC messages. 3554 * As appropriate, a statement of whether a Responder may use Remote 3555 Invalidation when sending messages that contain the new header 3556 type. 3558 There needs to be additional documentation that is made necessary due 3559 to the OPTIONAL status of new transport header types: 3561 * The document should discuss constraints on support for the new 3562 header types. For example, if support for one header type is 3563 implied or foreclosed by another one, this needs to be documented. 3565 * The document should describe the preferred method by which a 3566 sender determines whether its peer supports a particular header 3567 type. It is always possible to send a test invocation of a 3568 particular header type to see if support is available. However, 3569 when more efficient means are available (e.g., the value of a 3570 transport property), this should be noted. 3572 B.3. Adding New Header Flags to the Protocol 3574 New flag bits are to be defined similarly to Sections 6.2.2.1 and 3575 6.2.2.2. Each new flag definition should include: 3577 * An XDR description of the new flag. 3579 * A description of the function and use of the new flag. 3581 * An indication for which header types the flag value is meaningful, 3582 and for which header types it is an error to set the flag or to 3583 leave it unset. 3585 * A means to determine whether peers are prepared to receive 3586 transport headers with the new flag set. 3588 There needs to be additional documentation that is made necessary due 3589 to the OPTIONAL status of new flags: 3591 * The document should discuss constraints on support for the new 3592 flags. For example, if support for one flag is implied or 3593 foreclosed by another one, this needs to be documented. 3595 B.4. Adding New Transport properties to the Protocol 3597 A Standards Track document defining a new transport property should 3598 include the following information paralleling that provided in this 3599 document for the transport properties defined herein: 3601 * The rpcrdma2_propid value identifying the new property. 3603 * The XDR typedef specifying the structure of its property value. 3605 * A description of the new property. 3607 * An explanation of how the receiver can use this information. 3609 * The default value if a peer never receives the new property. 3611 There is no requirement that propid assignments occur in a continuous 3612 range of values. Implementations should not rely on all such values 3613 being small integers. 3615 Before the defining Standards Track document is published, the nfsv4 3616 Working Group should select a unique propid value, and ensure that: 3618 * rpcrdma2_propid values specified in the document do not conflict 3619 with those currently assigned or in use by other pending working 3620 group documents defining transport properties. 3622 * rpcrdma2_propid values specified in the document do not conflict 3623 with the range reserved for experimental use, as defined in 3624 Section 8.2. 3626 [ cel: There is no section 8.2. ] 3628 [ cel: Should we request the creation of an IANA registry for 3629 propid values? ]. 3631 When a Standards Track document proposes additional transport 3632 properties, reviewers should deal with possible security issues 3633 exposed by those new transport properties. 3635 B.5. Adding New Error Codes to the Protocol 3637 The same Standards Track document that defines a new header type may 3638 introduce new error codes used to support it. A Standards Track 3639 document may similarly define new error codes that an existing header 3640 type can return. 3642 For error codes that do not require the return of additional 3643 information, a peer can use the existing RDMA_ERR2 header type to 3644 report the new error. The sender sets the new error code as the 3645 value of rdma_err with the result that the default switch arm of the 3646 rpcrdma2_error (i.e., void) is selected. 3648 For error codes that do require the return of related information 3649 together with the error, a new header type should be defined that 3650 returns the error together with the related information. The sender 3651 of a new header type needs to be prepared to accept header types 3652 necessary to report associated errors. 3654 Appendix C. Differences from RPC-over-RDMA Version 1 3656 The goal of RPC-over-RDMA version 2 is to relieve certain constraints 3657 that have become evident in RPC-over-RDMA version 1 with deployment 3658 experience: 3660 * RPC-over-RDMA version 1 has been challenging to update to address 3661 shortcomings or improve data transfer efficiency. 3663 * The average size of NFSv4 COMPOUNDs is significantly greater than 3664 NFSv3 requests, requiring the use of Long messages for frequent 3665 operations. 3667 * Reply size estimation is awkward more often than expected. 3669 This section details specific changes in RPC-over-RDMA version 2 that 3670 address these constraints directly. 3672 C.1. Changes to the XDR Definition 3674 Several XDR structural changes enable within-version protocol 3675 extensibility. 3677 [RFC8166] defines the RPC-over-RDMA version 1 transport header as a 3678 single XDR object, with an RPC message potentially following it. In 3679 RPC-over-RDMA version 2, there are separate XDR definitions of the 3680 transport header prefix (see Section 6.2.2), which specifies the 3681 transport header type to be used, and the transport header itself 3682 (defined within one of the subsections of Section 6.3). This 3683 construction is similar to an RPC message, which consists of an RPC 3684 header (defined in [RFC5531]) followed by a message defined by an 3685 Upper-Layer Protocol. 3687 As a new version of the RPC-over-RDMA transport protocol, RPC-over- 3688 RDMA version 2 exists within the versioning rules defined in 3689 [RFC8166]. In particular, it maintains the first four words of the 3690 protocol header, as specified in Section 4.2 of [RFC8166], even 3691 though, as explained in Section 6.2.1 of the current document, the 3692 XDR definition of those words is structured differently. 3694 Although each of the first four fields retains its semantic function, 3695 there are differences in interpretation: 3697 * The first word of the header, the rdma_xid field, retains the 3698 format and function that it had in RPC-over-RDMA version 1. 3699 Because RPC-over-RDMA version 2 messages can convey non-RPC 3700 messages, a receiver should not use the contents of this field 3701 without consideration of the protocol version and header type. 3703 * The second word of the header, the rdma_vers field, retains the 3704 format and function that it had in RPC-over-RDMA version 1. To 3705 clearly distinguish version 1 and version 2 messages, senders need 3706 to fill in the correct version (fixed after version negotiation). 3707 Receivers should check that the content of the rdma_vers is 3708 correct before using the content of any other header field. 3710 * The third word of the header, the rdma_credit field, retains the 3711 size and general purpose that it had in RPC-over-RDMA version 1. 3712 However, RPC-over-RDMA version 2 divides this field into two 3713 16-bit subfields. See Section 4.2.1 for further details. 3715 * The fourth word of the header, previously the union discriminator 3716 field rdma_proc, retains its format and general function even 3717 though the set of valid values has changed. Within RPC-over-RDMA 3718 version 2, this word is the rdma_htype field of the structure 3719 rdma_start. The value of this field is now an unsigned 32-bit 3720 integer rather than an enum type, to facilitate header type 3721 extension. 3723 Beyond conforming to the restrictions specified in [RFC8166], RPC- 3724 over-RDMA version 2 tightly limits the scope of the changes made to 3725 ensure interoperability. Version 2 retains all existing transport 3726 header types used in version 1, as defined in [RFC8166]. And, it 3727 expresses chunks in the same format and uses them the same way. 3729 C.2. Transport Properties 3731 RPC-over-RDMA version 2 provides a mechanism for exchanging an 3732 implementation's operational properties. The purpose of this 3733 exchange is to help endpoints improve the efficiency of data transfer 3734 by exploiting the characteristics of both peers rather than falling 3735 back on the lowest common denominator default settings. A full 3736 discussion of transport properties appears in Section 5. 3738 C.3. Credit Management Changes 3740 RPC-over-RDMA transports employ credit-based flow control to ensure 3741 that a Requester does not emit more RDMA Sends than the Responder is 3742 prepared to receive. 3744 Section 3.3.1 of [RFC8166] explains the operation of RPC-over-RDMA 3745 version 1 credit management in detail. In that design, each RDMA 3746 Send from a Requester contains an RPC Call with a credit request, and 3747 each RDMA Send from a Responder contains an RPC Reply with a credit 3748 grant. The credit grant implies that enough Receives have been 3749 posted on the Responder to handle the credit grant minus the number 3750 of pending RPC transactions (the number of remaining Receive buffers 3751 might be zero). 3753 Each RPC Reply acts as an implicit ACK for a previous RPC Call from 3754 the Requester. Without an RPC Reply message, the Requester has no 3755 way to know that the Responder is ready for subsequent RPC Calls. 3757 Because version 1 embeds credit management in each message, there is 3758 a strict one-to-one ratio between RDMA Send and RPC message. There 3759 are interesting use cases that might be enabled if this relationship 3760 were more flexible: 3762 * RPC-over-RDMA operations that do not carry an RPC message, e.g., 3763 control plane operations. 3765 * A single RDMA Send that conveys more than one RPC message, e.g., 3766 for interrupt mitigation. 3768 * An RPC message that requires several sequential RDMA Sends, e.g., 3769 to reduce the use of explicit RDMA operations for moderate-sized 3770 RPC messages. 3772 * An RPC transaction that requires multiple exchanges or an odd 3773 number of RPC-over-RDMA operations to complete. 3775 RPC-over-RDMA version 2 provides a more sophisticated credit 3776 accounting mechanism to address these shortcomings. Section 4.2.1 3777 explains the new mechanism in detail. 3779 C.4. Inline Threshold Changes 3781 An "inline threshold" value is the largest message size (in octets) 3782 that can be conveyed on an RDMA connection using only RDMA Send and 3783 Receive. Each connection has two inline threshold values: one for 3784 messages flowing from client-to-server (referred to as the "client- 3785 to-server inline threshold") and one for messages flowing from 3786 server-to-client (referred to as the "server-to-client inline 3787 threshold"). 3789 A connection's inline thresholds determine, among other things, when 3790 RDMA Read or Write operations are required because an RPC message 3791 cannot be conveyed via a single RDMA Send and Receive pair. When an 3792 RPC message does not contain DDP-eligible data items, a Requester can 3793 prepare a Special Format Call or Reply to convey the whole RPC 3794 message using RDMA Read or Write operations. 3796 RDMA Read and Write operations require that data payloads reside in 3797 memory registered with the local RNIC. When an RPC completes, that 3798 memory is invalidated to fence it from the Responder. Memory 3799 registration and invalidation typically have a latency cost that is 3800 insignificant compared to data handling costs. 3802 When a data payload is small, however, the cost of registering and 3803 invalidating memory where the payload resides becomes a significant 3804 part of total RPC latency. Therefore the most efficient operation of 3805 an RPC-over-RDMA transport occurs when the peers use explicit RDMA 3806 Read and Write operations for large payloads but avoid those 3807 operations for small payloads. 3809 When the authors of [RFC8166] first conceived RPC-over-RDMA version 3810 1, the average size of RPC messages that did not involve a 3811 significant data payload was under 500 bytes. A 1024-byte inline 3812 threshold adequately minimized the frequency of inefficient Long 3813 messages. 3815 With NFS version 4 [RFC7530], the increased size of NFS COMPOUND 3816 operations resulted in RPC messages that are, on average, larger than 3817 previous versions of NFS. With a 1024-byte inline threshold, 3818 frequent operations such as GETATTR and LOOKUP require RDMA Read or 3819 Write operations, reducing the efficiency of data transport. 3821 To reduce the frequency of Special Format messages, RPC-over-RDMA 3822 version 2 increases the default size of inline thresholds. This 3823 change also increases the maximum size of reverse-direction RPC 3824 messages. 3826 C.5. Message Continuation Changes 3828 In addition to a larger default inline threshold, RPC-over-RDMA 3829 version 2 introduces Message Continuation. Message Continuation is a 3830 mechanism that enables the transmission of a data payload using more 3831 than one RDMA Send. The purpose of Message Continuation is to 3832 provide relief in several essential cases: 3834 * If a Requester finds that it is inefficient to convey a 3835 moderately-sized data payload using Read chunks, the Requester can 3836 use Message Continuation to send the RPC Call. 3838 * If a Requester has provided insufficient Reply chunk space for a 3839 Responder to send an RPC Reply, the Responder can use Message 3840 Continuation to send the RPC Reply. 3842 * If a sender has to convey a sizeable non-RPC data payload (e.g., a 3843 large transport property), the sender can use Message Continuation 3844 to avoid having to register memory. 3846 C.6. Host Authentication Changes 3848 For the general operation of NFS on open networks, we eventually 3849 intend to rely on RPC-on-TLS [I-D.ietf-nfsv4-rpc-tls] to provide 3850 cryptographic authentication of the two ends of each connection. In 3851 turn, this can improve the trustworthiness of AUTH_SYS-style user 3852 identities that flow on TCP, which are not cryptographically 3853 protected. We do not have a similar solution for RPC-over-RDMA, 3854 however. 3856 Here, the RDMA transport layer already provides a strong guarantee of 3857 message integrity. On some network fabrics, IPsec or TLS can protect 3858 the privacy of in-transit data. However, this is not the case for 3859 all fabrics (e.g., InfiniBand [IBA]). 3861 Thus, RPC-over-RDMA version 2 introduces a mechanism for 3862 authenticating connection peers (see Section 5.2.6). And like GSS 3863 channel binding, there is also a way to determine when the use of 3864 host authentication is unnecessary. 3866 C.7. Support for Remote Invalidation 3868 When an RDMA consumer uses FRWR or Memory Windows to register memory, 3869 that memory may be invalidated remotely [RFC5040]. These mechanisms 3870 are available when a Requester's RNIC supports MEM_MGT_EXTENSIONS. 3872 For this discussion, there are two classes of STags. Dynamically- 3873 registered STags appear in a single RPC, then are invalidated. 3874 Persistently-registered STags survive longer than one RPC. They may 3875 persist for the life of an RPC-over-RDMA connection or even longer. 3877 An RPC-over-RDMA Requester can provide more than one STag in a 3878 transport header. It may provide a combination of dynamically- and 3879 persistently-registered STags in one RPC message, or any combination 3880 of these in a series of RPCs on the same connection. Only 3881 dynamically-registered STags using Memory Windows or FRWR may be 3882 invalidated remotely. 3884 There is no transport-level mechanism by which a Responder can 3885 determine how a Requester-provided STag was registered, nor whether 3886 it is eligible to be invalidated remotely. A Requester that mixes 3887 persistently- and dynamically-registered STags in one RPC, or mixes 3888 them across RPCs on the same connection, must, therefore, indicate 3889 which STag the Responder may invalidate remotely via a mechanism 3890 provided in the Upper-Layer Protocol. RPC-over-RDMA version 2 3891 provides such a mechanism. 3893 A sender uses the RDMA Send With Invalidate operation to invalidate 3894 an STag on the remote peer. It is available only when both peers 3895 support MEM_MGT_EXTENSIONS (can send and process an IETH). 3897 Existing RPC-over-RDMA transport protocol specifications [RFC8166] 3898 [RFC8167] do not forbid direct data placement in the reverse 3899 direction. Moreover, there is currently no Upper-Layer Protocol that 3900 makes data items in reverse direction operations eligible for direct 3901 data placement. 3903 When chunks are present in a reverse direction RPC request, Remote 3904 Invalidation enables the Responder to trigger invalidation of a 3905 Requester's STags as part of sending an RPC Reply, the same way as is 3906 done in the forward direction. 3908 However, in the reverse direction, the server acts as the Requester, 3909 and the client is the Responder. The server's RNIC, therefore, must 3910 support receiving an IETH, and the server must have registered its 3911 STags with an appropriate registration mechanism. 3913 C.8. Error Reporting Changes 3915 RPC-over-RDMA version 2 expands the repertoire of errors that 3916 connection peers may report to each other. The goals of this 3917 expansion are: 3919 * To fill in details of peer recovery actions. 3921 * To enable retrying certain conditions caused by mis-estimation of 3922 the maximum reply size. 3924 * To minimize the likelihood of a Requester waiting forever for a 3925 Reply when there are communications problems that prevent the 3926 Responder from sending it. 3928 Acknowledgments 3930 The authors gratefully acknowledge the work of Brent Callaghan and 3931 Tom Talpey on the original RPC-over-RDMA version 1 specification (RFC 3932 5666). The authors also wish to thank Bill Baker, Greg Marsden, and 3933 Matt Benjamin for their support of this work. 3935 The XDR extraction conventions were first described by the authors of 3936 the NFS version 4.1 XDR specification [RFC5662]. Herbert van den 3937 Bergh suggested the replacement sed script used in this document. 3939 Special thanks go to Transport Area Director Magnus Westerlund, NFSV4 3940 Working Group Chairs Spencer Shepler, David Noveck, and Brian 3941 Pawlowski, and NFSV4 Working Group Secretary Thomas Haynes for their 3942 support. 3944 Authors' Addresses 3946 Charles Lever (editor) 3947 Oracle Corporation 3948 United States of America 3950 Email: chuck.lever@oracle.com 3952 David Noveck 3953 NetApp 3954 1601 Trapelo Road 3955 Waltham, MA 02451 3956 United States of America 3958 Phone: +1 781 572 8038 3959 Email: davenoveck@gmail.com