idnits 2.17.1 draft-ietf-nfsv4-rpcrdma-version-two-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- == There are 1 instance of lines with non-RFC6890-compliant IPv4 addresses in the document. If these are example addresses, they should be changed. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document seems to contain a disclaimer for pre-RFC5378 work, but was first submitted on or after 10 November 2008. The disclaimer is usually necessary only for documents that revise or obsolete older RFCs, and that take significant amounts of text from those RFCs. If you can contact all authors of the source material and they are willing to grant the BCP78 rights to the IETF Trust, you can and should remove the disclaimer. Otherwise, the disclaimer is needed and you can ignore this comment. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (January 17, 2020) is 1532 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Outdated reference: A later version (-11) exists of draft-ietf-nfsv4-rpc-tls-05 -- Obsolete informational reference (is this intentional?): RFC 793 (Obsoleted by RFC 9293) -- Obsolete informational reference (is this intentional?): RFC 5661 (Obsoleted by RFC 8881) Summary: 0 errors (**), 0 flaws (~~), 4 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network File System Version 4 C. Lever, Ed. 3 Internet-Draft Oracle 4 Intended status: Standards Track D. Noveck 5 Expires: July 20, 2020 NetApp 6 January 17, 2020 8 RPC-over-RDMA Version 2 Protocol 9 draft-ietf-nfsv4-rpcrdma-version-two-01 11 Abstract 13 This document specifies the second version of a transport protocol 14 that conveys Remote Procedure Call (RPC) messages using Remote Direct 15 Memory Access (RDMA). This version of the protocol is extensible. 17 Status of This Memo 19 This Internet-Draft is submitted in full conformance with the 20 provisions of BCP 78 and BCP 79. 22 Internet-Drafts are working documents of the Internet Engineering 23 Task Force (IETF). Note that other groups may also distribute 24 working documents as Internet-Drafts. The list of current Internet- 25 Drafts is at https://datatracker.ietf.org/drafts/current/. 27 Internet-Drafts are draft documents valid for a maximum of six months 28 and may be updated, replaced, or obsoleted by other documents at any 29 time. It is inappropriate to use Internet-Drafts as reference 30 material or to cite them other than as "work in progress." 32 This Internet-Draft will expire on July 20, 2020. 34 Copyright Notice 36 Copyright (c) 2020 IETF Trust and the persons identified as the 37 document authors. All rights reserved. 39 This document is subject to BCP 78 and the IETF Trust's Legal 40 Provisions Relating to IETF Documents 41 (https://trustee.ietf.org/license-info) in effect on the date of 42 publication of this document. Please review these documents 43 carefully, as they describe your rights and restrictions with respect 44 to this document. Code Components extracted from this document must 45 include Simplified BSD License text as described in Section 4.e of 46 the Trust Legal Provisions and are provided without warranty as 47 described in the Simplified BSD License. 49 This document may contain material from IETF Documents or IETF 50 Contributions published or made publicly available before November 51 10, 2008. The person(s) controlling the copyright in some of this 52 material may not have granted the IETF Trust the right to allow 53 modifications of such material outside the IETF Standards Process. 54 Without obtaining an adequate license from the person(s) controlling 55 the copyright in such materials, this document may not be modified 56 outside the IETF Standards Process, and derivative works of it may 57 not be created outside the IETF Standards Process, except to format 58 it for publication as an RFC or to translate it into languages other 59 than English. 61 Table of Contents 63 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 4 64 2. Requirements Language . . . . . . . . . . . . . . . . . . . . 6 65 3. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 6 66 3.1. Remote Procedure Calls . . . . . . . . . . . . . . . . . 6 67 3.1.1. Upper-Layer Protocols . . . . . . . . . . . . . . . . 6 68 3.1.2. Requesters and Responders . . . . . . . . . . . . . . 6 69 3.1.3. RPC Transports . . . . . . . . . . . . . . . . . . . 7 70 3.1.4. External Data Representation . . . . . . . . . . . . 9 71 3.2. Remote Direct Memory Access . . . . . . . . . . . . . . . 10 72 3.2.1. Direct Data Placement . . . . . . . . . . . . . . . . 10 73 3.2.2. RDMA Transport Requirements . . . . . . . . . . . . . 11 74 4. RPC-over-RDMA Framework . . . . . . . . . . . . . . . . . . . 12 75 4.1. Message Framing . . . . . . . . . . . . . . . . . . . . . 13 76 4.2. Managing Receiver Resources . . . . . . . . . . . . . . . 13 77 4.2.1. Flow Control . . . . . . . . . . . . . . . . . . . . 13 78 4.2.2. Inline Threshold . . . . . . . . . . . . . . . . . . 15 79 4.2.3. Initial Connection State . . . . . . . . . . . . . . 15 80 4.3. XDR Encoding with Chunks . . . . . . . . . . . . . . . . 16 81 4.3.1. Reducing an XDR Stream . . . . . . . . . . . . . . . 17 82 4.3.2. DDP-Eligibility . . . . . . . . . . . . . . . . . . . 17 83 4.3.3. RDMA Segments . . . . . . . . . . . . . . . . . . . . 17 84 4.3.4. Chunks . . . . . . . . . . . . . . . . . . . . . . . 18 85 4.3.5. Read Chunks . . . . . . . . . . . . . . . . . . . . . 19 86 4.3.6. Write Chunks . . . . . . . . . . . . . . . . . . . . 20 87 4.4. Payload Format . . . . . . . . . . . . . . . . . . . . . 21 88 4.4.1. Simple Format . . . . . . . . . . . . . . . . . . . . 22 89 4.4.2. Continued Format . . . . . . . . . . . . . . . . . . 23 90 4.4.3. Special Format . . . . . . . . . . . . . . . . . . . 25 91 4.5. Reverse-Direction Operation . . . . . . . . . . . . . . . 27 92 4.5.1. Sending a Reverse-Direction RPC Call . . . . . . . . 27 93 4.5.2. Sending a Reverse-Direction RPC Reply . . . . . . . . 28 94 4.5.3. In the Absence of Support For Reverse-Direction 95 Operation . . . . . . . . . . . . . . . . . . . . . . 28 96 4.5.4. Using Chunks During Reverse-Direction Operation . . . 29 97 4.5.5. Reverse-Direction Retransmission . . . . . . . . . . 30 98 5. Transport Properties . . . . . . . . . . . . . . . . . . . . 30 99 5.1. Transport Properties Model . . . . . . . . . . . . . . . 31 100 5.2. Current Transport Properties . . . . . . . . . . . . . . 32 101 5.2.1. Maximum Send Size . . . . . . . . . . . . . . . . . . 33 102 5.2.2. Receive Buffer Size . . . . . . . . . . . . . . . . . 33 103 5.2.3. Maximum RDMA Segment Size . . . . . . . . . . . . . . 34 104 5.2.4. Maximum RDMA Segment Count . . . . . . . . . . . . . 34 105 5.2.5. Reverse-Direction Support . . . . . . . . . . . . . . 34 106 5.2.6. Host Authentication Message . . . . . . . . . . . . . 35 107 6. Transport Messages . . . . . . . . . . . . . . . . . . . . . 35 108 6.1. Transport Header Types . . . . . . . . . . . . . . . . . 36 109 6.2. Headers and Chunks . . . . . . . . . . . . . . . . . . . 37 110 6.2.1. Common Transport Header Prefix . . . . . . . . . . . 37 111 6.2.2. Transport Header Prefix . . . . . . . . . . . . . . . 38 112 6.2.3. External Data Payloads . . . . . . . . . . . . . . . 40 113 6.2.4. Remote Invalidation . . . . . . . . . . . . . . . . . 41 114 6.3. Header Types . . . . . . . . . . . . . . . . . . . . . . 41 115 6.3.1. RDMA2_MSG: Convey RPC Message Inline . . . . . . . . 42 116 6.3.2. RDMA2_NOMSG: Convey External RPC Message . . . . . . 42 117 6.3.3. RDMA2_ERROR: Report Transport Error . . . . . . . . . 43 118 6.3.4. RDMA2_CONNPROP: Exchange Transport Properties . . . . 43 119 6.4. Choosing a Reply Mechanism . . . . . . . . . . . . . . . 44 120 7. Error Handling . . . . . . . . . . . . . . . . . . . . . . . 45 121 7.1. Basic Transport Stream Parsing Errors . . . . . . . . . . 45 122 7.1.1. RDMA2_ERR_VERS . . . . . . . . . . . . . . . . . . . 45 123 7.1.2. RDMA2_ERR_INVAL_HTYPE . . . . . . . . . . . . . . . . 46 124 7.1.3. RDMA2_ERR_INVAL_CONT . . . . . . . . . . . . . . . . 46 125 7.2. XDR Errors . . . . . . . . . . . . . . . . . . . . . . . 46 126 7.2.1. RDMA2_ERR_BAD_XDR . . . . . . . . . . . . . . . . . . 47 127 7.2.2. RDMA2_ERR_BAD_PROPVAL . . . . . . . . . . . . . . . . 47 128 7.3. Responder RDMA Operational Errors . . . . . . . . . . . . 47 129 7.3.1. RDMA2_ERR_READ_CHUNKS . . . . . . . . . . . . . . . . 48 130 7.3.2. RDMA2_ERR_WRITE_CHUNKS . . . . . . . . . . . . . . . 48 131 7.3.3. RDMA2_ERR_SEGMENTS . . . . . . . . . . . . . . . . . 49 132 7.3.4. RDMA2_ERR_WRITE_RESOURCE . . . . . . . . . . . . . . 49 133 7.3.5. RDMA2_ERR_REPLY_RESOURCE . . . . . . . . . . . . . . 49 134 7.4. Other Operational Errors . . . . . . . . . . . . . . . . 49 135 7.4.1. RDMA2_ERR_SYSTEM . . . . . . . . . . . . . . . . . . 50 136 7.5. RDMA Transport Errors . . . . . . . . . . . . . . . . . . 50 137 8. XDR Protocol Definition . . . . . . . . . . . . . . . . . . . 51 138 8.1. Code Component License . . . . . . . . . . . . . . . . . 51 139 8.2. Extraction of the XDR Definition . . . . . . . . . . . . 53 140 8.3. XDR Definition for RPC-over-RDMA Version 2 Core 141 Structures . . . . . . . . . . . . . . . . . . . . . . . 55 142 8.4. XDR Definition for RPC-over-RDMA Version 2 Base Header 143 Types . . . . . . . . . . . . . . . . . . . . . . . . . . 57 144 8.5. Use of the XDR Description . . . . . . . . . . . . . . . 59 146 9. RPC Bind Parameters . . . . . . . . . . . . . . . . . . . . . 60 147 10. Implementation Status . . . . . . . . . . . . . . . . . . . . 61 148 11. Security Considerations . . . . . . . . . . . . . . . . . . . 61 149 11.1. Memory Protection . . . . . . . . . . . . . . . . . . . 61 150 11.1.1. Protection Domains . . . . . . . . . . . . . . . . . 62 151 11.1.2. Handle (STag) Predictability . . . . . . . . . . . . 62 152 11.1.3. Memory Protection . . . . . . . . . . . . . . . . . 62 153 11.1.4. Denial of Service . . . . . . . . . . . . . . . . . 62 154 11.2. RPC Message Security . . . . . . . . . . . . . . . . . . 63 155 11.2.1. RPC-over-RDMA Protection at Other Layers . . . . . . 63 156 11.2.2. RPCSEC_GSS on RPC-over-RDMA Transports . . . . . . . 64 157 11.3. Transport Properties . . . . . . . . . . . . . . . . . . 66 158 11.4. Host Authentication . . . . . . . . . . . . . . . . . . 66 159 12. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 66 160 13. References . . . . . . . . . . . . . . . . . . . . . . . . . 67 161 13.1. Normative References . . . . . . . . . . . . . . . . . . 67 162 13.2. Informative References . . . . . . . . . . . . . . . . . 68 163 Appendix A. ULB Specifications . . . . . . . . . . . . . . . . . 70 164 A.1. DDP-Eligibility . . . . . . . . . . . . . . . . . . . . . 70 165 A.2. Maximum Reply Size . . . . . . . . . . . . . . . . . . . 72 166 A.3. Reverse-Direction Operation . . . . . . . . . . . . . . . 72 167 A.4. Additional Considerations . . . . . . . . . . . . . . . . 72 168 A.5. ULP Extensions . . . . . . . . . . . . . . . . . . . . . 73 169 Appendix B. Extending RPC-over-RDMA Version 2 . . . . . . . . . 73 170 B.1. Documentation Requirements . . . . . . . . . . . . . . . 74 171 B.2. Adding New Header Types to RPC-over-RDMA Version 2 . . . 74 172 B.3. Adding New Header Flags to the Protocol . . . . . . . . . 75 173 B.4. Adding New Transport properties to the Protocol . . . . . 76 174 B.5. Adding New Error Codes to the Protocol . . . . . . . . . 76 175 Appendix C. Differences from RPC-over-RDMA Version 1 . . . . . . 77 176 C.1. Changes to the XDR Definition . . . . . . . . . . . . . . 77 177 C.2. Transport Properties . . . . . . . . . . . . . . . . . . 78 178 C.3. Credit Management Changes . . . . . . . . . . . . . . . . 79 179 C.4. Inline Threshold Changes . . . . . . . . . . . . . . . . 79 180 C.5. Message Continuation Changes . . . . . . . . . . . . . . 80 181 C.6. Host Authentication Changes . . . . . . . . . . . . . . . 81 182 C.7. Support for Remote Invalidation . . . . . . . . . . . . . 81 183 C.8. Error Reporting Changes . . . . . . . . . . . . . . . . . 82 184 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . 83 185 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 83 187 1. Introduction 189 Remote Direct Memory Access (RDMA) [RFC5040] [RFC5041] [IBA] is a 190 technique for moving data efficiently between network nodes. By 191 placing transferred data directly into destination buffers using 192 Direct Memory Access, RDMA delivers the reciprocal benefits of faster 193 data transfer and reduced host CPU overhead. 195 Open Network Computing Remote Procedure Call (ONC RPC, often 196 shortened in NFSv4 documents to RPC) [RFC5531] is a Remote Procedure 197 Call protocol that runs over a variety of transports. Most RPC 198 implementations today use UDP [RFC0768] or TCP [RFC0793]. On UDP, a 199 datagram encapsulates each RPC message. Within a TCP byte stream, a 200 record marking protocol delineates RPC messages. 202 An RDMA transport, too, conveys RPC messages in a fashion that must 203 be fully defined if RPC implementations are to interoperate when 204 using RDMA to transport RPC transactions. Although RDMA transports 205 encapsulate messages like UDP, they deliver them reliably and in 206 order, like TCP. Further, they implement a bulk data transfer 207 service not provided by traditional network transports. Therefore, 208 we treat RDMA as a novel transport type for RPC. 210 The RPC-over-RDMA transport introduced in the current document can 211 transparently support any RPC application. The current document 212 describes mechanisms that enable further optimization of data 213 transfer when RPC applications are structured to exploit direct data 214 placement. In this context, the Network File System (NFS) protocols, 215 as described in [RFC1094], [RFC1813], [RFC7530], [RFC5661], and 216 subsequent NFSv4 minor versions, are all potential beneficiaries of 217 RPC-over-RDMA. A complete problem statement appears in [RFC5532]. 219 Storage administrators have broadly deployed the RPC-over-RDMA 220 version 1 protocol specified in [RFC8166]. However, there are known 221 shortcomings to this protocol: 223 o The protocol's default size of Receive buffers forces the use of 224 RDMA Read and Write transfers for small payloads, and limits the 225 size of reverse direction messages. 227 o It is difficult to make optimizations or protocol fixes that 228 require changes to on-the-wire behavior. 230 o For some RPC procedures, the maximum reply size is difficult or 231 impossible for an RPC client to estimate in advance. 233 To address these issues in a way that preserves interoperation with 234 existing RPC-over-RDMA version 1 deployments, we present a second 235 version of the RPC-over-RDMA transport protocol in the current 236 document. 238 The version of RPC-over-RDMA presented here is extensible, enabling 239 the introduction of OPTIONAL extensions without impacting existing 240 implementations. See Appendix C.1, for further discussion. It 241 introduces a mechanism to exchange implementation properties to 242 automatically provide further optimization of data transfer. 244 This version also contains incremental changes that relieve 245 performance constraints and enable recovery from unusual corner 246 cases. These changes are outlined in Appendix C and include a larger 247 default inline threshold, the ability to convey a single RPC message 248 using multiple RDMA Send operations, support for authentication of 249 connection peers, richer error reporting, improved credit-based flow 250 control, and support for Remote Invalidation. 252 2. Requirements Language 254 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 255 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 256 "OPTIONAL" in this document are to be interpreted as described in BCP 257 14 [RFC2119] [RFC8174] when, and only when, they appear in all 258 capitals, as shown here. 260 3. Terminology 262 3.1. Remote Procedure Calls 264 This section highlights critical elements of the RPC protocol 265 [RFC5531] and the External Data Representation (XDR) [RFC4506] it 266 uses. RPC-over-RDMA version 2 enables the transmission of RPC 267 messges built using XDR and also uses XDR internally to describe its 268 header formats. The remainder of this document requires an 269 understanding of RPC and its use of XDR. 271 3.1.1. Upper-Layer Protocols 273 RPCs are an abstraction used to implement the operations of an Upper- 274 Layer Protocol (ULP). For RPC-over-RDMA, "ULP" refers to an RPC 275 Program and Version tuple, which is a versioned set of procedure 276 calls that comprise a single well-defined API. One example of a ULP 277 is the Network File System Version 4.0 [RFC7530]. In the current 278 document, the term "RPC consumer" refers to an implementation of a 279 ULP running on an RPC client. 281 3.1.2. Requesters and Responders 283 Like a local procedure call, every RPC procedure has a set of 284 "arguments" and a set of "results". A calling context invokes a 285 procedure, passing arguments to it, and the procedure subsequently 286 returns a set of results. Unlike a local procedure call, the called 287 procedure is executed remotely rather than in the local application's 288 execution context. 290 The RPC protocol as described in [RFC5531] is fundamentally a 291 message-passing protocol between one or more clients, where RPC 292 consumers are running, and a server, where a remote execution context 293 is available to process RPC transactions on behalf of these 294 consumers. 296 ONC RPC transactions consist of two types of messages: 298 o A CALL message, or "Call", requests work. An RPC Call message is 299 designated by the value zero (0) in the message's msg_type field. 300 The sender places a unique 32-bit value in the message's XID field 301 to match this RPC Call message to a corresponding RPC Reply 302 message. 304 o A REPLY message, or "Reply", reports the results of work requested 305 by an RPC Call message. An RPC Reply message is designated by the 306 value one (1) in the message's msg_type field. The sender copies 307 the value contained in an RPC Reply message's XID field from the 308 RPC Call message whose results the sender is reporting. 310 Each RPC client endpoint acts as a "Requester", which serializes the 311 procedure's arguments and conveys them to a server endpoint via an 312 RPC Call message. A Call message contains an RPC protocol header, a 313 header describing the requested upper-layer operation, and all 314 arguments. 316 An RPC server endpoint acts as a "Responder", which deserializes the 317 arguments and processes the requested operation. It then serializes 318 the operation's results into an RPC Reply message. An RPC Reply 319 message contains an RPC protocol header, a header describing the 320 upper-layer reply, and all results. 322 The Requester deserializes the results and allows the RPC consumer to 323 proceed. At this point, the RPC transaction designated by the XID in 324 the RPC Call message is complete, and the XID is retired. 326 In summary, Requesters send RPC Call messages to Responders to 327 initiate RPC transactions. Responders send RPC Reply messages to 328 Requesters to complete the processing on an RPC transaction. 330 3.1.3. RPC Transports 332 The role of an "RPC transport" is to mediate the exchange of RPC 333 messages between Requesters and Responders. An RPC transport bridges 334 the gap between the RPC message abstraction and the native operations 335 of a network transport (e.g., a socket). 337 RPC-over-RDMA is a connection-oriented RPC transport. When a 338 transport type is connection-oriented, clients initiate transport 339 connections, while servers wait passively to accept incoming 340 connection requests. 342 3.1.3.1. Forward Direction 344 Traditionally, an RPC client acts as a Requester, while an RPC 345 service acts as a Responder. The current document refers to this 346 form of RPC message passing as "forward-direction" operation. 348 3.1.3.2. Reverse-Direction 350 The RPC specification [RFC5531] does not forbid passing RPC messages 351 in the other direction. An RPC service endpoint can act as a 352 Requester, in which case an RPC client endpoint acts as a Responder. 353 This form of message passing is known as "reverse-direction" 354 operation. 356 During reverse-direction operation, an RPC client is responsible for 357 establishing transport connections, even though the RPC server 358 originates RPC Calls. 360 RPC clients and servers are usually optimized to perform and scale 361 well when handling traffic in the forward direction. They might not 362 be prepared to handle operation in the reverse direction. Not until 363 NFS version 4.1 [RFC5661] has there been a strong need to handle 364 reverse-direction operation. 366 3.1.3.3. Bi-directional Operation 368 A pair of connected RPC endpoints may choose to use only forward- 369 direction or only reverse-direction operation on a particular 370 transport connection. Or, these endpoints may send Calls in both 371 directions concurrently on the same transport connection. 373 "Bi-directional operation" occurs when both transport endpoints act 374 as a Requester and a Responder at the same time on a single 375 connection. 377 Bi-directionality is an extension of RPC transport connection 378 sharing. Two RPC endpoints wish to exchange independent RPC messages 379 over a shared connection but in opposite directions. These messages 380 may or may not be related to the same workloads or RPC Programs. 382 3.1.3.4. XID Values 384 Section 9 of [RFC5531] introduces the RPC transaction identifier, or 385 "XID" for short. A connection peer interprets the value of an XID in 386 the context of the message's msg_type field. 388 o The XID of a Call is arbitrary but is unique among outstanding 389 Calls from that Requester on that connection. 391 o The XID of a Reply always matches that of the initiating Call. 393 After receiving a Reply, a Requester matches the XID value in that 394 Reply with a Call it previously sent. 396 During bi-directional operation, forward- and reverse- direction XIDs 397 are typically generated on distinct hosts by possibly different 398 algorithms. There is no coordination between the generation of XIDs 399 used in forward-direction and reverse-direction operation. 401 Therefore, a forward-direction Requester MAY use the same XID value 402 at the same time as a reverse-direction Requester on the same 403 transport connection. Although such concurrent requests use the same 404 XID value, they represent distinct RPC transactions. 406 3.1.4. External Data Representation 408 One cannot assume that all Requesters and Responders represent data 409 objects in the same way internally. RPC uses External Data 410 Representation (XDR) to translate native data types and serialize 411 arguments and results [RFC4506]. 413 XDR encodes data independently of the endianness or size of host- 414 native data types, enabling unambiguous decoding of data by a 415 receiver. 417 XDR assumes only that the number of bits in a byte (octet) and their 418 order are the same on both endpoints and the physical network. The 419 smallest indivisible unit of XDR encoding is a group of four octets. 420 XDR can also flatten lists, arrays, and other complex data types into 421 a stream of bytes. 423 We refer to a serialized stream of bytes that is the result of XDR 424 encoding as an "XDR stream". A sender encodes native data into an 425 XDR stream and then transmits that stream to a receiver. The 426 receiver decodes incoming XDR byte streams into its native data 427 representation format. 429 3.1.4.1. XDR Opaque Data 431 Sometimes, a data item is to be transferred as-is, without encoding 432 or decoding. We refer to the contents of such a data item as "opaque 433 data". XDR encoding places the content of opaque data items directly 434 into an XDR stream without altering it in any way. ULPs or 435 applications perform any needed data translation in this case. 437 Examples of opaque data items include the content of files or generic 438 byte strings. 440 3.1.4.2. XDR Roundup 442 The number of octets in a variable-length data item precedes that 443 item in an XDR stream. If the size of an encoded data item is not a 444 multiple of four octets, the sender appends octets containing zero 445 after the end of the data item. These zero octets shift the next 446 encoded data item in the XDR stream so that it always starts on a 447 four-octet boundary. The addition of extra octets does not change 448 the encoded size of the data item. Receivers do not expose the extra 449 octets to ULPs. 451 We refer to this technique as "XDR roundup", and the extra octets as 452 "XDR roundup padding". 454 3.2. Remote Direct Memory Access 456 When a third party transfers large RPC payloads, RPC Requesters and 457 Responders can become more efficient. An example of such a third 458 party might be an intelligent network interface (data movement 459 offload), which places data in the receiver's memory so that no 460 additional adjustment of data alignment is necessary (direct data 461 placement or "DDP"). RDMA transports enable both of these 462 optimizations. 464 In the current document, "RDMA" refers to the physical mechanism an 465 RDMA transport utilizes when moving data. 467 3.2.1. Direct Data Placement 469 Typically, RPC implementations copy the contents of RPC messages into 470 a buffer before being sent. An efficient RPC implementation sends 471 bulk data without copying it into a separate send buffer first. 473 However, socket-based RPC implementations are often unable to receive 474 data directly into its final place in memory. Receivers often need 475 to copy incoming data to finish an RPC operation sometimes, if only 476 to adjust data alignment. 478 Although it may not be efficient, before an RDMA transfer, a sender 479 may copy data into an intermediate buffer. After an RDMA transfer, a 480 receiver may copy that data again to its final destination. In this 481 document, the term "DDP" refers to any optimized data transfer where 482 a receiving host's CPU does not move transferred data to another 483 location after arrival. 485 RPC-over-RDMA version 2 enables the use of RDMA Read and Write 486 operations to achieve both data movement offload and DDP. However, 487 note that not all RDMA-based data transfer qualifies as DDP, and some 488 mechanisms that do not employ explicit RDMA can place data directly. 490 3.2.2. RDMA Transport Requirements 492 RDMA transports require that RDMA consumers provision resources in 493 advance to achieve good performance during receive operations. An 494 RDMA consumer might provide Receive buffers in advance by posting an 495 RDMA Receive Work Request for every expected RDMA Send from a remote 496 peer. These buffers are provided before the remote peer posts RDMA 497 Send Work Requests. Thus this is often referred to as "pre-posting" 498 buffers. 500 An RDMA Receive Work Request remains outstanding until the RDMA 501 provider matches it to an inbound Send operation. The resources 502 associated with that Receive must be retained in host memory, or 503 "pinned", until the Receive completes. 505 Given these tenets of operation, the RPC-over-RDMA version 2 protocol 506 assumes each transport provides the following abstract operations. A 507 more complete discussion of these operations appears in [RFC5040]. 509 3.2.2.1. Memory Registration 511 Memory registration assigns a steering tag to a region of memory, 512 permitting the RDMA provider to perform data-transfer operations. 513 The RPC-over-RDMA version 2 protocol assumes that a steering tag of 514 no more than 32 bits and memory addresses of up to 64 bits in length 515 identifies each registered memory region. 517 3.2.2.2. RDMA Send 519 The RDMA provider supports an RDMA Send operation, with completion 520 signaled on the receiving peer after RDMA provider has placed data in 521 a pre-posted buffer. Sends complete at the receiver in the order 522 they were posted at the sender. The size of the remote peer's pre- 523 posted buffers limits the amount of data that can be transferred by a 524 single RDMA Send operation. 526 3.2.2.3. RDMA Receive 528 The RDMA provider supports an RDMA Receive operation to receive data 529 conveyed by incoming RDMA Send operations. To reduce the amount of 530 memory that must remain pinned awaiting incoming Sends, the amount of 531 memory posted per Receive is limited. The RDMA consumer (in this 532 case, the RPC-over-RDMA version 2 protocol) provides flow control to 533 prevent overrunning receiver resources. 535 3.2.2.4. RDMA Write 537 The RDMA provider supports an RDMA Write operation to place data 538 directly into a remote memory region. The local host initiates an 539 RDMA Write and the RDMA provider signals completion there. The 540 remote RDMA provider does not signal completion on the remote peer. 541 The local host provides the steering tag, the memory address, and the 542 length of the remote peer's memory region. 544 RDMA Writes are not ordered relative to one another, but are ordered 545 relative to RDMA Sends. Thus, a subsequent RDMA Send completion 546 signaled on the local peer guarantees that prior RDMA Write data has 547 been successfully placed in the remote peer's memory. 549 3.2.2.5. RDMA Read 551 The RDMA provider supports an RDMA Read operation to place remote 552 source data directly into local memory. The local host initiates an 553 RDMA Read and and the RDMA provider signals completion there. The 554 remote RDMA provider does not signal completion on the remote peer. 555 The local host provides the steering tags, the memory addresses, and 556 the lengths for the remote source and local destination memory 557 regions. 559 The RDMA consumer (in this case, the RPC-over-RDMA version 2 560 protocol) signals Read completion to the remote peer as part of a 561 subsequent RDMA Send message. The remote peer can then invalidate 562 steering tags and subsequently free associated source memory regions. 564 4. RPC-over-RDMA Framework 566 Before an RDMA data transfer can occur, an endpoint first exposes 567 regions of its memory to a remote endpoint. The remote endpoint then 568 initiates RDMA Read and Write operations against the exposed memory. 569 A "transfer model" designates which endpoint exposes its memory and 570 which is responsible for initiating the transfer of data. 572 In RPC-over-RDMA version 2, only Requesters expose their memory to 573 the Responder, and only Responders initiate RDMA Read and Write 574 operations. Read access to memory regions enables the Responder to 575 pull RPC arguments or whole RPC Calls from each Requester. The 576 Responder pushes RPC results or whole RPC Replies to a Requester's 577 memory regions to which it has write access. 579 4.1. Message Framing 581 Each RPC-over-RDMA version 2 message consists of at most two XDR 582 streams: 584 o The "Transport stream" contains a header that describes and 585 controls the transfer of the Payload stream in this RPC-over-RDMA 586 message. Every RDMA Send on an RPC-over-RDMA version 2 connection 587 MUST begin with a Transport stream. 589 o The "Payload stream" contains part or all of a single RPC message. 590 The sender MAY divide an RPC message at any convenient boundary 591 but MUST send RPC message fragments in XDR stream order and MUST 592 NOT interleave Payload streams from multiple RPC messages. The 593 RPC-over-RDMA version 2 message carrying the final part of an RPC 594 message is marked (see Section 6.2.2.2). 596 The RPC-over-RDMA framing mechanism described in this section 597 replaces all other RPC framing mechanisms. Connection peers use RPC- 598 over-RDMA framing even when the underlying RDMA protocol runs on a 599 transport type with well-defined RPC framing, such as TCP. However, 600 a ULP can negotiate the use of RDMA, dynamically enabling the use of 601 RPC-over-RDMA on a connection established on some other transport 602 type. Because RPC framing delimits an entire RPC request or reply, 603 the resulting shift in framing must occur between distinct RPC 604 messages, and in concert with the underlying transport. 606 4.2. Managing Receiver Resources 608 If any pre-posted Receive buffer on the connection is not large 609 enough to accept an incoming RDMA Send, the RDMA provider can 610 terminate the connection. Likewise, if a pre-posted Receive buffer 611 is not available to accept an incoming RDMA Send, the RDMA provider 612 can terminate the connection. Therefore, a sender needs to respect 613 the resource limits of its peer receiver to ensure the longevity of 614 each connection. Two operational parameters communicate these limits 615 between connection peers: flow control, and inline threshold. 617 4.2.1. Flow Control 619 RPC-over-RDMA requires reliable and in-order delivery of data 620 payloads. Therefore, RPC-over-RDMA transports MUST use the RDMA RC 621 (Reliable Connected) Queue Pair (QP) type. The use of an RC QP 622 ensures in-transit data integrity and proper recovery from packet 623 loss or misordering. 625 However, RPC-over-RDMA itself provides a flow control mechanism to 626 prevent a sender from overwhelming receiver resources. RPC-over-RDMA 627 transports employ end-to-end credit-based flow control for this 628 purpose [CBFC]. Credit-based flow control is relatively simple, 629 providing robust operation in the face of bursty traffic and 630 automated management of receive buffer allocation. 632 4.2.1.1. Granting Credits 634 An RPC-over-RDMA version 2 credit is the capability to receive one 635 RPC-over-RDMA version 2 message. This arrangement enables RPC-over- 636 RDMA version 2 to support asymmetrical operation, where a message in 637 one direction might trigger zero, one, or multiple messages in the 638 other direction in response. 640 To achieve this, each posted Receive buffer on both connection peers 641 receives one credit. Each Requester has a set of Receive credits, 642 and each Responder has a set of Receive credits. These credit values 643 are managed independently of one another. 645 Section 7 of [RFC8166] requires that the 32-bit field containing the 646 credit grant is the third word in the transport header. To conform 647 with that requirement, senders encode the two independent credit 648 values into a single 32-bit field in the fixed portion of the 649 transport header. At the receiver, the low-order two bytes are the 650 number of credits that are newly granted by the sender. The granted 651 credit value MUST NOT be zero since such a value would result in 652 deadlock. The high-order two bytes are the maximum number of credits 653 that can be outstanding at the sender. 655 A sender must avoid posting more RDMA Send messages than the 656 receiver's granted credit limit. If the sender exceeds the granted 657 value, the RDMA provider might signal an error, possibly terminating 658 the connection. 660 The granted credit values MAY be adjusted to match the needs or 661 policies in effect on either peer. For instance, a peer may reduce 662 its granted credit value to accommodate the available resources in a 663 Shared Receive Queue. 665 Certain RDMA implementations may impose additional flow-control 666 restrictions, such as limits on RDMA Read operations in progress at 667 the Responder. Accommodation of such restrictions is considered the 668 responsibility of each RPC-over-RDMA version 2 implementation. 670 4.2.1.2. Asynchronous Credit Grants 672 A special protocol convention enables one peer to refresh its credit 673 grant to the other peer without sending a payload. Messages of this 674 type can also act as a keep-alive ping. See Section 6.3.2 for 675 information about this convention. 677 Receivers MUST always be in a position to receive one such credit 678 grant update message, in addition to payload-bearing messages, to 679 prevent transport deadlock. One way a receiver can do this is to 680 post one more RDMA Receive than the credit value the receiver 681 granted. 683 4.2.2. Inline Threshold 685 An "inline threshold" value is the largest message size (in octets) 686 that can be conveyed in one direction between peer implementations 687 using RDMA Send and Receive channel operations. An inline threshold 688 value is less than the largest number of octets the sender can post 689 in a single RDMA Send operation. It is also less than the largest 690 number of octets the receiver can reliably accept via a single RDMA 691 Receive operation. 693 Each connection has two inline threshold values. There is one for 694 messages flowing from Requester-to-Responder, referred to as the 695 "call inline threshold", and one for messages flowing from Responder- 696 to-Requester, referred to as the "reply inline threshold." 698 Peers can advertise their inline threshold values via RPC-over-RDMA 699 version 2 Transport Properties (see Section 5). In the absence of an 700 exchange of Transport Properties, connection peers MUST assume both 701 inline thresholds are 4096 octets. 703 4.2.3. Initial Connection State 705 When an RPC-over-RDMA version 2 client establishes a connection to a 706 server, its first order of business is to determine the server's 707 highest supported protocol version. 709 Upon connection establishment, a client MUST send only a single RPC- 710 over-RDMA message until it receives a valid RPC-over-RDMA message 711 from the server that grants client credits. 713 The second word of each transport header conveys the transport 714 protocol version. In the interest of clarity, the current document 715 refers to that word as rdma_vers even though in the RPC-over-RDMA 716 version 2 XDR definition, it appears as rdma_start.rdma_vers. 718 Immediately after the client establishes a connection, it sends a 719 single valid RPC-over-RDMA message with the value two (2) in the 720 rdma_vers field. Because the server might support only RPC-over-RDMA 721 version 1, this initial message MUST NOT be larger than the version 1 722 default inline threshold of 1024 octets. 724 4.2.3.1. Server Does Support RPC-over-RDMA Version 2 726 If the server supports RPC-over-RDMA version 2, it sends RPC-over- 727 RDMA messages back to the client with the value two (2) in the 728 rdma_vers field. Both peers may assume the default inline threshold 729 value for RPC-over-RDMA version 2 connections (4096 octets). 731 4.2.3.2. Server Does Not Support RPC-over-RDMA Version 2 733 If the server does not support RPC-over-RDMA version 2, it MUST send 734 an RPC-over-RDMA message to the client with an XID that matches the 735 client's first message, RDMA2_ERROR in the rdma_start.rdma_htype 736 field, and with the error code RDMA2_ERR_VERS. This message also 737 reports the range of RPC-over-RDMA protocol versions that the server 738 supports. To continue operation, the client selects a protocol 739 version in that range for subsequent messages on this connection. 741 If the connection is dropped immediately after an RDMA2_ERROR/ 742 RDMA2_ERR_VERS message is received, the client should try to avoid a 743 version negotiation loop when re-establishing another connection. It 744 can assume that the server does not support RPC-over-RDMA version 2. 745 A client can assume the same situation (i.e., no server support for 746 RPC-over-RDMA version 2) if the initial negotiation message is lost 747 or dropped. Once the version negotiation exchange is complete, both 748 peers may use the default inline threshold value for the negotiated 749 transport protocol version. 751 4.2.3.3. Client Does Not Support RPC-over-RDMA Version 2 753 The server examines the RPC-over-RDMA protocol version used in the 754 first RPC-over-RDMA message it receives. If it supports this 755 protocol version, it MUST use it in all subsequent messages it sends 756 on that connection. The client MUST NOT change the protocol version 757 for the duration of the connection. 759 4.3. XDR Encoding with Chunks 761 When a DDP capability is available, an RDMA provider can place the 762 contents of one or more XDR data items directly into a receiver's 763 memory. It can do this separately from the transfer of other parts 764 of the containing XDR stream. 766 4.3.1. Reducing an XDR Stream 768 RPC-over-RDMA version 2 provides a mechanism for moving part of an 769 RPC message via a data transfer distinct from an RDMA Send/Receive 770 pair. The sender removes one or more XDR data items from the Payload 771 stream. These items are conveyed via other mechanisms, such as one 772 or more RDMA Read or Write operations. As the receiver decodes an 773 incoming message, it skips over directly placed data items. 775 We refer to a data item that a sender removes from a Payload stream 776 to transmit separately as a "reduced" data item. After a sender has 777 finished removing XDR data items from a Payload stream, we refer to 778 it as a "reduced" Payload stream. The data object in a transport 779 header that describes memory regions containing reduced data items is 780 known as a "chunk." 782 4.3.2. DDP-Eligibility 784 Not all XDR data items benefit from Direct Data Placement. For 785 example, small data items or data items that require XDR unmarshaling 786 by the receiver do not benefit from DDP. Moreover, it is impractical 787 for receivers to prepare for every possible XDR data item in a 788 protocol to appear in a chunk. 790 Determining which data items are DDP-eligible is done in additional 791 specifications that describe how ULPs employ DDP. A "ULB 792 specification" identifies which XDR data items a peer MAY transfer 793 using DDP. Such data items are known as "DDP-eligible." Senders 794 MUST NOT reduce any other XDR data items. Detailed requirements for 795 ULB specifications appear in Appendix A. of the current document. 797 4.3.3. RDMA Segments 799 When encoding a Payload stream that contains a DDP-eligible data 800 item, a sender may choose to reduce that data item. When it chooses 801 to do so, the sender does not place the item into the Payload stream. 802 Instead, the sender records in the transport header the location and 803 size of the memory region containing that data item. 805 The Requester provides location information for DDP-eligible data 806 items in both RPC Call and Reply messages. The Responder uses this 807 information to retrieve arguments contained in the specified region 808 of the Requester's memory or place results in that memory region. 810 An "RDMA segment", or "plain segment", is a transport header data 811 object that contains the precise coordinates of a contiguous memory 812 region. This region is conveyed separately from the Payload stream. 813 Each RDMA segment contains the following information: 815 Handle: A steering Tag (STag) or R_key generated by registering this 816 memory with the RDMA provider. 818 Length: The length of the RDMA segment's memory region, in octets. 819 An "empty segment" is an RDMA segment with the value zero (0) in 820 its length field. 822 Offset: The offset or beginning memory address of the RDMA segment's 823 memory region. 825 See [RFC5040] for further discussion. 827 4.3.4. Chunks 829 In RPC-over-RDMA version 2, a "chunk" refers to a portion of an RPC 830 message that is moved independently of the Payload stream. The 831 sender removes chunk data from the Payload stream, transfers it via 832 separate operations, and then the receiver reinserts it into the 833 received Payload stream to reconstruct the complete RPC message. 835 Each chunk consists of RDMA segments. Each RDMA segment represents a 836 piece of a chunk that is contiguous in memory. A Requester MAY 837 divide a chunk into RDMA segments using any convenient boundaries. 838 The length of a chunk is precisely the sum of the lengths of the RDMA 839 segments that comprise it. 841 The RPC-over-RDMA version 2 transport protocol does not place a limit 842 on chunk size. However, each ULP may cap the amount of data that can 843 be transferred by a single RPC transaction. For example, NFS has 844 "rsize" and "wsize", which restrict the payload size of NFS READ and 845 WRITE operations. The Responder can use such limits to sanity check 846 chunk sizes before using them in RDMA operations. 848 4.3.4.1. Counted Arrays 850 If a chunk is to contain a counted array data type, the count of 851 array elements MUST remain in the Payload stream. The sender MUST 852 move the array elements into the chunk. For example, when encoding 853 an opaque byte array as a chunk, the count of bytes stays in the 854 Payload stream. The sender removes the bytes in the array from the 855 Payload stream and places them in the chunk. 857 Individual array elements appear in a chunk in their entirety. For 858 example, when encoding an array of arrays as a chunk, the count of 859 items in the enclosing array stays in the Payload stream. But each 860 enclosed array, including its item count, is transferred as part of 861 the chunk. 863 4.3.4.2. Optional-Data 865 If a chunk contains an optional-data data type, the "is present" 866 field MUST remain in the Payload stream. The sender MUST move the 867 data, if present, to the chunk. 869 4.3.4.3. XDR Unions 871 A union data type MUST NOT be made DDP-eligible. However, one or 872 more of its arms MAY be made DDP-eligible, subject to the other 873 requirements in this section. 875 4.3.4.4. Chunk Roundup 877 Except in special cases (covered in Section 4.4.3), a chunk MUST 878 contain only one XDR data item. This restriction makes it 879 straightforward to reduce variable-length data items without 880 affecting the XDR alignment of other data items in the Payload 881 stream. 883 When a sender reduces a variable-length XDR data item, data items 884 remaining in the Payload stream MUST remain on four-byte alignment. 885 Therefore, the sender always removes XDR roundup padding for that 886 data item from the Payload stream. 888 4.3.5. Read Chunks 890 A "Read chunk" represents an XDR data item that the Responder pulls 891 from the Requester. A Read chunk is a list of one or more RDMA read 892 segments. Each RDMA read segment consists of a Position field 893 followed by an RDMA segment, as defined in Section 4.3.3. 895 Position: The byte offset in the unreduced Payload stream where the 896 receiver reinserts the data item conveyed in the chunk. The 897 Requester MUST compute the Position value from the beginning of 898 the unreduced Payload stream, which begins at Position zero. All 899 RDMA read segments belonging to the same Read chunk have the same 900 value in their Position field. 902 While constructing an RPC Call message, a Requester registers memory 903 regions containing data items intended for RDMA Read operations. It 904 advertises the coordinates of these regions by adding Read chunks to 905 the transport header of the RPC Call message. 907 After receiving an RPC Call message sent via an RDMA Send operation, 908 a Responder transfers the chunk data from the Requester using RDMA 909 Read operations. The Responder inserts the first RDMA segment in a 910 Read chunk into the Payload stream at the byte offset indicated by 911 its Position field. The Responder concatenates RDMA segments whose 912 Position field value matches this offset until there are no more RDMA 913 segments at that Position value. 915 The Position field in an RDMA read segment indicates where the 916 containing Read chunk starts in the Payload stream. The value in 917 this field MUST be a multiple of four. All segments in the same Read 918 chunk share the same Position value, even if one or more of the RDMA 919 segments have a non-four-byte-aligned length. 921 4.3.5.1. Decoding Read Chunks 923 The Responder initiates an RDMA Read to pull a Read chunk's data 924 content into registered local memory whenever the XDR offset in the 925 Payload stream matches that of a Read chunk. The Responder 926 acknowledges that it is finished with Read chunk source buffers when 927 it sends the corresponding RPC Reply message to the Requester. The 928 Requester may then release Read chunks advertised in the RPC-over- 929 RDMA Call. 931 4.3.5.2. Read Chunk Roundup 933 When reducing a variable-length argument data item, the Requester 934 MUST NOT include the data item's XDR roundup padding in the chunk 935 itself. The chunk's total length MUST be the same as the encoded 936 length of the data item. 938 4.3.6. Write Chunks 940 While constructing an RPC Call message, a Requester prepares memory 941 regions in which to receive DDP-eligible result data items. A "Write 942 chunk" represents an XDR data item that a Responder is to push to a 943 Requester. It consists of an array of zero or more plain segments. 945 A Requester provisions Write chunks long before the Responder has 946 prepared the reply message. A Requester often does not know the 947 actual length of the result data items to be returned, since the 948 result does not yet exist. Thus, it MUST provision Write chunks 949 large enough to accommodate the maximum possible size of each 950 returned data item. 952 Note that the XDR position of DDP-eligible data items in the reply's 953 Payload stream is not predictable when a Requester constructs an RPC 954 Call message. Therefore, RDMA segments in a Write chunk do not have 955 a Position field. 957 For each Write chunk provided by a Requester, the Responder pushes 958 DDP-eligible one data item to the Requester. It fills the chunk 959 contiguously and in segment array order until the Responder has 960 written that data item to the Requester in its entirety. The 961 Responder MUST copy the segment count and all segments from the 962 Requester-provided Write chunk into the RPC Reply message's transport 963 header. As it does so, the Responder updates each segment length 964 field to reflect the actual amount of data returned in that segment. 965 The Responder then sends the RPC Reply message via an RDMA Send 966 operation. 968 An "empty Write chunk" is a Write chunk with a zero segment count. 969 By definition, the length of an empty Write chunk is zero. An 970 "unused Write chunk" has a non-zero segment count, but all of its 971 segments are empty segments. 973 4.3.6.1. Decoding Write Chunks 975 After receiving the RPC Reply message, the Requester reconstructs the 976 transferred data by concatenating the contents of each segment in 977 array order into the RPC Reply message's XDR stream at the known XDR 978 position of the associated DDP-eligible result data item. 980 4.3.6.2. Write Chunk Roundup 982 When provisioning a Write chunk for a variable-length result data 983 item, the Requester MUST NOT include additional space for XDR roundup 984 padding. A Responder MUST NOT write XDR roundup padding into a Write 985 chunk, even if the result is shorter than the available space in the 986 chunk. Therefore, when returning a single variable-length result 987 data item, a returned Write chunk's total length MUST be the same as 988 the encoded length of the result data item. 990 4.4. Payload Format 992 Unlike RPC-over-TCP and RPC-over-UDP transports, RPC-over-RDMA 993 transports are aware of the XDR encoding of each RPC message payload. 994 For efficiency, the transport can convey DDP-eligible XDR data items 995 separately from the RPC message itself. Also, receivers are required 996 to post adequate receive resources in advance of each RPC message. 998 RPC-over-RDMA version 2 provides several ways to arrange conveyance 999 of an RPC-over-RDMA message. A sender chooses the specific format 1000 for a message among several factors: 1002 o The existence of DDP-eligible data items in the RPC message 1004 o The size of the RPC message 1006 o The direction of the RPC message (i.e., Call or Reply) 1007 o The available hardware resources 1009 o The arrangement of source and sink memory buffers 1011 The following subsections describe in detail how Requesters and 1012 Responders format RPC-over-RDMA message payloads. 1014 4.4.1. Simple Format 1016 All RPC messages conveyed via RPC-over-RDMA version 2 require at 1017 least one RDMA Send operation to convey. Thus, the most efficient 1018 way to send an RPC message that is smaller than the inline threshold 1019 is to append the Payload stream directly to the Transport stream. 1020 When no chunks are present, senders construct Calls and Replies the 1021 same way, and no other operations are needed. 1023 4.4.1.1. Simple Format with Chunks 1025 If DDP-eligible data items are present in a Payload stream, a sender 1026 MAY reduce some or all of these items, removing them from the Payload 1027 stream. The sender then uses a separate mechanism to transfer the 1028 reduced data items. The Transport stream with the reduced Payload 1029 stream immediately following it is then transferred using one RDMA 1030 Send operation. 1032 When chunks are present, senders construct Calls differently than 1033 Replies. 1035 Simple Call: After receiving the Transport and Payload streams of an 1036 RPC Call message with Read chunks, the Responder uses RDMA Read 1037 operations to move the reduced data items contained in Read 1038 chunks. RPC-over-RDMA Calls can carry Write chunks for the 1039 Responder to use when sending the matching Reply. 1041 Simple Reply 1042 The Responder uses RDMA Write operations to move reduced data 1043 items contained in Write chunks. Afterward, it sends the 1044 Transport and Payload streams of the RPC Reply message using one 1045 RDMA Send. RPC-over-RDMA Replies always carry an empty Read chunk 1046 list. 1048 4.4.1.2. Simple Format Examples 1049 Requester Responder 1050 | RDMA Send (RDMA_MSG) | 1051 Call | ------------------------------> | 1052 | | 1053 | | Processing 1054 | | 1055 | RDMA Send (RDMA_MSG) | 1056 | <------------------------------ | Reply 1058 A Simple Call without chunks and a Simple Reply without chunks 1060 Requester Responder 1061 | RDMA Send (RDMA_MSG) | 1062 Call | ------------------------------> | 1063 | RDMA Read | 1064 | <------------------------------ | 1065 | RDMA Response (arg data) | 1066 | ------------------------------> | 1067 | | 1068 | | Processing 1069 | | 1070 | RDMA Send (RDMA_MSG) | 1071 | <------------------------------ | Reply 1073 A Simple Call with a Read chunk and a Simple Reply without chunks 1075 Requester Responder 1076 | RDMA Send (RDMA_MSG) | 1077 Call | ------------------------------> | 1078 | | 1079 | | Processing 1080 | | 1081 | RDMA Write (result data) | 1082 | <------------------------------ | 1083 | RDMA Send (RDMA_MSG) | 1084 | <------------------------------ | Reply 1086 A Simple Call without chunks and a Simple Reply with a Write chunk 1088 4.4.2. Continued Format 1090 For various reasons, a sender can choose to split a message payload 1091 over multiple RPC-over-RDMA messages. The Payload stream of each 1092 RPC-over-RDMA message contains a part of the RPC message. The 1093 receiver reconstructs the original RPC message by concatenating in 1094 sequence the Payload stream of each RPC-over-RDMA message. A sender 1095 MAY split an RPC message payload on any convenient boundary. 1097 4.4.2.1. Continued Format with Chunks 1099 If DDP-eligible data items are present in the Payload stream, a 1100 sender MAY reduce some or all of these items, removing them from the 1101 Payload stream. The sender then uses a separate mechanism to 1102 transfer the reduced data items. The Transport stream with the 1103 reduced Payload stream immediately following it is then transferred 1104 using one RDMA Send operation. 1106 As with Simple Format messages, when chunks are present, senders 1107 construct Calls differently than Replies. 1109 Continued Call 1110 After receiving the Transport and Payload streams of an RPC Call 1111 message with Read chunks, the Responder uses RDMA Read operations 1112 to move the reduced data items contained in Read chunks. RPC- 1113 over-RDMA Calls can carry Write chunks for the Responder to use 1114 when sending the matching Reply. 1116 Continued Reply 1117 The Responder uses RDMA Write operations to move reduced data 1118 items contained in Write chunks. Afterward, it sends the 1119 Transport and Payload streams of the RPC Reply message using 1120 multiple RDMA Sends. RPC-over-RDMA Replies always carry an empty 1121 Read chunk list. 1123 4.4.2.2. Continued Format Examples 1125 Requester Responder 1126 | RDMA Send (RDMA_MSG) | 1127 Call | ------------------------------> | 1128 | RDMA Send (RDMA_MSG) | 1129 | ------------------------------> | 1130 | RDMA Send (RDMA_MSG) | 1131 | ------------------------------> | 1132 | | 1133 | | 1134 | | Processing 1135 | | 1136 | RDMA Send (RDMA_MSG) | 1137 | <------------------------------ | Reply 1138 | RDMA Send (RDMA_MSG) | 1139 | <------------------------------ | 1140 | RDMA Send (RDMA_MSG) | 1141 | <------------------------------ | 1143 A Continued Call without chunks and a Continued Reply without chunks 1144 Requester Responder 1145 | RDMA Send (RDMA_MSG) | 1146 Call | ------------------------------> | 1147 | RDMA Send (RDMA_MSG) | 1148 | ------------------------------> | 1149 | RDMA Send (RDMA_MSG) | 1150 | ------------------------------> | 1151 | RDMA Read | 1152 | <------------------------------ | 1153 | RDMA Response (arg data) | 1154 | ------------------------------> | 1155 | | 1156 | | Processing 1157 | | 1158 | RDMA Send (RDMA_MSG) | 1159 | <------------------------------ | Reply 1161 A Continued Call with a Read chunk and a Simple Reply without chunks 1163 Requester Responder 1164 | RDMA Send (RDMA_MSG) | 1165 Call | ------------------------------> | 1166 | | 1167 | | Processing 1168 | | 1169 | RDMA Write (result data) | 1170 | <------------------------------ | 1171 | RDMA Send (RDMA_MSG) | 1172 | <------------------------------ | Reply 1173 | RDMA Send (RDMA_MSG) | 1174 | <------------------------------ | 1175 | RDMA Send (RDMA_MSG) | 1176 | <------------------------------ | 1178 A Simple Call without chunks and a Continued Reply with a Write chunk 1180 4.4.3. Special Format 1182 Sometimes, after DDP-eligible data items have been removed, a Payload 1183 stream is still too large to send using only RDMA Send operations. 1184 In those cases, the sender can use RDMA Read or Write operations to 1185 convey the entire RPC message. We refer to this as a "Special 1186 Format" message. 1188 To transmit a Special Format message, the sender transmits only the 1189 Transport stream with an RDMA Send operation. The sender does not 1190 include the Payload stream in the send buffer. Instead, the 1191 Requester provides chunks that the Responder uses to move the Payload 1192 stream. 1194 Because chunks are always present in Special Format messages, the 1195 sender always handles Calls and Replies differently. 1197 Special Call 1198 The Requester provides a Read chunk that contains the RPC Call 1199 message's Payload stream. Every read segment in this chunk MUST 1200 contain zero (0) in its Position field. This type of Read chunk 1201 is known as a "Position Zero Read chunk." 1203 Special Reply 1204 The Requester provisions a single Write chunk in advance, known as 1205 a "Reply chunk", in which the Responder places the RPC Reply 1206 message's Payload stream. The Requester sizes the Reply chunk to 1207 accommodate the maximum expected reply size for that upper-layer 1208 operation. 1210 One purpose of a Special Format message is to handle large RPC 1211 messages. However, Requesters MAY use a Special Format message at 1212 any time to convey an RPC Call message. 1214 When it has alternatives, a Responder chooses which Format to use 1215 based on the chunks provided by the Requester. If a Requester 1216 provided a Write chunk and the Responder has a DDP-eligible result, 1217 it first reduces the reply Payload stream. If a Requester provided a 1218 Reply chunk and the reduced Payload stream is larger than the reply 1219 inline threshold, the Responder MUST use the Requester-provided Reply 1220 chunk for the reply. 1222 XDR data items may appear in these chunks without regard to their 1223 DDP-eligibility. As these chunks contain a Payload stream, they MUST 1224 include appropriate XDR roundup padding to maintain proper XDR 1225 alignment of their contents. 1227 4.4.3.1. Special Format Examples 1228 Requester Responder 1229 | RDMA Send (RDMA_NOMSG) | 1230 Call | ------------------------------> | 1231 | RDMA Read | 1232 | <------------------------------ | 1233 | RDMA Response (RPC call) | 1234 | ------------------------------> | 1235 | | 1236 | | Processing 1237 | | 1238 | RDMA Send (RDMA_MSG) | 1239 | <------------------------------ | Reply 1241 A Special Call and a Simple Reply without chunks 1243 Requester Responder 1244 | RDMA Send (RDMA_MSG) | 1245 Call | ------------------------------> | 1246 | | 1247 | | Processing 1248 | | 1249 | RDMA Write (RPC reply) | 1250 | <------------------------------ | 1251 | RDMA Send (RDMA_NOMSG) | 1252 | <------------------------------ | Reply 1254 A Simple Call without chunks and a Special Reply 1256 4.5. Reverse-Direction Operation 1258 4.5.1. Sending a Reverse-Direction RPC Call 1260 An RPC-over-RDMA server endpoint constructs the transport header for 1261 a reverse-direction RPC Call as follows: 1263 o The server generates a new XID value (see Section 3.1.3.4 for full 1264 requirements) and places it in the rdma_xid field of the transport 1265 header and the xid field of the RPC Call message. The RPC Call 1266 header MUST start with the same XID value that is present in the 1267 transport header. 1269 o The rdma_vers field of each reverse-direction Call MUST contain 1270 the same value as forward-direction Calls on the same connection. 1272 o The server fills in the rdma_credits with the credit values for 1273 the connection, as described in Section 4.2.1.1. 1275 o The server determines the Payload format for the RPC message and 1276 fills in the rdma_htype field as appropriate (see Sections 4.4 and 1277 4.5.4). Section 4.5.4 also covers the disposition of the chunk 1278 lists. 1280 o The server MUST clear the RDMA2_F_RESPONSE flag in the rdma_flags 1281 field. It sets the RDMA2_F_MORE flag in the rdma_flags field as 1282 described in Section 6.2.2.2. 1284 4.5.2. Sending a Reverse-Direction RPC Reply 1286 An RPC-over-RDMA server endpoint constructs the transport header for 1287 a reverse-direction RPC Reply as follows: 1289 o The server copies the XID value from the matching RPC Call and 1290 places it in the rdma_xid field of the transport header and the 1291 xid field of the RPC Reply message. The RPC Reply header MUST 1292 start with the same XID value that is present in the transport 1293 header. 1295 o The rdma_vers field of each reverse-direction Call MUST contain 1296 the same value as forward-direction Replies on the same 1297 connection. 1299 o The server fills in the rdma_credits with the credit values for 1300 the connection, as described in Section 4.2.1.1. 1302 o The server determines the Payload format for the RPC message and 1303 fills in the rdma_htype field as appropriate (see Sections 4.4 and 1304 4.5.4). Section 4.5.4 also covers the disposition of the chunk 1305 lists. 1307 o The server MUST set the RDMA2_F_RESPONSE flag in the rdma_flags 1308 field. It sets the RDMA2_F_MORE flag in the rdma_flags field as 1309 described in Section 6.2.2.2. 1311 4.5.3. In the Absence of Support For Reverse-Direction Operation 1313 An RPC-over-RDMA transport endpoint does not have to support reverse- 1314 direction operation. There might be no mechanism in the transport 1315 implementation to do so. Or, the transport implementation might 1316 support operation in the reverse direction, but the Upper-Layer 1317 Protocol might not yet have configured the transport to handle 1318 reverse-direction traffic. 1320 If an endpoint is unprepared to receive a reverse-direction message, 1321 loss of the RDMA connection might result. Thus a denial of service 1322 can occur if an RPC server continues to send reverse-direction 1323 messages after a client that is not prepared to receive them 1324 reconnects to an endpoint. 1326 Connection peers indicate their support for reverse-direction 1327 operation as part of the exchange of Transport Properties just after 1328 a connection is established (see Section 5.2.5). 1330 When dealing with the possibility that the remote peer has no 1331 transport level support for reverse-direction operation, the Upper- 1332 Layer Protocol is responsible for informing peers when reverse 1333 direction operation is supported. Otherwise, even a simple reverse 1334 direction RPC NULL procedure from a peer could result in a lost 1335 connection. Therefore, an Upper-Layer Protocol MUST NOT perform 1336 reverse-direction RPC operations until the RPC server indicates 1337 support for them. 1339 4.5.4. Using Chunks During Reverse-Direction Operation 1341 Reverse-direction operations can use chunks, as defined in 1342 Section 4.3.4. for DDP-eligible data items or in Special payload 1343 formats. Reverse-direction chunks operate the same way as in 1344 forward-direction operation. Connection peers indicate their support 1345 for reverse-direction chunks as part of the exchange of Transport 1346 Properties just after a connection is established (see 1347 Section 5.2.5). 1349 However, an implementation might support only Upper-Layer Protocols 1350 that have no DDP-eligible data items. Such Upper-Layer Protocols can 1351 use only small messages, or they might have a native mechanism for 1352 restricting the size of reverse-direction RPC messages, obviating the 1353 need to handle chunks in the reverse direction. 1355 When there is no Upper-Layer Protocol need for chunks in the reverse 1356 direction, implementers MAY choose not to provide support for chunks 1357 in the reverse direction, thus avoiding the complexity of 1358 implementing support for RDMA Reads and Writes in the reverse 1359 direction. 1361 When an RPC-over-RDMA transport implementation does not support 1362 chunks in the reverse direction, RPC endpoints use only the Simple 1363 Payload format without chunks or the Continued Payload format without 1364 chunks to send RPC messages in the reverse direction. 1366 If a reverse-direction Requester provides a non-empty chunk list to a 1367 Responder that does not support chunks, the Responder MUST report its 1368 lack of support using one of the error values defined in Section 7.3. 1370 4.5.5. Reverse-Direction Retransmission 1372 In rare cases, an RPC server cannot complete an RPC transaction and 1373 cannot send a Reply. In these cases, the Requester may send the RPC 1374 transaction again using the same RPC XID. We refer to this as an 1375 "RPC retransmission" or a "replay." 1377 In the forward direction, the Requester is the RPC client. The 1378 client is always responsible for ensuring a transport connection is 1379 in place before sending a dropped Call again. 1381 With reverse-direction operation, the Requester is an RPC server. 1382 Because an RPC server is not responsible for establishing transport 1383 connections with clients, the Requester is unable to retransmit a 1384 reverse-direction Call whenever there is no transport connection. In 1385 this case, the RPC server must wait for the RPC client to re- 1386 establish a transport connection before it can retransmit reverse- 1387 direction RPC Calls. 1389 If the forward-direction Requester has no work to do, it can be some 1390 time before the RPC client re-establishes a transport connection. An 1391 RPC server may need to abandon a waiting reverse-direction RPC Call 1392 to avoid waiting indefinitely for the client to re-establish a 1393 transport connection. 1395 Therefore forward-direction Requesters SHOULD maintain a transport 1396 connection as long as the RPC server might send reverse-direction 1397 Calls. For example, while an NFS version 4.1 client has open 1398 delegated files or active pNFS layouts, it maintains one or more 1399 transport connections to enable the NFS server to perform callback 1400 operations. 1402 5. Transport Properties 1404 RPC-over-RDMA version 2 enables connection endpoints to exchange 1405 information about implementation properties. Compatible endpoints 1406 use this information to optimize data transfer. Initially, only a 1407 small set of transport properties are defined. The protocol provides 1408 a single message type to exchange transport properties (see 1409 Section 6.3.4). 1411 Both the set of transport properties and the operations used to 1412 communicate them may be extended. Within RPC-over-RDMA version 2, 1413 such extensions are OPTIONAL. A discussion of extending the set of 1414 transport properties appears in Appendix B.4. 1416 5.1. Transport Properties Model 1418 The current document specifies a basic set of receiver and sender 1419 properties. Such properties are specified using a code point that 1420 identifies the particular transport property and a nominally opaque 1421 array containing the XDR encoding of the property. 1423 The following XDR types handle transport properties: 1425 1427 typedef rpcrdma2_propid uint32; 1429 struct rpcrdma2_propval { 1430 rpcrdma2_propid rdma_which; 1431 opaque rdma_data<>; 1432 }; 1434 typedef rpcrdma2_propval rpcrdma2_propset<>; 1436 typedef uint32 rpcrdma2_propsubset<>; 1438 1440 The rpcrdma2_propid type specifies a distinct transport property. 1441 The property code points are defined as const values rather than 1442 elements in an enum type to enable the extension by concatenating XDR 1443 definition files. 1445 The rpcrdma2_propval type carries the value of a transport property. 1446 The rdma_which field identifies the particular property, and the 1447 rdma_data field contains the associated value of that property. A 1448 zero-length rdma_data field represents the default value of the 1449 property specified by rdma_which. 1451 Although the rdma_data field is opaque, receivers interpret its 1452 contents using the XDR type associated with the property specified by 1453 rdma_which. When the contents of the rdma_data field do not conform 1454 to that XDR type, the receiver MUST return the error 1455 RDMA2_ERR_BAD_PROPVAL using the header type RDMA2_ERROR, as described 1456 in Section Section 6.3.3. 1458 For example, the receiver of a message containing a valid 1459 rpcrdma2_propval returns this error if the length of rdma_data is 1460 greater than the length of the transferred message. Also, when the 1461 receiver recognizes the rpcrdma2_propid contained in rdma_which, it 1462 MUST report the error RDMA2_ERR_BAD_PROPVAL if either of the 1463 following occurs: 1465 o The nominally opaque data within rdma_data is not valid when 1466 interpreted using the property-associated typedef. 1468 o The length of rdma_data is insufficient to contain the data 1469 represented by the property-associated typedef. 1471 A receiver does not report an error if it does not recognize the 1472 value contained in rdma_which. In that case, the receiver does not 1473 process that rpcrdma2_propval. Processing continues with the next 1474 rpcrdma2_propval, if any. 1476 The rpcrdma2_propset type specifies a set of transport properties. 1477 The protocol does not impose a particular ordering of the 1478 rpcrdma2_propval items within it. 1480 The rpcrdma2_propsubset type identifies a subset of the properties in 1481 a rpcrdma2_propset. Each bit in the mask denotes a particular 1482 element in a previously specified rpcrdma2_propset. If a particular 1483 rpcrdma2_propval is at position N in the array, then bit number N mod 1484 32 in word N div 32 specifies whether the defined subset includes 1485 that particular rpcrdma2_propval. Words beyond the last one 1486 specified are assumed to contain zero. 1488 5.2. Current Transport Properties 1490 Table 1 specifies a basic set of transport properties. The columns 1491 contain the following information: 1493 o The column labeled "Property" contains a name of the transport 1494 property described by the current row. 1496 o The column labeled "Code" specifies the code point that identifies 1497 this property. 1499 o The column labeled "XDR type" gives the XDR type of the data used 1500 to communicate the value of this property. This data type 1501 overlays the data portion of the nominally opaque rdma_data field. 1503 o The column labeled "Default" gives the default value for the 1504 property. 1506 o The column labeled "Section" indicates the section within the 1507 current document that explains the use of this property. 1509 +----------------------------+------+----------+---------+---------+ 1510 | Property | Code | XDR type | Default | Section | 1511 +----------------------------+------+----------+---------+---------+ 1512 | Maximum Send Size | 1 | uint32 | 4096 | 5.2.1 | 1513 | Receive Buffer Size | 2 | uint32 | 4096 | 5.2.2 | 1514 | Maximum RDMA Segment Size | 3 | uint32 | 1048576 | 5.2.3 | 1515 | Maximum RDMA Segment Count | 4 | uint32 | 16 | 5.2.4 | 1516 | Reverse-Direction Support | 5 | uint32 | 0 | 5.2.5 | 1517 | Host Auth Message | 6 | opaque<> | N/A | 5.2.6 | 1518 +----------------------------+------+----------+---------+---------+ 1520 Table 1 1522 5.2.1. Maximum Send Size 1524 The value of this property specifies the maximum size, in octets, of 1525 Send payloads. The endpoint receiving this value can size its 1526 Receive buffers based on the value of this property. 1528 1530 const uint32 RDMA2_PROPID_SBSIZ = 1; 1531 typedef uint32 rpcrdma2_prop_sbsiz; 1533 1535 5.2.2. Receive Buffer Size 1537 The value of this property specifies the minimum size, in octets, of 1538 pre-posted receive buffers. 1540 1542 const uint32 RDMA2_PROPID_RBSIZ = 2; 1543 typedef uint32 rpcrdma2_prop_rbsiz; 1545 1547 A sender can subsequently use this value to determine when a message 1548 to be sent fits in pre-posted receive buffers that the receiver has 1549 set up. In particular: 1551 o Requesters may use the value to determine when to provide a 1552 Position Zero Read chunk or use Message Continuation when sending 1553 a Call. 1555 o Requesters may use the value to determine when to provide a Reply 1556 chunk when sending a Call, based on the maximum possible size of 1557 the Reply. 1559 o Responders may use the value to determine when to use a Reply 1560 chunk provided by the Requester, given the actual size of a Reply. 1562 5.2.3. Maximum RDMA Segment Size 1564 The value of this property specifies the maximum size, in octets, of 1565 an RDMA segment this endpoint is prepared to send or receive. 1567 1569 const uint32 RDMA2_PROPID_RSSIZ = 3; 1570 typedef uint32 rpcrdma2_prop_rssiz; 1572 1574 5.2.4. Maximum RDMA Segment Count 1576 The value of this property specifies the maximum number of RDMA 1577 segments that can appear in a Requester's transport header. 1579 1581 const uint32 RDMA2_PROPID_RCSIZ = 4; 1582 typedef uint32 rpcrdma2_prop_rcsiz; 1584 1586 5.2.5. Reverse-Direction Support 1588 The value of this property specifies a client implementation's 1589 readiness to process messages that are part of reverse direction RPC 1590 requests. 1592 1594 const uint32 RDMA_RVRSDIR_NONE = 0; 1595 const uint32 RDMA_RVRSDIR_SIMPLE = 1; 1596 const uint32 RDMA_RVRSDIR_CONT = 2; 1597 const uint32 RDMA_RVRSDIR_GENL = 3; 1599 const uint32 RDMA2_PROPID_BRS = 5; 1600 typedef uint32 rpcrdma2_prop_brs; 1602 1603 Multiple levels of support are distinguished: 1605 o The value RDMA2_RVRSDIR_NONE indicates that the sender does not 1606 support reverse-direction operation. 1608 o The value RDMA2_RVRSDIR_SIMPLE indicates that the sender supports 1609 using only Simple Format messages without chunks for reverse- 1610 direction messages. 1612 o The value RDMA2_RVRSDIR_CONT indicates that the sender supports 1613 using either Simple Format without chunks or Continued Format 1614 messages without chunks for reverse-direction messages. 1616 o The value RDMA2_RVRSDIR_GENL indicates that the sender supports 1617 reverse-direction messages in the same way as forward-direction 1618 messages. 1620 When a peer does not provide this property, the default is the peer 1621 does not support reverse-direction operation. 1623 5.2.6. Host Authentication Message 1625 The value of this transport property enables the exchange of host 1626 authentication material. This property can accommodate 1627 authentication handshakes that require multiple challenge-response 1628 interactions and potentially large amounts of material. 1630 1632 const uint32 RDMA2_PROPID_HOSTAUTH = 6; 1633 typedef opaque rpcrdma2_prop_hostauth<>; 1635 1637 When this property is not present, the peer(s) remain 1638 unauthenticated. Local security policy on each peer determines 1639 whether the connection is permitted to continue. 1641 6. Transport Messages 1643 Each transport message consists of multiple sections. 1645 o A transport header prefix, as defined in Section 6.2.2. Among 1646 other things, this structure indicates the header type. 1648 o The transport header proper, as defined by one of the sub-sections 1649 below. See Section 6.1 for the mapping between header types and 1650 the corresponding header structure. 1652 o Potentially, all or part of an RPC message payload. 1654 This organization differs from that presented in the definition of 1655 RPC-over-RDMA version 1 [RFC8166], which defined the first and second 1656 of the items above as a single XDR data structure. The new 1657 organization is in keeping with RPC-over-RDMA version 2's 1658 extensibility model, which enables the definition of new header types 1659 without modifying the XDR definition of existing header types. 1661 6.1. Transport Header Types 1663 Table 2 lists the RPC-over-RDMA version 2 header types. The columns 1664 contain the following information: 1666 o The column labeled "Operation" names the particular operation. 1668 o The column labeled "Code" specifies the value of the header type 1669 for this operation. 1671 o The column labeled "XDR type" gives the XDR type of the data 1672 structure used to organize the information in this new message 1673 type. This data immediately follows the universal portion on the 1674 transport header present in every RPC-over-RDMA transport header. 1676 o The column labeled "Msg" indicates whether this operation is 1677 followed (or not) by an RPC message payload. 1679 o The column labeled "Section" refers to the section within the 1680 current document that explains the use of this header type. 1682 +------------------------+------+-------------------+-----+---------+ 1683 | Operation | Code | XDR type | Msg | Section | 1684 +------------------------+------+-------------------+-----+---------+ 1685 | Convey Appended RPC | 0 | rpcrdma2_msg | Yes | 6.3.1 | 1686 | Message | | | | | 1687 | Convey External RPC | 1 | rpcrdma2_nomsg | No | 6.3.2 | 1688 | Message | | | | | 1689 | Report Transport Error | 4 | rpcrdma2_err | No | 6.3.3 | 1690 | Specify Properties at | 5 | rpcrdma2_connprop | No | 6.3.4 | 1691 | Connection | | | | | 1692 +------------------------+------+-------------------+-----+---------+ 1694 Table 2 1696 Suppport for the operations in Table 2 is REQUIRED. RPC-over-RDMA 1697 version 2 implementations that receive an unrecognized header type 1698 MUST respond with an RDMA2_ERROR message with an rdma_err field 1699 containing RDMA2_ERR_INVAL_HTYPE and drop the incoming message 1700 without processing it further. 1702 6.2. Headers and Chunks 1704 Most RPC-over-RDMA version 2 data structures have antecedents in 1705 corresponding structures in RPC-over-RDMA version 1. As is typical 1706 for new versions of an existing protocol, the XDR data structures 1707 have new names, and there are a few small changes in content. In 1708 some cases, there have been structural re-organizations to enable 1709 protocol extensibility. 1711 6.2.1. Common Transport Header Prefix 1713 The rpcrdma_common structure defines the initial part of each RPC- 1714 over-RDMA transport header for RPC-over-RDMA version 2 and subsequent 1715 versions. 1717 1719 struct rpcrdma_common { 1720 uint32 rdma_xid; 1721 uint32 rdma_vers; 1722 uint32 rdma_credit; 1723 uint32 rdma_htype; 1724 }; 1726 1728 RPC-over-RDMA version 2's use of these first four words matches that 1729 of version 1 as required by [RFC8166]. However, there are crucial 1730 structural differences in the XDR definition of RPC-over-RDMA version 1731 2: in the way that these words are described by the respective XDR 1732 descriptions: 1734 o The header type is represented as a uint32 rather than as an enum 1735 type. An enum would need to be modified to reflect additions to 1736 the set of header types made by later extensions. 1738 o The header type field is part of an XDR structure devoted to 1739 representing the transport header prefix, rather than being part 1740 of a discriminated union, that includes the body of each transport 1741 header type. 1743 o There is now a prefix structure (see Section 6.2.2) of which the 1744 rpcrdma_common structure is the initial segment. This prefix is a 1745 newly defined XDR object within the protocol description, which 1746 constrains the universal portion of all header types to the four 1747 words in rpcrdma_common. 1749 These changes are part of a more considerable structural change in 1750 the XDR definition of RPC-over-RDMA version 2 that facilitates a 1751 cleaner treatment of protocol extension. The XDR appearing in 1752 Section 8 reflects these changes, which Appendix C.1 discusses in 1753 further detail. 1755 6.2.2. Transport Header Prefix 1757 The following prefix structure appears at the start of each RPC-over- 1758 RDMA version 2 transport header. 1760 1762 const RDMA2_F_RESPONSE 0x00000001; 1763 const RDMA2_F_MORE 0x00000002; 1764 const RDMA2_F_TPMORE 0x00000004; 1766 struct rpcrdma2_hdr_prefix 1767 struct rpcrdma_common rdma_start; 1768 uint32 rdma_flags; 1769 }; 1771 1773 The rdma_flags field is new to RPC-over-RDMA version 2. Currently, 1774 the only flags defined within this word are the RDMA2_F_RESPONSE flag 1775 and the RDMA2_F_MORE flag. The other flags are reserved for future 1776 use as described in Appendix B.3. The sender MUST set reserved flags 1777 to zero, and the receiver MUST ignore reserved flags. 1779 6.2.2.1. RDMA2_F_RESPONSE Flag 1781 The RDMA2_F_RESPONSE flag qualifies the value contained in the 1782 transport header's rdma_xid field. The RDMA2_F_RESPONSE flag enables 1783 a receiver to avoid performing an XID lookup on incoming reverse 1784 direction Call messages. Therefore: 1786 o When the rdma_htype field has the value RDMA2_MSG or RDMA2_NOMSG, 1787 the value of the RDMA2_F_RESPONSE flag MUST be the same as the 1788 value of the associated RPC message's msg_type field. 1790 o When the header type is anything else and a whole or partial RPC 1791 message payload is present, the value of the RDMA2_F_RESPONSE flag 1792 MUST be the same as the value of the associated RPC message's 1793 msg_type field. 1795 o When no RPC message payload is present, a Requester MUST set the 1796 value of RDMA2_F_RESPONSE to reflect how the receiver is to 1797 interpret the rdma_xid field. 1799 o When the rdma_htype field has the value RDMA2_ERROR, the 1800 RDMA2_F_RESPONSE flag MUST be set. 1802 6.2.2.2. RDMA2_F_MORE Flag 1804 The RDMA2_F_MORE flag signifies that the RPC-over-RDMA message 1805 payload continues in the next message. 1807 When the sender asserts the RDMA2_F_MORE flag, the receiver is to 1808 concatenate the data payload of the next received message to the end 1809 of the data payload of the received message. The sender clears the 1810 RDMA2_F_MORE flag in the final message in the sequence. 1812 All RPC-over-RDMA messages in such a sequence MUST have the same 1813 values in the rdma_xid and rdma_htype fields. Otherwise, the 1814 receiver MUST drop the message without processing it further. If the 1815 receiver is a Responder, it MUST also respond with an RDMA2_ERROR 1816 message with the rdma_err field set to RDMA2_ERR_INVAL_CONT. 1818 If a peer receives an RPC-over-RDMA message with the RDMA2_F_MORE 1819 flag set, and the rdma_htype field does not contain the values 1820 RDMA2_MSG or RDMA2_CONNPROP, the receiver MUST drop the message 1821 without processing it further. If the receiver is a Responder, it 1822 MUST also respond with an RDMA2_ERROR message with the rdma_err field 1823 set to RDMA2_ERR_INVAL_CONT. 1825 The sender includes chunks only in the final message in a sequence, 1826 in which the RDMA2_F_MORE flag is clear. If a peer receives an RPC- 1827 over-RDMA message with the RDMA2_F_MORE flag set, and its chunk lists 1828 are not empty, the receiver MUST drop the message without processing 1829 it further. If the receiver is a Responder, it MUST also respond 1830 with an RDMA2_ERROR message with the rdma_err field set to 1831 RDMA2_ERR_INVAL_CONT. 1833 There is no protocol-defined limit on the number of concatenated 1834 messages in a sequence. If the sender exhausts the receiver's credit 1835 grant before sending the final message in the sequence, the sender 1836 waits for a further credit grant from the receiver before continuing 1837 to send messages. 1839 Credit exhaustion can occur at the receiver in the middle of a 1840 sequence of continued messages. The receiver can grant more credits 1841 by sending an RPC message payload or an out-of-band credit grant (see 1842 Section 4.2.1.2) to enable the sender to send the remaining messages 1843 in the sequence. 1845 6.2.2.3. RDMA2_F_TPMORE Flag 1847 The RDMA2_F_TPMORE flag indicates that the sender has additional 1848 Transport Properties to send in a subsequent RPC-over-RDMA message. 1849 If a peer receives any message type other than RDMA2_CONNPROP with 1850 the RDMA2_F_TPMORE flag set, it MUST respond with an RDMA2_ERROR 1851 message type whose rdma_err field contains RDMA2_ERR_INVAL_HTYPE, and 1852 then silently discard the ingress message without processing it. 1854 The RDMA2_F_TPDONE flag is clear in the final RDMA2_CONNPROP message 1855 type from this peer on this connection. If a peer receives an 1856 RDMA2_CONNPROP message type after it has received an RDMA2_CONNPROP 1857 message type with a clear RDMA2_F_TPDONE flag, it MUST respond with 1858 an RDMA2_ERROR message type whose rdma_err field contains 1859 RDMA2_ERR_INVAL_HTYPE, and then silently discard the ingress message 1860 without processing it. 1862 After both connection peers have indicated they have finished sending 1863 their Transport Properties, they may begin passing RPC traffic. 1865 6.2.3. External Data Payloads 1867 The rpcrdma2_chunk_lists structure specifies how explicit RDMA 1868 operations convey the message payload. 1870 1872 struct rpcrdma2_chunk_lists { 1873 uint32 rdma_inv_handle; 1874 struct rpcrdma2_read_list *rdma_reads; 1875 struct rpcrdma2_write_list *rdma_writes; 1876 struct rpcrdma2_write_chunk *rdma_reply; 1877 }; 1879 1881 The rdma_reads, rdma_writes, rdma_reply fields provide, respectively, 1882 the chunks used to read a Special Format Call or pull directly placed 1883 data from the Requester; the chunks used to push directly placed 1884 response data into the Requester's memory; and the chunk used to push 1885 a long Reply into the Requester's memory. See Section 4.3 for 1886 further details on how a sender constructs chunks. 1888 6.2.4. Remote Invalidation 1890 A central addition relative to the corresponding RPC-over-RDMA 1891 version 1 rdma_header structures is the rdma_inv_handle field. This 1892 field enables remote invalidation of one Requester memory 1893 registration by using the RDMA Send With Invalidate operation. 1895 To solicit the use of Remote Invalidation, a Requester sets the value 1896 of the rdma_inv_handle field in an RPC Call's transport header to a 1897 non-zero value that matches one of the rdma_handle fields in that 1898 header. If the Responder may invalidate none of the rdma_handle 1899 values in the header conveying the Call, the Requester sets the RPC 1900 Call's rdma_inv_handle field to the value zero. 1902 If the Responder chooses not to use remote invalidation for this 1903 particular RPC Reply, or the RPC Call's rdma_inv_handle field 1904 contains the value zero, the Responder simply uses RDMA Send to 1905 transmit the matching RPC reply. However, if the Responder chooses 1906 to use Remote Invalidation, it uses RDMA Send With Invalidate to 1907 transmit the RPC Reply. It MUST use the value in the corresponding 1908 Call's rdma_inv_handle field to construct the Send With Invalidate 1909 Work Request. 1911 6.3. Header Types 1913 The header types defined and used in RPC-over-RDMA version 1 are 1914 carried over into RPC-over-RDMA version 2, although there are some 1915 limited changes in the definitions of existing header types: 1917 o To simplify interoperability with RPC-over-RDMA version 1, only 1918 the RDMA2_ERROR header (defined in Section 6.3.3) has an XDR 1919 definition that differs from that in RPC-over-RDMA version 1, and 1920 its modifications are all compatible extensions. 1922 o RDMA2_MSG and RDMA2_NOMSG (defined in Sections 6.3.1 and 6.3.2) 1923 have XDR definitions that match the corresponding RPC-over-RDMA 1924 version 1 header types. However, because of the changes to the 1925 header prefix, the version 1 and version 2 header types differ in 1926 on-the-wire format. 1928 o RDMA2_CONNPROP (defined in Section 6.3.4) is an entirely new 1929 header type devoted to enabling connection peers to exchange 1930 information about their transport properties. 1932 6.3.1. RDMA2_MSG: Convey RPC Message Inline 1934 RDMA2_MSG conveys all or part of an RPC message immediately following 1935 the transport header in the send buffer. 1937 1939 const rpcrdma2_proc RDMA2_MSG = 0; 1941 struct rpcrdma2_msg { 1942 struct rpcrdma2_chunk_lists rdma_chunks; 1944 /* The rpc message starts here and continues 1945 * through the end of the transmission. */ 1946 uint32 rdma_rpc_first_word; 1947 }; 1949 1951 6.3.2. RDMA2_NOMSG: Convey External RPC Message 1953 RDMA2_NOMSG conveys an entire RPC message payload using explicit RDMA 1954 operations. In particular, it is a Special Format Call when the 1955 Responder reads the RPC payload from a memory area specified by a 1956 Position Zero Read chunk. It is a Special Format Reply when the 1957 Responder writes the RPC payload into a memory area specified by a 1958 Reply chunk. In both cases, the sender sets the rdma_xid field to 1959 the same value as the xid of the RPC message payload. 1961 If all the chunk lists are empty the message conveys a credit grant 1962 refresh. The header prefix of this message contains a credit grant 1963 refresh in the rdma_credit field. In this case, the sender MUST set 1964 the rdma_xid field to zero. 1966 1968 const rpcrdma2_proc RDMA2_NOMSG = 1; 1970 struct rpcrdma2_nomsg { 1971 struct rpcrdma2_chunk_lists rdma_chunks; 1972 }; 1974 1976 In RPC-over-RDMA version 2, a sender should use Message Continuation 1977 as an alternative to using a Special Format message. 1979 6.3.3. RDMA2_ERROR: Report Transport Error 1981 RDMA2_ERROR reports a transport layer error on a previous 1982 transmission. 1984 1986 const rpcrdma2_proc RDMA2_ERROR = 4; 1988 struct rpcrdma2_err_vers { 1989 uint32 rdma_vers_low; 1990 uint32 rdma_vers_high; 1991 }; 1993 struct rpcrdma2_err_write { 1994 uint32 rdma_chunk_index; 1995 uint32 rdma_length_needed; 1996 }; 1998 union rpcrdma2_error switch (rpcrdma2_errcode rdma_err) { 1999 case RDMA2_ERR_VERS: 2000 rpcrdma2_err_vers rdma_vrange; 2001 case RDMA2_ERR_READ_CHUNKS: 2002 uint32 rdma_max_chunks; 2003 case RDMA2_ERR_WRITE_CHUNKS: 2004 uint32 rdma_max_chunks; 2005 case RDMA2_ERR_SEGMENTS: 2006 uint32 rdma_max_segments; 2007 case RDMA2_ERR_WRITE_RESOURCE: 2008 rpcrdma2_err_write rdma_writeres; 2009 case RDMA2_ERR_REPLY_RESOURCE: 2010 uint32 rdma_length_needed; 2011 default: 2012 void; 2013 }; 2015 2017 See Section 7 for details on the use of this header type. 2019 6.3.4. RDMA2_CONNPROP: Exchange Transport Properties 2021 The RDMA2_CONNPROP message type enables a connection peer to publish 2022 the properties of its implementation to its remote peer. 2024 2026 struct rpcrdma2_connprop { 2027 rpcrdma2_propset rdma_props; 2028 }; 2030 2032 Each peer sends an RDMA2_CONNPROP message type as the first message 2033 after the client has established a connection. The size of this 2034 initial message is limited to the default inline threshold for the 2035 RPC-over-RDMA version that is in effect. If a peer has more or 2036 larger Transport Properties than can fit in the initial 2037 RDMA2_CONNPROP message type, it sets the RDMA2_F_TPMORE flag. The 2038 final RDMA2_CONNPROP message type the peer sends on that connection 2039 must have a clear RDMA2_F_TPMORE flag. 2041 A peer may encounter properties that it does not recognize or 2042 support. In such cases, the receiver ignores unsupported properties 2043 without generating an error response. 2045 6.4. Choosing a Reply Mechanism 2047 A Requester provisions all necessary registered memory resources for 2048 both an RPC Call and its matching RPC Reply. A Requester constructs 2049 each RPC Call, thus it can compute the exact memory resources needed 2050 to send every Call. However, the Requester allocates memory 2051 resources to receive the corresponding Reply before the Responder has 2052 constructed it. Occasionally, it is challenging for the Requester to 2053 know in advance precisely what resources are needed to receive the 2054 Reply. 2056 In RPC-over-RDMA version 2, a Requester can provide a Reply chunk for 2057 any transaction. The Responder can use the provided Reply chunk or 2058 it can decide to use another means to convey the RPC Reply. If the 2059 combination of the provided Write chunk list and Reply chunk is not 2060 adequate to convey a Reply, the Responder SHOULD use Message 2061 Continuation (see Section 6.2.2.2) to send that Reply. If even that 2062 is not possible, the Responder sends an RDMA2_ERROR message to the 2063 Requester, as described in Section 6.3.3: 2065 o If the Write chunk list cannot accommodate the ULP's DDP-eligible 2066 data payload, the Responder sends an RDMA2_ERR_WRITE_RESOURCE 2067 error. 2069 o If the Reply chunk cannot accommodate the parts of the Reply that 2070 are not DDP-eligible, the Responder sends an 2071 RDMA2_ERR_REPLY_RESOURCE error. 2073 When receiving such errors, the Requester can retry the ULP call 2074 using more substantial reply resources. In cases where retrying the 2075 ULP request is not possible (e.g., the request is non-idempotent), 2076 the Requester terminates the RPC transaction and presents an error to 2077 the RPC consumer. 2079 7. Error Handling 2081 A receiver performs validity checks on ingress RPC-over-RDMA messages 2082 before it passes the conveyed RPC message to the RPC layer. For 2083 example, if an ingress message is not as long as the size of struct 2084 rpcrdma2_hdr_prefix (20 octets), the receiver cannot trust the value 2085 of the rdma_xid field. In this case, the receiver MUST silently 2086 discard the ingress message without processing it further, and 2087 without a response to the sender. 2089 However, in many other cases, the receiver needs to actively report a 2090 problem with the RPC-over-RDMA message to its sender. The 2091 RDMA2_ERROR message type is used for this purpose. To form an 2092 RDMA2_ERROR message type: 2094 o The rdma_xid field MUST contain the same XID that was in the 2095 rdma_xid field in the ingress request. 2097 o The rdma_vers field MUST contain the same version that was in the 2098 rdma_vers field in the ingress request. 2100 o The sender sets the rdma_credit field to the credit values in 2101 effect for this connection. 2103 o The rdma_htype field MUST contain the value RDMA2_ERROR. 2105 o The rdma_err field contains a value that reflects the type of 2106 error that occurred, as described in the subsections below. 2108 When a peer receives an RDMA2_ERROR message type with an unrecognized 2109 or unsupported value in its rdma_err field, it MUST silently discard 2110 the message without processing it further. 2112 7.1. Basic Transport Stream Parsing Errors 2114 7.1.1. RDMA2_ERR_VERS 2116 When a Responder detects an RPC-over-RDMA header version that it does 2117 not support (the current document defines version 2), it MUST respond 2118 with an RDMA2_ERROR message type and set its rdma_err field to 2119 RDMA2_ERR_VERS. The Responder then fills in the rpcrdma2_err_vers 2120 structure with the RPC-over-RDMA versions it supports. The Responder 2121 MUST silently discard the ingress message without passing it to the 2122 RPC layer 2124 When a Requester receives this error, it uses the information in the 2125 rpcrdma2_err_vers structure to select an RPC-over-RDMA version that 2126 both peers support. A Requester MUST NOT subsequently send a message 2127 that uses a version that the Responder has indciated it does not 2128 support. RDMA2_ERR_VERS indicates a permanent error. Receipt of 2129 this error completes the RPC transaction associated with XID in the 2130 rdma_xid field. 2132 7.1.2. RDMA2_ERR_INVAL_HTYPE 2134 If a Responder recognizes the value in an ingress rdma_vers field, 2135 but it does not recognize the value in the rdma_htype field or does 2136 not support that header type, it MUST set the rdma_err field to 2137 RDMA2_ERR_INVAL_HTYPE. The Responder MUST silently discard the 2138 incoming message without passing it to the RPC layer. 2140 A Requester MUST NOT subsequently send a message that uses an htype 2141 that the Responder has indicated it does not support. 2142 RDMA2_ERR_INVAL_HTYPE indicates a permanent error. Receipt of this 2143 error completes the RPC transaction associated with XID in the 2144 rdma_xid field. 2146 7.1.3. RDMA2_ERR_INVAL_CONT 2148 If a Responder detects a problem with an ingress RPC-over-RDMA 2149 message that is part of a Message Continuation sequence, the 2150 Responder MUST set the rdma_err field to RDMA2_ERR_INVAL_CONT. 2151 Section 6.2.2.2 discusses the types of problems to watch for. The 2152 Responder MUST silently discard all ingress messages with an rdma_xid 2153 field that matches the failing message without reassembling the 2154 payload. 2156 RDMA2_ERR_INVAL_CONT indicates a permanent error. Receipt of this 2157 error completes the RPC transaction associated with XID in the 2158 rdma_xid field. 2160 7.2. XDR Errors 2162 A receiver might encounter an XDR parsing error that prevents it from 2163 processing an ingress Transport stream. Examples of such errors 2164 include: 2166 o An invalid value in the rdma_proc field. 2168 o An RDMA_NOMSG message where the Read list, Write list, and Reply 2169 chunk are marked not present. 2171 o The value of the rdma_xid field does not match the value of the 2172 XID field in the accompanying RPC message. 2174 When a Responder receives a valid RPC-over-RDMA header but the 2175 Responder's ULP implementation cannot parse the RPC arguments in the 2176 RPC Call, the Responder returns an RPC Reply with status 2177 GARBAGE_ARGS, using an RDMA2_MSG message type. This type of parsing 2178 failure might be due to mismatches between chunk sizes or offsets and 2179 the contents of the Payload stream, for example. In this case, the 2180 error is permanent, but the Requester has no way to know how much 2181 processing the Responder has completed for this RPC transaction. 2183 7.2.1. RDMA2_ERR_BAD_XDR 2185 If a Responder recognizes the values in the rdma_vers field, but it 2186 cannot otherwise parse the ingress Transport stream, it MUST set the 2187 rdma_err field to RDMA2_ERR_BAD_XDR. The Responder MUST silently 2188 discard the ingress message without passing it to the RPC layer. 2190 RDMA2_ERR_BAD_XDR indicates a permanent error. Receipt of this error 2191 completes the RPC transaction associated with XID in the rdma_xid 2192 field. 2194 7.2.2. RDMA2_ERR_BAD_PROPVAL 2196 If a receiver recognizes the value in an ingress rdma_which field, 2197 but it cannot parse the accompanying propval, it MUST set the 2198 rdma_err field to RDMA2_ERR_BAD_PROPVAL (see Section 5.1). The 2199 receiver MUST silently discard the ingress message without applying 2200 any of its property settings. 2202 7.3. Responder RDMA Operational Errors 2204 In RPC-over-RDMA version 2, the Responder initiates RDMA Read and 2205 Write operations that target the Requester's memory. Problems might 2206 arise as the Responder attempts to use Requester-provided resources 2207 for RDMA operations. For example: 2209 o Usually, chunks can be validated only by using their contents to 2210 perform data transfers. If chunk contents are invalid (e.g., a 2211 memory region is no longer registered or a chunk length exceeds 2212 the end of the registered memory region), a Remote Access Error 2213 occurs. 2215 o If a Requester's Receive buffer is too small, the Responder's Send 2216 operation completes with a Local Length Error. 2218 o If the Requester-provided Reply chunk is too small to accommodate 2219 a large RPC Reply message, a Remote Access Error occurs. A 2220 Responder might detect this problem before attempting to write 2221 past the end of the Reply chunk. 2223 RDMA operational errors can be fatal to the connection. To avoid a 2224 retransmission loop and repeated connection loss that deadlocks the 2225 connection, once the Requester has re-established a connection, the 2226 Responder should send an RDMA2_ERROR response to indicate that no 2227 RPC-level reply is possible for that transaction. 2229 7.3.1. RDMA2_ERR_READ_CHUNKS 2231 If a Requester presents more DDP-eligible arguments than a Responder 2232 is prepared to Read, the Responder MUST set the rdma_err field to 2233 RDMA2_ERR_READ_CHUNKS and set the rdma_max_chunks field to the 2234 maximum number of Read chunks the Responder can process. If the 2235 Responder implementation cannot handle any Read chunks for a request, 2236 it MUST set the rdma_max_chunks to zero in this response. The 2237 Responder MUST silently discard the ingress message without 2238 processing it further. 2240 The Requester can reconstruct the Call using Message Continuation or 2241 a Special Format payload and resend it. If the Requester does not 2242 resend the Call, it MUST terminate this RPC transaction with an 2243 error. 2245 7.3.2. RDMA2_ERR_WRITE_CHUNKS 2247 If a Requester has constructed an RPC Call with more DDP-eligible 2248 results than the Responder is prepared to Write, the Responder MUST 2249 set the rdma_err field to RDMA2_ERR_WRITE_CHUNKS and set the 2250 rdma_max_chunks field to the maximum number of Write chunks the 2251 Responder can return. The Requester can reconstruct the Call with no 2252 Write chunks and a Reply chunk of appropriate size. If the Requester 2253 does not resend the Call, it MUST terminate this RPC transaction with 2254 an error. 2256 If the Responder implementation cannot handle any Write chunks for a 2257 request and cannot send the Reply using Message Continuation, it MUST 2258 return a response of RDMA2_ERR_REPLY_RESOURCE instead (see below). 2260 7.3.3. RDMA2_ERR_SEGMENTS 2262 If a Requester has constructed an RPC Call with a chunk that contains 2263 more segments than the Responder supports, the Responder MUST set the 2264 rdma_err field to RDMA2_ERR_SEGMENTS and set the rdma_max_segments 2265 field to the maximum number of segments the Responder can process. 2266 The Requester can reconstruct the Call and resend it. If the 2267 Requester does not resend the Call, it MUST terminate this RPC 2268 transaction with an error. 2270 7.3.4. RDMA2_ERR_WRITE_RESOURCE 2272 If a Requester has provided a Write chunk that is not large enough to 2273 contain a DDP-eligible result, the Responder MUST set the rdma_err 2274 field to RDMA2_ERR_WRITE_RESOURCE. 2276 The Responder MUST set the rdma_chunk_index field to point to the 2277 first Write chunk in the transport header that is too short, or to 2278 zero to indicate that it was not possible to determine which chunk is 2279 too small. Indexing starts at one (1), which represents the first 2280 Write chunk. The Responder MUST set the rdma_length_needed to the 2281 number of bytes needed in that chunk to convey the result data item. 2283 The Requester can reconstruct the Call with more reply resources and 2284 resend it. If the Requester does not resend the Call (for instance, 2285 if the Responder set the index and length fields to zero), it MUST 2286 terminate this RPC transaction with an error. 2288 7.3.5. RDMA2_ERR_REPLY_RESOURCE 2290 If a Responder cannot send an RPC Reply using Message Continuation 2291 and the Reply does not fit in the Reply chunk, the Responder MUST set 2292 the rdma_err field to RDMA2_ERR_REPLY_RESOURCE. The Responder MUST 2293 set the rdma_length_needed to the number of Reply chunk bytes needed 2294 to convey the reply. 2296 The Requester can reconstruct the Call with more reply resources and 2297 resend it. If the Requester does not resend the Call (for instance, 2298 if the Responder set the length field to zero), it MUST terminate 2299 this RPC transaction with an error. 2301 7.4. Other Operational Errors 2303 While a Requester is constructing an RPC Call message, an 2304 unrecoverable problem might occur that prevents the Requester from 2305 posting further RDMA Work Requests on behalf of that message. As 2306 with other transports, if a Requester is unable to construct and 2307 transmit an RPC Call, the associated RPC transaction fails 2308 immediately. 2310 After a Requester has received a Reply, if it is unable to invalidate 2311 a memory region due to an unrecoverable problem, the Requester MUST 2312 close the connection to protect that memory from Responder access 2313 before the associated RPC transaction is complete. 2315 While a Responder is constructing an RPC Reply message or error 2316 message, an unrecoverable problem might occur that prevents the 2317 Responder from posting further RDMA Work Requests on behalf of that 2318 message. If a Responder is unable to construct and transmit an RPC 2319 Reply or RPC-over-RDMA error message, the Responder MUST close the 2320 connection to signal to the Requester that a reply was lost. 2322 7.4.1. RDMA2_ERR_SYSTEM 2324 If some problem occurs on a Responder that does not fit into the 2325 above categories, the Responder MAY report it to the Requester by 2326 setting the rdma_err field to RDMA2_ERR_SYSTEM. The Responder MUST 2327 silently discard the message(s) associated with the failing 2328 transaction without further processing. 2330 RDMA2_ERR_SYSTEM is a permanent error. This error does not indicate 2331 how much of the transaction the Responder has processed, nor does it 2332 indicate a particular recovery action for the Requester. A Requester 2333 that receives this error MUST terminate the RPC transaction 2334 associated with the XID value in the RDMA2_ERROR message's rdma_xid 2335 field. 2337 7.5. RDMA Transport Errors 2339 The RDMA connection and physical link provide some degree of error 2340 detection and retransmission. iWARP's Marker PDU Aligned (MPA) layer 2341 (when used over TCP), the Stream Control Transmission Protocol 2342 (SCTP), as well as the InfiniBand link layer [IBA] all provide Cyclic 2343 Redundancy Check (CRC) protection of RDMA payloads. CRC-class 2344 protection is a general attribute of such transports. 2346 Additionally, the RPC layer itself can accept errors from the 2347 transport and recover via retransmission. RPC recovery can typically 2348 handle complete loss and re-establishment of a transport connection. 2350 The details of reporting and recovery from RDMA link-layer errors are 2351 described in specific link-layer APIs and operational specifications 2352 and are outside the scope of this protocol specification. See 2353 Section 11 for further discussion of RPC-level integrity schemes. 2355 8. XDR Protocol Definition 2357 This section contains a description of the core features of the RPC- 2358 over-RDMA version 2 protocol expressed in the XDR language [RFC4506]. 2359 It organizes the description to make it simple to extract into a form 2360 that is ready to compile or combine with similar descriptions 2361 published later as extensions to RPC-over-RDMA version 2. 2363 8.1. Code Component License 2365 Code Components extracted from the current document must include the 2366 following license text. When combining the extracted XDR code with 2367 other XDR code which has an identical license, only a single copy of 2368 the license text needs to be retained. 2370 2372 /// /* 2373 /// * Copyright (c) 2010-2018 IETF Trust and the persons 2374 /// * identified as authors of the code. All rights reserved. 2375 /// * 2376 /// * The authors of the code are: 2377 /// * B. Callaghan, T. Talpey, C. Lever, and D. Noveck. 2378 /// * 2379 /// * Redistribution and use in source and binary forms, with 2380 /// * or without modification, are permitted provided that the 2381 /// * following conditions are met: 2382 /// * 2383 /// * - Redistributions of source code must retain the above 2384 /// * copyright notice, this list of conditions and the 2385 /// * following disclaimer. 2386 /// * 2387 /// * - Redistributions in binary form must reproduce the above 2388 /// * copyright notice, this list of conditions and the 2389 /// * following disclaimer in the documentation and/or other 2390 /// * materials provided with the distribution. 2391 /// * 2392 /// * - Neither the name of Internet Society, IETF or IETF 2393 /// * Trust, nor the names of specific contributors, may be 2394 /// * used to endorse or promote products derived from this 2395 /// * software without specific prior written permission. 2396 /// * 2397 /// * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS 2398 /// * AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED 2399 /// * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 2400 /// * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS 2401 /// * FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO 2402 /// * EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE 2403 /// * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, 2404 /// * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT 2405 /// * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 2406 /// * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS 2407 /// * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF 2408 /// * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 2409 /// * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING 2410 /// * IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF 2411 /// * ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 2412 /// */ 2413 /// 2415 2417 8.2. Extraction of the XDR Definition 2419 Implementers can apply the following sed script to the current 2420 document to produce a machine-readable XDR description of the base 2421 RPC-over-RDMA version 2 protocol. 2423 2425 sed -n -e 's:^ */// ::p' -e 's:^ *///$::p' 2427 2429 That is, if this document is in a file called "spec.txt", then 2430 implementers can do the following to extract an XDR description file 2431 and store it in the file rpcrdma-v2.x. 2433 2435 sed -n -e 's:^ */// ::p' -e 's:^ *///$::p' \ 2436 < spec.txt > rpcrdma-v2.x 2438 2440 Although this file is a usable description of the base protocol, when 2441 extensions are to be supported, it may be desirable to divide the 2442 description into multiple files. The following script achieves that 2443 purpose: 2445 2447 #!/usr/local/bin/perl 2448 open(IN,"rpcrdma-v2.x"); 2449 open(OUT,">temp.x"); 2450 while() 2451 { 2452 if (m/FILE ENDS: (.*)$/) 2453 { 2454 close(OUT); 2455 rename("temp.x", $1); 2456 open(OUT,">temp.x"); 2457 } 2458 else 2459 { 2460 print OUT $_; 2461 } 2462 } 2463 close(IN); 2464 close(OUT); 2466 2468 Running the above script results in two files: 2470 o The file common.x, containing the license plus the shared XDR 2471 definitions that need to be made available to both the base 2472 protocol and any subsequent extensions. 2474 o The file baseops.x containing the XDR definitions for the base 2475 protocol defined in this document. 2477 Extensions to RPC-over-RDMA version 2, published as Standards Track 2478 documents, should have similarly structured XDR definitions. Once an 2479 implementer has extracted the XDR for all desired extensions and the 2480 base XDR definition contained in the current document, she can 2481 concatenate them to produce a consolidated XDR definition that 2482 reflects the set of extensions selected for her RPC-over-RDMA version 2483 2 implementation. 2485 Alternatively, the XDR descriptions can be compiled separately. In 2486 that case, the combination of common.x and baseops.x defines the base 2487 transport. The combination of common.x and the XDR description of 2488 each extension produces a full XDR definition of that extension. 2490 8.3. XDR Definition for RPC-over-RDMA Version 2 Core Structures 2492 2493 /// /******************************************************************* 2494 /// * Transport Header Prefixes 2495 /// ******************************************************************/ 2496 /// 2497 /// struct rpcrdma_common { 2498 /// uint32 rdma_xid; 2499 /// uint32 rdma_vers; 2500 /// uint32 rdma_credit; 2501 /// uint32 rdma_htype; 2502 /// }; 2503 /// 2504 /// const RDMA2_F_RESPONSE 0x00000001; 2505 /// const RDMA2_F_MORE 0x00000002; 2506 /// const RDMA2_F_TPMORE 0x00000004; 2507 /// 2508 /// struct rpcrdma2_hdr_prefix 2509 /// struct rpcrdma_common rdma_start; 2510 /// uint32 rdma_flags; 2511 /// }; 2512 /// 2513 /// /******************************************************************* 2514 /// * Chunks and Chunk Lists 2515 /// ******************************************************************/ 2516 /// 2517 /// struct rpcrdma2_segment { 2518 /// uint32 rdma_handle; 2519 /// uint32 rdma_length; 2520 /// uint64 rdma_offset; 2521 /// }; 2522 /// 2523 /// struct rpcrdma2_read_segment { 2524 /// uint32 rdma_position; 2525 /// struct rpcrdma2_segment rdma_target; 2526 /// }; 2527 /// 2528 /// struct rpcrdma2_read_list { 2529 /// struct rpcrdma2_read_segment rdma_entry; 2530 /// struct rpcrdma2_read_list *rdma_next; 2531 /// }; 2532 /// 2533 /// struct rpcrdma2_write_chunk { 2534 /// struct rpcrdma2_segment rdma_target<>; 2535 /// }; 2536 /// 2537 /// struct rpcrdma2_write_list { 2538 /// struct rpcrdma2_write_chunk rdma_entry; 2539 /// struct rpcrdma2_write_list *rdma_next; 2540 /// }; 2541 /// 2542 /// struct rpcrdma2_chunk_lists { 2543 /// uint32 rdma_inv_handle; 2544 /// struct rpcrdma2_read_list *rdma_reads; 2545 /// struct rpcrdma2_write_list *rdma_writes; 2546 /// struct rpcrdma2_write_chunk *rdma_reply; 2547 /// }; 2548 /// 2549 /// /******************************************************************* 2550 /// * Transport Properties 2551 /// ******************************************************************/ 2552 /// 2553 /// /* 2554 /// * Types for transport properties model 2555 /// */ 2556 /// typedef rpcrdma2_propid uint32; 2557 /// 2558 /// struct rpcrdma2_propval { 2559 /// rpcrdma2_propid rdma_which; 2560 /// opaque rdma_data<>; 2561 /// }; 2562 /// 2563 /// typedef rpcrdma2_propval rpcrdma2_propset<>; 2564 /// typedef uint32 rpcrdma2_propsubset<>; 2565 /// 2566 /// /* 2567 /// * Transport propid values for basic properties 2568 /// */ 2569 /// const uint32 RDMA2_PROPID_SBSIZ = 1; 2570 /// const uint32 RDMA2_PROPID_RBSIZ = 2; 2571 /// const uint32 RDMA2_PROPID_RSSIZ = 3; 2572 /// const uint32 RDMA2_PROPID_RCSIZ = 4; 2573 /// const uint32 RDMA2_PROPID_BRS = 5; 2574 /// const uint32 RDMA2_PROPID_HOSTAUTH = 6; 2575 /// 2576 /// /* 2577 /// * Types specific to particular properties 2578 /// */ 2579 /// typedef uint32 rpcrdma2_prop_sbsiz; 2580 /// typedef uint32 rpcrdma2_prop_rbsiz; 2581 /// typedef uint32 rpcrdma2_prop_rssiz; 2582 /// typedef uint32 rpcrdma2_prop_rcsiz; 2583 /// typedef uint32 rpcrdma2_prop_brs; 2584 /// typedef opaque rpcrdma2_prop_hostauth<>; 2585 /// 2586 /// const uint32 RDMA2_RVRSDIR_NONE = 0; 2587 /// const uint32 RDMA2_RVRSDIR_SIMPLE = 1; 2588 /// const uint32 RDMA2_RVRSDIR_CONT = 1; 2589 /// const uint32 RDMA2_RVRSDIR_GENL = 3; 2590 /// 2591 /// /* FILE ENDS: common.x; */ 2593 2595 8.4. XDR Definition for RPC-over-RDMA Version 2 Base Header Types 2597 2598 /// /******************************************************************* 2599 /// * Descriptions of RPC-over-RDMA Header Types 2600 /// ******************************************************************/ 2601 /// 2602 /// /* 2603 /// * Header Type Codes. 2604 /// */ 2605 /// const rpcrdma2_proc RDMA2_MSG = 0; 2606 /// const rpcrdma2_proc RDMA2_NOMSG = 1; 2607 /// const rpcrdma2_proc RDMA2_ERROR = 4; 2608 /// const rpcrdma2_proc RDMA2_CONNPROP = 5; 2609 /// 2610 /// /* 2611 /// * Header Types to Convey RPC Messages. 2612 /// */ 2613 /// struct rpcrdma2_msg { 2614 /// struct rpcrdma2_chunk_lists rdma_chunks; 2615 /// 2616 /// /* The rpc message starts here and continues 2617 /// * through the end of the transmission. */ 2618 /// uint32 rdma_rpc_first_word; 2619 /// }; 2620 /// 2621 /// struct rpcrdma2_nomsg { 2622 /// struct rpcrdma2_chunk_lists rdma_chunks; 2623 /// }; 2624 /// 2625 /// /* 2626 /// * Header Type to Report Errors. 2627 /// */ 2628 /// const uint32 RDMA2_ERR_VERS = 1; 2629 /// const uint32 RDMA2_ERR_BAD_XDR = 2; 2630 /// const uint32 RDMA2_ERR_BAD_PROPVAL = 3; 2631 /// const uint32 RDMA2_ERR_INVAL_HTYPE = 4; 2632 /// const uint32 RDMA2_ERR_INVAL_CONT = 5; 2633 /// const uint32 RDMA2_ERR_READ_CHUNKS = 6; 2634 /// const uint32 RDMA2_ERR_WRITE_CHUNKS = 7; 2635 /// const uint32 RDMA2_ERR_SEGMENTS = 8; 2636 /// const uint32 RDMA2_ERR_WRITE_RESOURCE = 9; 2637 /// const uint32 RDMA2_ERR_REPLY_RESOURCE = 10; 2638 /// const uint32 RDMA2_ERR_SYSTEM = 100; 2639 /// 2640 /// struct rpcrdma2_err_vers { 2641 /// uint32 rdma_vers_low; 2642 /// uint32 rdma_vers_high; 2643 /// }; 2644 /// 2645 /// struct rpcrdma2_err_write { 2646 /// uint32 rdma_chunk_index; 2647 /// uint32 rdma_length_needed; 2648 /// }; 2649 /// 2650 /// union rpcrdma2_error switch (rpcrdma2_errcode rdma_err) { 2651 /// case RDMA2_ERR_VERS: 2652 /// rpcrdma2_err_vers rdma_vrange; 2653 /// case RDMA2_ERR_READ_CHUNKS: 2654 /// uint32 rdma_max_chunks; 2655 /// case RDMA2_ERR_WRITE_CHUNKS: 2656 /// uint32 rdma_max_chunks; 2657 /// case RDMA2_ERR_SEGMENTS: 2658 /// uint32 rdma_max_segments; 2659 /// case RDMA2_ERR_WRITE_RESOURCE: 2660 /// rpcrdma2_err_write rdma_writeres; 2661 /// case RDMA2_ERR_REPLY_RESOURCE: 2662 /// uint32 rdma_length_needed; 2663 /// default: 2664 /// void; 2665 /// }; 2666 /// 2667 /// /* 2668 /// * Header Type to Exchange Transport Properties. 2669 /// */ 2670 /// struct rpcrdma2_connprop { 2671 /// rpcrdma2_propset rdma_props; 2672 /// }; 2673 /// 2674 /// /* FILE ENDS: baseops.x; */ 2676 2677 8.5. Use of the XDR Description 2679 The files common.x and baseops.x, when combined with the XDR 2680 descriptions for extension defined later, produce a human-readable 2681 and compilable description of the RPC-over-RDMA version 2 protocol 2682 with the included extensions. 2684 Although this XDR description can generate encoders and decoders for 2685 the Transport and Payload streams, there are elements of the 2686 operation of RPC-over-RDMA version 2 that cannot be expressed within 2687 the XDR language. Implementations that use the output of an 2688 automated XDR processor need to provide additional code to bridge 2689 these gaps. 2691 o The Transport stream is not a single XDR object. Instead, the 2692 header prefix is one XDR data item, and the rest of the header is 2693 a separate XDR data item. Table 2 expresses the mapping between 2694 the header type in the header prefix and the XDR object 2695 representing the header type. 2697 o The relationship between the Transport stream and the Payload 2698 stream is not specified using XDR. Comments within the XDR text 2699 make clear where transported messages, described by their own XDR 2700 definitions, need to appear. Such data is opaque to the 2701 transport. 2703 o Continuation of RPC messages across transport message boundaries 2704 requires that message assembly facilities not specifiable within 2705 XDR are part of transport implementations. 2707 o Transport properties are constant integer values. Table 1 2708 expresses the mapping between each property's code point and the 2709 XDR typedef that represents the structure of the property's value. 2710 XDR does not possess the facility to express that mapping in an 2711 extensible way. 2713 The role of XDR in RPC-over-RDMA specifications is more limited than 2714 for protocols where the totality of the protocol is expressible 2715 within XDR. XDR lacks the facility to represent the embedding of 2716 XDR-encoded payload material. Also, the need to cleanly accommodate 2717 extensions has meant that those using rpcgen in their applications 2718 need to take an active role to provide the facilities that cannot be 2719 expressed within XDR. 2721 9. RPC Bind Parameters 2723 Before establishing a new connection, an RPC client obtains a 2724 transport address for the RPC server. The means used to obtain this 2725 address and to open an RDMA connection is dependent on the type of 2726 RDMA transport and is the responsibility of each RPC protocol binding 2727 and its local implementation. 2729 RPC services typically register with a portmap or rpcbind service 2730 [RFC1833], which associates an RPC Program number with a service 2731 address. This policy is no different with RDMA transports. However, 2732 a distinct service address (port number) is sometimes required for 2733 operation on RPC-over-RDMA. 2735 When mapped atop the iWARP transport [RFC5040] [RFC5041], which uses 2736 IP port addressing due to its layering on TCP or SCTP, port mapping 2737 is trivial and consists merely of issuing the port in the connection 2738 process. The NFS/RDMA protocol service address has been assigned 2739 port 20049 by IANA, for both iWARP/TCP and iWARP/SCTP [RFC8267]. 2741 When mapped atop InfiniBand [IBA], which uses a service endpoint 2742 naming scheme based on a Group Identifier (GID), a translation MUST 2743 be employed. One such translation is described in Annexes A3 2744 (Application Specific Identifiers), A4 (Sockets Direct Protocol 2745 (SDP)), and A11 (RDMA IP CM Service) of [IBA], which is appropriate 2746 for translating IP port addressing to the InfiniBand network. 2747 Therefore, in this case, IP port addressing may be readily employed 2748 by the upper layer. 2750 When a mapping standard or convention exists for IP ports on an RDMA 2751 interconnect, there are several possibilities for each upper layer to 2752 consider: 2754 o One possibility is to have the server register its mapped IP port 2755 with the rpcbind service under the netid (or netids) defined in 2756 [RFC8166]. An RPC-over-RDMA-aware RPC client can then resolve its 2757 desired service to a mappable port and proceed to connect. This 2758 method is the most flexible and compatible approach for those 2759 upper layers that are defined to use the rpcbind service. 2761 o A second possibility is to have the RPC server's portmapper 2762 register itself on the RDMA interconnect at a "well-known" service 2763 address (on UDP or TCP, this corresponds to port 111). An RPC 2764 client can connect to this service address and use the portmap 2765 protocol to obtain a service address in response to a program 2766 number (e.g., an iWARP port number or an InfiniBand GID). 2768 o Alternately, an RPC client can connect to the mapped well-known 2769 port for the service itself, if it is appropriately defined. By 2770 convention, the NFS/RDMA service, when operating atop such an 2771 InfiniBand fabric, uses the same 20049 assignment as for iWARP. 2773 Historically, different RPC protocols have taken different approaches 2774 to their port assignments. The current document leaves the specific 2775 method for each RPC-over-RDMA-enabled ULB. 2777 [RFC8166] defines two new netid values to be used for registration of 2778 upper layers atop iWARP [RFC5040] [RFC5041] and (when a suitable port 2779 translation service is available) InfiniBand [IBA]. Additional RDMA- 2780 capable networks MAY define their own netids, or if they provide a 2781 port translation, they MAY share the one defined in [RFC8166]. 2783 10. Implementation Status 2785 This section records the status of known implementations of the 2786 protocol defined by this specification at the time of posting of this 2787 Internet-Draft, and is based on a proposal described in [RFC7942]. 2788 The description of implementations in this section is intended to 2789 assist the IETF in its decision processes in progressing drafts to 2790 RFCs. 2792 Please note that the listing of any individual implementation here 2793 does not imply endorsement by the IETF. Furthermore, no effort has 2794 been spent to verify the information presented here that was supplied 2795 by IETF contributors. This is not intended as, and must not be 2796 construed to be, a catalog of available implementations or their 2797 features. Readers are advised to note that other implementations may 2798 exist. 2800 At this time, no known implementations of the protocol described in 2801 the current document exist. 2803 11. Security Considerations 2805 11.1. Memory Protection 2807 A primary consideration is the protection of the integrity and 2808 confidentiality of host memory by an RPC-over-RDMA transport. The 2809 use of an RPC-over-RDMA transport protocol MUST NOT introduce 2810 vulnerabilities to system memory contents nor memory owned by user 2811 processes. Any RDMA provider used for RPC transport MUST conform to 2812 the requirements of [RFC5042] to satisfy these protections. 2814 11.1.1. Protection Domains 2816 The use of a Protection Domain to limit the exposure of memory 2817 regions to a single connection is critical. Any attempt by an 2818 endpoint not participating in that connection to reuse memory handles 2819 needs to result in immediate failure of that connection. Because ULP 2820 security mechanisms rely on this aspect of Reliable Connected 2821 behavior, implementations SHOULD cryptographically authenticate 2822 connection endpoints. 2824 11.1.2. Handle (STag) Predictability 2826 Implementations should use unpredictable memory handles for any 2827 operation requiring exposed memory regions. Exposing a continuously 2828 registered memory region allows a remote host to read or write to 2829 that region even when an RPC involving that memory is not underway. 2830 Therefore, implementations should avoid the use of persistently 2831 registered memory. 2833 11.1.3. Memory Protection 2835 Requesters should register memory regions for remote access only when 2836 they are about to be the target of an RPC transaction that involves 2837 an RDMA Read or Write. 2839 Requesters should invalidate memory regions as soon as related RPC 2840 operations are complete. Invalidation and DMA unmapping of memory 2841 regions should complete before the receiver checks message integrity, 2842 and before the RPC consumer can use or alter the contents of the 2843 exposed memory region. 2845 An RPC transaction on a Requester can terminate before a Reply 2846 arrives, for example, if the RPC consumer is signaled, or a 2847 segmentation fault occurs. When an RPC terminates abnormally, memory 2848 regions associated with that RPC should be invalidated before the 2849 Requester reuses those regions for other purposes. 2851 11.1.4. Denial of Service 2853 A detailed discussion of denial-of-service exposures that can result 2854 from the use of an RDMA transport appears in Section 6.4 of 2855 [RFC5042]. 2857 A Responder is not obliged to pull unreasonably large Read chunks. A 2858 Responder can use an RDMA2_ERROR response to terminate RPCs with 2859 unreadable Read chunks. If a Responder transmits more data than a 2860 Requester is prepared to receive in a Write or Reply chunk, the RDMA 2861 provider typically terminates the connection. For further 2862 discussion, see Section 6.3.3. Such repeated connection termination 2863 can deny service to other users sharing the connection from the 2864 errant Requester. 2866 An RPC-over-RDMA transport implementation is not responsible for 2867 throttling the RPC request rate, other than to keep the number of 2868 concurrent RPC transactions at or under the number of credits granted 2869 per connection (see Section 4.2.1). A sender can trigger a self- 2870 denial of service by exceeding the credit grant repeatedly. 2872 When an RPC transaction terminates due to a signal or premature exit 2873 of an application process, a Requester should invalidate the RPC's 2874 Write and Reply chunks. Invalidation prevents the subsequent arrival 2875 of the Responder's Reply from altering the memory regions associated 2876 with those chunks after the Requester has released that memory. 2878 On the Requester, a malfunctioning application or a malicious user 2879 can create a situation where RPCs initiate and abort continuously, 2880 resulting in Responder replies that terminate the underlying RPC- 2881 over-RDMA connection repeatedly. Such situations can deny service to 2882 other users sharing the connection from that Requester. 2884 11.2. RPC Message Security 2886 ONC RPC provides cryptographic security via the RPCSEC_GSS framework 2887 [RFC7861]. RPCSEC_GSS implements message authentication 2888 (rpc_gss_svc_none), per-message integrity checking 2889 (rpc_gss_svc_integrity), and per-message confidentiality 2890 (rpc_gss_svc_privacy) in a layer above the RPC-over-RDMA transport. 2891 The integrity and privacy services require significant computation 2892 and movement of data on each endpoint host. Some performance 2893 benefits enabled by RDMA transports can be lost. 2895 11.2.1. RPC-over-RDMA Protection at Other Layers 2897 For any RPC transport, utilizing RPCSEC_GSS integrity or privacy 2898 services has performance implications. Protection below the RPC 2899 implementation is often a better choice in performance-sensitive 2900 deployments, especially if it, too, can be offloaded. Certain 2901 implementations of IPsec can be co-located in RDMA hardware, for 2902 example, without change to RDMA consumers and with little loss of 2903 data movement efficiency. Such arrangements can also provide a 2904 higher degree of privacy by hiding endpoint identity or altering the 2905 frequency at which messages are exchanged, at a performance cost. 2907 Implementations MAY negotiate the use of protection in another layer 2908 through the use of an RPCSEC_GSS security flavor defined in [RFC7861] 2909 in conjunction with the Channel Binding mechanism [RFC5056] and IPsec 2910 Channel Connection Latching [RFC5660]. 2912 11.2.2. RPCSEC_GSS on RPC-over-RDMA Transports 2914 Not all RDMA devices and fabrics support the above protection 2915 mechanisms. Also, NFS clients, where multiple users can access NFS 2916 files, still require per-message authentication. In these cases, 2917 RPCSEC_GSS can protect NFS traffic conveyed on RPC-over-RDMA 2918 connections. 2920 RPCSEC_GSS extends the ONC RPC protocol without changing the format 2921 of RPC messages. By observing the conventions described in this 2922 section, an RPC-over-RDMA transport can convey RPCSEC_GSS-protected 2923 RPC messages interoperably. 2925 Senders MUST NOT reduce protocol elements of RPCSEC_GSS that appear 2926 in the Payload stream of an RPC-over-RDMA message. Such elements 2927 include control messages exchanged as part of establishing or 2928 destroying a security context, or data items that are part of 2929 RPCSEC_GSS authentication material. 2931 11.2.2.1. RPCSEC_GSS Context Negotiation 2933 Some NFS client implementations use a separate connection to 2934 establish a Generic Security Service (GSS) context for NFS operation. 2935 Such clients use TCP and the standard NFS port (2049) for context 2936 establishment. Therefore, an NFS server MUST also provide a TCP- 2937 based NFS service on port 2049 to enable the use of RPCSEC_GSS with 2938 NFS/RDMA. 2940 11.2.2.2. RPC-over-RDMA with RPCSEC_GSS Authentication 2942 The RPCSEC_GSS authentication service has no impact on the DDP- 2943 eligibility of data items in a ULP. 2945 However, RPCSEC_GSS authentication material appearing in an RPC 2946 message header can be larger than, say, an AUTH_SYS authenticator. 2947 In particular, when an RPCSEC_GSS pseudoflavor is in use, a Requester 2948 needs to accommodate a larger RPC credential when marshaling RPC 2949 Calls and needs to provide for a maximum size RPCSEC_GSS verifier 2950 when allocating reply buffers and Reply chunks. 2952 RPC messages, and thus Payload streams, are larger on average as a 2953 result. ULP operations that fit in a Simple Format message when a 2954 simpler form of authentication is in use might need to be reduced or 2955 conveyed via a Special Format message when RPCSEC_GSS authentication 2956 is in use. It is therefore more likely that a Requester provides 2957 both a Read list and a Reply chunk in the same RPC-over-RDMA 2958 Transport header to convey a Special Format Call and provision a 2959 receptacle for a Special Format Reply. 2961 In addition to this cost, the XDR encoding and decoding of each RPC 2962 message using RPCSEC_GSS authentication requires per-message host 2963 compute resources to construct the GSS verifier. 2965 11.2.2.3. RPC-over-RDMA with RPCSEC_GSS Integrity or Privacy 2967 The RPCSEC_GSS integrity service enables endpoints to detect the 2968 modification of RPC messages in flight. The RPCSEC_GSS privacy 2969 service prevents all but the intended recipient from viewing the 2970 cleartext content of RPC arguments and results. RPCSEC_GSS integrity 2971 and privacy services are end-to-end. They protect RPC arguments and 2972 results from application to server endpoint, and back. 2974 The RPCSEC_GSS integrity and encryption services operate on whole RPC 2975 messages after they have been XDR encoded, and before they have been 2976 XDR decoded after receipt. Connection endpoints use intermediate 2977 buffers to prevent exposure of encrypted or unverified cleartext data 2978 to RPC consumers. After a sender has verified, encrypted, and 2979 wrapped a message, the transport layer MAY use RDMA data transfer 2980 between these intermediate buffers. 2982 The process of reducing a DDP-eligible data item removes the data 2983 item and its XDR padding from an encoded Payload stream. In a non- 2984 protected RPC-over-RDMA message, a reduced data item does not include 2985 XDR padding. After reduction, the Payload stream contains fewer 2986 octets than the whole XDR stream did beforehand. XDR padding octets 2987 are often zero bytes, but they don't have to be. Thus, reducing DDP- 2988 eligible items affects the result of message integrity verification 2989 and encryption. 2991 Therefore, a sender MUST NOT reduce a Payload stream when RPCSEC_GSS 2992 integrity or encryption services are in use. Effectively, no data 2993 item is DDP-eligible in this situation. Senders can use only Simple 2994 and Continued Formats without chunks, or Special Format. In this 2995 mode, an RPC-over-RDMA transport operates in the same manner as a 2996 transport that does not support DDP. 2998 11.2.2.4. Protecting RPC-over-RDMA Transport Headers 3000 Like the header fields in an RPC message (e.g., the xid and mtype 3001 fields), RPCSEC_GSS does not protect the RPC-over-RDMA Transport 3002 stream. XIDs, connection credit limits, and chunk lists (though not 3003 the content of the data items they refer to) are exposed to malicious 3004 behavior, which can redirect data that is transferred by the RPC- 3005 over-RDMA message, result in spurious retransmits, or trigger 3006 connection loss. 3008 In particular, if an attacker alters the information contained in the 3009 chunk lists of an RPC-over-RDMA Transport header, data contained in 3010 those chunks can be redirected to other registered memory regions on 3011 Requesters. An attacker might alter the arguments of RDMA Read and 3012 RDMA Write operations on the wire to gain a similar effect. If such 3013 alterations occur, the use of RPCSEC_GSS integrity or privacy 3014 services enables a Requester to detect unexpected material in a 3015 received RPC message. 3017 Encryption at other layers, as described in Section 11.2.1, protects 3018 the content of the Transport stream. RDMA transport implementations 3019 should conform to [RFC5042] to address attacks on RDMA protocols 3020 themselves. 3022 11.3. Transport Properties 3024 Like other fields that appear in the Transport stream, transport 3025 properties are sent in the clear with no integrity protection, making 3026 them vulnerable to man-in-the-middle attacks. 3028 For example, if a man-in-the-middle were to change the value of the 3029 Receive buffer size, it could reduce connection performance or 3030 trigger loss of connection. Repeated connection loss can impact 3031 performance or even prevent a new connection from being established. 3032 The recourse is to deploy on a private network or use transport layer 3033 encryption. 3035 11.4. Host Authentication 3037 [ cel: This subsection is unfinished. ] 3039 Wherein we use the relevant sections of [RFC3552] to analyze the 3040 addition of host authentication to this RPC-over-RDMA transport. 3042 The authors refer readers to Appendix C of [RFC8446] for information 3043 on how to design and test a secure authentication handshake 3044 implementation. 3046 12. IANA Considerations 3048 The RPC-over-RDMA family of transports have been assigned RPC netids 3049 by [RFC8166]. A netid is an rpcbind [RFC1833] string used to 3050 identify the underlying protocol in order for RPC to select 3051 appropriate transport framing and the format of the service addresses 3052 and ports. 3054 The following netid registry strings are already defined for this 3055 purpose: 3057 NC_RDMA "rdma" 3058 NC_RDMA6 "rdma6" 3060 The "rdma" netid is to be used when IPv4 addressing is employed by 3061 the underlying transport, and "rdma6" when IPv6 addressing is 3062 employed. The netid assignment policy and registry are defined in 3063 [RFC5665]. The current document does not alter these netid 3064 assignments. 3066 These netids MAY be used for any RDMA network that satisfies the 3067 requirements of Section 3.2.2 and that is able to identify service 3068 endpoints using IP port addressing, possibly through use of a 3069 translation service as described in Section 9. 3071 13. References 3073 13.1. Normative References 3075 [RFC1833] Srinivasan, R., "Binding Protocols for ONC RPC Version 2", 3076 RFC 1833, DOI 10.17487/RFC1833, August 1995, 3077 . 3079 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 3080 Requirement Levels", BCP 14, RFC 2119, 3081 DOI 10.17487/RFC2119, March 1997, 3082 . 3084 [RFC4506] Eisler, M., Ed., "XDR: External Data Representation 3085 Standard", STD 67, RFC 4506, DOI 10.17487/RFC4506, May 3086 2006, . 3088 [RFC5042] Pinkerton, J. and E. Deleganes, "Direct Data Placement 3089 Protocol (DDP) / Remote Direct Memory Access Protocol 3090 (RDMAP) Security", RFC 5042, DOI 10.17487/RFC5042, October 3091 2007, . 3093 [RFC5056] Williams, N., "On the Use of Channel Bindings to Secure 3094 Channels", RFC 5056, DOI 10.17487/RFC5056, November 2007, 3095 . 3097 [RFC5531] Thurlow, R., "RPC: Remote Procedure Call Protocol 3098 Specification Version 2", RFC 5531, DOI 10.17487/RFC5531, 3099 May 2009, . 3101 [RFC5660] Williams, N., "IPsec Channels: Connection Latching", 3102 RFC 5660, DOI 10.17487/RFC5660, October 2009, 3103 . 3105 [RFC5665] Eisler, M., "IANA Considerations for Remote Procedure Call 3106 (RPC) Network Identifiers and Universal Address Formats", 3107 RFC 5665, DOI 10.17487/RFC5665, January 2010, 3108 . 3110 [RFC7861] Adamson, A. and N. Williams, "Remote Procedure Call (RPC) 3111 Security Version 3", RFC 7861, DOI 10.17487/RFC7861, 3112 November 2016, . 3114 [RFC7942] Sheffer, Y. and A. Farrel, "Improving Awareness of Running 3115 Code: The Implementation Status Section", BCP 205, 3116 RFC 7942, DOI 10.17487/RFC7942, July 2016, 3117 . 3119 [RFC8166] Lever, C., Ed., Simpson, W., and T. Talpey, "Remote Direct 3120 Memory Access Transport for Remote Procedure Call Version 3121 1", RFC 8166, DOI 10.17487/RFC8166, June 2017, 3122 . 3124 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 3125 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, 3126 May 2017, . 3128 [RFC8267] Lever, C., "Network File System (NFS) Upper-Layer Binding 3129 to RPC-over-RDMA Version 1", RFC 8267, 3130 DOI 10.17487/RFC8267, October 2017, 3131 . 3133 [RFC8446] Rescorla, E., "The Transport Layer Security (TLS) Protocol 3134 Version 1.3", RFC 8446, DOI 10.17487/RFC8446, August 2018, 3135 . 3137 13.2. Informative References 3139 [CBFC] Kung, H., Blackwell, T., and A. Chapman, "Credit-Based 3140 Flow Control for ATM Networks: Credit Update Protocol, 3141 Adaptive Credit Allocation, and Statistical Multiplexing", 3142 Proc. ACM SIGCOMM '94 Symposium on Communications 3143 Architectures, Protocols and Applications, pp. 101-114., 3144 August 1994. 3146 [I-D.ietf-nfsv4-rpc-tls] 3147 Lever, C. and T. Myklebust, "Towards Remote Procedure Call 3148 Encryption By Default", draft-ietf-nfsv4-rpc-tls-05 (work 3149 in progress), January 2020. 3151 [IBA] InfiniBand Trade Association, "InfiniBand Architecture 3152 Specification Volume 1", Release 1.3, March 2015. 3154 Available from https://www.infinibandta.org/ 3156 [RFC0768] Postel, J., "User Datagram Protocol", STD 6, RFC 768, 3157 DOI 10.17487/RFC0768, August 1980, 3158 . 3160 [RFC0793] Postel, J., "Transmission Control Protocol", STD 7, 3161 RFC 793, DOI 10.17487/RFC0793, September 1981, 3162 . 3164 [RFC1094] Nowicki, B., "NFS: Network File System Protocol 3165 specification", RFC 1094, DOI 10.17487/RFC1094, March 3166 1989, . 3168 [RFC1813] Callaghan, B., Pawlowski, B., and P. Staubach, "NFS 3169 Version 3 Protocol Specification", RFC 1813, 3170 DOI 10.17487/RFC1813, June 1995, 3171 . 3173 [RFC3552] Rescorla, E. and B. Korver, "Guidelines for Writing RFC 3174 Text on Security Considerations", BCP 72, RFC 3552, 3175 DOI 10.17487/RFC3552, July 2003, 3176 . 3178 [RFC5040] Recio, R., Metzler, B., Culley, P., Hilland, J., and D. 3179 Garcia, "A Remote Direct Memory Access Protocol 3180 Specification", RFC 5040, DOI 10.17487/RFC5040, October 3181 2007, . 3183 [RFC5041] Shah, H., Pinkerton, J., Recio, R., and P. Culley, "Direct 3184 Data Placement over Reliable Transports", RFC 5041, 3185 DOI 10.17487/RFC5041, October 2007, 3186 . 3188 [RFC5532] Talpey, T. and C. Juszczak, "Network File System (NFS) 3189 Remote Direct Memory Access (RDMA) Problem Statement", 3190 RFC 5532, DOI 10.17487/RFC5532, May 2009, 3191 . 3193 [RFC5661] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed., 3194 "Network File System (NFS) Version 4 Minor Version 1 3195 Protocol", RFC 5661, DOI 10.17487/RFC5661, January 2010, 3196 . 3198 [RFC5662] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed., 3199 "Network File System (NFS) Version 4 Minor Version 1 3200 External Data Representation Standard (XDR) Description", 3201 RFC 5662, DOI 10.17487/RFC5662, January 2010, 3202 . 3204 [RFC7530] Haynes, T., Ed. and D. Noveck, Ed., "Network File System 3205 (NFS) Version 4 Protocol", RFC 7530, DOI 10.17487/RFC7530, 3206 March 2015, . 3208 [RFC8167] Lever, C., "Bidirectional Remote Procedure Call on RPC- 3209 over-RDMA Transports", RFC 8167, DOI 10.17487/RFC8167, 3210 June 2017, . 3212 Appendix A. ULB Specifications 3214 Typically, an Upper-Layer Protocol (ULP) is defined without regard to 3215 a particular RPC transport. An Upper-Layer Binding (ULB) 3216 specification provides guidance that helps a ULP interoperate 3217 correctly and efficiently over a particular transport. For RPC-over- 3218 RDMA version 2, a ULB may provide: 3220 o A taxonomy of XDR data items that are eligible for DDP 3222 o Constraints on which upper-layer procedures a sender may reduce, 3223 and on how many chunks may appear in a single RPC message 3225 o A method enabling a Requester to determine the maximum size of the 3226 reply Payload stream for all procedures in the ULP 3228 o An rpcbind port assignment for the RPC Program and Version when 3229 operating on the particular transport 3231 Each RPC Program and Version tuple that operates on RPC-over-RDMA 3232 version 2 needs to have a ULB specification. 3234 A.1. DDP-Eligibility 3236 A ULB designates specific XDR data items as eligible for DDP. As a 3237 sender constructs an RPC-over-RDMA message, it can remove DDP- 3238 eligible data items from the Payload stream so that the RDMA provider 3239 can place them directly in the receiver's memory. An XDR data item 3240 should be considered for DDP-eligibility if there is a clear benefit 3241 to moving the contents of the item directly from the sender's memory 3242 to the receiver's memory. 3244 Criteria for DDP-eligibility include: 3246 o The XDR data item is frequently sent or received, and its size is 3247 often much larger than typical inline thresholds. 3249 o If the XDR data item is a result, its maximum size must be 3250 predictable in advance by the Requester. 3252 o Transport-level processing of the XDR data item is not needed. 3253 For example, the data item is an opaque byte array, which requires 3254 no XDR encoding and decoding of its content. 3256 o The content of the XDR data item is sensitive to address 3257 alignment. For example, a data copy operation would be required 3258 on the receiver to enable the message to be parsed correctly, or 3259 to enable the data item to be accessed. 3261 o The XDR data item itself does not contain DDP-eligible data items. 3263 In addition to defining the set of data items that are DDP-eligible, 3264 a ULB may limit the use of chunks to particular upper-layer 3265 procedures. If more than one data item in a procedure is DDP- 3266 eligible, the ULB may limit the number of chunks that a Requester can 3267 provide for a particular upper-layer procedure. 3269 Senders never reduce data items that are not DDP-eligible. Such data 3270 items can, however, be part of a Special Format payload. 3272 The programming interface by which an upper-layer implementation 3273 indicates the DDP-eligibility of a data item to the RPC transport is 3274 not described by this specification. The only requirements are that 3275 the receiver can re-assemble the transmitted RPC-over-RDMA message 3276 into a valid XDR stream and that DDP-eligibility rules specified by 3277 the ULB are respected. 3279 There is no provision to express DDP-eligibility within the XDR 3280 language. The only definitive specification of DDP-eligibility is a 3281 ULB. 3283 In general, a DDP-eligibility violation occurs when: 3285 o A Requester reduces a non-DDP-eligible argument data item. The 3286 Responder reports the violation as described in Section 6.3.3. 3288 o A Responder reduces a non-DDP-eligible result data item. The 3289 Requester terminates the pending RPC transaction and reports an 3290 appropriate permanent error to the RPC consumer. 3292 o A Responder does not reduce a DDP-eligible result data item into 3293 an available Write chunk. The Requester terminates the pending 3294 RPC transaction and reports an appropriate permanent error to the 3295 RPC consumer. 3297 A.2. Maximum Reply Size 3299 When expecting small and moderately-sized Replies, a Requester should 3300 rely on Message Continuation rather than provision a Reply chunk. 3301 For each ULP procedure where there is no clear Reply size maximum and 3302 the maximum can be substantial, the ULB should specify a dependable 3303 means for determining the maximum Reply size. 3305 A.3. Reverse-Direction Operation 3307 The direction of operation does not preclude the need for DDP- 3308 eligibility statements. 3310 Reverse-direction operation occurs on an already-established 3311 connection. Specification of RPC binding parameters is usually not 3312 necessary in this case. 3314 Other considerations may apply when distinct RPC Programs share an 3315 RPC-over-RDMA transport connection concurrently. 3317 A.4. Additional Considerations 3319 There may be other details provided in a ULB. 3321 o A ULB may recommend inline threshold values or other transport- 3322 related parameters for RPC-over-RDMA version 2 connections bearing 3323 that ULP. 3325 o A ULP may provide a means to communicate transport-related 3326 parameters between peers. 3328 o Multiple ULPs may share a single RPC-over-RDMA version 2 3329 connection when their ULBs allow the use of RPC-over-RDMA version 3330 2 and the rpcbind port assignments for those protocols permit 3331 connection sharing. In this case, the same transport parameters 3332 (such as inline threshold) apply to all ULPs using that 3333 connection. 3335 Each ULB needs to be designed to allow correct interoperation without 3336 regard to the transport parameters actually in use. Furthermore, 3337 implementations of ULPs must be designed to interoperate correctly 3338 regardless of the connection parameters in effect on a connection. 3340 A.5. ULP Extensions 3342 An RPC Program and Version tuple may be extensible. For instance, 3343 the RPC version number may not reflect a ULP minor versioning scheme, 3344 or the ULP may allow the specification of additional features after 3345 the publication of the original RPC Program specification. ULBs are 3346 provided for interoperable RPC Programs and Versions by extending 3347 existing ULBs to reflect the changes made necessary by each addition 3348 to the existing XDR. 3350 [ cel: The final sentence is unclear, and may be inaccurate. I 3351 believe I copied this section directly from RFC 8166. Is there more 3352 to be said, now that we have some experience? ] 3354 Appendix B. Extending RPC-over-RDMA Version 2 3356 This Appendix is not addressed to protocol implementers, but rather 3357 to authors of documents that intend to extend the protocol specified 3358 in the current document. 3360 RPC-over-RDMA version 2 extensibility facilitates limited extensions 3361 to the base protocol presented in the current document so that new 3362 optional capabilities can be introduced without a protocol version 3363 change, while maintaining robust interoperability with existing RPC- 3364 over-RDMA version 2 implementations. It allows extensions to be 3365 defined, including the definition of new protocol elements, without 3366 requiring modification or recompilation of the XDR for the base 3367 protocol. 3369 Standards Track documents may introduce extensions to the base RPC- 3370 over-RDMA version 2 protocol in two ways: 3372 o They may introduce new OPTIONAL transport header types. 3373 Appendix B.2 covers such transport header types. 3375 o They may define new OPTIONAL transport properties. Appendix B.4 3376 describes such transport properties. 3378 These documents may also add the following sorts of ancillary 3379 protocol elements to the protocol to support the addition of new 3380 transport properties and header types: 3382 o They may create new error codes, as described in Appendix B.5. 3384 o They may define new flags to use within the rdma_flags field, as 3385 discussed in Appendix B.3. 3387 New capabilities can be proposed and developed independently of each 3388 other. Implementers can choose among them, making it straightforward 3389 to create and document experimental features and then bring them 3390 through the standards process. 3392 B.1. Documentation Requirements 3394 As described earlier, a Standards Track document introduces a set of 3395 new protocol elements. Together these elements are considered an 3396 OPTIONAL feature. Each implementation is either aware of all the 3397 protocol elements introduced by that feature or is aware of none of 3398 them. 3400 Documents specifying extensions to RPC-over-RDMA version 2 should 3401 contain: 3403 o An explanation of the purpose and use of each new protocol 3404 element. 3406 o An XDR description including all of the new protocol elements, and 3407 a script to extract it. 3409 o A discussion of interactions with other extensions. This 3410 discussion includes requirements for other OPTIONAL features to be 3411 present, or that a particular level of support for an OPTIONAL 3412 facility is required. 3414 Implementers combine the XDR descriptions of the new features they 3415 intend to use with the XDR description of the base protocol in the 3416 current document. This combination is necessary to create a valid 3417 XDR input file because extensions are free to use XDR types defined 3418 in the base protocol, and later extensions may use types defined by 3419 earlier extensions. 3421 The XDR description for the RPC-over-RDMA version 2 base protocol 3422 combined with that for any selected extensions should provide a 3423 human-readable and compilable definition of the extended protocol. 3425 B.2. Adding New Header Types to RPC-over-RDMA Version 2 3427 New transport header types are defined similar to Sections 6.3.1 3428 through 6.3.4. In particular, what is needed is: 3430 o A description of the function and use of the new header type. 3432 o A complete XDR description of the new header type. 3434 o A description of how receivers report errors, including mechanisms 3435 for reporting errors outside the available choices already 3436 available in the base protocol or other extensions. 3438 o An indication of whether a Payload stream must be present, and a 3439 description of its contents and how receivers use such Payload 3440 streams to reconstruct RPC messages. 3442 There needs to be additional documentation that is made necessary due 3443 to the OPTIONAL status of new transport header types: 3445 o The document should discuss constraints on support for the new 3446 header types. For example, if support for one header type is 3447 implied or foreclosed by another one, this needs to be documented. 3449 o The document should describe the preferred method by which a 3450 sender determines whether its peer supports a particular header 3451 type. It is always possible to send a test invocation of a 3452 particular header type to see if support is available. However, 3453 when more efficient means are available (e.g., the value of a 3454 transport property), this should be noted. 3456 B.3. Adding New Header Flags to the Protocol 3458 New flag bits are to be defined similarly to Sections 6.2.2.1 and 3459 6.2.2.2. Each new flag definition should include: 3461 o An XDR description of the new flag. 3463 o A description of the function and use of the new flag. 3465 o An indication for which header types the flag value is meaningful, 3466 and for which header types it is an error to set the flag or to 3467 leave it unset. 3469 o A means to determine whether peers are prepared to receive 3470 transport headers with the new flag set. 3472 There needs to be additional documentation that is made necessary due 3473 to the OPTIONAL status of new flags: 3475 o The document should discuss constraints on support for the new 3476 flags. For example, if support for one flag is implied or 3477 foreclosed by another one, this needs to be documented. 3479 B.4. Adding New Transport properties to the Protocol 3481 A Standards Track document defining a new transport property should 3482 include the following information paralleling that provided in this 3483 document for the transport properties defined herein: 3485 o The rpcrdma2_propid value identifying the new property. 3487 o The XDR typedef specifying the structure of its property value. 3489 o A description of the new property. 3491 o An explanation of how the receiver can use this information. 3493 o The default value if a peer never receives the new property. 3495 There is no requirement that propid assignments occur in a continuous 3496 range of values. Implementations should not rely on all such values 3497 being small integers. 3499 Before the defining Standards Track document is published, the nfsv4 3500 Working Group should select a unique propid value, and ensure that: 3502 o rpcrdma2_propid values specified in the document do not conflict 3503 with those currently assigned or in use by other pending working 3504 group documents defining transport properties. 3506 o rpcrdma2_propid values specified in the document do not conflict 3507 with the range reserved for experimental use, as defined in 3508 Section 8.2. 3510 [ cel: There is no section 8.2. ] 3512 [ cel: Should we request the creation of an IANA registry for 3513 propid values? ]. 3515 When a Standards Track document proposes additional transport 3516 properties, reviewers should deal with possible security issues 3517 exposed by those new transport properties. 3519 B.5. Adding New Error Codes to the Protocol 3521 The same Standards Track document that defines a new header type may 3522 introduce new error codes used to support it. A Standards Track 3523 document may similarly define new error codes that an existing header 3524 type can return. 3526 For error codes that do not require the return of additional 3527 information, a peer can use the existing RDMA_ERR2 header type to 3528 report the new error. The sender sets the new error code as the 3529 value of rdma_err with the result that the default switch arm of the 3530 rpcrdma2_error (i.e., void) is selected. 3532 For error codes that do require the return of related information 3533 together with the error, a new header type should be defined that 3534 returns the error together with the related information. The sender 3535 of a new header type needs to be prepared to accept header types 3536 necessary to report associated errors. 3538 Appendix C. Differences from RPC-over-RDMA Version 1 3540 The goal of RPC-over-RDMA version 2 is to relieve certain constraints 3541 that have become evident in RPC-over-RDMA version 1 with deployment 3542 experience: 3544 o RPC-over-RDMA version 1 has been challenging to update to address 3545 shortcomings or improve data transfer efficiency. 3547 o The average size of NFSv4 COMPOUNDs is significantly greater than 3548 NFSv3 requests, requiring the use of Long messages for frequent 3549 operations. 3551 o Reply size estimation is awkward more often than expected. 3553 This section details specific changes in RPC-over-RDMA version 2 that 3554 address these constraints directly. 3556 C.1. Changes to the XDR Definition 3558 Several XDR structural changes enable within-version protocol 3559 extensibility. 3561 [RFC8166] defines the RPC-over-RDMA version 1 transport header as a 3562 single XDR object, with an RPC message potentially following it. In 3563 RPC-over-RDMA version 2, there are separate XDR definitions of the 3564 transport header prefix (see Section 6.2.2), which specifies the 3565 transport header type to be used, and the transport header itself 3566 (defined within one of the subsections of Section 6.3). This 3567 construction is similar to an RPC message, which consists of an RPC 3568 header (defined in [RFC5531]) followed by a message defined by an 3569 Upper-Layer Protocol. 3571 As a new version of the RPC-over-RDMA transport protocol, RPC-over- 3572 RDMA version 2 exists within the versioning rules defined in 3573 [RFC8166]. In particular, it maintains the first four words of the 3574 protocol header, as specified in Section 4.2 of [RFC8166], even 3575 though, as explained in Section 6.2.1 of the current document, the 3576 XDR definition of those words is structured differently. 3578 Although each of the first four fields retains its semantic function, 3579 there are differences in interpretation: 3581 o The first word of the header, the rdma_xid field, retains the 3582 format and function that it had in RPC-over-RDMA version 1. 3583 Because RPC-over-RDMA version 2 messages can convey non-RPC 3584 messages, a receiver should not use the contents of this field 3585 without consideration of the protocol version and header type. 3587 o The second word of the header, the rdma_vers field, retains the 3588 format and function that it had in RPC-over-RDMA version 1. To 3589 clearly distinguish version 1 and version 2 messages, senders need 3590 to fill in the correct version (fixed after version negotiation). 3591 Receivers should check that the content of the rdma_vers is 3592 correct before using the content of any other header field. 3594 o The third word of the header, the rdma_credit field, retains the 3595 size and general purpose that it had in RPC-over-RDMA version 1. 3596 However, RPC-over-RDMA version 2 divides this field into two 3597 16-bit subfields. See Section 4.2.1 for further details. 3599 o The fourth word of the header, previously the union discriminator 3600 field rdma_proc, retains its format and general function even 3601 though the set of valid values has changed. Within RPC-over-RDMA 3602 version 2, this word is the rdma_htype field of the structure 3603 rdma_start. The value of this field is now an unsigned 32-bit 3604 integer rather than an enum type, to facilitate header type 3605 extension. 3607 Beyond conforming to the restrictions specified in [RFC8166], RPC- 3608 over-RDMA version 2 tightly limits the scope of the changes made to 3609 ensure interoperability. Version 2 retains all existing transport 3610 header types used in version 1, as defined in [RFC8166]. And, it 3611 expresses chunks in the same format and uses them the same way. 3613 C.2. Transport Properties 3615 RPC-over-RDMA version 2 provides a mechanism for exchanging an 3616 implementation's operational properties. The purpose of this 3617 exchange is to help endpoints improve the efficiency of data transfer 3618 by exploiting the characteristics of both peers rather than falling 3619 back on the lowest common denominator default settings. A full 3620 discussion of transport properties appears in Section 5. 3622 C.3. Credit Management Changes 3624 RPC-over-RDMA transports employ credit-based flow control to ensure 3625 that a Requester does not emit more RDMA Sends than the Responder is 3626 prepared to receive. 3628 Section 3.3.1 of [RFC8166] explains the operation of RPC-over-RDMA 3629 version 1 credit management in detail. In that design, each RDMA 3630 Send from a Requester contains an RPC Call with a credit request, and 3631 each RDMA Send from a Responder contains an RPC Reply with a credit 3632 grant. The credit grant implies that enough Receives have been 3633 posted on the Responder to handle the credit grant minus the number 3634 of pending RPC transactions (the number of remaining Receive buffers 3635 might be zero). 3637 Each RPC Reply acts as an implicit ACK for a previous RPC Call from 3638 the Requester. Without an RPC Reply message, the Requester has no 3639 way to know that the Responder is ready for subsequent RPC Calls. 3641 Because version 1 embeds credit management in each message, there is 3642 a strict one-to-one ratio between RDMA Send and RPC message. There 3643 are interesting use cases that might be enabled if this relationship 3644 were more flexible: 3646 o RPC-over-RDMA operations that do not carry an RPC message, e.g., 3647 control plane operations. 3649 o A single RDMA Send that conveys more than one RPC message, e.g., 3650 for interrupt mitigation. 3652 o An RPC message that requires several sequential RDMA Sends, e.g., 3653 to reduce the use of explicit RDMA operations for moderate-sized 3654 RPC messages. 3656 o An RPC transaction that requires multiple exchanges or an odd 3657 number of RPC-over-RDMA operations to complete. 3659 RPC-over-RDMA version 2 provides a more sophisticated credit 3660 accounting mechanism to address these shortcomings. Section 4.2.1 3661 explains the new mechanism in detail. 3663 C.4. Inline Threshold Changes 3665 An "inline threshold" value is the largest message size (in octets) 3666 that can be conveyed on an RDMA connection using only RDMA Send and 3667 Receive. Each connection has two inline threshold values: one for 3668 messages flowing from client-to-server (referred to as the "client- 3669 to-server inline threshold") and one for messages flowing from 3670 server-to-client (referred to as the "server-to-client inline 3671 threshold"). 3673 A connection's inline thresholds determine, among other things, when 3674 RDMA Read or Write operations are required because an RPC message 3675 cannot be conveyed via a single RDMA Send and Receive pair. When an 3676 RPC message does not contain DDP-eligible data items, a Requester can 3677 prepare a Special Format Call or Reply to convey the whole RPC 3678 message using RDMA Read or Write operations. 3680 RDMA Read and Write operations require that data payloads reside in 3681 memory registered with the local RNIC. When an RPC completes, that 3682 memory is invalidated to fence it from the Responder. Memory 3683 registration and invalidation typically have a latency cost that is 3684 insignificant compared to data handling costs. 3686 When a data payload is small, however, the cost of registering and 3687 invalidating memory where the payload resides becomes a significant 3688 part of total RPC latency. Therefore the most efficient operation of 3689 an RPC-over-RDMA transport occurs when the peers use explicit RDMA 3690 Read and Write operations for large payloads but avoid those 3691 operations for small payloads. 3693 When the authors of [RFC8166] first conceived RPC-over-RDMA version 3694 1, the average size of RPC messages that did not involve a 3695 significant data payload was under 500 bytes. A 1024-byte inline 3696 threshold adequately minimized the frequency of inefficient Long 3697 messages. 3699 With NFS version 4 [RFC7530], the increased size of NFS COMPOUND 3700 operations resulted in RPC messages that are, on average, larger than 3701 previous versions of NFS. With a 1024-byte inline threshold, 3702 frequent operations such as GETATTR and LOOKUP require RDMA Read or 3703 Write operations, reducing the efficiency of data transport. 3705 To reduce the frequency of Special Format messages, RPC-over-RDMA 3706 version 2 increases the default size of inline thresholds. This 3707 change also increases the maximum size of reverse-direction RPC 3708 messages. 3710 C.5. Message Continuation Changes 3712 In addition to a larger default inline threshold, RPC-over-RDMA 3713 version 2 introduces Message Continuation. Message Continuation is a 3714 mechanism that enables the transmission of a data payload using more 3715 than one RDMA Send. The purpose of Message Continuation is to 3716 provide relief in several essential cases: 3718 o If a Requester finds that it is inefficient to convey a 3719 moderately-sized data payload using Read chunks, the Requester can 3720 use Message Continuation to send the RPC Call. 3722 o If a Requester has provided insufficient Reply chunk space for a 3723 Responder to send an RPC Reply, the Responder can use Message 3724 Continuation to send the RPC Reply. 3726 o If a sender has to convey a sizeable non-RPC data payload (e.g., a 3727 large transport property), the sender can use Message Continuation 3728 to avoid having to register memory. 3730 C.6. Host Authentication Changes 3732 For the general operation of NFS on open networks, we eventually 3733 intend to rely on RPC-on-TLS [I-D.ietf-nfsv4-rpc-tls] to provide 3734 cryptographic authentication of the two ends of each connection. In 3735 turn, this can improve the trustworthiness of AUTH_SYS-style user 3736 identities that flow on TCP, which are not cryptographically 3737 protected. We do not have a similar solution for RPC-over-RDMA, 3738 however. 3740 Here, the RDMA transport layer already provides a strong guarantee of 3741 message integrity. On some network fabrics, IPsec or TLS can protect 3742 the privacy of in-transit data. However, this is not the case for 3743 all fabrics (e.g., InfiniBand [IBA]). 3745 Thus, RPC-over-RDMA version 2 introduces a mechanism for 3746 authenticating connection peers (see Section 5.2.6). And like GSS 3747 channel binding, there is also a way to determine when the use of 3748 host authentication is unnecessary. 3750 C.7. Support for Remote Invalidation 3752 When an RDMA consumer uses FRWR or Memory Windows to register memory, 3753 that memory may be invalidated remotely [RFC5040]. These mechanisms 3754 are available when a Requester's RNIC supports MEM_MGT_EXTENSIONS. 3756 For this discussion, there are two classes of STags. Dynamically- 3757 registered STags appear in a single RPC, then are invalidated. 3758 Persistently-registered STags survive longer than one RPC. They may 3759 persist for the life of an RPC-over-RDMA connection or even longer. 3761 An RPC-over-RDMA Requester can provide more than one STag in a 3762 transport header. It may provide a combination of dynamically- and 3763 persistently-registered STags in one RPC message, or any combination 3764 of these in a series of RPCs on the same connection. Only 3765 dynamically-registered STags using Memory Windows or FRWR may be 3766 invalidated remotely. 3768 There is no transport-level mechanism by which a Responder can 3769 determine how a Requester-provided STag was registered, nor whether 3770 it is eligible to be invalidated remotely. A Requester that mixes 3771 persistently- and dynamically-registered STags in one RPC, or mixes 3772 them across RPCs on the same connection, must, therefore, indicate 3773 which STag the Responder may invalidate remotely via a mechanism 3774 provided in the Upper-Layer Protocol. RPC-over-RDMA version 2 3775 provides such a mechanism. 3777 A sender uses the RDMA Send With Invalidate operation to invalidate 3778 an STag on the remote peer. It is available only when both peers 3779 support MEM_MGT_EXTENSIONS (can send and process an IETH). 3781 Existing RPC-over-RDMA transport protocol specifications [RFC8166] 3782 [RFC8167] do not forbid direct data placement in the reverse 3783 direction. Moreover, there is currently no Upper-Layer Protocol that 3784 makes data items in reverse direction operations eligible for direct 3785 data placement. 3787 When chunks are present in a reverse direction RPC request, Remote 3788 Invalidation enables the Responder to trigger invalidation of a 3789 Requester's STags as part of sending an RPC Reply, the same way as is 3790 done in the forward direction. 3792 However, in the reverse direction, the server acts as the Requester, 3793 and the client is the Responder. The server's RNIC, therefore, must 3794 support receiving an IETH, and the server must have registered its 3795 STags with an appropriate registration mechanism. 3797 C.8. Error Reporting Changes 3799 RPC-over-RDMA version 2 expands the repertoire of errors that 3800 connection peers may report to each other. The goals of this 3801 expansion are: 3803 o To fill in details of peer recovery actions. 3805 o To enable retrying certain conditions caused by mis-estimation of 3806 the maximum reply size. 3808 o To minimize the likelihood of a Requester waiting forever for a 3809 Reply when there are communications problems that prevent the 3810 Responder from sending it. 3812 Acknowledgments 3814 The authors gratefully acknowledge the work of Brent Callaghan and 3815 Tom Talpey on the original RPC-over-RDMA version 1 specification (RFC 3816 5666). The authors also wish to thank Bill Baker, Greg Marsden, and 3817 Matt Benjamin for their support of this work. 3819 The XDR extraction conventions were first described by the authors of 3820 the NFS version 4.1 XDR specification [RFC5662]. Herbert van den 3821 Bergh suggested the replacement sed script used in this document. 3823 Special thanks go to Transport Area Director Magnus Westerlund, NFSV4 3824 Working Group Chairs Spencer Shepler and Brian Pawlowski, and NFSV4 3825 Working Group Secretary Thomas Haynes for their support. 3827 Authors' Addresses 3829 Charles Lever (editor) 3830 Oracle Corporation 3831 United States of America 3833 Email: chuck.lever@oracle.com 3835 David Noveck 3836 NetApp 3837 1601 Trapelo Road 3838 Waltham, MA 02451 3839 United States of America 3841 Phone: +1 781 572 8038 3842 Email: davenoveck@gmail.com