idnits 2.17.1 draft-ietf-nfsv4-rpcrdma-version-two-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 1712 has weird spacing: '...k_lists rdma_...' == Line 1743 has weird spacing: '...k_lists rdma_...' == Line 2259 has weird spacing: '...k_lists rdma_...' == Line 2267 has weird spacing: '...k_lists rdma_...' == The document seems to contain a disclaimer for pre-RFC5378 work, but was first submitted on or after 10 November 2008. The disclaimer is usually necessary only for documents that revise or obsolete older RFCs, and that take significant amounts of text from those RFCs. If you can contact all authors of the source material and they are willing to grant the BCP78 rights to the IETF Trust, you can and should remove the disclaimer. Otherwise, the disclaimer is needed and you can ignore this comment. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (November 17, 2019) is 1621 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Unused Reference: 'RFC5280' is defined on line 2774, but no explicit reference was found in the text == Unused Reference: 'RFC6125' is defined on line 2793, but no explicit reference was found in the text ** Obsolete normative reference: RFC 6125 (Obsoleted by RFC 9525) -- Obsolete informational reference (is this intentional?): RFC 793 (Obsoleted by RFC 9293) -- Obsolete informational reference (is this intentional?): RFC 5661 (Obsoleted by RFC 8881) Summary: 1 error (**), 0 flaws (~~), 8 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network File System Version 4 C. Lever, Ed. 3 Internet-Draft Oracle 4 Intended status: Standards Track D. Noveck 5 Expires: May 20, 2020 NetApp 6 November 17, 2019 8 RPC-over-RDMA Version 2 Protocol 9 draft-ietf-nfsv4-rpcrdma-version-two-00 11 Abstract 13 This document specifies the second version of a protocol that conveys 14 Remote Procedure Call (RPC) messages on transports capable of Remote 15 Direct Memory Access (RDMA). This version of the protocol is 16 extensible. 18 Status of This Memo 20 This Internet-Draft is submitted in full conformance with the 21 provisions of BCP 78 and BCP 79. 23 Internet-Drafts are working documents of the Internet Engineering 24 Task Force (IETF). Note that other groups may also distribute 25 working documents as Internet-Drafts. The list of current Internet- 26 Drafts is at https://datatracker.ietf.org/drafts/current/. 28 Internet-Drafts are draft documents valid for a maximum of six months 29 and may be updated, replaced, or obsoleted by other documents at any 30 time. It is inappropriate to use Internet-Drafts as reference 31 material or to cite them other than as "work in progress." 33 This Internet-Draft will expire on May 20, 2020. 35 Copyright Notice 37 Copyright (c) 2019 IETF Trust and the persons identified as the 38 document authors. All rights reserved. 40 This document is subject to BCP 78 and the IETF Trust's Legal 41 Provisions Relating to IETF Documents 42 (https://trustee.ietf.org/license-info) in effect on the date of 43 publication of this document. Please review these documents 44 carefully, as they describe your rights and restrictions with respect 45 to this document. Code Components extracted from this document must 46 include Simplified BSD License text as described in Section 4.e of 47 the Trust Legal Provisions and are provided without warranty as 48 described in the Simplified BSD License. 50 This document may contain material from IETF Documents or IETF 51 Contributions published or made publicly available before November 52 10, 2008. The person(s) controlling the copyright in some of this 53 material may not have granted the IETF Trust the right to allow 54 modifications of such material outside the IETF Standards Process. 55 Without obtaining an adequate license from the person(s) controlling 56 the copyright in such materials, this document may not be modified 57 outside the IETF Standards Process, and derivative works of it may 58 not be created outside the IETF Standards Process, except to format 59 it for publication as an RFC or to translate it into languages other 60 than English. 62 Table of Contents 64 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 4 65 2. Requirements Language . . . . . . . . . . . . . . . . . . . . 5 66 3. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 6 67 3.1. Remote Procedure Calls . . . . . . . . . . . . . . . . . 6 68 3.1.1. Upper-Layer Protocols . . . . . . . . . . . . . . . . 6 69 3.1.2. Requesters and Responders . . . . . . . . . . . . . . 6 70 3.1.3. RPC Transports . . . . . . . . . . . . . . . . . . . 7 71 3.1.4. External Data Representation . . . . . . . . . . . . 8 72 3.2. Remote Direct Memory Access . . . . . . . . . . . . . . . 9 73 3.2.1. Direct Data Placement . . . . . . . . . . . . . . . . 9 74 3.2.2. RDMA Transport Requirements . . . . . . . . . . . . . 9 75 4. RPC-over-RDMA Protocol Framework . . . . . . . . . . . . . . 11 76 4.1. Transfer Model . . . . . . . . . . . . . . . . . . . . . 11 77 4.2. Message Framing . . . . . . . . . . . . . . . . . . . . . 11 78 4.3. Managing Receiver Resources . . . . . . . . . . . . . . . 12 79 4.3.1. RPC-over-RDMA Version 2 Flow Control . . . . . . . . 12 80 4.3.2. Inline Threshold . . . . . . . . . . . . . . . . . . 14 81 4.3.3. Initial Connection State . . . . . . . . . . . . . . 14 82 4.4. XDR Encoding with Chunks . . . . . . . . . . . . . . . . 15 83 4.4.1. Reducing an XDR Stream . . . . . . . . . . . . . . . 15 84 4.4.2. DDP-Eligibility . . . . . . . . . . . . . . . . . . . 16 85 4.4.3. RDMA Segments . . . . . . . . . . . . . . . . . . . . 16 86 4.4.4. Chunks . . . . . . . . . . . . . . . . . . . . . . . 17 87 4.4.5. Read Chunks . . . . . . . . . . . . . . . . . . . . . 18 88 4.4.6. Write Chunks . . . . . . . . . . . . . . . . . . . . 19 89 4.5. Message Transfer Methods . . . . . . . . . . . . . . . . 20 90 4.5.1. Short Messages . . . . . . . . . . . . . . . . . . . 21 91 4.5.2. Continued Messages . . . . . . . . . . . . . . . . . 21 92 4.5.3. Chunked Messages . . . . . . . . . . . . . . . . . . 22 93 4.5.4. Long Messages . . . . . . . . . . . . . . . . . . . . 23 94 5. Transport Properties . . . . . . . . . . . . . . . . . . . . 24 95 5.1. Transport Properties Model . . . . . . . . . . . . . . . 25 96 5.2. Current Transport Properties . . . . . . . . . . . . . . 26 97 5.2.1. Maximum Send Size . . . . . . . . . . . . . . . . . . 27 98 5.2.2. Receive Buffer Size . . . . . . . . . . . . . . . . . 28 99 5.2.3. Maximum RDMA Segment Size . . . . . . . . . . . . . . 28 100 5.2.4. Maximum RDMA Segment Count . . . . . . . . . . . . . 28 101 5.2.5. Reverse Request Support . . . . . . . . . . . . . . . 29 102 5.2.6. Host Authentication Message . . . . . . . . . . . . . 30 103 6. RPC-over-RDMA Version 2 Transport Messages . . . . . . . . . 30 104 6.1. Overall Transport Message Structure . . . . . . . . . . . 30 105 6.2. Transport Header Types . . . . . . . . . . . . . . . . . 30 106 6.3. RPC-over-RDMA Version 2 Headers and Chunks . . . . . . . 31 107 6.3.1. Common Transport Header Prefix . . . . . . . . . . . 31 108 6.3.2. RPC-over-RDMA Version 2 Transport Header Prefix . . . 32 109 6.3.3. Describing External Data Payloads . . . . . . . . . . 35 110 6.4. Header Types Defined in RPC-over-RDMA version 2 . . . . . 36 111 6.4.1. RDMA2_MSG: Convey RPC Message Inline . . . . . . . . 36 112 6.4.2. RDMA2_NOMSG: Convey External RPC Message . . . . . . 37 113 6.4.3. RDMA2_ERROR: Report Transport Error . . . . . . . . . 38 114 6.4.4. RDMA2_CONNPROP: Advertise Transport Properties . . . 41 115 6.5. Choosing a Reply Mechanism . . . . . . . . . . . . . . . 42 116 7. XDR Protocol Definition . . . . . . . . . . . . . . . . . . . 42 117 7.1. Code Component License . . . . . . . . . . . . . . . . . 43 118 7.2. Extraction and Use of XDR Definitions . . . . . . . . . . 45 119 7.3. XDR Definition for RPC-over-RDMA Version 2 Core 120 Structures . . . . . . . . . . . . . . . . . . . . . . . 47 121 7.4. XDR Definition for RPC-over-RDMA Version 2 Base Header 122 Types . . . . . . . . . . . . . . . . . . . . . . . . . . 49 123 7.5. Use of the XDR Description Files . . . . . . . . . . . . 50 124 8. RPC Bind Parameters . . . . . . . . . . . . . . . . . . . . . 52 125 9. Implementation Status . . . . . . . . . . . . . . . . . . . . 53 126 10. Security Considerations . . . . . . . . . . . . . . . . . . . 54 127 10.1. Memory Protection . . . . . . . . . . . . . . . . . . . 54 128 10.1.1. Protection Domains . . . . . . . . . . . . . . . . . 54 129 10.1.2. Handle (STag) Predictability . . . . . . . . . . . . 54 130 10.1.3. Memory Protection . . . . . . . . . . . . . . . . . 54 131 10.1.4. Denial of Service . . . . . . . . . . . . . . . . . 55 132 10.2. RPC Message Security . . . . . . . . . . . . . . . . . . 55 133 10.2.1. RPC-over-RDMA Protection at Lower Layers . . . . . . 56 134 10.2.2. RPCSEC_GSS on RPC-over-RDMA Transports . . . . . . . 56 135 10.3. Transport Properties . . . . . . . . . . . . . . . . . . 58 136 10.4. Host Authentication . . . . . . . . . . . . . . . . . . 59 137 11. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 59 138 12. References . . . . . . . . . . . . . . . . . . . . . . . . . 59 139 12.1. Normative References . . . . . . . . . . . . . . . . . . 59 140 12.2. Informative References . . . . . . . . . . . . . . . . . 61 141 Appendix A. ULB Specifications . . . . . . . . . . . . . . . . . 63 142 A.1. DDP-Eligibility . . . . . . . . . . . . . . . . . . . . . 63 143 A.2. Maximum Reply Size . . . . . . . . . . . . . . . . . . . 64 144 A.3. Additional Considerations . . . . . . . . . . . . . . . . 65 145 A.4. ULP Extensions . . . . . . . . . . . . . . . . . . . . . 65 147 Appendix B. Extending the Version 2 Protocol . . . . . . . . . . 65 148 B.1. Adding New Header Types to RPC-over-RDMA Version 2 . . . 67 149 B.2. Adding New Header Flags to the Protocol . . . . . . . . . 68 150 B.3. Adding New Transport properties to the Protocol . . . . . 69 151 B.4. Adding New Error Codes to the Protocol . . . . . . . . . 70 152 Appendix C. Differences from the RPC-over-RDMA Version 1 153 Protocol . . . . . . . . . . . . . . . . . . . . . . 70 154 C.1. Relationship to the RPC-over-RDMA Version 1 XDR 155 Definition . . . . . . . . . . . . . . . . . . . . . . . 70 156 C.2. Transport Properties . . . . . . . . . . . . . . . . . . 72 157 C.3. Credit Management Changes . . . . . . . . . . . . . . . . 72 158 C.4. Inline Threshold Changes . . . . . . . . . . . . . . . . 73 159 C.5. Message Continuation Changes . . . . . . . . . . . . . . 74 160 C.6. Host Authentication Changes . . . . . . . . . . . . . . . 75 161 C.7. Support for Remote Invalidation . . . . . . . . . . . . . 75 162 C.8. Error Reporting Changes . . . . . . . . . . . . . . . . . 76 163 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . 76 164 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 77 166 1. Introduction 168 Remote Direct Memory Access (RDMA) [RFC5040] [RFC5041] [IBA] is a 169 technique for moving data efficiently between network nodes. By 170 directing data into destination buffers as it is sent on a network 171 and placing it using direct memory access implemented by hardware, 172 the complementary benefits of faster transfers and reduced host 173 overhead are obtained. 175 Open Network Computing Remote Procedure Call (ONC RPC, often 176 shortened in NFSv4 documents to RPC) [RFC5531] is a Remote Procedure 177 Call protocol that runs over a variety of transports. Most RPC 178 implementations today use UDP [RFC0768] or TCP [RFC0793]. On UDP, 179 RPC messages are encapsulated inside datagrams, while on a TCP byte 180 stream, RPC messages are delineated by a record marking protocol. An 181 RDMA transport also conveys RPC messages in a specific fashion that 182 must be fully described if RPC implementations are to interoperate 183 when using RDMA to transport RPC transactions. 185 RDMA transports present semantics that differ from either UDP or TCP. 186 They retain message delineations like UDP but provide reliable and 187 sequenced data transfer like TCP. They also provide an offloaded 188 bulk transfer service not provided by UDP or TCP. RDMA transports 189 are therefore appropriately treated as a new transport type by RPC. 191 Although the RDMA transport described herein can provide relatively 192 transparent support for any RPC application, this document also 193 describes mechanisms that enable further optimization of data 194 transfer, when RPC applications are structured to exploit awareness 195 of a transport's RDMA capability. In this context, the Network File 196 System (NFS) protocols, as described in [RFC1094], [RFC1813], 197 [RFC7530], [RFC5661], and subsequent NFSv4 minor versions, are all 198 potential beneficiaries of RDMA transports. A complete problem 199 statement is presented in [RFC5532]. 201 The RPC-over-RDMA version 1 protocol specified in [RFC8166] is 202 deployed and in use, although there are known shortcomings to this 203 protocol: 205 o The protocol's default size of Receive buffers forces the use of 206 RDMA Read and Write transfers for small payloads, and limits the 207 size of reverse direction messages. 209 o It is difficult to make optimizations or protocol fixes that 210 require changes to on-the-wire behavior. 212 o For some RPC procedures, the maximum reply size is difficult or 213 impossible for an RPC client to estimate in advance. 215 To address these issues in a way that enables interoperation with 216 existing RPC-over-RDMA version 1 deployments, a second version of the 217 RPC-over-RDMA transport protocol is presented in this document. 219 Version 2 of RPC-over-RDMA is extensible, enabling OPTIONAL 220 extensions to be added without impacting existing implementations. 221 To enable protocol extension, the XDR definition for RPC-over-RDMA 222 version 2 is organized differently than the definition version 1. 223 These changes, which are discussed in Appendix C.1, do not alter the 224 on-the-wire format. 226 In addition, RPC-over-RDMA version 2 contains a set of incremental 227 changes that relieve certain performance constraints and enable 228 recovery from abnormal corner cases. These changes are outlined in 229 Appendix C and include a larger default inline threshold, the ability 230 to convey a single RPC message using multiple RDMA Send operations, 231 support for authentication of connection peers, richer error 232 reporting, an improved credit-based flow control mechanism, and 233 support for Remote Invalidation. 235 2. Requirements Language 237 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 238 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 239 "OPTIONAL" in this document are to be interpreted as described in BCP 240 14 [RFC2119] [RFC8174] when, and only when, they appear in all 241 capitals, as shown here. 243 3. Terminology 245 3.1. Remote Procedure Calls 247 This section highlights key elements of the RPC protocol [RFC5531] 248 and the External Data Representation (XDR) [RFC4506] used by it. 249 RPC-over-RDMA version 2 enables the transmission of RPC messges built 250 using XDR and also uses XDR internaly to describe its own header 251 formats. An understanding of RPC and its use of XDR is assumed in 252 this document. 254 3.1.1. Upper-Layer Protocols 256 RPCs are an abstraction used to implement the operations of an Upper- 257 Layer Protocol (ULP). "ULP" refers to an RPC Program and Version 258 tuple, which is a versioned set of procedure calls that comprise a 259 single well-defined API. One example of a ULP is the Network File 260 System Version 4.0 [RFC7530]. 262 In this document, the term "RPC consumer" refers to an implementation 263 of a ULP running on an RPC client. 265 3.1.2. Requesters and Responders 267 Like a local procedure call, every RPC procedure has a set of 268 "arguments" and a set of "results". A calling context invokes a 269 procedure, passing arguments to it, and the procedure subsequently 270 returns a set of results. Unlike a local procedure call, the called 271 procedure is executed remotely rather than in the local application's 272 execution context. 274 The RPC protocol as described in [RFC5531] is fundamentally a 275 message-passing protocol between one or more clients, where RPC 276 consumers are running, and a server, where a remote execution context 277 is available to process RPC transactions on behalf of those 278 consumers. 280 ONC RPC transactions are made up of two types of messages: 282 CALL 283 A CALL message, or "Call", requests that work be done. An RPC 284 Call message is designated by the value zero (0) in the message's 285 msg_type field. An arbitrary unique value is placed in the 286 message's XID field in order to match this RPC Call message to a 287 corresponding RPC Reply message. 289 REPLY 290 A REPLY message, or "Reply", reports the results of work requested 291 by an RPC Call message. An RPC Reply message is designated by the 292 value one (1) in the message's msg_type field. The value 293 contained in an RPC Reply message's XID field is copied from the 294 RPC Call message whose results are being reported. 296 Each RPC client endpoint acts as a "Requester". It serializes the 297 procedure's arguments and conveys them to a server endpoint via an 298 RPC Call message. This message contains an RPC protocol header, a 299 header describing the requested upper-layer operation, and all 300 arguments. 302 An RPC server endpoint acts as a "Responder". It deserializes the 303 arguments and processes the requested operation. It then serializes 304 the operation's results into another byte stream. This byte stream 305 is conveyed back to the Requester via an RPC Reply message. This 306 message contains an RPC protocol header, a header describing the 307 upper-layer reply, and all results. 309 The Requester deserializes the results and allows the RPC consumer to 310 proceed. At this point, the RPC transaction designated by the XID in 311 the RPC Call message is complete, and the XID is retired. 313 In summary, Requesters send RPC Call messages to Responders to 314 initiate RPC transactions. Responders send RPC Reply messages to 315 Requesters to complete the processing on an RPC transaction. 317 3.1.3. RPC Transports 319 The role of an "RPC transport" is to mediate the exchange of RPC 320 messages between Requesters and Responders. An RPC transport bridges 321 the gap between the RPC message abstraction and the native operations 322 of a particular network transport. 324 RPC-over-RDMA is a connection-oriented RPC transport. When a 325 connection-oriented transport is used, clients initiate transport 326 connections, while servers wait passively to accept incoming 327 connection requests. 329 Most commonly, the client end of the connection acts in the role of 330 Requester, and the server end of the connection acts as a Responder. 331 However, RPC transactions can also be sent in the reverse direction. 332 In this case, the server end of the connection acts as a Requestor 333 while the client end acts as a Responder. 335 3.1.4. External Data Representation 337 One cannot assume that all Requesters and Responders represent data 338 objects the same way internally. RPC uses External Data 339 Representation (XDR) to translate native data types and serialize 340 arguments and results [RFC4506]. 342 The XDR protocol encodes data independently of the endianness or size 343 of host-native data types, enabling unambiguous decoding of data by 344 the receiver. RPC Programs are specified by writing an XDR 345 definition of their procedures, argument data types, and result data 346 types. 348 XDR assumes only that the number of bits in a byte (octet) and their 349 order are the same on both endpoints and on the physical network. 350 The smallest indivisible unit of XDR encoding is a group of four 351 octets. XDR can also flatten lists, arrays, and other complex data 352 types so they can be conveyed as a stream of bytes. 354 A serialized stream of bytes that is the result of XDR encoding is 355 referred to as an "XDR stream". A sending endpoint encodes native 356 data into an XDR stream and then transmits that stream to a receiver. 357 A receiving endpoint decodes incoming XDR byte streams into its 358 native data representation format. 360 3.1.4.1. XDR Opaque Data 362 Sometimes, a data item is to be transferred as is: without encoding 363 or decoding. The contents of such a data item are referred to as 364 "opaque data". XDR encoding places the content of opaque data items 365 directly into an XDR stream without altering it in any way. ULPs or 366 applications perform any needed data translation in this case. 367 Examples of opaque data items include the content of files or generic 368 byte strings. 370 3.1.4.2. XDR Roundup 372 The number of octets in a variable-length data item precedes that 373 item in an XDR stream. If the size of an encoded data item is not a 374 multiple of four octets, octets containing zero are added after the 375 end of the item. This is the case so that the next encoded data item 376 in the XDR stream always starts on a four-octet boundary. The 377 encoded size of the item is not changed by the addition of the extra 378 octets. These extra octets are never exposed to ULPs. 380 This technique is referred to as "XDR roundup", and the extra octets 381 are referred to as "XDR roundup padding". 383 3.2. Remote Direct Memory Access 385 RPC Requesters and Responders can be made more efficient if large RPC 386 messages are transferred by a third party, such as intelligent 387 network-interface hardware (data movement offload), and placed in the 388 receiver's memory so that no additional adjustment of data alignment 389 has to be made (direct data placement or "DDP"). RDMA transports 390 enable both optimizations. 392 In the current document, "RDMA" refers to the physical mechanism an 393 RDMA transport utilizes when moving data. 395 3.2.1. Direct Data Placement 397 Typically, RPC implementations copy the contents of RPC messages into 398 a buffer before being sent. An efficient RPC implementation sends 399 bulk data without copying it into a separate send buffer first. 401 However, socket-based RPC implementations are often unable to receive 402 data directly into its final place in memory. Receivers often need 403 to copy incoming data to finish an RPC operation: sometimes, only to 404 adjust data alignment. 406 Although it may not be efficient, before an RDMA transfer, a sender 407 may copy data into an intermediate buffer. After an RDMA transfer, a 408 receiver may copy that data again to its final destination. In this 409 document, the term "DDP" refers to any optimized data transfer where 410 it is unnecessary for a receiving host's CPU to copy transferred data 411 to another location after it has been received. 413 RPC-over-RDMA version 2 enables the use of RDMA Read and Write 414 operations to achieve both data movement offload and DDP. However, 415 not all RDMA-based data transfer qualifies as DDP, and DDP can be 416 achieved using non-RDMA mechanisms. 418 3.2.2. RDMA Transport Requirements 420 To achieve good performance during receive operations, RDMA 421 transports require that RDMA consumers provision resources in advance 422 in order to receive incoming messages. 424 An RDMA consumer might provide Receive buffers in advance by posting 425 an RDMA Receive Work Request for every expected RDMA Send from a 426 remote peer. These buffers are provided before the remote peer posts 427 RDMA Send Work Requests. Thus this is often referred to as "pre- 428 posting" buffers. 430 An RDMA Receive Work Request remains outstanding until hardware 431 matches it to an inbound Send operation. The resources associated 432 with that Receive must be retained in host memory, or "pinned", until 433 the Receive completes. 435 Given these basic tenets of RDMA transport operation, the RPC-over- 436 RDMA version 2 protocol assumes each transport provides the following 437 abstract operations. A more complete discussion of these operations 438 can be found in [RFC5040]. 440 3.2.2.1. Memory Registration 442 Memory registration assigns a steering tag to a region of memory, 443 permitting the RDMA provider to perform data-transfer operations. 444 The RPC-over-RDMA version 2 protocol assumes that each registered 445 memory region is identified with a steering tag of no more than 32 446 bits and memory addresses of up to 64 bits in length. 448 3.2.2.2. RDMA Send 450 The RDMA provider supports an RDMA Send operation, with completion 451 signaled on the receiving peer after data has been placed in a pre- 452 posted buffer. Sends complete at the receiver in the order they were 453 issued at the sender. The amount of data transferred by a single 454 RDMA Send operation is limited by the size of the remote peer's pre- 455 posted buffers. 457 3.2.2.3. RDMA Receive 459 The RDMA provider supports an RDMA Receive operation to receive data 460 conveyed by incoming RDMA Send operations. To reduce the amount of 461 memory that must remain pinned awaiting incoming Sends, the amount of 462 pre-posted memory is limited. Flow control to prevent overrunning 463 receiver resources is provided by the RDMA consumer (in this case, 464 the RPC-over-RDMA version 2 protocol). 466 3.2.2.4. RDMA Write 468 The RDMA provider supports an RDMA Write operation to place data 469 directly into a remote memory region. The local host initiates an 470 RDMA Write, and completion is signaled there. No completion is 471 signaled on the remote peer. The local host provides a steering tag, 472 memory address, and the length of the remote peer's memory region. 474 RDMA Writes are not ordered with respect to one another, but are 475 ordered with respect to RDMA Sends. A subsequent RDMA Send 476 completion obtained at the write initiator guarantees that prior RDMA 477 Write data has been successfully placed in the remote peer's memory. 479 3.2.2.5. RDMA Read 481 The RDMA provider supports an RDMA Read operation to place peer 482 source data directly into the read initiator's memory. The local 483 host initiates an RDMA Read, and completion is signaled there. No 484 completion is signaled on the remote peer. The local host provides 485 steering tags, memory addresses, and a length for the remote source 486 and local destination memory region. 488 The local host signals Read completion to the remote peer as part of 489 a subsequent RDMA Send message. The remote peer can then invalidate 490 steering tags and subsequently free associated source memory regions. 492 4. RPC-over-RDMA Protocol Framework 494 4.1. Transfer Model 496 A "transfer model" designates which endpoint exposes its memory and 497 which is responsible for initiating the transfer of data. To enable 498 RDMA Read and Write operations, for example, an endpoint first 499 exposes regions of its memory to a remote endpoint, which initiates 500 these operations against the exposed memory. 502 In RPC-over-RDMA version 2, Requesters expose their memory to the 503 Responder, but the Responder does not expose its memory. The 504 Responder pulls RPC arguments or whole RPC calls from each Requester. 505 The Responder pushes RPC results or whole RPC replies to each 506 Requester. 508 4.2. Message Framing 510 Each RPC-over-RDMA version 2 message consists of at most two XDR 511 streams: 513 Transport Stream 514 The "Transport stream" contains a header that describes and 515 controls the transfer of the Payload stream in this RPC-over-RDMA 516 message. Every RDMA Send message on an RPC-over-RDMA version 2 517 connection MUST begin with a Transport stream. 519 RPC Payload Stream 520 The "Payload stream" contains part or all of a single RPC message. 521 The sender MAY divide an RPC message at any convenient boundary, 522 but MUST send RPC message fragments in XDR stream order and MUST 523 NOT interleave Payload streams from multiple RPC messages. The 524 RPC-over-RDMA version 2 message carrying the final part of an RPC 525 message is marked (see Section 6.3.2.2). 527 In its simplest form, an RPC-over-RDMA version 2 message conveying an 528 RPC message payload consists of a Transport stream followed 529 immediately by a Payload stream transmitted together via a single 530 RDMA Send. 532 RPC-over-RDMA framing replaces all other RPC framing (such as TCP 533 record marking) when used atop an RPC-over-RDMA association, even 534 when the underlying RDMA protocol may itself be layered atop a 535 transport with a defined RPC framing (such as TCP). 537 However, it is possible for RPC-over-RDMA to be dynamically enabled 538 on a connection in the course of negotiating the use of RDMA via a 539 ULP exchange. Because RPC framing delimits an entire RPC request or 540 reply, the resulting shift in framing must occur between distinct RPC 541 messages, and in concert with the underlying transport. 543 4.3. Managing Receiver Resources 545 The longevity of an RDMA connection mandates that sending endpoints 546 respect the resource limits of peer receivers. To ensure messages 547 can be sent and received reliably, there are two operational 548 parameters for each connection. It is critical to provide RDMA Send 549 flow control for an RDMA connection. If any pre-posted Receive 550 buffer on the connection is not large enough to accept an incoming 551 RDMA Send, or if a pre-posted Receive buffer is not available to 552 accept an incoming RDMA Send, the RDMA connection can be terminated. 554 4.3.1. RPC-over-RDMA Version 2 Flow Control 556 Because RPC-over-RDMA requires reliable and in-order delivery of data 557 payloads, RPC-over-RDMA transports MUST use the RDMA RC (Reliable 558 Connected) Queue Pair (QP) type, which ensures in-transit data 559 integrity and handles recovery from packet loss or misordering. 561 However, RPC-over-RDMA transports provide their own flow control 562 mechanism to prevent a sender from overwhelming receiver resources. 563 RPC-over-RDMA transports employ an end-to-end credit-based flow 564 control mechanism for this purpose [CBFC]. Credit-based flow control 565 was chosen because it is relatively simple, provides robust operation 566 in the face of bursty traffic, automated management of receive buffer 567 allocation, and excellent buffer utilization. 569 4.3.1.1. Granting Credits 571 An RPC-over-RDMA version 2 credit is the capability to receive one 572 RPC-over-RDMA version 2 message. This enables RPC-over-RDMA version 573 2 to support asymmetrical operation, where a message in one direction 574 might be matched by zero, one, or multiple messages in the other 575 direction. 577 To achieve this, credits are assigned to each connection peer's 578 posted Receive buffers. Each Requester has a set of Receive credits, 579 and each Responder has a set of Receive credits. These credit values 580 are managed independently of one another. 582 Section 7 of [RFC8166] requires that the 32-bit field containing the 583 credit grant is the third word in the transport header. To conform 584 with that requirement, the two independent credit values are encoded 585 into a single 32-bit field in the fixed portion of the transport 586 header. After the field is XDR decoded, the receiver takes the low- 587 order two bytes as the number of credits that are newly granted by 588 the sender, and the high-order two bytes as the maximum number of 589 credits that can be outstanding at the sender. 591 In this approach, then, there are requester credits, sent in messages 592 from the requester to the responder; and responder credits, sent in 593 messages from the responder to the requester. 595 A sender MUST NOT send RDMA messages in excess of the receiver's 596 granted credit limit. If the granted value is exceeded, the RDMA 597 layer may signal an error, possibly terminating the connection. The 598 granted value MUST NOT be zero, since such a value would result in 599 deadlock. 601 The granted credit values MAY be adjusted to match the needs or 602 policies in effect on either peer. For instance, a peer may reduce 603 its granted credit value to accommodate the available resources in a 604 Shared Receive Queue. 606 Certain RDMA implementations may impose additional flow-control 607 restrictions, such as limits on RDMA Read operations in progress at 608 the Responder. Accommodation of such restrictions is considered the 609 responsibility of each RPC-over-RDMA version 2 implementation. 611 4.3.1.2. Asynchronous Credit Grants 613 A protocol convention is provided to enable one peer to refresh its 614 credit grant to the other peer without sending a data payload. 615 Messages of this type can also act as a keep-alive ping. See 616 Section 6.4.2 for information about this convention. 618 To prevent transport deadlock, receivers MUST always be in a position 619 to receive one such credit grant update message, in addition to 620 payload-bearing messages. One way a receiver can do this is to post 621 one extra Receive more than the credit value it granted. 623 4.3.2. Inline Threshold 625 An "inline threshold" value is the largest message size (in octets) 626 that can be conveyed in one direction between peer implementations 627 using RDMA Send and Receive operations. The inline threshold value 628 is effectively the smaller of the largest number of bytes the sender 629 can post via a single RDMA Send operation and the largest number of 630 bytes the receiver can accept via a single RDMA Receive operation. 631 Each connection has two inline threshold values: one for messages 632 flowing from Requester-to-Responder, referred to as the "call inline 633 threshold", and one for messages flowing from Responder-to-Requester, 634 referred to as the "reply inline threshold". Inline threshold values 635 can be advertised to peers via Transport Properties. 637 Receiver implementations MUST support inline thresholds of 4096 638 bytes. In the absence of an exchange of Transport Properties, 639 senders and receivers MUST assume both connection inline thresholds 640 are 4096 bytes. 642 4.3.3. Initial Connection State 644 When an RPC-over-RDMA version 2 client establishes a connection to a 645 server, its first order of business is to determine the server's 646 highest supported protocol version. 648 Upon connection establishment a client MUST NOT send more than a 649 single RPC-over-RDMA message at a time until it receives a valid non- 650 error RPC-over-RDMA message from the server that grants client 651 credits. 653 The second word of each transport header is used to convey the 654 transport protocol version. In the interest of simplicity, we refer 655 to that word as rdma_vers even though in the RPC-over-RDMA version 2 656 XDR definition it is described as rdma_start.rdma_vers. 658 First, the client sends a single valid RPC-over-RDMA message with the 659 value two (2) in the rdma_vers field. Because the server might 660 support only RPC-over-RDMA version 1, this initial message MUST NOT 661 be larger than the version 1 default inline threshold of 1024 bytes. 663 4.3.3.1. Server Does Support RPC-over-RDMA Version 2 665 If the server does support RPC-over-RDMA version 2, it sends RPC- 666 over-RDMA messages back to the client with the value two (2) in the 667 rdma_vers field. Both peers may use the default inline threshold 668 value for RPC-over-RDMA version 2 connections (4096 bytes). 670 4.3.3.2. Server Does Not Support RPC-over-RDMA Version 2 672 If the server does not support RPC-over-RDMA version 2, it MUST send 673 an RPC-over-RDMA message to the client with the same XID, with 674 RDMA2_ERROR in the rdma_start.rdma_htype field, and with the error 675 code RDMA2_ERR_VERS. This message also reports a range of protocol 676 versions that the server supports. To continue operation, the client 677 selects a protocol version in the range of server-supported versions 678 for subsequent messages on this connection. 680 If the connection is lost immediately after an RDMA2_ERROR / 681 RDMA2_ERR_VERS message is received, a client can avoid a possible 682 version negotiation loop when re-establishing another connection by 683 assuming that particular server does not support RPC-over-RDMA 684 version 2. A client can assume the same situation (no server support 685 for RPC-over-RDMA version 2) if the initial negotiation message is 686 lost or dropped. Once the negotiation exchange is complete, both 687 peers may use the default inline threshold value for the transport 688 protocol version that has been selected. 690 4.3.3.3. Client Does Not Support RPC-over-RDMA Version 2 692 If the server supports the RPC-over-RDMA protocol version used in the 693 first RPC-over-RDMA message received from a client, it MUST use that 694 protocol version in all subsequent messages it sends on that 695 connection. The client MUST NOT change the protocol version for the 696 duration of the connection. 698 4.4. XDR Encoding with Chunks 700 When a DDP capability is available, the transport places the contents 701 of one or more XDR data items directly into the receiver's memory, 702 separately from the transfer of other parts of the containing XDR 703 stream. 705 4.4.1. Reducing an XDR Stream 707 RPC-over-RDMA version 2 provides a mechanism for moving part of an 708 RPC message via a data transfer distinct from an RDMA Send/Receive 709 pair. The sender removes one or more XDR data items from the Payload 710 stream. These items are conveyed via other mechanisms, such as one 711 or more RDMA Read or Write operations. As the receiver decodes an 712 incoming message, it skips over directly placed data items. 714 The portion of an XDR stream that is split out and moved separately 715 is referred to as a "chunk". In some contexts, data in an RPC-over- 716 RDMA header that describes these split out regions of memory may also 717 be referred to as a "chunk". 719 A Payload stream after chunks have been removed is referred to as a 720 "reduced" Payload stream. Likewise, a data item that has been 721 removed from a Payload stream to be transferred separately is 722 referred to as a "reduced" data item. 724 4.4.2. DDP-Eligibility 726 Not all XDR data items benefit from DDP. For example, small data 727 items or data items that require XDR unmarshaling by the receiver do 728 not benefit from DDP. In addition, it is impractical for receivers 729 to prepare for every possible XDR data item in a protocol to be 730 transferred in a chunk. 732 To maintain practical interoperability on an RPC-over-RDMA transport, 733 a determination must be made of which few XDR data items in each ULP 734 are allowed to use DDP. 736 This is done in additional specifications that describe how ULPs 737 employ DDP. A "ULB specification" identifies which specific 738 individual XDR data items in a ULP MAY be transferred via DDP. Such 739 data items are referred to as "DDP-eligible". All other XDR data 740 items MUST NOT be reduced. Detailed requirements for ULBs are 741 provided in Appendix A. 743 4.4.3. RDMA Segments 745 When encoding a Payload stream that contains a DDP-eligible data 746 item, a sender may choose to reduce that data item. When it chooses 747 to do so, the sender does not place the item into the Payload stream. 748 Instead, the sender records in the RPC-over-RDMA Transport header the 749 location and size of the memory region containing that data item. 751 The Requester provides location information for DDP-eligible data 752 items in both RPC Call and Reply messages. The Responder uses this 753 information to retrieve arguments contained in the specified region 754 of the Requester's memory or place results in that memory region. 756 An "RDMA segment", or "plain segment", is an RPC-over-RDMA Transport 757 header data object that contains the precise coordinates of a 758 contiguous memory region that is to be conveyed separately from the 759 Payload stream. Plain segments contain the following information: 761 Handle 762 Steering tag (STag) or R_key generated by registering this memory 763 with the RDMA provider. 765 Length 766 The length of the RDMA segment's memory region, in octets. An 767 "empty segment" is an RDMA segment with the value zero (0) in its 768 length field. 770 Offset 771 The offset or beginning memory address of the RDMA segment's 772 memory region. 774 See [RFC5040] for further discussion. 776 4.4.4. Chunks 778 In RPC-over-RDMA version 2, a "chunk" refers to a portion of the 779 Payload stream that is moved independently of the RPC-over-RDMA 780 Transport header and Payload stream. Chunk data is removed from the 781 sender's Payload stream, transferred via separate operations, and 782 then reinserted into the receiver's Payload stream to form a complete 783 RPC message. 785 Each chunk is comprised of RDMA segments. Each RDMA segment 786 represents a single contiguous piece of that chunk. A Requester MAY 787 divide a chunk into RDMA segments using any boundaries that are 788 convenient. The length of a chunk is exactly the sum of the lengths 789 of the RDMA segments that comprise it. 791 The RPC-over-RDMA version 2 transport protocol does not place a limit 792 on chunk size. However, each ULP may cap the amount of data that can 793 be transferred by a single RPC transaction. For example, NFS has 794 "rsize" and "wsize", which restrict the payload size of NFS READ and 795 WRITE operations. The Responder can use such limits to sanity check 796 chunk sizes before using them in RDMA operations. 798 4.4.4.1. Counted Arrays 800 If a chunk contains a counted array data type, the count of array 801 elements MUST remain in the Payload stream, while the array elements 802 MUST be moved to the chunk. For example, when encoding an opaque 803 byte array as a chunk, the count of bytes stays in the Payload 804 stream, while the bytes in the array are removed from the Payload 805 stream and transferred within the chunk. 807 Individual array elements appear in a chunk in their entirety. For 808 example, when encoding an array of arrays as a chunk, the count of 809 items in the enclosing array stays in the Payload stream, but each 810 enclosed array, including its item count, is transferred as part of 811 the chunk. 813 4.4.4.2. Optional-Data 815 If a chunk contains an optional-data data type, the "is present" 816 field MUST remain in the Payload stream, while the data, if present, 817 MUST be moved to the chunk. 819 4.4.4.3. XDR Unions 821 A union data type MUST NOT be made DDP-eligible, but one or more of 822 its arms MAY be DDP-eligible, subject to the other requirements in 823 this section. 825 4.4.4.4. Chunk Roundup 827 Except in special cases (covered in Section 4.5.4), a chunk MUST 828 contain exactly one XDR data item. This makes it straightforward to 829 reduce variable-length data items without affecting the XDR alignment 830 of data items in the Payload stream. 832 When a variable-length XDR data item is reduced, the sender MUST 833 remove XDR roundup padding for that data item from the Payload stream 834 so that data items remaining in the Payload stream begin on four-byte 835 alignment. 837 4.4.5. Read Chunks 839 A "Read chunk" represents an XDR data item that is to be pulled from 840 the Requester to the Responder. A Read chunk is a list of one or 841 more RDMA read segments. Each RDMA read segment consists of a 842 Position field followed by a plain segment. 844 Position 845 The byte offset in the unreduced Payload stream where the receiver 846 reinserts the data item conveyed in a chunk. The Position value 847 MUST be computed from the beginning of the unreduced Payload 848 stream, which begins at Position zero. All RDMA read segments 849 belonging to the same Read chunk have the same value in their 850 Position field. 852 While constructing an RPC Call message, a Requester registers memory 853 regions that contain data to be transferred via RDMA Read operations. 854 It advertises the coordinates of these regions in the RPC-over-RDMA 855 Transport header of the RPC Call message. 857 After receiving an RPC Call message sent via an RDMA Send operation, 858 a Responder transfers the chunk data from the Requester using RDMA 859 Read operations. The Responder reconstructs the transferred chunk 860 data by concatenating the contents of each RDMA segment in list order 861 into the received Payload stream at the Position value recorded in 862 that RDMA segment. 864 Put another way, the Responder inserts the first RDMA segment in a 865 Read chunk into the Payload stream at the byte offset indicated by 866 its Position field. RDMA segments whose Position field value match 867 this offset are concatenated afterwards, until there are no more RDMA 868 segments at that Position value. 870 The Position field in a read segment indicates where the containing 871 Read chunk starts in the Payload stream. The value in this field 872 MUST be a multiple of four. All segments in the same Read chunk 873 share the same Position value, even if one or more of the RDMA 874 segments have a non-four-byte-aligned length. 876 4.4.5.1. Decoding Read Chunks 878 While decoding a received Payload stream, whenever the XDR offset in 879 the Payload stream matches that of a Read chunk, the Responder 880 initiates an RDMA Read to pull the chunk's data content into 881 registered local memory. 883 The Responder acknowledges its completion of use of Read chunk source 884 buffers when it sends an RPC Reply message to the Requester. The 885 Requester may then release Read chunks advertised in the request. 887 4.4.5.2. Read Chunk Roundup 889 When reducing a variable-length argument data item, the Requester 890 MUST NOT include the data item's XDR roundup padding in the chunk 891 itself. The chunk's total length MUST be the same as the encoded 892 length of the data item. 894 4.4.6. Write Chunks 896 While constructing an RPC Call message, a Requester prepares memory 897 regions in which to receive DDP-eligible result data items. A "Write 898 chunk" represents an XDR data item that is to be pushed from a 899 Responder to a Requester. It is made up of an array of zero or more 900 plain segments. 902 Write chunks are provisioned by a Requester long before the Responder 903 has prepared the reply Payload stream. A Requester often does not 904 know the actual length of the result data items to be returned, since 905 the result does not yet exist. Thus, it MUST register Write chunks 906 long enough to accommodate the maximum possible size of each returned 907 data item. 909 In addition, the XDR position of DDP-eligible data items in the 910 reply's Payload stream is not predictable when a Requester constructs 911 an RPC Call message. Therefore, RDMA segments in a Write chunk do 912 not have a Position field. 914 For each Write chunk provided by a Requester, the Responder pushes 915 one data item to the Requester, filling the chunk contiguously and in 916 segment array order until that data item has been completely written 917 to the Requester. The Responder MUST copy the segment count and all 918 segments from the Requester-provided Write chunk into the RPC Reply 919 message's Transport header. As it does so, the Responder updates 920 each segment length field to reflect the actual amount of data that 921 is being returned in that segment. The Responder then sends the RPC 922 Reply message via an RDMA Send operation. 924 An "empty Write chunk" is a Write chunk with a zero segment count. 925 By definition, the length of an empty Write chunk is zero. An 926 "unused Write chunk" has a non-zero segment count, but all of its 927 segments are empty segments. 929 4.4.6.1. Decoding Write Chunks 931 After receiving the RPC Reply message, the Requester reconstructs the 932 transferred data by concatenating the contents of each segment in 933 array order into the RPC Reply message's XDR stream at the known XDR 934 position of the associated DDP-eligible result data item. 936 4.4.6.2. Write Chunk Roundup 938 When provisioning a Write chunk for a variable-length result data 939 item, the Requester MUST NOT include additional space for XDR roundup 940 padding. A Responder MUST NOT write XDR roundup padding into a Write 941 chunk, even if the result is shorter than the available space in the 942 chunk. Therefore, when returning a single variable-length result 943 data item, a returned Write chunk's total length MUST be the same as 944 the encoded length of the result data item. 946 4.5. Message Transfer Methods 948 A receiver of RDMA Send operations is required to have previously 949 posted one or more adequately sized buffers. Memory savings are 950 achieved on both Requesters and Responders by posting small Receive 951 buffers. However, not all RPC messages are small. RPC-over-RDMA 952 version 2 provides several mechanisms that enable RPC message 953 payloads of any size to be conveyed efficiently. 955 4.5.1. Short Messages 957 RPC message payloads are often smaller than typical inline 958 thresholds. For example, an NFS version 3 GETATTR operation is only 959 56 octets: 20 octets of RPC header, a 32-octet file handle argument, 960 and 4 octets for its length. The reply to this common request is 961 about 100 octets. 963 Since all RPC messages conveyed via RPC-over-RDMA version 2 require 964 at least one RDMA Send operation, the most efficient way to send an 965 RPC message that is smaller than the inline threshold is to append 966 the Payload stream directly to the Transport stream. An RPC-over- 967 RDMA header with a small RPC Call or Reply message immediately 968 following is transferred using a single RDMA Send operation. No 969 other operations are needed. 971 An RPC-over-RDMA transaction using a Short Message: 973 Requester Responder 974 | RDMA Send (RDMA_MSG) | 975 Call | ------------------------------> | 976 | | 977 | | Processing 978 | | 979 | RDMA Send (RDMA_MSG) | 980 | <------------------------------ | Reply 982 4.5.2. Continued Messages 984 If an RPC message is larger than the inline threshold, the sender can 985 choose to split that message over multiple RPC-over-RDMA messages. 986 The Payload stream of each RPC-over-RDMA message contains a part of 987 the RPC message. The receiver reconstitutes the RPC message by 988 concatenating the Payload streams of the sequence of RPC-over-RDMA 989 messages together. 991 Though the purpose of a Continued Message is to handle large RPC 992 messages, senders MAY use a Continued Message at any time to convey 993 an RPC message, and MAY split the RPC message payload on any 994 convenient boundary. 996 An RPC-over-RDMA transaction using a Continued Message: 998 Requester Responder 999 | RDMA Send (RDMA_MSG) | 1000 Call | ------------------------------> | 1001 | RDMA Send (RDMA_MSG) | 1002 | ------------------------------> | 1003 | RDMA Send (RDMA_MSG) | 1004 | ------------------------------> | 1005 | | 1006 | | 1007 | | Processing 1008 | | 1009 | RDMA Send (RDMA_MSG) | 1010 | <------------------------------ | Reply 1012 4.5.3. Chunked Messages 1014 If DDP-eligible data items are present in a Payload stream, a sender 1015 MAY reduce some or all of these items by removing them from the 1016 Payload stream. The sender then uses a separate mechanism to 1017 transfer the reduced data items. The Transport stream with the 1018 reduced Payload stream immediately following is then transferred 1019 using a single RDMA Send operation. 1021 After receiving the Transport and Payload streams of an RPC Call 1022 message accompanied by Read chunks, the Responder uses RDMA Read 1023 operations to move reduced data items in Read chunks. Before sending 1024 the Transport and Payload streams of an RPC Reply message containing 1025 Write chunks, the Responder uses RDMA Write operations to move 1026 reduced data items in Write and Reply chunks. 1028 An RPC-over-RDMA transaction with a Read chunk: 1030 Requester Responder 1031 | RDMA Send (RDMA_MSG) | 1032 Call | ------------------------------> | 1033 | RDMA Read | 1034 | <------------------------------ | 1035 | RDMA Response (arg data) | 1036 | ------------------------------> | 1037 | | 1038 | | Processing 1039 | | 1040 | RDMA Send (RDMA_MSG) | 1041 | <------------------------------ | Reply 1043 An RPC-over-RDMA transaction with a Write chunk: 1045 Requester Responder 1046 | RDMA Send (RDMA_MSG) | 1047 Call | ------------------------------> | 1048 | | 1049 | | Processing 1050 | | 1051 | RDMA Write (result data) | 1052 | <------------------------------ | 1053 | RDMA Send (RDMA_MSG) | 1054 | <------------------------------ | Reply 1056 Chunking and Message Continuation can be combined. After reduction, 1057 the sender MAY split the reduced RPC message into multiple Payload 1058 streams and then send it via a Continued Message. 1060 4.5.4. Long Messages 1062 When a Payload stream is larger than the receiver's inline threshold, 1063 the Payload stream is reduced by removing DDP-eligible data items and 1064 placing them in chunks to be moved separately. If there are no DDP- 1065 eligible data items in the Payload stream, or the Payload stream is 1066 still too large after it has been reduced, the sender uses either 1067 Message Continuation, or it can use RDMA Read or Write operations to 1068 convey the entire RPC message. The latter mechanism is referred to 1069 as a "Long Message". 1071 To transmit a Long Message, the sender conveys only the Transport 1072 stream with an RDMA Send operation. The Payload stream is not 1073 included in the Send buffer in this instance. Instead, the Requester 1074 provides chunks that the Responder uses to move the Payload stream. 1076 Long Call 1077 To send a Long Call message, the Requester provides a special Read 1078 chunk that contains the RPC Call message's Payload stream. Every 1079 RDMA read segment in this chunk MUST contain zero in its Position 1080 field. This type of chunk is known as a "Position Zero Read 1081 chunk". 1083 Long Reply 1084 To send a Long Reply, the Requester provides a single special 1085 Write chunk in advance, known as the "Reply chunk", that will 1086 contain the RPC Reply message's Payload stream. The Requester 1087 sizes the Reply chunk to accommodate the maximum expected reply 1088 size for that upper-layer operation. 1090 Though the purpose of a Long Message is to handle large RPC messages, 1091 Requesters MAY use a Long Message at any time to convey an RPC Call 1092 message. 1094 A Responder chooses which form of reply to use based on the chunks 1095 provided by the Requester. If Write chunks were provided and the 1096 Responder has a DDP-eligible result, it first reduces the reply 1097 Payload stream. If a Reply chunk was provided and the reduced 1098 Payload stream is larger than the reply inline threshold, the 1099 Responder MUST use the Requester-provided Reply chunk for the reply. 1101 XDR data items may appear in these special chunks without regard to 1102 their DDP-eligibility. As these chunks contain a Payload stream, 1103 such chunks MUST include appropriate XDR roundup padding to maintain 1104 proper XDR alignment of their contents. 1106 An RPC-over-RDMA transaction using a Long Call: 1108 Requester Responder 1109 | RDMA Send (RDMA_NOMSG) | 1110 Call | ------------------------------> | 1111 | RDMA Read | 1112 | <------------------------------ | 1113 | RDMA Response (RPC call) | 1114 | ------------------------------> | 1115 | | 1116 | | Processing 1117 | | 1118 | RDMA Send (RDMA_MSG) | 1119 | <------------------------------ | Reply 1121 An RPC-over-RDMA transaction using a Long Reply: 1123 Requester Responder 1124 | RDMA Send (RDMA_MSG) | 1125 Call | ------------------------------> | 1126 | | 1127 | | Processing 1128 | | 1129 | RDMA Write (RPC reply) | 1130 | <------------------------------ | 1131 | RDMA Send (RDMA_NOMSG) | 1132 | <------------------------------ | Reply 1134 5. Transport Properties 1136 RPC-over-RDMA version 2 provides a mechanism for connection endpoints 1137 to communicate information about implementation properties, enabling 1138 compatible endpoints to optimize data transfer. Initially only a 1139 small set of transport properties are defined and a single operation 1140 is provided to exchange transport properties (see Section 6.4.4). 1142 Both the set of transport properties and the operations used to 1143 communicate may be extended. Within RPC-over-RDMA version 2, all 1144 such extensions are OPTIONAL. For information about existing 1145 transport properties, see Sections 5.1 through 5.2. For discussion 1146 of extensions to the set of transport properties, see Appendix B.3. 1148 5.1. Transport Properties Model 1150 A basic set of receiver and sender properties is specified in this 1151 document. An extensible approach is used, allowing new properties to 1152 be defined in future Standards Track documents. 1154 Such properties are specified using: 1156 o A code point identifying the particular transport property being 1157 specified. 1159 o A nominally opaque array which contains within it the XDR encoding 1160 of the specific property indicated by the associated code point. 1162 The following XDR types are used by operations that deal with 1163 transport properties: 1165 1167 typedef rpcrdma2_propid uint32; 1169 struct rpcrdma2_propval { 1170 rpcrdma2_propid rdma_which; 1171 opaque rdma_data<>; 1172 }; 1174 typedef rpcrdma2_propval rpcrdma2_propset<>; 1176 typedef uint32 rpcrdma2_propsubset<>; 1178 1180 An rpcrdma2_propid specifies a particular transport property. In 1181 order to facilitate XDR extension of the set of properties by 1182 concatenating XDR definition files, specific properties are defined 1183 as const values rather than as elements in an enum. 1185 An rpcrdma2_propval specifies a value of a particular transport 1186 property with the particular property identified by rdma_which, while 1187 the associated value of that property is contained within rdma_data. 1189 An rdma_data field which is of zero length is interpreted as 1190 indicating the default value or the property indicated by rdma_which. 1192 While rdma_data is defined as opaque within the XDR, the contents are 1193 interpreted (except when of length zero) using the XDR typedef 1194 associated with the property specified by rdma_which. As a result, 1195 when rpcrdma2_propval does not conform to that typedef, the receiver 1196 is REQUIRED to return the error RDMA2_ERR_BAD_XDR using the header 1197 type RDMA2_ERROR as described in Section 6.4.3. For example, the 1198 receiver of a message containing a valid rpcrdma2_propval returns 1199 this error if the length of rdma_data is such that it extends beyond 1200 the bounds of the message being transferred. 1202 In cases in which the rpcrdma2_propid specified by rdma_which is 1203 understood by the receiver, the receiver also MUST report the error 1204 RDMA2_ERR_BAD_XDR if either of the following occur: 1206 o The nominally opaque data within rdma_data is not valid when 1207 interpreted using the property-associated typedef. 1209 o The length of rdma_data is insufficient to contain the data 1210 represented by the property-associated typedef. 1212 Note that no error is to be reported if rdma_which is unknown to the 1213 receiver. In that case, that rpcrdma2_propval is not processed and 1214 processing continues using the next rpcrdma2_propval, if any. 1216 A rpcrdma2_propset specifies a set of transport properties. No 1217 particular ordering of the rpcrdma2_propval items within it is 1218 imposed. 1220 A rpcrdma2_propsubset identifies a subset of the properties in a 1221 previously specified rpcrdma2_propset. Each bit in the mask denotes 1222 a particular element in a previously specified rpcrdma2_propset. If 1223 a particular rpcrdma2_propval is at position N in the array, then bit 1224 number N mod 32 in word N div 32 specifies whether that particular 1225 rpcrdma2_propval is included in the defined subset. Words beyond the 1226 last one specified are treated as containing zero. 1228 5.2. Current Transport Properties 1230 Although the set of transport properties may be extended, a basic set 1231 of transport properties is defined in Table 1. 1233 In that table, the columns contain the following information: 1235 o The column labeled "Property" identifies the transport property 1236 described by the current row. 1238 o The column labeled "Code" specifies the rpcrdma2_propid value used 1239 to identify this property. 1241 o The column labeled "XDR type" gives the XDR type of the data used 1242 to communicate the value of this property. This data type 1243 overlays the data portion of the nominally opaque field rdma_data 1244 in a rpcrdma2_propval. 1246 o The column labeled "Default" gives the default value for the 1247 property which is to be assumed by those who do not receive, or 1248 are unable to interpret, information about the actual value of the 1249 property. 1251 o The column labeled "Sec" indicates the section within this 1252 document that explains the semantics and use of this transport 1253 property. 1255 +----------------------------+------+----------+---------+---------+ 1256 | Property | Code | XDR type | Default | Sec | 1257 +----------------------------+------+----------+---------+---------+ 1258 | Maximum Send Size | 1 | uint32 | 4096 | 5.2.1 | 1259 | Receive Buffer Size | 2 | uint32 | 4096 | 5.2.2 | 1260 | Maximum RDMA Segment Size | 3 | uint32 | 1048576 | 5.2.3 | 1261 | Maximum RDMA Segment Count | 4 | uint32 | 16 | 5.2.4 | 1262 | Reverse Request Support | 5 | uint32 | 1 | 5.2.5 | 1263 | Host Auth Message | 6 | opaque<> | N/A | 5.2.6 | 1264 +----------------------------+------+----------+---------+---------+ 1266 Table 1 1268 5.2.1. Maximum Send Size 1270 The Maximum Send Size specifies the maximum size, in octets, of Send 1271 payloads. The endpoint sending this value ensures that it will not 1272 transmit a Send WR payload larger than this size, allowing the 1273 endpoint receiving this value to size its Receive buffers 1274 appropriately. 1276 1278 const uint32 RDMA2_PROPID_SBSIZ = 1; 1279 typedef uint32 rpcrdma2_prop_sbsiz; 1281 1283 5.2.2. Receive Buffer Size 1285 The Receive Buffer Size specifies the minimum size, in octets, of 1286 pre-posted receive buffers. It is the responsibility of the endpoint 1287 sending this value to ensure that its pre-posted receive buffers are 1288 at least the size specified, allowing the endpoint receiving this 1289 value to send messages that are of this size. 1291 1293 const uint32 RDMA2_PROPID_RBSIZ = 2; 1294 typedef uint32 rpcrdma2_prop_rbsiz; 1296 1298 A sender may use his knowledge of the receiver's buffer size to 1299 determine when the message to be sent will fit in the preposted 1300 receive buffers that the receiver has set up. In particular, 1302 o Requesters may use the value to determine when it is necessary to 1303 provide a Position Zero Read chunk or Message Continuation when 1304 sending a request. 1306 o Requesters may use the value to determine when it is necessary to 1307 provide a Reply chunk when sending a request, based on the maximum 1308 possible size of the reply. 1310 o Responders may use the value to determine when it is necessary, 1311 given the actual size of the reply, to actually use a Reply chunk 1312 provided by the requester. 1314 5.2.3. Maximum RDMA Segment Size 1316 The Maximum RDMA Segment Size specifies the maximum size, in octets, 1317 of an RDMA segment this endpoint is prepared to send or receive. 1319 1321 const uint32 RDMA2_PROPID_RSSIZ = 3; 1322 typedef uint32 rpcrdma2_prop_rssiz; 1324 1326 5.2.4. Maximum RDMA Segment Count 1328 The Maximum RDMA Segment Count specifies the maximum number of RDMA 1329 segments that can appear in a requester's transport header. 1331 1333 const uint32 RDMA2_PROPID_RCSIZ = 4; 1334 typedef uint32 rpcrdma2_prop_rcsiz; 1336 1338 5.2.5. Reverse Request Support 1340 The value of this property is used to indicate a client 1341 implementation's readiness to accept and process messages that are 1342 part of reverse direction RPC requests. 1344 1346 const uint32 RDMA_RVREQSUP_NONE = 0; 1347 const uint32 RDMA_RVREQSUP_INLINE = 1; 1348 const uint32 RDMA_RVREQSUP_GENL = 2; 1350 const uint32 RDMA2_PROPID_BRS = 5; 1351 typedef uint32 rpcrdma2_prop_brs; 1353 1355 Multiple levels of support are distinguished: 1357 o The value RDMA2_RVREQSUP_NONE indicates that receipt of reverse 1358 direction requests and replies is not supported. 1360 o The value RDMA2_RVREQSUP_INLINE indicates that receipt of reverse 1361 direction requests or replies is only supported using inline 1362 messages and that use of explicit RDMA operations for reverse 1363 direction messages is not supported. 1365 o The value RDMA2_RVREQSUP_GENL that receipt of reverse direction 1366 requests or replies is supported in the same ways that forward 1367 direction requests or replies typically are. 1369 When information about this property is not provided, the support 1370 level of servers can be inferred from the reverse direction requests 1371 that they issue, assuming that issuing a request implicitly indicates 1372 support for receiving the corresponding reply. On this basis, 1373 support for receiving inline replies can be assumed when requests 1374 without Read chunks, Write chunks, or Reply chunks are issued, while 1375 requests with any of these elements allow the client to assume that 1376 general support for reverse direction replies is present on the 1377 server. 1379 5.2.6. Host Authentication Message 1381 The value of this transport property is used as part of an exchange 1382 of host authentication material. This property can accommodate 1383 authentication handshakes that require multiple challenge-response 1384 interactions, and potentially large amounts of material. 1386 1388 const uint32 RDMA2_PROPID_HOSTAUTH = 6; 1389 typedef opaque rpcrdma2_prop_hostauth<>; 1391 1393 When this property is not provided, the peer(s) remain 1394 unauthenticated. Local security policy on each peer determines 1395 whether the connection is permitted to continue. 1397 6. RPC-over-RDMA Version 2 Transport Messages 1399 6.1. Overall Transport Message Structure 1401 Each transport message consists of multiple sections: 1403 o A transport header prefix, as defined in Section 6.3.2. Among 1404 other things, this structure indicates the header type. 1406 o The transport header proper, as defined by one of the sub-sections 1407 below. See Section 6.2 for the mapping between header types and 1408 the corresponding header structure. 1410 o Potentially, all or part of an RPC message payload being conveyed 1411 as an addendum to the transport header. 1413 This organization differs from that presented in the definition of 1414 RPC-over-RDMA version 1 [RFC8166], which presented the first and 1415 second of the items above as a single XDR item. The new organization 1416 is more in keeping with RPC-over-RDMA version 2's extensibility model 1417 in that new header types can be defined without modifying the 1418 existing set of header types. 1420 6.2. Transport Header Types 1422 The new header types within RPC-over-RDMA version 2 are set forth in 1423 Table 2. In that table, the columns contain the following 1424 information: 1426 o The column labeled "Operation" specifies the particular operation. 1428 o The column labeled "Code" specifies the value of header type for 1429 this operation. 1431 o The column labeled "XDR type" gives the XDR type of the data 1432 structure used to describe the information in this new message 1433 type. This data immediately follows the universal portion on the 1434 transport header present in every RPC-over-RDMA transport header. 1436 o The column labeled "Msg" indicates whether this operation is 1437 followed (or not) by an RPC message payload. 1439 o The column labeled "Sec" indicates the section (within this 1440 document) that explains the semantics and use of this operation. 1442 +-------------------------+------+-------------------+-----+--------+ 1443 | Operation | Code | XDR type | Msg | Sec | 1444 +-------------------------+------+-------------------+-----+--------+ 1445 | Convey Appended RPC | 0 | rpcrdma2_msg | Yes | 6.4.1 | 1446 | Message | | | | | 1447 | Convey External RPC | 1 | rpcrdma2_nomsg | No | 6.4.2 | 1448 | Message | | | | | 1449 | Report Transport Error | 4 | rpcrdma2_err | No | 6.4.3 | 1450 | Specify Properties at | 5 | rpcrdma2_connprop | No | 6.4.4 | 1451 | Connection | | | | | 1452 +-------------------------+------+-------------------+-----+--------+ 1454 Table 2 1456 Suppport for the operations in Table 2 is REQUIRED. Support for 1457 additional operations will be OPTIONAL. RPC-over-RDMA version 2 1458 implementations that receive an OPTIONAL operation that is not 1459 supported MUST respond with an RDMA2_ERROR message with an error code 1460 of RDMA2_ERR_INVAL_HTYPE. 1462 6.3. RPC-over-RDMA Version 2 Headers and Chunks 1464 Most RPC-over-RDMA version 2 data structures are derived from 1465 corresponding structures in RPC-over-RDMA version 1. As is typical 1466 for new versions of an existing protocol, the XDR data structures 1467 have new names and there are a few small changes in content. In some 1468 cases, there have been structural re-organizations to enabled 1469 protocol extensibility. 1471 6.3.1. Common Transport Header Prefix 1473 The rpcrdma_common prefix describes the first part of each RDMA-over- 1474 RPC transport header for version 2 and subsequent versions. 1476 1478 struct rpcrdma_common { 1479 uint32 rdma_xid; 1480 uint32 rdma_vers; 1481 uint32 rdma_credit; 1482 uint32 rdma_htype; 1483 }; 1485 1487 RPC-over-RDMA version 2's use of these first four words matches that 1488 of version 1 as required by [RFC8166]. However, there are important 1489 structural differences in the way that these words are described by 1490 the respective XDR descriptions: 1492 o The header type is represented as a uint32 rather than as an enum 1493 that would need to be modified to reflect additions to the set of 1494 header types made by later extensions. 1496 o The header type field is part of an XDR structure devoted to 1497 representing the transport header prefix, rather than being part 1498 of a discriminated union, that includes the body of each transport 1499 header type. 1501 o There is now a prefix structure (see Section 6.3.2) of which the 1502 rpcrdma_common structure is the initial segment. This is a newly 1503 defined XDR object within the protocol description, in contrast 1504 with RPC-over-RDMA version 1, which limits the common portion of 1505 all header types to the four words in rpcrdma_common. 1507 These changes are part of a larger structural change in the XDR 1508 description of RPC-over-RDMA version 2 that enables a cleaner 1509 treatment of protocol extension. The XDR appearing in Section 7 1510 reflects these changes, which are discussed in further detail in 1511 Appendix C.1. 1513 6.3.2. RPC-over-RDMA Version 2 Transport Header Prefix 1515 The following prefix structure appears at the start of any RPC-over- 1516 RDMA version 2 transport header. 1518 1520 const RPCRDMA2_F_RESPONSE 0x00000001; 1521 const RPCRDMA2_F_MORE 0x00000002; 1523 struct rpcrdma2_hdr_prefix 1524 struct rpcrdma_common rdma_start; 1525 uint32 rdma_flags; 1526 }; 1528 1530 The rdma_flags is new to RPC-over-RDMA version 2. Currently, the 1531 only flags defined within this word are the RPCRDMA2_F_RESPONSE flag 1532 and the RPCRDMA2_F_MORE flag. The other bits are reserved for future 1533 use as described in Appendix B.2. The sender MUST set these flags to 1534 zero. 1536 6.3.2.1. RPCRDMA2_F_RESPONSE Flag 1538 The RPCRDMA2_F_RESPONSE flag qualifies the value contained in the 1539 transport header's rdma_start.rdma_xid field. The 1540 RPCRDMA2_F_RESPONSE flag enables a receiver to reliably avoid 1541 performing an XID lookup on incoming reverse direction Call messages. 1543 In general, when a message carries an XID that was generated by the 1544 message's receiver (that is, the receiver is acting as a requester), 1545 the message's sender sets the RPCRDMA2_F_RESPONSE flag. Otherwise 1546 that flag is clear. For example: 1548 o When the rdma_start.rdma_htype field has the value RDMA2_MSG or 1549 RDMA2_NOMSG, the value of the RPCRDMA2_F_RESPONSE flag MUST be the 1550 same as the value of the associated RPC message's msg_type field. 1552 o When the header type is anything else and a whole or partial RPC 1553 message payload is present, the value of the RPCRDMA2_F_RESPONSE 1554 flag MUST be the same as the value of the associated RPC message's 1555 msg_type field. 1557 o When no RPC message payload is present, a requester MUST set the 1558 value of RPCRDMA2_F_RESPONSE to reflect how the receiver is to 1559 interpret the rdma_start.rdma_xid field. 1561 o When the rdma_start.rdma_htype field has the value RDMA2_ERROR, 1562 the RPCRDMA2_F_RESPONSE flag MUST be set. 1564 6.3.2.2. RPCRDMA2_F_MORE Flag 1566 The RPCRDMA2_F_MORE flag signifies that the RPC-over-RDMA message 1567 payload continues in the next message. This is referred to as 1568 Message Continuation, or Send chaining. 1570 When the RPCRDMA2_F_MORE flag is asserted, the receiver is to 1571 concatenate the data payload of the next received message to the end 1572 of the data payload of the current received message. The sender 1573 clears the RPCRDMA2_F_MORE flag in the final message in the sequence. 1575 All RPC-over-RDMA messages in such a sequence MUST have the same 1576 values in the rdma_start.rdma_xid and rdma_start.rdma_htype fields. 1577 If this constraint is not met, the receiver MUST respond with an 1578 RDMA2_ERROR message with the rdma_err field set to 1579 RDMA2_ERR_INVAL_FLAG. 1581 If a peer receives an RPC-over-RDMA message where the RPCRDMA2_F_MORE 1582 flag is set and the rdma_start.rdma_htype field does not contain 1583 RDMA2_MSG or RDMA2_CONNPROP, the receiver MUST respond with an 1584 RDMA2_ERROR message with the rdma_err field set to 1585 RDMA2_ERR_INVAL_FLAG. 1587 [ dnoveck: Both the above and your error in the existing third 1588 paragraph raise issues since they could be sent by a responder. Will 1589 need to fix RDMA2_ERROR so that this can be done when appropriate. ] 1591 When the RPCRDMA2_F_MORE flag is set in an individual message, that 1592 message's chunk lists MUST be empty. Chunks for a chained message 1593 may be conveyed in the final message in the sequence, whose 1594 RPCRDMA2_F_MORE flag is clear. 1596 There is no protocol-defined limit on the number of concatenated 1597 messages in a sequence. If the sender exhausts the receiver's credit 1598 grant before the final message is sent, the sender MUST wait for a 1599 further credit grant from the receiver before continuing to send 1600 messages. 1602 Credit exhaustion can occur at the receiver in the middle of a 1603 sequence of continued messages. To enable the sender to continue 1604 sending the remaining messages in the sequence, the receiver can 1605 grant more credits by sending an RPC message payload or an out-of- 1606 band credit grant (see Section 4.3.1.2). 1608 6.3.3. Describing External Data Payloads 1610 The rpcrdma2_chunk_lists structure specifies how an RPC message is 1611 conveyed using explicit RDMA operations. 1613 1615 struct rpcrdma2_chunk_lists { 1616 uint32 rdma_inv_handle; 1617 struct rpcrdma2_read_list *rdma_reads; 1618 struct rpcrdma2_write_list *rdma_writes; 1619 struct rpcrdma2_write_chunk *rdma_reply; 1620 }; 1622 1624 For the most part this structure parallels its RPC-over-RDMA version 1625 1 equivalent. That is, the rdma_reads, rdma_writes, rdma_reply 1626 fields provide, respectively, descriptions of the chunks used to read 1627 a Long message or directly placed data from the requester, to write 1628 directly placed response data into the requester's memory, and to 1629 write a long reply into the requester's memory. 1631 6.3.3.1. Chunks and Chunk Lists 1633 The chunks and chunk list structures follow the same rules as in 1634 Section 3.4 of [RFC8166], with these exceptions: 1636 o In RPC-over-RDMA version 1, there were cases where XDR padding was 1637 allowed to appear in a reduced XDR data item. However, in RPC- 1638 over-RDMA version 2, requesters and responders MUST NOT include 1639 XDR padding in reduced Read and Write chunks, but chunks that make 1640 up Position Zero Read chunks and Reply chunks MUST include all XDR 1641 padding. 1643 o A responder MUST use Message Continuation if the requester does 1644 not provide a Reply chunk and the actual size of the reply is 1645 larger than the connection's inline threshold. A responder MAY 1646 use Message Continuation even if the requester has provided 1647 adequate Reply resources. This makes it unnecessary for RPC-over- 1648 RDMA version 2 requesters to have perfect reply size estimation. 1650 6.3.3.2. Remote Invalidation 1652 An important addition relative to the corresponding RPC-over-RDMA 1653 version 1 rdma_header structures is the rdma_inv_handle field. This 1654 field supports remote invalidation of requester memory registrations 1655 via the RDMA Send With Invalidate operation. 1657 To request Remote Invalidation, a requester sets the value of the 1658 rdma_inv_handle field in an RPC Call's transport header to a non-zero 1659 value that matches one of the rdma_handle fields in that header. If 1660 none of the rdma_handle values in the header conveying the Call may 1661 be invalidated by the responder, the requester sets the RPC Call's 1662 rdma_inv_handle field to the value zero. 1664 If the responder chooses not to use remote invalidation for this 1665 particular RPC Reply, or the RPC Call's rdma_inv_handle field 1666 contains the value zero, the responder uses RDMA Send to transmit the 1667 matching RPC reply. 1669 If a requester has provided a non-zero value in the RPC Call's 1670 rdma_inv_handle field and the responder chooses to use Remote 1671 Invalidation for the matching RPC Reply, the responder uses RDMA Send 1672 With Invalidate to transmit that RPC reply, and uses the value in the 1673 corresponding Call's rdma_inv_handle field to construct the Send With 1674 Invalidate Work Request. 1676 6.4. Header Types Defined in RPC-over-RDMA version 2 1678 The header types defined and used in RPC-over-RDMA version 1 are all 1679 carried over into RPC-over-RDMA version 2, although there may be 1680 limited changes in the definition of existing header types. 1682 In comparison with the header types of RPC-over-RDMA version 1, the 1683 changes can be summarized as follows: 1685 o To simplify interoperability with RPC-over-RDMA version 1, only 1686 the RDMA2_ERROR header (defined in Section 6.4.3) has an XDR 1687 definition that differs from that in RPC-over-RDMA version 1, and 1688 its modifications are all compatible extensions. 1690 o RDMA2_MSG and RDMA2_NOMSG (defined in Sections Section 6.4.1 and 1691 Section 6.4.2) have XDR definitions that match the corresponding 1692 RPC-over-RDMA version 1 header types. However, because of the 1693 changes to the header prefix, the version 1 and version 2 header 1694 types differ in on-the-wire format. 1696 o RDMA2_CONNPROP (defined in Section 6.4.4) is a completely new 1697 header type devoted to enabling connection peers to exchange 1698 information about their transport properties. 1700 6.4.1. RDMA2_MSG: Convey RPC Message Inline 1702 RDMA2_MSG is used to convey an RPC message that immediately follows 1703 the Transport Header in the Send buffer. This is either an RPC 1704 request that has no Position Zero Read chunk or an RPC reply that is 1705 not sent using a Reply chunk. 1707 1709 const rpcrdma2_proc RDMA2_MSG = 0; 1711 struct rpcrdma2_msg { 1712 struct rpcrdma2_chunk_lists rdma_chunks; 1714 /* The rpc message starts here and continues 1715 * through the end of the transmission. */ 1716 uint32 rdma_rpc_first_word; 1717 }; 1719 1721 6.4.2. RDMA2_NOMSG: Convey External RPC Message 1723 RDMA2_NOMSG can convey an entire RPC message payload using explicit 1724 RDMA operations. When an RPC message payload is present, this 1725 message type is also known as a Long message. In particular, it is a 1726 Long call when the responder reads the RPC payload from a memory area 1727 specified by a Position Zero Read chunk; and it is a Long reply when 1728 the respond writes the RPC payload into a memory area specified by a 1729 Reply chunk. In both of these cases, the rdma_xid field is set to 1730 the same value as the xid of the RPC message payload. 1732 If all the chunk lists are empty (i.e., three 32-bit zeroes in the 1733 chunk list fields), the message conveys a credit grant refresh. The 1734 header prefix of this message contains a credit grant refresh in the 1735 rdma_credit field. In this case, the sender MUST set the rdma_xid 1736 field to zero. 1738 1740 const rpcrdma2_proc RDMA2_NOMSG = 1; 1742 struct rpcrdma2_nomsg { 1743 struct rpcrdma2_chunk_lists rdma_chunks; 1744 }; 1746 1748 In RPC-over-RDMA version 2, an alternative to using a Long message is 1749 to use Message Continuation. 1751 6.4.3. RDMA2_ERROR: Report Transport Error 1753 RDMA2_ERROR provides a way of reporting the occurrence of transport 1754 errors on a previous transmission. This header type MUST NOT be 1755 transmitted by a requester. 1757 1759 const rpcrdma2_proc RDMA2_ERROR = 4; 1761 struct rpcrdma2_err_vers { 1762 uint32 rdma_vers_low; 1763 uint32 rdma_vers_high; 1764 }; 1766 struct rpcrdma2_err_write { 1767 uint32 rdma_chunk_index; 1768 uint32 rdma_length_needed; 1769 }; 1771 union rpcrdma2_error switch (rpcrdma2_errcode rdma_err) { 1772 case RDMA2_ERR_VERS: 1773 rpcrdma2_err_vers rdma_vrange; 1774 case RDMA2_ERR_READ_CHUNKS: 1775 uint32 rdma_max_chunks; 1776 case RDMA2_ERR_WRITE_CHUNKS: 1777 uint32 rdma_max_chunks; 1778 case RDMA2_ERR_SEGMENTS: 1779 uint32 rdma_max_segments; 1780 case RDMA2_ERR_WRITE_RESOURCE: 1781 rpcrdma2_err_write rdma_writeres; 1782 case RDMA2_ERR_REPLY_RESOURCE: 1783 uint32 rdma_length_needed; 1784 default: 1785 void; 1786 }; 1788 1790 Error reporting is addressed in RPC-over-RDMA version 2 in a fashion 1791 similar to RPC-over-RDMA version 1. Several new error codes, and 1792 error messages never flow from requester to responder. RPC-over-RDMA 1793 version 1 error reporting is described in Section 5 of [RFC8166]. 1795 Unless otherwise specified, in all cases below, the responder copies 1796 the values of the rdma_start.rdma_xid and rdma_start.rdma_vers fields 1797 from the incoming transport header that generated the error to 1798 transport header of the error response. The responder sets the 1799 rdma_start.rdma_htype field of the transport header prefix to 1800 RDMA2_ERROR, and the rdma_start.rdma_credit field is set to the 1801 credit grant value for this connection. The receiver of this header 1802 type MUST ignore the value of the rdma_start.rdma_credit field. 1804 RDMA2_ERR_VERS 1805 This is the equivalent of ERR_VERS in RPC-over-RDMA version 1. 1806 The error code value, semantics, and utilization are the same. 1808 RDMA2_ERR_INVAL_HTYPE 1809 If a responder recognizes the value in the rdma_start.rdma_vers 1810 field, but it does not recognize the value in the 1811 rdma_start.rdma_htype field or does not support that header type, 1812 it MUST set the rdma_err field to RDMA2_ERR_INVAL_HTYPE. 1814 RDMA2_ERR_INVAL_FLAG 1815 If a receiver recognizes the value in the rdma_start.rdma_htype 1816 field but does not recognize the combination of flags in the 1817 rdma_flags field, it MUST set the rdma_err field to 1818 RDMA2_ERR_INVAL_HTYPE. 1820 RDMA2_ERR_BAD_XDR 1821 If a responder recognizes the values in the rdma_start.rdma_vers 1822 and rdma_start.rdma_proc fields, but the incoming RPC-over-RDMA 1823 transport header cannot be parsed, it MUST set the rdma_err field 1824 to RDMA2_ERR_BAD_XDR. This includes cases in which a nominally 1825 opaque property value field cannot be parsed using the XDR typedef 1826 associated with the transport property definition. The error code 1827 value of RDMA2_ERR_BAD_XDR is the same as the error code value of 1828 ERR_CHUNK in RPC-over-RDMA version 1. The responder MUST NOT 1829 process the request in any way except to send an error message. 1831 RDMA2_ERR_READ_CHUNKS 1832 If a requester presents more DDP-eligible arguments than the 1833 responder is prepared to Read, the responder MUST set the rdma_err 1834 field to RDMA2_ERR_READ_CHUNKS, and set the rdma_max_chunks field 1835 to the maximum number of Read chunks the responder can receive and 1836 process. 1837 If the responder implementation cannot handle any Read chunks for 1838 a request, it MUST set the rdma_max_chunks to zero in this 1839 response. The requester SHOULD resend the request using a 1840 Position Zero Read chunk. If this was a request using a Position 1841 Zero Read chunk, the requester MUST terminate the transaction with 1842 an error. 1844 RDMA2_ERR_WRITE_CHUNKS 1845 If a requester has constructed an RPC Call message with more DDP- 1846 eligible results than the server is prepared to Write, the 1847 responder MUST set the rdma_err field to RDMA2_ERR_WRITE_CHUNKS, 1848 and set the rdma_max_chunks field to the maximum number of Write 1849 chunks the responder can process and return. 1850 If the responder implementation cannot handle any Write chunks for 1851 a request, it MUST return a response of RDMA2_ERR_REPLY_RESOURCE 1852 (below). The requester SHOULD resend the request with no Write 1853 chunks and a Reply chunk of appropriate size. 1855 RDMA2_ERR_SEGMENTS 1856 If a requester has constructed an RPC Call message with a chunk 1857 that contains more segments than the responder supports, the 1858 responder MUST set the rdma_err field to RDMA2_ERR_SEGMENTS, and 1859 set the rdma_max_segments field to the maximum number of segments 1860 the responder can process. 1862 RDMA2_ERR_WRITE_RESOURCE 1863 If a requester has provided a Write chunk that is not large enough 1864 to fully convey a DDP-eligible result, the responder MUST set the 1865 rdma_err field to RDMA2_ERR_WRITE_RESOURCE. 1867 The responder MUST set the rdma_chunk_index field to point to the 1868 first Write chunk in the transport header that is too short, or to 1869 zero to indicate that it was not possible to determine which chunk 1870 is too small. Indexing starts at one (1), which represents the 1871 first Write chunk. The responder MUST set the rdma_length_needed 1872 to the number of bytes needed in that chunk in order to convey the 1873 result data item. 1875 Upon receipt of this error code, a responder MAY choose to 1876 terminate the operation (for instance, if the responder set the 1877 index and length fields to zero), or it MAY send the request again 1878 using the same XID and more reply resources. 1880 RDMA2_ERR_REPLY_RESOURCE 1881 If an RPC Reply's Payload stream does not fit inline and the 1882 requester has not provided a large enough Reply chunk to convey 1883 the stream, the responder MUST set the rdma_err field to 1884 RDMA2_ERR_REPLY_RESOURCE. The responder MUST set the 1885 rdma_length_needed to the number of Reply chunk bytes needed to 1886 convey the reply. 1888 Upon receipt of this error code, a responder MAY choose to 1889 terminate the operation (for instance, if the responder set the 1890 index and length fields to zero), or it MAY send the request again 1891 using the same XID and larger reply resources. 1893 RDMA2_ERR_SYSTEM 1894 If some problem occurs on a responder that does not fit into the 1895 above categories, the responder MAY report it to the sender by 1896 setting the rdma_err field to RDMA2_ERR_SYSTEM. 1898 This is a permanent error: a requester that receives this error 1899 MUST terminate the RPC transaction associated with the XID value 1900 in the rdma_start.rdma_xid field. 1902 6.4.4. RDMA2_CONNPROP: Advertise Transport Properties 1904 The RDMA2_CONNPROP message type allows an RPC-over-RDMA endpoint, 1905 whether client or server, to indicate to its partner relevant 1906 transport properties that the partner might need to be aware of. 1908 The message definition for this operation is as follows: 1910 1912 struct rpcrdma2_connprop { 1913 rpcrdma2_propset rdma_props; 1914 }; 1916 1918 All relevant transport properties that the sender is aware of should 1919 be included in rdma_props. Since support of each of the properties 1920 is OPTIONAL, the sender cannot assume that the receiver will 1921 necessarily take note of these properties. The sender should be 1922 prepared for cases in which the receiver continues to assume that the 1923 default value for a particular property is still in effect. 1925 Generally, a participant will send a RDMA2_CONNPROP message as the 1926 first message after a connection is established. Given that fact, 1927 the sender should make sure that the message can be received by peers 1928 who use the default Receive Buffer Size. The connection's initial 1929 receive buffer size is typically 1KB, but it depends on the initial 1930 connection state of the RPC-over-RDMA version in use. 1932 Properties not included in rdma_props are to be treated by the peer 1933 endpoint as having the default value and are not allowed to change 1934 subsequently. The peer should not request changes in such 1935 properties. 1937 Those receiving an RDMA2_CONNPROP may encounter properties that they 1938 do not support or are unaware of. In such cases, these properties 1939 are simply ignored without any error response being generated. 1941 6.5. Choosing a Reply Mechanism 1943 A requester provides any necessary registered memory resources for 1944 both an RPC Call message and its matching RPC Reply message. A 1945 requester forms each RPC Call itself, thus it can compute the exact 1946 memory resources needed to send every Call. However, the requester 1947 must allocate memory resources to receive the corresponding Reply 1948 before the responder has formed it. In some cases it is difficult 1949 for the requester to know in advance precisely what resources will be 1950 needed to receive the Reply. 1952 In RPC-over-RDMA version 2, a requester MAY provide a Reply chunk at 1953 any time. The responder MAY use the provided Reply chunk or decide 1954 to use another means to convey the RPC Reply. If the combination of 1955 the provided Write chunk list and Reply chunk is not adequate to 1956 convey a Reply, the responder SHOULD use Message Continuation (see 1957 Section 6.3.2.2 to send that Reply. 1959 If even that is not possible, the responder sends an RDMA2_ERROR 1960 message to the requester, as described in Section 6.4.3: 1962 o The responder MUST send a RDMA2_ERR_WRITE_RESOURCE error if the 1963 Write chunk list cannot accommodate the ULP's DDP-eligible data 1964 payload. 1966 o The responder MUST send a RDMA2_ERR_REPLY_RESOURCE error if the 1967 Reply chunk cannot accommodate the non DDP-eligible parts of the 1968 Reply. 1970 When receiving such errors, the requester SHOULD retry the ULP call 1971 using larger reply resources. In cases where retrying the ULP 1972 request is not possible, the requester terminates the RPC request and 1973 presents an error to the RPC consumer. 1975 7. XDR Protocol Definition 1977 This section contains a description of the core features of the RPC- 1978 over-RDMA version 2 protocol expressed in the XDR language [RFC4506]. 1980 Because of the need to provide for protocol extensibility without 1981 modifying an existing XDR definition, this description has some 1982 important structural differences from the corresponding XDR 1983 description for RPC-over-RDMA version 1, which appears in [RFC8166]. 1985 This description is divided into three parts: 1987 o A code component license which appears in Section 7.1. 1989 o An XDR description of the structures that are generally available 1990 for use by transport header types including both those defined in 1991 this document and those that may be defined as extensions. This 1992 includes definitions of the chunk-related structures derived from 1993 RPC-over-RDMA version 1, the transport property model introduced 1994 in this document, and a definition of the transport header 1995 prefixes that precede the various transport header types. This 1996 appears in Section 7.3. 1998 o An XDR description of the transport header types defined in this 1999 document, including those derived from RPC-over-RDMA version 1 and 2000 those introduced in RPC-over-RDMA version 2. This appears in 2001 Section 7.4. 2003 This description is provided in a way that makes it simple to extract 2004 into ready-to-compile form. To enable the combination of this 2005 description with the descriptions of subsequent extensions to RPC- 2006 over-RDMA version 2, the extracted description can be combined with 2007 similar descriptions published later, or those descriptions can be 2008 compiled separately. Refer to Section 7.2 for details. 2010 7.1. Code Component License 2012 Code components extracted from this document must include the 2013 following license text. When the extracted XDR code is combined with 2014 other complementary XDR code which itself has an identical license, 2015 only a single copy of the license text need be preserved. 2017 2019 /// /* 2020 /// * Copyright (c) 2010-2018 IETF Trust and the persons 2021 /// * identified as authors of the code. All rights reserved. 2022 /// * 2023 /// * The authors of the code are: 2024 /// * B. Callaghan, T. Talpey, C. Lever, and D. Noveck. 2025 /// * 2026 /// * Redistribution and use in source and binary forms, with 2027 /// * or without modification, are permitted provided that the 2028 /// * following conditions are met: 2029 /// * 2030 /// * - Redistributions of source code must retain the above 2031 /// * copyright notice, this list of conditions and the 2032 /// * following disclaimer. 2033 /// * 2034 /// * - Redistributions in binary form must reproduce the above 2035 /// * copyright notice, this list of conditions and the 2036 /// * following disclaimer in the documentation and/or other 2037 /// * materials provided with the distribution. 2038 /// * 2039 /// * - Neither the name of Internet Society, IETF or IETF 2040 /// * Trust, nor the names of specific contributors, may be 2041 /// * used to endorse or promote products derived from this 2042 /// * software without specific prior written permission. 2043 /// * 2044 /// * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS 2045 /// * AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED 2046 /// * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 2047 /// * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS 2048 /// * FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO 2049 /// * EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE 2050 /// * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, 2051 /// * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT 2052 /// * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 2053 /// * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS 2054 /// * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF 2055 /// * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 2056 /// * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING 2057 /// * IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF 2058 /// * ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 2059 /// */ 2060 /// 2062 2064 7.2. Extraction and Use of XDR Definitions 2066 The reader can apply the following sed script to this document to 2067 produce a machine-readable XDR description of the RPC-over-RDMA 2068 version 2 protocol without any OPTIONAL extensions. 2070 2072 sed -n -e 's:^ */// ::p' -e 's:^ *///$::p' 2074 2076 That is, if this document is in a file called "spec.txt" then the 2077 reader can do the following to extract an XDR description file and 2078 store it in the file rpcrdma-v2.x. 2080 2082 sed -n -e 's:^ */// ::p' -e 's:^ *///$::p' \ 2083 < spec.txt > rpcrdma-v2.x 2085 2087 Although this file is a usable description of the base protocol, when 2088 extensions are to supported, it may be desirable to divide into 2089 multiple files. The following script can be used for that purpose: 2091 2093 #!/usr/local/bin/perl 2094 open(IN,"rpcrdma-v2.x"); 2095 open(OUT,">temp.x"); 2096 while() 2097 { 2098 if (m/FILE ENDS: (.*)$/) 2099 { 2100 close(OUT); 2101 rename("temp.x", $1); 2102 open(OUT,">temp.x"); 2103 } 2104 else 2105 { 2106 print OUT $_; 2107 } 2108 } 2109 close(IN); 2110 close(OUT); 2112 2114 Running the above script will result in two files: 2116 o The file common.x, containing the license plus the common XDR 2117 definitions which need to be made available to both the base 2118 operations and any subsequent extensions. 2120 o The file baseops.x containing the XDR definitions for the base 2121 operations, defined in this document. 2123 Optional extensions to RPC-over-RDMA version 2, published as 2124 Standards Track documents, will have similar means of providing XDR 2125 that describes those extensions. Once XDR for all desired extensions 2126 is also extracted, it can be appended to the XDR description file 2127 extracted from this document to produce a consolidated XDR 2128 description file reflecting all extensions selected for an RPC-over- 2129 RDMA implementation. 2131 Alternatively, the XDR descriptions can be compiled separately. In 2132 this case the combination of common.x and baseops.x serves to define 2133 the base transport, while using as XDR descriptions for extensions, 2134 the XDR from the document defining that extension, together with the 2135 file common.x, obtained from this document. 2137 7.3. XDR Definition for RPC-over-RDMA Version 2 Core Structures 2139 2140 /// /******************************************************************* 2141 /// * Transport Header Prefixes 2142 /// ******************************************************************/ 2143 /// 2144 /// struct rpcrdma_common { 2145 /// uint32 rdma_xid; 2146 /// uint32 rdma_vers; 2147 /// uint32 rdma_credit; 2148 /// uint32 rdma_htype; 2149 /// }; 2150 /// 2151 /// const RPCRDMA2_F_RESPONSE 0x00000001; 2152 /// const RPCRDMA2_F_MORE 0x00000002; 2153 /// 2154 /// struct rpcrdma2_hdr_prefix 2155 /// struct rpcrdma_common rdma_start; 2156 /// uint32 rdma_flags; 2157 /// }; 2158 /// 2159 /// /******************************************************************* 2160 /// * Chunks and Chunk Lists 2161 /// ******************************************************************/ 2162 /// 2163 /// struct rpcrdma2_segment { 2164 /// uint32 rdma_handle; 2165 /// uint32 rdma_length; 2166 /// uint64 rdma_offset; 2167 /// }; 2168 /// 2169 /// struct rpcrdma2_read_segment { 2170 /// uint32 rdma_position; 2171 /// struct rpcrdma2_segment rdma_target; 2172 /// }; 2173 /// 2174 /// struct rpcrdma2_read_list { 2175 /// struct rpcrdma2_read_segment rdma_entry; 2176 /// struct rpcrdma2_read_list *rdma_next; 2177 /// }; 2178 /// 2179 /// struct rpcrdma2_write_chunk { 2180 /// struct rpcrdma2_segment rdma_target<>; 2181 /// }; 2182 /// 2183 /// struct rpcrdma2_write_list { 2184 /// struct rpcrdma2_write_chunk rdma_entry; 2185 /// struct rpcrdma2_write_list *rdma_next; 2186 /// }; 2187 /// 2188 /// struct rpcrdma2_chunk_lists { 2189 /// uint32 rdma_inv_handle; 2190 /// struct rpcrdma2_read_list *rdma_reads; 2191 /// struct rpcrdma2_write_list *rdma_writes; 2192 /// struct rpcrdma2_write_chunk *rdma_reply; 2193 /// }; 2194 /// 2195 /// /******************************************************************* 2196 /// * Transport Properties 2197 /// ******************************************************************/ 2198 /// 2199 /// /* 2200 /// * Types for transport properties model 2201 /// */ 2202 /// typedef rpcrdma2_propid uint32; 2203 /// 2204 /// struct rpcrdma2_propval { 2205 /// rpcrdma2_propid rdma_which; 2206 /// opaque rdma_data<>; 2207 /// }; 2208 /// 2209 /// typedef rpcrdma2_propval rpcrdma2_propset<>; 2210 /// typedef uint32 rpcrdma2_propsubset<>; 2211 /// 2212 /// /* 2213 /// * Transport propid values for basic properties 2214 /// */ 2215 /// const uint32 RDMA2_PROPID_SBSIZ = 1; 2216 /// const uint32 RDMA2_PROPID_RBSIZ = 2; 2217 /// const uint32 RDMA2_PROPID_RSSIZ = 3; 2218 /// const uint32 RDMA2_PROPID_RCSIZ = 4; 2219 /// const uint32 RDMA2_PROPID_BRS = 5; 2220 /// const uint32 RDMA2_PROPID_HOSTAUTH = 6; 2221 /// 2222 /// /* 2223 /// * Types specific to particular properties 2224 /// */ 2225 /// typedef uint32 rpcrdma2_prop_sbsiz; 2226 /// typedef uint32 rpcrdma2_prop_rbsiz; 2227 /// typedef uint32 rpcrdma2_prop_rssiz; 2228 /// typedef uint32 rpcrdma2_prop_rcsiz; 2229 /// typedef uint32 rpcrdma2_prop_brs; 2230 /// typedef opaque rpcrdma2_prop_hostauth<>; 2231 /// 2232 /// const uint32 RDMA_RVREQSUP_NONE = 0; 2233 /// const uint32 RDMA_RVREQSUP_INLINE = 1; 2234 /// const uint32 RDMA_RVREQSUP_GENL = 2; 2235 /// 2236 /// /* FILE ENDS: common.x; */ 2238 2240 7.4. XDR Definition for RPC-over-RDMA Version 2 Base Header Types 2242 2243 /// /******************************************************************* 2244 /// * Descriptions of RPC-over-RDMA Header Types 2245 /// ******************************************************************/ 2246 /// 2247 /// /* 2248 /// * Header Type Codes. 2249 /// */ 2250 /// const rpcrdma2_proc RDMA2_MSG = 0; 2251 /// const rpcrdma2_proc RDMA2_NOMSG = 1; 2252 /// const rpcrdma2_proc RDMA2_ERROR = 4; 2253 /// const rpcrdma2_proc RDMA2_CONNPROP = 5; 2254 /// 2255 /// /* 2256 /// * Header Types to Convey RPC Messages. 2257 /// */ 2258 /// struct rpcrdma2_msg { 2259 /// struct rpcrdma2_chunk_lists rdma_chunks; 2260 /// 2261 /// /* The rpc message starts here and continues 2262 /// * through the end of the transmission. */ 2263 /// uint32 rdma_rpc_first_word; 2264 /// }; 2265 /// 2266 /// struct rpcrdma2_nomsg { 2267 /// struct rpcrdma2_chunk_lists rdma_chunks; 2268 /// }; 2269 /// 2270 /// /* 2271 /// * Header Type to Report Errors. 2272 /// */ 2273 /// const uint32 RDMA2_ERR_VERS = 1; 2274 /// const uint32 RDMA2_ERR_BAD_XDR = 2; 2275 /// const uint32 RDMA2_ERR_INVAL_HTYPE = 3; 2276 /// const uint32 RDMA2_ERR_INVAL_FLAG = 4; 2277 /// const uint32 RDMA2_ERR_READ_CHUNKS = 5; 2278 /// const uint32 RDMA2_ERR_WRITE_CHUNKS = 6; 2279 /// const uint32 RDMA2_ERR_SEGMENTS = 7; 2280 /// const uint32 RDMA2_ERR_WRITE_RESOURCE = 8; 2281 /// const uint32 RDMA2_ERR_REPLY_RESOURCE = 9; 2282 /// const uint32 RDMA2_ERR_SYSTEM = 10; 2283 /// 2284 /// struct rpcrdma2_err_vers { 2285 /// uint32 rdma_vers_low; 2286 /// uint32 rdma_vers_high; 2287 /// }; 2288 /// 2289 /// struct rpcrdma2_err_write { 2290 /// uint32 rdma_chunk_index; 2291 /// uint32 rdma_length_needed; 2292 /// }; 2293 /// 2294 /// union rpcrdma2_error switch (rpcrdma2_errcode rdma_err) { 2295 /// case RDMA2_ERR_VERS: 2296 /// rpcrdma2_err_vers rdma_vrange; 2297 /// case RDMA2_ERR_READ_CHUNKS: 2298 /// uint32 rdma_max_chunks; 2299 /// case RDMA2_ERR_WRITE_CHUNKS: 2300 /// uint32 rdma_max_chunks; 2301 /// case RDMA2_ERR_SEGMENTS: 2302 /// uint32 rdma_max_segments; 2303 /// case RDMA2_ERR_WRITE_RESOURCE: 2304 /// rpcrdma2_err_write rdma_writeres; 2305 /// case RDMA2_ERR_REPLY_RESOURCE: 2306 /// uint32 rdma_length_needed; 2307 /// default: 2308 /// void; 2309 /// }; 2310 /// 2311 /// /* 2312 /// * Header Type to Exchange Transport Properties. 2313 /// */ 2314 /// struct rpcrdma2_connprop { 2315 /// rpcrdma2_propset rdma_props; 2316 /// }; 2317 /// 2318 /// /* FILE ENDS: baseops.x; */ 2320 2322 7.5. Use of the XDR Description Files 2324 The three files common.x and baseops.x, when combined with the XDR 2325 descriptions for extension defined later, produce a human-readable 2326 and compilable description of the RPC-over-RDMA version 2 protocol 2327 with the included extensions. 2329 Although this XDR description can be useful in generating code to 2330 encode and decode the transport and payload streams, there are 2331 elements of the structure of RPC-over-RDMA version 2 which are not 2332 expressible within the XDR language as currently defined. This 2333 requires implementations that use the output of the XDR processor to 2334 provide additional code to bridge the gaps. 2336 o The values of transport properties are represented within XDR as 2337 opaque values. However, the actual structures of each of the 2338 properties are represented by XDR typedefs, with the selection of 2339 the appropriate typedef described by text in this document. The 2340 determination of the appropriate typedef is not specified by XDR, 2341 which does not possess the facilities necessary for that 2342 determination to be specified in an extensible way. 2344 This is similar to the way in which NFSv4 attributes are handled 2345 [RFC7530] [RFC5661]. As in that case, implementations that need 2346 to encode and decode these nominally opaque entities need to use 2347 the protocol description to determine the actual XDR 2348 representation that underlays the items described as opaque. 2350 o The transport stream is not represented as a single XDR object. 2351 Instead, the header prefix is described by one XDR object while 2352 the rest of the header is described as another XDR object with the 2353 mapping between the header type in the header prefix and the XDR 2354 object representing the header type represented by tables 2355 contained in this document, with additional mappings being 2356 specifiable by a later extension document. 2358 This situation is similar to that in which RPC message headers 2359 contain program and procedure numbers, so that the XDR for those 2360 request and replies can be used to encode and decode the 2361 associated messages without requiring that all be present in a 2362 single XDR specification. As in that case, implementations need 2363 to use the header specification to select the appropriate XDR- 2364 generated code to be used in message processing. 2366 o The relationship between the transport stream and the payload 2367 stream is not specified in the XDR itself, although comments 2368 within the XDR text make clear where transported messages, 2369 described by their own XDR, need to appear. Such data by its 2370 nature is opaque to the transport, although its form differs XDR 2371 opaque arrays. 2373 Potential extensions allowing continuation of RPC messages across 2374 transport message boundaries will require that message assembly 2375 facilities, not specifiable within XDR, also be part of transport 2376 implementations. 2378 To summarize, the role of XDR in this specification is more limited 2379 than for protocols which are themselves XDR programs, where the 2380 totality of the protocol is expressible within the XDR paradigm 2381 established for that purpose. This more limited role reflects the 2382 fact that XDR lacks facilities to represent the embedding of 2383 transported material within the transport framework. In addition, 2384 the need to cleanly accommodate extensions has meant that those using 2385 rpcgen in their applications need to take a more active role in 2386 providing the facilities that cannot be expressed within XDR. 2388 8. RPC Bind Parameters 2390 In setting up a new RDMA connection, the first action by an RPC 2391 client is to obtain a transport address for the RPC server. The 2392 means used to obtain this address and to open an RDMA connection is 2393 dependent on the type of RDMA transport, and is the responsibility of 2394 each RPC protocol binding and its local implementation. 2396 RPC services normally register with a portmap or rpcbind service 2397 [RFC1833], which associates an RPC Program number with a service 2398 address. This policy is no different with RDMA transports. However, 2399 a different and distinct service address (port number) might 2400 sometimes be required for ULP operation with RPC-over-RDMA. 2402 When mapped atop the iWARP transport [RFC5040] [RFC5041], which uses 2403 IP port addressing due to its layering on TCP and/or SCTP, port 2404 mapping is trivial and consists merely of issuing the port in the 2405 connection process. The NFS/RDMA protocol service address has been 2406 assigned port 20049 by IANA, for both iWARP/TCP and iWARP/SCTP 2407 [RFC8267]. 2409 When mapped atop InfiniBand [IBA], which uses a service endpoint 2410 naming scheme based on a Group Identifier (GID), a translation MUST 2411 be employed. One such translation is described in Annexes A3 2412 (Application Specific Identifiers), A4 (Sockets Direct Protocol 2413 (SDP)), and A11 (RDMA IP CM Service) of [IBA], which is appropriate 2414 for translating IP port addressing to the InfiniBand network. 2415 Therefore, in this case, IP port addressing may be readily employed 2416 by the upper layer. 2418 When a mapping standard or convention exists for IP ports on an RDMA 2419 interconnect, there are several possibilities for each upper layer to 2420 consider: 2422 o One possibility is to have the server register its mapped IP port 2423 with the rpcbind service under the netid (or netids) defined in 2424 [RFC8166]. An RPC-over-RDMA-aware RPC client can then resolve its 2425 desired service to a mappable port and proceed to connect. This 2426 is the most flexible and compatible approach for those upper 2427 layers that are defined to use the rpcbind service. 2429 o A second possibility is to have the RPC server's portmapper 2430 register itself on the RDMA interconnect at a "well-known" service 2431 address (on UDP or TCP, this corresponds to port 111). An RPC 2432 client could connect to this service address and use the portmap 2433 protocol to obtain a service address in response to a program 2434 number; e.g., an iWARP port number or an InfiniBand GID. 2436 o Alternately, the RPC client could simply connect to the mapped 2437 well-known port for the service itself, if it is appropriately 2438 defined. By convention, the NFS/RDMA service, when operating atop 2439 such an InfiniBand fabric, uses the same 20049 assignment as for 2440 iWARP. 2442 Historically, different RPC protocols have taken different approaches 2443 to their port assignment. Therefore, the specific method is left to 2444 each RPC-over-RDMA-enabled ULB and is not addressed in this document. 2446 [RFC8166] defines two new netid values to be used for registration of 2447 upper layers atop iWARP [RFC5040] [RFC5041] and (when a suitable port 2448 translation service is available) InfiniBand [IBA]. Additional RDMA- 2449 capable networks MAY define their own netids, or if they provide a 2450 port translation, they MAY share the one defined in [RFC8166]. 2452 9. Implementation Status 2454 This section records the status of known implementations of the 2455 protocol defined by this specification at the time of posting of this 2456 Internet-Draft, and is based on a proposal described in [RFC7942]. 2457 The description of implementations in this section is intended to 2458 assist the IETF in its decision processes in progressing drafts to 2459 RFCs. 2461 Please note that the listing of any individual implementation here 2462 does not imply endorsement by the IETF. Furthermore, no effort has 2463 been spent to verify the information presented here that was supplied 2464 by IETF contributors. This is not intended as, and must not be 2465 construed to be, a catalog of available implementations or their 2466 features. Readers are advised to note that other implementations may 2467 exist. 2469 At this time, no known implementations of the protocol described in 2470 this document exist. 2472 10. Security Considerations 2474 10.1. Memory Protection 2476 A primary consideration is the protection of the integrity and 2477 confidentiality of host memory by an RPC-over-RDMA transport. The 2478 use of an RPC-over-RDMA transport protocol MUST NOT introduce 2479 vulnerabilities to system memory contents nor to memory owned by user 2480 processes. 2482 It is REQUIRED that any RDMA provider used for RPC transport be 2483 conformant to the requirements of [RFC5042] in order to satisfy these 2484 protections. These protections are provided by the RDMA layer 2485 specifications, and in particular, their security models. 2487 10.1.1. Protection Domains 2489 The use of Protection Domains to limit the exposure of memory regions 2490 to a single connection is critical. Any attempt by an endpoint not 2491 participating in that connection to reuse memory handles needs to 2492 result in immediate failure of that connection. Because ULP security 2493 mechanisms rely on this aspect of Reliable connected behavior, strong 2494 authentication of remote endpoints is recommended. 2496 10.1.2. Handle (STag) Predictability 2498 Unpredictable memory handles should be used for any operation 2499 requiring advertised memory regions. Advertising a continuously 2500 registered memory region allows a remote host to read or write to 2501 that region even when an RPC involving that memory is not under way. 2502 Therefore, implementations should avoid advertising persistently 2503 registered memory. 2505 10.1.3. Memory Protection 2507 Requesters should register memory regions for remote access only when 2508 they are about to be the target of an RPC operation that involves an 2509 RDMA Read or Write. 2511 Registered memory regions should be invalidated as soon as related 2512 RPC operations are complete. Invalidation and DMA unmapping of 2513 memory regions should be complete before message integrity checking 2514 is done and before the RPC consumer is allowed to continue execution 2515 and use or alter the contents of a memory region. 2517 An RPC transaction on a Requester might be terminated before a reply 2518 arrives if the RPC consumer exits unexpectedly (for example, it is 2519 signaled or a segmentation fault occurs). When an RPC terminates 2520 abnormally, memory regions associated with that RPC should be 2521 invalidated appropriately before the regions are released to be 2522 reused for other purposes on the Requester. 2524 10.1.4. Denial of Service 2526 A detailed discussion of denial-of-service exposures that can result 2527 from the use of an RDMA transport is found in Section 6.4 of 2528 [RFC5042]. 2530 A Responder is not obliged to pull Read chunks that are unreasonably 2531 large. The Responder can use an RDMA2_ERROR response to terminate 2532 RPCs with unreadable Read chunks. If a Responder transmits more data 2533 than a Requester is prepared to receive in a Write or Reply chunk, 2534 the RDMA Network Interface Cards (RNICs) typically terminate the 2535 connection. For further discussion, see Section 6.4.3. Such 2536 repeated chunk errors can deny service to other users sharing the 2537 connection from the errant Requester. 2539 An RPC-over-RDMA transport implementation is not responsible for 2540 throttling the RPC request rate, other than to keep the number of 2541 concurrent RPC transactions at or under the number of credits granted 2542 per connection. This is explained in Section 4.3.1. A sender can 2543 trigger a self denial of service by exceeding the credit grant 2544 repeatedly. 2546 When an RPC has been canceled due to a signal or premature exit of an 2547 application process, a Requester typically invalidates the RPC's 2548 Write and Reply chunks. Invalidation prevents the subsequent arrival 2549 of the Responder's reply from altering the memory regions associated 2550 with those chunks after the memory has been reused. 2552 On the Requester, a malfunctioning application or a malicious user 2553 can create a situation where RPCs are continuously initiated and then 2554 aborted, resulting in Responder replies that terminate the underlying 2555 RPC-over-RDMA connection repeatedly. Such situations can deny 2556 service to other users sharing the connection from that Requester. 2558 10.2. RPC Message Security 2560 ONC RPC provides cryptographic security via the RPCSEC_GSS framework 2561 [RFC7861]. RPCSEC_GSS implements message authentication 2562 (rpc_gss_svc_none), per-message integrity checking 2563 (rpc_gss_svc_integrity), and per-message confidentiality 2564 (rpc_gss_svc_privacy) in the layer above the RPC-over-RDMA transport. 2565 The latter two services require significant computation and movement 2566 of data on each endpoint host. Some performance benefits enabled by 2567 RDMA transports can be lost. 2569 10.2.1. RPC-over-RDMA Protection at Lower Layers 2571 For any RPC transport, utilizing RPCSEC_GSS integrity or privacy 2572 services has performance implications. Protection below the RPC 2573 transport is often more appropriate in performance-sensitive 2574 deployments, especially if it, too, can be offloaded. Certain 2575 configurations of IPsec can be co-located in RDMA hardware, for 2576 example, without change to RDMA consumers and little loss of data 2577 movement efficiency. Such arrangements can also provide a higher 2578 degree of privacy by hiding endpoint identity or altering the 2579 frequency at which messages are exchanged, at a performance cost. 2581 The use of protection in a lower layer MAY be negotiated through the 2582 use of an RPCSEC_GSS security flavor defined in [RFC7861] in 2583 conjunction with the Channel Binding mechanism [RFC5056] and IPsec 2584 Channel Connection Latching [RFC5660]. Use of such mechanisms is 2585 REQUIRED where integrity or confidentiality is desired and where 2586 efficiency is required. 2588 10.2.2. RPCSEC_GSS on RPC-over-RDMA Transports 2590 Not all RDMA devices and fabrics support the above protection 2591 mechanisms. Also, per-message authentication is still required on 2592 NFS clients where multiple users access NFS files. In these cases, 2593 RPCSEC_GSS can protect NFS traffic conveyed on RPC-over-RDMA 2594 connections. 2596 RPCSEC_GSS extends the ONC RPC protocol without changing the format 2597 of RPC messages. By observing the conventions described in this 2598 section, an RPC-over-RDMA transport can convey RPCSEC_GSS-protected 2599 RPC messages interoperably. 2601 As part of the ONC RPC protocol, protocol elements of RPCSEC_GSS that 2602 appear in the Payload stream of an RPC-over-RDMA message (such as 2603 control messages exchanged as part of establishing or destroying a 2604 security context or data items that are part of RPCSEC_GSS 2605 authentication material) MUST NOT be reduced. 2607 10.2.2.1. RPCSEC_GSS Context Negotiation 2609 Some NFS client implementations use a separate connection to 2610 establish a Generic Security Service (GSS) context for NFS operation. 2611 Such clients use TCP and the standard NFS port (2049) for context 2612 establishment. To enable the use of RPCSEC_GSS with NFS/RDMA, an NFS 2613 server MUST also provide a TCP-based NFS service on port 2049. 2615 10.2.2.2. RPC-over-RDMA with RPCSEC_GSS Authentication 2617 The RPCSEC_GSS authentication service has no impact on the DDP- 2618 eligibility of data items in a ULP. 2620 However, RPCSEC_GSS authentication material appearing in an RPC 2621 message header can be larger than, say, an AUTH_SYS authenticator. 2622 In particular, when an RPCSEC_GSS pseudoflavor is in use, a Requester 2623 needs to accommodate a larger RPC credential when marshaling RPC Call 2624 messages and needs to provide for a maximum size RPCSEC_GSS verifier 2625 when allocating reply buffers and Reply chunks. 2627 RPC messages, and thus Payload streams, are made larger as a result. 2628 ULP operations that fit in a Short Message when a simpler form of 2629 authentication is in use might need to be reduced or conveyed via a 2630 Long Message when RPCSEC_GSS authentication is in use. It is more 2631 likely that a Requester provides both a Read list and a Reply chunk 2632 in the same RPC-over-RDMA Transport header to convey a Long Call and 2633 provision a receptacle for a Long Reply. 2635 In addition to this cost, the XDR encoding and decoding of each RPC 2636 message using RPCSEC_GSS authentication requires host compute 2637 resources to construct the GSS verifier. 2639 10.2.2.3. RPC-over-RDMA with RPCSEC_GSS Integrity or Privacy 2641 The RPCSEC_GSS integrity service enables endpoints to detect 2642 modification of RPC messages in flight. The RPCSEC_GSS privacy 2643 service prevents all but the intended recipient from viewing the 2644 cleartext content of RPC arguments and results. RPCSEC_GSS integrity 2645 and privacy services are end-to-end. They protect RPC arguments and 2646 results from application to server endpoint, and back. 2648 The RPCSEC_GSS integrity and encryption services operate on whole RPC 2649 messages after they have been XDR encoded for transmit, and before 2650 they have been XDR decoded after receipt. Both sender and receiver 2651 endpoints use intermediate buffers to prevent exposure of encrypted 2652 data or unverified cleartext data to RPC consumers. After 2653 verification, encryption, and message wrapping has been performed, 2654 the transport layer MAY use RDMA data transfer between these 2655 intermediate buffers. 2657 The process of reducing a DDP-eligible data item removes the data 2658 item and its XDR padding from the encoded Payload stream. XDR 2659 padding of a reduced data item is not transferred in a normal RPC- 2660 over-RDMA message. After reduction, the Payload stream contains 2661 fewer octets than the whole XDR stream did beforehand. XDR padding 2662 octets are often zero bytes, but they don't have to be. Thus, 2663 reducing DDP-eligible items affects the result of message integrity 2664 verification or encryption. 2666 Therefore, a sender MUST NOT reduce a Payload stream when RPCSEC_GSS 2667 integrity or encryption services are in use. Effectively, no data 2668 item is DDP-eligible in this situation, and Chunked Messages cannot 2669 be used. In this mode, an RPC-over-RDMA transport operates in the 2670 same manner as a transport that does not support DDP. 2672 When an RPCSEC_GSS integrity or privacy service is in use, a 2673 Requester provides both a Read list and a Reply chunk in the same 2674 RPC-over-RDMA header to convey a Long Call and provision a receptacle 2675 for a Long Reply. 2677 10.2.2.4. Protecting RPC-over-RDMA Transport Headers 2679 Like the base fields in an ONC RPC message (XID, call direction, and 2680 so on), the contents of an RPC-over-RDMA message's Transport stream 2681 are not protected by RPCSEC_GSS. This exposes XIDs, connection 2682 credit limits, and chunk lists (but not the content of the data items 2683 they refer to) to malicious behavior, which could redirect data that 2684 is transferred by the RPC-over-RDMA message, result in spurious 2685 retransmits, or trigger connection loss. 2687 In particular, if an attacker alters the information contained in the 2688 chunk lists of an RPC-over-RDMA Transport header, data contained in 2689 those chunks can be redirected to other registered memory regions on 2690 Requesters. An attacker might alter the arguments of RDMA Read and 2691 RDMA Write operations on the wire to similar effect. If such 2692 alterations occur, the use of RPCSEC_GSS integrity or privacy 2693 services enable a Requester to detect unexpected material in a 2694 received RPC message. 2696 Encryption at lower layers, as described in Section 10.2.1 protects 2697 the content of the Transport stream. To address attacks on RDMA 2698 protocols themselves, RDMA transport implementations should conform 2699 to [RFC5042]. 2701 10.3. Transport Properties 2703 Like other fields that appear in each RPC-over-RDMA header, property 2704 information is sent in the clear on the fabric with no integrity 2705 protection, making it vulnerable to man-in-the-middle attacks. 2707 For example, if a man-in-the-middle were to change the value of the 2708 Receive buffer size or the Requester Remote Invalidation boolean, it 2709 could reduce connection performance or trigger loss of connection. 2710 Repeated connection loss can impact performance or even prevent a new 2711 connection from being established. Recourse is to deploy on a 2712 private network or use link-layer encryption. 2714 10.4. Host Authentication 2716 Wherein we use the relevant sections of [RFC3552] to analyze the 2717 addition of host authentication to this RPC-over-RDMA transport. 2719 The authors refer readers to Appendix C of [RFC8446] for information 2720 on how to design and test a secure authentication handshake 2721 implementation. 2723 11. IANA Considerations 2725 The RPC-over-RDMA family of transports have been assigned RPC netids 2726 by [RFC8166]. A netid is an rpcbind [RFC1833] string used to 2727 identify the underlying protocol in order for RPC to select 2728 appropriate transport framing and the format of the service addresses 2729 and ports. 2731 The following netid registry strings are already defined for this 2732 purpose: 2734 NC_RDMA "rdma" 2735 NC_RDMA6 "rdma6" 2737 The "rdma" netid is to be used when IPv4 addressing is employed by 2738 the underlying transport, and "rdma6" when IPv6 addressing is 2739 employed. The netid assignment policy and registry are defined in 2740 [RFC5665]. The current document does not alter these netid 2741 assignments. 2743 These netids MAY be used for any RDMA network that satisfies the 2744 requirements of Section 3.2.2 and that is able to identify service 2745 endpoints using IP port addressing, possibly through use of a 2746 translation service as described in Section 8. 2748 12. References 2750 12.1. Normative References 2752 [RFC1833] Srinivasan, R., "Binding Protocols for ONC RPC Version 2", 2753 RFC 1833, DOI 10.17487/RFC1833, August 1995, 2754 . 2756 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 2757 Requirement Levels", BCP 14, RFC 2119, 2758 DOI 10.17487/RFC2119, March 1997, 2759 . 2761 [RFC4506] Eisler, M., Ed., "XDR: External Data Representation 2762 Standard", STD 67, RFC 4506, DOI 10.17487/RFC4506, May 2763 2006, . 2765 [RFC5042] Pinkerton, J. and E. Deleganes, "Direct Data Placement 2766 Protocol (DDP) / Remote Direct Memory Access Protocol 2767 (RDMAP) Security", RFC 5042, DOI 10.17487/RFC5042, October 2768 2007, . 2770 [RFC5056] Williams, N., "On the Use of Channel Bindings to Secure 2771 Channels", RFC 5056, DOI 10.17487/RFC5056, November 2007, 2772 . 2774 [RFC5280] Cooper, D., Santesson, S., Farrell, S., Boeyen, S., 2775 Housley, R., and W. Polk, "Internet X.509 Public Key 2776 Infrastructure Certificate and Certificate Revocation List 2777 (CRL) Profile", RFC 5280, DOI 10.17487/RFC5280, May 2008, 2778 . 2780 [RFC5531] Thurlow, R., "RPC: Remote Procedure Call Protocol 2781 Specification Version 2", RFC 5531, DOI 10.17487/RFC5531, 2782 May 2009, . 2784 [RFC5660] Williams, N., "IPsec Channels: Connection Latching", 2785 RFC 5660, DOI 10.17487/RFC5660, October 2009, 2786 . 2788 [RFC5665] Eisler, M., "IANA Considerations for Remote Procedure Call 2789 (RPC) Network Identifiers and Universal Address Formats", 2790 RFC 5665, DOI 10.17487/RFC5665, January 2010, 2791 . 2793 [RFC6125] Saint-Andre, P. and J. Hodges, "Representation and 2794 Verification of Domain-Based Application Service Identity 2795 within Internet Public Key Infrastructure Using X.509 2796 (PKIX) Certificates in the Context of Transport Layer 2797 Security (TLS)", RFC 6125, DOI 10.17487/RFC6125, March 2798 2011, . 2800 [RFC7861] Adamson, A. and N. Williams, "Remote Procedure Call (RPC) 2801 Security Version 3", RFC 7861, DOI 10.17487/RFC7861, 2802 November 2016, . 2804 [RFC7942] Sheffer, Y. and A. Farrel, "Improving Awareness of Running 2805 Code: The Implementation Status Section", BCP 205, 2806 RFC 7942, DOI 10.17487/RFC7942, July 2016, 2807 . 2809 [RFC8166] Lever, C., Ed., Simpson, W., and T. Talpey, "Remote Direct 2810 Memory Access Transport for Remote Procedure Call Version 2811 1", RFC 8166, DOI 10.17487/RFC8166, June 2017, 2812 . 2814 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 2815 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, 2816 May 2017, . 2818 [RFC8267] Lever, C., "Network File System (NFS) Upper-Layer Binding 2819 to RPC-over-RDMA Version 1", RFC 8267, 2820 DOI 10.17487/RFC8267, October 2017, 2821 . 2823 [RFC8446] Rescorla, E., "The Transport Layer Security (TLS) Protocol 2824 Version 1.3", RFC 8446, DOI 10.17487/RFC8446, August 2018, 2825 . 2827 12.2. Informative References 2829 [CBFC] Kung, H., Blackwell, T., and A. Chapman, "Credit-Based 2830 Flow Control for ATM Networks: Credit Update Protocol, 2831 Adaptive Credit Allocation, and Statistical Multiplexing", 2832 Proc. ACM SIGCOMM '94 Symposium on Communications 2833 Architectures, Protocols and Applications, pp. 101-114., 2834 August 1994. 2836 [IBA] InfiniBand Trade Association, "InfiniBand Architecture 2837 Specification Volume 1", Release 1.3, March 2015. 2839 Available from https://www.infinibandta.org/ 2841 [RFC0768] Postel, J., "User Datagram Protocol", STD 6, RFC 768, 2842 DOI 10.17487/RFC0768, August 1980, 2843 . 2845 [RFC0793] Postel, J., "Transmission Control Protocol", STD 7, 2846 RFC 793, DOI 10.17487/RFC0793, September 1981, 2847 . 2849 [RFC1094] Nowicki, B., "NFS: Network File System Protocol 2850 specification", RFC 1094, DOI 10.17487/RFC1094, March 2851 1989, . 2853 [RFC1813] Callaghan, B., Pawlowski, B., and P. Staubach, "NFS 2854 Version 3 Protocol Specification", RFC 1813, 2855 DOI 10.17487/RFC1813, June 1995, 2856 . 2858 [RFC3552] Rescorla, E. and B. Korver, "Guidelines for Writing RFC 2859 Text on Security Considerations", BCP 72, RFC 3552, 2860 DOI 10.17487/RFC3552, July 2003, 2861 . 2863 [RFC5040] Recio, R., Metzler, B., Culley, P., Hilland, J., and D. 2864 Garcia, "A Remote Direct Memory Access Protocol 2865 Specification", RFC 5040, DOI 10.17487/RFC5040, October 2866 2007, . 2868 [RFC5041] Shah, H., Pinkerton, J., Recio, R., and P. Culley, "Direct 2869 Data Placement over Reliable Transports", RFC 5041, 2870 DOI 10.17487/RFC5041, October 2007, 2871 . 2873 [RFC5532] Talpey, T. and C. Juszczak, "Network File System (NFS) 2874 Remote Direct Memory Access (RDMA) Problem Statement", 2875 RFC 5532, DOI 10.17487/RFC5532, May 2009, 2876 . 2878 [RFC5661] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed., 2879 "Network File System (NFS) Version 4 Minor Version 1 2880 Protocol", RFC 5661, DOI 10.17487/RFC5661, January 2010, 2881 . 2883 [RFC5662] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed., 2884 "Network File System (NFS) Version 4 Minor Version 1 2885 External Data Representation Standard (XDR) Description", 2886 RFC 5662, DOI 10.17487/RFC5662, January 2010, 2887 . 2889 [RFC7530] Haynes, T., Ed. and D. Noveck, Ed., "Network File System 2890 (NFS) Version 4 Protocol", RFC 7530, DOI 10.17487/RFC7530, 2891 March 2015, . 2893 [RFC8167] Lever, C., "Bidirectional Remote Procedure Call on RPC- 2894 over-RDMA Transports", RFC 8167, DOI 10.17487/RFC8167, 2895 June 2017, . 2897 [RFC8178] Noveck, D., "Rules for NFSv4 Extensions and Minor 2898 Versions", RFC 8178, DOI 10.17487/RFC8178, July 2017, 2899 . 2901 Appendix A. ULB Specifications 2903 An Upper-Layer Protocol (ULP) is typically defined independently of 2904 any particular RPC transport. An Upper-Layer Binding (ULB) 2905 specification provides guidance that helps the ULP interoperate 2906 correctly and efficiently over a particular transport. For RPC-over- 2907 RDMA version 2, a ULB may provide: 2909 o A taxonomy of XDR data items that are eligible for DDP 2911 o Constraints on which upper-layer procedures may be reduced and on 2912 how many chunks may appear in a single RPC request 2914 o A method for determining the maximum size of the reply Payload 2915 stream for all procedures in the ULP 2917 o An rpcbind port assignment for operation of the RPC Program and 2918 Version on an RPC-over-RDMA transport 2920 Each RPC Program and Version tuple that utilizes RPC-over-RDMA 2921 version 2 needs to have a ULB specification. 2923 A.1. DDP-Eligibility 2925 An ULB designates some XDR data items as eligible for DDP. As an 2926 RPC-over-RDMA message is formed, DDP-eligible data items can be 2927 removed from the Payload stream and placed directly in the receiver's 2928 memory. An XDR data item should be considered for DDP-eligibility if 2929 there is a clear benefit to moving the contents of the item directly 2930 from the sender's memory to the receiver's memory. 2932 Criteria for DDP-eligibility include: 2934 o The XDR data item is frequently sent or received, and its size is 2935 often much larger than typical inline thresholds. 2937 o If the XDR data item is a result, its maximum size must be 2938 predictable in advance by the requester. 2940 o Transport-level processing of the XDR data item is not needed. 2941 For example, the data item is an opaque byte array, which requires 2942 no XDR encoding and decoding of its content. 2944 o The content of the XDR data item is sensitive to address 2945 alignment. For example, a data copy operation would be required 2946 on the receiver to enable the message to be parsed correctly, or 2947 to enable the data item to be accessed. 2949 o The XDR data item does not contain DDP-eligible data items. 2951 In addition to defining the set of data items that are DDP-eligible, 2952 a ULB may also limit the use of chunks to particular upper-layer 2953 procedures. If more than one data item in a procedure is DDP- 2954 eligible, the ULB may also limit the number of chunks that a 2955 requester can provide for a particular upper-layer procedure. 2957 Senders MUST NOT reduce data items that are not DDP-eligible. Such 2958 data items MAY, however, be moved as part of a Position Zero Read 2959 chunk or a Reply chunk. 2961 The programming interface by which an upper-layer implementation 2962 indicates the DDP-eligibility of a data item to the RPC transport is 2963 not described by this specification. The only requirements are that 2964 the receiver can re-assemble the transmitted RPC-over-RDMA message 2965 into a valid XDR stream, and that DDP-eligibility rules specified by 2966 the ULB are respected. 2968 There is no provision to express DDP-eligibility within the XDR 2969 language. The only definitive specification of DDP-eligibility is a 2970 ULB. 2972 In general, a DDP-eligibility violation occurs when: 2974 o A requester reduces a non-DDP-eligible argument data item. The 2975 Responder MUST NOT process this RPC Call message and MUST report 2976 the violation as described in Section 6.4.3. 2978 o A Responder reduces a non-DDP-eligible result data item. The 2979 requester MUST terminate the pending RPC transaction and report an 2980 appropriate permanent error to the RPC consumer. 2982 o A Responder does not reduce a DDP-eligible result data item into 2983 an available Write chunk. The requester MUST terminate the 2984 pending RPC transaction and report an appropriate permanent error 2985 to the RPC consumer. 2987 A.2. Maximum Reply Size 2989 When expecting small and moderately-sized Replies, a requester should 2990 typically rely on Message Continuation rather than provisioning a 2991 Reply chunk. For each ULP procedure where there is no clear Reply 2992 size maximum and the maximum can be large, the ULB should specify a 2993 dependable means for determining the maximum Reply size. 2995 A.3. Additional Considerations 2997 There may be other details provided in a ULB. 2999 o An ULB may recommend inline threshold values or other transport- 3000 related parameters for RPC-over-RDMA version 2 connections bearing 3001 that ULP. 3003 o An ULP may provide a means to communicate these transport-related 3004 parameters between peers. Note that RPC-over-RDMA version 2 does 3005 not specify any mechanism for changing any transport-related 3006 parameter after a connection has been established and the initial 3007 transport properties have been exchanged. 3009 o Multiple ULPs may share a single RPC-over-RDMA version 2 3010 connection when their ULBs allow the use of RPC-over-RDMA version 3011 2 and the rpcbind port assignments for the Protocols allow 3012 connection sharing. In this case, the same transport parameters 3013 (such as inline threshold) apply to all Protocols using that 3014 connection. 3016 Each ULB needs to be designed to allow correct interoperation without 3017 regard to the transport parameters actually in use. Furthermore, 3018 implementations of ULPs must be designed to interoperate correctly 3019 regardless of the connection parameters in effect on a connection. 3021 A.4. ULP Extensions 3023 An RPC Program and Version tuple may be extensible. For instance, 3024 there may be a minor versioning scheme that is not reflected in the 3025 RPC version number, or the ULP may allow additional features to be 3026 specified after the original RPC Program specification was ratified. 3027 ULBs are provided for interoperable RPC Programs and Versions by 3028 extending existing ULBs to reflect the changes made necessary by each 3029 addition to the existing XDR. 3031 Appendix B. Extending the Version 2 Protocol 3033 This Appendix is not addressed to protocol implementers, but rather 3034 to authors of documents that intend to extend the protocol described 3035 earlier in this document. 3037 Subsequent RPC-over-RDMA versions are free to change the protocol in 3038 any way they choose as long as they leave unchanged those fields 3039 identified as "fixed for all versions" in Section 4.2.1 of [RFC8166]. 3041 Such changes might involve deletion or major re-organization of 3042 existing transport headers. However, the need for interoperability 3043 between adjacent versions will often limit the scope of changes that 3044 can be made in a single version. 3046 In some cases it may prove desirable to transition to a new version 3047 by using the extension features described for use with RPC-over-RDMA 3048 version 2, by continuing the same basic extension model but allowing 3049 header types and properties that were OPTIONAL in one version to 3050 become REQUIRED in the subsequent version. 3052 RPC-over-RDMA version 2 is designed to be extensible in a way that 3053 enables the addition of OPTIONAL features that may subsequently be 3054 converted to REQUIRED status in a future protocol version. The 3055 protocol may be extended by Standards Track documents in a way 3056 analogous to that provided for Network File System Version 4 as 3057 described in [RFC8178]. 3059 This form of extensibility enables limited extensions to the base 3060 RPC-over-RDMA version 2 protocol presented in this document so that 3061 new optional capabilities can be introduced without a protocol 3062 version change, while maintaining robust interoperability with 3063 existing RPC-over-RDMA version 2 implementations. The design allows 3064 extensions to be defined, including the definition of new protocol 3065 elements, without requiring modification or recompilation of the 3066 existing XDR. 3068 A Standards Track document introduces each set of such protocol 3069 elements. Together these elements are considered an OPTIONAL 3070 feature. Each implementation is either aware of all the protocol 3071 elements introduced by that feature or is aware of none of them. 3073 Documents describing extensions to RPC-over-RDMA version 2 should 3074 contain: 3076 o An explanation of the purpose and use of each new protocol element 3077 added. 3079 o An XDR description including all of the new protocol elements, and 3080 a script to extract it. 3082 o A description of interactions with existing extensions. 3084 This includes possible requirements of other OPTIONAL features to 3085 be present for new protocol elements to work, or that a particular 3086 level of support for an OPTIONAL facility is required for the new 3087 extension to work. 3089 Implementers combine the XDR descriptions of the new features they 3090 intend to use with the XDR description of the base protocol in this 3091 document. This may be necessary to create a valid XDR input file 3092 because extensions are free to use XDR types defined in the base 3093 protocol, and later extensions may use types defined by earlier 3094 extensions. 3096 The XDR description for the RPC-over-RDMA version 2 base protocol 3097 combined with that for any selected extensions should provide an 3098 adequate human-readable description of the extended protocol. 3100 The base protocol specified in this document may be extended within 3101 RPC-over-RDMA version 2 in two ways: 3103 o New OPTIONAL transport header types may be introduced by later 3104 Standards Track documents. Such transport header types will be 3105 documented as described in Appendix B.1. 3107 o New OPTIONAL transport properties may be defined in later 3108 Standards Track documents. Such transport properties will be 3109 documented as described in Appendix B.3. 3111 The following sorts of ancillary protocol elements may be added to 3112 the protocol to support the addition of new transport properties and 3113 header types. 3115 o New error codes may be created as described in Appendix B.4. 3117 o New flags to use within the rdma_flags field may be created as 3118 described in Appendix B.2. 3120 New capabilities can be proposed and developed independently of each 3121 other, and implementers can choose among them. This makes it 3122 straightforward to create and document experimental features and then 3123 bring them through the standards process. 3125 B.1. Adding New Header Types to RPC-over-RDMA Version 2 3127 New transport header types are to defined in a manner similar to the 3128 way existing ones are described in Sections 6.4.1 through 6.4.4. 3129 Specifically what is needed is: 3131 o A description of the function and use of the new header type. 3133 o A complete XDR description of the new header type including a 3134 description of the use of all fields within the header. 3136 o A description of how errors are reported, including the definition 3137 of a mechanism for reporting errors when the error is outside the 3138 available choices already available in the base protocol or in 3139 other existing extensions. 3141 o An indication of whether a Payload stream must be present, and a 3142 description of its contents and how such payload streams are used 3143 to construct RPC messages for processing. 3145 In addition, there needs to be additional documentation that is made 3146 necessary due to the Optional status of new transport header types. 3148 o Information about constraints on support for the new header types 3149 should be provided. For example, if support for one header type 3150 is implied or foreclosed by another one, this needs to be 3151 documented. 3153 o A preferred method by which a sender should determine whether the 3154 peer supports a particular header type needs to be provided. 3155 While it is always possible for a send a test invocation of a 3156 particular header type to see if support is available, when more 3157 efficient means are available (e.g. the value of a transport 3158 property, this should be noted. 3160 B.2. Adding New Header Flags to the Protocol 3162 New flag bits are to defined in a manner similar to the way existing 3163 ones are described in Sections 6.3.2.1 and 6.3.2.2. Each new flag 3164 definition should include: 3166 o An XDR description of the new flag. 3168 o A description of the function and use of the new flag. 3170 o An indication for which header types the flag value is meaningful 3171 and for which header types it is an error to set the flag or to 3172 leave it unset. 3174 o A means to determine whether receivers are prepared to receive 3175 transport headers with the new flag set. 3177 In addition, there needs to be additional documentation that is made 3178 necessary due to the Optional status of new transport header types. 3180 o Information about constraints on support for the new flags should 3181 be provided. For example, if support for one flag is implied or 3182 foreclosed by another one, this needs to be documented. 3184 B.3. Adding New Transport properties to the Protocol 3186 The set of transport properties is designed to be extensible. As a 3187 result, once new properties are defined in standards track documents, 3188 the operations defined in this document may reference these new 3189 transport properties, as well as the ones described in this document. 3191 A standards track document defining a new transport property should 3192 include the following information paralleling that provided in this 3193 document for the transport properties defined herein. 3195 o The rpcrdma2_propid value used to identify this property. 3197 o The XDR typedef specifying the form in which the property value is 3198 communicated. 3200 o A description of the transport property that is communicated by 3201 the sender of RDMA2_CONNPROP. 3203 o An explanation of how this knowledge could be used by the peer 3204 receiving this information. 3206 The definition of transport property structures is such as to make it 3207 easy to assign unique values. There is no requirement that a 3208 continuous set of values be used and implementations should not rely 3209 on all such values being small integers. A unique value should be 3210 selected when the defining document is first published as an internet 3211 draft. When the document becomes a standards track document, the 3212 working group should ensure that: 3214 o rpcrdma2_propid values specified in the document do not conflict 3215 with those currently assigned or in use by other pending working 3216 group documents defining transport properties. 3218 o rpcrdma2_propid values specified in the document do not conflict 3219 with the range reserved for experimental use, as defined in 3220 Section 8.2. 3222 Documents defining new properties fall into a number of categories. 3224 o Those defining new properties and explaining (only) how they 3225 affect use of existing message types. 3227 o Those defining new OPTIONAL message types and new properties 3228 applicable to the operation of those new message types. 3230 o Those defining new OPTIONAL message types and new properties 3231 applicable both to new and existing message types. 3233 When additional transport properties are proposed, the review of the 3234 associated standards track document should deal with possible 3235 security issues raised by those new transport properties. 3237 B.4. Adding New Error Codes to the Protocol 3239 New error codes to be returned when using new header types may be 3240 introduced in the same Standards Track document that defines the new 3241 header type. Cases in which a new error code is to be returned by an 3242 existing header type can be accommodated by defining the new error 3243 code in the same Standards Track document that defines the new 3244 transport property. 3246 For error codes that do not require that additional error information 3247 be returned with them, the existing RDMA_ERR2 header can be used to 3248 report the new error. The new error code is set as the value of 3249 rdma_err with the result that the default switch arm of the 3250 rpcrdma2_error (i.e. void) is selected. 3252 For error codes that do require the return of additional error- 3253 related information together with the error, a new header type should 3254 be defined for the purpose of returning the error together with 3255 needed additional information. It should be documented just like any 3256 other new header type. 3258 When a new header type is sent, the sender needs to be prepared to 3259 accept header types necessary to report associated errors. 3261 Appendix C. Differences from the RPC-over-RDMA Version 1 Protocol 3263 This section describes the substantive changes made in RPC-over-RDMA 3264 version 2. 3266 C.1. Relationship to the RPC-over-RDMA Version 1 XDR Definition 3268 There are a number of structural XDR changes whose goal is to enable 3269 within-version protocol extensibility. 3271 The RPC-over-RDMA version 1 transport header is defined as a single 3272 XDR object, with an RPC message proper potentially following it. In 3273 RPC-over-RDMA version 2, as described in Section 6.1 there are 3274 separate XDR definitions of the transport header prefix (see 3275 Section 6.3.2 which specifies the transport header type to be used, 3276 and the specific transport header, defined within one of the 3277 subsections of Section 6). This is similar to the way that an RPC 3278 message consists of an RPC header (defined in [RFC5531]) and an RPC 3279 request or reply, defined by the Upper-Layer protocol being conveyed. 3281 As a new version of the RPC-over-RDMA transport protocol, RPC-over- 3282 RDMA version 2 exists within the versioning rules defined in 3283 [RFC8166]. In particular, it maintains the first four words of the 3284 protocol header as sent and received, as specified in Section 4.2 of 3285 [RFC8166], even though, as explained in Section 6.3.1 of this 3286 document, the XDR definition of those words is structured 3287 differently. 3289 Although each of the first four words retains its semantic function, 3290 there are important differences of field interpretation, besides the 3291 fact that the words have different names and different roles with the 3292 XDR constrict of they are parts. 3294 o The first word of the header, previously the rdma_xid field, 3295 retains the format and function that in had in RPC-over-RDMA 3296 version 1. Within RPC-over-RDMA version 2, this word is the 3297 rdma_xid field of the structure rdma_start. However, to 3298 accommodate the use of request-response pairing of non-RPC 3299 messages and the potential use of message continuation, it cannot 3300 be assumed that it will always have the same value it would have 3301 had in RPC-over-RDMA version 1. As a result, the contents of this 3302 field should not be used without consideration of the associated 3303 protocol version identification. 3305 o The second word of the header, previously the rdma_vers field, 3306 retains the format and function that it had in RPC-over-RDMA 3307 version 1. Within RPC-over-RDMA version 2, this word is the 3308 rdma_vers field of the structure rdma_start. To clearly 3309 distinguish version 1 and version 2 messages, senders MUST fill in 3310 the correct version (fixed after version negotiation) and 3311 receivers MUST check that the content of the rdma_vers is correct 3312 before using referencing any other header field. 3314 o The third word of the header, previously the rdma_credit field, 3315 retains the size and general purpose that it had in RPC-over-RDMA 3316 version 1. Within RPC-over-RDMA version 2, this word is the 3317 rdma_credit field of the structure rdma_start. 3319 o The fourth word of the header, previously the union discriminator 3320 field rdma_proc, retains its format and general function even 3321 though the set of valid values has changed. The value of this 3322 field is now considered an unsigned 32-bit integer rather than an 3323 enum. Within RPC-over-RDMA version 2, this word is the rdma_htype 3324 field of the structure rdma_start. 3326 Beyond conforming to the restrictions specified in [RFC8166], RPC- 3327 over-RDMA version 2 tightly limits the scope of the changes made in 3328 order to ensure interoperability. It makes no major structural 3329 changes to the protocol, and all existing transport header types used 3330 in version 1 (as defined in [RFC8166]) are retained in version 2. 3331 Chunks are expressed using the same on-the-wire format and are used 3332 in the same way in both versions. 3334 C.2. Transport Properties 3336 RPC-over-RDMA version 2 provides a mechanism for exchanging the 3337 transport's operational properties. This mechanism allows connection 3338 endpoints to communicate the properties of their implementation at 3339 connection setup. The mechanism could be expanded to enable an 3340 endpoint to request changes in properties of the other endpoint and 3341 to notify peer endpoints of changes to properties that occur during 3342 operation. Transport properties are described in Section 5. 3344 C.3. Credit Management Changes 3346 RPC-over-RDMA transports employ credit-based flow control to ensure 3347 that a requester does not emit more RDMA Sends than the responder is 3348 prepared to receive. Section 3.3.1 of [RFC8166] explains the purpose 3349 and operation of RPC-over-RDMA version 1 credit management in detail. 3351 In the RPC-over-RDMA version 1 design, each RDMA Send from a 3352 requester contains an RPC Call with a credit request, and each RDMA 3353 Send from a responder contains an RPC Reply with a credit grant. The 3354 credit grant implies that enough Receives have been posted on the 3355 responder to handle the credit grant minus the number of pending RPC 3356 transactions (the number of remaining Receive buffers might be zero). 3358 In other words, each RPC Reply acts as an implicit ACK for a previous 3359 RPC Call from the requester, indicating that the responder has posted 3360 a Receive to replace the Receive consumed by the requester's RDMA 3361 Send. Without an RPC Reply message, the requester has no way to know 3362 that the responder is properly prepared for subsequent RPC Calls. 3364 Aside from being a bit of a layering violation, there are basic (but 3365 rare) cases where this arrangement is inadequate: 3367 o When a requester retransmits an RPC Call on the same connection as 3368 an earlier RPC Call for the same transaction. 3370 o When a requester transmits an RPC operation that requires no 3371 reply. 3373 o When more than one RPC-over-RDMA message is needed to complete the 3374 transaction (e.g., RDMA_DONE). 3376 Typically, the connection must be replaced in these cases. This 3377 resets the credit accounting mechanism but has an undesirable impact 3378 on other ongoing RPC transactions on that connection. 3380 Because credit management accompanies each RPC message, there is a 3381 strict one-to-one ratio between RDMA Send and RPC message. There are 3382 interesting use cases that might be enabled if this relationship were 3383 more flexible: 3385 o RPC-over-RDMA operations which do not carry an RPC message; e.g., 3386 control plane operations. 3388 o A single RDMA Send that conveys more than one RPC message for the 3389 purpose of interrupt mitigation. 3391 o An RPC message that is conveyed via several sequential RDMA Sends 3392 to reduce the use of explicit RDMA operations for moderate-sized 3393 RPC messages. 3395 o An RPC transaction that needs multiple exchanges or an odd number 3396 of RPC-over-RDMA operations to complete. 3398 Bi-directional RPC operation also introduces an ambiguity. If the 3399 RPC-over-RDMA message does not carry an RPC message, then it is not 3400 possible to determine whether the sender is a requester or a 3401 responder, and thus whether the rdma_credit field contains a credit 3402 request or a credit grant. 3404 A more sophisticated credit accounting mechanism is provided in RPC- 3405 over-RDMA version 2 in an attempt to address some of these 3406 shortcomings. This new mechanism is detailed in Section 4.3.1. 3408 C.4. Inline Threshold Changes 3410 The term "inline threshold" is defined in Section 3.3.2 of [RFC8166]. 3411 An "inline threshold" value is the largest message size (in octets) 3412 that can be conveyed on an RDMA connection using only RDMA Send and 3413 Receive. Each connection has two inline threshold values: one for 3414 messages flowing from client-to-server (referred to as the "client- 3415 to-server inline threshold") and one for messages flowing from 3416 server-to-client (referred to as the "server-to-client inline 3417 threshold"). Note that [RFC8166] uses somewhat different 3418 terminology. This is because it was written with only forward- 3419 direction RPC transactions in mind. 3421 A connection's inline thresholds determine when RDMA Read or Write 3422 operations are required because the RPC message to be sent cannot be 3423 conveyed via a single RDMA Send and Receive pair. When an RPC 3424 message does not contain DDP-eligible data items, a requester can 3425 prepare a Long Call or Reply to convey the whole RPC message using 3426 RDMA Read or Write operations. 3428 RDMA Read and Write operations require that each data payload resides 3429 in a region of memory that is registered with the RNIC. When an RPC 3430 is complete, that region is invalidated, fencing it from the 3431 responder. Memory registration and invalidation typically have a 3432 latency cost that is insignificant compared to data handling costs. 3433 When a data payload is small, however, the cost of registering and 3434 invalidating the memory where the payload resides becomes a 3435 relatively significant part of total RPC latency. Therefore the most 3436 efficient operation of RPC-over-RDMA occurs when explicit RDMA Read 3437 and Write operations are used for large payloads, and are avoided for 3438 small payloads. 3440 When RPC-over-RDMA version 1 was conceived, the typical size of RPC 3441 messages that did not involve a significant data payload was under 3442 500 bytes. A 1024-byte inline threshold adequately minimized the 3443 frequency of inefficient Long messages. 3445 With NFS version 4.1 [RFC5661], the increased size of NFS COMPOUND 3446 operations resulted in RPC messages that are on average larger and 3447 more complex than previous versions of NFS. With 1024-byte inline 3448 thresholds, RDMA Read or Write operations are needed for frequent 3449 operations that do not bear a data payload, such as GETATTR and 3450 LOOKUP, reducing the efficiency of the transport. 3452 To reduce the need to use Long messages, RPC-over-RDMA version 2 3453 increases the default size of inline thresholds. This also increases 3454 the maximum size of reverse-direction RPC messages. 3456 C.5. Message Continuation Changes 3458 In addition to a larger default inline threshold, RPC-over-RDMA 3459 version 2 introduces Message Continuation. Message Continuation is a 3460 mechanism that enables the transmission of a data payload using more 3461 than one RDMA Send. The purpose of Message Continuation is to 3462 provide relief in several important cases: 3464 o If a requester finds that it is inefficient to convey a 3465 moderately-sized data payload using Read chunks, the requester can 3466 use Message Continuation to send the RPC Call. 3468 o If a requester has provided insufficient Reply chunk space for a 3469 responder to send an RPC Reply, the responder can use Message 3470 Continuation to send the RPC Reply. 3472 o If a sender has to convey a large non-RPC data payload (e.g, a 3473 large transport property), the sender can use Message Continuation 3474 to avoid using registered memory. 3476 C.6. Host Authentication Changes 3478 For general operation of NFS on open networks, we eventually intend 3479 to rely on RPC-on-TLS [citation needed] to provide cryptographic 3480 authentication of the two ends of each connection. In turn, this 3481 will improve the trustworthiness of AUTH_SYS-style user identities 3482 that flow on TCP, which are not cryptographic. We do not have a 3483 similar solution for RPC-over-RDMA, however. 3485 Here, the RDMA transport layer already provides a strong guarantee of 3486 message integrity. On some network fabrics, IPsec can be used to 3487 protect the privacy of in-transit data, or TLS itself could be used 3488 for transporting raw RDMA operations. However, this is not the case 3489 for all fabrics (e.g., InfiniBand [IBA]). 3491 Thus, it is sensible to add a mechanism in the RPC-over-RDMA 3492 transport itself for authenticating the connection peers. This 3493 mechanism is described in Section 5.2.6. And like GSS channel 3494 binding, there should also be a way to determine when the use of host 3495 authentication is superfluous and can be avoided. 3497 C.7. Support for Remote Invalidation 3499 An STag that is registered using the FRWR mechanism in a privileged 3500 execution context or is registered via a Memory Window in an 3501 unprivileged context may be invalidated remotely [RFC5040]. These 3502 mechanisms are available when a requester's RNIC supports 3503 MEM_MGT_EXTENSIONS. 3505 For the purposes of this discussion, there are two classes of STags. 3506 Dynamically-registered STags are used in a single RPC, then 3507 invalidated. Persistently-registered STags live longer than one RPC. 3508 They may persist for the life of an RPC-over-RDMA connection, or 3509 longer. 3511 An RPC-over-RDMA requester may provide more than one STag in one 3512 transport header. It may provide a combination of dynamically- and 3513 persistently-registered STags in one RPC message, or any combination 3514 of these in a series of RPCs on the same connection. Only 3515 dynamically-registered STags using Memory Windows or FRWR (i.e., 3516 registered via MEM_MGT_EXTENSIONS) may be invalidated remotely. 3518 There is no transport-level mechanism by which a responder can 3519 determine how a requester-provided STag was registered, nor whether 3520 it is eligible to be invalidated remotely. A requester that mixes 3521 persistently- and dynamically-registered STags in one RPC, or mixes 3522 them across RPCs on the same connection, must therefore indicate 3523 which handles may be invalidated via a mechanism provided in the 3524 Upper-Layer Protocol. RPC-over-RDMA version 2 provides such a 3525 mechanism. 3527 The RDMA Send With Invalidate operation is used to invalidate an STag 3528 on a remote system. It is available only when a responder's RNIC 3529 supports MEM_MGT_EXTENSIONS, and must be utilized only when a 3530 requester's RNIC supports MEM_MGT_EXTENSIONS (can receive and 3531 recognize an IETH). 3533 Existing RPC-over-RDMA transport protocol specifications [RFC8166] 3534 [RFC8167] do not forbid direct data placement in the reverse 3535 direction, even though there is currently no Upper-Layer Protocol 3536 that makes data items in reverse direction operations elegible for 3537 direct data placement. 3539 When chunks are present in a reverse direction RPC request, Remote 3540 Invalidation allows the responder to trigger invalidation of a 3541 requester's STags as part of sending a reply, the same way as is done 3542 in the forward direction. 3544 However, in the reverse direction, the server acts as the requester, 3545 and the client is the responder. The server's RNIC, therefore, must 3546 support receiving an IETH, and the server must have registered the 3547 STags with an appropriate registration mechanism. 3549 C.8. Error Reporting Changes 3551 RPC-over-RDMA version 2 expands the repertoire of errors that may be 3552 reported by connection endpoints. This change, which is structured 3553 to enable extensibility, allows a peer to report overruns of specific 3554 resources and to avoid requester retries when an error is permanent. 3556 Acknowledgments 3558 The authors gratefully acknowledge the work of Brent Callaghan and 3559 Tom Talpey on the original RPC-over-RDMA version 1 specification (RFC 3560 5666). The authors also wish to thank Bill Baker, Greg Marsden, and 3561 Matt Benjamin for their support of this work. 3563 The XDR extraction conventions were first described by the authors of 3564 the NFS version 4.1 XDR specification [RFC5662]. Herbert van den 3565 Bergh suggested the replacement sed script used in this document. 3567 Special thanks go to Transport Area Director Magnus Westerlund, NFSV4 3568 Working Group Chairs Spencer Shepler and Brian Pawlowski, and NFSV4 3569 Working Group Secretary Thomas Haynes for their support. 3571 Authors' Addresses 3573 Charles Lever (editor) 3574 Oracle Corporation 3575 United States of America 3577 Email: chuck.lever@oracle.com 3579 David Noveck 3580 NetApp 3581 1601 Trapelo Road 3582 Waltham, MA 02451 3583 United States of America 3585 Phone: +1 781 572 8038 3586 Email: davenoveck@gmail.com