idnits 2.17.1 draft-ietf-nfsv4-rfc5666bis-04.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (March 4, 2016) is 2969 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Obsolete informational reference (is this intentional?): RFC 793 (Obsoleted by RFC 9293) -- Obsolete informational reference (is this intentional?): RFC 5661 (Obsoleted by RFC 8881) -- Obsolete informational reference (is this intentional?): RFC 5666 (Obsoleted by RFC 8166) -- Obsolete informational reference (is this intentional?): RFC 5667 (Obsoleted by RFC 8267) Summary: 0 errors (**), 0 flaws (~~), 1 warning (==), 5 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network File System Version 4 C. Lever, Ed. 3 Internet-Draft Oracle 4 Obsoletes: 5666 (if approved) W. Simpson 5 Intended status: Standards Track DayDreamer 6 Expires: September 5, 2016 T. Talpey 7 Microsoft 8 March 4, 2016 10 Remote Direct Memory Access Transport for Remote Procedure Call 11 draft-ietf-nfsv4-rfc5666bis-04 13 Abstract 15 This document specifies a protocol for conveying Remote Procedure 16 Call (RPC) messages on physical transports capable of Remote Direct 17 Memory Access (RDMA). It requires no revision to application RPC 18 protocols or the RPC protocol itself. This document obsoletes RFC 19 5666. 21 Status of This Memo 23 This Internet-Draft is submitted in full conformance with the 24 provisions of BCP 78 and BCP 79. 26 Internet-Drafts are working documents of the Internet Engineering 27 Task Force (IETF). Note that other groups may also distribute 28 working documents as Internet-Drafts. The list of current Internet- 29 Drafts is at http://datatracker.ietf.org/drafts/current/. 31 Internet-Drafts are draft documents valid for a maximum of six months 32 and may be updated, replaced, or obsoleted by other documents at any 33 time. It is inappropriate to use Internet-Drafts as reference 34 material or to cite them other than as "work in progress." 36 This Internet-Draft will expire on September 5, 2016. 38 Copyright Notice 40 Copyright (c) 2016 IETF Trust and the persons identified as the 41 document authors. All rights reserved. 43 This document is subject to BCP 78 and the IETF Trust's Legal 44 Provisions Relating to IETF Documents 45 (http://trustee.ietf.org/license-info) in effect on the date of 46 publication of this document. Please review these documents 47 carefully, as they describe your rights and restrictions with respect 48 to this document. Code Components extracted from this document must 49 include Simplified BSD License text as described in Section 4.e of 50 the Trust Legal Provisions and are provided without warranty as 51 described in the Simplified BSD License. 53 Table of Contents 55 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 56 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 3 57 1.2. Remote Procedure Calls On RDMA Transports . . . . . . . . 3 58 2. Changes Since RFC 5666 . . . . . . . . . . . . . . . . . . . 4 59 2.1. Changes To The Specification . . . . . . . . . . . . . . 4 60 2.2. Changes To The XDR Definition . . . . . . . . . . . . . . 5 61 2.3. Changes To The Protocol . . . . . . . . . . . . . . . . . 5 62 3. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 6 63 3.1. Remote Procedure Calls . . . . . . . . . . . . . . . . . 6 64 3.2. Remote Direct Memory Access . . . . . . . . . . . . . . . 8 65 4. RPC-Over-RDMA Protocol Framework . . . . . . . . . . . . . . 10 66 4.1. Transfer Models . . . . . . . . . . . . . . . . . . . . . 11 67 4.2. Message Framing . . . . . . . . . . . . . . . . . . . . . 11 68 4.3. Managing Receiver Resources . . . . . . . . . . . . . . . 12 69 4.4. XDR Encoding With Chunks . . . . . . . . . . . . . . . . 14 70 4.5. Message Size . . . . . . . . . . . . . . . . . . . . . . 20 71 5. RPC-Over-RDMA In Operation . . . . . . . . . . . . . . . . . 22 72 5.1. XDR Protocol Definition . . . . . . . . . . . . . . . . . 22 73 5.2. Fixed Header Fields . . . . . . . . . . . . . . . . . . . 28 74 5.3. Chunk Lists . . . . . . . . . . . . . . . . . . . . . . . 30 75 5.4. Memory Registration . . . . . . . . . . . . . . . . . . . 32 76 5.5. Error Handling . . . . . . . . . . . . . . . . . . . . . 33 77 5.6. Protocol Elements No Longer Supported . . . . . . . . . . 36 78 5.7. XDR Examples . . . . . . . . . . . . . . . . . . . . . . 37 79 6. RPC Bind Parameters . . . . . . . . . . . . . . . . . . . . . 39 80 7. Bi-Directional RPC-Over-RDMA . . . . . . . . . . . . . . . . 40 81 7.1. RPC Direction . . . . . . . . . . . . . . . . . . . . . . 40 82 7.2. Backward Direction Flow Control . . . . . . . . . . . . . 41 83 7.3. Conventions For Backward Operation . . . . . . . . . . . 43 84 7.4. Backward Direction Upper Layer Binding . . . . . . . . . 45 85 8. Upper Layer Binding Specifications . . . . . . . . . . . . . 45 86 8.1. DDP-Eligibility . . . . . . . . . . . . . . . . . . . . . 46 87 8.2. Maximum Reply Size . . . . . . . . . . . . . . . . . . . 47 88 8.3. Additional Considerations . . . . . . . . . . . . . . . . 47 89 8.4. Upper Layer Protocol Extensions . . . . . . . . . . . . . 48 90 9. Protocol Extensibility . . . . . . . . . . . . . . . . . . . 48 91 9.1. Changes To RPC-Over-RDMA Header XDR . . . . . . . . . . . 49 92 9.2. Feature Statuses With RPC-Over-RDMA Versions . . . . . . 50 93 9.3. RPC-Over-RDMA Version Numbering . . . . . . . . . . . . . 51 94 9.4. RPC-Over-RDMA Version One Extension Practices . . . . . . 52 95 10. Security Considerations . . . . . . . . . . . . . . . . . . . 53 96 10.1. Memory Protection . . . . . . . . . . . . . . . . . . . 53 97 10.2. RPC Message Security . . . . . . . . . . . . . . . . . . 54 98 11. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 57 99 12. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 58 100 13. References . . . . . . . . . . . . . . . . . . . . . . . . . 58 101 13.1. Normative References . . . . . . . . . . . . . . . . . . 58 102 13.2. Informative References . . . . . . . . . . . . . . . . . 59 103 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 61 105 1. Introduction 107 This document obsoletes RFC 5666. However, the protocol specified by 108 this document is based on existing interoperating implementations of 109 the RPC-over-RDMA Version One protocol. 111 The new specification clarifies text that is subject to multiple 112 interpretations, and removes support for unimplemented RPC-over-RDMA 113 Version One protocol elements. It makes the role of Upper Layer 114 Bindings an explicit part of the protocol specification. 116 In addition, this document introduces conventions that enable bi- 117 directional RPC-over-RDMA operation, enabling operation of NFSv4.1 118 [RFC5661] on RDMA transports, and that enable the use of RPCSEC_GSS 119 [RFC5403] on RDMA transports. 121 1.1. Requirements Language 123 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 124 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 125 document are to be interpreted as described in [RFC2119]. 127 1.2. Remote Procedure Calls On RDMA Transports 129 Remote Direct Memory Access (RDMA) [RFC5040] [RFC5041] [IB] is a 130 technique for moving data efficiently between end nodes. By 131 directing data into destination buffers as it is sent on a network, 132 and placing it via direct memory access by hardware, the benefits of 133 faster transfers and reduced host overhead are obtained. 135 Open Network Computing Remote Procedure Call (ONC RPC, or simply, 136 RPC) [RFC5531] is a remote procedure call protocol that runs over a 137 variety of transports. Most RPC implementations today use UDP 138 [RFC0768] or TCP [RFC0793]. On UDP, RPC messages are encapsulated 139 inside datagrams, while on a TCP byte stream, RPC messages are 140 delineated by a record marking protocol. An RDMA transport also 141 conveys RPC messages in a specific fashion that must be fully 142 described if RPC implementations are to interoperate. 144 RDMA transports present semantics different from either UDP or TCP. 145 They retain message delineations like UDP, but provide a reliable and 146 sequenced data transfer like TCP. They also provide an offloaded 147 bulk transfer service not provided by UDP or TCP. RDMA transports 148 are therefore appropriately viewed as a new transport type by RPC. 150 In this context, the Network File System (NFS) protocols as described 151 in [RFC1094], [RFC1813], [RFC7530], [RFC5661], and future NFSv4 minor 152 verions are obvious beneficiaries of RDMA transports. A complete 153 problem statement is discussed in [RFC5532], and NFSv4-related issues 154 are discussed in [RFC5661]. Many other RPC-based protocols can also 155 benefit. 157 Although the RDMA transport described here can provide relatively 158 transparent support for any RPC application, this document also 159 describes mechanisms that can optimize data transfer further, given 160 more active participation by RPC applications. 162 2. Changes Since RFC 5666 164 2.1. Changes To The Specification 166 The following alterations have been made to the RPC-over-RDMA Version 167 One specification. The section numbers below refer to [RFC5666]. 169 o Section 2 has been expanded to introduce and explain key RPC, XDR, 170 and RDMA terminology. These terms are now used consistently 171 throughout the specification. 173 o Section 3 has been re-organized and split into sub-sections to 174 help readers locate specific requirements and definitions. 176 o Sections 4 and 5 have been combined to improve the organization of 177 this information. 179 o The specification of the optional Connection Configuration 180 Protocol has been removed from the specification. 182 o A section consolidating requirements for Upper Layer Bindings has 183 been added. 185 o A section discussing RPC-over-RDMA protocol extensibility has been 186 added. 188 o A section specifying conventions for bi-directional RPC operation 189 on RPC-over-RDMA Version One has been added. 191 o The "Security Considerations" section has been expanded to include 192 a discussion of how RPC-over-RDMA security depends on features of 193 the underlying RDMA transport. A subsection specifying 194 conventions for using RPCSEC_GSS with RPC-over-RDMA Version One 195 has been added. 197 2.2. Changes To The XDR Definition 199 The XDR changes described in this section do not alter the over-the- 200 wire message format described in [RFC5666]. Changes made to the XDR 201 which do alter the over-the-wire message format (i.e., to make it 202 match actual interoperating implementations) are discussed in 203 Section 2.3. 205 These alterations make it easier to extend the RPC-over-RDMA 206 protocol. They also better organize the definition, making the 207 protocol elements more consonant with actual protocol function. The 208 specific changes are: 210 o The XDR description has been given an extraction script using a 211 sentinel sequence, matching the approach used in [RFC5662]. 213 o XDR data types which need to be the same in all RPC-over-RDMA 214 versions have been moved to a separate section and given names 215 that are not version-specific. 217 o To allow extensions without modification to the existing XDR, the 218 header types previously defined as members of the enum 219 rpcrdma1_proc have been defined as constants, the union 220 rpcrdma1_body was deleted, and RDMA_ERR_CHUNK has been renamed as 221 RDMA_ERR_BADHEADER. 223 2.3. Changes To The Protocol 225 Although the protocol described herein interoperates with existing 226 implementations of [RFC5666], the following changes have been made 227 relative to the protocol described in that document: 229 o Support for the Read-Read transfer model has been removed. Read- 230 Read is a slower transfer model than Read-Write, thus implementers 231 have chosen not to support it. Removal simplifies explanatory 232 text, and support for the RDMA_DONE procedure is no longer 233 necessary. 235 o The specification of RDMA_MSGP in [RFC5666] and current 236 implementations of it are incomplete. Even if completed, benefit 237 for protocols such as NFSv4.0 [RFC7530] is doubtful. Therefore 238 the RDMA_MSGP message type is no longer supported. 240 o Technical errors with regard to handling RPC-over-RDMA header 241 errors have been corrected. 243 o Specific requirements related to handling XDR round-up and complex 244 XDR data types have been added. 246 o Explicit guidance is provided for sizing Write chunks, managing 247 multiple chunks in the Write list, and handling unused Write 248 chunks. 250 o Clear guidance about Send and Receive buffer size has been added. 251 This enables better decisions about when to provide and use the 252 Reply chunk. 254 The protocol version number has not been changed because the protocol 255 specified in this document fully interoperates with implementations 256 of the RPC-over-RDMA Version One protocol specified in [RFC5666]. 258 3. Terminology 260 3.1. Remote Procedure Calls 262 This section introduces key elements of the Remote Procedure Call 263 [RFC5531] and External Data Representation [RFC4506] protocols, upon 264 which RPC-over-RDMA Version One is constructed. 266 3.1.1. Upper Layer Protocols 268 Remote Procedure Calls are an abstraction used to implement the 269 operations of an "Upper Layer Protocol," or ULP. The term Upper 270 Layer Protocol refers to an RPC Program and Version tuple, which is a 271 versioned set of procedure calls that comprise a single well-defined 272 API. One example of an Upper Layer Protocol is the Network File 273 System Version 4.0 [RFC7530]. 275 3.1.2. Requesters And Responders 277 Like a local procedure call, every Remote Procedure Call (RPC) has a 278 set of "arguments" and a set of "results". A calling context is not 279 allowed to proceed until the procedure's results are available to it. 280 Unlike a local procedure call, the called procedure is executed 281 remotely rather than in the local application's context. 283 The RPC protocol as described in [RFC5531] is fundamentally a 284 message-passing protocol between one server and one or more clients. 285 ONC RPC transactions are made up of two types of messages: 287 CALL Message 288 A CALL message, or "Call", requests that work be done. A Call is 289 designated by the value zero (0) in the message's msg_type field. 290 An arbitrary unique value is placed in the message's xid field in 291 order to match this CALL message to a corresponding REPLY message. 293 REPLY Message 294 A REPLY message, or "Reply", reports the results of work requested 295 by a Call. A Reply is designated by the value one (1) in the 296 message's msg_type field. The value contained in the message's 297 xid field is copied from the Call whose results are being 298 reported. 300 The RPC client endpoint, or "requester", serializes an RPC Call's 301 arguments and conveys them to a server endpoint via an RPC Call 302 message. This message contains an RPC protocol header, a header 303 describing the requested upper layer operation, and all arguments. 305 The RPC server endpoint, or "responder", deserializes the arguments 306 and processes the requested operation. It then serializes the 307 operation's results into another byte stream. This byte stream is 308 conveyed back to the requester via an RPC Reply message. This 309 message contains an RPC protocol header, a header describing the 310 upper layer reply, and all results. 312 The requester deserializes the results and allows the original caller 313 to proceed. At this point the RPC transaction designated by the xid 314 in the Call message is complete, and the xid is retired. 316 3.1.3. RPC Transports 318 The role of an "RPC transport" is to mediate the exchange of RPC 319 messages between requesters and responders. An RPC transport bridges 320 the gap between the RPC message abstraction and the native operations 321 of a particular network transport. 323 RPC-over-RDMA is a connection-oriented RPC transport. When a 324 connection-oriented transport is used, requesters initiate transport 325 connections, while responders wait passively for incoming connection 326 requests. 328 3.1.4. External Data Representation 330 One cannot assume that all requesters and responders internally 331 represent data objects the same way. RPC uses eXternal Data 332 Representation, or XDR, to translate data types and serialize 333 arguments and results [RFC4506]. 335 The XDR protocol encodes data independent of the endianness or size 336 of host-native data types, allowing unambiguous decoding of data on 337 the receiving end. RPC Programs are specified by writing an XDR 338 definition of their procedures, argument data types, and result data 339 types. 341 XDR assumes that the number of bits in a byte (octet) and their order 342 are the same on both endpoints and on the physical network. The 343 smallest indivisible unit of XDR encoding is a group of four octets 344 in little-endian order. XDR also flattens lists, arrays, and other 345 complex data types so they can be conveyed as a stream of bytes. 347 A serialized stream of bytes that is the result of XDR encoding is 348 referred to as an "XDR stream." A sending endpoint encodes native 349 data into an XDR stream and then transmits that stream to a receiver. 350 A receiving endpoint decodes incoming XDR byte streams into its 351 native data representation format. 353 3.1.4.1. XDR Opaque Data 355 Sometimes a data item must be transferred as-is, without encoding or 356 decoding. Such a data item is referred to as "opaque data." XDR 357 encoding places opaque data items directly into an XDR stream without 358 altering their content in any way. Upper Layer Protocols or 359 applications perform any needed data translation in this case. 360 Examples of opaque data items include the content of files, or 361 generic byte strings. 363 3.1.4.2. XDR Round-up 365 The number of octets in a variable-size opaque data item precedes 366 that item in an XDR stream. If the size of an encoded data item is 367 not a multiple of four octets, octets containing zero are added to 368 the end of the item as it is encoded so that the next encoded data 369 item starts on a four-octet boundary. The encoded size of the item 370 is not changed by the addition of the extra octets, and the zero 371 bytes are not exposed to the Upper Layer. 373 This technique is referred to as "XDR round-up," and the extra octets 374 are referred to as "XDR padding". 376 3.2. Remote Direct Memory Access 378 RPC requesters and responders can be made more efficient if large RPC 379 messages are transferred by a third party such as intelligent network 380 interface hardware (data movement offload), and placed in the 381 receiver's memory so that no additional adjustment of data alignment 382 has to be made (direct data placement). Remote Direct Memory Access 383 enables both optimizations. 385 3.2.1. Direct Data Placement 387 Typically, RPC implementations copy the contents of RPC messages into 388 a buffer before being sent. An efficient RPC implementation sends 389 bulk data without copying it into a separate send buffer first. 391 However, socket-based RPC implementations are often unable to receive 392 data directly into its final place in memory. Receivers often need 393 to copy incoming data to finish an RPC operation; sometimes, only to 394 adjust data alignment. 396 In this document, "RDMA" refers to the physical mechanism an RDMA 397 transport utilizes when moving data. Although this may not be 398 efficient, before an RDMA transfer a sender may copy data into an 399 intermediate buffer before an RDMA transfer. After an RDMA transfer, 400 a receiver may copy that data again to its final destination. 402 This document uses the term "direct data placement" (or DDP) to refer 403 specifically to an optimized data transfer where it is unnecessary 404 for a receiving host's CPU to copy transferred data to another 405 location after it has been received. Not all RDMA-based data 406 transfer qualifies as Direct Data Placement, and DDP can be achieved 407 using non-RDMA mechanisms. 409 3.2.2. RDMA Transport Requirements 411 The RPC-over-RDMA Version One protocol assumes the physical transport 412 provides the following abstract operations. A more complete 413 discussion of these operations is found in [RFC5040]. 415 Registered Memory 416 Registered memory is a segment of memory that is assigned a 417 steering tag that temporarily permits access by the RDMA provider 418 to perform data transfer operations. The RPC-over-RDMA Version 419 One protocol assumes that each segment of registered memory MUST 420 be identified with a steering tag of no more than 32 bits and 421 memory addresses of up to 64 bits in length. 423 RDMA Send 424 The RDMA provider supports an RDMA Send operation, with completion 425 signaled on the receiving peer after data has been placed in a 426 pre-posted memory segment. Sends complete at the receiver in the 427 order they were issued at the sender. The amount of data 428 transferred by an RDMA Send operation is limited by the size of 429 the remote pre-posted memory segment. 431 RDMA Receive 432 The RDMA provider supports an RDMA Receive operation to receive 433 data conveyed by incoming RDMA Send operations. To reduce the 434 amount of memory that must remain pinned awaiting incoming Sends, 435 the amount of pre-posted memory is limited. Flow-control to 436 prevent overrunning receiver resources is provided by the RDMA 437 consumer (in this case, the RPC-over-RDMA Version One protocol). 439 RDMA Write 440 The RDMA provider supports an RDMA Write operation to directly 441 place data in remote memory. The local host initiates an RDMA 442 Write, and completion is signaled there. No completion is 443 signaled on the remote. The local host provides a steering tag, 444 memory address, and length of the remote's memory segment. 446 RDMA Writes are not necessarily ordered with respect to one 447 another, but are ordered with respect to RDMA Sends. A subsequent 448 RDMA Send completion obtained at the write initiator guarantees 449 that prior RDMA Write data has been successfully placed in the 450 remote peer's memory. 452 RDMA Read 453 The RDMA provider supports an RDMA Read operation to directly 454 place peer source data in the read initiator's memory. The local 455 host initiates an RDMA Read, and completion is signaled there; no 456 completion is signaled on the remote. The local host provides 457 steering tags, memory addresses, and a length for the remote 458 source and local destination memory segments. 460 The remote peer receives no notification of RDMA Read completion. 461 The local host signals completion as part of a subsequent RDMA 462 Send message so that the remote peer can release steering tags and 463 subsequently free associated source memory segments. 465 The RPC-over-RDMA Version One protocol is designed to be carried over 466 RDMA transports that support the above abstract operations. This 467 protocol conveys to the RPC peer information sufficient for that RPC 468 peer to direct an RDMA layer to perform transfers containing RPC data 469 and to communicate their result(s). For example, it is readily 470 carried over RDMA transports such as Internet Wide Area RDMA Protocol 471 (iWARP) [RFC5040] [RFC5041]. 473 4. RPC-Over-RDMA Protocol Framework 474 4.1. Transfer Models 476 A "transfer model" designates which endpoint is responsible for 477 performing RDMA Read and Write operations. To enable these 478 operations, the peer endpoint first exposes segments of its memory to 479 the endpoint performing the RDMA Read and Write operations. 481 Read-Read 482 Requesters expose their memory to the responder, and the responder 483 exposes its memory to requesters. The responder employs RDMA Read 484 operations to pull RPC arguments or whole RPC calls from the 485 requester. Requesters employ RDMA Read operations to pull RPC 486 results or whole RPC relies from the responder. 488 Write-Write 489 Requesters expose their memory to the responder, and the responder 490 exposes its memory to requesters. Requesters employ RDMA Write 491 operations to push RPC arguments or whole RPC calls to the 492 responder. The responder employs RDMA Write operations to push 493 RPC results or whole RPC relies to the requester. 495 Read-Write 496 Requesters expose their memory to the responder, but the responder 497 does not expose its memory. The responder employs RDMA Read 498 operations to pull RPC arguments or whole RPC calls from the 499 requester. The responder employs RDMA Write operations to push 500 RPC results or whole RPC relies to the requester. 502 Write-Read 503 The responder exposes its memory to requesters, but requesters do 504 not expose their memory. Requesters employ RDMA Write operations 505 to push RPC arguments or whole RPC calls to the responder. 506 Requesters employ RDMA Read operations to pull RPC results or 507 whole RPC relies from the responder. 509 [RFC5666] specifies the use of both the Read-Read and the Read-Write 510 Transfer Model. All current RPC-over-RDMA Version One 511 implementations use only the Read-Write Transfer Model. Therefore 512 the use of the Read-Read Transfer Model by RPC-over-RDMA Version One 513 implementations is no longer supported. Other Transfer Models may be 514 used by a future version of RPC-over-RDMA. 516 4.2. Message Framing 518 On an RPC-over-RDMA transport, each RPC message is encapsulated by an 519 RPC-over-RDMA message. An RPC-over-RDMA message consists of two XDR 520 streams. 522 RPC Payload Stream 523 The "Payload stream" contains the encapsulated RPC message being 524 transferred by this RPC-over-RDMA message. This stream always 525 begins with the XID field of the encapsulated RPC message. 527 Transport Header Stream 528 The "Transport stream" contains a header that describes and 529 controls the transfer of the Payload stream in this RPC-over-RDMA 530 message. This header is analogous to the record marking used for 531 RPC over TCP but is more extensive, since RDMA transports support 532 several modes of data transfer. 534 In its simplest form, an RPC-over-RDMA message consists of a 535 Transport stream followed immediately by a Payload stream conveyed 536 together in a single RDMA Send. To transmit large RPC messages, a 537 combination of one RDMA Send operation and one or more RDMA Read or 538 Write operations is employed. 540 RPC-over-RDMA framing replaces all other RPC framing (such as TCP 541 record marking) when used atop an RPC-over-RDMA association, even 542 when the underlying RDMA protocol may itself be layered atop a 543 transport with a defined RPC framing (such as TCP). 545 It is however possible for RPC-over-RDMA to be dynamically enabled in 546 the course of negotiating the use of RDMA via an Upper Layer Protocol 547 exchange. Because RPC framing delimits an entire RPC request or 548 reply, the resulting shift in framing must occur between distinct RPC 549 messages, and in concert with the underlying transport. 551 4.3. Managing Receiver Resources 553 It is critical to provide RDMA Send flow control for an RDMA 554 connection. If no pre-posted receive buffer is large enough to 555 accept an incoming RDMA Send, the RDMA Send operation fails. If a 556 pre-posted receive buffer is not available to accept an incoming RDMA 557 Send, the RDMA Send operation can fail. Repeated occurrences of such 558 errors can be fatal to the connection. This is a departure from 559 conventional TCP/IP networking where buffers are allocated 560 dynamically as part of receiving messages. 562 The longevity of an RDMA connection requires that sending endpoints 563 respect the resource limits of peer receivers. To ensure messages 564 can be sent and received reliably, there are two operational 565 parameters for each connection. 567 4.3.1. Credit Limit 569 The number of pre-posted RDMA Receive operations is sometimes 570 referred to as a peer's "credit limit." Flow control for RDMA Send 571 operations directed to the responder is implemented as a simple 572 request/grant protocol in the RPC-over-RDMA header associated with 573 each RPC message. Section 5.2.3 has further detail. 575 o The RPC-over-RDMA header for RPC Call messages contains a 576 requested credit value for the responder. This is the maximum 577 number of RPC replies the requester can handle at once, 578 independent of how many RPCs are in flight at that moment. The 579 requester MAY dynamically adjust the requested credit value to 580 match its expected needs. 582 o The RPC-over-RDMA header for RPC Reply messages provides the 583 granted result. This is the maximum number of RPC calls the 584 responder can handle at once, without regard to how many RPCs are 585 in flight at that moment. The granted value MUST NOT be zero, 586 since such a value would result in deadlock. The responder MAY 587 dynamically adjust the granted credit value to match its needs or 588 policies (e.g. to accommodate the available resources in a shared 589 receive queue). 591 The requester MUST NOT send unacknowledged requests in excess of this 592 granted responder credit limit. If the limit is exceeded, the RDMA 593 layer may signal an error, possibly terminating the connection. If 594 an RDMA layer error does not occur, the responder MAY handle excess 595 requests or return an RPC layer error to the requester. 597 While RPC calls complete in any order, the current flow control limit 598 at the responder is known to the requester from the Send ordering 599 properties. It is always the lower of the requested and granted 600 credit values, minus the number of requests in flight. Advertised 601 credit values are not altered when individual RPCs are started or 602 completed. 604 On occasion a requester or responder may need to adjust the amount of 605 resources available to a connection. When this happens, the 606 responder needs to ensure that a credit increase is effected (i.e. 607 RDMA Receives are posted) before the next reply is sent. 609 Certain RDMA implementations may impose additional flow control 610 restrictions, such as limits on RDMA Read operations in progress at 611 the responder. Accommodation of such restrictions is considered the 612 responsibility of each RPC-over-RDMA Version One implementation. 614 4.3.2. Inline Threshold 616 A receiver's "inline threshold" value is the largest message size (in 617 octets) that the receiver can accept via an RDMA Receive operation. 618 Each connection has two inline threshold values, one for each peer 619 receiver. 621 Unlike credit limits, inline threshold values are not advertised to 622 peers via the RPC-over-RDMA Version One protocol, and there is no 623 provision for the inline threshold value to change during the 624 lifetime of an RPC-over-RDMA Version One connection. 626 4.3.3. Initial Connection State 628 When a connection is first established, peers might not know how many 629 receive buffers the other has, nor how large these buffers are. 631 As a basis for an initial exchange of RPC requests, each RPC-over- 632 RDMA Version One connection provides the ability to exchange at least 633 one RPC message at a time that is 1024 bytes in size. A responder 634 MAY exceed this basic level of configuration, but a requester MUST 635 NOT assume more than one credit is available, and MUST receive a 636 valid reply from the responder carrying the actual number of 637 available credits, prior to sending its next request. 639 Receiver implementations MUST support an inline threshold of 1024 640 bytes, but MAY support larger inline thresholds values. A mechanism 641 for discovering a peer's inline threshold value before a connection 642 is established may be used to optimize the use of RDMA Send 643 operations. In the absense of such a mechanism, senders MUST assume 644 a receiver's inline threshold is 1024 bytes. 646 4.4. XDR Encoding With Chunks 648 When a direct data placement capability is available, during XDR 649 encoding it can be determined that an XDR data item is large enough 650 that it might be more efficient if the transport placed the content 651 of the data item directly in the receiver's memory. 653 4.4.1. Reducing An XDR Stream 655 RPC-over-RDMA Version One provides a mechanism for moving part of an 656 RPC message via a data transfer separate from an RDMA Send/Receive. 657 The sender removes one or more XDR data items from the Payload 658 stream. They are conveyed via one or more RDMA Read or Write 659 operations. The receiver inserts the data items into the Payload 660 stream before passing it to the Upper Layer. 662 A contiguous piece of a Payload stream that is split out and moved 663 via separate RDMA operations is known as a "chunk." A Payload stream 664 after chunks have been removed is referred to as a "reduced" Payload 665 stream. 667 4.4.2. DDP-Eligibility 669 Only an XDR data item that might benefit from Direct Data Placement 670 may be reduced. The eligibility of particular XDR data items to be 671 reduced is not specified by this document. 673 To maintain interoperability on an RPC-over-RDMA transport, a 674 determination must be made of which XDR data items in each Upper 675 Layer Protocol are allowed to use Direct Data Placement. Therefore 676 an additional specification is needed that describes how an Upper 677 Layer Protocol enables Direct Data Placement. The set of 678 requirements for an Upper Layer Protocol to use an RPC-over-RDMA 679 transport is known as an "Upper Layer Binding specification," or ULB. 681 An Upper Layer Binding specification states which specific individual 682 XDR data items in an Upper Layer Protocol MAY be transferred via 683 Direct Data Placement. This document will refer to XDR data items 684 that are permitted to be reduced as "DDP-eligible". All other XDR 685 data items MUST NOT be reduced. RPC-over-RDMA Version One uses RDMA 686 Read and Write operations to transfer DDP-eligible data that has been 687 reduced. 689 Detailed requirements for Upper Layer Bindings are discussed in full 690 in Section 8. 692 4.4.3. RDMA Segments 694 When encoding a Payload stream that contains a DDP-eligible data 695 item, a sender may choose to reduce that data item. It does not 696 place the item into the Payload stream. Instead, the sender records 697 in the RPC-over-RDMA header the actual address and size of the memory 698 region containing that data item. 700 The requester provides location information for DDP-eligible data 701 items in both RPC Calls and Replies. The responder uses this 702 information to initiate RDMA Read and Write operations to retrieve or 703 update the specified region of the requester's memory. 705 An "RDMA segment", or a "plain segment", is an RPC-over-RDMA header 706 data object that contains the precise co-ordinates of a contiguous 707 memory region that is to be conveyed via one or more RDMA Read or 708 RDMA Write operations. The following fields are contained in each 709 segment. 711 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 712 | Handle | 713 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 714 | Length | 715 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 716 | | 717 + Offset + 718 | | 719 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 721 Handle 722 Steering tag (STag) or handle obtained when the segment's memory 723 is registered for RDMA. Also known as an R_key, this value is 724 generated by registering this memory with the RDMA provider. 726 Length 727 The length of the memory segment, in octets. 729 Offset 730 The offset or beginning memory address of the segment. 732 See [RFC5040] for further discussion of the meaning of these fields. 734 4.4.4. Chunks 736 In RPC-over-RDMA Version One, a "chunk" refers to a portion of the 737 Payload stream that is moved via RDMA Read or Write operations. 738 Chunk data is removed from the sender's Payload stream, transferred 739 by separate RDMA operations, and then re-inserted into the receiver's 740 Payload stream. 742 Each chunk consists of one or more RDMA segments. Each segment 743 represents a single contiguous piece of that chunk. Segments MAY 744 divide a chunk on any boundary that is convenient to the requester. 746 Except in special cases, a chunk contains exactly one XDR data item. 747 This makes it straightforward to remove chunks from an XDR stream 748 without affecting XDR alignment. Not every RPC-over-RDMA message has 749 chunks associated with it. 751 4.4.4.1. Counted Arrays 753 If a chunk contains a counted array data type, the count of array 754 elements MUST remain in the Payload stream, while the array elements 755 MUST be moved to the chunk. For example, when encoding an opaque 756 byte array as a chunk, the count of bytes stays in the Payload 757 stream, while the bytes in the array are removed from the Payload 758 stream and transferred within the chunk. 760 Any byte count left in the Payload stream MUST match the sum of the 761 lengths of the segments making up the chunk. If they do not agree, 762 an RPC protocol encoding error results. 764 Individual array elements appear in a chunk in their entirety. For 765 example, when encoding an array of arrays as a chunk, the count of 766 items in the enclosing array stays in the Payload stream, but each 767 enclosed array, including its item count, is transferred as part of 768 the chunk. 770 4.4.4.2. Optional-data 772 If a chunk contains an optional-data data type, the "is present" 773 field MUST remain in the Payload stream, while the data, if present, 774 MUST be moved to the chunk. 776 4.4.4.3. XDR Unions 778 A union data type should never be made DDP-eligible, but one or more 779 of its arms may be DDP-eligible. 781 4.4.5. Read Chunks 783 A "Read chunk" represents an XDR data item that is to be pulled from 784 the requester to the responder using RDMA Read operations. 786 A Read chunk is a list of one or more RDMA segments. Each RDMA 787 segment in a Read chunk is a plain segment which has an additional 788 Position field. 790 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 791 | Position | 792 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 793 | Handle | 794 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 795 | Length | 796 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 797 | | 798 + Offset + 799 | | 800 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 802 Position 803 The byte offset in the Payload stream where the receiver re- 804 inserts the data item conveyed in a chunk. The Position value 805 MUST be computed from the beginning of the Payload stream, which 806 begins at Position zero. All RDMA segments belonging to the same 807 Read chunk have the same value in their Position field. 809 While constructing an RPC-over-RDMA Call message, a requester 810 registers memory segments containing data in Read chunks. It 811 advertises these chunks in the RPC-over-RDMA header of the RPC Call. 813 After receiving an RPC Call sent via an RDMA Send operation, a 814 responder transfers the chunk data from the requester using RDMA Read 815 operations. The responder reconstructs the transferred chunk data by 816 concatenating the contents of each segment, in list order, into the 817 received Payload stream at the Position value recorded in the 818 segment. 820 Put another way, a receiver inserts the first segment in a Read chunk 821 into the Payload stream at the byte offset indicated by its Position 822 field. Segments whose Position field value match this offset are 823 concatenated afterwards, until there are no more segments at that 824 Position value. The next XDR data item in the Payload stream 825 follows. 827 4.4.5.1. Read Chunk Round-up 829 XDR requires each encoded data item to start on four-byte alignment. 830 When an odd-length data item is encoded, its length is encoded 831 literally, while the data is padded so the next data item in the XDR 832 stream can start on a four-byte boundary. Receivers ignore the 833 content of the pad bytes. 835 After an XDR data item has been reduced, all data items remaining in 836 the Payload stream must continue to adhere to these padding 837 requirements. Thus when an XDR data item is moved from the Payload 838 stream into a Read chunk, the requester MUST remove XDR padding for 839 that data item from the Payload stream as well. 841 The length of a Read chunk is the sum of the lengths of the read 842 segments that comprise it. If this sum is not a multiple of four, 843 the requester MAY choose to send a Read chunk without any XDR 844 padding. If the requester provides no actual round-up in a Read 845 chunk, the responder MUST be prepared to provide appropriate round-up 846 in the reconstructed call XDR stream 848 The Position field in a read segment indicates where the containing 849 Read chunk starts in the Payload stream. The value in this field 850 MUST be a multiple of four. Moreover, all segments in the same Read 851 chunk share the same Position value, even if one or more of the 852 segments have a non-four-byte aligned length. 854 4.4.5.2. Decoding Read Chunks 856 While decoding a received Payload stream, whenever the XDR offset in 857 the Payload stream matches that of a Read chunk, the transport 858 initiates an RDMA Read to pull the chunk's data content into 859 registered memory on the responder. 861 The responder acknowledges its completion of use of Read chunk source 862 buffers when it sends an RPC Reply to the requester. The requester 863 may then release Read chunks advertised in the request. 865 4.4.6. Write Chunks 867 A "Write chunk" represents an XDR data item that is to be pushed from 868 a responder to a requester using RDMA Write operations. 870 A Write chunk is an array of one or more plain RDMA segments. Write 871 chunks are provided by a requester long before the responder has 872 prepared the reply Payload stream. Therefore RDMA segments in a 873 Write chunk do not have a Position field. 875 While constructing an RPC Call message, a requester also prepares 876 memory regions to catch DDP-eligible reply data items. A requester 877 does not know the actual length of the result data item to be 878 returned, thus it MUST register a Write chunk long enough to 879 accommodate the maximum possible size of the returned data item. 881 A responder copies the requester-provided Write chunk segments into 882 the RPC-over-RDMA header that it returns with the reply. The 883 responder MUST NOT change the number of segments in the Write chunk. 885 The responder fills the segments in array order until the data item 886 has been completely written. The responder updates the segment 887 length fields to reflect the actual amount of data that is being 888 returned in each segment. If a Write chunk segment is not filled by 889 the responder, the updated length of the segment SHOULD be zero. 891 The responder then sends the RPC Reply via an RDMA Send operation. 892 After receiving the RPC Reply, the requester reconstructs the 893 transferred data by concatenating the contents of each segment, in 894 array order, into RPC Reply XDR stream. 896 4.4.6.1. Write Chunk Round-up 898 XDR requires each encoded data item to start on four-byte alignment. 899 When an odd-length data item is encoded, its length is encoded 900 literally, while the data is padded so the next data item in the XDR 901 stream can start on a four-byte boundary. Receivers ignore the 902 content of the pad bytes. 904 After a data item is reduced, data items remaining in the Payload 905 stream must continue to adhere to these padding requirements. Thus 906 when an XDR data item is moved from a reply Payload stream into a 907 Write chunk, the responder MUST remove XDR padding for that data item 908 from the reply Payload stream as well. 910 A requester SHOULD NOT provide extra length in a Write chunk to 911 accommodate XDR pad bytes. A responder MUST NOT write XDR pad bytes 912 for a Write chunk. 914 4.4.6.2. Unused Write Chunks 916 There are occasions when a requester provides a Write chunk but the 917 responder does not use it. 919 For example, an Upper Layer Protocol may define a union result where 920 some arms of the union contain a DDP-eligible data item while other 921 arms do not. The requester is required to provide a Write chunk in 922 this case, but if the responder returns a result that uses an arm of 923 the union that has no DDP-eligible data item, the Write chunk remains 924 unused. 926 When forming an RPC-over-RDMA Reply message with an unused Write 927 chunk, the responder MUST set the length of all segments in the chunk 928 to zero. 930 Unused write chunks, or unused bytes in write chunk segments, are not 931 returned as results. Their memory is returned to the Upper Layer as 932 part of RPC completion. However, the RPC layer MUST NOT assume that 933 the buffers have not been modified. 935 4.5. Message Size 937 A receiver of RDMA Send operations is required by RDMA to have 938 previously posted one or more adequately sized buffers. Memory 939 savings can be achieved on both requesters and responders by leaving 940 the inline threshold small. However, not all RPC messages are small. 942 4.5.1. Short Messages 944 RPC messages are frequently smaller than typical inline thresholds. 945 For example, the NFS version 3 GETATTR request is only 56 bytes: 20 946 bytes of RPC header, plus a 32-byte file handle argument and 4 bytes 947 for its length. The reply to this common request is about 100 bytes. 949 Since all RPC messages conveyed via RPC-over-RDMA require an RDMA 950 Send operation, the most efficient way to send an RPC message that is 951 smaller than the receiver's inline threshold is to append the Payload 952 stream directly to the Transport stream. An RPC-over-RDMA header 953 with a small RPC Call or Reply message immediately following is 954 transferred using a single RDMA Send operation. No RDMA Read or 955 Write operations are needed. 957 4.5.2. Chunked Messages 959 If DDP-eligible data items are present in a Payload stream, a sender 960 MAY reduce the Payload stream to enable the use of RDMA Read or Write 961 operations to move the reduced data items. The Transport stream with 962 the reduced Payload stream immediately following is transferred using 963 a single RDMA Send operation. 965 After receiving the Transport and Payload streams of a Chunked RPC- 966 over-RDMA Call message, the responder uses RDMA Read operations to 967 move reduced data items in Read chunks. Before sending the Transport 968 and Payload streams of a Chunked RPC-over-RDMA Reply message, the 969 responder uses RDMA Write operations to move reduced data items in 970 Write and Reply chunks. 972 4.5.3. Long Messages 974 When a Payload stream is larger than the receiver's inline threshold, 975 the Payload stream is reduced by removing DDP-eligible data items and 976 placing them in chunks to be moved separately. If there are no DDP- 977 eligible data items in the Payload stream, or the Payload stream is 978 still too large after it has been reduced, the RDMA transport MUST 979 use RDMA Read or Write operations to convey the Payload stream 980 itself. This mechanism is referred to as a "Long Message." 982 To transmit a Long Message, the sender conveys only the Transport 983 stream with an RDMA Send operation. The Payload stream is not 984 included in the Send buffer in this instance. Instead, the requester 985 provides chunks that the responder uses to move the Payload stream. 987 Long RPC Call 988 To send a Long RPC-over-RDMA Call message, the requester provides 989 a special Read chunk that contains the RPC Call's Payload stream. 991 Every segment in this Read chunk MUST contain zero in its Position 992 field. Thus this chunk is known as a "Position Zero Read chunk." 994 Long RPC Reply 995 To send a Long RPC-over-RDMA Reply message, the requester provides 996 a single special Write chunk in advance, known as the "Reply 997 chunk", that will contain the RPC Reply's Payload stream. The 998 requester sizes the Reply chunk to accommodate the maximum 999 expected reply size for that Upper Layer operation. 1001 Though the purpose of a Long Message is to handle large RPC messages, 1002 requesters MAY use a Long Message at any time to convey an RPC Call. 1004 A responder chooses which form of reply to use based on the chunks 1005 provided by the requester. If Write chunks were provided and the 1006 responder has a DDP-eligible result, it first reduces the reply 1007 Payload stream. If a Reply chunk was provided and the reduced 1008 Payload is larger than the requester's inline threshold, the 1009 responder MUST use the provided Reply chunk for the reply. 1011 Because these special chunks contain a whole RPC message, any XDR 1012 data item MAY appear in one of these special chunks without regard to 1013 its DDP-eligibility. DDP-eligible data items MAY be removed from 1014 these special chunks and conveyed via normal chunks, but non-eligible 1015 data items MUST NOT appear in normal chunks. 1017 5. RPC-Over-RDMA In Operation 1019 Every RPC-over-RDMA Version One message has a header that includes a 1020 copy of the message's transaction ID, data for managing RDMA flow 1021 control credits, and lists of RDMA segments used for RDMA Read and 1022 Write operations. All RPC-over-RDMA header content is contained in 1023 the Transport stream, and thus MUST be XDR encoded. 1025 RPC message layout is unchanged from that described in [RFC5531] 1026 except for the possible reduction of data items that are moved by 1027 RDMA Read or Write operations. 1029 The RPC-over-RDMA protocol passes RPC messages without regard to 1030 their type (CALL or REPLY) or direction (forwards or backwards). 1031 Both endpoints of a connection MAY send any RPC-over-RDMA message 1032 header type at any time (subject to credit limits). 1034 5.1. XDR Protocol Definition 1036 This section contains a description of the core features of the RPC- 1037 over-RDMA Version One protocol, expressed in the XDR language 1038 [RFC4506]. 1040 This description is provided in a way that makes it simple to extract 1041 into ready-to-compile form. The reader can apply the following shell 1042 script to this document to produce a machine-readable XDR description 1043 of the RPC-over-RDMA Version One protocol without any OPTIONAL 1044 extensions. 1046 1048 #!/bin/sh 1049 grep '^ *///' | sed 's?^ /// ??' | sed 's?^ *///$??' 1051 1053 That is, if the above script is stored in a file called "extract.sh" 1054 and this document is in a file called "spec.txt" then the reader can 1055 do the following to extract an XDR description file: 1057 1059 sh extract.sh < spec.txt > rpcrdma_corev1.x 1061 1063 As described in Section 9.4, extensions to RPC-over-RDMA Version One, 1064 published as Proposed Standards, will have similar means of providing 1065 an XDR description appropriate to those extensions. Once XDR for 1066 extensions is also extracted, it can be appended to the XDR 1067 description file extracted from this document to produce a 1068 consolidated XDR description file reflecting all extensions selected 1069 for an RPC-over-RDMA implementation. 1071 RPC-over-RDMA is not a stand-alone RPC Program. To enable protocol 1072 extension, there is no single XDR entity which describes the format 1073 of RPC-over-RDMA headers. Instead, implementers need to follow the 1074 instructions in Section 5.1.4 to appropriately encode and decode 1075 protocol messages. 1077 5.1.1. Code Component License 1079 Code components extracted from this document must include the 1080 following license text. When the extracted XDR code is combined with 1081 other complementary XDR code which itself has an identical license, 1082 only a single copy of the license text need be preserved. 1084 1086 /// /* 1087 /// * Copyright (c) 2010, 2015 IETF Trust and the persons 1088 /// * identified as authors of the code. All rights reserved. 1089 /// * 1090 /// * The authors of the code are: 1091 /// * B. Callaghan, T. Talpey, C. Lever, and D. Noveck. 1092 /// * 1093 /// * Redistribution and use in source and binary forms, with 1094 /// * or without modification, are permitted provided that the 1095 /// * following conditions are met: 1096 /// * 1097 /// * - Redistributions of source code must retain the above 1098 /// * copyright notice, this list of conditions and the 1099 /// * following disclaimer. 1100 /// * 1101 /// * - Redistributions in binary form must reproduce the above 1102 /// * copyright notice, this list of conditions and the 1103 /// * following disclaimer in the documentation and/or other 1104 /// * materials provided with the distribution. 1105 /// * 1106 /// * - Neither the name of Internet Society, IETF or IETF 1107 /// * Trust, nor the names of specific contributors, may be 1108 /// * used to endorse or promote products derived from this 1109 /// * software without specific prior written permission. 1110 /// * 1111 /// * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS 1112 /// * AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED 1113 /// * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 1114 /// * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS 1115 /// * FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO 1116 /// * EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE 1117 /// * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, 1118 /// * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT 1119 /// * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 1120 /// * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS 1121 /// * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF 1122 /// * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 1123 /// * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING 1124 /// * IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF 1125 /// * ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 1126 /// */ 1128 1130 5.1.2. XDR Applying To All Versions Of RPC-Over-RDMA 1132 XDR data items defined in this section describe elements of the RPC- 1133 over-RDMA protocol that are not subject to change in subsequent 1134 versions. A full discussion of the extensibility model is in 1135 Section 9. 1137 1139 /// typedef uint32 rdma_htype; 1140 /// 1141 /// struct rpcrdma_prefix { 1142 /// uint32 rdma_xid; 1143 /// uint32 rdma_version; 1144 /// uint32 rdma_credits; 1145 /// rpcrdma_htype rdma_htype; 1146 /// }; 1147 /// 1148 /// /* 1149 /// * Mandatory RPC-over-RDMA message header types 1150 /// */ 1151 /// const RDMA_MSG = 0; 1152 /// const RDMA_NOMSG = 1; 1153 /// const RDMA_ERROR = 4; 1154 /// 1155 /// struct rpcrdma_err_vers { 1156 /// uint32 rdma_vers_low; 1157 /// uint32 rdma_vers_high; 1158 /// }; 1160 1162 5.1.3. XDR Applying To Version One Of RPC-Over-RDMA 1164 XDR data items defined in this section are subject to change in 1165 subsequent RPC-over-RDMA versions. 1167 Even though the names of structures and unions begin "rpcrdma1_" 1168 these are not restricted to use in RPC-over-RDMA Version One. 1169 Structure definitions may be carried over unchanged to subsequence 1170 versions, but unions are subject to extension according to the rules 1171 for compatible XDR extension as discussed in Section 9. Comments 1172 identify items that cannot be changed in subsequent versions. 1174 1175 /// /* 1176 /// * Version One reserved message types 1177 /// */ 1178 /// const RDMA_MSGP = 2; 1179 /// const RDMA_DONE = 3; 1180 /// 1181 /// struct rpcrdma1_segment { 1182 /// uint32 rdma_handle; 1183 /// uint32 rdma_length; 1184 /// uint64 rdma_offset; 1185 /// }; 1186 /// 1187 /// struct rpcrdma1_read_segment { 1188 /// uint32 rdma_position; 1189 /// struct rpcrdma1_segment rdma_target; 1190 /// }; 1191 /// 1192 /// struct rpcrdma1_read_list { 1193 /// struct rpcrdma1_read_segment rdma_entry; 1194 /// struct rpcrdma1_read_list *rdma_next; 1195 /// }; 1196 /// 1197 /// struct rpcrdma1_write_chunk { 1198 /// struct rpcrdma1_segment rdma_target<>; 1199 /// }; 1200 /// 1201 /// struct rpcrdma1_write_list { 1202 /// struct rpcrdma1_write_chunk rdma_entry; 1203 /// struct rpcrdma1_write_list *rdma_next; 1204 /// }; 1205 /// 1206 /// struct rpcrdma1_chunks { 1207 /// struct rpcrdma1_read_list *rdma_reads; 1208 /// struct rpcrdma1_write_list *rdma_writes; 1209 /// struct rpcrdma1_write_chunk *rdma_reply; 1210 /// }; 1211 /// 1212 /// struct rpcrdma1_padded { 1213 /// uint32 rdma_align; 1214 /// uint32 rdma_thresh; 1215 /// rpcrdma1_chunks rdma_chunks; 1216 /// }; 1217 /// 1218 /// enum rpcrdma1_errcode { 1219 /// RDMA_ERR_VERS = 1, 1220 /// RDMA_ERR_BADHEADER = 2 1221 /// }; 1222 /// 1223 /// union rpcrdma1_error switch (rpcrdma1_errcode rdma_err) { 1224 /// case RDMA_ERR_VERS: 1225 /// rpcrdma_err_vers rdma_vrange; /* Immutable */ 1226 /// case RDMA_ERR_BADHEADER: 1227 /// void; 1228 /// }; 1230 1232 5.1.4. Use Of XDR Descriptions 1234 Though it is described by XDR, RPC-over-RDMA is not an RPC Program. 1235 Certain functions normally provided by RPC need to be addressed by 1236 the RPC-over-RDMA definition itself. In particular, the following 1237 functions normally provided by RPC need to be provided for as part of 1238 the RPC-over-RDMA XDR description: 1240 o negotiation of RPC-over-RDMA protocol version 1242 o Identifying RPC-over-RDMA header types that are followed by a 1243 Payload stream 1245 In [RFC5666] the XDR description did not take account of the natural 1246 layering between the part of RPC-over-RDMA functionality that 1247 performed RPC-layer like functions described above and that which 1248 implemented individual transport functions. As a result: 1250 o The four 32-bit words which must be the same in all versions of 1251 RPC-over-RDMA are split, with three of those words in struct 1252 rpcrdma1_header and the remaining word part of union 1253 rpcrdma1_body, together with each of the message bodies. 1255 o It is impossible, within the resulting structure, to add a new 1256 message type without modifying the existing XDR description. 1258 The XDR description introduced in this document reorganizes the XDR 1259 in line with this natural layering, while maintaining over-the-wire 1260 equivalence. As a result, the 32-bit big-endian field strating 1261 twelve bytes into the header is no longer the discriminator field of 1262 union rpcrdma1_body. Instead it is the last 32-bit word within 1263 struct rpcrdma_header which define the common (i.e., for all RPC- 1264 over-RDMA versions) header prefix. It retains its role of indicating 1265 the message type and deciding which particular header body is to 1266 follow. 1268 As a result there is no longer a single XDR item that encompasses the 1269 entire RPC-over-RDMA header. Instead, each RPC-over-RDMA meassage 1270 consists of up to three items and those using XDR encode and decode 1271 must be aware that they proceed in sequence as follows: 1273 1. A struct rpcrdma_prefix 1275 2. Depending on the rdma_which field of the prefix, the appropriate 1276 header body for that message type as given by Table 1. In cases 1277 in which there is an undefined header type, this is to be treated 1278 as an XDR encode/decode error. 1280 3. If allowed for that header type as defined in Table 1, an XDR 1281 stream for the RPC message being transported 1283 +--------------+------------------------+-------------------+ 1284 | Message Type | Body | Payload stream? | 1285 +--------------+------------------------+-------------------+ 1286 | RDMA_MSG | struct rpcrdma1_chunks | Yes | 1287 +--------------+------------------------+-------------------+ 1288 | RDMA_NOMSG | struct rpcrdma1_chunks | No | 1289 +--------------+------------------------+-------------------+ 1290 | RDMA_ERROR | union rpcrdma1_error | No | 1291 +--------------+------------------------+-------------------+ 1293 Table 1. Header Type Characteristics 1295 5.2. Fixed Header Fields 1297 The RPC-over-RDMA header begins with four fixed 32-bit fields that 1298 control the RDMA interaction. These four fields, which must remain 1299 with the same meanings and in the same positions in all subsequent 1300 versions of the RPC-over-RDMA protocol, are described below. 1302 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1303 | XID | 1304 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1305 | Version Number | 1306 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1307 | Credit Value | 1308 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1309 | Procedure Number | 1310 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1312 5.2.1. Transaction ID (XID) 1314 The XID generated for the RPC Call and Reply. Having the XID at a 1315 fixed location in the header makes it easy for the receiver to 1316 establish context as soon as each RPC-over-RDMA message arrives. 1317 This XID MUST be the same as the XID in the RPC message. The 1318 receiver MAY perform its processing based solely on the XID in the 1319 RPC-over-RDMA header, and thereby ignore the XID in the RPC message, 1320 if it so chooses. 1322 5.2.2. Version Number 1324 For RPC-over-RDMA Version One, this field MUST contain the value one 1325 (1). Rules regarding changes to this transport protocol version 1326 number can be found in Section 9.3. 1328 5.2.3. Credit Value 1330 When sent in an RPC Call message, the requested credit value is 1331 provided. When sent in an RPC Reply message, the granted credit 1332 value is returned. RPC Calls SHOULD NOT be sent in excess of the 1333 currently granted limit. Further discussion of how the credit value 1334 is determined can be found in Section 4.3. 1336 5.2.4. Procedure number 1338 o RDMA_MSG = 0 indicates that chunk lists and a Payload stream 1339 follow. The format of the chunk lists is discussed below. 1341 o RDMA_NOMSG = 1 indicates that after the chunk lists there is no 1342 Payload stream. In this case, the chunk lists provide information 1343 to allow the responder to transfer the Payload stream using RDMA 1344 Read or Write operations. 1346 o RDMA_MSGP = 2 is reserved. 1348 o RDMA_DONE = 3 is reserved. 1350 o RDMA_ERROR = 4 is used to signal an encoding error in the RPC- 1351 over-RDMA header. 1353 An RDMA_MSG procedure conveys the Transport stream and the Payload 1354 stream via an RDMA Send operation. The Transport stream contains the 1355 four fixed fields, followed by the Read and Write lists and the Reply 1356 chunk, though any or all three MAY be marked as not present. The 1357 Payload stream then follows, beginning with its XID field. If a Read 1358 or Write chunk list is present, a portion of the Payload stream has 1359 been excised and is conveyed separately via RDMA Read or Write 1360 operations. 1362 An RDMA_NOMSG procedure conveys the Transport stream via an RDMA Send 1363 operation. The Transport stream contains the four fixed fields, 1364 followed by the Read and Write chunk lists and the Reply chunk. 1365 Though any of these MAY be marked as not present, one MUST be present 1366 and MUST hold the Payload stream for this RPC-over-RDMA message. If 1367 a Read or Write chunk list is present, a portion of the Payload 1368 stream has been excised and is conveyed separately via RDMA Read or 1369 Write operations. 1371 An RDMA_ERROR procedure conveys the Transport stream via an RDMA Send 1372 operation. The Transport stream contains the four fixed fields, 1373 followed by formatted error information. No Payload stream is 1374 conveyed in this type of RPC-over-RDMA message. 1376 A gather operation on each RDMA Send operation can be used to combine 1377 the Transport and Payload streams, which might have been constructed 1378 in separate buffers. However, the total length of the gathered send 1379 buffers MUST NOT exceed the peer receiver's inline threshold. 1381 5.3. Chunk Lists 1383 The chunk lists in an RPC-over-RDMA Version One header are three XDR 1384 optional-data fields that follow the fixed header fields in RDMA_MSG 1385 and RDMA_NOMSG procedures. Read Section 4.19 of [RFC4506] carefully 1386 to understand how optional-data fields work. Examples of XDR encoded 1387 chunk lists are provided in Section 5.7 as an aid to understanding. 1389 5.3.1. Read List 1391 Each RDMA_MSG or RDMA_NOMSG procedure has one "Read list." The Read 1392 list is a list of zero or more Read segments, provided by the 1393 requester, that are grouped by their Position fields into Read 1394 chunks. Each Read chunk advertises the location of argument data the 1395 responder is to retrieve via RDMA Read operations. The requester has 1396 removed the data in these chunks from the call's Payload stream. 1398 Via a Position Zero Read Chunk, a requester may provide an RPC Call 1399 message as a chunk in the Read list. 1401 If the RPC Call has no argument data that is DDP-eligible and the 1402 Position Zero Read Chunk is not being used, the requester leaves the 1403 Read list empty. 1405 5.3.2. Write List 1407 Each RDMA_MSG or RDMA_NOMSG procedure has one "Write list." The 1408 Write list is a list of zero or more Write chunks, provided by the 1409 requester. Each Write chunk is an array of RDMA segments, thus the 1410 Write list is a list of counted arrays. Each Write chunk advertises 1411 receptacles for DDP-eligible data to be pushed by the responder via 1412 RDMA Write operations. If the RPC Reply has no possible DDP-eligible 1413 result data items, the requester leaves the Write list empty. 1415 *** This section needs to specify when a requester must provide Write 1416 chunks, and how many chunks must be provided. *** 1418 When a Write list is provided for the results of an RPC Call, the 1419 responder MUST provide data corresponding to DDP-eligible XDR data 1420 items via RDMA Write operations to the memory referenced in the Write 1421 list. The responder removes the data in these chunks from the 1422 reply's Payload stream. 1424 When multiple Write chunks are present, the responder fills in each 1425 Write chunk with a DDP-eligible result until either there are no more 1426 results or no more Write chunks. The requester may not be able to 1427 predict which DDP-eligible data item goes in which chunk. Thus the 1428 requester is responsible for allocating and registering Write chunks 1429 large enough to accommodate the largest XDR data item that might be 1430 associated with each chunk in the list. 1432 The RPC Reply conveys the size of result data items by returning each 1433 Write chunk to the requester with the segment lengths rewritten to 1434 match the actual data transferred. Decoding the reply therefore 1435 performs no local data copying but merely returns the length obtained 1436 from the reply. 1438 Each decoded result consumes one entry in the Write list, which in 1439 turn consists of an array of RDMA segments. The length of a Write 1440 chunk is therefore the sum of all returned lengths in all segments 1441 comprising the corresponding list entry. As each Write chunk is 1442 decoded, the entire Write list entry is consumed. 1444 A requester constructs the Write list for an RPC transaction before 1445 the responder has formulated its reply. When there is only one DDP- 1446 eligible result data item, the requester inserts only a single Write 1447 chunk in the Write list. If the responder populates that chunk with 1448 data, the requester knows with certainty which result data item is 1449 contained in it. 1451 However, Upper Layer Protocol procedures may allow replies where more 1452 than one result data item is DDP-eligible. For example, an NFSv4 1453 COMPOUND procedure is composed of individual NFSv4 operations, more 1454 than one of which may have a reply containing a DDP-eligible result. 1456 As stated above, when multiple Write chunks are present, the 1457 responder reduces DDP-eligible result until either there are no more 1458 results or no more Write chunks. Then, as the requester decodes the 1459 reply Payload stream, it is clear from the contents of the reply 1460 which Write chunk contains which data item. 1462 5.3.3. Reply Chunk 1464 Each RDMA_MSG or RDMA_NOMSG procedure has one "Reply chunk." The 1465 Reply chunk is a Write chunk, provided by the requester. The Reply 1466 chunk is a single counted array of RDMA segments. 1468 A requester MUST provide a Reply chunk whenever the maximum possible 1469 size of the reply is larger than its own inline threshold. The Reply 1470 chunk MUST be large enough to contain a Payload stream (RPC message) 1471 of this maximum size. If the actual reply Payload stream is smaller 1472 than the requester's inline threshold, the responder MAY return it as 1473 a Short message rather than using the Reply chunk. 1475 5.4. Memory Registration 1477 RDMA requires that data is transferred between only registered memory 1478 segments at the source and destination. All protocol headers as well 1479 as separately transferred data chunks must reside in registered 1480 memory. 1482 Since the cost of registering and de-registering memory can be a 1483 significant proportion of the RDMA transaction cost, it is important 1484 to minimize registration activity. For memory that is targeted by 1485 RDMA Send and Receive operations, a local-only registration is 1486 sufficient and can be left in place during the life of a connection 1487 without any risk of data exposure. 1489 5.4.1. Registration Longevity 1491 Data transferred via RDMA Read and Write can reside in a memory 1492 allocation not in the control of the RPC-over-RDMA transport. These 1493 memory allocations can persist outside the bounds of an RPC 1494 transaction. They are registered and invalidated as needed, as part 1495 of each RPC transaction. 1497 The requester endpoint must ensure that memory segments associated 1498 with each RPC transaction are properly fenced from responders before 1499 allowing Upper Layer access to the data contained in them. Moreover, 1500 the requester must not access these memory segments while the 1501 responder has access to them. 1503 This includes segments that are associated with canceled RPCs. A 1504 responder cannot know that the requester is no longer waiting for a 1505 reply, and might proceed to read or even update memory that the 1506 requester might have released for other use. 1508 5.4.2. Communicating DDP-Eligibility 1510 The interface by which an Upper Layer Protocol implementation 1511 communicates the eligibility of a data item locally to its local RPC- 1512 over-RDMA endpoint is not described by this specification. 1514 Depending on the implementation and constraints imposed by Upper 1515 Layer Bindings, it is possible to implement reduction transparently 1516 to upper layers. Such implementations may lead to inefficiencies, 1517 either because they require the RPC layer to perform expensive 1518 registration and de-registration of memory "on the fly", or they may 1519 require using RDMA chunks in reply messages, along with the resulting 1520 additional handshaking with the RPC-over-RDMA peer. 1522 However, these issues are internal and generally confined to the 1523 local interface between RPC and its upper layers, one in which 1524 implementations are free to innovate. The only requirement is that 1525 the resulting RPC-over-RDMA protocol sent to the peer is valid for 1526 the upper layer. 1528 5.4.3. Registration Strategies 1530 The choice of which memory registration strategies to employ is left 1531 to requester and responder implementers. To support the widest array 1532 of RDMA implementations, as well as the most general steering tag 1533 scheme, an Offset field is included in each segment. 1535 While zero-based offset schemes are available in many RDMA 1536 implementations, their use by RPC requires individual registration of 1537 each segment. For such implementations, this can be a significant 1538 overhead. By providing an offset in each chunk, many pre- 1539 registration or region-based registrations can be readily supported. 1540 By using a single, universal chunk representation, the RPC-over-RDMA 1541 protocol implementation is simplified to its most general form. 1543 5.5. Error Handling 1545 A receiver performs basic validity checks on the RPC-over-RDMA header 1546 and chunk contents before it passes the RPC message to the RPC 1547 consumer. If errors are detected in an RPC-over-RDMA header, an 1548 RDMA_ERROR procedure MUST be generated. Because the transport layer 1549 may not be aware of the direction of a problematic RPC message, an 1550 RDMA_ERROR procedure MAY be generated by either a requester or a 1551 responder. 1553 To form an RDMA_ERROR procedure: The rdma_xid field MUST contain the 1554 same XID that was in the rdma_xid field in the failing request; The 1555 rdma_vers field MUST contain the same version that was in the 1556 rdma_vers field in the failing request; The rdma_proc field MUST 1557 contain the value RDMA_ERROR; The rdma_err field contains a value 1558 that reflects the type of error that occurred, as described below. 1560 An RDMA_ERROR procedure indicates a permanent error. Receipt of this 1561 procedure completes the RPC transaction associated with XID in the 1562 rdma_xid field. A receiver MUST silently discard an RDMA_ERROR 1563 procedure that cannot be decoded. 1565 5.5.1. Header Version Mismatch 1567 When a receiver detects an RPC-over-RDMA header version that it does 1568 not support (currently this document defines only Version One), it 1569 MUST reply with an RDMA_ERROR procedure and set the rdma_err value to 1570 RDMA_ERR_VERS, also providing the low and high inclusive version 1571 numbers it does, in fact, support. 1573 5.5.2. XDR Errors 1575 A receiver might encounter an XDR parsing error that prevents it from 1576 processing the incoming Transport stream. Examples of such errors 1577 include an invalid value in the rdma_proc field, an RDMA_NOMSG 1578 message that has no chunk lists, or the contents of the rdma_xid 1579 field might not match the contents of the XID field in the 1580 accompanying RPC message. If the rdma_vers field contains a 1581 recognized value, but an XDR parsing error occurs, the responder MUST 1582 reply with an RDMA_ERROR procedure and set the rdma_err value to 1583 RDMA_ERR_BADHEADER. 1585 When a responder receives a valid RPC-over-RDMA header but the 1586 responder's Upper Layer Protocol implementation cannot parse the RPC 1587 arguments in the RPC Call message, the responder SHOULD return a 1588 RPC_GARBAGEARGS reply, using an RDMA_MSG procedure. This type of 1589 parsing failure might be due to mismatches between chunk sizes or 1590 offsets and the contents of the Payload stream, for example. A 1591 responder MAY also report the presence of a non-DDP-eligible data 1592 item in a Read or Write chunk using RPC_GARBAGEARGS. 1594 5.5.3. Responder RDMA Operational Errors 1596 In RPC-over-RDMA Version One, it is the responder which drives RDMA 1597 Read and Write operations that target the requester's memory. 1598 Problems might arise as the responder attempts to use requester- 1599 provided resources for RDMA operations. For example: 1601 o Chunks can be validated only by using their contents to form RDMA 1602 Read or Write operations. If chunk contents are invalid (say, a 1603 segment is no longer registered, or a chunk length is too long), a 1604 Remote Access error occurs. 1606 o If a requester's receive buffer is too small, the responder's Send 1607 operation completes with a Local Length Error. 1609 o If the requester-provided Reply chunk is too small to accommodate 1610 a large RPC Reply, a Remote Access error occurs. A responder can 1611 detect this problem before attempting to write past the end of the 1612 Reply chunk. 1614 RDMA operational errors are typically fatal to the connection. To 1615 avoid a retransmission loop and repeated connection loss that 1616 deadlocks the connection, once the requester has re-established a 1617 connection, the responder should send an RDMA_ERROR reply with an 1618 rdma_err value of RDMA_ERR_BADHEADER to indicate that no RPC-level 1619 reply is possible for that XID. 1621 5.5.4. Other Operational Errors 1623 While a requester is constructing a Call message, an unrecoverable 1624 problem might occur that prevents the requester from posting further 1625 RDMA Work Requests on behalf of that message. As with other 1626 transports, if a requester is unable to construct and transmit a Call 1627 message, the associated RPC transaction fails immediately. 1629 After a requester has received a reply, if it is unable to invalidate 1630 a memory region due to an unrecoverable problem, the requester MUST 1631 close the connection to fence that memory from the responder before 1632 the associated RPC transaction is complete. 1634 While a responder is constructing a Reply message or error message, 1635 an unrecoverable problem might occur that prevents the responder from 1636 posting further RDMA Work Requests on behalf of that message. If a 1637 responder is unable to construct and transmit a Reply or error 1638 message, the responder MUST close the connection to signal to the 1639 requester that a reply was lost. 1641 5.5.5. RDMA Transport Errors 1643 The RDMA connection and physical link provide some degree of error 1644 detection and retransmission. iWARP's Marker PDU Aligned (MPA) layer 1645 (when used over TCP), Stream Control Transmission Protocol (SCTP), as 1646 well as the InfiniBand link layer all provide Cyclic Redundancy Check 1647 (CRC) protection of the RDMA payload, and CRC-class protection is a 1648 general attribute of such transports. 1650 Additionally, the RPC layer itself can accept errors from the link 1651 level and recover via retransmission. RPC recovery can handle 1652 complete loss and re-establishment of the link. 1654 The details of reporting and recovery from RDMA link layer errors are 1655 outside the scope of this protocol specification. See Section 10 for 1656 further discussion of the use of RPC-level integrity schemes to 1657 detect errors. 1659 5.6. Protocol Elements No Longer Supported 1661 The following protocol elements are no longer supported in RPC-over- 1662 RDMA Version One. Related enum values and structure definitions 1663 remain in the RPC-over-RDMA Version One protocol for backwards 1664 compatibility. 1666 5.6.1. RDMA_MSGP 1668 The specification of RDMA_MSGP in Section 3.9 of [RFC5666] is 1669 incomplete. To fully specify RDMA_MSGP would require: 1671 o Updating the definition of DDP-eligibility to include data items 1672 that may be transferred, with padding, via RDMA_MSGP procedures 1674 o Adding full operational descriptions of the alignment and 1675 threshold fields 1677 o Discussing how alignment preferences are communicated between two 1678 peers without using CCP 1680 o Describing the treatment of RDMA_MSGP procedures that convey Read 1681 or Write chunks 1683 The RDMA_MSGP message type is beneficial only when the padded data 1684 payload is at the end of an RPC message's argument or result list. 1685 This is not typical for NFSv4 COMPOUND RPCs, which often include a 1686 GETATTR operation as the final element of the compound operation 1687 array. 1689 Without a full specification of RDMA_MSGP, there has been no fully 1690 implemented prototype of it. Without a complete prototype of 1691 RDMA_MSGP support, it is difficult to assess whether this protocol 1692 element has benefit, or can even be made to work interoperably. 1694 Therefore, senders MUST NOT send RDMA_MSGP procedures. When 1695 receiving an RDMA_MSGP procedure, receivers SHOULD reply with an 1696 RDMA_ERROR procedure, setting the rdma_err field to 1697 RDMA_ERR_BADHEADER. 1699 5.6.2. RDMA_DONE 1701 Because no implementation of RPC-over-RDMA Version One uses the Read- 1702 Read transfer model, there is never a need to send an RDMA_DONE 1703 procedure. 1705 Therefore, senders MUST NOT send RDMA_DONE messages. When receiving 1706 an RDMA_DONE procedure, receivers SHOULD reply with an RDMA_ERROR 1707 procedure, setting the rdma_err field to RDMA_ERR_BADHEADER. 1709 5.7. XDR Examples 1711 RPC-over-RDMA chunk lists are complex data types. In this section, 1712 illustrations are provided to help readers grasp how chunk lists are 1713 represented inside an RPC-over-RDMA header. 1715 An RDMA segment is the simplest component, being made up of a 32-bit 1716 handle (H), a 32-bit length (L), and 64-bits of offset (OO). Once 1717 flattened into an XDR stream, RDMA segments appear as 1719 HLOO 1721 A Read segment has an additional 32-bit position field. Read 1722 segments appear as 1724 PHLOO 1726 A Read chunk is a list of Read segments. Each segment is preceded by 1727 a 32-bit word containing a one if there is a segment, or a zero if 1728 there are no more segments (optional-data). In XDR form, this would 1729 look like 1730 1 PHLOO 1 PHLOO 1 PHLOO 0 1732 where P would hold the same value for each segment belonging to the 1733 same Read chunk. 1735 The Read List is also a list of Read segments. In XDR form, this 1736 would look like a Read chunk, except that the P values could vary 1737 across the list. An empty Read List is encoded as a single 32-bit 1738 zero. 1740 One Write chunk is a counted array of segments. In XDR form, the 1741 count would appear as the first 32-bit word, followed by an HLOO for 1742 each element of the array. For instance, a Write chunk with three 1743 elements would look like 1745 3 HLOO HLOO HLOO 1747 The Write List is a list of counted arrays. In XDR form, this is a 1748 combination of optional-data and counted arrays. To represent a 1749 Write List containing a Write chunk with three segments and a Write 1750 chunk with two segments, XDR would encode 1752 1 3 HLOO HLOO HLOO 1 2 HLOO HLOO 0 1754 An empty Write List is encoded as a single 32-bit zero. 1756 The Reply chunk is a Write chunk. Since it is an optional-data 1757 field, however, there is a 32-bit field in front of it that contains 1758 a one if the Reply chunk is present, or a zero if it is not. After 1759 encoding, a Reply chunk with 2 segments would look like 1761 1 2 HLOO HLOO 1763 Frequently a requester does not provide any chunks. In that case, 1764 after the four fixed fields in the RPC-over-RDMA header, there are 1765 simply three 32-bit fields that contain zero. 1767 6. RPC Bind Parameters 1769 In setting up a new RDMA connection, the first action by a requester 1770 is to obtain a transport address for the responder. The mechanism 1771 used to obtain this address, and to open an RDMA connection is 1772 dependent on the type of RDMA transport, and is the responsibility of 1773 each RPC protocol binding and its local implementation. 1775 RPC services normally register with a portmap or rpcbind [RFC1833] 1776 service, which associates an RPC Program number with a service 1777 address. (In the case of UDP or TCP, the service address for NFS is 1778 normally port 2049.) This policy is no different with RDMA 1779 transports, although it may require the allocation of port numbers 1780 appropriate to each Upper Layer Protocol that uses the RPC framing 1781 defined here. 1783 When mapped atop the iWARP transport [RFC5040] [RFC5041], which uses 1784 IP port addressing due to its layering on TCP and/or SCTP, port 1785 mapping is trivial and consists merely of issuing the port in the 1786 connection process. The NFS/RDMA protocol service address has been 1787 assigned port 20049 by IANA, for both iWARP/TCP and iWARP/SCTP. 1789 When mapped atop InfiniBand [IB], which uses a Group Identifier 1790 (GID)-based service endpoint naming scheme, a translation MUST be 1791 employed. One such translation is defined in the InfiniBand Port 1792 Addressing Annex [IBPORT], which is appropriate for translating IP 1793 port addressing to the InfiniBand network. Therefore, in this case, 1794 IP port addressing may be readily employed by the upper layer. 1796 When a mapping standard or convention exists for IP ports on an RDMA 1797 interconnect, there are several possibilities for each upper layer to 1798 consider: 1800 o One possibility is to have responder register its mapped IP port 1801 with the rpcbind service, under the netid (or netid's) defined 1802 here. An RPC-over-RDMA-aware requester can then resolve its 1803 desired service to a mappable port, and proceed to connect. This 1804 is the most flexible and compatible approach, for those upper 1805 layers that are defined to use the rpcbind service. 1807 o A second possibility is to have the responder's portmapper 1808 register itself on the RDMA interconnect at a "well known" service 1809 address (on UDP or TCP, this corresponds to port 111). A 1810 requester could connect to this service address and use the 1811 portmap protocol to obtain a service address in response to a 1812 program number, e.g., an iWARP port number, or an InfiniBand GID. 1814 o Alternatively, the requester could simply connect to the mapped 1815 well-known port for the service itself, if it is appropriately 1816 defined. By convention, the NFS/RDMA service, when operating atop 1817 such an InfiniBand fabric, will use the same 20049 assignment as 1818 for iWARP. 1820 Historically, different RPC protocols have taken different approaches 1821 to their port assignment; therefore, the specific method is left to 1822 each RPC-over-RDMA-enabled Upper Layer binding, and not addressed 1823 here. 1825 In Section 11, this specification defines two new "netid" values, to 1826 be used for registration of upper layers atop iWARP [RFC5040] 1827 [RFC5041] and (when a suitable port translation service is available) 1828 InfiniBand [IB]. Additional RDMA-capable networks MAY define their 1829 own netids, or if they provide a port translation, MAY share the one 1830 defined here. 1832 7. Bi-Directional RPC-Over-RDMA 1834 7.1. RPC Direction 1836 7.1.1. Forward Direction 1838 A traditional ONC RPC client is always a requester. A traditional 1839 ONC RPC service is always a responder. This traditional form of ONC 1840 RPC message passing is referred to as operation in the "forward 1841 direction." 1843 During forward direction operation, the ONC RPC client is responsible 1844 for establishing transport connections. 1846 7.1.2. Backward Direction 1848 The ONC RPC standard does not forbid passing messages in the other 1849 direction. An ONC RPC service endpoint can act as a requester, in 1850 which case an ONC RPC client endpoint acts as a responder. This form 1851 of message passing is referred to as operation in the "backward 1852 direction." 1854 During backward direction operation, the ONC RPC client is 1855 responsible for establishing transport connections, even though ONC 1856 RPC Calls come from the ONC RPC server. 1858 7.1.3. Bi-direction 1860 A pair of endpoints may choose to use only forward or only backward 1861 direction operations on a particular transport. Or, the endpoints 1862 may send operations in both directions concurrently on the same 1863 transport. 1865 Bi-directional operation occurs when both transport endpoints act as 1866 a requester and a responder at the same time. As above, the ONC RPC 1867 client is responsible for establishing transport connections. 1869 7.1.4. XIDs with Bi-direction 1871 During bi-directional operation, the forward and backward directions 1872 use independent xid spaces. 1874 In other words, a forward direction requester MAY use the same xid 1875 value at the same time as a backward direction requester on the same 1876 transport connection, but such concurrent requests represent distinct 1877 ONC RPC transactions. 1879 7.2. Backward Direction Flow Control 1881 7.2.1. Backward RPC-over-RDMA Credits 1883 Credits work the same way in the backward direction as they do in the 1884 forward direction. However, forward direction credits and backward 1885 direction credits are accounted separately. 1887 In other words, the forward direction credit value is the same 1888 whether or not there are backward direction resources associated with 1889 an RPC-over-RDMA transport connection. The backward direction credit 1890 value MAY be different than the forward direction credit value. The 1891 rdma_credit field in a backward direction RPC-over-RDMA message MUST 1892 NOT contain the value zero. 1894 A backward direction requester (an RPC-over-RDMA service endpoint) 1895 requests credits from the responder (an RPC-over-RDMA client 1896 endpoint). The responder reports how many credits it can grant. 1897 This is the number of backward direction Calls the responder is 1898 prepared to handle at once. 1900 When an RPC-over-RDMA server endpoint is operating correctly, it 1901 sends no more outstanding requests at a time than the client 1902 endpoint's advertised backward direction credit value. 1904 7.2.2. Receive Buffer Management 1906 An RPC-over-RDMA transport endpoint must pre-post receive buffers 1907 before it can receive and process incoming RPC-over-RDMA messages. 1908 If a sender transmits a message for a receiver which has no posted 1909 receive buffer, the RDMA provider MAY drop the RDMA connection. 1911 7.2.2.1. Client Receive Buffers 1913 Typically an RPC-over-RDMA caller posts only as many receive buffers 1914 as there are outstanding RPC Calls. A client endpoint without 1915 backward direction support might therefore at times have no pre- 1916 posted receive buffers. 1918 To receive incoming backward direction Calls, an RPC-over-RDMA client 1919 endpoint must pre-post enough additional receive buffers to match its 1920 advertised backward direction credit value. Each outstanding forward 1921 direction RPC requires an additional receive buffer above this 1922 minimum. 1924 When an RDMA transport connection is lost, all active receive buffers 1925 are flushed and are no longer available to receive incoming messages. 1926 When a fresh transport connection is established, a client endpoint 1927 must re-post a receive buffer to handle the Reply for each 1928 retransmitted forward direction Call, and a full set of receive 1929 buffers to handle backward direction Calls. 1931 7.2.2.2. Server Receive Buffers 1933 A forward direction RPC-over-RDMA service endpoint posts as many 1934 receive buffers as it expects incoming forward direction Calls. That 1935 is, it posts no fewer buffers than the number of RPC-over-RDMA 1936 credits it advertises in the rdma_credit field of forward direction 1937 RPC replies. 1939 To receive incoming backward direction replies, an RPC-over-RDMA 1940 server endpoint must pre-post a receive buffer for each backward 1941 direction Call it sends. 1943 When the existing transport connection is lost, all active receive 1944 buffers are flushed and are no longer available to receive incoming 1945 messages. When a fresh transport connection is established, a server 1946 endpoint must re-post a receive buffer to handle the Reply for each 1947 retransmitted backward direction Call, and a full set of receive 1948 buffers for receiving forward direction Calls. 1950 7.3. Conventions For Backward Operation 1952 7.3.1. In the Absense of Backward Direction Support 1954 An RPC-over-RDMA transport endpoint might not support backward 1955 direction operation. There might be no mechanism in the transport 1956 implementation to do so, or the Upper Layer Protocol consumer might 1957 not yet have configured the transport to handle backward direction 1958 traffic. 1960 A loss of the RDMA connection may result if the receiver is not 1961 prepared to receive an incoming message. Thus a denial-of-service 1962 could result if a sender continues to send backchannel messages after 1963 every transport reconnect to an endpoint that is not prepared to 1964 receive them. 1966 For RPC-over-RDMA Version One transports, the Upper Layer Protocol is 1967 responsible for informing its peer when it has established a backward 1968 direction capability. Otherwise even a simple backward direction 1969 NULL probe from a peer would result in a lost connection. 1971 An Upper Layer Protocol consumer MUST NOT perform backward direction 1972 ONC RPC operations unless the peer consumer has indicated it is 1973 prepared to handle them. A description of Upper Layer Protocol 1974 mechanisms used for this indication is outside the scope of this 1975 document. 1977 7.3.2. Backward Direction Retransmission 1979 In rare cases, an ONC RPC transaction cannot be completed within a 1980 certain time. This can be because the transport connection was lost, 1981 the Call or Reply message was dropped, or because the Upper Layer 1982 consumer delayed or dropped the ONC RPC request. Typically, the 1983 requester sends the transaction again, reusing the same RPC XID. 1984 This is known as an "RPC retransmission". 1986 In the forward direction, the Caller is the ONC RPC client. The 1987 client is always responsible for establishing a transport connection 1988 before sending again. 1990 In the backward direction, the Caller is the ONC RPC server. Because 1991 an ONC RPC server does not establish transport connections with 1992 clients, it cannot send a retransmission if there is no transport 1993 connection. It must wait for the ONC RPC client to re-establish the 1994 transport connection before it can retransmit ONC RPC transactions in 1995 the backward direction. 1997 If an ONC RPC client has no work to do, it may be some time before it 1998 re-establishes a transport connection. Backward direction Callers 1999 must be prepared to wait indefinitely before a connection is 2000 established before a pending backward direction ONC RPC Call can be 2001 retransmitted. 2003 7.3.3. Backward Direction Message Size 2005 RPC-over-RDMA backward direction messages are transmitted and 2006 received using the same buffers as messages in the forward direction. 2007 Therefore they are constrained to be no larger than receive buffers 2008 posted for forward messages. 2010 It is expected that the Upper Layer Protocol consumer establishes an 2011 appropriate payload size limit for backward direction operations, 2012 either by advertising that size limit to its peers, or by convention. 2013 If that is done, backward direction messages do not exceed the size 2014 of receive buffers at either endpoint. 2016 If a sender transmits a backward direction message that is larger 2017 than the receiver is prepared for, the RDMA provider drops the 2018 message and the RDMA connection. 2020 7.3.4. Sending A Backward Direction Call 2022 To form a backward direction RPC-over-RDMA Call message on an RPC- 2023 over-RDMA Version One transport, an ONC RPC service endpoint 2024 constructs an RPC-over-RDMA header containing a fresh RPC XID in the 2025 rdma_xid field. 2027 The rdma_vers field MUST contain the value one. The number of 2028 requested credits is placed in the rdma_credit field. 2030 The rdma_proc field in the RPC-over-RDMA header MUST contain the 2031 value RDMA_MSG. All three chunk lists MUST be empty. 2033 The ONC RPC Call header MUST follow immediately, starting with the 2034 same XID value that is present in the RPC-over-RDMA header. The Call 2035 header's msg_type field MUST contain the value CALL. 2037 7.3.5. Sending A Backward Direction Reply 2039 To form a backward direction RPC-over-RDMA Reply message on an RPC- 2040 over-RDMA Version One transport, an ONC RPC client endpoint 2041 constructs an RPC-over-RDMA header containing a copy of the matching 2042 ONC RPC Call's RPC XID in the rdma_xid field. 2044 The rdma_vers field MUST contain the value one. The number of 2045 granted credits is placed in the rdma_credit field. 2047 The rdma_proc field in the RPC-over-RDMA header MUST contain the 2048 value RDMA_MSG. All three chunk lists MUST be empty. 2050 The ONC RPC Reply header MUST follow immediately, starting with the 2051 same XID value that is present in the RPC-over-RDMA header. The 2052 Reply header's msg_type field MUST contain the value REPLY. 2054 7.4. Backward Direction Upper Layer Binding 2056 RPC programs that operate on RPC-over-RDMA Version One only in the 2057 backward direction do not require an Upper Layer Binding 2058 specification. Because RPC-over-RDMA Version One operation in the 2059 backward direction does not allow reduction, there can be no DDP- 2060 eligible data items in such a program. Backward direction operation 2061 occurs on an already-established connection, thus there is no need to 2062 specify RPC bind parameters. 2064 8. Upper Layer Binding Specifications 2066 An Upper Layer Protocol is typically defined independently of any 2067 particular RPC transport. An Upper Layer Binding specification (ULB) 2068 provides guidance that helps the Upper Layer Protocol interoperate 2069 correctly and efficiently over a particular transport. For RPC-over- 2070 RDMA Version One, a ULB provides: 2072 o A taxonomy of XDR data items that are eligible for Direct Data 2073 Placement 2075 o A method for determining the maximum size of the reply Payload 2076 stream for all procedures in the Upper Layer Protocol 2078 o An rpcbind port assignment for operation of the RPC Program and 2079 Version on an RPC-over-RDMA transport 2081 Each RPC Program and Version tuple that utilizes RPC-over-RDMA 2082 Version One needs to have an Upper Layer Binding specification. 2083 Requesters MUST NOT send RPC-over-RDMA messages for Upper Layer 2084 Protocols that do not have a Upper Layer Binding. Responders MUST 2085 NOT reply to RPC-over-RDMA messages for Upper Layer Protocols that do 2086 not have a Upper Layer Binding. 2088 8.1. DDP-Eligibility 2090 An Upper Layer Binding designates some XDR data items as eligible for 2091 Direct Data Placement. As an RPC-over-RDMA message is formed, DDP- 2092 eligible data items can be removed from the Payload stream and placed 2093 directly in the receiver's memory (reduced). 2095 An XDR data item should be considered for DDP-eligibility if there is 2096 a clear benefit to moving the contents of the item directly from the 2097 sender's memory to the receiver's memory. Criteria for DDP- 2098 eligibility include: 2100 o The XDR data item is frequently sent or received, and its size is 2101 often much larger than typical inline thresholds. 2103 o Transport-level processing of the XDR data item is not needed. 2104 For example, the data item is an opaque byte array, which requires 2105 no XDR encoding and decoding of its content. 2107 o The content of the XDR data item is sensitive to address 2108 alignment. For example, pullup would be required on the receiver 2109 before the content of the item can be used. 2111 o The XDR data item does not contain DDP-eligible data items. 2113 Senders MUST NOT reduce data items that are not DDP-eligible. Such 2114 data items MAY, however, be moved as part of a Position Zero Read 2115 Chunk or a Reply chunk. 2117 The interface by which an Upper Layer implementation indicates the 2118 DDP-eligibility of a data item to the RPC transport is not described 2119 by this specification. The only requirements are that the receiver 2120 can re-assemble the transmitted RPC-over-RDMA message into a valid 2121 XDR stream, and that DDP-eligibility rules specified by the Upper 2122 Layer Binding are respected. 2124 There is no provision to express DDP-eligibility within the XDR 2125 language. The only definitive specification of DDP-eligibility is 2126 the Upper Layer Binding itself. 2128 8.1.1. DDP-Eligibility Violation 2130 A DDP-eligibility violation occurs when a requester forms a Call 2131 message with a non-DDP-eligible data item in a Read chunk. A 2132 violation occurs when a responder forms a Reply message without 2133 reducing a DDP-eligible data item when there is a Write list provided 2134 by the requester. 2136 In the first case, a responder MUST NOT process the Call message. 2138 In the second case, as a requester parses a Reply message, it must 2139 assume that the responder has correctly reduced a DDP-eligible result 2140 data item. If the responder has not done so, it is likely that the 2141 requester cannot finish parsing the Payload stream and that an XDR 2142 error would result. 2144 Both types of violations MUST be reported as described in 2145 Section 5.5.2. 2147 8.2. Maximum Reply Size 2149 A requester provides resources for both a Call message and its 2150 matching Reply message. A requester forms the Call message itself, 2151 thus can compute the exact resources needed for it. 2153 A requester must allocate resources for the Reply message (an RPC- 2154 over-RDMA credit, a Receive buffer, and possibly a Write list and 2155 Reply chunk) before the responder has formed the actual reply. To 2156 accommodate all possible replies for the procedure in the Call 2157 message, a requester must allocate reply resources based on the 2158 maximum possible size of the expected Reply message. 2160 If there are procedures in the Upper Layer Protocol for which there 2161 is no clear reply size maximum, the Upper Layer Binding needs to 2162 specify a dependable means for determining the maximum. 2164 8.3. Additional Considerations 2166 There may be other details provided in an Upper Layer Binding. 2168 o An Upper Layer Binding may recommend an inline threshold value or 2169 other transport-related parameters for RPC-over-RDMA Version One 2170 connections bearing that Upper Layer Protocol. 2172 o An Upper Layer Protocol may provide a means to communicate these 2173 transport-related parameters between peers. Note that RPC-over- 2174 RDMA Version One does not specify any mechanism for changing any 2175 transport-related parameter after a connection has been 2176 established. 2178 o Multiple Upper Layer Protocols may share a single RPC-over-RDMA 2179 Version One connection when their Upper Layer Bindings allow the 2180 use of RPC-over-RDMA Version One and the rpcbind port assignments 2181 for the Protocols allow connection sharing. In this case, the 2182 same transport parameters (such as inline threshold) apply to all 2183 Protocols using that connection. 2185 Given the above, Upper Layer Bindings and Upper Layer Protocols must 2186 be designed to interoperate correctly no matter what connection 2187 parameters are in effect on a connection. 2189 8.4. Upper Layer Protocol Extensions 2191 An RPC Program and Version tuple may be extensible. For instance, 2192 there may be a minor versioning scheme that is not reflected in the 2193 RPC version number. Or, the Upper Layer Protocol may allow 2194 additional features to be specified after the original RPC program 2195 specification was ratified. 2197 Upper Layer Bindings are provided for interoperable RPC Programs and 2198 Versions by extending existing Upper Layer Bindings to reflect the 2199 changes made necessary by each addition to the existing XDR. 2201 9. Protocol Extensibility 2203 The RPC-over-RDMA header format is specified using XDR, unlike the 2204 message header format of RPC on TCP. Defining the header using XDR 2205 allows minor issues with the transport protocol to be addressed and 2206 optional features to be introduced by making extensions to the RPC- 2207 over-RDMA header XDR. Such changes can be made without a change to 2208 the protocol version number. 2210 When more invasive changes to the protocol are to be made, a protocol 2211 version number change is required. In either case, any changes to 2212 the RPC-over-RDMA protocol can only be effected by publication of a 2213 Standards Track document with appropriate review by the nfsv4 Working 2214 Group and the IESG. 2216 Although it is possible to make XDR changes which are not limited to 2217 the use of compatible extensions in new RPC-over-RDMA versions, such 2218 changes should only be done when absolutely necessary, as they limit 2219 interoperability with existing implementations. It is appropriate 2220 for the nfsv4 Working Group to consider alternatives carefully before 2221 using this approach. 2223 Unlike the rest of this document, which defines the base of RPC-over- 2224 RDMA Version One, Section 9 (except for Section 9.4) applies to all 2225 versions of RPC-over-RDMA. New versions of RPC-over-RDMA may be 2226 published as separate protocols without updating this document, but 2227 any change to the extensibility model defined here requires 2228 publication of a Standards Track document updating this document. 2230 9.1. Changes To RPC-Over-RDMA Header XDR 2232 The first four fields in the RPC-over-RDMA header (now in struct 2233 rpcrdma_prefix) must remain aligned at the same fixed offsets for all 2234 versions of the RPC-over-RDMA protocol. The version number must be 2235 in a fixed place in order to enable version mismatch detection. For 2236 version mismatches to be reported in a fashion that all future 2237 version implementations can reliably decode, the rdma_which field 2238 must be in a fixed place, the value of RDMA_ERR_VERS must always 2239 remain the same, and the field placement of the RDMA_ERR_VERS arm of 2240 the rpcrdma1_error union (now in struct rpcrdma_err_vers) must always 2241 remain the same. 2243 Given these constraints, one way to extend RPC-over-RDMA is to add 2244 new values to the rdma_proc enumerated type and new components (arms) 2245 to the rpcrdma1_body union. New argument and result types may be 2246 introduced for each new procedure defined this way. These extensions 2247 would be specified by new Internet Drafts with appropriate Working 2248 Group and IESG review to ensure continued interoperation with 2249 existing implementations. 2251 XDR extensions may introduce only optional features to an existing 2252 RPC-over-RDMA protocol version. To detect when an optional rdma_proc 2253 value is supported by a receiver, it is desirable to have a specific 2254 value of the rdma_err field, say, RDMA_ERR_PROC, that indicates when 2255 the receiver does not recognize an rdma_proc value. 2257 In RPC-over-RDMA Version One, a receiver can indicate that it does 2258 not recognize an rdma_proc enum value only by returning an RDMA_ERROR 2259 procedure with the rdma_err field set to RDMA_ERR_BADHEADER (see 2260 Section 5.5.2). This is indistinguishable from a situation where the 2261 receiver does indeed support the procedure, but the XDR is malformed. 2263 To resolve this problem, an RPC-over-RDMA Version One sender uses the 2264 following convention. If the first time the sender uses an optional 2265 rdma_proc value the receiver returns an RDMA_ERROR procedure with 2266 RDMA_ERR_BADHEADER in the rdma_err field, the sender simply marks 2267 that feature as unsupported and does not send it again on the current 2268 connection instance. Subsequent to an initial successful result, 2269 receiving RDMA_ERR_BADHEADER retains its more relaxed meaning of 2270 "generic XDR parsing error." 2272 To ensure backwards compatibility when such an extension mechanism is 2273 in place, the value of RDMA_ERR_BADHEADER must remain the same for 2274 all versions of the RPC-over-RDMA protocol. 2276 Most changes to the RPC-over-RDMA XDR will take the form of a 2277 compatible extension to the existing XDR. Changes which do not 2278 update the version number (see Section 9.3) must take this form. 2280 For an XDR description B to be a compatible extension of an XDR 2281 description A, the following must be the case: 2283 o All input recognized as description valid by A must be recognized 2284 as valid by description B 2286 o Any input recognized as valid by both descriptions must be 2287 interpreted as having the same structure according to both 2288 descriptions 2290 o Any input recognized as valid by description B but not by 2291 description A can be recognizable as part of a supported./unknown 2292 extension using description A 2294 The following changes can be made compatibly: 2296 o Addition of a new message header type and associated header body 2298 o Addition of new enum values and associated arms to unions that do 2299 not include a default case 2301 o Addition of previously undefined flag bits to flag words that are 2302 included in existing header bodies 2304 Each such addition is referred to as a "protocol element." A set of 2305 protocol elements defined together such that all must be supported or 2306 not supported by a receiver is called a "feature." 2308 Because of the simplicity of the existing protocol and deficiencies 2309 in the existing error reporting structure, some of the above 2310 techiques are not realizable within RPC-over-RDMA Version One. For a 2311 discussion of protocol extension practices within RPC-over-RDMA 2312 Version One, including XDR extension, see Section 9.4. 2314 9.2. Feature Statuses With RPC-Over-RDMA Versions 2316 Within a given RPC-over-RDMA version, every known feature is either 2317 OPTIONAL, REQUIRED, or "not allowed". 2319 o REQUIRED features MUST be supported by all receivers. Senders can 2320 depend on them being supported. 2322 o OPTIONAL features MAY be supported by particular receivers. 2323 Senders need to be prepared for the absence of support. 2325 o "Not allowed" features are typically those that were formally 2326 OPTIONAL or REQUIRED, but for which support has been removed. 2328 All features defined in this document are REQUIRED in RPC-over-RDMA 2329 Version One. OPTIONAL features may be added to Version One as 2330 specified in Section 9.4. 2332 The terms "OPTIONAL" and "REQUIRED" are used as specified in 2333 [RFC2119] as indicated in Section 1.1. These status values are 2334 assigned by those writing additional specifications (e.g., new RPC- 2335 over-RDMA versions or extensions to existing RPC-over-RDMA versions). 2336 Their choice in this regard is their guidance to implementers. As 2337 used in this document, these terms are only directed to implementers 2338 of RPC-over-RDMA Version One. 2340 The status of features may change between RPC-over-RDMA protocol 2341 versions. 2343 9.3. RPC-Over-RDMA Version Numbering 2345 RPC-over-RDMA version numbering enables both endpoints to agree on a 2346 set of interoperable behaviors and determine which OPTIONAL features 2347 are available. 2349 An expected pattern of protocol development is to introduce OPTIONAL 2350 features within a given version using XDR extension. Such features 2351 often need a significant period of optional general use to ensure 2352 they are capable of being implemented broadly. This is especially 2353 true for infrastructural features that others will build upon. When 2354 it is appropriate for OPTIONAL features to become REQUIRED, that 2355 would be an occasion to create a new RPC-over-RDMA protocol version. 2357 The value of the RPC-over-RDMA header's version field has to be 2358 updated when the protocol is altered in a way that prevents 2359 interoperability with current implementations. A version change is 2360 needed whenever: 2362 o The RPC-over-RDMA header XDR definition is changed to add a 2363 REQUIRED protocol element, or an existing OPTIONAL feature is made 2364 REQUIRED 2366 o A REQUIRED feature is made OPTIONAL 2368 o A REQUIRED or OPTIONAL feature is converted to be "not allowed" 2370 o An XDR change is made that is not a compatible extension as 2371 defined in Section 9.1 2373 o The use of a previously not used abstract RDMA operation is 2374 specified as REQUIRED 2376 o The use of an existing REQUIRED abstract RDMA operation is removed 2378 When a version number change is to be made, the nfsv4 Working Group 2379 creates a Standards Track document that does one of the following: 2381 1. Documents the whole protocol as amended 2383 2. Documents changes relative to the previous minor version 2385 3. Documents extensions made since the previous minor versions by 2386 normatively referencing the documents defining those extensions 2388 4. Documents all REQUIRED functionality, and includes OPTIONAL 2389 features by normatively referencing the documents defining those 2390 extensions 2392 The Working Group retains all these options, but the last is 2393 typically preferred. When an XDR change that is not a compatible 2394 extension is made, the first is most desirable. In any case, if 2395 there are features whose status has been changed to "not allowed", 2396 the document needs to explain that change and how it is intended that 2397 existing implementations address the feature removal. 2399 9.4. RPC-Over-RDMA Version One Extension Practices 2401 This subsection applies primarily to RPC-over-RDMA Version One but 2402 remains in effect unless modified by documents defining future RPC- 2403 over-RDMA versions. Such documents need not update this document. 2405 9.4.1. Documentation Requirements 2407 RPC-over-RDMA Version One may be extended by defining a new message 2408 header type and XDR description of the corresponding header body. 2410 A set of such new protocol elements may be introduced by a Standards 2411 Track document and are together considered an OPTIONAL feature. 2412 nfsv4 Working Group and IESG review, together with appropriate 2413 testing of prototype implementations, should ensure continued 2414 interoperation with existing implementations. 2416 Documents describing extensions to RPC-over-RDMA Version One should 2417 contain: 2419 o An explanation of the purpose and use of each new protocol element 2420 o An XDR description and a script to extract it 2422 o A receiver response that a sender can use to determine that 2423 support is in fact present 2425 o A description of interactions with existing features (e.g., any 2426 requirement that another OPTIONAL or REQUIRED feature needs to be 2427 present and supported for the new feature to work) 2429 Implementers concatenate the XDR description of the new feature with 2430 the XDR description of the base protocol, extracted from this 2431 document, to produce a combined XDR description for the RPC-over-RDMA 2432 Version One protocol with the specified extension. 2434 9.4.2. Detecting Support For Message Header Types 2436 A sender determines whether a receiver supports an OPTIONAL message 2437 header type by issuing a simple test request using that message 2438 header type. The receiver sends an affirmative response that 2439 indicates the message header type is supported. The response message 2440 header type may itself be an extension. The sender ties together the 2441 message and response using the rdma_xid field. 2443 The receiver indicates that it does not recognize a particular 2444 rdma_which value by returning an RDMA_ERROR message type with the 2445 rdma_err field set to RDMA_ERR_BADHEADER and with the rdma_xid field 2446 set to a value that matches the test message. 2448 This is indistinguishable from a situation where the receiver does 2449 support the procedure but the test message is malformed. However, if 2450 the sender always tests for receiver support using a simple instance 2451 of the message header type to be tested, such an error at this point 2452 indicates the sender and receiver have no prospect of using the new 2453 protocol element interoperably. A lack of support for this feature 2454 can be reasonably assumed. 2456 A sender should issue OPTIONAL message header types one-at-a-time 2457 until it receives indication of the receiver's support status of that 2458 message header type. 2460 10. Security Considerations 2462 10.1. Memory Protection 2464 A primary consideration is the protection of the integrity and 2465 privacy of local memory by an RPC-over-RDMA transport. The use of 2466 RPC-over-RDMA MUST NOT introduce any vulnerabilities to system memory 2467 contents, nor to memory owned by user processes. 2469 It is REQUIRED that any RDMA provider used for RPC transport be 2470 conformant to the requirements of [RFC5042] in order to satisfy these 2471 protections. These protections are provided by the RDMA layer 2472 specifications, and in particular, their security models. 2474 10.1.1. Protection Domains 2476 The use of Protection Domains to limit the exposure of memory 2477 segments to a single connection is critical. Any attempt by an 2478 endpoint not participating in that connection to re-use memory 2479 handles needs to result in immediate failure of that connection. 2480 Because Upper Layer Protocol security mechanisms rely on this aspect 2481 of Reliable Connection behavior, strong authentication of remote 2482 endpoints is recommended. 2484 10.1.2. Handle Predictability 2486 Unpredictable memory handles should be used for any operation 2487 requiring advertised memory segments. Advertising a continuously 2488 registered memory region allows a remote host to read or write to 2489 that region even when an RPC involving that memory is not under way. 2490 Therefore implementations should avoid advertising persistently 2491 registered memory. 2493 10.1.3. Memory Fencing 2495 Advertised memory segments should be invalidated as soon as related 2496 RPC operations are complete. Invalidation and DMA unmapping of 2497 segments should be complete before the Upper Layer is allowed to 2498 continue execution and use or alter the contents of a memory region. 2500 10.2. RPC Message Security 2502 ONC RPC provides cryptographic security via the RPCSEC_GSS framework 2503 [RFC2203]. RPCSEC_GSS implements message authentication, per-message 2504 integrity checking, and per-message confidentiality. However, 2505 integrity and privacy services require significant movement of data 2506 on each endpoint host. Some performance benefits enabled by RDMA 2507 transports can be lost. Note that some performance loss is expected 2508 when RPCSEC_GSS integrity or privacy is in use on any RPC transport. 2510 10.2.1. RPC-Over-RDMA Link-Level Protection 2512 Link-level protection is a more appropriate security mechanism for 2513 RDMA transports. Certain configurations of IPsec can be co-located 2514 in RDMA hardware, for example, without any change to RDMA consumers 2515 or loss of data movement efficiency. 2517 The use of link-level protection MAY be negotiated through the use of 2518 the RPCSEC_GSS security flavor defined in [RFC5403] in conjunction 2519 with the Channel Binding mechanism [RFC5056] and IPsec Channel 2520 Connection Latching [RFC5660]. Use of such mechanisms is REQUIRED 2521 where integrity and/or privacy is desired and where efficiency is 2522 required. 2524 10.2.2. RPCSEC_GSS On RPC-Over-RDMA Transports 2526 RPCSEC_GSS [RFC5403] extends the ONC RPC protocol [RFC5531] without 2527 changing the format of RPC messages. By observing the conventions 2528 described in this section, an RPC-over-RDMA implementation can 2529 support RPCSEC_GSS in a way that interoperates successfully with 2530 other implementations. 2532 As part of the ONC RPC protocol, protocol elements of RPCSEC_GSS that 2533 appear in the Payload stream of an RPC-over-RDMA message (such as 2534 control messages exchanged as part of establishing or destroying a 2535 security context, or data items that are part of RPCSEC_GSS 2536 authentication material) MUST NOT be reduced. 2538 10.2.2.1. RPCSEC_GSS Context Negotiation 2540 Some NFS client implementations use a separate connection to 2541 establish a GSS context for NFS operation. These clients use TCP and 2542 the standard NFS port (2049) for context establishment, but there is 2543 no guarantee that an NFS/RDMA server provides a TCP-based NFS server 2544 on port 2049. 2546 10.2.2.2. RPC-Over-RDMA With RPCSEC_GSS Authentication 2548 The RPCSEC_GSS authentication service has no impact on the DDP- 2549 eligibity of data items in an Upper Layer Protocol. 2551 However, RPCSEC_GSS authentication material appearing in an RPC 2552 message header is often larger than material associated with, say, 2553 the AUTH_SYS security flavor. In particular, when an RPCSEC_GSS 2554 pseudoflavor is in use, a requester needs to accommodate a larger RPC 2555 credential when marshaling Call messages, and to provide for a 2556 maximum size RPCSEC_GSS verifier when allocating reply buffers and 2557 Reply chunks. 2559 RPC messages, and thus Payload streams, are made larger as a result. 2560 Upper Layer Protocol operations that fit in a Short Message when a 2561 simpler form of authentication is in use might need to be reduced or 2562 conveyed via a Long Message when RPCSEC_GSS authentication is in use. 2563 This can impact efficiency when RPCSEC_GSS authentication is use. 2565 Because average RPC message size is larger when RPCSEC_GSS 2566 authentication is in use, it is more likely that a requester will 2567 provide both a Read list and a Reply chunk in the same RPC-over-RDMA 2568 header to convey a Long call and provision a receptacle for a Long 2569 reply. 2571 10.2.2.3. RPC-Over-RDMA With RPCSEC_GSS Integrity Or Privacy 2573 The RPCSEC_GSS integrity service enables endpoints to detect 2574 modification of RPC messages in flight. The RPCSEC_GSS privacy 2575 service prevents all but the intended recipient from viewing the 2576 cleartext content of RPC messages. RPCSEC_GSS integrity and privacy 2577 are end-to-end; that is, they protect RPC arguments and results from 2578 application to server endpoint, and back. 2580 The RPCSEC_GSS integrity and encryption services operate on whole RPC 2581 messages after they have been XDR encoded for transmit, and before 2582 they have been XDR decoded after receipt. Both the sender and the 2583 receiver endpoints use intermediate buffers to prevent exposure of 2584 encrypted data or unverified cleartext data to RPC consumers. After 2585 verification, encryption, and message wrapping has been performed, 2586 the transport layer can use RDMA data transfer between these 2587 intermediate buffers. 2589 The process of reducing a DDP-eligible data item removes the data 2590 item and its XDR padding from the encoded XDR stream. XDR padding of 2591 a reduced data item is not transferred in an RPC-over-RDMA message. 2592 After reduction, the Payload stream contains fewer octets then the 2593 whole XDR stream did beforehand. XDR padding octets are often zero 2594 bytes, but they don't have to be. Thus reducing DDP-eligible items 2595 affects the result of message integrity verification or encryption. 2597 Therefore a sender MUST NOT reduce a Payload stream when RPCSEC_GSS 2598 integrity or encryption services are in use. Effectively, no data 2599 item is DDP-eligible in this situation, and Chunked Messages cannot 2600 be used. In this mode, an RPC-over-RDMA transport operates in the 2601 same manner as a transport that does not support direct data 2602 placement. 2604 When RPCSEC_GSS integrity or privacy is in use, a requester provides 2605 both a Read list and a Reply chunk in the same RPC-over-RDMA header 2606 to convey a Long call and provision a receptacle for a Long reply. 2608 10.2.2.4. RPC-Over-RDMA Header Exposure 2610 Like the base fields in an ONC RPC message (XID, call direction, and 2611 so on), the contents of an RPC-over-RDMA message's Transport stream 2612 are not protected by RPCSEC_GSS. This exposes XIDs, connection 2613 credit limits, and chunk lists (but not the content of the data items 2614 they refer to) to malicious behavior, which could redirect data that 2615 is transferred by the RPC-over-RDMA message, result in spurious 2616 retransmits, or trigger connection loss. 2618 Encryption at the link layer, as described in Section 10.2.1, 2619 protects the content of the Transport stream. 2621 11. IANA Considerations 2623 Three assignments are specified by this document. These are 2624 unchanged from [RFC5666]: 2626 o A set of RPC "netids" for resolving RPC-over-RDMA services 2628 o Optional service port assignments for Upper Layer Bindings 2630 o An RPC program number assignment for the configuration protocol 2632 These assignments have been established, as below. 2634 The new RPC transport has been assigned an RPC "netid", which is an 2635 rpcbind [RFC1833] string used to describe the underlying protocol in 2636 order for RPC to select the appropriate transport framing, as well as 2637 the format of the service addresses and ports. 2639 The following "Netid" registry strings are defined for this purpose: 2641 NC_RDMA "rdma" 2642 NC_RDMA6 "rdma6" 2644 These netids MAY be used for any RDMA network satisfying the 2645 requirements of Section 3.2.2, and able to identify service endpoints 2646 using IP port addressing, possibly through use of a translation 2647 service as described above in Section 6. The "rdma" netid is to be 2648 used when IPv4 addressing is employed by the underlying transport, 2649 and "rdma6" for IPv6 addressing. 2651 The netid assignment policy and registry are defined in [RFC5665]. 2653 As a new RPC transport, this protocol has no effect on RPC Program 2654 numbers or existing registered port numbers. However, new port 2655 numbers MAY be registered for use by RPC-over-RDMA-enabled services, 2656 as appropriate to the new networks over which the services will 2657 operate. 2659 For example, the NFS/RDMA service defined in [RFC5667] has been 2660 assigned the port 20049, in the IANA registry: 2662 nfsrdma 20049/tcp Network File System (NFS) over RDMA 2663 nfsrdma 20049/udp Network File System (NFS) over RDMA 2664 nfsrdma 20049/sctp Network File System (NFS) over RDMA 2666 The RPC program number assignment policy and registry are defined in 2667 [RFC5531]. 2669 12. Acknowledgments 2671 The editor gratefully acknowledges the work of Brent Callaghan and 2672 Tom Talpey on the original RPC-over-RDMA Version One specification 2673 [RFC5666]. 2675 Dave Noveck provided excellent review, constructive suggestions, and 2676 consistent navigational guidance throughout the process of drafting 2677 this document. Dave also contributed much of the organization and 2678 content of Section 9 and helped the authors understand the 2679 complexities of XDR extensibility. 2681 The comments and contributions of Karen Deitke, Dai Ngo, Chunli 2682 Zhang, Dominique Martinet, and Mahesh Siddheshwar are accepted with 2683 great thanks. The editor also wishes to thank Bill Baker for his 2684 support of this work. 2686 The extract.sh shell script and formatting conventions were first 2687 described by the authors of the NFSv4.1 XDR specification [RFC5662]. 2689 Special thanks go to nfsv4 Working Group Chair Spencer Shepler and 2690 nfsv4 Working Group Secretary Thomas Haynes for their support. 2692 13. References 2694 13.1. Normative References 2696 [RFC1833] Srinivasan, R., "Binding Protocols for ONC RPC Version 2", 2697 RFC 1833, DOI 10.17487/RFC1833, August 1995, 2698 . 2700 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 2701 Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/ 2702 RFC2119, March 1997, 2703 . 2705 [RFC2203] Eisler, M., Chiu, A., and L. Ling, "RPCSEC_GSS Protocol 2706 Specification", RFC 2203, DOI 10.17487/RFC2203, September 2707 1997, . 2709 [RFC4506] Eisler, M., Ed., "XDR: External Data Representation 2710 Standard", STD 67, RFC 4506, DOI 10.17487/RFC4506, May 2711 2006, . 2713 [RFC5042] Pinkerton, J. and E. Deleganes, "Direct Data Placement 2714 Protocol (DDP) / Remote Direct Memory Access Protocol 2715 (RDMAP) Security", RFC 5042, DOI 10.17487/RFC5042, October 2716 2007, . 2718 [RFC5056] Williams, N., "On the Use of Channel Bindings to Secure 2719 Channels", RFC 5056, DOI 10.17487/RFC5056, November 2007, 2720 . 2722 [RFC5403] Eisler, M., "RPCSEC_GSS Version 2", RFC 5403, DOI 2723 10.17487/RFC5403, February 2009, 2724 . 2726 [RFC5531] Thurlow, R., "RPC: Remote Procedure Call Protocol 2727 Specification Version 2", RFC 5531, DOI 10.17487/RFC5531, 2728 May 2009, . 2730 [RFC5660] Williams, N., "IPsec Channels: Connection Latching", RFC 2731 5660, DOI 10.17487/RFC5660, October 2009, 2732 . 2734 [RFC5665] Eisler, M., "IANA Considerations for Remote Procedure Call 2735 (RPC) Network Identifiers and Universal Address Formats", 2736 RFC 5665, DOI 10.17487/RFC5665, January 2010, 2737 . 2739 13.2. Informative References 2741 [IB] InfiniBand Trade Association, "InfiniBand Architecture 2742 Specifications", . 2744 [IBPORT] InfiniBand Trade Association, "IP Addressing Annex", 2745 . 2747 [RFC0768] Postel, J., "User Datagram Protocol", STD 6, RFC 768, DOI 2748 10.17487/RFC0768, August 1980, 2749 . 2751 [RFC0793] Postel, J., "Transmission Control Protocol", STD 7, RFC 2752 793, DOI 10.17487/RFC0793, September 1981, 2753 . 2755 [RFC1094] Nowicki, B., "NFS: Network File System Protocol 2756 specification", RFC 1094, DOI 10.17487/RFC1094, March 2757 1989, . 2759 [RFC1813] Callaghan, B., Pawlowski, B., and P. Staubach, "NFS 2760 Version 3 Protocol Specification", RFC 1813, DOI 10.17487/ 2761 RFC1813, June 1995, 2762 . 2764 [RFC5040] Recio, R., Metzler, B., Culley, P., Hilland, J., and D. 2765 Garcia, "A Remote Direct Memory Access Protocol 2766 Specification", RFC 5040, DOI 10.17487/RFC5040, October 2767 2007, . 2769 [RFC5041] Shah, H., Pinkerton, J., Recio, R., and P. Culley, "Direct 2770 Data Placement over Reliable Transports", RFC 5041, DOI 2771 10.17487/RFC5041, October 2007, 2772 . 2774 [RFC5532] Talpey, T. and C. Juszczak, "Network File System (NFS) 2775 Remote Direct Memory Access (RDMA) Problem Statement", RFC 2776 5532, DOI 10.17487/RFC5532, May 2009, 2777 . 2779 [RFC5661] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed., 2780 "Network File System (NFS) Version 4 Minor Version 1 2781 Protocol", RFC 5661, DOI 10.17487/RFC5661, January 2010, 2782 . 2784 [RFC5662] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed., 2785 "Network File System (NFS) Version 4 Minor Version 1 2786 External Data Representation Standard (XDR) Description", 2787 RFC 5662, DOI 10.17487/RFC5662, January 2010, 2788 . 2790 [RFC5666] Talpey, T. and B. Callaghan, "Remote Direct Memory Access 2791 Transport for Remote Procedure Call", RFC 5666, DOI 2792 10.17487/RFC5666, January 2010, 2793 . 2795 [RFC5667] Talpey, T. and B. Callaghan, "Network File System (NFS) 2796 Direct Data Placement", RFC 5667, DOI 10.17487/RFC5667, 2797 January 2010, . 2799 [RFC7530] Haynes, T., Ed. and D. Noveck, Ed., "Network File System 2800 (NFS) Version 4 Protocol", RFC 7530, DOI 10.17487/RFC7530, 2801 March 2015, . 2803 Authors' Addresses 2805 Charles Lever (editor) 2806 Oracle Corporation 2807 1015 Granger Avenue 2808 Ann Arbor, MI 48104 2809 USA 2811 Phone: +1 734 274 2396 2812 Email: chuck.lever@oracle.com 2814 William Allen Simpson 2815 DayDreamer 2816 1384 Fontaine 2817 Madison Heights, MI 48071 2818 USA 2820 Email: william.allen.simpson@gmail.com 2822 Tom Talpey 2823 Microsoft Corp. 2824 One Microsoft Way 2825 Redmond, WA 98052 2826 USA 2828 Phone: +1 425 704-9945 2829 Email: ttalpey@microsoft.com