idnits 2.17.1 draft-ietf-nfsv4-rfc5666bis-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (December 14, 2015) is 3055 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) ** Obsolete normative reference: RFC 5666 (Obsoleted by RFC 8166) -- Obsolete informational reference (is this intentional?): RFC 793 (Obsoleted by RFC 9293) -- Obsolete informational reference (is this intentional?): RFC 5661 (Obsoleted by RFC 8881) -- Obsolete informational reference (is this intentional?): RFC 5667 (Obsoleted by RFC 8267) Summary: 1 error (**), 0 flaws (~~), 1 warning (==), 4 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network File System Version 4 C. Lever, Ed. 3 Internet-Draft Oracle 4 Obsoletes: 5666 (if approved) W. Simpson 5 Intended status: Standards Track DayDreamer 6 Expires: June 16, 2016 T. Talpey 7 Microsoft 8 December 14, 2015 10 Remote Direct Memory Access Transport for Remote Procedure Call 11 draft-ietf-nfsv4-rfc5666bis-01 13 Abstract 15 This document specifies a protocol for conveying Remote Procedure 16 Call (RPC) messages on physical transports capable of Remote Direct 17 Memory Access (RDMA). The RDMA transport binding enables efficient 18 bulk-data transport over high-speed networks with minimal change to 19 RPC applications. It requires no revision to application RPC 20 protocols or the RPC protocol itself. This document obsoletes RFC 21 5666. 23 Status of This Memo 25 This Internet-Draft is submitted in full conformance with the 26 provisions of BCP 78 and BCP 79. 28 Internet-Drafts are working documents of the Internet Engineering 29 Task Force (IETF). Note that other groups may also distribute 30 working documents as Internet-Drafts. The list of current Internet- 31 Drafts is at http://datatracker.ietf.org/drafts/current/. 33 Internet-Drafts are draft documents valid for a maximum of six months 34 and may be updated, replaced, or obsoleted by other documents at any 35 time. It is inappropriate to use Internet-Drafts as reference 36 material or to cite them other than as "work in progress." 38 This Internet-Draft will expire on June 16, 2016. 40 Copyright Notice 42 Copyright (c) 2015 IETF Trust and the persons identified as the 43 document authors. All rights reserved. 45 This document is subject to BCP 78 and the IETF Trust's Legal 46 Provisions Relating to IETF Documents 47 (http://trustee.ietf.org/license-info) in effect on the date of 48 publication of this document. Please review these documents 49 carefully, as they describe your rights and restrictions with respect 50 to this document. Code Components extracted from this document must 51 include Simplified BSD License text as described in Section 4.e of 52 the Trust Legal Provisions and are provided without warranty as 53 described in the Simplified BSD License. 55 Table of Contents 57 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 58 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 3 59 1.2. RPC Over RDMA Transports . . . . . . . . . . . . . . . . 3 60 2. Changes Since RFC 5666 . . . . . . . . . . . . . . . . . . . 4 61 2.1. Changes To The Specification . . . . . . . . . . . . . . 4 62 2.2. Changes To The Protocol . . . . . . . . . . . . . . . . . 5 63 3. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 5 64 3.1. Remote Procedure Calls . . . . . . . . . . . . . . . . . 5 65 3.2. Remote Direct Memory Access . . . . . . . . . . . . . . . 8 66 4. RPC-Over-RDMA Protocol Framework . . . . . . . . . . . . . . 10 67 4.1. Transfer Models . . . . . . . . . . . . . . . . . . . . . 10 68 4.2. RPC Message Framing . . . . . . . . . . . . . . . . . . . 11 69 4.3. Flow Control . . . . . . . . . . . . . . . . . . . . . . 11 70 4.4. XDR Encoding With Chunks . . . . . . . . . . . . . . . . 13 71 4.5. Data Exchange . . . . . . . . . . . . . . . . . . . . . . 19 72 4.6. Message Size . . . . . . . . . . . . . . . . . . . . . . 21 73 5. RPC-Over-RDMA In Operation . . . . . . . . . . . . . . . . . 23 74 5.1. Fixed Header Fields . . . . . . . . . . . . . . . . . . . 23 75 5.2. Chunk Lists . . . . . . . . . . . . . . . . . . . . . . . 24 76 5.3. Forming Messages . . . . . . . . . . . . . . . . . . . . 26 77 5.4. Memory Registration . . . . . . . . . . . . . . . . . . . 29 78 5.5. Handling Errors . . . . . . . . . . . . . . . . . . . . . 30 79 5.6. XDR Language Description . . . . . . . . . . . . . . . . 31 80 5.7. Deprecated Protocol Elements . . . . . . . . . . . . . . 34 81 6. Upper Layer Binding Specifications . . . . . . . . . . . . . 34 82 6.1. Determining DDP-Eligibility . . . . . . . . . . . . . . . 35 83 6.2. Write List Ordering . . . . . . . . . . . . . . . . . . . 36 84 6.3. DDP-Eligibility Violation . . . . . . . . . . . . . . . . 36 85 6.4. Other Binding Information . . . . . . . . . . . . . . . . 37 86 7. RPC Bind Parameters . . . . . . . . . . . . . . . . . . . . . 37 87 8. Bi-Directional RPC-Over-RDMA . . . . . . . . . . . . . . . . 38 88 8.1. RPC Direction . . . . . . . . . . . . . . . . . . . . . . 39 89 8.2. Backward Direction Flow Control . . . . . . . . . . . . . 40 90 8.3. Conventions For Backward Operation . . . . . . . . . . . 41 91 8.4. Backward Direction Upper Layer Binding . . . . . . . . . 43 92 9. Transport Protocol Extensibility . . . . . . . . . . . . . . 44 93 9.1. Bumping The RPC-over-RDMA Version . . . . . . . . . . . . 44 94 10. Security Considerations . . . . . . . . . . . . . . . . . . . 45 95 10.1. Memory Protection . . . . . . . . . . . . . . . . . . . 45 96 10.2. Using GSS With RPC-Over-RDMA . . . . . . . . . . . . . . 45 98 11. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 46 99 12. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 47 100 13. Appendices . . . . . . . . . . . . . . . . . . . . . . . . . 47 101 13.1. Appendix 1: XDR Examples . . . . . . . . . . . . . . . . 47 102 14. References . . . . . . . . . . . . . . . . . . . . . . . . . 49 103 14.1. Normative References . . . . . . . . . . . . . . . . . . 49 104 14.2. Informative References . . . . . . . . . . . . . . . . . 50 105 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 51 107 1. Introduction 109 This document obsoletes RFC 5666; however, the protocol specified by 110 this document is based on existing interoperating implementations of 111 the RPC-over-RDMA Version One protocol. The new specification 112 clarifies text that is subject to multiple interpretations and 113 eliminates support for unimplemented RPC-over-RDMA Version One 114 protocol elements. In addition, it introduces conventions that 115 enable bi-directional RPC-over-RDMA operation. 117 1.1. Requirements Language 119 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 120 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 121 document are to be interpreted as described in [RFC2119]. 123 1.2. RPC Over RDMA Transports 125 Remote Direct Memory Access (RDMA) [RFC5040] [RFC5041] [IB] is a 126 technique for moving data efficiently between end nodes. By 127 directing data into destination buffers as it is sent on a network, 128 and placing it via direct memory access by hardware, the benefits of 129 faster transfers and reduced host overhead are obtained. 131 Open Network Computing Remote Procedure Call (ONC RPC, or simply, 132 RPC) [RFC5531] is a remote procedure call protocol that runs over a 133 variety of transports. Most RPC implementations today use UDP or 134 TCP. On UDP, RPC messages are encapsulated inside datagrams, while 135 on a TCP byte stream, RPC messages are delineated by a record marking 136 protocol. An RDMA transport also conveys RPC messages in a specific 137 fashion that must be fully described if RPC implementations are to 138 interoperate. 140 RDMA transports present semantics different from either UDP or TCP. 141 They retain message delineations like UDP, but provide a reliable and 142 sequenced data transfer like TCP. They also provide an efficient 143 bulk-transfer service not provided by UDP or TCP. RDMA transports 144 are therefore appropriately viewed as a new transport type by RPC. 146 RDMA as a transport can enhance the performance of RPC protocols that 147 move large quantities of data, since RDMA hardware excels at moving 148 data efficiently between host memory and a high-speed network with 149 little host CPU involvement. In this context, the Network File 150 System (NFS) protocols as described in [RFC1094], [RFC1813], 151 [RFC7530], [RFC5661], and future NFSv4 minor verions are obvious 152 beneficiaries of RDMA transports. A complete problem statement is 153 discussed in [RFC5532], and NFSv4-related issues are discussed in 154 [RFC5661]. Many other RPC-based protocols can also benefit. 156 Although the RDMA transport described here can provide relatively 157 transparent support for any RPC application, this document also 158 describes mechanisms that can optimize data transfer further, given 159 more active participation by RPC applications. 161 2. Changes Since RFC 5666 163 2.1. Changes To The Specification 165 The following alterations have been made to the RPC-over-RDMA Version 166 One specification: 168 o Section 2 has been expanded to introduce and explain key RPC, XDR, 169 and RDMA terminology. These terms are now used consistently 170 throughout the specification. This change was necesssary because 171 implementers familiar with RDMA are often not familiar with the 172 mechanics of RPC, and vice versa. 174 o Section 3 has been re-organized and split into sub-sections to 175 facilitate locating specific requirements and definitions. 177 o Section 4 and 5 have been combined for clarity and to improve the 178 organization of this information. 180 o The XDR definition of RPC-over-RDMA Version One has been updated 181 (without on-the-wire changes) to align with the terms and concepts 182 introduced in this specification. 184 o The specification of the optional Connection Configuration 185 Protocol has been removed from the specification, as there are no 186 known implementations of the protocol. 188 o Sections discussing requirements for Upper Layer Bindings have 189 been added. 191 o A section discussing RPC-over-RDMA protocol extensibility has been 192 added. 194 2.2. Changes To The Protocol 196 While the protocol described herein interoperates with existing 197 implementations of [RFC5666], the following changes have been made 198 relative to the protocol described in that document: 200 o Support for the Read-Read transfer model has been removed. Read- 201 Read is a slower transfer model than Read-Write, thus implementers 202 have chosen not to support it. 204 o Support for sending the RDMA_MSGP message type has been 205 deprecated. This document instructs senders not to use it, but 206 receivers must continue to recognize it. 208 RDMA_MSGP has no benefit for RPC programs that place bulk payload 209 items at positions other than at the end of their argument or 210 result lists, as is common with NFSv4 COMPOUND RPCs [RFC7530]. 211 Similarly it is not beneficial when a connection's inline 212 threshold is significantly smaller than the system page size, as 213 is typical for RPC-over-RDMA Version One implementations. 215 o Specific requirements related to handling XDR round-up and 216 abstract data types have been added. 218 o Clear guidance about Send and Receive buffer size has been added. 219 This enables better decisions about when to provide and use the 220 Reply chunk. 222 o A section specifying bi-directional RPC operation on RPC-over-RDMA 223 has been added. This enables the NFSv4.1 backchannel [RFC5661] on 224 RPC-over-RDMA Version One transports when both endpoints support 225 the new functionality. 227 The protocol version number has not been changed because the protocol 228 specified in this document fully interoperates with implementations 229 of the RPC-over-RDMA Version One protocol specified in [RFC5666]. 231 3. Terminology 233 3.1. Remote Procedure Calls 235 This section introduces key elements of the Remote Procedure Call 236 [RFC5531] and External Data Representation [RFC4506] protocols upon 237 which RPC-over-RDMA Version One is constructed. 239 3.1.1. Upper Layer Protocols 241 Remote Procedure Calls are an abstraction used to implement the 242 operations of an "upper layer protocol," sometimes referred to as a 243 ULP. One example of such a protocol is the Network File System 244 Version 4.0 [RFC7530]. 246 3.1.2. Requesters And Responders 248 Like a local procedure call, every Remote Procedure Call has a set of 249 "arguments" and a set of "results". A calling context is not allowed 250 to proceed until the procedure's results are available to it. Unlike 251 a local procedure call, the called procedure is executed remotely 252 rather than in the local application's context. 254 The RPC protocol as described in [RFC5531] is fundamentally a 255 message-passing protocol between one server and one or more clients. 256 ONC RPC transactions are made up of two types of messages: 258 CALL Message 259 A CALL message, or "Call", requests that work be done. A Call is 260 designated by the value CALL in the message's msg_type field. An 261 arbitrary unique value is placed in the message's xid field. 263 REPLY Message 264 A REPLY message, or "Reply", reports the results of work requested 265 by a Call. A Reply is designated by the value REPLY in the 266 message's msg_type field. The value contained in the message's 267 xid field is copied from the Call whose results are being 268 reported. 270 An RPC client endpoint, or "requester", serializes an RPC call's 271 arguments and conveys them to a server endpoint via an RPC call 272 message. This message contains an RPC protocol header, a header 273 describing the requested upper layer operation, and all arguments. 275 The server endpoint, or "responder", deserializes the arguments and 276 processes the requested operation. It then serializes the 277 operation's results into another byte stream. This byte stream is 278 conveyed back to the requester via an RPC reply message. This 279 message contains an RPC protocol header, a header describing the 280 upper layer reply, and all results. The requester deserializes the 281 results and allows the original caller to proceed. 283 RPC-over-RDMA is a connection-oriented RPC transport. When a 284 connection-oriented transport is used, ONC RPC client endpoints are 285 responsible for initiating transport connections, while ONC RPC 286 service endpoints wait passively for incoming connection requests. 288 3.1.3. External Data Representation 290 In a heterogenous environment, one cannot assume that all requesters 291 and responders represent data the same way. RPC uses eXternal Data 292 Representation, or XDR, to translate data types and serialize 293 arguments and results. The XDR protocol encodes data independent of 294 the endianness or size of host-native data types, allowing 295 unambiguous decoding of data on the receiving end. RPC programs are 296 specified by writing an XDR definition of their procedures, argument 297 data types, and result data types. 299 XDR assumes that the number of bits in a byte (octet) and their order 300 are the same on both endpoints and on the physical network. The 301 smallest indivisible unit of XDR encoding is a group of four octets 302 in little-endian order. XDR also flattens lists, arrays, and other 303 abstract data types so they can be conveyed as a simple stream of 304 bytes. 306 A serialized stream of bytes that is the result of XDR encoding is 307 referred to as an "XDR stream." A sending endpoint encodes native 308 data into an XDR stream and then transmits that stream to a receiver. 309 A receiving endpoint decodes incoming XDR byte streams into its 310 native data representation format. 312 The function of an RPC transport is to convey RPC messages, each 313 encoded as a separate XDR stream, from one endpoint to another. 315 3.1.3.1. XDR Opaque Data 317 Sometimes a data item must be transferred as-is, without encoding or 318 decoding. Such a data item is referred to as "opaque data." XDR 319 encoding places opaque data items directly into an XDR stream without 320 altering its content in any way. Upper Layer Protocols or 321 applications perform any needed data translation in this case. 322 Examples of opaque data items include the contents of files, and 323 generic byte strings. 325 3.1.3.2. XDR Round-up 327 The number of octets in a variable-size data item precedes that item 328 in the encoding stream. If the size of an encoded data item is not a 329 multiple of four octets, octets containing zero are added to the end 330 of the item so that the next encoded data item starts on a four-octet 331 boundary. The encoded size of the item is not changed by the 332 addition of the extra octets. 334 This technique is referred to as "XDR round-up," and the extra octets 335 are referred to as "XDR padding". The content of XDR pad octets is 336 ignored by receivers. 338 3.2. Remote Direct Memory Access 340 RPC requesters and responders can be made more efficient if large RPC 341 messages are transferred by a third party such as intelligent network 342 interface hardware (data movement offload), and placed in the 343 receiver's memory so that no additional adjustment of data alignment 344 has to be made (direct data placement). Remote Direct Memory Access 345 enables both optimizations. 347 3.2.1. Direct Data Placement 349 Very often, RPC implementations copy the contents of RPC messages 350 into a buffer before being sent. An efficient RPC implementation 351 sends bulk data without copying it into a separate send buffer first. 353 However, socket-based RPC implementations are often unable to receive 354 data directly into its final place in memory. Receivers often need 355 to copy incoming data to finish an RPC operation; sometimes, only to 356 adjust data alignment. 358 In this document, "RDMA" refers to the physical mechanism an RDMA 359 transport utilizes when moving data. Although it may not be 360 efficient, before an RDMA transfer a sender may copy data into an 361 intermediate buffer before an RDMA transfer. After an RDMA transfer, 362 a receiver may copy that data again to its final destination. 364 This document uses the term "direct data placement" (or DDP) to refer 365 specifically to an optimized data transfer where it is unnecessary 366 for a receiving host's CPU to copy transferred data to another 367 location after it has been received. Not all RDMA-based data 368 transfer qualifies as Direct Data Placement, and DDP can be achieved 369 using non-RDMA mechanisms. 371 3.2.2. RDMA Transport Requirements 373 The RPC-over-RDMA Version One protocol assumes the physical transport 374 provides the following abstract operations. A more complete 375 discussion of these operations is found in [RFC5040]. 377 Registered Memory 378 Registered memory is a segment of memory that is assigned a 379 steering tag that temporarily permits access by the RDMA provider 380 to perform data transfer operations. The RPC-over-RDMA Version 381 One protocol assumes that each segment of registered memory MUST 382 be identified with a steering tag of no more than 32 bits and 383 memory addresses of up to 64 bits in length. 385 RDMA Send 386 The RDMA provider supports an RDMA Send operation, with completion 387 signaled on the receiving peer after data has been placed in a 388 pre-posted memory segment. Sends complete at the receiver in the 389 order they were issued at the sender. The amount of data 390 transferred by an RDMA Send operation is limited by the size of 391 the remote pre-posted memory segment. 393 RDMA Receive 394 The RDMA provider supports an RDMA Receive operation to receive 395 data conveyed by incoming RDMA Send operations. To reduce the 396 amount of memory that must remain pinned awaiting incoming Sends, 397 the amount of pre-posted memory is limited. Flow-control to 398 prevent overrunning receiver resources is provided by the RDMA 399 consumer (in this case, the RPC-over-RDMA Version One protocol). 401 RDMA Write 402 The RDMA provider supports an RDMA Write operation to directly 403 place data in remote memory. The local host initiates an RDMA 404 Write, and completion is signaled there; no completion is signaled 405 on the remote. The local host provides a steering tag, memory 406 address, and length of the remote's memory segment. 408 RDMA Writes are not necessarily ordered with respect to one 409 another, but are ordered with respect to RDMA Sends. A subsequent 410 RDMA Send completion obtained at the write initiator guarantees 411 that prior RDMA Write data has been successfully placed in the 412 remote peer's memory. 414 RDMA Read 415 The RDMA provider supports an RDMA Read operation to directly 416 place peer source data in the read initiator's memory. The local 417 host initiates an RDMA Read, and completion is signaled there; no 418 completion is signaled on the remote. The local host provides 419 steering tags, memory addresses, and a length for the remote 420 source and local destination memory segments. 422 The remote peer receives no notification of RDMA Read completion. 423 The local host signals completion as part of an RDMA Send message 424 so that the remote peer can release steering tags and subsequently 425 free associated source memory segments. 427 The RPC-over-RDMA Version One protocol is designed to be carried over 428 RDMA transports that support the above abstract operations. This 429 protocol conveys to the RPC peer information sufficient for that RPC 430 peer to direct an RDMA layer to perform transfers containing RPC data 431 and to communicate their result(s). For example, it is readily 432 carried over RDMA transports such as Internet Wide Area RDMA Protocol 433 (iWARP) [RFC5040] [RFC5041]. 435 4. RPC-Over-RDMA Protocol Framework 437 4.1. Transfer Models 439 A "transfer model" designates which endpoint is responsible for 440 performing RDMA Read and Write operations. To enable these 441 operations, the peer endpoint first exposes segments of its memory to 442 the endpoint performing the RDMA Read and Write operations. 444 Read-Read 445 Requesters expose their memory to the responder, and the responder 446 exposes its memory to requesters. The responder employs RDMA Read 447 operations to convey RPC arguments or whole RPC calls. Requesters 448 employ RDMA Read operations to convey RPC results or whole RPC 449 relies. 451 Write-Write 452 Requesters expose their memory to the responder, and the responder 453 exposes its memory to requesters. Requesters employ RDMA Write 454 operations to convey RPC arguments or whole RPC calls. The 455 responder employs RDMA Write operations to convey RPC results or 456 whole RPC relies. 458 Read-Write 459 Requesters expose their memory to the responder, but the responder 460 does not expose its memory. The responder employs RDMA Read 461 operations to convey RPC arguments or whole RPC calls. The 462 responder employs RDMA Write operations to convey RPC results or 463 whole RPC relies. 465 Write-Read 466 The responder exposes its memory to requesters, but requesters do 467 not expose their memory. Requesters employ RDMA Write operations 468 to convey RPC arguments or whole RPC calls. Requesters employ 469 RDMA Read operations to convey RPC results or whole RPC relies. 471 [RFC5666] specifies the use of both the Read-Read and the Read-Write 472 Transfer Model. All current RPC-over-RDMA Version One 473 implementations use the Read-Write Transfer Model. Use of the Read- 474 Read Transfer Model by RPC-over-RDMA Version One implementations is 475 no longer supported. Other Transfer Models may be used by a future 476 version of RPC-over-RDMA. 478 4.2. RPC Message Framing 480 During transmission, the XDR stream containing an RPC message is 481 preceded by an RPC-over-RDMA header. This header is analogous to the 482 record marking used for RPC over TCP but is more extensive, since 483 RDMA transports support several modes of data transfer. 485 All transfers of an RPC message begin with an RDMA Send that 486 transfers an RPC-over-RDMA header and part or all of the accompanying 487 RPC message. Because the size of what may be transmitted via RDMA 488 Send is limited by the size of the receiver's pre-posted buffers, the 489 RPC-over-RDMA transport provides a number of methods to reduce the 490 amount transferred via RDMA Send. Parts of RPC messages not 491 transferred via RDMA Send are transferred using RDMA Read or RDMA 492 Write operations. 494 RPC-over-RDMA framing replaces all other RPC framing (such as TCP 495 record marking) when used atop an RPC-over-RDMA association, even 496 when the underlying RDMA protocol may itself be layered atop a 497 transport with a defined RPC framing (such as TCP). 499 It is however possible for RPC-over-RDMA to be dynamically enabled in 500 the course of negotiating the use of RDMA via an Upper Layer Protocol 501 exchange. Because RPC framing delimits an entire RPC request or 502 reply, the resulting shift in framing must occur between distinct RPC 503 messages, and in concert with the underlying transport. 505 4.3. Flow Control 507 It is critical to provide RDMA Send flow control for an RDMA 508 connection. RDMA receive operations can fail if a pre-posted receive 509 buffer is not available to accept an incoming RDMA Send, and repeated 510 occurrences of such errors can be fatal to the connection. This is a 511 departure from conventional TCP/IP networking where buffers are 512 allocated dynamically as part of receiving messages. 514 Flow control for RDMA Send operations directed to the responder is 515 implemented as a simple request/grant protocol in the RPC-over-RDMA 516 header associated with each RPC message (Section 5.1.3 has details). 518 o The RPC-over-RDMA header for RPC call messages contains a 519 requested credit value for the responder. This is the maximum 520 number of RPC replies the requester can handle at once, 521 independent of how many RPCs are in flight at that moment. The 522 requester MAY dynamically adjust the requested credit value to 523 match its expected needs. 525 o The RPC-over-RDMA header for RPC reply messages provides the 526 granted result. This is the maximum number of RPC calls the 527 responder can handle at once, without regard to how many RPCs are 528 in flight at that moment. The granted value MUST NOT be zero, 529 since such a value would result in deadlock. The responder MAY 530 dynamically adjust the granted credit value to match its needs or 531 policies (e.g. to accommodate the available resources in a shared 532 receive queue). 534 The requester MUST NOT send unacknowledged requests in excess of this 535 granted responder credit limit. If the limit is exceeded, the RDMA 536 layer may signal an error, possibly terminating the connection. Even 537 if an RDMA layer error does not occur, the responder MAY handle 538 excess requests or return an RPC layer error to the requester. 540 While RPC calls complete in any order, the current flow control limit 541 at the responder is known to the requester from the Send ordering 542 properties. It is always the lower of the requested and granted 543 credit values, minus the number of requests in flight. Advertised 544 credit values are not altered because individual RPCs are started or 545 completed. 547 On occasion a requester or responder may need to adjust the amount of 548 resources available to a connection. When this happens, the 549 responder needs to ensure that a credit increase is effected (i.e. 550 receives are posted) before the next reply is sent. 552 Certain RDMA implementations may impose additional flow control 553 restrictions, such as limits on RDMA Read operations in progress at 554 the responder. Accommodation of such restrictions is considered the 555 responsibility of each RPC-over-RDMA Version One implementation. 557 4.3.1. Initial Connection State 559 There are two operational parameters for each connection: 561 Credit Limit 562 As described above, the total number of responder receive buffers 563 is a connection's credit limit. The credit limit is advertised in 564 the RPC-over-RDMA header in each RPC message, and can change 565 during the lifetime of a connection. 567 Inline Threshold 568 The size of the receiver's smallest posted receive buffer is the 569 largest size message that a sender can convey with an RDMA Send 570 operation, and is known as a connection's "inline threshold." 571 Unlike the connection's credit limit, the inline threshold value 572 is not advertised to peers via the RPC-over-RDMA Version One 573 protocol, and there is no provision for the inline threshold value 574 to change during the lifetime of an RPC-over-RDMA Version One 575 connection. Connection peers MAY have different inline 576 thresholds. 578 The longevity of a transport connection requires that sending 579 endpoints respect the resource limits of peer receivers. However, 580 when a connection is first established, peers cannot know how many 581 receive buffers the other has, nor how large the buffers are. 583 As a basis for an initial exchange of RPC requests, each RPC-over- 584 RDMA Version One connection provides the ability to exchange at least 585 one RPC message at a time that is 1024 bytes in size. A responder 586 MAY exceed this basic level of configuration, but a requester MUST 587 NOT assume more than one credit is available, and MUST receive a 588 valid reply from the responder carrying the actual number of 589 available credits, prior to sending its next request. 591 Implementations MUST support an inline threshold of 1024 bytes, but 592 MAY support larger inline thresholds. In the absense of a mechanism 593 for discovering a peer's inline threshold, senders MUST assume a 594 receiver's inline threshold is 1024 bytes. 596 4.4. XDR Encoding With Chunks 598 On traditional RPC transports, XDR data items in an RPC message are 599 encoded as a contiguous sequence of bytes for network transmission. 600 However, in the case of an RDMA transport, during XDR encoding it can 601 be determined that (for instance) an opaque byte array is large 602 enough to be moved via an RDMA Read or RDMA Write operation. 604 RPC-over-RDMA Version One provides a mechanism for moving part an RPC 605 message via a separate RDMA data transfer. A contiguous piece of an 606 XDR stream that is split out and moved via a separate RDMA operation 607 is known as a "chunk." The sender removes the chunk data out from 608 the XDR stream conveyed via RDMA Send, and the receiver inserts it 609 before handing the reconstructed stream to the Upper Layer. 611 4.4.1. DDP-Eligibility 613 Only an XDR data item that might benefit from Direct Data Placement 614 should be moved to a chunk. The eligibility of specific XDR data 615 items to be moved as a chunk, as opposed to being left in the XDR 616 stream, is not specified by this document. A determination must be 617 made for each Upper Layer Protocol which items in its XDR definition 618 are allowed to use Direct Data Placement. Therefore an additional 619 specification is needed that describes how an Upper Layer Protocol 620 enables Direct Data Placement. The set of requirements for a ULP to 621 use an RDMA transport is known as an "Upper Layer Binding" 622 specification, or ULB. 624 An Upper Layer Binding states which specific individual XDR data 625 items in an Upper Layer Protocol MAY be transferred via Direct Data 626 Placement. This document will refer to such XDR data items as "DDP- 627 eligible". All other XDR data items MUST NOT be placed in a chunk. 628 RPC-over-RDMA Version One uses RDMA Read and Write operations to 629 transfer DDP-eligible data that has been placed in chunks. 631 The details and requirements for Upper Layer Bindings are discussed 632 in full in Section 6. 634 4.4.2. RDMA Segments 636 When encoding an RPC message that contains a DDP-eligible data item, 637 the RPC-over-RDMA transport does not place the item into the RPC 638 message's XDR stream. Instead, it records in the RPC-over-RDMA 639 header the address and size of the memory region containing the data 640 item. The requester sends this information for DDP-eligible data 641 items in both RPC calls and replies. The responder uses this 642 information to initiate RDMA Read and Write operations on the memory 643 regions. 645 An "RDMA segment", or just "segment", is an RPC-over-RDMA header data 646 object that contain the precise co-ordinates of a contiguous memory 647 region that is to be conveyed via one or more RDMA Read or RDMA Write 648 operations. The following fields are contained in a segment: 650 Handle 651 Steering tag or handle obtained when the segment's memory is 652 registered for RDMA. Sometimes known as an R_key. 654 Length 655 The length of the segment in bytes. 657 Offset 658 The offset or beginning memory address of the segment. 660 See [RFC5040] for further discussion of the meaning of these fields. 662 4.4.3. Chunks 664 A "chunk" refers to a portion of XDR stream data that is moved via 665 RDMA Read or Write operations. Chunk data is removed from the 666 sender's XDR stream, transferred by separate RDMA operations, and 667 then re-inserted into the receiver's XDR stream. 669 Each chunk consists of one or more RDMA segments. Each segment 670 represents a single contiguous piece of that chunk. 672 Except in special cases, a chunk contains exactly one XDR data item. 673 This makes it straightforward to remove chunks from an XDR stream 674 without affecting XDR alignment. 676 +----------------+ +----------------+------------------ 677 | RPC-over-RDMA | | | 678 | header w/ | | RPC Header | Non-chunk args/results 679 | segments | | | 680 +----------------+ +----------------+------------------ 681 | 682 +-> Chunk A 683 +-> Chunk B 684 +-> Chunk C 685 . . . 687 Block diagram of an RPC-over-RDMA message 689 Not every message has chunks associated with it. The structure of 690 the RPC-over-RDMA header is covered in Section 5. 692 4.4.3.1. Counted Arrays 694 If a chunk is to move a counted array data type, the count of array 695 elements MUST remain in the XDR stream, while the array elements MUST 696 be moved to the chunk. For example, when encoding an opaque byte 697 array as a chunk, the count of bytes stays in the XDR stream, while 698 the bytes in the array are removed from the XDR stream and 699 transferred via the chunk. Any byte count left in the XDR stream 700 MUST match the sum of the lengths of the segments making up the 701 chunk. If they do not agree, an RPC protocol encoding error results. 703 Individual array elements appear in the chunk in their entirety. For 704 example, when encoding an array of arrays as a chunk, the count of 705 items in the enclosing array stays in the XDR stream, but each 706 enclosed array, including its item count, is transferred as part of 707 the chunk. 709 4.4.3.2. Optional-data And Unions 711 If a chunk is to move an optional-data data type, the "is present" 712 field MUST remain in the XDR stream, while the data, if present, MUST 713 be moved to the chunk. 715 A union data type should never be made DDP-eligible, but one or more 716 of its arms may be DDP-eligible. 718 4.4.4. Read Chunks 720 A "Read chunk" represents an XDR data item that is to be pulled from 721 the requester to the responder using RDMA Read operations. 723 A Read chunk is a list of one or more RDMA segments. Each segment in 724 a Read chunk has an additional Position field. 726 Position 727 For data that is to be encoded, the byte offset in the RPC message 728 XDR stream where the receiver re-inserts the chunk data. The byte 729 offset MUST be computed from the beginning of the RPC message, not 730 the beginning of the RPC-over-RDMA header. All segments belonging 731 to the same Read chunk have the same value in their Position 732 field. 734 While constructing the RPC call, the requester registers memory 735 segments containing data in Read chunks. It advertises these chunks 736 in the RPC-over-RDMA header of the RPC call. 738 After receiving the RPC call via an RDMA Send operation, the 739 responder transfers the chunk data from the requester using RDMA Read 740 operations. The responder reconstructs the transferred chunk data by 741 concatenating the contents of each segment, in list order, into the 742 RPC call XDR stream. The first segment begins at the XDR position in 743 the Position field, and subsequent segments are concatenated 744 afterwards until there are no more segments left at that XDR 745 Position. 747 4.4.4.1. Read Chunk Round-up 749 XDR requires each encoded data item to start on four-byte alignment. 750 When an odd-length data item is marshaled, its length is encoded 751 literally, while the data is padded so the next data item can start 752 on a four-byte boundary in the XDR stream. Receivers ignore the 753 content of the pad bytes. 755 Data items remaining in the XDR stream must all adhere to the above 756 padding requirements. When a Read chunk is removed from an XDR 757 stream, the requester MUST remove any needed XDR padding for that 758 chunk as well. Alignment of the items remaining in the stream is 759 unaffected. 761 The length of a Read chunk is the sum of the lengths of the segments 762 that comprise it. If this sum is not a multiple of four, the 763 requester MAY choose to send a Read chunk without any XDR padding. 764 The responder MUST be prepared to provide appropriate round-up in its 765 reconstructed XDR stream if the requester provides no actual round-up 766 in a Read chunk. 768 The Position field in read segments indicates where the containing 769 Read chunk starts in the RPC message XDR stream. The value in this 770 field MUST be a multiple of four. Moreover, all segments in the same 771 Read chunk share the same Position value, even if one or more of the 772 segments have a non-four-byte aligned length. 774 4.4.4.2. Decoding Read Chunks 776 XDR decoding moves data from an XDR stream into a data structure 777 provided by an RPC application. Where elements of the destination 778 data structure are buffers or strings, the RPC application can either 779 pre-allocate storage to receive the data, or leave the string or 780 buffer fields null and allow the XDR decode stage of RPC processing 781 to automatically allocate storage of sufficient size. 783 When decoding a message from an RDMA transport, the receiver first 784 decodes the chunk lists from the RPC-over-RDMA header, then proceeds 785 to decode the body of the RPC message. Whenever the XDR offset in 786 the decode stream matches that of a Read chunk, the transport 787 initiates an RDMA Read to bring over the chunk data into locally 788 registered memory for the destination buffer. 790 When processing an RPC request, the responder acknowledges its 791 completion of use of the source buffers by simply replying to the 792 requester. The requester may then free all source buffers advertised 793 by the request. 795 4.4.5. Write Chunks 797 A "Write chunk" represents an XDR data item that is to be pushed from 798 the responder to the requester using RDMA Write operations. 800 A Write chunk is an array of one or more RDMA segments. Segments in 801 a Write chunk do not have a Position field because Write chunks are 802 provided by a requester long before the responder prepares the reply 803 XDR stream. 805 While constructing the RPC call, the requester also sets up memory 806 segments to catch DDP-eligible reply data. The requester provides as 807 many segments as needed to accommodate the largest possible size of 808 the data item in each Write chunk. 810 The responder transfers the chunk data to the requester using RDMA 811 Write operations. The responder copies the responder's Write chunk 812 segments into the RPC-over-RDMA header to be sent with the reply. 813 The responder updates the segment length fields to reflect the actual 814 amount of data that is being returned in the chunk. The updated 815 length of a Write chunk segment MAY be zero if the segment was not 816 filled by the responder. However the responder MUST NOT change the 817 number of segments in the Write chunk. 819 The responder then sends the RPC reply via an RDMA Send operation. 820 After receiving the RPC reply, the requester reconstructs the 821 transferred data by concatenating the contents of each segment, in 822 array order, into RPC reply XDR stream. 824 4.4.5.1. Unused Write Chunks 826 There are occasions when a requester provides a Write chunk but the 827 responder does not use it. For example, an Upper Layer Protocol may 828 have a union result where some arms of the union contain a DDP- 829 eligible data item, and other arms do not. To return an unused Write 830 chunk, the responder MUST set the length of all segments in the chunk 831 to zero. 833 Unused write chunks, or unused bytes in write chunk segments, are not 834 returned as results and their memory is returned to the Upper Layer 835 as part of RPC completion. However, the RPC layer MUST NOT assume 836 that the buffers have not been modified. 838 4.4.5.2. Write Chunk Round-up 840 XDR requires each encoded data item to start on four-byte alignment. 841 When an odd-length data item is marshaled, its length is encoded 842 literally, while the data is padded so the next data item can start 843 on a four-byte boundary in the XDR stream. Receivers ignore the 844 content of the pad bytes. 846 Data items remaining in the XDR stream must all adhere to the above 847 padding requirements. When a Write chunk is removed from an XDR 848 stream, the requester MUST remove any needed XDR padding for that 849 chunk as well. Alignment of the items remaining in the stream is 850 unaffected. 852 The length of a Write chunk is the sum of the lengths of the segments 853 that comprise it. If this sum is not a multiple of four, the 854 responder MAY choose not to write XDR padding. The requester does 855 not know the actual length of a Write chunk when it is prepared, but 856 it SHOULD provide enough segments to accommodate any needed XDR 857 padding. The requester MUST be prepared to provide appropriate 858 round-up in its reconstructed XDR stream if the responder provides no 859 actual round-up in a Write chunk. 861 4.5. Data Exchange 863 In summary, there are three mechanisms for moving data between 864 requester and responder. 866 Inline 867 Data is moved between requester and responder via an RDMA Send 868 operation. 870 RDMA Read 871 Data is moved between requester and responder via an RDMA Read 872 operation. Address and offset are obtained from a Read chunk in 873 the requester's RPC call message. 875 RDMA Write 876 Data is moved from responder to requester via an RDMA Write 877 operation. Address and offset are obtained from a Write chunk in 878 the requester's RPC call message. 880 Many combinations are possible. For instance, an RPC call may 881 contain some inline data along with Read or Write chunks. The reply 882 to that call may have chunks that the responder RDMA Writes back to 883 the requester. The following diagrams illustrate RPC calls that use 884 these methods to move RPC message data. 886 Requester Responder 887 | RPC Call | 888 Send | ------------------------------> | 889 | | 890 | RPC Reply | 891 | <------------------------------ | Send 893 An RPC with no chunks in the call or reply messages 895 Requester Responder 896 | RPC Call + Write chunks | 897 Send | ------------------------------> | 898 | | 899 | Chunk 1 | 900 | <------------------------------ | Write 901 | : | 902 | Chunk n | 903 | <------------------------------ | Write 904 | | 905 | RPC Reply | 906 | <------------------------------ | Send 908 An RPC with write chunks in the call message 910 In the presence of write chunks, RDMA ordering guarantees that all 911 data in the RDMA Write operations has been placed in memory prior to 912 the requester's RPC reply processing. 914 Requester Responder 915 | RPC Call + Read chunks | 916 Send | ------------------------------> | 917 | | 918 | Chunk 1 | 919 | +------------------------------ | Read 920 | v-----------------------------> | 921 | : | 922 | Chunk n | 923 | +------------------------------ | Read 924 | v-----------------------------> | 925 | | 926 | RPC Reply | 927 | <------------------------------ | Send 929 An RPC with read chunks in the call message 931 Requester Responder 932 | RPC Call + Read and Write chunks | 933 Send | ------------------------------> | 934 | | 935 | Read chunk 1 | 936 | +------------------------------ | Read 937 | v-----------------------------> | 938 | : | 939 | Read chunk n | 940 | +------------------------------ | Read 941 | v-----------------------------> | 942 | | 943 | Write chunk 1 | 944 | <------------------------------ | Write 945 | : | 946 | Write chunk n | 947 | <------------------------------ | Write 948 | | 949 | RPC Reply | 950 | <------------------------------ | Send 952 An RPC with read and write chunks in the call message 954 4.6. Message Size 956 The receiver of RDMA Send operations is required by RDMA to have 957 previously posted one or more adequately sized buffers (see 958 Section 4.3.1). Memory savings can be achieved on both requesters 959 and responders by leaving the inline threshold small. 961 4.6.1. Short Messages 963 RPC messages are frequently smaller than the connection's inline 964 threshold. 966 For example, the NFS version 3 GETATTR request is only 56 bytes: 20 967 bytes of RPC header, plus a 32-byte file handle argument and 4 bytes 968 for its length. The reply to this common request is about 100 bytes. 970 Since all RPC messages conveyed via RPC-over-RDMA require an RDMA 971 Send operation, the most efficient way to send an RPC message that is 972 smaller than the connection's inline threshold is to append its XDR 973 stream directly to the buffer carrying the RPC-over-RDMA header. An 974 RPC-over-RDMA header with a small RPC call or reply message 975 immediately following is transferred using a single RDMA Send 976 operation. No RDMA Read or Write operations are needed. 978 4.6.2. Chunked Messages 980 If DDP-eligible data items are present in an RPC message, a sender 981 MAY remove them from the RPC message, and use RDMA Read or Write 982 operations to move that data. The RPC-over-RDMA header with the 983 shortened RPC call or reply message immediately following is 984 transferred using a single RDMA Send operation. Removed DDP-eligible 985 data items are conveyed using RDMA Read or Write operations using 986 additional information provided in the RPC-over-RDMA header. 988 4.6.3. Long Messages 990 When an RPC message is larger than the connection's inline threshold, 991 DDP-eligible data items are removed from the message and placed in 992 chunks and moved separately. If there are no DDP-eligible data items 993 in the message, or the message is still too large after DDP-eligible 994 items have been removed, the RDMA transport MUST use RDMA Read or 995 Write operations to convey the RPC message body itself. This 996 mechanism is referred to as a "Long Message." 998 To send an RPC message as a Long Message, the sender conveys only the 999 RPC-over-RDMA header with an RDMA Send operation. The RPC message 1000 itself is not included in the Send buffer. Instead, the requester 1001 provides chunks that the responder uses to move the whole RPC 1002 message. 1004 Long RPC call 1005 To handle an RPC request using a Long Message, the requester 1006 provides a special Read chunk that contains the RPC call's XDR 1007 stream. Every segment in this Read chunk MUST contain zero in its 1008 Position field. This chunk is known as a "Position Zero Read 1009 chunk." 1011 Long RPC reply 1012 To handle an RPC reply using a Long Message, the requester 1013 provides a single special Write chunk, known as the "Reply chunk", 1014 that contains the RPC reply's XDR stream. The requester sizes the 1015 Reply chunk to accommodate the largest possible expected reply for 1016 that Upper Layer operation. 1018 Though the purpose of a Long Message is to handle large RPC messages, 1019 requesters MAY use a Long Message at any time to convey an RPC call. 1020 Responders SHOULD use a Long Message whenever a Reply chunk has been 1021 provided by a requester. Both types of special chunk MAY be present 1022 in the same RPC message. 1024 Because these special chunks contain a whole RPC message, any XDR 1025 data item MAY appear in one of these special chunks without regard to 1026 its DDP-eligibility. DDP-eligible data items MAY be removed from 1027 these special chunks and conveyed via normal chunks, but non-eligible 1028 data items MUST NOT appear in normal chunks. 1030 5. RPC-Over-RDMA In Operation 1032 An RPC-over-RDMA Version One header precedes all RPC messages 1033 conveyed across an RDMA transport. This header includes a copy of 1034 the message's transaction ID, data for RDMA flow control credits, and 1035 lists of memory addresses used for RDMA Read and Write operations. 1036 All RPC-over-RDMA header content MUST be XDR encoded. 1038 RPC message layout is unchanged from that described in [RFC5531] 1039 except for the possible removal of data items that are moved by RDMA 1040 Read or Write operations. If an RPC message (along with its RPC- 1041 over-RDMA header) is larger than the connection's inline threshold 1042 even after any large chunks are removed, then the RPC message MAY be 1043 moved separately as a chunk, leaving just the RPC-over-RDMA header in 1044 the RDMA Send. 1046 5.1. Fixed Header Fields 1048 The RPC-over-RDMA header begins with four fixed 32-bit fields that 1049 MUST be present and that control the RDMA interaction including RDMA- 1050 specific flow control. These four fields are: 1052 5.1.1. Transaction ID (XID) 1054 The XID generated for the RPC call and reply. Having the XID at a 1055 fixed location in the header makes it easy for the receiver to 1056 establish context as soon as the message arrives. This XID MUST be 1057 the same as the XID in the RPC header. The receiver MAY perform its 1058 processing based solely on the XID in the RPC-over-RDMA header, and 1059 thereby ignore the XID in the RPC header, if it so chooses. 1061 5.1.2. Version number 1063 For RPC-over-RDMA Version One, this field MUST contain the value 1 1064 (one). Further discussion of protocol extensibility can be found in 1065 Section 9. 1067 5.1.3. Flow control credit value 1069 When sent in an RPC call message, the requested credit value is 1070 provided. When sent in an RPC reply message, the granted credit 1071 value is returned. RPC calls SHOULD NOT be sent in excess of the 1072 currently granted limit. Further discussion of flow control can be 1073 found in Section 4.3. 1075 5.1.4. Message type 1077 o RDMA_MSG = 0 indicates that chunk lists and an RPC message follow. 1078 The format of the chunk lists is discussed below. 1080 o RDMA_NOMSG = 1 indicates that after the chunk lists there is no 1081 RPC message. In this case, the chunk lists provide information to 1082 allow the responder to transfer the RPC message using RDMA Read or 1083 Write operations. 1085 o RDMA_MSGP = 2 is reserved, and no longer used. 1087 o RDMA_DONE = 3 is reserved, and no longer used. 1089 o RDMA_ERROR = 4 is used to signal a responder-detected error in 1090 RDMA chunk encoding. 1092 For a message of type RDMA_MSG, the four fixed fields are followed by 1093 the Read and Write lists and the Reply chunk (though any or all three 1094 MAY be marked as not present), then an RPC message, beginning with 1095 its XID field. The Send buffer holds two separate XDR streams: the 1096 first XDR stream contains the RPC-over-RDMA header, and the second 1097 XDR stream contains the RPC message. 1099 For a message of type RDMA_NOMSG, the four fixed fields are followed 1100 by the Read and Write chunk lists and the Reply chunk (though any or 1101 all three MAY be marked as not present). The Send buffer holds one 1102 XDR stream which contains the RPC-over-RDMA header. 1104 For a message of type RDMA_ERROR, the four fixed fields are followed 1105 by formatted error information. 1107 The above content (the fixed fields, the chunk lists, and the RPC 1108 message, when present) MUST be conveyed via a single RDMA Send 1109 operation. A gather operation on the Send can be used to marshal the 1110 separate RPC-over-RDMA header, the chunk lists, and the RPC message 1111 itself. However, the total length of the gathered send buffers 1112 cannot exceed the peer's inline threshold. 1114 5.2. Chunk Lists 1116 The chunk lists in an RPC-over-RDMA Version One header are three XDR 1117 optional-data fields that MUST follow the fixed header fields in 1118 RDMA_MSG and RDMA_NOMSG type messages. Read Section 4.19 of 1119 [RFC4506] carefully to understand how optional-data fields work. 1120 Examples of XDR encoded chunk lists are provided in Section 13.1 to 1121 aid understanding. 1123 5.2.1. Read List 1125 Each RPC-over-RDMA Version One header has one "Read list." The Read 1126 list is a list of zero or more Read segments, provided by the 1127 requester, that are grouped by their Position fields into Read 1128 chunks. Each Read chunk advertises the locations of data the 1129 responder is to pull via RDMA Read operations. The requester SHOULD 1130 sort the chunks in the Read list in Position order. 1132 Via a Position Zero Read Chunk, a requester may provide part or all 1133 of an entire RPC call message as the first chunk in this list. 1135 The Read list MAY be empty if the RPC call has no argument data that 1136 is DDP-eligible and the Position Zero Read Chunk is not being used. 1138 5.2.2. Write List 1140 Each RPC-over-RDMA Version One header has one "Write list." The 1141 Write list is a list of zero or more Write chunks, provided by the 1142 requester. Each Write chunk is an array of RDMA segments, thus the 1143 Write list is a list of counted arrays. Each Write chunk advertises 1144 receptacles for DDP-eligible data to be pushed by the responder. 1146 When a Write list is provided for the results of the RPC call, the 1147 responder MUST provide any corresponding data via RDMA Write to the 1148 memory referenced in the chunk's segments. The Write list MAY be 1149 empty if the RPC operation has no DDP-eligible result data. 1151 When multiple Write chunks are present, the responder fills in each 1152 Write chunk with a DDP-eligible result until either there are no more 1153 results or no more Write chunks. An Upper Layer Binding MUST 1154 determine how Write list entries are mapped to procedure arguments 1155 for each Upper Layer procedure. For details, see Section 6. 1157 The RPC reply conveys the size of result data by returning the Write 1158 list to the requester with the lengths rewritten to match the actual 1159 transfer. Decoding the reply therefore performs no local data 1160 transfer but merely returns the length obtained from the reply. 1162 Each decoded result consumes one entry in the Write list, which in 1163 turn consists of an array of RDMA segments. The length of a Write 1164 chunk is therefore the sum of all returned lengths in all segments 1165 comprising the corresponding list entry. As each Write chunk is 1166 decoded, the entire entry is consumed. 1168 5.2.3. Reply Chunk 1170 Each RPC-over-RDMA Version One header has one "Reply Chunk." The 1171 Reply Chunk is a Write chunk, provided by the requester. The Reply 1172 Chunk is a single counted array of RDMA segments. A responder MAY 1173 convey part or all of an entire RPC reply message in this chunk. 1175 A requester provides the Reply chunk whenever it predicts the 1176 responder's reply might not fit in an RDMA Send operation. A 1177 requester MAY choose to provide the Reply chunk even when the 1178 responder can return only a small reply. 1180 5.3. Forming Messages 1182 5.3.1. Short Messages 1184 A Short Message without chunks is contained entirely within a single 1185 RDMA Send Operation. Since the RPC call message immediately follows 1186 the RPC-over-RDMA header in the send buffer, the requester MUST set 1187 the message type to RDMA_MSG. 1189 <------------------ RPC-over-RDMA header ---------------> 1190 +--------+---------+---------+------------+-------------+ +---------- 1191 | | | | | NULL | | Whole 1192 | XID | Version | Credits | RDMA_MSG | Chunk Lists | | RPC 1193 | | | | | | | Message 1194 +--------+---------+---------+------------+-------------+ +---------- 1196 5.3.2. Chunked Messages 1198 A Chunked Message is similar to a Short Message, but the RPC message 1199 does not contain the chunk data. Since the RPC call message 1200 immediately follows the RPC-over-RDMA header in the send buffer, the 1201 requester MUST set the message type to RDMA_MSG. 1203 <------------------ RPC-over-RDMA header ---------------> 1204 +--------+---------+---------+------------+-------------+ +---------- 1205 | | | | | | | Modified 1206 | XID | Version | Credits | RDMA_MSG | Chunk Lists | | RPC 1207 | | | | | | | Message 1208 +--------+---------+---------+------------+-------------+ +---------- 1209 | 1210 | +---------- 1211 | | 1212 +->| Chunks 1213 | 1214 +---------- 1216 5.3.3. Long Call Messages 1218 To send a Long Call Message, the requester registers the memory 1219 containing the RPC call message and adds a chunk to the Read List at 1220 Position Zero. Since the RPC call message does not follow the RPC- 1221 over-RDMA header in the send buffer, the requester MUST set the 1222 message type to RDMA_NOMSG. 1224 <------------------ RPC-over-RDMA header ---------------> 1225 +--------+---------+---------+------------+-------------+ 1226 | | | | | | 1227 | XID | Version | Credits | RDMA_NOMSG | Chunk Lists | 1228 | | | | | | 1229 +--------+---------+---------+------------+-------------+ 1230 | 1231 | +---------- 1232 | | RPC Call 1233 +->| 1234 | Message 1235 +---------- 1237 If a responder gets an RPC-over-RDMA header with a message type of 1238 RDMA_NOMSG and finds an initial Read list entry with a zero XDR 1239 position, it allocates a registered buffer and issues an RDMA Read of 1240 the RPC message into it. The responder then proceeds to XDR decode 1241 the RPC message as if it had received it with the Send data. Further 1242 decoding may issue additional RDMA Reads to bring over additional 1243 chunks. 1245 Requester Responder 1246 | RDMA-over-RPC Header | 1247 Send | ------------------------------> | 1248 | | 1249 | Long RPC Call Msg | 1250 | <------------------------------ | Read 1251 | ------------------------------> | 1252 | | 1253 | RDMA-over-RPC Reply | 1254 | <------------------------------ | Send 1256 A long call RPC with request supplied via RDMA Read 1258 5.3.4. Long Reply Messages 1260 To send a Long Reply Message, the requester MAY register a large 1261 buffer into which the responder can write an RPC reply. This buffer 1262 is passed to the responder in the RPC call message as the Reply 1263 chunk. 1265 If the responder's reply message is too long to return with an RDMA 1266 Send operation, even after Write chunks are removed, then the 1267 responder performs an RDMA Write of the RPC reply message into the 1268 buffer indicated by the Reply chunk. Since the RPC reply message 1269 does not follow the RPC-over-RDMA header in the send buffer, the 1270 responder MUST set the message type to RDMA_NOMSG. 1272 <------------------ RPC-over-RDMA header ---------------> 1273 +--------+---------+---------+------------+-------------+ 1274 | | | | | | 1275 | XID | Version | Credits | RDMA_NOMSG | Chunk Lists | 1276 | | | | | | 1277 +--------+---------+---------+------------+-------------+ 1278 | 1279 | +---------- 1280 | | RPC Reply 1281 +->| 1282 | Message 1283 +---------- 1285 Requester Responder 1286 | RPC Call with Reply chunk | 1287 Send | ------------------------------> | 1288 | | 1289 | Long RPC Reply Msg | 1290 | <------------------------------ | Write 1291 | RDMA-over-RPC Header | 1292 | <------------------------------ | Send 1294 An RPC with long reply returned via RDMA Write 1296 The use of RDMA Write to return long replies requires that the 1297 requester anticipates a long reply and has some knowledge of its size 1298 so that an adequately sized buffer can be allocated. Typically the 1299 Upper Layer Protocol can limit the size of RPC replies appropriately. 1301 It is possible for a single RPC procedure to employ both a long call 1302 for its arguments and a long reply for its results. However, such an 1303 operation is atypical, as few upper layers define such exchanges. 1305 5.4. Memory Registration 1307 RDMA requires that data is transferred only between registered memory 1308 segments at the source and destination. All protocol headers as well 1309 as separately transferred data chunks use registered memory. 1311 Since the cost of registering and de-registering memory can be a 1312 significant proportion of the RDMA transaction cost, it is important 1313 to minimize registration activity. This is easily achieved within 1314 RPC-controlled memory by allocating chunk list data and RPC headers 1315 in a reusable way from pre-registered pools. 1317 5.4.1. Registration Longevity 1319 Data chunks transferred via RDMA Read and Write MAY reside in memory 1320 that persists outside the bounds of the RPC transaction. Hence, the 1321 default behavior of an RPC-over-RDMA transport is to register and 1322 invalidate these chunks on every RPC transaction. 1324 The requester endpoint must ensure that these memory segments are 1325 properly fenced from the responder before allowing Upper Layer access 1326 to the data contained in them. The data in such segments must be at 1327 rest while a responder has access to that memory. 1329 This includes segments that are associated with canceled RPCs. A 1330 responder cannot know that the requester is no longer waiting for a 1331 reply, and might proceed to read or even update memory that the 1332 requester has released for other use. 1334 5.4.2. Communicating DDP-Eligibility 1336 The interface by which an Upper Layer Protocol implementation 1337 communicates the eligibility of a data item locally to its local RPC- 1338 over-RDMA endpoint is out of scope for this specification. 1340 Depending on the implementation and constraints imposed by Upper 1341 Layer Bindings, it is possible to implement an RPC chunking facility 1342 that is transparent to upper layers. Such implementations may lead 1343 to inefficiencies, either because they require the RPC layer to 1344 perform expensive registration and de-registration of memory "on the 1345 fly", or they may require using RDMA chunks in reply messages, along 1346 with the resulting additional handshaking with the RPC-over-RDMA 1347 peer. 1349 However, these issues are internal and generally confined to the 1350 local interface between RPC and its upper layers, one in which 1351 implementations are free to innovate. The only requirement is that 1352 the resulting RPC-over-RDMA protocol sent to the peer is valid for 1353 the upper layer. 1355 5.4.3. Registration Strategies 1357 The choice of which memory registration strategies to employ is left 1358 to requester and responder implementers. To support the widest array 1359 of RDMA implementations, as well as the most general steering tag 1360 scheme, an Offset field is included in each segment. 1362 While zero-based offset schemes are available in many RDMA 1363 implementations, their use by RPC requires individual registration of 1364 each segment. For such implementations, this can be a significant 1365 overhead. By providing an offset in each chunk, many pre- 1366 registration or region-based registrations can be readily supported. 1367 By using a single, universal chunk representation, the RPC-over-RDMA 1368 protocol implementation is simplified to its most general form. 1370 5.5. Handling Errors 1372 When a peer receives an RPC-over-RDMA message, it MUST perform basic 1373 validity checks on the header and chunk contents. If such errors are 1374 detected in the request, an RDMA_ERROR reply MUST be generated. 1376 Two types of errors are defined, version mismatch and invalid chunk 1377 format. 1379 o When a responder detects an RPC-over-RDMA header version that it 1380 does not support (currently this document defines only Version 1381 One), it replies with an error code of ERR_VERS, and provides the 1382 low and high inclusive version numbers it does, in fact, support. 1383 The version number in this reply MUST be any value otherwise valid 1384 at the receiver. 1386 o When a responder detects other decoding errors in the header or 1387 chunks, one of the following errors MUST be returned: either an 1388 RPC decode error such as RPC_GARBAGEARGS, or the RPC-over-RDMA 1389 error code ERR_CHUNK. 1391 When a requester cannot parse a responder's reply, the requester 1392 SHOULD drop the RPC request and return an error to the application to 1393 prevent retransmission of an operation that can never complete. 1395 A requester might not provide a responder with enough resources to 1396 reply. For example, if a requester's receive buffer is too small, 1397 the responder's Send operation completes with a Local Length Error, 1398 and the connection is dropped. Or, if the requester's Reply chunk is 1399 too small to accommodate the whole RPC reply, the responder can tell 1400 as it is constructing the reply. The responder SHOULD send a reply 1401 with RDMA_ERROR to signal to the requester that no RPC-level reply is 1402 possible, and the XID should not be retried. 1404 It is assumed that the link itself will provide some degree of error 1405 detection and retransmission. iWARP's Marker PDU Aligned (MPA) layer 1406 (when used over TCP), Stream Control Transmission Protocol (SCTP), as 1407 well as the InfiniBand link layer all provide Cyclic Redundancy Check 1408 (CRC) protection of the RDMA payload, and CRC-class protection is a 1409 general attribute of such transports. 1411 Additionally, the RPC layer itself can accept errors from the link 1412 level and recover via retransmission. RPC recovery can handle 1413 complete loss and re-establishment of the link. The details of 1414 reporting and recovery from RDMA link layer errors are outside the 1415 scope of this protocol specification. 1417 See Section 10 for further discussion of the use of RPC-level 1418 integrity schemes to detect errors and related efficiency issues. 1420 5.6. XDR Language Description 1422 Code components extracted from this document must include the 1423 following license boilerplate. 1425 1426 /* 1427 * Copyright (c) 2010, 2015 IETF Trust and the persons 1428 * identified as authors of the code. All rights reserved. 1429 * 1430 * The authors of the code are: 1431 * B. Callaghan, T. Talpey, and C. Lever. 1432 * 1433 * Redistribution and use in source and binary forms, with 1434 * or without modification, are permitted provided that the 1435 * following conditions are met: 1436 * 1437 * - Redistributions of source code must retain the above 1438 * copyright notice, this list of conditions and the 1439 * following disclaimer. 1440 * 1441 * - Redistributions in binary form must reproduce the above 1442 * copyright notice, this list of conditions and the 1443 * following disclaimer in the documentation and/or other 1444 * materials provided with the distribution. 1445 * 1446 * - Neither the name of Internet Society, IETF or IETF 1447 * Trust, nor the names of specific contributors, may be 1448 * used to endorse or promote products derived from this 1449 * software without specific prior written permission. 1450 * 1451 * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS 1452 * AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED 1453 * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 1454 * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS 1455 * FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO 1456 * EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE 1457 * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, 1458 * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT 1459 * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 1460 * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS 1461 * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF 1462 * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 1463 * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING 1464 * IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF 1465 * ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 1466 */ 1468 struct rpcrdma1_segment { 1469 uint32 rdma_handle; 1470 uint32 rdma_length; 1471 uint64 rdma_offset; 1472 }; 1473 struct rpcrdma1_read_segment { 1474 uint32 rdma_position; 1475 struct rpcrdma1_segment rdma_target; 1476 }; 1478 struct rpcrdma1_read_list { 1479 struct rpcrdma1_read_segment rdma_entry; 1480 struct rpcrdma1_read_list *rdma_next; 1481 }; 1483 struct rpcrdma1_write_chunk { 1484 struct rpcrdma1_segment rdma_target<>; 1485 }; 1487 struct rpcrdma1_write_list { 1488 struct rpcrdma1_write_chunk rdma_entry; 1489 struct rpcrdma1_write_list *rdma_next; 1490 }; 1492 struct rpcrdma1_msg { 1493 uint32 rdma_xid; 1494 uint32 rdma_vers; 1495 uint32 rdma_credit; 1496 rpcrdma1_body rdma_body; 1497 }; 1499 enum rpcrdma1_proc { 1500 RDMA_MSG = 0, 1501 RDMA_NOMSG = 1, 1502 RDMA_MSGP = 2, /* Reserved */ 1503 RDMA_DONE = 3, /* Reserved */ 1504 RDMA_ERROR = 4 1505 }; 1507 struct rpcrdma1_chunks { 1508 struct rpcrdma1_read_list *rdma_reads; 1509 struct rpcrdma1_write_list *rdma_writes; 1510 struct rpcrdma1_write_chunk *rdma_reply; 1511 }; 1513 enum rpcrdma1_errcode { 1514 RDMA_ERR_VERS = 1, 1515 RDMA_ERR_CHUNK = 2 1516 }; 1518 union rpcrdma1_error switch (rpcrdma1_errcode err) { 1519 case RDMA_ERR_VERS: 1520 uint32 rdma_vers_low; 1521 uint32 rdma_vers_high; 1522 case RDMA_ERR_CHUNK: 1523 void; 1524 }; 1526 union rdma_body switch (rpcrdma1_proc proc) { 1527 case RDMA_MSG: 1528 case RDMA_NOMSG: 1529 rpcrdma1_chunks rdma_chunks; 1530 case RDMA_MSGP: 1531 uint32 rdma_align; 1532 uint32 rdma_thresh; 1533 rpcrdma1_chunks rdma_achunks; 1534 case RDMA_DONE: 1535 void; 1536 case RDMA_ERROR: 1537 rpcrdma1_error rdma_error; 1538 }; 1540 1542 5.7. Deprecated Protocol Elements 1544 5.7.1. RDMA_MSGP 1546 Implementers of RPC-over-RDMA Version One have neglected to make use 1547 of the RDMA_MSGP message type. Therefore RDMA_MSGP is deprecated. 1549 Senders SHOULD NOT send RDMA_MSGP type messages. Receivers MUST 1550 treat received RDMA_MSGP type messages as they would treat RDMA_MSG 1551 type messages. The additional alignment information is an 1552 optimization hint that may be ignored. 1554 5.7.2. RDMA_DONE 1556 Because implementations of RPC-over-RDMA Version One do not use the 1557 Read-Read transfer model, there should never be any need to send an 1558 RDMA_DONE type message. Therefore RDMA_DONE is deprecated. 1560 Receivers MUST drop RDMA_DONE type messages without additional 1561 processing. 1563 6. Upper Layer Binding Specifications 1565 Each RPC program and version tuple that operates on an RDMA transport 1566 MUST have an Upper Layer Binding specification. A ULB may be part of 1567 another protocol specification, or it may be a stand-alone document, 1568 similar to [RFC5667]. 1570 A ULB specification MUST provide four important pieces of 1571 information: 1573 o Which XDR data items in the RPC program are eligible for Direct 1574 Data Placement 1576 o How a responder utilizes chunks provided in a Write list 1578 o How DDP-eligibility violations are reported to peers 1580 o An rpcbind port assignment for operation of the RPC program on 1581 RDMA transports 1583 6.1. Determining DDP-Eligibility 1585 A DDP-eligible XDR data item is one that MAY be moved in a chunk. 1586 All other XDR data items MUST NOT be moved in a chunk that is part of 1587 a Short or Chunked Message, nor as a separate chunk in a Long 1588 Message. 1590 Only an XDR data item that might benefit from Direct Data Placement 1591 should be transferred in a chunk. An Upper Layer Binding 1592 specification should consider an XDR data item for DDP-eligibility if 1593 the data item can be larger than a Send buffer, and at least one of 1594 the following: 1596 o Is sensitive to page alignment (eg. it would require pullup on the 1597 receiver before it can be used) 1599 o Is not translated or marshaled when it is XDR encoded (eg. an 1600 opaque array) 1602 o Is not immediately used by applications (eg. is part of data 1603 backup or replication) 1605 The Upper Layer Protocol implementation or the RDMA transport 1606 implementation decide when to move a DDP-eligible data item into a 1607 chunk instead of leaving the item in the RPC message's XDR stream. 1608 The interface by which an Upper Layer implementation communicates the 1609 chunk eligibility of a data item locally to the RPC transport is out 1610 of scope for this specification. The only requirement is that the 1611 resulting RPC-over-RDMA protocol sent to the peer is valid for the 1612 Upper Layer. 1614 The XDR language definition of DDP-eligible data items is not 1615 decorated in any way. 1617 It is the responsibility of the protocol's Upper Layer Binding to 1618 specify DDP-eligibity rules so that if a DDP-eligible XDR data item 1619 is embedded within another, only one of these two objects is to be 1620 represented by a chunk. This ensures that the mapping from XDR 1621 position to the XDR object represented is unambiguous. 1623 6.2. Write List Ordering 1625 A requester constructs the Write list for an RPC transaction before 1626 the responder has formulated the transaction's reply. 1628 When there is only one result data item that is DDP-eligible, the 1629 requester appends only a single Write chunk to that Write list. If 1630 the responder populates that chunk with data, there is no ambiguity 1631 about which result is contained in it. 1633 However, an Upper Layer Binding MAY permit a reply where more than 1634 one result data item is DDP-eligible. For example, an NFSv4 COMPOUND 1635 reply is composed of individual NFSv4 operations, more than one of 1636 which can contain a DDP-eligible result. 1638 A requester provides multiple Write chunks when it expects the RPC 1639 reply to contain more than one data item that is DDP-elegible. 1640 Ambiguities can arise when replies contain XDR unions or arrays of 1641 complex data types which allow a responder options about whether a 1642 DDP-eligible data item is included or not. 1644 Requester and responder must agree beforehand which data items appear 1645 in which Write chunk. Therefore an Upper Layer Binding MUST 1646 determine how Write list entries are mapped to procedure arguments 1647 for each Upper Layer procedure. 1649 6.3. DDP-Eligibility Violation 1651 If the Upper Layer on a receiver is not aware of the presence and 1652 operation of an RPC-over-RDMA transport under it, it could be 1653 challenging to discover when a sender has violated an Upper Layer 1654 Binding rule. 1656 If a violation does occur, RFC 5666 does not define an unambiguous 1657 mechanism for reporting the violation. The violation of Binding 1658 rules is an Upper Layer Protocol issue, but it is likely that there 1659 is nothing the Upper Layer can do but reply with the equivalent of 1660 BAD XDR. 1662 When an erroneously-constructed reply reaches a requester, there is 1663 no recourse but to drop the reply, and perhaps the transport 1664 connection as well. 1666 Policing DDP-eligibility must be done in co-operation with the Upper 1667 Layer Protocol by its receive endpoint implementation. 1669 It is the Upper Layer Binding's responsibility to specify how a 1670 responder must reply if a requester violates a DDP-eligibilty rule. 1671 The Binding specification should provide similar guidance for 1672 requesters about handling invalid RPC-over-RDMA replies. 1674 6.4. Other Binding Information 1676 An Upper Layer Binding may recommend inline threshold values for RPC- 1677 over-RDMA Version One connections bearing that Upper Layer Protocol. 1678 However, note that RPC-over-RDMA connections can be shared by more 1679 than one Upper Layer Protocol, and that an implementation may use the 1680 same inline threshold for all connections and Protocols that flow 1681 between two peers. 1683 If an Upper Layer Protocol specifies a method for exchanging inline 1684 threshold information, the sender can find out the receiver's 1685 threshold value only subsequent to establishing an RPC-over-RDMA 1686 connection. The new threshold value can take effect only when a new 1687 connection is established. 1689 7. RPC Bind Parameters 1691 In setting up a new RDMA connection, the first action by a requester 1692 is to obtain a transport address for the responder. The mechanism 1693 used to obtain this address, and to open an RDMA connection is 1694 dependent on the type of RDMA transport, and is the responsibility of 1695 each RPC protocol binding and its local implementation. 1697 RPC services normally register with a portmap or rpcbind [RFC1833] 1698 service, which associates an RPC program number with a service 1699 address. (In the case of UDP or TCP, the service address for NFS is 1700 normally port 2049.) This policy is no different with RDMA 1701 transports, although it may require the allocation of port numbers 1702 appropriate to each Upper Layer Protocol that uses the RPC framing 1703 defined here. 1705 When mapped atop the iWARP transport [RFC5040] [RFC5041], which uses 1706 IP port addressing due to its layering on TCP and/or SCTP, port 1707 mapping is trivial and consists merely of issuing the port in the 1708 connection process. The NFS/RDMA protocol service address has been 1709 assigned port 20049 by IANA, for both iWARP/TCP and iWARP/SCTP. 1711 When mapped atop InfiniBand [IB], which uses a Group Identifier 1712 (GID)-based service endpoint naming scheme, a translation MUST be 1713 employed. One such translation is defined in the InfiniBand Port 1714 Addressing Annex [IBPORT], which is appropriate for translating IP 1715 port addressing to the InfiniBand network. Therefore, in this case, 1716 IP port addressing may be readily employed by the upper layer. 1718 When a mapping standard or convention exists for IP ports on an RDMA 1719 interconnect, there are several possibilities for each upper layer to 1720 consider: 1722 o One possibility is to have responder register its mapped IP port 1723 with the rpcbind service, under the netid (or netid's) defined 1724 here. An RPC-over-RDMA-aware requester can then resolve its 1725 desired service to a mappable port, and proceed to connect. This 1726 is the most flexible and compatible approach, for those upper 1727 layers that are defined to use the rpcbind service. 1729 o A second possibility is to have the responder's portmapper 1730 register itself on the RDMA interconnect at a "well known" service 1731 address. (On UDP or TCP, this corresponds to port 111.) A 1732 requester could connect to this service address and use the 1733 portmap protocol to obtain a service address in response to a 1734 program number, e.g., an iWARP port number, or an InfiniBand GID. 1736 o Alternatively, the requester could simply connect to the mapped 1737 well-known port for the service itself, if it is appropriately 1738 defined. By convention, the NFS/RDMA service, when operating atop 1739 such an InfiniBand fabric, will use the same 20049 assignment as 1740 for iWARP. 1742 Historically, different RPC protocols have taken different approaches 1743 to their port assignment; therefore, the specific method is left to 1744 each RPC-over-RDMA-enabled Upper Layer binding, and not addressed 1745 here. 1747 In Section 12, "IANA Considerations", this specification defines two 1748 new "netid" values, to be used for registration of upper layers atop 1749 iWARP [RFC5040] [RFC5041] and (when a suitable port translation 1750 service is available) InfiniBand [IB]. Additional RDMA-capable 1751 networks MAY define their own netids, or if they provide a port 1752 translation, MAY share the one defined here. 1754 8. Bi-Directional RPC-Over-RDMA 1755 8.1. RPC Direction 1757 8.1.1. Forward Direction 1759 A traditional ONC RPC client is always a requester. A traditional 1760 ONC RPC service is always a responder. This traditional form of ONC 1761 RPC message passing is referred to as operation in the "forward 1762 direction." 1764 During forward direction operation, the ONC RPC client is responsible 1765 for establishing transport connections. 1767 8.1.2. Backward Direction 1769 The ONC RPC standard does not forbid passing messages in the other 1770 direction. An ONC RPC service endpoint can act as a requester, in 1771 which case an ONC RPC client endpoint acts as a responder. This form 1772 of message passing is referred to as operation in the "backward 1773 direction." 1775 During backward direction operation, the ONC RPC client is 1776 responsible for establishing transport connections, even though ONC 1777 RPC Calls come from the ONC RPC server. 1779 8.1.3. Bi-direction 1781 A pair of endpoints may choose to use only forward or only backward 1782 direction operations on a particular transport. Or, the endpoints 1783 may send operations in both directions concurrently on the same 1784 transport. 1786 Bi-directional operation occurs when both transport endpoints act as 1787 a requester and a responder at the same time. As above, the ONC RPC 1788 client is responsible for establishing transport connections. 1790 8.1.4. XIDs with Bi-direction 1792 During bi-directional operation, the forward and backward directions 1793 use independent xid spaces. 1795 In other words, a forward direction requester MAY use the same xid 1796 value at the same time as a backward direction requester on the same 1797 transport connection, but such concurrent requests use represent 1798 distinct ONC RPC transactions. 1800 8.2. Backward Direction Flow Control 1802 8.2.1. Backward RPC-over-RDMA Credits 1804 Credits work the same way in the backward direction as they do in the 1805 forward direction. However, forward direction credits and backward 1806 direction credits are accounted separately. 1808 In other words, the forward direction credit value is the same 1809 whether or not there are backward direction resources associated with 1810 an RPC-over-RDMA transport connection. The backward direction credit 1811 value MAY be different than the forward direction credit value. The 1812 rdma_credit field in a backward direction RPC-over-RDMA message MUST 1813 NOT contain the value zero. 1815 A backward direction requester (an RPC-over-RDMA service endpoint) 1816 requests credits from the Responder (an RPC-over-RDMA client 1817 endpoint). The Responder reports how many credits it can grant. 1818 This is the number of backward direction Calls the Responder is 1819 prepared to handle at once. 1821 When an RPC-over-RDMA server endpoint is operating correctly, it 1822 sends no more outstanding requests at a time than the client 1823 endpoint's advertised backward direction credit value. 1825 8.2.2. Receive Buffer Management 1827 An RPC-over-RDMA transport endpoint must pre-post receive buffers 1828 before it can receive and process incoming RPC-over-RDMA messages. 1829 If a sender transmits a message for a receiver which has no posted 1830 receive buffer, the RDMA provider is allowed to drop the RDMA 1831 connection. 1833 8.2.2.1. Client Receive Buffers 1835 Typically an RPC-over-RDMA caller posts only as many receive buffers 1836 as there are outstanding RPC Calls. A client endpoint without 1837 backward direction support might therefore at times have no pre- 1838 posted receive buffers. 1840 To receive incoming backward direction Calls, an RPC-over-RDMA client 1841 endpoint must pre-post enough additional receive buffers to match its 1842 advertised backward direction credit value. Each outstanding forward 1843 direction RPC requires an additional receive buffer above this 1844 minimum. 1846 When an RDMA transport connection is lost, all active receive buffers 1847 are flushed and are no longer available to receive incoming messages. 1849 When a fresh transport connection is established, a client endpoint 1850 must re-post a receive buffer to handle the Reply for each 1851 retransmitted forward direction Call, and a full set of receive 1852 buffers to handle backward direction Calls. 1854 8.2.2.2. Server Receive Buffers 1856 A forward direction RPC-over-RDMA service endpoint posts as many 1857 receive buffers as it expects incoming forward direction Calls. That 1858 is, it posts no fewer buffers than the number of RPC-over-RDMA 1859 credits it advertises in the rdma_credit field of forward direction 1860 RPC replies. 1862 To receive incoming backward direction replies, an RPC-over-RDMA 1863 server endpoint must pre-post a receive buffer for each backward 1864 direction Call it sends. 1866 When the existing transport connection is lost, all active receive 1867 buffers are flushed and are no longer available to receive incoming 1868 messages. When a fresh transport connection is established, a server 1869 endpoint must re-post a receive buffer to handle the Reply for each 1870 retransmitted backward direction Call, and a full set of receive 1871 buffers for receiving forward direction Calls. 1873 8.3. Conventions For Backward Operation 1875 8.3.1. In the Absense of Backward Direction Support 1877 An RPC-over-RDMA transport endpoint might not support backward 1878 direction operation. There might be no mechanism in the transport 1879 implementation to do so, or the Upper Layer Protocol consumer might 1880 not yet have configured the transport to handle backward direction 1881 traffic. 1883 A loss of the RDMA connection may result if the receiver is not 1884 prepared to receive an incoming message. Thus a denial-of-service 1885 could result if a sender continues to send backchannel messages after 1886 every transport reconnect to an endpoint that is not prepared to 1887 receive them. 1889 For RPC-over-RDMA Version One transports, the Upper Layer Protocol is 1890 responsible for informing its peer when it has established a backward 1891 direction capability. Otherwise even a simple backward direction 1892 NULL probe from a peer would result in a lost connection. 1894 An Upper Layer Protocol consumer MUST NOT perform backward direction 1895 ONC RPC operations unless the peer consumer has indicated it is 1896 prepared to handle them. A description of Upper Layer Protocol 1897 mechanisms used for this indication is outside the scope of this 1898 document. 1900 8.3.2. Backward Direction Retransmission 1902 In rare cases, an ONC RPC transaction cannot be completed within a 1903 certain time. This can be because the transport connection was lost, 1904 the Call or Reply message was dropped, or because the Upper Layer 1905 consumer delayed or dropped the ONC RPC request. Typically, the 1906 requester sends the transaction again, reusing the same RPC XID. 1907 This is known as an "RPC retransmission". 1909 In the forward direction, the Caller is the ONC RPC client. The 1910 client is always responsible for establishing a transport connection 1911 before sending again. 1913 In the backward direction, the Caller is the ONC RPC server. Because 1914 an ONC RPC server does not establish transport connections with 1915 clients, it cannot send a retransmission if there is no transport 1916 connection. It must wait for the ONC RPC client to re-establish the 1917 transport connection before it can retransmit ONC RPC transactions in 1918 the backward direction. 1920 If an ONC RPC client has no work to do, it may be some time before it 1921 re-establishes a transport connection. Backward direction Callers 1922 must be prepared to wait indefinitely before a connection is 1923 established before a pending backward direction ONC RPC Call can be 1924 retransmitted. 1926 8.3.3. Backward Direction Message Size 1928 RPC-over-RDMA backward direction messages are transmitted and 1929 received using the same buffers as messages in the forward direction. 1930 Therefore they are constrained to be no larger than receive buffers 1931 posted for forward messages. Typical implementations have chosen to 1932 use 1024-byte buffers. 1934 It is expected that the Upper Layer Protocol consumer establishes an 1935 appropriate payload size limit for backward direction operations, 1936 either by advertising that size limit to its peers, or by convention. 1937 If that is done, backward direction messages do not exceed the size 1938 of receive buffers at either endpoint. 1940 If a sender transmits a backward direction message that is larger 1941 than the receiver is prepared for, the RDMA provider drops the 1942 message and the RDMA connection. 1944 If a sender transmits an RDMA message that is too small to convey a 1945 complete and valid RPC-over-RDMA and RPC message in either direction, 1946 the receiver MUST NOT use any value in the fields that were 1947 transmitted. Namely, the rdma_credit field MUST be ignored, and the 1948 message dropped. 1950 8.3.4. Sending A Backward Direction Call 1952 To form a backward direction RPC-over-RDMA Call message on an RPC- 1953 over-RDMA Version One transport, an ONC RPC service endpoint 1954 constructs an RPC-over-RDMA header containing a fresh RPC XID in the 1955 rdma_xid field. 1957 The rdma_vers field MUST contain the value one. The number of 1958 requested credits is placed in the rdma_credit field. 1960 The rdma_proc field in the RPC-over-RDMA header MUST contain the 1961 value RDMA_MSG. All three chunk lists MUST be empty. 1963 The ONC RPC Call header MUST follow immediately, starting with the 1964 same XID value that is present in the RPC-over-RDMA header. The Call 1965 header's msg_type field MUST contain the value CALL. 1967 8.3.5. Sending A Backward Direction Reply 1969 To form a backward direction RPC-over-RDMA Reply message on an RPC- 1970 over-RDMA Version One transport, an ONC RPC client endpoint 1971 constructs an RPC-over-RDMA header containing a copy of the matching 1972 ONC RPC Call's RPC XID in the rdma_xid field. 1974 The rdma_vers field MUST contain the value one. The number of 1975 granted credits is placed in the rdma_credit field. 1977 The rdma_proc field in the RPC-over-RDMA header MUST contain the 1978 value RDMA_MSG. All three chunk lists MUST be empty. 1980 The ONC RPC Reply header MUST follow immediately, starting with the 1981 same XID value that is present in the RPC-over-RDMA header. The 1982 Reply header's msg_type field MUST contain the value REPLY. 1984 8.4. Backward Direction Upper Layer Binding 1986 RPC programs that operate on RPC-over-RDMA Version One only in the 1987 backward direction do not require an Upper Layer Binding 1988 specification. Because RPC-over-RDMA Version One operation in the 1989 backward direction does not allow chunking, there can be no DDP- 1990 eligible data items in such a program. Backward direction operation 1991 occurs on an already-established connection, thus there is no need to 1992 specify RPC bind parameters. 1994 9. Transport Protocol Extensibility 1996 RPC programs are defined solely by their XDR definitions. They are 1997 independent of the transport mechanism used to convey base RPC 1998 messages. Protocols defined by XDR often have signifcant 1999 extensibility restrictions placed on them. 2001 Not all extensibility restrictions on RPC-based Upper Layer Protocols 2002 may be appropriate for an RPC transport protocol. TCP [RFC0793], for 2003 example, is an RPC transport protocol that has been extended many 2004 times independently of the RPC and XDR standards. 2006 RPC-over-RDMA might be considered as an extension of the RPC protocol 2007 rather than a separate transport, however. 2009 o The mechanisms that TCP uses to move data are opaque to the RPC 2010 implementation and RPC programs using it. Upper Layer Protocols 2011 are often aware that RPC-over-RDMA is present, as they identify 2012 data items that can be moved via direct data placement. 2014 o RPC-over-RDMA is used only for moving RPC messages, and not ever 2015 for generic data transfer. 2017 o RPC-over-RDMA relies on a more sophisticated set of base transport 2018 operations than traditional socket-based transports. 2019 Interoperability depends on RPC-over-RDMA implementations using 2020 these operations in a predictable way. 2022 o The RPC-over-RDMA header is specified using XDR, unlike other RPC 2023 transport protocols. 2025 9.1. Bumping The RPC-over-RDMA Version 2027 Place holder section. 2029 Because the version number is encoded as part of the RPC-over-RDMA 2030 header and the RDMA_ERROR message type is used to indicate errors, 2031 these first four fields and the start of the chunk lists MUST always 2032 remain aligned at the same fixed offsets for all versions of the RPC- 2033 over-RDMA header. 2035 The value of the RPC-over-RDMA header's version field MUST be changed 2036 o Whenever the on-the-wire format of the RPC-over-RDMA header is 2037 changed in a way that prevents interoperability with current 2038 implementations 2040 o Whenever the set of abstract RDMA operations that may be used is 2041 changed 2043 o Whenever the set of allowable transfer models is altered 2045 10. Security Considerations 2047 10.1. Memory Protection 2049 A primary consideration is the protection of the integrity and 2050 privacy of local memory by an RPC-over-RDMA transport. The use of 2051 RPC-over-RDMA MUST NOT introduce any vulnerabilities to system memory 2052 contents, nor to memory owned by user processes. 2054 It is REQUIRED that any RDMA provider used for RPC transport be 2055 conformant to the requirements of [RFC5042] in order to satisfy these 2056 protections. These protections are provided by the RDMA layer 2057 specifications, and specifically their security models. 2059 o The use of Protection Domains to limit the exposure of memory 2060 segments to a single connection is critical. Any attempt by a 2061 host not participating in that connection to re-use handles will 2062 result in a connection failure. Because Upper Layer Protocol 2063 security mechanisms rely on this aspect of Reliable Connection 2064 behavior, strong authentication of the remote is recommended. 2066 o Unpredictable memory handles should be used for any operation 2067 requiring advertised memory segments. Advertising a continuously 2068 registered memory region allows a remote host to read or write to 2069 that region even when an RPC involving that memory is not under 2070 way. Therefore advertising persistently registered memory should 2071 be avoided. 2073 o Advertised memory segments should be invalidated as soon as 2074 related RPC operations are complete. Invalidation and DMA 2075 unmapping of segments should be complete before an RPC application 2076 is allowed to continue execution and use the contents of a memory 2077 region. 2079 10.2. Using GSS With RPC-Over-RDMA 2081 ONC RPC provides its own security via the RPCSEC_GSS framework 2082 [RFC2203]. RPCSEC_GSS can provide message authentication, integrity 2083 checking, and privacy. This security mechanism is unaffected by the 2084 RDMA transport. However, there is much data movement associated with 2085 computation and verification of integrity, or encryption/decryption, 2086 so certain performance advantages may be lost. 2088 For efficiency, a more appropriate security mechanism for RDMA links 2089 may be link-level protection, such as certain configurations of 2090 IPsec, which may be co-located in the RDMA hardware. The use of 2091 link-level protection MAY be negotiated through the use of the 2092 RPCSEC_GSS mechanism defined in [RFC5403] in conjunction with the 2093 Channel Binding mechanism [RFC5056] and IPsec Channel Connection 2094 Latching [RFC5660]. Use of such mechanisms is REQUIRED where 2095 integrity and/or privacy is desired, and where efficiency is 2096 required. 2098 Once delivered securely by the RDMA provider, any RDMA-exposed memory 2099 will contain only RPC payloads in the chunk lists, transferred under 2100 the protection of RPCSEC_GSS integrity and privacy. By these means, 2101 the data will be protected end-to-end, as required by the RPC layer 2102 security model. 2104 11. IANA Considerations 2106 Three new assignments are specified by this document: 2108 - A new set of RPC "netids" for resolving RPC-over-RDMA services 2110 - Optional service port assignments for Upper Layer Bindings 2112 - An RPC program number assignment for the configuration protocol 2114 These assignments have been established, as below. 2116 The new RPC transport has been assigned an RPC "netid", which is an 2117 rpcbind [RFC1833] string used to describe the underlying protocol in 2118 order for RPC to select the appropriate transport framing, as well as 2119 the format of the service addresses and ports. 2121 The following "Netid" registry strings are defined for this purpose: 2123 NC_RDMA "rdma" 2124 NC_RDMA6 "rdma6" 2126 These netids MAY be used for any RDMA network satisfying the 2127 requirements of Section 2, and able to identify service endpoints 2128 using IP port addressing, possibly through use of a translation 2129 service as described above in Section 10, "RPC Binding". The "rdma" 2130 netid is to be used when IPv4 addressing is employed by the 2131 underlying transport, and "rdma6" for IPv6 addressing. 2133 The netid assignment policy and registry are defined in [RFC5665]. 2135 As a new RPC transport, this protocol has no effect on RPC program 2136 numbers or existing registered port numbers. However, new port 2137 numbers MAY be registered for use by RPC-over-RDMA-enabled services, 2138 as appropriate to the new networks over which the services will 2139 operate. 2141 For example, the NFS/RDMA service defined in [RFC5667] has been 2142 assigned the port 20049, in the IANA registry: 2144 nfsrdma 20049/tcp Network File System (NFS) over RDMA 2145 nfsrdma 20049/udp Network File System (NFS) over RDMA 2146 nfsrdma 20049/sctp Network File System (NFS) over RDMA 2148 The RPC program number assignment policy and registry are defined in 2149 [RFC5531]. 2151 12. Acknowledgments 2153 The editor gratefully acknowledges the work of Brent Callaghan and 2154 Tom Talpey on the original RPC-over-RDMA Version One specification 2155 [RFC5666]. 2157 The comments and contributions of Karen Deitke, Dai Ngo, Chunli 2158 Zhang, Dominique Martinet, and Mahesh Siddheshwar are accepted with 2159 many and great thanks. The editor also wishes to thank Dave Noveck 2160 and Bill Baker for their unwavering support of this work. 2162 Special thanks go to nfsv4 Working Group Chair Spencer Shepler and 2163 nfsv4 Working Group Secretary Thomas Haynes for their support. 2165 13. Appendices 2167 13.1. Appendix 1: XDR Examples 2169 RPC-over-RDMA chunk lists are complex data types. In this appendix, 2170 illustrations are provided to help readers grasp how chunk lists are 2171 represented inside an RPC-over-RDMA header. 2173 An RDMA segment is the simplest component, being made up of a 32-bit 2174 handle (H), a 32-bit length (L), and 64-bits of offset (OO). Once 2175 flattened into an XDR stream, RDMA segments appear as 2176 HLOO 2178 A Read segment has an additional 32-bit position field. Read 2179 segments appear as 2181 PHLOO 2183 A Read chunk is a list of Read segments. Each segment is preceded by 2184 a 32-bit word containing a one if there is a segment, or a zero if 2185 there are no more segments (optional-data). In XDR form, this would 2186 look like 2188 1 PHLOO 1 PHLOO 1 PHLOO 0 2190 where P would hold the same value for each segment belonging to the 2191 same Read chunk. 2193 The Read List is also a list of Read segments. In XDR form, this 2194 would look a lot like a Read chunk, except that the P values could 2195 vary across the list. An empty Read List is encoded as a single 2196 32-bit zero. 2198 One Write chunk is a counted array of segments. In XDR form, the 2199 count would appear as the first 32-bit word, followed by an HLOO for 2200 each element of the array. For instance, a Write chunk with three 2201 elements would look like 2203 3 HLOO HLOO HLOO 2205 The Write List is a list of counted arrays. In XDR form, this is a 2206 combination of optional-data and counted arrays. To represent a 2207 Write List containing a Write chunk with three segments and a Write 2208 chunk with two segments, XDR would encode 2210 1 3 HLOO HLOO HLOO 1 2 HLOO HLOO 0 2212 An empty Write List is encoded as a single 32-bit zero. 2214 The Reply chunk is the same as a Write chunk. Since it is an 2215 optional-data field, however, there is a 32-bit field in front of it 2216 that contains a one if the Reply chunk is present, or a zero if it is 2217 not. After encoding, a Reply chunk with 2 segments would look like 2219 1 2 HLOO HLOO 2221 Frequently a requester does not provide any chunks. In that case, 2222 after the four fixed fields in the RPC-over-RDMA header, there are 2223 simply three 32-bit fields that contain zero. 2225 14. References 2227 14.1. Normative References 2229 [RFC1833] Srinivasan, R., "Binding Protocols for ONC RPC Version 2", 2230 RFC 1833, DOI 10.17487/RFC1833, August 1995, 2231 . 2233 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 2234 Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/ 2235 RFC2119, March 1997, 2236 . 2238 [RFC2203] Eisler, M., Chiu, A., and L. Ling, "RPCSEC_GSS Protocol 2239 Specification", RFC 2203, DOI 10.17487/RFC2203, September 2240 1997, . 2242 [RFC4506] Eisler, M., Ed., "XDR: External Data Representation 2243 Standard", STD 67, RFC 4506, DOI 10.17487/RFC4506, May 2244 2006, . 2246 [RFC5042] Pinkerton, J. and E. Deleganes, "Direct Data Placement 2247 Protocol (DDP) / Remote Direct Memory Access Protocol 2248 (RDMAP) Security", RFC 5042, DOI 10.17487/RFC5042, October 2249 2007, . 2251 [RFC5056] Williams, N., "On the Use of Channel Bindings to Secure 2252 Channels", RFC 5056, DOI 10.17487/RFC5056, November 2007, 2253 . 2255 [RFC5403] Eisler, M., "RPCSEC_GSS Version 2", RFC 5403, DOI 2256 10.17487/RFC5403, February 2009, 2257 . 2259 [RFC5531] Thurlow, R., "RPC: Remote Procedure Call Protocol 2260 Specification Version 2", RFC 5531, DOI 10.17487/RFC5531, 2261 May 2009, . 2263 [RFC5660] Williams, N., "IPsec Channels: Connection Latching", RFC 2264 5660, DOI 10.17487/RFC5660, October 2009, 2265 . 2267 [RFC5665] Eisler, M., "IANA Considerations for Remote Procedure Call 2268 (RPC) Network Identifiers and Universal Address Formats", 2269 RFC 5665, DOI 10.17487/RFC5665, January 2010, 2270 . 2272 [RFC5666] Talpey, T. and B. Callaghan, "Remote Direct Memory Access 2273 Transport for Remote Procedure Call", RFC 5666, DOI 2274 10.17487/RFC5666, January 2010, 2275 . 2277 14.2. Informative References 2279 [IB] InfiniBand Trade Association, "InfiniBand Architecture 2280 Specifications", . 2282 [IBPORT] InfiniBand Trade Association, "IP Addressing Annex", 2283 . 2285 [RFC0793] Postel, J., "Transmission Control Protocol", STD 7, RFC 2286 793, DOI 10.17487/RFC0793, September 1981, 2287 . 2289 [RFC1094] Nowicki, B., "NFS: Network File System Protocol 2290 specification", RFC 1094, DOI 10.17487/RFC1094, March 2291 1989, . 2293 [RFC1813] Callaghan, B., Pawlowski, B., and P. Staubach, "NFS 2294 Version 3 Protocol Specification", RFC 1813, DOI 10.17487/ 2295 RFC1813, June 1995, 2296 . 2298 [RFC5040] Recio, R., Metzler, B., Culley, P., Hilland, J., and D. 2299 Garcia, "A Remote Direct Memory Access Protocol 2300 Specification", RFC 5040, DOI 10.17487/RFC5040, October 2301 2007, . 2303 [RFC5041] Shah, H., Pinkerton, J., Recio, R., and P. Culley, "Direct 2304 Data Placement over Reliable Transports", RFC 5041, DOI 2305 10.17487/RFC5041, October 2007, 2306 . 2308 [RFC5532] Talpey, T. and C. Juszczak, "Network File System (NFS) 2309 Remote Direct Memory Access (RDMA) Problem Statement", RFC 2310 5532, DOI 10.17487/RFC5532, May 2009, 2311 . 2313 [RFC5661] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed., 2314 "Network File System (NFS) Version 4 Minor Version 1 2315 Protocol", RFC 5661, DOI 10.17487/RFC5661, January 2010, 2316 . 2318 [RFC5667] Talpey, T. and B. Callaghan, "Network File System (NFS) 2319 Direct Data Placement", RFC 5667, DOI 10.17487/RFC5667, 2320 January 2010, . 2322 [RFC7530] Haynes, T., Ed. and D. Noveck, Ed., "Network File System 2323 (NFS) Version 4 Protocol", RFC 7530, DOI 10.17487/RFC7530, 2324 March 2015, . 2326 Authors' Addresses 2328 Charles Lever (editor) 2329 Oracle Corporation 2330 1015 Granger Avenue 2331 Ann Arbor, MI 48104 2332 USA 2334 Phone: +1 734 274 2396 2335 Email: chuck.lever@oracle.com 2337 William Allen Simpson 2338 DayDreamer 2339 1384 Fontaine 2340 Madison Heights, MI 48071 2341 USA 2343 Email: william.allen.simpson@gmail.com 2345 Tom Talpey 2346 Microsoft Corp. 2347 One Microsoft Way 2348 Redmond, WA 98052 2349 USA 2351 Phone: +1 425 704-9945 2352 Email: ttalpey@microsoft.com