idnits 2.17.1 draft-ietf-nfsv4-rpcrdma-09.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** The document seems to lack a License Notice according IETF Trust Provisions of 28 Dec 2009, Section 6.b.i or Provisions of 12 Sep 2009 Section 6.b -- however, there's a paragraph with a matching beginning. Boilerplate error? -- It seems you're using the 'non-IETF stream' Licence Notice instead Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document seems to use 'NOT RECOMMENDED' as an RFC 2119 keyword, but does not include the phrase in its RFC 2119 key words list. -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (December 3, 2008) is 5621 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Looks like a reference, but probably isn't: '8' on line 998 -- Obsolete informational reference (is this intentional?): RFC 3530 (Obsoleted by RFC 7530) Summary: 1 error (**), 0 flaws (~~), 2 warnings (==), 6 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 NFSv4 Working Group Tom Talpey 3 Internet-Draft NetApp 4 Intended status: Standards Track Brent Callaghan 5 Expires: June 3, 2009 Apple 6 December 3, 2008 8 Remote Direct Memory Access Transport for Remote Procedure Call 9 draft-ietf-nfsv4-rpcrdma-09 11 Status of this Memo 13 This Internet-Draft is submitted to IETF in full conformance with 14 the provisions of BCP 78 and BCP 79. 16 Internet-Drafts are working documents of the Internet Engineering 17 Task Force (IETF), its areas, and its working groups. Note that 18 other groups may also distribute working documents as Internet- 19 Drafts. 21 Internet-Drafts are draft documents valid for a maximum of six 22 months and may be updated, replaced, or obsoleted by other 23 documents at any time. It is inappropriate to use Internet-Drafts 24 as reference material or to cite them other than as "work in 25 progress." 27 The list of current Internet-Drafts can be accessed at 28 http://www.ietf.org/ietf/1id-abstracts.txt. 30 The list of Internet-Draft Shadow Directories can be accessed at 31 http://www.ietf.org/shadow.html. 33 This Internet-Draft will expire on June 3, 2009. 35 Abstract 37 A protocol is described providing Remote Direct Memory Access 38 (RDMA) as a new transport for Remote Procedure Call (RPC). The 39 RDMA transport binding conveys the benefits of efficient, bulk data 40 transport over high speed networks, while providing for minimal 41 change to RPC applications and with no required revision of the 42 application RPC protocol, or the RPC protocol itself. 44 Table of Contents 46 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 47 2. Abstract RDMA Requirements . . . . . . . . . . . . . . . . . 3 48 3. Protocol Outline . . . . . . . . . . . . . . . . . . . . . . 4 49 3.1. Short Messages . . . . . . . . . . . . . . . . . . . . . . 5 50 3.2. Data Chunks . . . . . . . . . . . . . . . . . . . . . . . 5 51 3.3. Flow Control . . . . . . . . . . . . . . . . . . . . . . . 6 52 3.4. XDR Encoding with Chunks . . . . . . . . . . . . . . . . . 7 53 3.5. XDR Decoding with Read Chunks . . . . . . . . . . . . . 10 54 3.6. XDR Decoding with Write Chunks . . . . . . . . . . . . . 11 55 3.7. XDR Roundup and Chunks . . . . . . . . . . . . . . . . . 12 56 3.8. RPC Call and Reply . . . . . . . . . . . . . . . . . . . 13 57 3.9. Padding . . . . . . . . . . . . . . . . . . . . . . . . 16 58 4. RPC RDMA Message Layout . . . . . . . . . . . . . . . . . 17 59 4.1. RPC over RDMA Header . . . . . . . . . . . . . . . . . . 17 60 4.2. RPC over RDMA header errors . . . . . . . . . . . . . . 19 61 4.3. XDR Language Description . . . . . . . . . . . . . . . . 20 62 5. Long Messages . . . . . . . . . . . . . . . . . . . . . . 22 63 5.1. Message as an RDMA Read Chunk . . . . . . . . . . . . . 22 64 5.2. RDMA Write of Long Replies (Reply Chunks) . . . . . . . 24 65 6. Connection Configuration Protocol . . . . . . . . . . . . 25 66 6.1. Initial Connection State . . . . . . . . . . . . . . . . 26 67 6.2. Protocol Description . . . . . . . . . . . . . . . . . . 26 68 7. Memory Registration Overhead . . . . . . . . . . . . . . . 28 69 8. Errors and Error Recovery . . . . . . . . . . . . . . . . 28 70 9. Node Addressing . . . . . . . . . . . . . . . . . . . . . 28 71 10. RPC Binding . . . . . . . . . . . . . . . . . . . . . . . 29 72 11. Security Considerations . . . . . . . . . . . . . . . . . 30 73 12. IANA Considerations . . . . . . . . . . . . . . . . . . . 31 74 13. Acknowledgments . . . . . . . . . . . . . . . . . . . . . 33 75 14. Normative References . . . . . . . . . . . . . . . . . . 33 76 15. Informative References . . . . . . . . . . . . . . . . . 34 77 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . 35 78 Intellectual Property Statement . . . . . . . . . . . . . . . 35 79 Disclaimer of Validity . . . . . . . . . . . . . . . . . . . . 36 80 Copyright Statement . . . . . . . . . . . . . . . . . . . . . 36 82 Requirements Language 84 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 85 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in 86 this document are to be interpreted as described in [RFC2119]. 88 1. Introduction 90 Remote Direct Memory Access (RDMA) [RFC5040, RFC5041] [IB] is a 91 technique for efficient movement of data between end nodes, which 92 becomes increasingly compelling over high speed transports. By 93 directing data into destination buffers as it is sent on a network, 94 and placing it via direct memory access by hardware, the double 95 benefit of faster transfers and reduced host overhead is obtained. 97 Open Network Computing Remote Procedure Call (ONC RPC, or simply, 98 RPC) [RFC1831bis] is a remote procedure call protocol that has been 99 run over a variety of transports. Most RPC implementations today 100 use UDP or TCP. RPC messages are defined in terms of an eXternal 101 Data Representation (XDR) [RFC4506] which provides a canonical data 102 representation across a variety of host architectures. An XDR data 103 stream is conveyed differently on each type of transport. On UDP, 104 RPC messages are encapsulated inside datagrams, while on a TCP byte 105 stream, RPC messages are delineated by a record marking protocol. 106 An RDMA transport also conveys RPC messages in a unique fashion 107 that must be fully described if client and server implementations 108 are to interoperate. 110 RDMA transports present new semantics unlike the behaviors of 111 either UDP or TCP alone. They retain message delineations like UDP 112 while also providing a reliable, sequenced data transfer like TCP. 113 And, they provide the new efficient, bulk transfer service of RDMA. 114 RDMA transports are therefore naturally viewed as a new transport 115 type by RPC. 117 RDMA as a transport will benefit the performance of RPC protocols 118 that move large "chunks" of data, since RDMA hardware excels at 119 moving data efficiently between host memory and a high speed 120 network with little or no host CPU involvement. In this context, 121 the NFS protocol, in all its versions [RFC1094] [RFC1813] [RFC3530] 122 [NFSv4.1], is an obvious beneficiary of RDMA. A complete problem 123 statement is discussed in [NFSRDMAPS], and related NFSv4 issues are 124 discussed in [NFSv4.1]. Many other RPC-based protocols will also 125 benefit. 127 Although the RDMA transport described here provides relatively 128 transparent support for any RPC application, the proposal goes 129 further in describing mechanisms that can optimize the use of RDMA 130 with more active participation by the RPC application. 132 2. Abstract RDMA Requirements 134 An RPC transport is responsible for conveying an RPC message from a 135 sender to a receiver. An RPC message is either an RPC call from a 136 client to a server, or an RPC reply from the server back to the 137 client. An RPC message contains an RPC call header followed by 138 arguments if the message is an RPC call, or an RPC reply header 139 followed by results if the message is an RPC reply. The call 140 header contains a transaction ID (XID) followed by the program and 141 procedure number as well as a security credential. An RPC reply 142 header begins with an XID that matches that of the RPC call 143 message, followed by a security verifier and results. All data in 144 an RPC message is XDR encoded. For a complete description of the 145 RPC protocol and XDR encoding, see [RFC1831bis] and [RFC4506]. 147 This protocol assumes the following abstract model for RDMA 148 transports. These terms, common in the RDMA lexicon, are used in 149 this document. A more complete glossary of RDMA terms can be found 150 in [RFC5040]. 152 o Registered Memory 153 All data moved via tagged RDMA operations is resident in 154 registered memory at its destination. This protocol assumes 155 that each segment of registered memory MUST be identified with 156 a steering tag of no more than 32 bits and memory addresses of 157 up to 64 bits in length. 159 o RDMA Send 160 The RDMA provider supports an RDMA Send operation with 161 completion signalled at the receiver when data is placed in a 162 pre-posted buffer. The amount of transferred data is limited 163 only by the size of the receiver's buffer. Sends complete at 164 the receiver in the order they were issued at the sender. 166 o RDMA Write 167 The RDMA provider supports an RDMA Write operation to directly 168 place data in the receiver's buffer. An RDMA Write is 169 initiated by the sender and completion is signalled at the 170 sender. No completion is signalled at the receiver. The 171 sender uses a steering tag, memory address and length of the 172 remote destination buffer. RDMA Writes are not necessarily 173 ordered with respect to one another, but are ordered with 174 respect to RDMA Sends; a subsequent RDMA Send completion 175 obtained at the receiver guarantees that prior RDMA Write data 176 has been successfully placed in the receiver's memory. 178 o RDMA Read 179 The RDMA provider supports an RDMA Read operation to directly 180 place peer source data in the requester's buffer. An RDMA 181 Read is initiated by the receiver and completion is signalled 182 at the receiver. The receiver provides steering tags, memory 183 addresses and a length for the remote source and local 184 destination buffers. Since the peer at the data source 185 receives no notification of RDMA Read completion, there is an 186 assumption that on receiving the data the receiver will signal 187 completion with an RDMA Send message, so that the peer can 188 free the source buffers and the associated steering tags. 190 This protocol is designed to be carried over all RDMA transports 191 meeting the stated requirements. This protocol conveys to the RPC 192 peer, information sufficient for that RPC peer to direct an RDMA 193 layer to perform transfers containing RPC data, and to communicate 194 their result(s). For example, it is readily carried over RDMA 195 transports such as iWARP [RFC5040, RFC5041] or Infiniband [IB]. 197 3. Protocol Outline 199 An RPC message can be conveyed in identical fashion, whether it is 200 a call or reply message. In each case, the transmission of the 201 message proper is preceded by transmission of a transport-specific 202 header for use by RPC over RDMA transports. This header is 203 analogous to the record marking used for RPC over TCP, but is more 204 extensive, since RDMA transports support several modes of data 205 transfer and it is important to allow the upper layer protocol to 206 specify the most efficient mode for each of the segments in a 207 message. Multiple segments of a message may thereby be transferred 208 in different ways to different remote memory destinations. 210 All transfers of a call or reply begin with an RDMA Send which 211 transfers at least the RPC over RDMA header, usually with the call 212 or reply message appended, or at least some part thereof. Because 213 the size of what may be transmitted via RDMA Send is limited by the 214 size of the receiver's pre-posted buffer, the RPC over RDMA 215 transport provides a number of methods to reduce the amount 216 transferred by means of the RDMA Send, when necessary, by 217 transferring various parts of the message using RDMA Read and RDMA 218 Write. 220 RPC over RDMA framing replaces all other RPC framing (such as TCP 221 record marking) when used atop an RPC/RDMA association, even though 222 the underlying RDMA protocol may itself be layered atop a protocol 223 with a defined RPC framing (such as TCP). It is however possible 224 for RPC/RDMA to be dynamically enabled, in the course of 225 negotiating the use of RDMA via an upper layer exchange. Because 226 RPC framing delimits an entire RPC request or reply, the resulting 227 shift in framing must occur between distinct RPC messages, and in 228 concert with the transport. 230 3.1. Short Messages 232 Many RPC messages are quite short. For example, the NFS version 3 233 GETATTR request, is only 56 bytes: 20 bytes of RPC header, plus a 234 32 byte file handle argument and 4 bytes of length. The reply to 235 this common request is about 100 bytes. 237 There is no benefit in transferring such small messages with an 238 RDMA Read or Write operation. The overhead in transferring 239 steering tags and memory addresses is justified only by large 240 transfers. The critical message size that justifies RDMA transfer 241 will vary depending on the RDMA implementation and network, but is 242 typically of the order of a few kilobytes. It is appropriate to 243 transfer a short message with an RDMA Send to a pre-posted buffer. 244 The RPC over RDMA header with the short message (call or reply) 245 immediately following is transferred using a single RDMA Send 246 operation. 248 Short RPC messages over an RDMA transport: 250 RPC Client RPC Server 251 | RPC Call | 252 Send | ------------------------------> | 253 | | 254 | RPC Reply | 255 | <------------------------------ | Send 257 3.2. Data Chunks 259 Some protocols, like NFS, have RPC procedures that can transfer 260 very large "chunks" of data in the RPC call or reply and would 261 cause the maximum send size to be exceeded if one tried to transfer 262 them as part of the RDMA Send. These large chunks typically range 263 from a kilobyte to a megabyte or more. An RDMA transport can 264 transfer large chunks of data more efficiently via the direct 265 placement of an RDMA Read or RDMA Write operation. Using direct 266 placement instead of inline transfer not only avoids expensive data 267 copies, but provides correct data alignment at the destination. 269 3.3. Flow Control 271 It is critical to provide RDMA Send flow control for an RDMA 272 connection. RDMA receive operations will fail if a pre-posted 273 receive buffer is not available to accept an incoming RDMA Send, 274 and repeated occurrences of such errors can be fatal to the 275 connection. This is a departure from conventional TCP/IP 276 networking where buffers are allocated dynamically on an as-needed 277 basis, and where pre-posting is not required. 279 It is not practical to provide for fixed credit limits at the RPC 280 server. Fixed limits scale poorly, since posted buffers are 281 dedicated to the associated connection until consumed by receive 282 operations. Additionally for protocol correctness, the RPC server 283 must always be able to reply to client requests, whether or not new 284 buffers have been posted to accept future receives. (Note that the 285 RPC server may in fact be a client at some other layer. For 286 example, NFSv4 callbacks are processed by the NFSv4 client, acting 287 as an RPC server. The credit discussions apply equally in either 288 case.) 290 Flow control for RDMA Send operations is implemented as a simple 291 request/grant protocol in the RPC over RDMA header associated with 292 each RPC message. The RPC over RDMA header for RPC call messages 293 contains a requested credit value for the RPC server, which MAY be 294 dynamically adjusted by the caller to match its expected needs. 295 The RPC over RDMA header for the RPC reply messages provides the 296 granted result, which MAY have any value except it MUST NOT be zero 297 when no in-progress operations are present at the server, since 298 such a value would result in deadlock. The value MAY be adjusted 299 up or down at each opportunity to match the server's needs or 300 policies. 302 The RPC client MUST NOT send unacknowledged requests in excess of 303 this granted RPC server credit limit. If the limit is exceeded, 304 the RDMA layer may signal an error, possibly terminating the 305 connection. Even if an error does not occur, it is OPTIONAL that 306 the server handle the excess request(s), and it MAY return an RPC 307 error to the client. Also note that the never-zero requirement 308 implies that an RPC server MUST always provide at least one credit 309 to each connected RPC client from which no requests are 310 outstanding. The client would deadlock otherwise, unable to send 311 another request. 313 While RPC calls complete in any order, the current flow control 314 limit at the RPC server is known to the RPC client from the Send 315 ordering properties. It is always the most recent server-granted 316 credit value minus the number of requests in flight. 318 Certain RDMA implementations may impose additional flow control 319 restrictions, such as limits on RDMA Read operations in progress at 320 the responder. Because these operations are outside the scope of 321 this protocol, they are not addressed and SHOULD be provided for by 322 other layers. For example, a simple upper layer RPC consumer might 323 perform single-issue RDMA Read requests, while a more 324 sophisticated, multithreaded RPC consumer might implement its own 325 FIFO queue of such operations. For further discussion of possible 326 protocol implementations capable of negotiating these values, see 327 section 6 "Connection Configuration Protocol" of this draft, or 328 [NFSv4.1]. 330 3.4. XDR Encoding with Chunks 332 The data comprising an RPC call or reply message is marshaled or 333 serialized into a contiguous stream by an XDR routine. XDR data 334 types such as integers, strings, arrays and linked lists are 335 commonly implemented over two very simple functions that encode 336 either an XDR data unit (32 bits) or an array of bytes. 338 Normally, the separate data items in an RPC call or reply are 339 encoded as a contiguous sequence of bytes for network transmission 340 over UDP or TCP. However, in the case of an RDMA transport, local 341 routines such as XDR encode can determine that (for instance) an 342 opaque byte array is large enough to be more efficiently moved via 343 an RDMA data transfer operation like RDMA Read or RDMA Write. 345 Semantically speaking, the protocol has no restriction regarding 346 data types which may or may not be represented by a read or write 347 chunk. In practice however, efficiency considerations lead to the 348 conclusion that certain data types are not generally "chunkable". 349 Typically, only those opaque and aggregate data types that may 350 attain substantial size are considered to be eligible. With 351 today's hardware this size may be a kilobyte or more. However any 352 object MAY be chosen for chunking in any given message. 354 The eligibility of XDR data items to be candidates for being moved 355 as data chunks (as opposed to being marshaled inline) is not 356 specified by the RPC over RDMA protocol. Chunk eligibility 357 criteria MUST be determined by each upper layer in order to provide 358 for an interoperable specification. One such example with 359 rationale, for the NFS protocol family, is provided in [NFSDDP]. 361 The interface by which an upper layer implementation communicates 362 the eligibility of a data item locally to RPC for chunking is out 363 of scope for this specification. In many implementations, it is 364 possible to implement a transparent RPC chunking facility. 365 However, such implementations may lead to inefficiencies, either 366 because they require the RPC layer to perform expensive 367 registration and deregistration of memory "on the fly", or they may 368 require using RDMA chunks in reply messages, along with the 369 resulting additional handshaking with the RPC over RDMA peer. 370 However, these issues are internal and generally confined to the 371 local interface between RPC and its upper layers, one in which 372 implementations are free to innovate. The only requirement is that 373 the resulting RPC RDMA protocol sent to the peer is valid for the 374 upper layer. See for example [NFSDDP]. 376 When sending any message (request or reply) that contains an 377 eligible large data chunk, the XDR encoding routine avoids moving 378 the data into the XDR stream. Instead, it does not encode the data 379 portion, but records the address and size of each chunk in a 380 separate "read chunk list" encoded within RPC RDMA transport- 381 specific headers. Such chunks will be transferred via RDMA Read 382 operations initiated by the receiver. 384 When the read chunks are to be moved via RDMA, the memory for each 385 chunk is registered. This registration may take place within XDR 386 itself, providing for full transparency to upper layers, or it may 387 be performed by any other specific local implementation. 389 Additionally, when making an RPC call that can result in bulk data 390 transferred in the reply, write chunks MAY be provided to accept 391 the data directly via RDMA Write. These write chunks will 392 therefore be pre-filled by the RPC server prior to responding, and 393 XDR decode of the data at the client will not be required. These 394 chunks undergo a similar registration and advertisement via "write 395 chunk lists" built as a part of XDR encoding. 397 Some RPC client implementations are not able to determine where an 398 RPC call's results reside during the "encode" phase. This makes it 399 difficult or impossible for the RPC client layer to encode the 400 write chunk list at the time of building the request. In this 401 case, it is difficult for the RPC implementation to provide 402 transparency to the RPC consumer, which may require recoding to 403 provide result information at this earlier stage. 405 Therefore if the RPC client does not make a write chunk list 406 available to receive the result, then the RPC server MAY return 407 data inline in the reply, or if the upper layer specification 408 permits, it MAY be returned via a read chunk list. It is NOT 409 RECOMMENDED that upper layer RPC client protocol specifications 410 omit write chunk lists for eligible replies, due to the lower 411 performance of the additional handshaking to perform data transfer, 412 and the requirement that the RPC server must expose (and preserve) 413 the reply data for a period of time. In the absence of a server- 414 provided read chunk list in the reply, if the encoded reply 415 overflows the posted receive buffer, the RPC will fail with an RDMA 416 transport error. 418 When any data within a message is provided via either read or write 419 chunks, the chunk itself refers only to the data portion of the XDR 420 stream element. In particular, for counted fields (e.g., a "<>" 421 encoding) the byte count which is encoded as part of the field 422 remains in the XDR stream, and is also encoded in the chunk list. 423 The data portion is however elided from the encoded XDR stream, and 424 is transferred as part of chunk list processing. This is important 425 to maintain upper layer implementation compatibility - both the 426 count and the data must be transferred as part of the logical XDR 427 stream. While the chunk list processing results in the data being 428 available to the upper layer peer for XDR decoding, the length 429 present in the chunk list entries is not. Any byte count in the 430 XDR stream MUST match the sum of the byte counts present in the 431 corresponding read or write chunk list. If they do not agree, an 432 RPC protocol encoding error results. 434 The following items are contained in a chunk list entry. 436 Handle 437 Steering tag or handle obtained when the chunk memory is 438 registered for RDMA. 440 Length 441 The length of the chunk in bytes. 443 Offset 444 The offset or beginning memory address of the chunk. In order 445 to support the widest array of RDMA implementations, as well 446 as the most general steering tag scheme, this field is 447 unconditionally included in each chunk list entry. 449 While zero-based offset schemes are available in many RDMA 450 implementations, their use by RPC requires individual 451 registration of each read or write chunk. On many such 452 implementations this can be a significant overhead. By 453 providing an offset in each chunk, many pre-registration or 454 region-based registrations can be readily supported, and by 455 using a single, universal chunk representation, the RPC RDMA 456 protocol implementation is simplified to its most general 457 form. 459 Position 460 For data which is to be encoded, the position in the XDR 461 stream where the chunk would normally reside. Note that the 462 chunk therefore inserts its data into the XDR stream at this 463 position, but its transfer is no longer "inline". Also note 464 therefore that all chunks belonging to a single RPC argument 465 or result will have the same position. For data which is to 466 be decoded, no position is used. 468 When XDR marshaling is complete, the chunk list is XDR encoded, 469 then sent to the receiver prepended to the RPC message. Any source 470 data for a read chunk, or the destination of a write chunk, remain 471 behind in the sender's registered memory and their actual payload 472 is not marshaled into the request or reply. 474 +----------------+----------------+------------- 475 | RPC over RDMA | | 476 | header w/ | RPC Header | Non-chunk args/results 477 | chunks | | 478 +----------------+----------------+------------- 480 Read chunk lists and write chunk lists are structured somewhat 481 differently. This is due to the different usage - read chunks are 482 decoded and indexed by their argument's or result's position in the 483 XDR data stream; their size is always known. Write chunks on the 484 other hand are used only for results, and have neither a 485 preassigned offset in the XDR stream, nor a size until the results 486 are produced, since the buffers may be only partially filled, or 487 may not be used for results at all. Their presence in the XDR 488 stream is therefore not known until the reply is processed. The 489 mapping of Write chunks onto designated NFS procedures and their 490 results is described in [NFSDDP]. 492 Therefore, read chunks are encoded into a read chunk list as a 493 single array, with each entry tagged by its (known) size and its 494 argument's or result's position in the XDR stream. Write chunks 495 are encoded as a list of arrays of RDMA buffers, with each list 496 element (an array) providing buffers for a separate result. 497 Individual write chunk list elements MAY thereby result in being 498 partially or fully filled, or in fact not being filled at all. 499 Unused write chunks, or unused bytes in write chunk buffer lists, 500 are not returned as results, and their memory is returned to the 501 upper layer as part of RPC completion. However, the RPC layer MUST 502 NOT assume that the buffers have not been modified. 504 3.5. XDR Decoding with Read Chunks 506 The XDR decode process moves data from an XDR stream into a data 507 structure provided by the RPC client or server application. Where 508 elements of the destination data structure are buffers or strings, 509 the RPC application can either pre-allocate storage to receive the 510 data, or leave the string or buffer fields null and allow the XDR 511 decode stage of RPC processing to automatically allocate storage of 512 sufficient size. 514 When decoding a message from an RDMA transport, the receiver first 515 XDR decodes the chunk lists from the RPC over RDMA header, then 516 proceeds to decode the body of the RPC message (arguments or 517 results). Whenever the XDR offset in the decode stream matches 518 that of a chunk in the read chunk list, the XDR routine initiates 519 an RDMA Read to bring over the chunk data into locally registered 520 memory for the destination buffer. 522 When processing an RPC request, the RPC receiver (RPC server) 523 acknowledges its completion of use of the source buffers by simply 524 replying to the RPC sender (client), and the peer may then free all 525 source buffers advertised by the request. 527 When processing an RPC reply, after completing such a transfer the 528 RPC receiver (client) MUST issue an RDMA_DONE message (described in 529 Section 3.8) to notify the peer (server) that the source buffers 530 can be freed. 532 The read chunk list is constructed and used entirely within the 533 RPC/XDR layer. Other than specifying the minimum chunk size, the 534 management of the read chunk list is automatic and transparent to 535 an RPC application. 537 3.6. XDR Decoding with Write Chunks 539 When a "write chunk list" is provided for the results of the RPC 540 call, the RPC server MUST provide any corresponding data via RDMA 541 Write to the memory referenced in the chunk list entries. The RPC 542 reply conveys this by returning the write chunk list to the client 543 with the lengths rewritten to match the actual transfer. The XDR 544 "decode" of the reply therefore performs no local data transfer but 545 merely returns the length obtained from the reply. 547 Each decoded result consumes one entry in the write chunk list, 548 which in turn consists of an array of RDMA segments. The length is 549 therefore the sum of all returned lengths in all segments 550 comprising the corresponding list entry. As each list entry is 551 "decoded", the entire entry is consumed. 553 The write chunk list is constructed and used by the RPC 554 application. The RPC/XDR layer simply conveys the list between 555 client and server and initiates the RDMA Writes back to the client. 556 The mapping of write chunk list entries to procedure arguments MUST 557 be determined for each protocol. An example of a mapping is 558 described in [NFSDDP]. 560 3.7. XDR Roundup and Chunks 562 The XDR protocol requires 4-byte alignment of each new encoded 563 element in any XDR stream. This requirement is for efficiency and 564 ease of decode/unmarshaling at the receiver - if the XDR stream 565 buffer begins on a native machine boundary, then the XDR elements 566 will lie on similarly predictable offsets in memory. 568 Within XDR, when non-4-byte encodes (such as an odd-length string 569 or bulk data) are marshaled, their length is encoded literally, 570 while their data is padded to begin the next element at a 4-byte 571 boundary in the XDR stream. For TCP or RDMA inline encoding, this 572 minimal overhead is required because the transport-specific framing 573 relies on the fact that the relative offset of the elements in the 574 XDR stream from the start of the message determines the XDR 575 position during decode. 577 On the other hand, RPC/RDMA Read chunks carry the XDR position of 578 each chunked element and length of the Chunk segment, and can be 579 placed by the receiver exactly where they belong in the receiver's 580 memory without regard to the alignment of their position in the XDR 581 stream. Since any rounded-up data is not actually part of the 582 upper layer's message, the receiver will not reference it, and 583 there is no reason to set it to any particular value in the 584 receiver's memory. 586 When roundup is present at the end of a sequence of chunks, the 587 length of the sequence will terminate it at a non-4-byte XDR 588 position. When the receiver proceeds to decode the remaining part 589 of the XDR stream, it inspects the XDR position indicated by the 590 next chunk. Because this position will not match (else roundup 591 would not have occurred), the receiver decoding will fall back to 592 inspecting the remaining inline portion. If in turn, no data 593 remains to be decoded from the inline portion, then the receiver 594 MUST conclude that roundup is present, and therefore advances the 595 XDR decode position to that indicated by the next chunk (if any). 596 In this way, roundup is passed without ever actually transferring 597 additional XDR bytes. 599 Some protocol operations over RPC/RDMA, for instance NFS writes of 600 data encountered at the end of a file or in direct i/o situations, 601 commonly yield these roundups within RDMA Read Chunks. Because any 602 roundup bytes are not actually present in the data buffers being 603 written, memory for these bytes would come from noncontiguous 604 buffers, either as an additional memory registration segment, or as 605 an additional Chunk. The overhead of these operations can be 606 significant to both the sender to marshal them, and even higher to 607 the receiver which to transfer them. Senders SHOULD therefore 608 avoid encoding individual RDMA Read Chunks for roundup whenever 609 possible. It is acceptable, but not necessary, to include roundup 610 data in an existing RDMA Read Chunk, but only if it is already 611 present in the XDR stream to carry upper layer data. 613 Note that there is no exposure of additional data at the sender due 614 to eliding roundup data from the XDR stream, since any additional 615 sender buffers are never exposed to the peer. The data is 616 literally not there to be transferred. 618 For RDMA Write Chunks, a simpler encoding method applies. Again, 619 roundup bytes are not transferred, instead the chunk length sent to 620 the receiver in the reply is simply increased to include any 621 roundup. Because of the requirement that the RDMA Write chunks are 622 filled sequentially without gaps, this situation can only occur on 623 the final chunk receiving data. Therefore there is no opportunity 624 for roundup data to insert misalignment or positional gaps into the 625 XDR stream. 627 3.8. RPC Call and Reply 629 The RDMA transport for RPC provides three methods of moving data 630 between RPC client and server: 632 Inline 633 Data are moved between RPC client and server within an RDMA 634 Send. 636 RDMA Read 637 Data are moved between RPC client and server via an RDMA Read 638 operation via steering tag, address and offset obtained from a 639 read chunk list. 641 RDMA Write 642 Result data is moved from RPC server to client via an RDMA 643 Write operation via steering tag, address and offset obtained 644 from a write chunk list or reply chunk in the client's RPC 645 call message. 647 These methods of data movement may occur in combinations within a 648 single RPC. For instance, an RPC call may contain some inline data 649 along with some large chunks to be transferred via RDMA Read to the 650 server. The reply to that call may have some result chunks that 651 the server RDMA Writes back to the client. The following protocol 652 interactions illustrate RPC calls that use these methods to move 653 RPC message data: 655 An RPC with write chunks in the call message: 657 RPC Client RPC Server 658 | RPC Call + Write Chunk list | 659 Send | ------------------------------> | 660 | | 661 | Chunk 1 | 662 | <------------------------------ | Write 663 | : | 664 | Chunk n | 665 | <------------------------------ | Write 666 | | 667 | RPC Reply | 668 | <------------------------------ | Send 670 In the presence of write chunks, RDMA ordering provides the 671 guarantee that all data in the RDMA Write operations has been 672 placed in memory prior to the client's RPC reply processing. 674 An RPC with read chunks in the call message: 676 RPC Client RPC Server 677 | RPC Call + Read Chunk list | 678 Send | ------------------------------> | 679 | | 680 | Chunk 1 | 681 | +------------------------------ | Read 682 | v-----------------------------> | 683 | : | 684 | Chunk n | 685 | +------------------------------ | Read 686 | v-----------------------------> | 687 | | 688 | RPC Reply | 689 | <------------------------------ | Send 691 An RPC with read chunks in the reply message: 693 RPC Client RPC Server 694 | RPC Call | 695 Send | ------------------------------> | 696 | | 697 | RPC Reply + Read Chunk list | 698 | <------------------------------ | Send 699 | | 700 | Chunk 1 | 701 Read | ------------------------------+ | 702 | <-----------------------------v | 703 | : | 704 | Chunk n | 705 Read | ------------------------------+ | 706 | <-----------------------------v | 707 | | 708 | Done | 709 Send | ------------------------------> | 711 The final Done message allows the RPC client to signal the server 712 that it has received the chunks, so the server can de-register and 713 free the memory holding the chunks. A Done completion is not 714 necessary for an RPC call, since the RPC reply Send is itself a 715 receive completion notification. In the event that the client 716 fails to return the Done message within some timeout period, the 717 server MAY conclude that a protocol violation has occurred and 718 close the RPC connection, or it MAY proceed with a de-register and 719 free its chunk buffers. This may result in a fatal RDMA error if 720 the client later attempts to perform an RDMA Read operation, which 721 amounts to the same thing. 723 The use of read chunks in RPC reply messages is much less efficient 724 than providing write chunks in the originating RPC calls, due to 725 the additional message exchanges, the need for the RPC server to 726 advertise buffers to the peer, the necessity of the server 727 maintaining a timer for the purpose of recovery from misbehaving 728 clients, and the need for additional memory registration. Their 729 use is NOT RECOMMENDED by upper layers where efficiency is a 730 primary concern. [NFSDDP] However, they MAY be employed by upper 731 layer protocol bindings which are primarily concerned with 732 transparency, since they can frequently be implemented completely 733 within the RPC lower layers. 735 It is important to note that the Done message consumes a credit at 736 the RPC server. The RPC server SHOULD provide sufficient credits 737 to the client to allow the Done message to be sent without deadlock 738 (driving the outstanding credit count to zero). The RPC client 739 MUST account for its required Done messages to the server in its 740 accounting of available credits, and the server SHOULD replenish 741 any credit consumed by its use of such exchanges at its earliest 742 opportunity. 744 Finally, it is possible to conceive of RPC exchanges that involve 745 any or all combinations of write chunks in the RPC call, read 746 chunks in the RPC call, and read chunks in the RPC reply. Support 747 for such exchanges is straightforward from a protocol perspective, 748 but in practice such exchanges would be quite rare, limited to 749 upper layer protocol exchanges which transferred bulk data in both 750 the call and corresponding reply. 752 3.9. Padding 754 Alignment of specific opaque data enables certain scatter/gather 755 optimizations. Padding leverages the useful property that RDMA 756 transfers preserve alignment of data, even when they are placed 757 into pre-posted receive buffers by Sends. 759 Many servers can make good use of such padding. Padding allows the 760 chaining of RDMA receive buffers such that any data transferred by 761 RDMA on behalf of RPC requests will be placed into appropriately 762 aligned buffers on the system that receives the transfer. In this 763 way, the need for servers to perform RDMA Read to satisfy all but 764 the largest client writes is obviated. 766 The effect of padding is demonstrated below showing prior bytes on 767 an XDR stream (XXX) followed by an opaque field consisting of four 768 length bytes (LLLL) followed by data bytes (DDDD). The receiver of 769 the RDMA Send has posted two chained receive buffers. Without 770 padding, the opaque data is split across the two buffers. With the 771 addition of padding bytes ("ppp" in the figure below) prior to the 772 first data byte, the data can be forced to align correctly in the 773 second buffer. 775 Buffer 1 Buffer 2 776 Unpadded -------------- -------------- 778 XXXXXXXLLLLDDDDDDDDDDDDDD ---> XXXXXXXLLLLDDD DDDDDDDDDDD 780 Padded 782 XXXXXXXLLLLpppDDDDDDDDDDDDDD ---> XXXXXXXLLLLppp DDDDDDDDDDDDDD 784 Padding is implemented completely within the RDMA transport 785 encoding, flagged with a specific message type. Where padding is 786 applied, two values are passed to the peer: an "rdma_align" which 787 is the padding value used, and "rdma_thresh", which is the opaque 788 data size at or above which padding is applied. For instance, if 789 the server is using chained 4 KB receive buffers, then up to (4 KB 790 - 1) padding bytes could be used to achieve alignment of the data. 791 The XDR routine at the peer MUST consult these values when decoding 792 opaque values. Where the decoded length exceeds the rdma_thresh, 793 the XDR decode MUST skip over the appropriate padding as indicated 794 by rdma_align and the current XDR stream position. 796 4. RPC RDMA Message Layout 798 RPC call and reply messages are conveyed across an RDMA transport 799 with a prepended RPC over RDMA header. The RPC over RDMA header 800 includes data for RDMA flow control credits, padding parameters and 801 lists of addresses that provide direct data placement via RDMA Read 802 and Write operations. The layout of the RPC message itself is 803 unchanged from that described in [RFC1831bis] except for the 804 possible exclusion of large data chunks that will be moved by RDMA 805 Read or Write operations. If the RPC message (along with the RPC 806 over RDMA header) is too long for the posted receive buffer (even 807 after any large chunks are removed), then the entire RPC message 808 MAY be moved separately as a chunk, leaving just the RPC over RDMA 809 header in the RDMA Send. 811 4.1. RPC over RDMA Header 813 The RPC over RDMA header begins with four 32-bit fields that are 814 always present and which control the RDMA interaction including 815 RDMA-specific flow control. These are then followed by a number of 816 items such as chunk lists and padding which MAY or MUST NOT be 817 present depending on the type of transmission. The four fields 818 which are always present are: 820 1. Transaction ID (XID). 821 The XID generated for the RPC call and reply. Having the XID 822 at the beginning of the message makes it easy to establish the 823 message context. This XID MUST be the same as the XID in the 824 RPC header. The receiver MAY perform its processing based 825 solely on the XID in the RPC over RDMA header, and thereby 826 ignore the XID in the RPC header, if it so chooses. 828 2. Version number. 829 This version of the RPC RDMA message protocol is 1. The 830 version number MUST be increased by one whenever the format of 831 the RPC RDMA messages is changed. 833 3. Flow control credit value. 834 When sent in an RPC call message, the requested value is 835 provided. When sent in an RPC reply message, the granted 836 value is returned. RPC calls SHOULD NOT be sent in excess of 837 the currently granted limit. 839 4. Message type. 841 o RDMA_MSG = 0 indicates that chunk lists and RPC message 842 follow. 844 o RDMA_NOMSG = 1 indicates that after the chunk lists there 845 is no RPC message. In this case, the chunk lists provide 846 information to allow the message proper to be transferred 847 using RDMA Read or write and thus is not appended to the 848 RPC over RDMA header. 850 o RDMA_MSGP = 2 indicates that a chunk list and RPC message 851 with some padding follow. 853 0 RDMA_DONE = 3 indicates that the message signals the 854 completion of a chunk transfer via RDMA Read. 856 o RDMA_ERROR = 4 is used to signal any detected error(s) in 857 the RPC RDMA chunk encoding. 859 Because the version number is encoded as part of this header, and 860 the RDMA_ERROR message type is used to indicate errors, these first 861 four fields and the start of the following message body MUST always 862 remain aligned at these fixed offsets for all versions of the RPC 863 over RDMA header. 865 For a message of type RDMA_MSG or RDMA_NOMSG, the Read and Write 866 chunk lists follow. If the Read chunk list is null (a 32 bit word 867 of zeros), then there are no chunks to be transferred separately 868 and the RPC message follows in its entirety. If non-null, then 869 it's the beginning of an XDR encoded sequence of Read chunk list 870 entries. If the Write chunk list is non-null, then an XDR encoded 871 sequence of Write chunk entries follows. 873 If the message type is RDMA_MSGP, then two additional fields that 874 specify the padding alignment and threshold are inserted prior to 875 the Read and Write chunk lists. 877 A header of message type RDMA_MSG or RDMA_MSGP MUST be followed by 878 the RPC call or RPC reply message body, beginning with the XID. 879 The XID in the RDMA_MSG or RDMA_MSGP header MUST match this. 881 +--------+---------+---------+-----------+-------------+---------- 882 | | | | Message | NULLs | RPC Call 883 | XID | Version | Credits | Type | or | or 884 | | | | | Chunk Lists | Reply Msg 885 +--------+---------+---------+-----------+-------------+---------- 887 Note that in the case of RDMA_DONE and RDMA_ERROR, no chunk list or 888 RPC message follows. As an implementation hint: a gather operation 889 on the Send of the RDMA RPC message can be used to marshal the 890 initial header, the chunk list, and the RPC message itself. 892 4.2. RPC over RDMA header errors 894 When a peer receives an RPC RDMA message, it MUST perform the 895 following basic validity checks on the header and chunk contents. 896 If such errors are detected in the request, an RDMA_ERROR reply 897 MUST be generated. 899 Two types of errors are defined, version mismatch and invalid chunk 900 format. When the peer detects an RPC over RDMA header version 901 which it does not support (currently this draft defines only 902 version 1), it replies with an error code of ERR_VERS, and provides 903 the low and high inclusive version numbers it does, in fact, 904 support. The version number in this reply MUST be any value 905 otherwise valid at the receiver. When other decoding errors are 906 detected in the header or chunks, either an RPC decode error MAY be 907 returned, or the ROC/RDMA error code ERR_CHUNK MUST be returned. 909 4.3. XDR Language Description 911 Here is the message layout in XDR language. 913 struct xdr_rdma_segment { 914 uint32 handle; /* Registered memory handle */ 915 uint32 length; /* Length of the chunk in bytes */ 916 uint64 offset; /* Chunk virtual address or offset */ 917 }; 919 struct xdr_read_chunk { 920 uint32 position; /* Position in XDR stream */ 921 struct xdr_rdma_segment target; 922 }; 924 struct xdr_read_list { 925 struct xdr_read_chunk entry; 926 struct xdr_read_list *next; 927 }; 929 struct xdr_write_chunk { 930 struct xdr_rdma_segment target<>; 931 }; 933 struct xdr_write_list { 934 struct xdr_write_chunk entry; 935 struct xdr_write_list *next; 936 }; 938 struct rdma_msg { 939 uint32 rdma_xid; /* Mirrors the RPC header xid */ 940 uint32 rdma_vers; /* Version of this protocol */ 941 uint32 rdma_credit; /* Buffers requested/granted */ 942 rdma_body rdma_body; 943 }; 945 enum rdma_proc { 946 RDMA_MSG=0, /* An RPC call or reply msg */ 947 RDMA_NOMSG=1, /* An RPC call or reply msg - separate body */ 948 RDMA_MSGP=2, /* An RPC call or reply msg with padding */ 949 RDMA_DONE=3, /* Client signals reply completion */ 950 RDMA_ERROR=4 /* An RPC RDMA encoding error */ 951 }; 952 union rdma_body switch (rdma_proc proc) { 953 case RDMA_MSG: 954 rpc_rdma_header rdma_msg; 955 case RDMA_NOMSG: 956 rpc_rdma_header_nomsg rdma_nomsg; 957 case RDMA_MSGP: 958 rpc_rdma_header_padded rdma_msgp; 959 case RDMA_DONE: 960 void; 961 case RDMA_ERROR: 962 rpc_rdma_error rdma_error; 963 }; 965 struct rpc_rdma_header { 966 struct xdr_read_list *rdma_reads; 967 struct xdr_write_list *rdma_writes; 968 struct xdr_write_chunk *rdma_reply; 969 /* rpc body follows */ 970 }; 972 struct rpc_rdma_header_nomsg { 973 struct xdr_read_list *rdma_reads; 974 struct xdr_write_list *rdma_writes; 975 struct xdr_write_chunk *rdma_reply; 976 }; 978 struct rpc_rdma_header_padded { 979 uint32 rdma_align; /* Padding alignment */ 980 uint32 rdma_thresh; /* Padding threshold */ 981 struct xdr_read_list *rdma_reads; 982 struct xdr_write_list *rdma_writes; 983 struct xdr_write_chunk *rdma_reply; 984 /* rpc body follows */ 985 }; 986 enum rpc_rdma_errcode { 987 ERR_VERS = 1, 988 ERR_CHUNK = 2 989 }; 991 union rpc_rdma_error switch (rpc_rdma_errcode err) { 992 case ERR_VERS: 993 uint32 rdma_vers_low; 994 uint32 rdma_vers_high; 995 case ERR_CHUNK: 996 void; 997 default: 998 uint32 rdma_extra[8]; 999 }; 1001 5. Long Messages 1003 The receiver of RDMA Send messages is required by RDMA to have 1004 previously posted one or more adequately sized buffers. The RPC 1005 client can inform the server of the maximum size of its RDMA Send 1006 messages via the Connection Configuration Protocol described later 1007 in this document. 1009 Since RPC messages are frequently small, memory savings can be 1010 achieved by posting small buffers. Even large messages like NFS 1011 READ or WRITE will be quite small once the chunks are removed from 1012 the message. However, there may be large messages that would 1013 demand a very large buffer be posted, where the contents of the 1014 buffer may not be a chunkable XDR element. A good example is an 1015 NFS READDIR reply which may contain a large number of small 1016 filename strings. Also, the NFS version 4 protocol [RFC3530] 1017 features COMPOUND request and reply messages of unbounded length. 1019 Ideally, each upper layer will negotiate these limits. However, it 1020 is frequently necessary to provide a transparent solution. 1022 5.1. Message as an RDMA Read Chunk 1024 One relatively simple method is to have the client identify any RPC 1025 message that exceeds the RPC server's posted buffer size and move 1026 it separately as a chunk, i.e., reference it as the first entry in 1027 the read chunk list with an XDR position of zero. 1029 Normal Message 1031 +--------+---------+---------+------------+-------------+---------- 1032 | | | | | | RPC Call 1033 | XID | Version | Credits | RDMA_MSG | Chunk Lists | or 1034 | | | | | | Reply Msg 1035 +--------+---------+---------+------------+-------------+---------- 1037 Long Message 1039 +--------+---------+---------+------------+-------------+ 1040 | | | | | | 1041 | XID | Version | Credits | RDMA_NOMSG | Chunk Lists | 1042 | | | | | | 1043 +--------+---------+---------+------------+-------------+ 1044 | 1045 | +---------- 1046 | | Long RPC Call 1047 +->| or 1048 | Reply Message 1049 +---------- 1051 If the receiver gets an RPC over RDMA header with a message type of 1052 RDMA_NOMSG and finds an initial read chunk list entry with a zero 1053 XDR position, it allocates a registered buffer and issues an RDMA 1054 Read of the long RPC message into it. The receiver then proceeds 1055 to XDR decode the RPC message as if it had received it inline with 1056 the Send data. Further decoding may issue additional RDMA Reads to 1057 bring over additional chunks. 1059 Although the handling of long messages requires one extra network 1060 turnaround, in practice these messages will be rare if the posted 1061 receive buffers are correctly sized, and of course they will be 1062 non-existent for RDMA-aware upper layers. 1064 A long call RPC with request supplied via RDMA Read 1066 RPC Client RPC Server 1067 | RDMA over RPC Header | 1068 Send | ------------------------------> | 1069 | | 1070 | Long RPC Call Msg | 1071 | +------------------------------ | Read 1072 | v-----------------------------> | 1073 | | 1074 | RDMA over RPC Reply | 1075 | <------------------------------ | Send 1077 An RPC with long reply returned via RDMA Read 1079 RPC Client RPC Server 1080 | RPC Call | 1081 Send | ------------------------------> | 1082 | | 1083 | RDMA over RPC Header | 1084 | <------------------------------ | Send 1085 | | 1086 | Long RPC Reply Msg | 1087 Read | ------------------------------+ | 1088 | <-----------------------------v | 1089 | | 1090 | Done | 1091 Send | ------------------------------> | 1093 It is possible for a single RPC procedure to employ both a long 1094 call for its arguments, and a long reply for its results. However, 1095 such an operation is atypical, as few upper layers define such 1096 exchanges. 1098 5.2. RDMA Write of Long Replies (Reply Chunks) 1100 A superior method of handling long RPC replies is to have the RPC 1101 client post a large buffer into which the server can write a large 1102 RPC reply. This has the advantage that an RDMA Write may be 1103 slightly faster in network latency than an RDMA Read, and does not 1104 require the server to wait for the completion as it must for RDMA 1105 Read. Additionally, for a reply it removes the need for an 1106 RDMA_DONE message if the large reply is returned as a Read chunk. 1108 This protocol supports direct return of a large reply via the 1109 inclusion of an OPTIONAL rdma_reply write chunk after the read 1110 chunk list and the write chunk list. The client allocates a buffer 1111 sized to receive a large reply and enters its steering tag, address 1112 and length in the rdma_reply write chunk. If the reply message is 1113 too long to return inline with an RDMA Send (exceeds the size of 1114 the client's posted receive buffer), even with read chunks removed, 1115 then the RPC server performs an RDMA Write of the RPC reply message 1116 into the buffer indicated by the rdma_reply chunk. If the client 1117 doesn't provide an rdma_reply chunk, or if it's too small, then if 1118 the upper layer specification permits, the message MAY be returned 1119 as a Read chunk. 1121 An RPC with long reply returned via RDMA Write 1123 RPC Client RPC Server 1124 | RPC Call with rdma_reply | 1125 Send | ------------------------------> | 1126 | | 1127 | Long RPC Reply Msg | 1128 | <------------------------------ | Write 1129 | | 1130 | RDMA over RPC Header | 1131 | <------------------------------ | Send 1133 The use of RDMA Write to return long replies requires that the 1134 client applications anticipate a long reply and have some knowledge 1135 of its size so that an adequately sized buffer can be allocated. 1136 This is certainly true of NFS READDIR replies; where the client 1137 already provides an upper bound on the size of the encoded 1138 directory fragment to be returned by the server. 1140 The use of these "reply chunks" is highly efficient and convenient 1141 for both RPC client and server. Their use is encouraged for 1142 eligible RPC operations such as NFS READDIR, which would otherwise 1143 require extensive chunk management within the results or use of 1144 RDMA Read and a Done message. [NFSDDP] 1146 6. Connection Configuration Protocol 1148 RDMA Send operations require the receiver to post one or more 1149 buffers at the RDMA connection endpoint, each large enough to 1150 receive the largest Send message. Buffers are consumed as Send 1151 messages are received. If a buffer is too small, or if there are 1152 no buffers posted, the RDMA transport MAY return an error and break 1153 the RDMA connection. The receiver MUST post sufficient, adequately 1154 buffers to avoid buffer overrun or capacity errors. 1156 The protocol described above includes only a mechanism for managing 1157 the number of such receive buffers, and no explicit features to 1158 allow the RPC client and server to provision or control buffer 1159 sizing, nor any other session parameters. 1161 In the past, this type of connection management has not been 1162 necessary for RPC. RPC over UDP or TCP does not have a protocol to 1163 negotiate the link. The server can get a rough idea of the maximum 1164 size of messages from the server protocol code. However, a 1165 protocol to negotiate transport features on a more dynamic basis is 1166 desirable. 1168 The Connection Configuration Protocol allows the client to pass its 1169 connection requirements to the server, and allows the server to 1170 inform the client of its connection limits. 1172 Use of the Connection Configuration Protocol by an upper layer is 1173 OPTIONAL. 1175 6.1. Initial Connection State 1177 This protocol MAY be used for connection setup prior to the use of 1178 another RPC protocol that uses the RDMA transport. It operates in- 1179 band, i.e., it uses the connection itself to negotiate the 1180 connection parameters. To provide a basis for connection 1181 negotiation, the connection is assumed to provide a basic level of 1182 interoperability: the ability to exchange at least one RPC message 1183 at a time that is at least 1 KB in size. The server MAY exceed 1184 this basic level of configuration, but the client MUST NOT assume 1185 more than one, and MUST receive a valid reply from the server 1186 carrying the actual number of available receive messages, prior to 1187 sending its next request. 1189 6.2. Protocol Description 1191 Version 1 of the Connection Configuration protocol consists of a 1192 single procedure that allows the client to inform the server of its 1193 connection requirements and the server to return connection 1194 information to the client. 1196 The maxcall_sendsize argument is the maximum size of an RPC call 1197 message that the client MAY send inline in an RDMA Send message to 1198 the server. The server MAY return a maxcall_sendsize value that is 1199 smaller or larger than the client's request. The client MUST NOT 1200 send an inline call message larger than what the server will 1201 accept. The maxcall_sendsize limits only the size of inline RPC 1202 calls. It does not limit the size of long RPC messages transferred 1203 as an initial chunk in the Read chunk list. 1205 The maxreply_sendsize is the maximum size of an inline RPC message 1206 that the client will accept from the server. 1208 The maxrdmaread is the maximum number of RDMA Reads which may be 1209 active at the peer. This number correlates to the RDMA incoming 1210 RDMA Read count ("IRD") configured into each originating endpoint 1211 by the client or server. If more than this number of RDMA Read 1212 operations by the connected peer are issued simultaneously, 1213 connection loss or suboptimal flow control may result, therefore 1214 the value SHOULD be observed at all times. The peers' values need 1215 not be equal. If zero, the peer MUST NOT issue requests which 1216 require RDMA Read to satisfy, as no transfer will be possible. 1218 The align value is the value recommended by the server for opaque 1219 data values such as strings and counted byte arrays. The client 1220 MAY use this value to compute the number of prepended pad bytes 1221 when XDR encoding opaque values in the RPC call message. 1223 typedef unsigned int uint32; 1225 struct config_rdma_req { 1226 uint32 maxcall_sendsize; 1227 /* max size of inline RPC call */ 1228 uint32 maxreply_sendsize; 1229 /* max size of inline RPC reply */ 1230 uint32 maxrdmaread; 1231 /* max active RDMA Reads at client */ 1232 }; 1234 struct config_rdma_reply { 1235 uint32 maxcall_sendsize; 1236 /* max call size accepted by server */ 1237 uint32 align; 1238 /* server's receive buffer alignment */ 1239 uint32 maxrdmaread; 1240 /* max active RDMA Reads at server */ 1241 }; 1243 program CONFIG_RDMA_PROG { 1244 version VERS1 { 1245 /* 1246 * Config call/reply 1247 */ 1248 config_rdma_reply CONF_RDMA(config_rdma_req) = 1; 1249 } = 1; 1250 } = 100417; 1252 7. Memory Registration Overhead 1254 RDMA requires that all data be transferred between registered 1255 memory regions at the source and destination. All protocol headers 1256 as well as separately transferred data chunks use registered 1257 memory. Since the cost of registering and de-registering memory 1258 can be a large proportion of the RDMA transaction cost, it is 1259 important to minimize registration activity. This is easily 1260 achieved within RPC controlled memory by allocating chunk list data 1261 and RPC headers in a reusable way from pre-registered pools. 1263 The data chunks transferred via RDMA MAY occupy memory that 1264 persists outside the bounds of the RPC transaction. Hence, the 1265 default behavior of an RPC over RDMA transport is to register and 1266 de-register these chunks on every transaction. However, this is 1267 not a limitation of the protocol - only of the existing local RPC 1268 API. The API is easily extended through such functions as 1269 rpc_control(3) to change the default behavior so that the 1270 application can assume responsibility for controlling memory 1271 registration through an RPC-provided registered memory allocator. 1273 8. Errors and Error Recovery 1275 RPC RDMA protocol errors are described in section 4. RPC errors 1276 and RPC error recovery are not affected by the protocol, and 1277 proceed as for any RPC error condition. RDMA Transport error 1278 reporting and recovery are outside the scope of this protocol. 1280 It is assumed that the link itself will provide some degree of 1281 error detection and retransmission. iWARP's MPA layer (when used 1282 over TCP), SCTP, as well as the Infiniband link layer all provide 1283 CRC protection of the RDMA payload, and CRC-class protection is a 1284 general attribute of such transports. Additionally, the RPC layer 1285 itself can accept errors from the link level and recover via 1286 retransmission. RPC recovery can handle complete loss and re- 1287 establishment of the link. 1289 See section 11 for further discussion of the use of RPC-level 1290 integrity schemes to detect errors, and related efficiency issues. 1292 9. Node Addressing 1294 In setting up a new RDMA connection, the first action by an RPC 1295 client will be to obtain a transport address for the server. The 1296 mechanism used to obtain this address, and to open an RDMA 1297 connection is dependent on the type of RDMA transport, and is the 1298 responsibility of each RPC protocol binding and its local 1299 implementation. 1301 10. RPC Binding 1303 RPC services normally register with a portmap or rpcbind [RFC1833] 1304 service, which associates an RPC program number with a service 1305 address. (In the case of UDP or TCP, the service address for NFS 1306 is normally port 2049.) This policy is no different with RDMA 1307 interconnects, although it may require the allocation of port 1308 numbers appropriate to each upper layer binding which uses the RPC 1309 framing defined here. 1311 When mapped atop the iWARP [RFC5040, RFC5041] transport, which uses 1312 IP port addressing due to its layering on TCP and/or SCTP, port 1313 mapping is trivial and consists merely of issuing the port in the 1314 connection process. The NFS/RDMA protocol service address has been 1315 assigned port 20049 by IANA, for both iWARP/TCP and iWARP/SCTP. 1317 When mapped atop Infiniband [IB], which uses a GID-based service 1318 endpoint naming scheme, a translation MUST be employed. One such 1319 translation is defined in the Infiniband Port Addressing Annex 1320 [IBPORT], which is appropriate for translating IP port addressing 1321 to the Infiniband network. Therefore, in this case, IP port 1322 addressing may be readily employed by the upper layer. 1324 When a mapping standard or convention exists for IP ports on an 1325 RDMA interconnect, there are several possibilities for each upper 1326 layer to consider: 1328 One possibility is to have an upper layer server register its 1329 mapped IP port with the rpcbind service, under the netid (or 1330 netid's) defined here. An RPC/RDMA-aware client can then 1331 resolve its desired service to a mappable port, and proceed to 1332 connect. This is the most flexible and compatible approach, 1333 for those upper layers which are defined to use the rpcbind 1334 service. 1336 A second possibility is to have the server's portmapper 1337 register itself on the RDMA interconnect at a "well known" 1338 service address. (On UDP or TCP, this corresponds to port 1339 111.) A client could connect to this service address and use 1340 the portmap protocol to obtain a service address in response 1341 to a program number, e.g., an iWARP port number, or an 1342 Infiniband GID. 1344 Alternatively, the client could simply connect to the mapped 1345 well-known port for the service itself, if it is appropriately 1346 defined. By convention, the NFS/RDMA service, when operating 1347 atop such an Infiniband fabric, will use the same 20049 1348 assignment as for iWARP. 1350 Historically, different RPC protocols have taken different 1351 approaches to their port assignment, therefore the specific method 1352 is left to each RPC/RDMA-enabled upper layer binding, and not 1353 addressed here. 1355 This specification defines two new "netid" values, to be used for 1356 registration of upper layers atop iWARP [RFC5040, RFC5041] and 1357 (when a suitable port translation service is available) Infiniband 1358 [IB] in section 12, "IANA Considerations." Additional RDMA-capable 1359 networks MAY define their own netids, or if they provide a port 1360 translation, MAY share the one defined here. 1362 11. Security Considerations 1364 RPC provides its own security via the RPCSEC_GSS framework 1365 [RFC2203]. RPCSEC_GSS can provide message authentication, 1366 integrity checking, and privacy. This security mechanism will be 1367 unaffected by the RDMA transport. The data integrity and privacy 1368 features alter the body of the message, presenting it as a single 1369 chunk. For large messages the chunk may be large enough to qualify 1370 for RDMA Read transfer. However, there is much data movement 1371 associated with computation and verification of integrity, or 1372 encryption/decryption, so certain performance advantages may be 1373 lost. 1375 For efficiency, a more appropriate security mechanism for RDMA 1376 links may be link-level protection, such as certain configurations 1377 of IPsec, which may be co-located in the RDMA hardware. The use of 1378 link-level protection MAY be negotiated through the use of the new 1379 RPCSEC_GSS mechanism defined in [RPCSECGSSV2] in conjunction with 1380 the Channel Binding mechanism [RFC5056] and IPsec Channel 1381 Connection Latching [BTNSLATCH]. Use of such mechanisms is 1382 REQUIRED where integrity and/or privacy is desired, and where 1383 efficiency is required. 1385 An additional consideration is the protection of the integrity and 1386 privacy of local memory by the RDMA transport itself. The use of 1387 RDMA by RPC MUST NOT introduce any vulnerabilities to system memory 1388 contents, or to memory owned by user processes. These protections 1389 are provided by the RDMA layer specifications, and specifically 1390 their security models. It is REQUIRED that any RDMA provider used 1391 for RPC transport be conformant to the requirements of [RFC5042] in 1392 order to satisfy these protections. 1394 Once delivered securely by the RDMA provider, any RDMA-exposed 1395 addresses will contain only RPC payloads in the chunk lists, 1396 transferred under the protection of RPCSEC_GSS integrity and 1397 privacy. By these means, the data will be protected end-to-end, as 1398 required by the RPC layer security model. 1400 Where upper layer protocols choose to supply results to the 1401 requester via Read chunks, a server resource deficit can arise if 1402 the client does not promptly acknowledge their status via the 1403 RDMA_DONE message. This can potentially lead to a denial of 1404 service situation, with a single client unfairly (and 1405 unnecessarily) consuming server RDMA resources. Servers for such 1406 upper layer protocols MUST protect against this situation, 1407 originating from one or many clients. For example, a time-based 1408 window of buffer availability may be offered, if the client fails 1409 to obtain the data within the window, it will simply retry using 1410 ordinary RPC retry semantics. Or, a more severe method would be 1411 for the server to simply close the client's RDMA connection, 1412 freeing the RDMA resources and allowing the server to reclaim them. 1414 A fairer and more useful method is provided by the protocol itself. 1415 The server MAY use the rdma_credit value to limit the number of 1416 outstanding requests for each client. By including the number of 1417 outstanding RDMA_DONE completions in the computation of available 1418 client credits, the server can limit its exposure to each client, 1419 and therefore provide uninterrupted service as its resources 1420 permit. 1422 However, the server must ensure that it does not decrease the 1423 credit count to zero with this method, since the RDMA_DONE message 1424 is not acknowledged. If the credit count were to drop to zero 1425 solely due to outstanding RDMA_DONE messages, the client would 1426 deadlock since it would never obtain a new credit with which to 1427 continue. Therefore, if the server adjusts credits to zero for 1428 outstanding RDMA_DONE, it MUST withhold its reply to at least one 1429 message in order to provide the next credit. The time-based window 1430 (or any other appropriate method) SHOULD be used by the server to 1431 recover resources in the event that the client never returns. 1433 The "Connection Configuration Protocol", when used, MUST be 1434 protected by an appropriate RPC security flavor, to ensure it is 1435 not attacked in the process of initiating an RPC/RDMA connection. 1437 12. IANA Considerations 1439 Three new assignments are specified by this document: 1441 - A new set of RPC "netids" for resolving RPC/RDMA services 1443 - Optional service port assignments for upper layer bindings 1444 - An RPC program number assignment for the configuration protocol 1446 These assignments have been established, as below. 1448 The new RPC transport has been assigned an RPC "netid", which is an 1449 rpcbind [RFC1833] string used to describe the underlying protocol 1450 in order for RPC to select the appropriate transport framing, as 1451 well as the format of the service addresses and ports. 1453 The following "nc_proto" registry strings are defined for this 1454 purpose: 1456 NC_RDMA "rdma" 1457 NC_RDMA6 "rdma6" 1459 These netids MAY be used for any RDMA network satisfying the 1460 requirements of section 2, and able to identify service endpoints 1461 using IP port addressing, possibly through use of a translation 1462 service as described above in section 10, RPC Binding. The "rdma" 1463 netid is to be used when IPv4 addressing is employed by the 1464 underlying transport, and "rdma6" for IPv6 addressing. 1466 The netid assignment policy and registry are defined in [IANA- 1467 NETID]. 1469 As a new RPC transport, this protocol has no effect on RPC program 1470 numbers or existing registered port numbers. However, new port 1471 numbers MAY be registered for use by RPC/RDMA-enabled services, as 1472 appropriate to the new networks over which the services will 1473 operate. 1475 For example, the NFS/RDMA service defined in [NFSDDP] has been 1476 assigned the port 20049, in the IANA registry: 1478 nfsrdma 20049/tcp Network File System (NFS) over RDMA 1479 nfsrdma 20049/udp Network File System (NFS) over RDMA 1480 nfsrdma 20049/sctp Network File System (NFS) over RDMA 1482 The OPTIONAL Connection Configuration protocol described herein 1483 requires an RPC program number assignment. The value "100417" has 1484 been assigned: 1486 rdmaconfig 100417 rpc.rdmaconfig 1488 The RPC program number assignment policy and registry are defined 1489 in [RFC1831bis]. 1491 13. Acknowledgments 1493 The authors wish to thank Rob Thurlow, John Howard, Chet Juszczak, 1494 Alex Chiu, Peter Staubach, Dave Noveck, Brian Pawlowski, Steve 1495 Kleiman, Mike Eisler, Mark Wittle, Shantanu Mehendale, David 1496 Robinson and Mallikarjun Chadalapaka for their contributions to 1497 this document. 1499 14. Normative References 1501 [RFC2119] 1502 S. Bradner, "Key words for use in RFCs to Indicate Requirement 1503 Levels", Best Current Practice, BCP 14, RFC 2119, March 1997. 1505 [RFC1831bis] 1506 R. Thurlow, Ed., "RPC: Remote Procedure Call Protocol 1507 Specification Version 2", Internet Draft Work in Progress, 1508 draft-ietf-nfsv4-rfc1831bis 1510 [RFC4506] 1511 M. Eisler Ed., "XDR: External Data Representation Standard", 1512 Standards Track RFC, http://www.ietf.org/rfc/rfc4506.txt 1514 [RFC1833] 1515 R. Srinivasan, "Binding Protocols for ONC RPC Version 2", 1516 Standards Track RFC, http://www.ietf.org/rfc/rfc1833.txt 1518 [RFC2203] 1519 M. Eisler, A. Chiu, L. Ling, "RPCSEC_GSS Protocol 1520 Specification", Standards Track RFC, 1521 http://www.ietf.org/rfc/rfc2203.txt 1523 [RPCSECGSSV2] 1524 M. Eisler, "RPCSEC_GSS Version 2", Internet Draft Work in 1525 Progress, draft-ietf-nfsv4-rpcsec-gss-v2 1527 [RFC5056] 1528 N. Williams, "On the Use of Channel Bindings to Secure 1529 Channels", Standards Track RFC 1530 http://www.ietf.org/rfc/rfc5056.txt 1532 [BTNSLATCH] 1533 N. Williams, "IPsec Channels: Connection Latching", Internet 1534 Draft Work in Progress, draft-ietf-btns-connection-latching 1536 [RFC5042] 1537 J. Pinkerton, E. Deleganes, "Direct Data Placement Protocol 1538 (DDP) / Remote Direct Memory Access Protocol (RDMAP) 1539 Security", Standards Track RFC, 1540 http://www.ietf.org/rfc/rfc5042.txt 1542 [IANA-NETID] 1543 M. Eisler, "IANA Considerations for RPC Net Identifiers and 1544 Universal Address Formats", Internet Draft Work in Progress, 1545 draft-ietf-nfsv4-rpc-netid 1547 15. Informative References 1549 [RFC1094] 1550 Sun Microsystems, "NFS: Network File System Protocol 1551 Specification", (NFS version 2) Informational RFC, 1552 http://www.ietf.org/rfc/rfc1094.txt 1554 [RFC1813] 1555 B. Callaghan, B. Pawlowski, P. Staubach, "NFS Version 3 1556 Protocol Specification", Informational RFC, 1557 http://www.ietf.org/rfc/rfc1813.txt 1559 [RFC3530] 1560 S. Shepler, B. Callaghan, D. Robinson, R. Thurlow, C. Beame, 1561 M. Eisler, D. Noveck, "NFS version 4 Protocol", Standards 1562 Track RFC, http://www.ietf.org/rfc/rfc3530.txt 1564 [NFSDDP] 1565 B. Callaghan, T. Talpey, "NFS Direct Data Placement", Internet 1566 Draft Work in Progress, draft-ietf-nfsv4-nfsdirect 1568 [RFC5040] 1569 R. Recio et al., "A Remote Direct Memory Access Protocol 1570 Specification", Standards Track RFC, 1571 http://www.ietf.org/rfc/rfc5040.txt 1573 [RFC5041] 1574 H. Shah et al., "Direct Data Placement over Reliable 1575 Transports", Standards Track RFC, 1576 http://www.ietf.org/rfc/rfc5041.txt 1578 [NFSRDMAPS] 1579 T. Talpey, C. Juszczak, "NFS RDMA Problem Statement", Internet 1580 Draft Work in Progress, draft-ietf-nfsv4-nfs-rdma-problem- 1581 statement 1583 [NFSv4.1] 1584 S. Shepler et al., ed., "NFSv4 Minor Version 1", Internet 1585 Draft Work in Progress, draft-ietf-nfsv4-minorversion1 1587 [IB] 1588 Infiniband Architecture Specification, available from 1589 http://www.infinibandta.org 1591 [IBPORT] 1592 Infiniband Trade Association, "IP Addressing Annex", available 1593 from http://www.infinibandta.org 1595 Authors' Addresses 1597 Tom Talpey 1598 Network Appliance, Inc. 1599 1601 Trapelo Road, #16 1600 Waltham, MA 02451 USA 1602 Phone: +1 781 768 5329 1603 EMail: thomas.talpey@netapp.com 1605 Brent Callaghan 1606 Apple Computer, Inc. 1607 MS: 302-4K 1608 2 Infinite Loop 1609 Cupertino, CA 95014 USA 1611 EMail: brentc@apple.com 1613 Intellectual Property Statement 1615 The IETF Trust takes no position regarding the validity or scope of 1616 any Intellectual Property Rights or other rights that might be 1617 claimed to pertain to the implementation or use of the technology 1618 described in any IETF Document or the extent to which any license 1619 under such rights might or might not be available; nor does it 1620 represent that it has made any independent effort to identify any 1621 such rights. 1623 Copies of Intellectual Property disclosures made to the IETF 1624 Secretariat and any assurances of licenses to be made available, or 1625 the result of an attempt made to obtain a general license or 1626 permission for the use of such proprietary rights by implementers 1627 or users of this specification can be obtained from the IETF on- 1628 line IPR repository at http://www.ietf.org/ipr 1630 The IETF invites any interested party to bring to its attention any 1631 copyrights, patents or patent applications, or other proprietary 1632 rights that may cover technology that may be required to implement 1633 any standard or specification contained in an IETF Document. Please 1634 address the information to the IETF at ietf-ipr@ietf.org 1636 The definitive version of an IETF Document is that published by, or 1637 under the auspices of, the IETF. Versions of IETF Documents that 1638 are published by third parties, including those that are translated 1639 into other languages, should not be considered to be definitive 1640 versions of IETF Documents. The definitive version of these Legal 1641 Provisions is that published by, or under the auspices of, the 1642 IETF. Versions of these Legal Provisions that are published by 1643 third parties, including those that are translated into other 1644 languages, should not be considered to be definitive versions of 1645 these Legal Provisions. 1647 For the avoidance of doubt, each Contributor to the IETF Standards 1648 Process licenses each Contribution that he or she makes as part of 1649 the IETF Standards Process to the IETF Trust pursuant to the 1650 provisions of RFC 5378. No language to the contrary, or terms, 1651 conditions or rights that differ from or are inconsistent with the 1652 rights and licenses granted under RFC 5378, shall have any effect 1653 and shall be null and void, whether published or posted by such 1654 Contributor, or included with or in such Contribution. 1656 Disclaimer of Validity 1658 All IETF Documents and the information contained therein are 1659 provided on an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION 1660 HE/SHE REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET 1661 SOCIETY, THE IETF TRUST AND THE INTERNET ENGINEERING TASK FORCE 1662 DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT 1663 LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION THEREIN 1664 WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF 1665 MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 1667 Copyright Statement 1669 Copyright (c) 2008 IETF Trust and the persons identified as the 1670 document authors. All rights reserved. 1672 This document is subject to BCP 78 and the IETF Trust's Legal 1673 Provisions Relating to IETF Documents 1674 (http://trustee.ietf.org/license-info) in effect on the date of 1675 publication of this document. Please review these documents 1676 carefully, as they describe your rights and restrictions with 1677 respect to this document.