idnits 2.17.1 draft-ietf-nfsv4-rpcrdma-07.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3978, Section 5.1 on line 16. -- Found old boilerplate from RFC 3978, Section 5.5, updated by RFC 4748 on line 1607. -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 1617. -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 1624. -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 1630. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust Copyright Line does not match the current year == The document seems to use 'NOT RECOMMENDED' as an RFC 2119 keyword, but does not include the phrase in its RFC 2119 key words list. == Using lowercase 'not' together with uppercase 'MUST', 'SHALL', 'SHOULD', or 'RECOMMENDED' is not an accepted usage according to RFC 2119. Please use uppercase 'NOT' together with RFC 2119 keywords (if that is what you mean). Found 'SHOULD not' in this paragraph: 3. Flow control credit value. When sent in an RPC call message, the requested value is provided. When sent in an RPC reply message, the granted value is returned. RPC calls SHOULD not be sent in excess of the currently granted limit. -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (February 22, 2008) is 5907 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Looks like a reference, but probably isn't: '8' on line 1003 ** Downref: Normative reference to an Informational RFC: RFC 1094 ** Obsolete normative reference: RFC 1831 (Obsoleted by RFC 5531) ** Downref: Normative reference to an Informational RFC: RFC 1813 ** Obsolete normative reference: RFC 3530 (Obsoleted by RFC 7530) Summary: 5 errors (**), 0 flaws (~~), 3 warnings (==), 9 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 NFSv4 Working Group Tom Talpey 3 Internet-Draft NetApp 4 Intended status: Standards Track Brent Callaghan 5 Expires: August 23, 2008 Apple 6 February 22, 2008 8 Remote Direct Memory Access Transport for Remote Procedure Call 9 draft-ietf-nfsv4-rpcrdma-07 11 Status of this Memo 13 By submitting this Internet-Draft, each author represents that any 14 applicable patent or other IPR claims of which he or she is aware 15 have been or will be disclosed, and any of which he or she becomes 16 aware will be disclosed, in accordance with Section 6 of BCP 79. 18 Internet-Drafts are working documents of the Internet Engineering 19 Task Force (IETF), its areas, and its working groups. Note that 20 other groups may also distribute working documents as Internet- 21 Drafts. 23 Internet-Drafts are draft documents valid for a maximum of six 24 months and may be updated, replaced, or obsoleted by other 25 documents at any time. It is inappropriate to use Internet-Drafts 26 as reference material or to cite them other than as "work in 27 progress." 29 The list of current Internet-Drafts can be accessed at 30 http://www.ietf.org/ietf/1id-abstracts.txt 32 The list of Internet-Draft Shadow Directories can be accessed at 33 http://www.ietf.org/shadow.html. 35 Copyright Notice 37 Copyright (C) The IETF Trust (2008). 39 Abstract 41 A protocol is described providing Remote Direct Memory Access 42 (RDMA) as a new transport for Computing Remote Procedure Call 43 (RPC). The RDMA transport binding conveys the benefits of 44 efficient, bulk data transport over high speed networks, while 45 providing for minimal change to RPC applications and with no 46 required revision of the application RPC protocol, or the RPC 47 protocol itself. 49 Table of Contents 51 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 52 2. Abstract RDMA Requirements . . . . . . . . . . . . . . . . . 3 53 3. Protocol Outline . . . . . . . . . . . . . . . . . . . . . . 4 54 3.1. Short Messages . . . . . . . . . . . . . . . . . . . . . . 5 55 3.2. Data Chunks . . . . . . . . . . . . . . . . . . . . . . . 5 56 3.3. Flow Control . . . . . . . . . . . . . . . . . . . . . . . 6 57 3.4. XDR Encoding with Chunks . . . . . . . . . . . . . . . . . 7 58 3.5. XDR Decoding with Read Chunks . . . . . . . . . . . . . 10 59 3.6. XDR Decoding with Write Chunks . . . . . . . . . . . . . 11 60 3.7. XDR Roundup and Chunks . . . . . . . . . . . . . . . . . 12 61 3.8. RPC Call and Reply . . . . . . . . . . . . . . . . . . . 13 62 3.9. Padding . . . . . . . . . . . . . . . . . . . . . . . . 16 63 4. RPC RDMA Message Layout . . . . . . . . . . . . . . . . . 17 64 4.1. RPC over RDMA Header . . . . . . . . . . . . . . . . . . 17 65 4.2. RPC over RDMA header errors . . . . . . . . . . . . . . 19 66 4.3. XDR Language Description . . . . . . . . . . . . . . . . 20 67 5. Long Messages . . . . . . . . . . . . . . . . . . . . . . 22 68 5.1. Message as an RDMA Read Chunk . . . . . . . . . . . . . 22 69 5.2. RDMA Write of Long Replies (Reply Chunks) . . . . . . . 24 70 6. Connection Configuration Protocol . . . . . . . . . . . . 25 71 6.1. Initial Connection State . . . . . . . . . . . . . . . . 26 72 6.2. Protocol Description . . . . . . . . . . . . . . . . . . 26 73 7. Memory Registration Overhead . . . . . . . . . . . . . . . 28 74 8. Errors and Error Recovery . . . . . . . . . . . . . . . . 28 75 9. Node Addressing . . . . . . . . . . . . . . . . . . . . . 28 76 10. RPC Binding . . . . . . . . . . . . . . . . . . . . . . . 29 77 11. Security Considerations . . . . . . . . . . . . . . . . . 30 78 12. IANA Considerations . . . . . . . . . . . . . . . . . . . 31 79 13. Acknowledgements . . . . . . . . . . . . . . . . . . . . 32 80 14. Normative References . . . . . . . . . . . . . . . . . . 32 81 15. Informative References . . . . . . . . . . . . . . . . . 33 82 16. Authors' Addresses . . . . . . . . . . . . . . . . . . . 34 83 17. Intellectual Property and Copyright Statements . . . . . 35 84 Acknowledgment . . . . . . . . . . . . . . . . . . . . . . . . 36 86 Requirements Language 88 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 89 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in 90 this document are to be interpreted as described in [RFC2119]. 92 1. Introduction 94 Remote Direct Memory Access (RDMA) [RFC5040, RFC5041] [IB] is a 95 technique for efficient movement of data between end nodes, which 96 becomes increasingly compelling over high speed transports. By 97 directing data into destination buffers as it is sent on a network, 98 and placing it via direct memory access by hardware, the double 99 benefit of faster transfers and reduced host overhead is obtained. 101 Open Network Computing Remote Procedure Call (ONC RPC, or simply, 102 RPC) [RFC1831bis] is a remote procedure call protocol that has been 103 run over a variety of transports. Most RPC implementations today 104 use UDP or TCP. RPC messages are defined in terms of an eXternal 105 Data Representation (XDR) [RFC4506] which provides a canonical data 106 representation across a variety of host architectures. An XDR data 107 stream is conveyed differently on each type of transport. On UDP, 108 RPC messages are encapsulated inside datagrams, while on a TCP byte 109 stream, RPC messages are delineated by a record marking protocol. 110 An RDMA transport also conveys RPC messages in a unique fashion 111 that must be fully described if client and server implementations 112 are to interoperate. 114 RDMA transports present new semantics unlike the behaviors of 115 either UDP or TCP alone. They retain message delineations like UDP 116 while also providing a reliable, sequenced data transfer like TCP. 117 And, they provide the new efficient, bulk transfer service of RDMA. 118 RDMA transports are therefore naturally viewed as a new transport 119 type by RPC. 121 RDMA as a transport will benefit the performance of RPC protocols 122 that move large "chunks" of data, since RDMA hardware excels at 123 moving data efficiently between host memory and a high speed 124 network with little or no host CPU involvement. In this context, 125 the NFS protocol, in all its versions [RFC1094] [RFC1813] [RFC3530] 126 [NFSv4.1], is an obvious beneficiary of RDMA. A complete problem 127 statement is discussed in [NFSRDMAPS], and related NFSv4 issues are 128 discussed in [NFSv4.1]. Many other RPC-based protocols will also 129 benefit. 131 Although the RDMA transport described here provides relatively 132 transparent support for any RPC application, the proposal goes 133 further in describing mechanisms that can optimize the use of RDMA 134 with more active participation by the RPC application. 136 2. Abstract RDMA Requirements 138 An RPC transport is responsible for conveying an RPC message from a 139 sender to a receiver. An RPC message is either an RPC call from a 140 client to a server, or an RPC reply from the server back to the 141 client. An RPC message contains an RPC call header followed by 142 arguments if the message is an RPC call, or an RPC reply header 143 followed by results if the message is an RPC reply. The call 144 header contains a transaction ID (XID) followed by the program and 145 procedure number as well as a security credential. An RPC reply 146 header begins with an XID that matches that of the RPC call 147 message, followed by a security verifier and results. All data in 148 an RPC message is XDR encoded. For a complete description of the 149 RPC protocol and XDR encoding, see [RFC1831bis] and [RFC4506]. 151 This protocol assumes the following abstract model for RDMA 152 transports. These terms, common in the RDMA lexicon, are used in 153 this document. A more complete glossary of RDMA terms can be found 154 in [RFC5040]. 156 o Registered Memory 157 All data moved via tagged RDMA operations is resident in 158 registered memory at its destination. This protocol assumes 159 that each segment of registered memory MUST be identified with 160 a steering tag of no more than 32 bits and memory addresses of 161 up to 64 bits in length. 163 o RDMA Send 164 The RDMA provider supports an RDMA Send operation with 165 completion signalled at the receiver when data is placed in a 166 pre-posted buffer. The amount of transferred data is limited 167 only by the size of the receiver's buffer. Sends complete at 168 the receiver in the order they were issued at the sender. 170 o RDMA Write 171 The RDMA provider supports an RDMA Write operation to directly 172 place data in the receiver's buffer. An RDMA Write is 173 initiated by the sender and completion is signalled at the 174 sender. No completion is signalled at the receiver. The 175 sender uses a steering tag, memory address and length of the 176 remote destination buffer. RDMA Writes are not necessarily 177 ordered with respect to one another, but are ordered with 178 respect to RDMA Sends; a subsequent RDMA Send completion 179 obtained at the receiver guarantees that prior RDMA Write data 180 has been successfully placed in the receiver's memory. 182 o RDMA Read 183 The RDMA provider supports an RDMA Read operation to directly 184 place peer source data in the requester's buffer. An RDMA 185 Read is initiated by the receiver and completion is signalled 186 at the receiver. The receiver provides steering tags, memory 187 addresses and a length for the remote source and local 188 destination buffers. Since the peer at the data source 189 receives no notification of RDMA Read completion, there is an 190 assumption that on receiving the data the receiver will signal 191 completion with an RDMA Send message, so that the peer can 192 free the source buffers and the associated steering tags. 194 This protocol is designed to be carried over all RDMA transports 195 meeting the stated requirements. This protocol conveys to the RPC 196 peer, information sufficient for that RPC peer to direct an RDMA 197 layer to perform transfers containing RPC data, and to communicate 198 their result(s). For example, it is readily carried over RDMA 199 transports such as iWARP [RFC5040, RFC5041] or Infiniband [IB]. 201 3. Protocol Outline 203 An RPC message can be conveyed in identical fashion, whether it is 204 a call or reply message. In each case, the transmission of the 205 message proper is preceded by transmission of a transport-specific 206 header for use by RPC over RDMA transports. This header is 207 analogous to the record marking used for RPC over TCP, but is more 208 extensive, since RDMA transports support several modes of data 209 transfer and it is important to allow the client and server to use 210 the most efficient mode for any given transfer. Multiple segments 211 of a message may be transferred in different ways to different 212 remote memory destinations. 214 All transfers of a call or reply begin with an RDMA Send which 215 transfers at least the RPC over RDMA header, usually with the call 216 or reply message appended, or at least some part thereof. Because 217 the size of what may be transmitted via RDMA Send is limited by the 218 size of the receiver's pre-posted buffer, the RPC over RDMA 219 transport provides a number of methods to reduce the amount 220 transferred by means of the RDMA Send, when necessary, by 221 transferring various parts of the message using RDMA Read and RDMA 222 Write. 224 RPC over RDMA framing replaces all other RPC framing (such as TCP 225 record marking) when used atop an RPC/RDMA association, even though 226 the underlying RDMA protocol may itself be layered atop a protocol 227 with a defined RPC framing (such as TCP). An upper layer may 228 however define an exchange to dynamically enable RPC/RDMA on an 229 existing RPC association. Any such exchange must be carefully 230 architected so as to prevent any ambiguity as to the framing in use 231 for each side of the connection. Because RPC/RDMA framing delimits 232 an entire RPC request or reply, any such shift must occur between 233 distinct RPC messages. 235 3.1. Short Messages 237 Many RPC messages are quite short. For example, the NFS version 3 238 GETATTR request, is only 56 bytes: 20 bytes of RPC header, plus a 239 32 byte file handle argument and 4 bytes of length. The reply to 240 this common request is about 100 bytes. 242 There is no benefit in transferring such small messages with an 243 RDMA Read or Write operation. The overhead in transferring 244 steering tags and memory addresses is justified only by large 245 transfers. The critical message size that justifies RDMA transfer 246 will vary depending on the RDMA implementation and network, but is 247 typically of the order of a few kilobytes. It is appropriate to 248 transfer a short message with an RDMA Send to a pre-posted buffer. 249 The RPC over RDMA header with the short message (call or reply) 250 immediately following is transferred using a single RDMA Send 251 operation. 253 Short RPC messages over an RDMA transport: 255 RPC Client RPC Server 256 | RPC Call | 257 Send | ------------------------------> | 258 | | 259 | RPC Reply | 260 | <------------------------------ | Send 262 3.2. Data Chunks 264 Some protocols, like NFS, have RPC procedures that can transfer 265 very large "chunks" of data in the RPC call or reply and would 266 cause the maximum send size to be exceeded if one tried to transfer 267 them as part of the RDMA Send. These large chunks typically range 268 from a kilobyte to a megabyte or more. An RDMA transport can 269 transfer large chunks of data more efficiently via the direct 270 placement of an RDMA Read or RDMA Write operation. Using direct 271 placement instead of inline transfer not only avoids expensive data 272 copies, but provides correct data alignment at the destination. 274 3.3. Flow Control 276 It is critical to provide RDMA Send flow control for an RDMA 277 connection. RDMA receive operations will fail if a pre-posted 278 receive buffer is not available to accept an incoming RDMA Send, 279 and repeated occurrences of such errors can be fatal to the 280 connection. This is a departure from conventional TCP/IP 281 networking where buffers are allocated dynamically on an as-needed 282 basis, and where pre-posting is not required. 284 It is not practical to provide for fixed credit limits at the RPC 285 server. Fixed limits scale poorly, since posted buffers are 286 dedicated to the associated connection until consumed by receive 287 operations. Additionally for protocol correctness, the RPC server 288 must always be able to reply to client requests, whether or not new 289 buffers have been posted to accept future receives. (Note that the 290 RPC server may in fact be a client at some other layer. For 291 example, NFSv4 callbacks are processed by the NFSv4 client, acting 292 as an RPC server. The credit discussions apply equally in either 293 case.) 295 Flow control for RDMA Send operations is implemented as a simple 296 request/grant protocol in the RPC over RDMA header associated with 297 each RPC message. The RPC over RDMA header for RPC call messages 298 contains a requested credit value for the RPC server, which MAY be 299 dynamically adjusted by the caller to match its expected needs. 300 The RPC over RDMA header for the RPC reply messages provides the 301 granted result, which MAY have any value except it MUST NOT be zero 302 when no in-progress operations are present at the server, since 303 such a value would result in deadlock. The value MAY be adjusted 304 up or down at each opportunity to match the server's needs or 305 policies. 307 The RPC client MUST NOT send unacknowledged requests in excess of 308 this granted RPC server credit limit. If the limit is exceeded, 309 the RDMA layer may signal an error, possibly terminating the 310 connection. Even if an error does not occur, it is OPTIONAL that 311 the server handle the excess request(s), and it MAY return an RPC 312 error to the client. Also note that the never-zero requirement 313 implies that an RPC server MUST always provide at least one credit 314 to each connected RPC client from which no requests are 315 outstanding. The client would deadlock otherwise, unable to send 316 another request. 318 While RPC calls complete in any order, the current flow control 319 limit at the RPC server is known to the RPC client from the Send 320 ordering properties. It is always the most recent server-granted 321 credit value minus the number of requests in flight. 323 Certain RDMA implementations may impose additional flow control 324 restrictions, such as limits on RDMA Read operations in progress at 325 the responder. Because these operations are outside the scope of 326 this protocol, they are not addressed and SHOULD be provided for by 327 other layers. For example, a simple upper layer RPC consumer might 328 perform single-issue RDMA Read requests, while a more 329 sophisticated, multithreaded RPC consumer might implement its own 330 FIFO queue of such operations. For further discussion of possible 331 protocol implementations capable of negotiating these values, see 332 section 6 "Connection Configuration Protocol" of this draft, or 333 [NFSv4.1]. 335 3.4. XDR Encoding with Chunks 337 The data comprising an RPC call or reply message is marshaled or 338 serialized into a contiguous stream by an XDR routine. XDR data 339 types such as integers, strings, arrays and linked lists are 340 commonly implemented over two very simple functions that encode 341 either an XDR data unit (32 bits) or an array of bytes. 343 Normally, the separate data items in an RPC call or reply are 344 encoded as a contiguous sequence of bytes for network transmission 345 over UDP or TCP. However, in the case of an RDMA transport, local 346 routines such as XDR encode can determine that (for instance) an 347 opaque byte array is large enough to be more efficiently moved via 348 an RDMA data transfer operation like RDMA Read or RDMA Write. 350 Semantically speaking, the protocol has no restriction regarding 351 data types which may or may not be represented by a read or write 352 chunk. In practice however, efficiency considerations lead to the 353 conclusion that certain data types are not generally "chunkable". 354 Typically, only those opaque and aggregate data types that may 355 attain substantial size are considered to be eligible. With 356 today's hardware this size may be a kilobyte or more. However any 357 object MAY be chosen for chunking in any given message. 359 The eligibility of XDR data items to be candidates for being moved 360 as data chunks (as opposed to being marshaled inline) is not 361 specified by the RPC over RDMA protocol. Chunk eligibility 362 criteria MUST be determined by each upper layer in order to provide 363 for an interoperable specification. One such example with 364 rationale, for the NFS protocol family, is provided in [NFSDDP]. 366 The interface by which an upper layer implementation communicates 367 the eligibility of a data item locally to RPC for chunking is out 368 of scope for this specification. In many implementations, it is 369 possible to implement a transparent RPC chunking facility. 370 However, such implementations may lead to inefficiencies, either 371 because they require the RPC layer to perform expensive 372 registration and deregistration of memory "on the fly", or they may 373 require using RDMA chunks in reply messages, along with the 374 resulting additional handshaking with the RPC over RDMA peer. 375 However, these issues are internal and generally confined to the 376 local interface between RPC and its upper layers, one in which 377 implementations are free to innovate. The only requirement is that 378 the resulting RPC RDMA protocol sent to the peer is valid for the 379 upper layer. See for example [NFSDDP]. 381 When sending any message (request or reply) that contains an 382 eligible large data chunk, the XDR encoding routine avoids moving 383 the data into the XDR stream. Instead, it does not encode the data 384 portion, but records the address and size of each chunk in a 385 separate "read chunk list" encoded within RPC RDMA transport- 386 specific headers. Such chunks will be transferred via RDMA Read 387 operations initiated by the receiver. 389 When the read chunks are to be moved via RDMA, the memory for each 390 chunk is registered. This registration may take place within XDR 391 itself, providing for full transparency to upper layers, or it may 392 be performed by any other specific local implementation. 394 Additionally, when making an RPC call that can result in bulk data 395 transferred in the reply, write chunks MAY be provided to accept 396 the data directly via RDMA Write. These write chunks will 397 therefore be pre-filled by the RPC server prior to responding, and 398 XDR decode of the data at the client will not be required. These 399 chunks undergo a similar registration and advertisement via "write 400 chunk lists" built as a part of XDR encoding. 402 Some RPC client implementations are not able to determine where an 403 RPC call's results reside during the "encode" phase. This makes it 404 difficult or impossible for the RPC client layer to encode the 405 write chunk list at the time of building the request. In this 406 case, it is difficult for the RPC implementation to provide 407 transparency to the RPC consumer, which may require recoding to 408 provide result information at this earlier stage. 410 Therefore if the RPC client does not make a write chunk list 411 available to receive the result, then the RPC server MAY return 412 data inline in the reply, or if the upper layer specification 413 permits, it MAY be returned via a read chunk list. It is NOT 414 RECOMMENDED that upper layer RPC client protocol specifications 415 omit write chunk lists for eligible replies, due to the lower 416 performance of the additional handshaking to perform data transfer, 417 and the requirement that the RPC server must expose (and preserve) 418 the reply data for a period of time. In the absence of a server- 419 provided read chunk list in the reply, if the encoded reply 420 overflows the posted receive buffer, the RPC will fail with an RDMA 421 transport error. 423 When any data within a message is provided via either read or write 424 chunks, the chunk itself refers only to the data portion of the XDR 425 stream element. In particular, for counted fields (e.g., a "<>" 426 encoding) the byte count which is encoded as part of the field 427 remains in the XDR stream, and is also encoded in the chunk list. 428 The data portion is however elided from the encoded XDR stream, and 429 is transferred as part of chunk list processing. This is important 430 to maintain upper layer implementation compatibility - both the 431 count and the data must be transferred as part of the logical XDR 432 stream. While the chunk list processing results in the data being 433 available to the upper layer peer for XDR decoding, the length 434 present in the chunk list entries is not. Any byte count in the 435 XDR stream MUST match the sum of the byte counts present in the 436 corresponding read or write chunk list. If they do not agree, an 437 RPC protocol encoding error results. 439 The following items are contained in a chunk list entry. 441 Handle 442 Steering tag or handle obtained when the chunk memory is 443 registered for RDMA. 445 Length 446 The length of the chunk in bytes. 448 Offset 449 The offset or beginning memory address of the chunk. In order 450 to support the widest array of RDMA implementations, as well 451 as the most general steering tag scheme, this field is 452 unconditionally included in each chunk list entry. 454 While zero-based offset schemes are available in many RDMA 455 implementations, their use by RPC requires individual 456 registration of each read or write chunk. On many such 457 implementations this can be a significant overhead. By 458 providing an offset in each chunk, many pre-registration or 459 region-based registrations can be readily supported, and by 460 using a single, universal chunk representation, the RPC RDMA 461 protocol implementation is simplified to its most general 462 form. 464 Position 465 For data which is to be encoded, the position in the XDR 466 stream where the chunk would normally reside. Note that the 467 chunk therefore inserts its data into the XDR stream at this 468 position, but its transfer is no longer "inline". Also note 469 therefore that all chunks belonging to a single RPC argument 470 or result will have the same position. For data which is to 471 be decoded, no position is used. 473 When XDR marshaling is complete, the chunk list is XDR encoded, 474 then sent to the receiver prepended to the RPC message. Any source 475 data for a read chunk, or the destination of a write chunk, remain 476 behind in the sender's registered memory and their actual payload 477 is not marshaled into the request or reply. 479 +----------------+----------------+------------- 480 | RPC over RDMA | | 481 | header w/ | RPC Header | Non-chunk args/results 482 | chunks | | 483 +----------------+----------------+------------- 485 Read chunk lists and write chunk lists are structured somewhat 486 differently. This is due to the different usage - read chunks are 487 decoded and indexed by their argument's or result's position in the 488 XDR data stream; their size is always known. Write chunks on the 489 other hand are used only for results, and have neither a 490 preassigned offset in the XDR stream, nor a size until the results 491 are produced, since the buffers may be only partially filled, or 492 may not be used for results at all. Their presence in the XDR 493 stream is therefore not known until the reply is processed. The 494 mapping of Write chunks onto designated NFS procedures and their 495 results is described in [NFSDDP]. 497 Therefore, read chunks are encoded into a read chunk list as a 498 single array, with each entry tagged by its (known) size and its 499 argument's or result's position in the XDR stream. Write chunks 500 are encoded as a list of arrays of RDMA buffers, with each list 501 element (an array) providing buffers for a separate result. 502 Individual write chunk list elements MAY thereby result in being 503 partially or fully filled, or in fact not being filled at all. 504 Unused write chunks, or unused bytes in write chunk buffer lists, 505 are not returned as results, and their memory is returned to the 506 upper layer as part of RPC completion. However, the RPC layer MUST 507 NOT assume that the buffers have not been modified. 509 3.5. XDR Decoding with Read Chunks 511 The XDR decode process moves data from an XDR stream into a data 512 structure provided by the RPC client or server application. Where 513 elements of the destination data structure are buffers or strings, 514 the RPC application can either pre-allocate storage to receive the 515 data, or leave the string or buffer fields null and allow the XDR 516 decode stage of RPC processing to automatically allocate storage of 517 sufficient size. 519 When decoding a message from an RDMA transport, the receiver first 520 XDR decodes the chunk lists from the RPC over RDMA header, then 521 proceeds to decode the body of the RPC message (arguments or 522 results). Whenever the XDR offset in the decode stream matches 523 that of a chunk in the read chunk list, the XDR routine initiates 524 an RDMA Read to bring over the chunk data into locally registered 525 memory for the destination buffer. 527 When processing an RPC request, the RPC receiver (RPC server) 528 acknowledges its completion of use of the source buffers by simply 529 replying to the RPC sender (client), and the peer may then free all 530 source buffers advertised by the request. 532 When processing an RPC reply, after completing such a transfer the 533 RPC receiver (client) MUST issue an RDMA_DONE message (described in 534 Section 3.8) to notify the peer (server) that the source buffers 535 can be freed. 537 The read chunk list is constructed and used entirely within the 538 RPC/XDR layer. Other than specifying the minimum chunk size, the 539 management of the read chunk list is automatic and transparent to 540 an RPC application. 542 3.6. XDR Decoding with Write Chunks 544 When a "write chunk list" is provided for the results of the RPC 545 call, the RPC server MUST provide any corresponding data via RDMA 546 Write to the memory referenced in the chunk list entries. The RPC 547 reply conveys this by returning the write chunk list to the client 548 with the lengths rewritten to match the actual transfer. The XDR 549 "decode" of the reply therefore performs no local data transfer but 550 merely returns the length obtained from the reply. 552 Each decoded result consumes one entry in the write chunk list, 553 which in turn consists of an array of RDMA segments. The length is 554 therefore the sum of all returned lengths in all segments 555 comprising the corresponding list entry. As each list entry is 556 "decoded", the entire entry is consumed. 558 The write chunk list is constructed and used by the RPC 559 application. The RPC/XDR layer simply conveys the list between 560 client and server and initiates the RDMA Writes back to the client. 561 The mapping of write chunk list entries to procedure arguments MUST 562 be determined for each protocol. An example of a mapping is 563 described in [NFSDDP]. 565 3.7. XDR Roundup and Chunks 567 The XDR protocol requires 4-byte alignment of each new encoded 568 element in any XDR stream. This requirement is for efficiency and 569 ease of decode/unmarshaling at the receiver - if the XDR stream 570 buffer begins on a native machine boundary, then the XDR elements 571 will lie on similarly predictable offsets in memory. 573 Within XDR, when non-4-byte encodes (such as an odd-length string 574 or bulk data) are marshaled, their length is encoded literally, 575 while their data is padded to begin the next element at a 4-byte 576 boundary in the XDR stream. For TCP or RDMA inline encoding, this 577 minimal overhead is required because the transport-specific framing 578 relies on the fact that the relative offset of the elements in the 579 XDR stream from the start of the message determines the XDR 580 position during decode. 582 On the other hand, RPC/RDMA Read chunks carry the XDR position of 583 each chunked element and length of the Chunk segment, and can be 584 placed by the receiver exactly where they belong in the receiver's 585 memory without regard to the alignment of their position in the XDR 586 stream. Since any rounded-up data is not actually part of the 587 upper layer's message, the receiver will not reference it, and 588 there is no reason to set it to any particular value in the 589 receiver's memory. 591 When roundup is present at the end of a sequence of chunks, the 592 length of the sequence will terminate it at a non-4-byte XDR 593 position. When the receiver proceeds to decode the remaining part 594 of the XDR stream, it inspects the XDR position indicated by the 595 next chunk. Because this position will not match (else roundup 596 would not have occurred), the receiver decoding will fall back to 597 inspecting the remaining inline portion. If in turn, no data 598 remains to be decoded from the inline portion, then the receiver 599 MUST conclude that roundup is present, and therefore advances the 600 XDR decode position to that indicated by the next chunk (if any). 601 In this way, roundup is passed without ever actually transferring 602 additional XDR bytes. 604 Some protocol operations over RPC/RDMA, for instance NFS writes of 605 data encountered at the end of a file or in direct i/o situations, 606 commonly yield these roundups within RDMA Read Chunks. Because any 607 roundup bytes are not actually present in the data buffers being 608 written, memory for these bytes would come from noncontiguous 609 buffers, either as an additional memory registration segment, or as 610 an additional Chunk. The overhead of these operations can be 611 significant to both the sender to marshal them, and even higher to 612 the receiver which to transfer them. Senders SHOULD therefore 613 avoid encoding individual RDMA Read Chunks for roundup whenever 614 possible. It is acceptable, but not necessary, to include roundup 615 data in an existing RDMA Read Chunk, but only if it is already 616 present in the XDR stream to carry upper layer data. 618 Note that there is no exposure of additional data at the sender due 619 to eliding roundup data from the XDR stream, since any additional 620 sender buffers are never exposed to the peer. The data is 621 literally not there to be transferred. 623 For RDMA Write Chunks, a simpler encoding method applies. Again, 624 roundup bytes are not transferred, instead the chunk length sent to 625 the receiver in the reply is simply increased to include any 626 roundup. Because of the requirement that the RDMA Write chunks are 627 filled sequentially without gaps, this situation can only occur on 628 the final chunk receiving data. Therefore there is no opportunity 629 for roundup data to insert misalignment or positional gaps into the 630 XDR stream. 632 3.8. RPC Call and Reply 634 The RDMA transport for RPC provides three methods of moving data 635 between RPC client and server: 637 Inline 638 Data are moved between RPC client and server within an RDMA 639 Send. 641 RDMA Read 642 Data are moved between RPC client and server via an RDMA Read 643 operation via steering tag, address and offset obtained from a 644 read chunk list. 646 RDMA Write 647 Result data is moved from RPC server to client via an RDMA 648 Write operation via steering tag, address and offset obtained 649 from a write chunk list or reply chunk in the client's RPC 650 call message. 652 These methods of data movement may occur in combinations within a 653 single RPC. For instance, an RPC call may contain some inline data 654 along with some large chunks to be transferred via RDMA Read to the 655 server. The reply to that call may have some result chunks that 656 the server RDMA Writes back to the client. The following protocol 657 interactions illustrate RPC calls that use these methods to move 658 RPC message data: 660 An RPC with write chunks in the call message: 662 RPC Client RPC Server 663 | RPC Call + Write Chunk list | 664 Send | ------------------------------> | 665 | | 666 | Chunk 1 | 667 | <------------------------------ | Write 668 | : | 669 | Chunk n | 670 | <------------------------------ | Write 671 | | 672 | RPC Reply | 673 | <------------------------------ | Send 675 In the presence of write chunks, RDMA ordering provides the 676 guarantee that all data in the RDMA Write operations has been 677 placed in memory prior to the client's RPC reply processing. 679 An RPC with read chunks in the call message: 681 RPC Client RPC Server 682 | RPC Call + Read Chunk list | 683 Send | ------------------------------> | 684 | | 685 | Chunk 1 | 686 | +------------------------------ | Read 687 | v-----------------------------> | 688 | : | 689 | Chunk n | 690 | +------------------------------ | Read 691 | v-----------------------------> | 692 | | 693 | RPC Reply | 694 | <------------------------------ | Send 696 An RPC with read chunks in the reply message: 698 RPC Client RPC Server 699 | RPC Call | 700 Send | ------------------------------> | 701 | | 702 | RPC Reply + Read Chunk list | 703 | <------------------------------ | Send 704 | | 705 | Chunk 1 | 706 Read | ------------------------------+ | 707 | <-----------------------------v | 708 | : | 709 | Chunk n | 710 Read | ------------------------------+ | 711 | <-----------------------------v | 712 | | 713 | Done | 714 Send | ------------------------------> | 716 The final Done message allows the RPC client to signal the server 717 that it has received the chunks, so the server can de-register and 718 free the memory holding the chunks. A Done completion is not 719 necessary for an RPC call, since the RPC reply Send is itself a 720 receive completion notification. In the event that the client 721 fails to return the Done message within some timeout period, the 722 server MAY conclude that a protocol violation has occurred and 723 close the RPC connection, or it MAY proceed with a de-register and 724 free its chunk buffers. This may result in a fatal RDMA error if 725 the client later attempts to perform an RDMA Read operation, which 726 amounts to the same thing. 728 The use of read chunks in RPC reply messages is much less efficient 729 than providing write chunks in the originating RPC calls, due to 730 the additional message exchanges, the need for the RPC server to 731 advertise buffers to the peer, the necessity of the server 732 maintaining a timer for the purpose of recovery from misbehaving 733 clients, and the need for additional memory registration. Their 734 use is NOT RECOMMENDED by upper layers where efficiency is a 735 primary concern. [NFSDDP] However, they MAY be employed by upper 736 layer protocol bindings which are primarily concerned with 737 transparency, since they can frequently be implemented completely 738 within the RPC lower layers. 740 It is important to note that the Done message consumes a credit at 741 the RPC server. The RPC server SHOULD provide sufficient credits 742 to the client to allow the Done message to be sent without deadlock 743 (driving the outstanding credit count to zero). The RPC client 744 MUST account for its required Done messages to the server in its 745 accounting of available credits, and the server SHOULD replenish 746 any credit consumed by its use of such exchanges at its earliest 747 opportunity. 749 Finally, it is possible to conceive of RPC exchanges that involve 750 any or all combinations of write chunks in the RPC call, read 751 chunks in the RPC call, and read chunks in the RPC reply. Support 752 for such exchanges is straightforward from a protocol perspective, 753 but in practice such exchanges would be quite rare, limited to 754 upper layer protocol exchanges which transferred bulk data in both 755 the call and corresponding reply. 757 3.9. Padding 759 Alignment of specific opaque data enables certain scatter/gather 760 optimizations. Padding leverages the useful property that RDMA 761 transfers preserve alignment of data, even when they are placed 762 into pre-posted receive buffers by Sends. 764 Many servers can make good use of such padding. Padding allows the 765 chaining of RDMA receive buffers such that any data transferred by 766 RDMA on behalf of RPC requests will be placed into appropriately 767 aligned buffers on the system that receives the transfer. In this 768 way, the need for servers to perform RDMA Read to satisfy all but 769 the largest client writes is obviated. 771 The effect of padding is demonstrated below showing prior bytes on 772 an XDR stream (XXX) followed by an opaque field consisting of four 773 length bytes (LLLL) followed by data bytes (DDDD). The receiver of 774 the RDMA Send has posted two chained receive buffers. Without 775 padding, the opaque data is split across the two buffers. With the 776 addition of padding bytes ("ppp" in the figure below) prior to the 777 first data byte, the data can be forced to align correctly in the 778 second buffer. 780 Buffer 1 Buffer 2 781 Unpadded -------------- -------------- 783 XXXXXXXLLLLDDDDDDDDDDDDDD ---> XXXXXXXLLLLDDD DDDDDDDDDDD 785 Padded 787 XXXXXXXLLLLpppDDDDDDDDDDDDDD ---> XXXXXXXLLLLppp DDDDDDDDDDDDDD 789 Padding is implemented completely within the RDMA transport 790 encoding, flagged with a specific message type. Where padding is 791 applied, two values are passed to the peer: an "rdma_align" which 792 is the padding value used, and "rdma_thresh", which is the opaque 793 data size at or above which padding is applied. For instance, if 794 the server is using chained 4 KB receive buffers, then up to (4 KB 795 - 1) padding bytes could be used to achieve alignment of the data. 796 The XDR routine at the peer MUST consult these values when decoding 797 opaque values. Where the decoded length exceeds the rdma_thresh, 798 the XDR decode MUST skip over the appropriate padding as indicated 799 by rdma_align and the current XDR stream position. 801 4. RPC RDMA Message Layout 803 RPC call and reply messages are conveyed across an RDMA transport 804 with a prepended RPC over RDMA header. The RPC over RDMA header 805 includes data for RDMA flow control credits, padding parameters and 806 lists of addresses that provide direct data placement via RDMA Read 807 and Write operations. The layout of the RPC message itself is 808 unchanged from that described in [RFC1831bis] except for the 809 possible exclusion of large data chunks that will be moved by RDMA 810 Read or Write operations. If the RPC message (along with the RPC 811 over RDMA header) is too long for the posted receive buffer (even 812 after any large chunks are removed), then the entire RPC message 813 MAY be moved separately as a chunk, leaving just the RPC over RDMA 814 header in the RDMA Send. 816 4.1. RPC over RDMA Header 818 The RPC over RDMA header begins with four 32-bit fields that are 819 always present and which control the RDMA interaction including 820 RDMA-specific flow control. These are then followed by a number of 821 items such as chunk lists and padding which MAY or MUST NOT be 822 present depending on the type of transmission. The four fields 823 which are always present are: 825 1. Transaction ID (XID). 826 The XID generated for the RPC call and reply. Having the XID 827 at the beginning of the message makes it easy to establish the 828 message context. This XID MUST be the same as the XID in the 829 RPC header. The receiver MAY perform its processing based 830 solely on the XID in the RPC over RDMA header, and thereby 831 ignore the XID in the RPC header, if it so chooses. 833 2. Version number. 834 This version of the RPC RDMA message protocol is 1. The 835 version number MUST be increased by one whenever the format of 836 the RPC RDMA messages is changed. 838 3. Flow control credit value. 839 When sent in an RPC call message, the requested value is 840 provided. When sent in an RPC reply message, the granted 841 value is returned. RPC calls SHOULD not be sent in excess of 842 the currently granted limit. 844 4. Message type. 846 o RDMA_MSG = 0 indicates that chunk lists and RPC message 847 follow. 849 o RDMA_NOMSG = 1 indicates that after the chunk lists there 850 is no RPC message. In this case, the chunk lists provide 851 information to allow the message proper to be transferred 852 using RDMA Read or write and thus is not appended to the 853 RPC over RDMA header. 855 o RDMA_MSGP = 2 indicates that a chunk list and RPC message 856 with some padding follow. 858 0 RDMA_DONE = 3 indicates that the message signals the 859 completion of a chunk transfer via RDMA Read. 861 o RDMA_ERROR = 4 is used to signal any detected error(s) in 862 the RPC RDMA chunk encoding. 864 Because the version number is encoded as part of this header, and 865 the RDMA_ERROR message type is used to indicate errors, these first 866 four fields and the start of the following message body MUST always 867 remain aligned at these fixed offsets for all versions of the RPC 868 over RDMA header. 870 For a message of type RDMA_MSG or RDMA_NOMSG, the Read and Write 871 chunk lists follow. If the Read chunk list is null (a 32 bit word 872 of zeros), then there are no chunks to be transferred separately 873 and the RPC message follows in its entirety. If non-null, then 874 it's the beginning of an XDR encoded sequence of Read chunk list 875 entries. If the Write chunk list is non-null, then an XDR encoded 876 sequence of Write chunk entries follows. 878 If the message type is RDMA_MSGP, then two additional fields that 879 specify the padding alignment and threshold are inserted prior to 880 the Read and Write chunk lists. 882 A header of message type RDMA_MSG or RDMA_MSGP MUST be followed by 883 the RPC call or RPC reply message body, beginning with the XID. 884 The XID in the RDMA_MSG or RDMA_MSGP header MUST match this. 886 +--------+---------+---------+-----------+-------------+---------- 887 | | | | Message | NULLs | RPC Call 888 | XID | Version | Credits | Type | or | or 889 | | | | | Chunk Lists | Reply Msg 890 +--------+---------+---------+-----------+-------------+---------- 892 Note that in the case of RDMA_DONE and RDMA_ERROR, no chunk list or 893 RPC message follows. As an implementation hint: a gather operation 894 on the Send of the RDMA RPC message can be used to marshal the 895 initial header, the chunk list, and the RPC message itself. 897 4.2. RPC over RDMA header errors 899 When a peer receives an RPC RDMA message, it MUST perform the 900 following basic validity checks on the header and chunk contents. 901 If such errors are detected in the request, an RDMA_ERROR reply 902 MUST be generated. 904 Two types of errors are defined, version mismatch and invalid chunk 905 format. When the peer detects an RPC over RDMA header version 906 which it does not support (currently this draft defines only 907 version 1), it replies with an error code of ERR_VERS, and provides 908 the low and high inclusive version numbers it does, in fact, 909 support. The version number in this reply MUST be any value 910 otherwise valid at the receiver. When other decoding errors are 911 detected in the header or chunks, either an RPC decode error MAY be 912 returned, or the ROC/RDMA error code ERR_CHUNK MUST be returned. 914 4.3. XDR Language Description 916 Here is the message layout in XDR language. 918 struct xdr_rdma_segment { 919 uint32 handle; /* Registered memory handle */ 920 uint32 length; /* Length of the chunk in bytes */ 921 uint64 offset; /* Chunk virtual address or offset */ 922 }; 924 struct xdr_read_chunk { 925 uint32 position; /* Position in XDR stream */ 926 struct xdr_rdma_segment target; 927 }; 929 struct xdr_read_list { 930 struct xdr_read_chunk entry; 931 struct xdr_read_list *next; 932 }; 934 struct xdr_write_chunk { 935 struct xdr_rdma_segment target<>; 936 }; 938 struct xdr_write_list { 939 struct xdr_write_chunk entry; 940 struct xdr_write_list *next; 941 }; 943 struct rdma_msg { 944 uint32 rdma_xid; /* Mirrors the RPC header xid */ 945 uint32 rdma_vers; /* Version of this protocol */ 946 uint32 rdma_credit; /* Buffers requested/granted */ 947 rdma_body rdma_body; 948 }; 950 enum rdma_proc { 951 RDMA_MSG=0, /* An RPC call or reply msg */ 952 RDMA_NOMSG=1, /* An RPC call or reply msg - separate body */ 953 RDMA_MSGP=2, /* An RPC call or reply msg with padding */ 954 RDMA_DONE=3, /* Client signals reply completion */ 955 RDMA_ERROR=4 /* An RPC RDMA encoding error */ 956 }; 957 union rdma_body switch (rdma_proc proc) { 958 case RDMA_MSG: 959 rpc_rdma_header rdma_msg; 960 case RDMA_NOMSG: 961 rpc_rdma_header_nomsg rdma_nomsg; 962 case RDMA_MSGP: 963 rpc_rdma_header_padded rdma_msgp; 964 case RDMA_DONE: 965 void; 966 case RDMA_ERROR: 967 rpc_rdma_error rdma_error; 968 }; 970 struct rpc_rdma_header { 971 struct xdr_read_list *rdma_reads; 972 struct xdr_write_list *rdma_writes; 973 struct xdr_write_chunk *rdma_reply; 974 /* rpc body follows */ 975 }; 977 struct rpc_rdma_header_nomsg { 978 struct xdr_read_list *rdma_reads; 979 struct xdr_write_list *rdma_writes; 980 struct xdr_write_chunk *rdma_reply; 981 }; 983 struct rpc_rdma_header_padded { 984 uint32 rdma_align; /* Padding alignment */ 985 uint32 rdma_thresh; /* Padding threshold */ 986 struct xdr_read_list *rdma_reads; 987 struct xdr_write_list *rdma_writes; 988 struct xdr_write_chunk *rdma_reply; 989 /* rpc body follows */ 990 }; 991 enum rpc_rdma_errcode { 992 ERR_VERS = 1, 993 ERR_CHUNK = 2 994 }; 996 union rpc_rdma_error switch (rpc_rdma_errcode err) { 997 case ERR_VERS: 998 uint32 rdma_vers_low; 999 uint32 rdma_vers_high; 1000 case ERR_CHUNK: 1001 void; 1002 default: 1003 uint32 rdma_extra[8]; 1004 }; 1006 5. Long Messages 1008 The receiver of RDMA Send messages is required by RDMA to have 1009 previously posted one or more adequately sized buffers. The RPC 1010 client can inform the server of the maximum size of its RDMA Send 1011 messages via the Connection Configuration Protocol described later 1012 in this document. 1014 Since RPC messages are frequently small, memory savings can be 1015 achieved by posting small buffers. Even large messages like NFS 1016 READ or WRITE will be quite small once the chunks are removed from 1017 the message. However, there may be large messages that would 1018 demand a very large buffer be posted, where the contents of the 1019 buffer may not be a chunkable XDR element. A good example is an 1020 NFS READDIR reply which may contain a large number of small 1021 filename strings. Also, the NFS version 4 protocol [RFC3530] 1022 features COMPOUND request and reply messages of unbounded length. 1024 Ideally, each upper layer will negotiate these limits. However, it 1025 is frequently necessary to provide a transparent solution. 1027 5.1. Message as an RDMA Read Chunk 1029 One relatively simple method is to have the client identify any RPC 1030 message that exceeds the RPC server's posted buffer size and move 1031 it separately as a chunk, i.e., reference it as the first entry in 1032 the read chunk list with an XDR position of zero. 1034 Normal Message 1036 +--------+---------+---------+------------+-------------+---------- 1037 | | | | | | RPC Call 1038 | XID | Version | Credits | RDMA_MSG | Chunk Lists | or 1039 | | | | | | Reply Msg 1040 +--------+---------+---------+------------+-------------+---------- 1042 Long Message 1044 +--------+---------+---------+------------+-------------+ 1045 | | | | | | 1046 | XID | Version | Credits | RDMA_NOMSG | Chunk Lists | 1047 | | | | | | 1048 +--------+---------+---------+------------+-------------+ 1049 | 1050 | +---------- 1051 | | Long RPC Call 1052 +->| or 1053 | Reply Message 1054 +---------- 1056 If the receiver gets an RPC over RDMA header with a message type of 1057 RDMA_NOMSG and finds an initial read chunk list entry with a zero 1058 XDR position, it allocates a registered buffer and issues an RDMA 1059 Read of the long RPC message into it. The receiver then proceeds 1060 to XDR decode the RPC message as if it had received it inline with 1061 the Send data. Further decoding may issue additional RDMA Reads to 1062 bring over additional chunks. 1064 Although the handling of long messages requires one extra network 1065 turnaround, in practice these messages will be rare if the posted 1066 receive buffers are correctly sized, and of course they will be 1067 non-existent for RDMA-aware upper layers. 1069 A long call RPC with request supplied via RDMA Read 1071 RPC Client RPC Server 1072 | RDMA over RPC Header | 1073 Send | ------------------------------> | 1074 | | 1075 | Long RPC Call Msg | 1076 | +------------------------------ | Read 1077 | v-----------------------------> | 1078 | | 1079 | RDMA over RPC Reply | 1080 | <------------------------------ | Send 1082 An RPC with long reply returned via RDMA Read 1084 RPC Client RPC Server 1085 | RPC Call | 1086 Send | ------------------------------> | 1087 | | 1088 | RDMA over RPC Header | 1089 | <------------------------------ | Send 1090 | | 1091 | Long RPC Reply Msg | 1092 Read | ------------------------------+ | 1093 | <-----------------------------v | 1094 | | 1095 | Done | 1096 Send | ------------------------------> | 1098 It is possible for a single RPC procedure to employ both a long 1099 call for its arguments, and a long reply for its results. However, 1100 such an operation is atypical, as few upper layers define such 1101 exchanges. 1103 5.2. RDMA Write of Long Replies (Reply Chunks) 1105 A superior method of handling long RPC replies is to have the RPC 1106 client post a large buffer into which the server can write a large 1107 RPC reply. This has the advantage that an RDMA Write may be 1108 slightly faster in network latency than an RDMA Read, and does not 1109 require the server to wait for the completion as it must for RDMA 1110 Read. Additionally, for a reply it removes the need for an 1111 RDMA_DONE message if the large reply is returned as a Read chunk. 1113 This protocol supports direct return of a large reply via the 1114 inclusion of an OPTIONAL rdma_reply write chunk after the read 1115 chunk list and the write chunk list. The client allocates a buffer 1116 sized to receive a large reply and enters its steering tag, address 1117 and length in the rdma_reply write chunk. If the reply message is 1118 too long to return inline with an RDMA Send (exceeds the size of 1119 the client's posted receive buffer), even with read chunks removed, 1120 then the RPC server performs an RDMA Write of the RPC reply message 1121 into the buffer indicated by the rdma_reply chunk. If the client 1122 doesn't provide an rdma_reply chunk, or if it's too small, then if 1123 the upper layer specification permits, the message MAY be returned 1124 as a Read chunk. 1126 An RPC with long reply returned via RDMA Write 1128 RPC Client RPC Server 1129 | RPC Call with rdma_reply | 1130 Send | ------------------------------> | 1131 | | 1132 | Long RPC Reply Msg | 1133 | <------------------------------ | Write 1134 | | 1135 | RDMA over RPC Header | 1136 | <------------------------------ | Send 1138 The use of RDMA Write to return long replies requires that the 1139 client applications anticipate a long reply and have some knowledge 1140 of its size so that an adequately sized buffer can be allocated. 1141 This is certainly true of NFS READDIR replies; where the client 1142 already provides an upper bound on the size of the encoded 1143 directory fragment to be returned by the server. 1145 The use of these "reply chunks" is highly efficient and convenient 1146 for both RPC client and server. Their use is encouraged for 1147 eligible RPC operations such as NFS READDIR, which would otherwise 1148 require extensive chunk management within the results or use of 1149 RDMA Read and a Done message. [NFSDDP] 1151 6. Connection Configuration Protocol 1153 RDMA Send operations require the receiver to post one or more 1154 buffers at the RDMA connection endpoint, each large enough to 1155 receive the largest Send message. Buffers are consumed as Send 1156 messages are received. If a buffer is too small, or if there are 1157 no buffers posted, the RDMA transport MAY return an error and break 1158 the RDMA connection. The receiver MUST post sufficient, adequately 1159 buffers to avoid buffer overrun or capacity errors. 1161 The protocol described above includes only a mechanism for managing 1162 the number of such receive buffers, and no explicit features to 1163 allow the RPC client and server to provision or control buffer 1164 sizing, nor any other session parameters. 1166 In the past, this type of connection management has not been 1167 necessary for RPC. RPC over UDP or TCP does not have a protocol to 1168 negotiate the link. The server can get a rough idea of the maximum 1169 size of messages from the server protocol code. However, a 1170 protocol to negotiate transport features on a more dynamic basis is 1171 desirable. 1173 The Connection Configuration Protocol allows the client to pass its 1174 connection requirements to the server, and allows the server to 1175 inform the client of its connection limits. 1177 Use of the Connection Configuration Protocol by an upper layer is 1178 OPTIONAL. 1180 6.1. Initial Connection State 1182 This protocol MAY be used for connection setup prior to the use of 1183 another RPC protocol that uses the RDMA transport. It operates in- 1184 band, i.e., it uses the connection itself to negotiate the 1185 connection parameters. To provide a basis for connection 1186 negotiation, the connection is assumed to provide a basic level of 1187 interoperability: the ability to exchange at least one RPC message 1188 at a time that is at least 1 KB in size. The server MAY exceed 1189 this basic level of configuration, but the client MUST NOT assume 1190 more than one, and MUST receive a valid reply from the server 1191 carrying the actual number of available receive messages, prior to 1192 sending its next request. 1194 6.2. Protocol Description 1196 Version 1 of the Connection Configuration protocol consists of a 1197 single procedure that allows the client to inform the server of its 1198 connection requirements and the server to return connection 1199 information to the client. 1201 The maxcall_sendsize argument is the maximum size of an RPC call 1202 message that the client MAY send inline in an RDMA Send message to 1203 the server. The server MAY return a maxcall_sendsize value that is 1204 smaller or larger than the client's request. The client MUST NOT 1205 send an inline call message larger than what the server will 1206 accept. The maxcall_sendsize limits only the size of inline RPC 1207 calls. It does not limit the size of long RPC messages transferred 1208 as an initial chunk in the Read chunk list. 1210 The maxreply_sendsize is the maximum size of an inline RPC message 1211 that the client will accept from the server. 1213 The maxrdmaread is the maximum number of RDMA Reads which may be 1214 active at the peer. This number correlates to the RDMA incoming 1215 RDMA Read count ("IRD") configured into each originating endpoint 1216 by the client or server. If more than this number of RDMA Read 1217 operations by the connected peer are issued simultaneously, 1218 connection loss or suboptimal flow control may result, therefore 1219 the value SHOULD be observed at all times. The peers' values need 1220 not be equal. If zero, the peer MUST NOT issue requests which 1221 require RDMA Read to satisfy, as no transfer will be possible. 1223 The align value is the value recommended by the server for opaque 1224 data values such as strings and counted byte arrays. The client 1225 MAY use this value to compute the number of prepended pad bytes 1226 when XDR encoding opaque values in the RPC call message. 1228 typedef unsigned int uint32; 1230 struct config_rdma_req { 1231 uint32 maxcall_sendsize; 1232 /* max size of inline RPC call */ 1233 uint32 maxreply_sendsize; 1234 /* max size of inline RPC reply */ 1235 uint32 maxrdmaread; 1236 /* max active RDMA Reads at client */ 1237 }; 1239 struct config_rdma_reply { 1240 uint32 maxcall_sendsize; 1241 /* max call size accepted by server */ 1242 uint32 align; 1243 /* server's receive buffer alignment */ 1244 uint32 maxrdmaread; 1245 /* max active RDMA Reads at server */ 1246 }; 1248 program CONFIG_RDMA_PROG { 1249 version VERS1 { 1250 /* 1251 * Config call/reply 1252 */ 1253 config_rdma_reply CONF_RDMA(config_rdma_req) = 1; 1254 } = 1; 1255 } = 100400; 1257 7. Memory Registration Overhead 1259 RDMA requires that all data be transferred between registered 1260 memory regions at the source and destination. All protocol headers 1261 as well as separately transferred data chunks use registered 1262 memory. Since the cost of registering and de-registering memory 1263 can be a large proportion of the RDMA transaction cost, it is 1264 important to minimize registration activity. This is easily 1265 achieved within RPC controlled memory by allocating chunk list data 1266 and RPC headers in a reusable way from pre-registered pools. 1268 The data chunks transferred via RDMA MAY occupy memory that 1269 persists outside the bounds of the RPC transaction. Hence, the 1270 default behavior of an RPC over RDMA transport is to register and 1271 de-register these chunks on every transaction. However, this is 1272 not a limitation of the protocol - only of the existing local RPC 1273 API. The API is easily extended through such functions as 1274 rpc_control(3) to change the default behavior so that the 1275 application can assume responsibility for controlling memory 1276 registration through an RPC-provided registered memory allocator. 1278 8. Errors and Error Recovery 1280 RPC RDMA protocol errors are described in section 4. RPC errors 1281 and RPC error recovery are not affected by the protocol, and 1282 proceed as for any RPC error condition. RDMA Transport error 1283 reporting and recovery are outside the scope of this protocol. 1285 It is assumed that the link itself will provide some degree of 1286 error detection and retransmission. iWARP's MPA layer (when used 1287 over TCP), SCTP, as well as the Infiniband link layer all provide 1288 CRC protection of the RDMA payload, and CRC-class protection is a 1289 general attribute of such transports. Additionally, the RPC layer 1290 itself can accept errors from the link level and recover via 1291 retransmission. RPC recovery can handle complete loss and re- 1292 establishment of the link. 1294 See section 11 for further discussion of the use of RPC-level 1295 integrity schemes to detect errors, and related efficiency issues. 1297 9. Node Addressing 1299 In setting up a new RDMA connection, the first action by an RPC 1300 client will be to obtain a transport address for the server. The 1301 mechanism used to obtain this address, and to open an RDMA 1302 connection is dependent on the type of RDMA transport, and is the 1303 responsibility of each RPC protocol binding and its local 1304 implementation. 1306 10. RPC Binding 1308 RPC services normally register with a portmap or rpcbind [RFC1833] 1309 service, which associates an RPC program number with a service 1310 address. (In the case of UDP or TCP, the service address for NFS 1311 is normally port 2049.) This policy is no different with RDMA 1312 interconnects, although it may require the allocation of port 1313 numbers appropriate to each upper layer binding which uses the RPC 1314 framing defined here. 1316 When mapped atop the iWARP [RFC5040, RFC5041] transport, which uses 1317 IP port addressing due to its layering on TCP and/or SCTP, port 1318 mapping is trivial and consists merely of issuing the port in the 1319 connection process. 1321 When mapped atop Infiniband [IB], which uses a GID-based service 1322 endpoint naming scheme, a translation MUST be employed. One such 1323 translation is defined in the Infiniband Port Addressing Annex 1324 [IBPORT], which is appropriate for translating IP port addressing 1325 to the Infiniband network. Therefore, in this case, IP port 1326 addressing may be readily employed by the upper layer. 1328 When a mapping standard or convention exists for IP ports on an 1329 RDMA interconnect, there are several possibilities for each upper 1330 layer to consider: 1332 One possibility is to have an upper layer server register its 1333 mapped IP port with the rpcbind service, under the netid (or 1334 netid's) defined here. An RPC/RDMA-aware client can then 1335 resolve its desired service to a mappable port, and proceed to 1336 connect. This is the most flexible and compatible approach, 1337 for those upper layers which are defined to use the rpcbind 1338 service. 1340 A second possibility is to have the server's portmapper 1341 register itself on the RDMA interconnect at a "well known" 1342 service address. (On UDP or TCP, this corresponds to port 1343 111.) A client could connect to this service address and use 1344 the portmap protocol to obtain a service address in response 1345 to a program number, e.g., an iWARP port number, or an 1346 Infiniband GID. 1348 Alternatively, the client could simply connect to the mapped 1349 well-known port for the service itself, if it is appropriately 1350 defined. 1352 Historically, different RPC protocols have taken different 1353 approaches to their port assignment, therefore the specific method 1354 is left to each RPC/RDMA-enabled upper layer binding, and not 1355 addressed here. 1357 This specification defines a new "netid", to be used for 1358 registration of upper layers atop iWARP [RFC5040, RFC5041] and 1359 (when a suitable port translation service is available) Infiniband 1360 [IB] in section 12, "IANA Considerations." Additional RDMA-capable 1361 networks MAY define their own netids, or if they provide a port 1362 translation, MAY share the one defined here. 1364 11. Security Considerations 1366 RPC provides its own security via the RPCSEC_GSS framework 1367 [RFC2203]. RPCSEC_GSS can provide message authentication, 1368 integrity checking, and privacy. This security mechanism will be 1369 unaffected by the RDMA transport. The data integrity and privacy 1370 features alter the body of the message, presenting it as a single 1371 chunk. For large messages the chunk may be large enough to qualify 1372 for RDMA Read transfer. However, there is much data movement 1373 associated with computation and verification of integrity, or 1374 encryption/decryption, so certain performance advantages may be 1375 lost. 1377 For efficiency, a more appropriate security mechanism for RDMA 1378 links may be link-level protection, such as certain configurations 1379 of IPsec, which may be co-located in the RDMA hardware. The use of 1380 link-level protection MAY be negotiated through the use of the new 1381 RPCSEC_GSS mechanism defined in [RPCSECGSSV2] in conjunction with 1382 the Channel Binding mechanism [RFC5056] and IPsec Channel 1383 Connection Latching [BTNSLATCH]. Use of such mechanisms is 1384 REQUIRED where integrity and/or privacy is desired, and where 1385 efficiency is required. 1387 An additional consideration is the protection of the integrity and 1388 privacy of local memory by the RDMA transport itself. The use of 1389 RDMA by RPC MUST NOT introduce any vulnerabilities to system memory 1390 contents, or to memory owned by user processes. These protections 1391 are provided by the RDMA layer specifications, and specifically 1392 their security models. It is REQUIRED that any RDMA provider used 1393 for RPC transport be conformant to the requirements of [RFC5042] in 1394 order to satisfy these protections. 1396 Once delivered securely by the RDMA provider, any RDMA-exposed 1397 addresses will contain only RPC payloads in the chunk lists, 1398 transferred under the protection of RPCSEC_GSS integrity and 1399 privacy. By these means, the data will be protected end-to-end, as 1400 required by the RPC layer security model. 1402 Where results are supplied to the requester via Read chunks, a 1403 server resource deficit can arise if the client does not promptly 1404 acknowledge their status via the RDMA_DONE message. This can 1405 potentially lead to a denial of service situation, with a single 1406 client unfairly (and unnecessarily) consuming server RDMA 1407 resources. Servers MUST protect against this situation, 1408 originating from one or many clients. For example, a time-based 1409 window of buffer availability may be offered, if the client fails 1410 to obtain the data within the window, it will simply retry using 1411 ordinary RPC retry semantics. Or, a more severe method would be 1412 for the server to simply close the client's RDMA connection, 1413 freeing the RDMA resources and allowing the server to reclaim them. 1415 A fairer and more useful method is provided by the protocol itself. 1416 The server MAY use the rdma_credit value to limit the number of 1417 outstanding requests for each client. By including the number of 1418 outstanding RDMA_DONE completions in the computation of available 1419 client credits, the server can limit its exposure to each client, 1420 and therefore provide uninterrupted service as its resources 1421 permit. 1423 However, the server must ensure that it does not decrease the 1424 credit count to zero with this method, since the RDMA_DONE message 1425 is not acknowledged. If the credit count were to drop to zero 1426 solely due to outstanding RDMA_DONE messages, the client would 1427 deadlock since it would never obtain a new credit with which to 1428 continue. Therefore, if the server adjusts credits to zero for 1429 outstanding RDMA_DONE, it MUST withhold its reply to at least one 1430 message in order to provide the next credit. The time-based window 1431 (or any other appropriate method) SHOULD be used by the server to 1432 recover resources in the event that the client never returns. 1434 The "Connection Configuration Protocol", when used, MUST be 1435 protected by an appropriate RPC security flavor, to ensure it is 1436 not attacked in the process of initiating an RPC/RDMA connection. 1438 12. IANA Considerations 1440 The new RPC transport is to be assigned a new RPC "netid", which is 1441 an rpcbind [RFC1833] string used to describe the underlying 1442 protocol in order for RPC to select the appropriate transport 1443 framing, as well as the format of the service ports. 1445 The following "nc_proto" registry string is hereby defined for this 1446 purpose: 1448 NC_RDMA "rdma" 1450 This netid MAY be used for any RDMA network satisfying the 1451 requirements of section 2, and able to identify service endpoints 1452 using IP port addressing, possibly through use of a translation 1453 service as described above in section 10, RPC Binding. 1455 As a new RPC transport, this protocol has no effect on RPC program 1456 numbers or existing registered port numbers. However, new port 1457 numbers MAY be registered for use by RPC/RDMA-enabled services, as 1458 appropriate to the new networks over which the services will 1459 operate. 1461 The OPTIONAL Connection Configuration protocol described herein 1462 requires an RPC program number assignment. The value "100400" is 1463 hereby assigned: 1465 rdmaconfig 100400 rpc.rdmaconfig 1467 Currently, neither the nc_proto netid's nor the RPC program numbers 1468 are are assigned by IANA. The list in [RFC1833] has served as the 1469 netid registry, and the republication declared in [IANA-RPC] has 1470 served as the program number registry. Ideally, IANA will create 1471 explicit registries for these objects. However, in the absence of 1472 new registries, this document would serve as the repository for the 1473 RPC program number assignment, and the protocol netid. 1475 13. Acknowledgements 1477 The authors wish to thank Rob Thurlow, John Howard, Chet Juszczak, 1478 Alex Chiu, Peter Staubach, Dave Noveck, Brian Pawlowski, Steve 1479 Kleiman, Mike Eisler, Mark Wittle, Shantanu Mehendale, David 1480 Robinson and Mallikarjun Chadalapaka for their contributions to 1481 this document. 1483 14. Normative References 1485 [RFC2119] 1486 S. Bradner, "Key words for use in RFCs to Indicate Requirement 1487 Levels", Best Current Practice, BCP 14, RFC 2119, March 1997. 1489 [RFC1094] 1490 Sun Microsystems, "NFS: Network File System Protocol 1491 Specification", (NFS version 2) Informational RFC, 1492 http://www.ietf.org/rfc/rfc1094.txt 1494 [RFC1831bis] 1495 R. Thurlow, Ed., "RPC: Remote Procedure Call Protocol 1496 Specification Version 2", Standards Track RFC 1498 [RFC4506] 1499 M. Eisler Ed., "XDR: External Data Representation Standard", 1500 Standards Track RFC, http://www.ietf.org/rfc/rfc4506.txt 1502 [RFC1813] 1503 B. Callaghan, B. Pawlowski, P. Staubach, "NFS Version 3 1504 Protocol Specification", Informational RFC, 1505 http://www.ietf.org/rfc/rfc1813.txt 1507 [RFC1833] 1508 R. Srinivasan, "Binding Protocols for ONC RPC Version 2", 1509 Standards Track RFC, http://www.ietf.org/rfc/rfc1833.txt 1511 [RFC3530] 1512 S. Shepler, B. Callaghan, D. Robinson, R. Thurlow, C. Beame, 1513 M. Eisler, D. Noveck, "NFS version 4 Protocol", Standards 1514 Track RFC, http://www.ietf.org/rfc/rfc3530.txt 1516 [RFC2203] 1517 M. Eisler, A. Chiu, L. Ling, "RPCSEC_GSS Protocol 1518 Specification", Standards Track RFC, 1519 http://www.ietf.org/rfc/rfc2203.txt 1521 [RPCSECGSSV2] 1522 M. Eisler, "RPCSEC_GSS Version 2", Internet Draft Work in 1523 Progress draft-ietf-nfsv4-rpcsec-gss-v2 1525 [RFC5056] 1526 N. Williams, "On the Use of Channel Bindings to Secure 1527 Channels", Standards Track RFC 1529 [BTNSLATCH] 1530 N. Williams, "IPsec Channels: Connection Latching", Internet 1531 Draft Work in Progress draft-ietf-btns-connection-latching 1533 [RFC5042] 1534 J. Pinkerton, E. Deleganes, "Direct Data Placement Protocol 1535 (DDP) / Remote Direct Memory Access Protocol (RDMAP) Security" 1536 Standards Track RFC 1538 15. Informative References 1540 [NFSDDP] 1541 B. Callaghan, T. Talpey, "NFS Direct Data Placement" Internet 1542 Draft Work in Progress, draft-ietf-nfsv4-nfsdirect 1544 [RFC5040] 1545 R. Recio et al., "A Remote Direct Memory Access Protocol 1546 Specification", Standards Track RFC 1548 [RFC5041] 1549 H. Shah et al., "Direct Data Placement over Reliable 1550 Transports", Standards Track RFC 1552 [NFSRDMAPS] 1553 T. Talpey, C. Juszczak, "NFS RDMA Problem Statement", Internet 1554 Draft Work in Progress, draft-ietf-nfsv4-nfs-rdma-problem- 1555 statement 1557 [NFSv4.1] 1558 S. Shepler et al., ed., "NFSv4 Minor Version 1" Internet Draft 1559 Work in Progress, draft-ietf-nfsv4-minorversion1 1561 [IB] 1562 Infiniband Architecture Specification, available from 1563 http://www.infinibandta.org 1565 [IBPORT] 1566 Infiniband Trade Association, "IP Addressing Annex", available 1567 from http://www.infinibandta.org 1569 [IANA-RPC] 1570 IANA Sun RPC number statement, 1571 http://www.iana.org/assignments/sun-rpc-numbers 1573 16. Authors' Addresses 1575 Tom Talpey 1576 Network Appliance, Inc. 1577 1601 Trapelo Road, #16 1578 Waltham, MA 02451 USA 1580 Phone: +1 781 768 5329 1581 EMail: thomas.talpey@netapp.com 1582 Brent Callaghan 1583 Apple Computer, Inc. 1584 MS: 302-4K 1585 2 Infinite Loop 1586 Cupertino, CA 95014 USA 1588 EMail: brentc@apple.com 1590 17. Intellectual Property and Copyright Statements 1592 Full Copyright Statement 1594 Copyright (C) The IETF Trust (2008). 1596 This document is subject to the rights, licenses and restrictions 1597 contained in BCP 78, and except as set forth therein, the authors 1598 retain all their rights. 1600 This document and the information contained herein are provided on 1601 an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE 1602 REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE 1603 IETF TRUST AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL 1604 WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY 1605 WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE 1606 ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS 1607 FOR A PARTICULAR PURPOSE. 1609 Intellectual Property 1610 The IETF takes no position regarding the validity or scope of any 1611 Intellectual Property Rights or other rights that might be claimed 1612 to pertain to the implementation or use of the technology described 1613 in this document or the extent to which any license under such 1614 rights might or might not be available; nor does it represent that 1615 it has made any independent effort to identify any such rights. 1616 Information on the procedures with respect to rights in RFC 1617 documents can be found in BCP 78 and BCP 79. 1619 Copies of IPR disclosures made to the IETF Secretariat and any 1620 assurances of licenses to be made available, or the result of an 1621 attempt made to obtain a general license or permission for the use 1622 of such proprietary rights by implementers or users of this 1623 specification can be obtained from the IETF on-line IPR repository 1624 at http://www.ietf.org/ipr. 1626 The IETF invites any interested party to bring to its attention any 1627 copyrights, patents or patent applications, or other proprietary 1628 rights that may cover technology that may be required to implement 1629 this standard. Please address the information to the IETF at ietf- 1630 ipr@ietf.org. 1632 Acknowledgment 1633 Funding for the RFC Editor function is provided by the IETF 1634 Administrative Support Activity (IASA).