idnits 2.17.1 draft-ietf-nfsv4-rpcrdma-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3978, Section 5.1 on line 14. -- Found old boilerplate from RFC 3978, Section 5.5 on line 1347. -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 1323. -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 1330. -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 1336. ** This document has an original RFC 3978 Section 5.4 Copyright Line, instead of the newer IETF Trust Copyright according to RFC 4748. ** This document has an original RFC 3978 Section 5.5 Disclaimer, instead of the newer disclaimer which includes the IETF Trust according to RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- Couldn't find a document date in the document -- date freshness check skipped. -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Looks like a reference, but probably isn't: '8' on line 906 == Unused Reference: 'RFC1813' is defined on line 1252, but no explicit reference was found in the text == Unused Reference: 'RFC2203' is defined on line 1262, but no explicit reference was found in the text ** Obsolete normative reference: RFC 1831 (Obsoleted by RFC 5531) ** Obsolete normative reference: RFC 1832 (Obsoleted by RFC 4506) ** Downref: Normative reference to an Informational RFC: RFC 1813 ** Obsolete normative reference: RFC 3530 (Obsoleted by RFC 7530) -- No information found for draft-ietf-nfsv4-nfs-sess - is the name correct? Summary: 7 errors (**), 0 flaws (~~), 4 warnings (==), 10 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internet-Draft Brent Callaghan 3 Expires: April 2006 Tom Talpey 5 Document: draft-ietf-nfsv4-rpcrdma-02 October, 2005 7 RDMA Transport for ONC RPC 9 Status of this Memo 11 By submitting this Internet-Draft, each author represents that any 12 applicable patent or other IPR claims of which he or she is aware 13 have been or will be disclosed, and any of which he or she becomes 14 aware will be disclosed, in accordance with Section 6 of BCP 79. 16 Internet-Drafts are working documents of the Internet Engineering 17 Task Force (IETF), its areas, and its working groups. Note that 18 other groups may also distribute working documents as Internet- 19 Drafts. 21 Internet-Drafts are draft documents valid for a maximum of six 22 months and may be updated, replaced, or obsoleted by other 23 documents at any time. It is inappropriate to use Internet-Drafts 24 as reference material or to cite them other than as "work in 25 progress." 27 The list of current Internet-Drafts can be accessed at 28 http://www.ietf.org/ietf/1id-abstracts.txt The list of 29 Internet-Draft Shadow Directories can be accessed at 30 http://www.ietf.org/shadow.html. 32 Abstract 34 A protocol is described providing RDMA as a new transport for ONC 35 RPC. The RDMA transport binding conveys the benefits of efficient, 36 bulk data transport over high speed networks, while providing for 37 minimal change to RPC applications and with no required revision of 38 the application RPC protocol, or the RPC protocol itself. 40 Table of Contents 42 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 43 2. Abstract RDMA Requirements . . . . . . . . . . . . . . . . . 3 44 3. Protocol Outline . . . . . . . . . . . . . . . . . . . . . . 4 45 3.1. Short Messages . . . . . . . . . . . . . . . . . . . . . . 4 46 3.2. Data Chunks . . . . . . . . . . . . . . . . . . . . . . . 5 47 3.3. Flow Control . . . . . . . . . . . . . . . . . . . . . . . 5 48 3.4. XDR Encoding with Chunks . . . . . . . . . . . . . . . . . 7 49 3.5. Padding . . . . . . . . . . . . . . . . . . . . . . . . 10 50 3.6. XDR Decoding with Read Chunks . . . . . . . . . . . . . 11 51 3.7. XDR Decoding with Write Chunks . . . . . . . . . . . . . 12 52 3.8. RPC Call and Reply . . . . . . . . . . . . . . . . . . . 12 53 4. RPC RDMA Message Layout . . . . . . . . . . . . . . . . . 15 54 4.1. RPC over RDMA Header . . . . . . . . . . . . . . . . . . 16 55 4.2. RPC over RDMA header errors . . . . . . . . . . . . . . 17 56 4.3. XDR Language Description . . . . . . . . . . . . . . . . 18 57 5. Long Messages . . . . . . . . . . . . . . . . . . . . . . 20 58 5.1. Message as an RDMA Read Chunk . . . . . . . . . . . . . 20 59 5.2. RDMA Write of Long Replies (Reply Chunks) . . . . . . . 22 60 6. Connection Configuration Protocol . . . . . . . . . . . . 23 61 6.1. Initial Connection State . . . . . . . . . . . . . . . . 24 62 6.2. Protocol Description . . . . . . . . . . . . . . . . . . 24 63 7. Memory Registration Overhead . . . . . . . . . . . . . . . 25 64 8. Errors and Error Recovery . . . . . . . . . . . . . . . . 26 65 9. Node Addressing . . . . . . . . . . . . . . . . . . . . . 26 66 10. RPC Binding . . . . . . . . . . . . . . . . . . . . . . . 26 67 11. Security . . . . . . . . . . . . . . . . . . . . . . . . 27 68 12. IANA Considerations . . . . . . . . . . . . . . . . . . . 27 69 13. Acknowledgements . . . . . . . . . . . . . . . . . . . . 27 70 14. Normative References . . . . . . . . . . . . . . . . . . 27 71 15. Informative References . . . . . . . . . . . . . . . . . 28 72 16. Authors' Addresses . . . . . . . . . . . . . . . . . . . 29 73 17. Intellectual Property and Copyright Statements . . . . . 29 74 Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . 30 76 1. Introduction 78 RDMA is a technique for efficient movement of data between end 79 nodes, which becomes increasingly compelling over high speed 80 transports. By directing data into destination buffers as it sent 81 on a network, and placing it via direct memory access by hardware, 82 the double benefit of faster transfers and reduced host overhead is 83 obtained. 85 ONC RPC [RFC1831] is a remote procedure call protocol that has been 86 run over a variety of transports. Most RPC implementations today 87 use UDP or TCP. RPC messages are defined in terms of an eXternal 88 Data Representation (XDR) [RFC1832] which provides a canonical data 89 representation across a variety of host architectures. An XDR data 90 stream is conveyed differently on each type of transport. On UDP, 91 RPC messages are encapsulated inside datagrams, while on a TCP byte 92 stream, RPC messages are delineated by a record marking protocol. 93 An RDMA transport also conveys RPC messages in a unique fashion 94 that must be fully described if client and server implementations 95 are to interoperate. 97 RDMA transports present new semantics unlike the behaviors of 98 either UDP and TCP alone. They retain message delineations like 99 UDP while also providing a reliable, sequenced data transfer like 100 TCP. And, they provide the new efficient, bulk transfer service of 101 RDMA. RDMA transports are therefore naturally viewed as a new 102 transport type by ONC RPC. 104 RDMA as a transport will benefit the performance of RPC protocols 105 that move large "chunks" of data, since RDMA hardware excels at 106 moving data efficiently between host memory and a high speed 107 network with little or no host CPU involvement. In this context, 108 the NFS protocol, in all its versions, is an obvious beneficiary of 109 RDMA. A complete problem statement is discussed in [NFSRDMAPS], 110 and related NFSv4 issues are discussed in [NFSSESS]. Many other 111 RPC-based protocols will also benefit. 113 Although the RDMA transport described here provides relatively 114 transparent support for any RPC application, the proposal goes 115 further in describing mechanisms that can optimize the use of RDMA 116 with more active participation by the RPC application. 118 2. Abstract RDMA Requirements 120 An RPC transport is responsible for conveying an RPC message from a 121 sender to a receiver. An RPC message is either an RPC call from a 122 client to a server, or an RPC reply from the server back to the 123 client. An RPC message contains an RPC call header followed by 124 arguments if the message is an RPC call, or an RPC reply header 125 followed by results if the message is an RPC reply. The call 126 header contains a transaction ID (XID) followed by the program and 127 procedure number as well as a security credential. An RPC reply 128 header begins with an XID that matches that of the RPC call 129 message, followed by a security verifier and results. All data in 130 an RPC message is XDR encoded. For a complete description of the 131 RPC protocol and XDR encoding, see [RFC1831] and [RFC1832]. 133 This protocol assumes the following abstract model for RDMA 134 transports. Theese terms, common in the RDMA lexicon, are used in 135 this document. A more complete glossary of RDMA terms can be found 136 in [RDMA]. 138 o Registered Memory 139 All data moved via tagged RDMA operations must be resident in 140 registered memory at its destination. This protocol assumes 141 that each segment of registered memory may be identified with 142 a steering tag of no more than 32 bits and memory addresses of 143 up to 64 bits in length. 145 o RDMA Send 146 The RDMA provider supports an RDMA Send operation with 147 completion signalled at the receiver when data is placed in a 148 pre-posted buffer. The amount of transferred data is limited 149 only by the size of the receiver's buffer. Sends complete at 150 the receiver in the order they were issued at the sender. 152 o RDMA Write 153 The RDMA provider supports an RDMA Write operation to directly 154 place data in the receiver's buffer. An RDMA Write is 155 initiated by the sender and completion is signalled at the 156 sender. No completion is signalled at the receiver. The 157 sender uses a steering tag, memory address and length of the 158 remote destination buffer. RDMA Writes are not necessarily 159 ordered with respect to one another, but are ordered with 160 respect to RDMA Sends; a subsequent RDMA Send completion must 161 be obtained at the receiver to notify that prior RDMA Write 162 data has been successfully placed in the receiver's memory. 164 o RDMA Read 165 The RDMA provider supports an RDMA Read operation to directly 166 place peer source data in the requester's buffer. An RDMA 167 Read is initiated by the receiver and completion is signalled 168 at the receiver. The receiver provides steering tags, memory 169 addresses and a length for the remote source and local 170 destination buffers. Since the peer at the data source 171 receives no notification of RDMA Read completion, there is an 172 assumption that on receiving the data the receiver will signal 173 completion with an RDMA Send message, so that the peer can 174 free the source buffers and the associated steering tags. 176 This protocol is designed to be carried over all RDMA transports 177 meeting the stated requirements. This protocol conveys to the RPC 178 peer, information sufficient for that RPC peer to direct an RDMA 179 layer to perform transfers containing RPC data, and to communicate 180 their result(s). For example, it is readily carried over RDMA 181 transports such as iWARP [RDDP] or Infiniband [IB]. 183 3. Protocol Outline 185 An RPC message can be conveyed in identical fashion, whether it is 186 a call or reply message. In each case, the transmission of the 187 message proper is preceded by transmission of a transport-specific 188 header for use by RPC over RDMA transports. This header is 189 analogous to the record marking used for RPC over TCP, but is more 190 extensive, since RDMA transports support several modes of data 191 transfer and it is important to allow the client and server to use 192 the most efficient mode for any given transfer. Multiple segments 193 of a message may be transferred in different ways to different 194 remote memory destinations. 196 All transfers of a call or reply begin with an RDMA Send which 197 transfers at least the RPC over RDMA header, usually with the call 198 or reply message appended, or at least some part thereof. Because 199 the size of what may be transmitted via RDMA Send is limited by the 200 size of the receiver's pre-posted buffer, the RPC over RDMA 201 transport provides a number of methods to reduce the amount 202 transferred by means of the RDMA Send, when necessary, by 203 transferring various parts of the message using RDMA Read and RDMA 204 Write. 206 3.1. Short Messages 208 Many RPC messages are quite short. For example, the NFS version 3 209 GETATTR request, is only 56 bytes: 20 bytes of RPC header plus a 32 210 byte filehandle argument and 4 bytes of length. The reply to this 211 common request is about 100 bytes. 213 There is no benefit in transferring such small messages with an 214 RDMA Read or Write operation. The overhead in transferring 215 steering tags and memory addresses is justified only by large 216 transfers. The critical message size that justifies RDMA transfer 217 will vary depending on the RDMA implementation and network, but is 218 typically of the order of a few kilobytes. It is appropriate to 219 transfer a short message with an RDMA Send to a pre-posted buffer. 220 The RPC over RDMA header with the short message (call or reply) 221 immediately following is transferred using a single RDMA Send 222 operation. 224 Short RPC messages over an RDMA transport will look like this: 226 RPC Client RPC Server 227 | RPC Call | 228 Send | ------------------------------> | 229 | | 230 | RPC Reply | 231 | <------------------------------ | Send 233 3.2. Data Chunks 235 Some protocols, like NFS, have RPC procedures that can transfer 236 very large "chunks" of data in the RPC call or reply and would 237 cause the maximum send size to be exceeded if one tried to transfer 238 them as part of the RDMA Send. These large chunks typically range 239 from a kilobyte to a megabyte or more. An RDMA transport can 240 transfer large chunks of data more efficiently via the direct 241 placement of an RDMA Read or RDMA Write operation. Using direct 242 placement instead of inline transfer not only avoids expensive data 243 copies, but provides correct data alignment at the destination. 245 3.3. Flow Control 247 It is critical to provide RDMA Send flow control for an RDMA 248 connection. RDMA receive operations will fail if a pre-posted 249 receive buffer is not available to accept an incoming RDMA Send, 250 and repeated occurrences of such errors can be fatal to the 251 connection. This is a departure from conventional TCP/IP 252 networking where buffers are allocated dynamically on an as-needed 253 basis, and pre-posting is not required. 255 It is not practical to provide for fixed credit limits at the RPC 256 server. Fixed limits scale poorly, since posted buffers are 257 dedicated to the associated connection until consumed by receive 258 operations. Additionally for protocol correctness, the RPC server 259 must always be able to reply to client requests, whether or not new 260 buffers have been posted to accept future receives. (Note that the 261 RPC server may in fact be a client at some other layer. For 262 example, NFSv4 callbacks are processed by the NFSv4 client, acting 263 as an RPC server. The credit discussions apply equally in either 264 case.) 266 Flow control for RDMA Send operations is implemented as a simple 267 request/grant protocol in the RPC over RDMA header associated with 268 each RPC message. The RPC over RDMA header for RPC call messages 269 contains a requested credit value for the RPC server, which may be 270 dynamically adjusted by the caller to match its expected needs. 271 The RPC over RDMA header for the RPC reply messages provides the 272 granted result, which may have any value except it may not be zero 273 when no in-progress operations are present at the server, since 274 such a value would result in deadlock. The value may be adjusted 275 up or down at each opportunity to match the server's needs or 276 policies. 278 The RPC client must not send unacknowledged requests in excess of 279 this granted RPC server credit limit. If the limit is exceeded, 280 the RDMA layer may signal an error, possibly terminating the 281 connection. Even if an error does not occur, there is no 282 requirement that the server must handle the excess request(s), and 283 it may return an RPC error to the client. Also note that the 284 never-zero requirement implies that an RPC server must always 285 provide at least one credit to each connected RPC client. It does 286 not however require that the server must always be prepared to 287 receive a request from each client, for example when the server is 288 busy processing all granted client requests. 290 While RPC call may complete in any order, the current flow control 291 limit at the RPC server is known to the RPC client from the Send 292 ordering properties. It is always the most recent server-granted 293 credit value minus the number of requests in flight. 295 Certain RDMA implementations may impose additional flow control 296 restrictions, such as limits on RDMA Read operations in progress at 297 the responder. Because these operations are outside the scope of 298 this protocol, they are not addressed and must be provided for by 299 other layers. For example, a simple upper layer RPC consumer might 300 perform single-issue RDMA Read requests, while a more 301 sophisticated, multithreaded RPC consumer may implement its own 302 FIFO queue of such operations. For further discussion of possible 303 protocol implementations capable of negotiating these values, see 304 section 6 "Connection Configuration Protocol" of this draft, or 305 [NFSSESS]. 307 3.4. XDR Encoding with Chunks 309 The data comprising an RPC call or reply message is marshaled or 310 serialized into a contiguous stream by an XDR routine. XDR data 311 types such as integers, strings, arrays and linked lists are 312 commonly implemented over two very simple functions that encode 313 either an XDR data unit (32 bits) or an array of bytes. 315 Normally, the separate data items in an RPC call or reply are 316 encoded as a contiguous sequence of bytes for network transmission 317 over UDP or TCP. However, in the case of an RDMA transport, local 318 routines such as XDR encode can determine that (for instance) an 319 opaque byte array is large enough to be more efficiently moved via 320 an RDMA data transfer operation like RDMA Read or RDMA Write. 322 Semantically speaking, the protocol has no restriction regarding 323 data types which may or may not be represented by a read or write 324 chunk. In practice however, efficiency considerations lead to the 325 conclusion that certain data types are not generally "chunkable". 326 Typically, only opaque and aggregate data types which may attain 327 substantial size are considered to be eligible. With today's 328 hardware this size may be a kilobyte or more. However any object 329 may be chosen for chunking in any given message. 331 The eligibility of XDR data items to be candidates for being moved 332 as data chunks (as opposed to being marshalled inline) is not 333 specified by the RPC over RDMA protocol. Chunk eligibility 334 criteria must be determined by each upper layer in order to provide 335 for an interoperable specification. One such example with 336 rationale, for the NFS protocol family, is provided in [NFSDDP]. 338 The interface by which an upper layer implementation communicates 339 the eligibility of a data item locally to RPC for chunking is out 340 of scope for this specification. In many implementations, it is 341 possible to implement a transparent RPC chunking facility. 342 However, such implementations may lead to inefficiencies, either 343 because they require the RPC layer to perform expensive 344 registration and deregistration of memory "on the fly", or they may 345 require using RDMA chunks in reply messages, along with the 346 resulting additional handshaking with the RPC over RDMA peer. 347 However, these issues are internal and generally confined to the 348 local interface between RPC and its upper layers, one in which 349 implementations are free to innovate. The only requirement is that 350 the resulting RPC RDMA protocol sent to the peer is valid for the 351 upper layer. See for example [NFSDDP]. 353 When sending any message (request or reply) that contains an 354 eligible large data chunk, the XDR encoding routine avoids moving 355 the data into the XDR stream. Instead, it does not encode the data 356 portion, but records the address and size of each chunk in a 357 separate "read chunk list" encoded within RPC RDMA transport- 358 specific headers. Such chunks will be transferred via RDMA Read 359 operations initiated by the receiver. 361 When the read chunks are to be moved via RDMA, the memory for each 362 chunk must be registered. This registration may take place within 363 XDR itself, providing for full transparency to upper layers, or it 364 may be performed by any other specific local implementation. 366 Additionally, when making an RPC call that can result in bulk data 367 transferred in the reply, it is desirable to provide chunks to 368 accept the data directly via RDMA Write. These write chunks will 369 therefore be pre-filled by the RPC server prior to responding, and 370 XDR decode at the client will not be required. These chunks 371 undergo a similar registration and advertisement via "write chunk 372 lists" built as a part of XDR encoding. 374 Some RPC client implementations are not able to determine where an 375 RPC call's results reside during the "encode" phase. This makes it 376 difficult or impossible for the RPC client layer to encode the 377 write chunk list at the time of building the request. In this 378 case, it is difficult for the RPC implementation to provide 379 transparency to the RPC consumer, which may require recoding to 380 provide result information at this earlier stage. 382 Therefore if the RPC client does not make a write chunk list 383 available to receive the result, then the RPC server must return 384 data inline in the reply, or if it so chooses, via a read chunk 385 list. RPC clients are discouraged from omitting write chunk lists 386 for eligible replies, due to the lower performance of the 387 additional handshaking to perform data transfer, and the 388 requirement that the RPC server must expose (and preserve) the 389 reply data for a period of time. In the absence of a server- 390 provided read chunk list in the reply, if the encoded reply 391 overflows the posted receive buffer, the RPC will fail. 393 When any data within a message is provided via either read or write 394 chunks, the chunk itself refers only to the data portion of the XDR 395 stream element. In particular, for counted fields (e.g. a "<>" 396 encoding) the byte count which is encoded as part of the field 397 remains in the XDR stream, and is also encoded in the chunk list. 398 The data portion is however elided from the encoded XDR stream, and 399 is transferred as part of chunk list processing. This is important 400 to maintain upper layer implementation compatibility - both the 401 count and the data must be transferred as part of the logical XDR 402 stream. While the chunk list processing results in the data being 403 available to the upper layer peer for XDR decoding, the length 404 present in the chunk list entries is not. Any byte count in the 405 XDR stream must match the sum of the byte counts present in the 406 corresponding read or write chunk list. If they do not agree, an 407 RPC protocol encoding error results. 409 The following items are contained in a chunk list entry. 411 Handle 412 Steering tag or handle obtained when the chunk memory is 413 registered for RDMA. 415 Length 416 The length of the chunk in bytes. 418 Offset 419 The offset or beginning memory address of the chunk. In order 420 to support the widest array of RDMA implementations, as well 421 as the most general steering tag scheme, this field is 422 unconditionally included in each chunk list entry. 424 While zero-based offset schemes are available in many RDMA 425 implementations, their use by RPC requires individual 426 registration of each read or write chunk. On many such 427 implementations this can be a significant overhead. By 428 providing an offset in each chunk, many pre-registration or 429 region-based registrations can be readily supported, and by 430 using a single, universal chunk representation, the RPC RDMA 431 protocol implementation is simplified to its most general 432 form. 434 Position 435 For data which is to be encoded, the position in the XDR 436 stream where the chunk would normally reside. Note that the 437 chunk therefore inserts its data into the XDR stream at this 438 position, but its transfer is no longer "inline". Also note 439 it is possible that a contiguous sequence of chunks might all 440 have the same position. For data which is to be decoded, no 441 "position" is used. 443 When XDR marshaling is complete, the chunk list is XDR encoded, 444 then sent to the receiver prepended to the RPC message. Any source 445 data for a read chunk, or the destination of a write chunk, remain 446 behind in the sender's registered memory and their actual payload 447 is not marshalled into the request or reply. 449 +----------------+----------------+------------- 450 | RPC over RDMA | | 451 | header w/ | RPC Header | Non-chunk args/results 452 | chunks | | 453 +----------------+----------------+------------- 455 Read chunk lists and write chunk lists are structured somewhat 456 differently. This is due to the different usage - read chunks are 457 decoded and indexed by their position in the XDR data stream, their 458 size is always known, and may be used for both arguments and 459 results. Write chunks on the other hand are used only for results, 460 and have neither a preassigned offset in the XDR stream, nor a size 461 until the results are produced, since the buffers may not be used 462 for results at all, or may be partially filled. Their presence in 463 the XDR stream is therefore not known until the reply is processed. 464 The mapping of Write chunks onto designated NFS procedures and 465 their results is described in [NFSDDP]. 467 Therefore, read chunks are encoded into a read chunk list as a 468 single array, with each entry tagged by its (known) size and 469 position in the XDR stream. Write chunks are encoded as a list of 470 arrays of RDMA buffers, with each list element (an array) providing 471 buffers for a separate result. Individual write chunk list 472 elements may thereby result in being partially or fully filled, or 473 in fact not being filled at all. Unused write chunks, or unused 474 bytes in write chunk buffer lists, are not returned as results, and 475 their memory is returned to the upper layer as part of RPC 476 completion. However, the RPC layer should not assume that the 477 buffers have not been modified. 479 3.5. Padding 481 Alignment of specific opaque data enables certain scatter/gather 482 optimizations. Padding leverages the useful property that RDMA 483 transfers preserve alignment of data, even when they are placed 484 into pre-posted receive buffers by Sends. 486 Many servers can make good use of such padding. Padding allows the 487 chaining of RDMA receive buffers such that any data transferred by 488 RDMA on behalf of RPC requests will be placed into appropriately 489 aligned buffers on the system that receives the transfer. In this 490 way, the need for servers to perform RDMA Read to satisfy all but 491 the largest client writes is obviated. 493 The effect of padding is demonstrated below showing prior bytes on 494 an XDR stream (XXX) followed by an opaque field consisting of four 495 length bytes (LLLL) followed by data bytes (DDDD). The receiver of 496 the RDMA Send has posted two chained receive buffers. Without 497 padding, the opaque data is split across the two buffers. With the 498 addition of padding bytes (ppp) prior to the first data byte, the 499 data can be forced to align correctly in the second buffer. 501 Buffer 1 Buffer 2 502 Unpadded -------------- -------------- 504 XXXXXXXLLLLDDDDDDDDDDDDDD ---> XXXXXXXLLLLDDD DDDDDDDDDDD 506 Padded 508 XXXXXXXLLLLpppDDDDDDDDDDDDDD ---> XXXXXXXLLLLppp DDDDDDDDDDDDDD 510 Padding is implemented completely within the RDMA transport 511 encoding, flagged with a specific message type. Where padding is 512 applied, two values are passed to the peer: an "rdma_align" which 513 is the padding value used, and "rdma_thresh", which is the opaque 514 data size at or above which padding is applied. For instance, if 515 the server is using chained 4 KB receive buffers, then up to (4 KB 516 - 1) padding bytes could be used to achieve alignment of the data. 517 If padding is to apply only to chunks at least 1 KB in size, then 518 the threshold should be set to 1 KB. The XDR routine at the peer 519 will consult these values when decoding opaque values. Where the 520 decoded length exceeds the rdma_thresh, the XDR decode will skip 521 over the appropriate padding as indicated by rdma_align and the 522 current XDR stream position. 524 3.6. XDR Decoding with Read Chunks 526 The XDR decode process moves data from an XDR stream into a data 527 structure provided by the RPC client or server application. Where 528 elements of the destination data structure are buffers or strings, 529 the RPC application can either pre-allocate storage to receive the 530 data, or leave the string or buffer fields null and allow the XDR 531 decode stage of RPC processing to automatically allocate storage of 532 sufficient size. 534 When decoding a message from an RDMA transport, the receiver first 535 XDR decodes the chunk lists from the RPC over RDMA header, then 536 proceeds to decode the body of the RPC message (arguments or 537 results). Whenever the XDR offset in the decode stream matches 538 that of a chunk in the read chunk list, the XDR routine initiates 539 an RDMA Read to bring over the chunk data into locally registered 540 memory for the destination buffer. 542 When processing an RPC request, the RPC receiver (RPC server) 543 acknowledges its completion of use of the source buffers by simply 544 replying to the RPC sender (client), and the peer may free all 545 source buffers advertised by the request. 547 When processing an RPC reply, after completing such a transfer the 548 RPC receiver (client) must issue an RDMA_DONE message (described in 549 Section 3.8) to notify the peer (server) that the source buffers 550 can be freed. 552 The read chunk list is constructed and used entirely within the 553 RPC/XDR layer. Other than specifying the minimum chunk size, the 554 management of the read chunk list is automatic and transparent to 555 an RPC application. 557 3.7. XDR Decoding with Write Chunks 559 When a "write chunk list" is provided for the results of the RPC 560 call, the RPC server must provide any corresponding data via RDMA 561 Write to the memory referenced in the chunk list entries. The RPC 562 reply conveys this by returning the write chunk list to the client 563 with the lengths rewritten to match the actual transfer. The XDR 564 "decode" of the reply therefore performs no local data transfer but 565 merely returns the length obtained from the reply. 567 Each decoded result consumes one entry in the write chunk list, 568 which in turn consists of an array of RDMA segments. The length is 569 therefore the sum of all returned lengths in all segments 570 comprising the corresponding list entry. As each list entry is 571 "decoded", the entire entry is consumed. 573 The write chunk list is constructed and used by the RPC 574 application. The RPC/XDR layer simply conveys the list between 575 client and server and initiates the RDMA Writes back to the client. 576 The mapping of write chunk list entries to procedure arguments must 577 be determined for each protocol. An example of a mapping is 578 described in [NFSDDP]. 580 3.8. RPC Call and Reply 582 The RDMA transport for RPC provides three methods of moving data 583 between RPC client and server: 585 Inline 586 Data are moved between RPC client and server within an RDMA 587 Send. 589 RDMA Read 590 Data are moved between RPC client and server via an RDMA Read 591 operation via steering tag, address and offset obtained from a 592 read chunk list. 594 RDMA Write 595 Result data is moved from RPC server to client via an RDMA 596 Write operation via steering tag, address and offset obtained 597 from a write chunk list or reply chunk in the client's RPC 598 call message. 600 These methods of data movement may occur in combinations within a 601 single RPC. For instance, an RPC call may contain some inline data 602 along with some large chunks to be transferred via RDMA Read to the 603 server. The reply to that call may have some result chunks that 604 the server RDMA Writes back to the client. The following protocol 605 interactions illustrate RPC calls that use these methods to move 606 RPC message data: 608 An RPC with write chunks in the call message looks like this: 610 RPC Client RPC Server 611 | RPC Call + Write Chunk list | 612 Send | ------------------------------> | 613 | | 614 | Chunk 1 | 615 | <------------------------------ | Write 616 | : | 617 | Chunk n | 618 | <------------------------------ | Write 619 | | 620 | RPC Reply | 621 | <------------------------------ | Send 623 In the presence of write chunks, RDMA ordering provides the 624 guarantee that all data in the RDMA Write operations has been 625 placed in memory prior to the client's RPC reply processing. 627 An RPC with read chunks in the call message looks like this: 629 RPC Client RPC Server 630 | RPC Call + Read Chunk list | 631 Send | ------------------------------> | 632 | | 633 | Chunk 1 | 634 | +------------------------------ | Read 635 | v-----------------------------> | 636 | : | 637 | Chunk n | 638 | +------------------------------ | Read 639 | v-----------------------------> | 640 | | 641 | RPC Reply | 642 | <------------------------------ | Send 644 And an RPC with read chunks in the reply message looks like this: 646 RPC Client RPC Server 647 | RPC Call | 648 Send | ------------------------------> | 649 | | 650 | RPC Reply + Read Chunk list | 651 | <------------------------------ | Send 652 | | 653 | Chunk 1 | 654 Read | ------------------------------+ | 655 | <-----------------------------v | 656 | : | 657 | Chunk n | 658 Read | ------------------------------+ | 659 | <-----------------------------v | 660 | | 661 | Done | 662 Send | ------------------------------> | 664 The final Done message allows the RPC client to signal the server 665 that it has received the chunks, so the server can de-register and 666 free the memory holding the chunks. A Done completion is not 667 necessary for an RPC call, since the RPC reply Send is itself a 668 receive completion notification. In the event that the client 669 fails to return the Done message within some timeout period, the 670 server may conclude that a protocol violation has occurred and 671 close the RPC connection, or it may proceed with a de-register and 672 free its chunk buffers. This may result in a fatal RDMA error if 673 the client later attempts to perform an RDMA Read operation, which 674 amounts to the same thing. 676 The use of read chunks in RPC reply messages is much less efficient 677 than providing write chunks in the originating RPC calls, due to 678 the additional message exchanges, the need for the RPC server to 679 advertise buffers to the peer, the necessity of the server 680 maintaining a timer for the purpose of recovery from misbehaving 681 clients, and the need for additional memory registration. Their 682 use is not recommended by upper layers where efficiency is a 683 primary concern. [NFSDDP] However, they may be employed by upper 684 layer protocol bindings which are primarily concerned with 685 transparency, since they can frequently be implemented completely 686 within the RPC lower layers. 688 It is important to note that the Done message consumes a credit at 689 the RPC server. The RPC server should provide sufficient credits 690 to the client to allow the Done message to be sent without deadlock 691 (driving the outstanding credit count to zero). The RPC client 692 must account for its required Done messages to the server in its 693 accounting of available credits, and the server should replenish 694 any credit consumed by its use of such exchanges at its earliest 695 opportunity. 697 Finally, it is possible to conceive of RPC exchanges that involve 698 any or all combinations of write chunks in the RPC call, read 699 chunks in the RPC call, and read chunks in the RPC reply. Support 700 for such exchanges is straightforward from a protocol perspective, 701 but in practice such exchanges would be quite rare, limited to 702 upper layer protocol exchanges which transferred bulk data in both 703 the call and corresponding reply. 705 4. RPC RDMA Message Layout 707 RPC call and reply messages are conveyed across an RDMA transport 708 with a prepended RPC over RDMA header. The RPC over RDMA header 709 includes data for RDMA flow control credits, padding parameters and 710 lists of addresses that provide direct data placement via RDMA Read 711 and Write operations. The layout of the RPC message itself is 712 unchanged from that described in [RFC1831] except for the possible 713 exclusion of large data chunks that will be moved by RDMA Read or 714 Write operations. If the RPC message (along with the RPC over RDMA 715 header) is too long for the posted receive buffer (even after any 716 large chunks are removed), then the entire RPC message can be moved 717 separately as a chunk, leaving just the RPC over RDMA header in the 718 RDMA Send. 720 4.1. RPC over RDMA Header 722 The RPC over RDMA header begins with four 32-bit fields that are 723 always present and which control the RDMA interaction including 724 RDMA-specific flow control. These are then followed by a number of 725 items such as chunk lists and padding which may or may not be 726 present depending on the type of transmission. The four fields 727 which are always present are: 729 1. Transaction ID (XID). 730 The XID generated for the RPC call and reply. Having the XID 731 at the beginning of the message makes it easy to establish the 732 message context. This XID mirrors the XID in the RPC header, 733 and takes precedence. The receiver may ignore the XID in the 734 RPC header, if it so chooses. 736 2. Version number. 737 This version of the RPC RDMA message protocol is 1. The 738 version number must be increased by one whenever the format of 739 the RPC RDMA messages is changed. 741 3. Flow control credit value. 742 When sent in an RPC call message, the requested value is 743 provided. When sent in an RPC reply message, the granted 744 value is returned. RPC calls must not be sent in excess of 745 the currently granted limit. 747 4. Message type. 749 o RDMA_MSG = 0 indicates that chunk lists and RPC message 750 follow. 752 o RDMA_NOMSG = 1 indicates that after the chunk lists there 753 is no RPC message. In this case, the chunk lists provide 754 information to allow the message proper to be transferred 755 using RDMA Read or write and thus is not appended to the 756 RPC over RDMA header. 758 o RDMA_MSGP = 2 indicates that a chunk list and RPC message 759 with some padding follow. 761 0 RDMA_DONE = 3 indicates that the message signals the 762 completion of a chunk transfer via RDMA Read. 764 o RDMA_ERROR = 4 is used to signal any detected error(s) in 765 the RPC RDMA chunk encoding. 767 Because the version number is encoded as part of this header, and 768 the RDMA_ERROR message type is used to indicate errors, these first 769 four fields and the start of the following message body must always 770 remain aligned at these fixed offsets for all versions of the RPC 771 over RDMA header. 773 For a message of type RDMA_MSG or RDMA_NOMSG, the Read and Write 774 chunk lists follow. If the Read chunk list is null (a 32 bit word 775 of zeros), then there are no chunks to be transferred separately 776 and the RPC message follows in its entirety. If non-null, then 777 it's the beginning of an XDR encoded sequence of Read chunk list 778 entries. If the Write chunk list is non-null, then an XDR encoded 779 sequence of Write chunk entries follows. 781 If the message type is RDMA_MSGP, then two additional fields that 782 specify the padding alignment and threshold are inserted prior to 783 the Read and Write chunk lists. 785 A header of message type RDMA_MSG or RDMA_MSGP will be followed by 786 the RPC call or RPC reply message body, beginning with the XID. 787 The XID in the RDMA_MSG or RDMA_MSGP header must match this. 789 +--------+---------+---------+-----------+-------------+---------- 790 | | | | Message | NULLs | RPC Call 791 | XID | Version | Credits | Type | or | or 792 | | | | | Chunk Lists | Reply Msg 793 +--------+---------+---------+-----------+-------------+---------- 795 Note that in the case of RDMA_DONE and RDMA_ERROR, no chunk list or 796 RPC message follows. As an implementation hint: a gather operation 797 on the Send of the RDMA RPC message can be used to marshal the 798 initial header, the chunk list, and the RPC message itself. 800 4.2. RPC over RDMA header errors 802 When a peer receives an RPC RDMA message, it must perform certain 803 basic validity checks on the header and chunk contents. If errors 804 are detected in an RPC request, an RDMA_ERROR reply should be 805 generated. 807 Two types of errors are defined, version mismatch and invalid chunk 808 format. When the peer detects an RPC over RDMA header version 809 which it does not support (currently this draft defines only 810 version 1), it replies with an error code of ERR_VERS, and provides 811 the low and high inclusive version numbers it does, in fact, 812 support. The version number in this reply can be any value 813 otherwise valid at the receiver. When other decoding errors are 814 detected in the header or chunks, either an RPC decode error may be 815 returned, or the error code ERR_CHUNK. 817 4.3. XDR Language Description 819 Here is the message layout in XDR language. 821 struct xdr_rdma_segment { 822 uint32 handle; /* Registered memory handle */ 823 uint32 length; /* Length of the chunk in bytes */ 824 uint64 offset; /* Chunk virtual address or offset */ 825 }; 827 struct xdr_read_chunk { 828 uint32 position; /* Position in XDR stream */ 829 struct xdr_rdma_segment target; 830 }; 832 struct xdr_read_list { 833 struct xdr_read_chunk entry; 834 struct xdr_read_list *next; 835 }; 837 struct xdr_write_chunk { 838 struct xdr_rdma_segment target<>; 839 }; 841 struct xdr_write_list { 842 struct xdr_write_chunk entry; 843 struct xdr_write_list *next; 844 }; 846 struct rdma_msg { 847 uint32 rdma_xid; /* Mirrors the RPC header xid */ 848 uint32 rdma_vers; /* Version of this protocol */ 849 uint32 rdma_credit; /* Buffers requested/granted */ 850 rdma_body rdma_body; 851 }; 853 enum rdma_proc { 854 RDMA_MSG=0, /* An RPC call or reply msg */ 855 RDMA_NOMSG=1, /* An RPC call or reply msg - separate body */ 856 RDMA_MSGP=2, /* An RPC call or reply msg with padding */ 857 RDMA_DONE=3, /* Client signals reply completion */ 858 RDMA_ERROR=4 /* An RPC RDMA encoding error */ 859 }; 860 union rdma_body switch (rdma_proc proc) { 861 case RDMA_MSG: 862 rpc_rdma_header rdma_msg; 863 case RDMA_NOMSG: 864 rpc_rdma_header_nomsg rdma_nomsg; 865 case RDMA_MSGP: 866 rpc_rdma_header_padded rdma_msgp; 867 case RDMA_DONE: 868 void; 869 case RDMA_ERROR: 870 rpc_rdma_error rdma_error; 871 }; 873 struct rpc_rdma_header { 874 struct xdr_read_list *rdma_reads; 875 struct xdr_write_list *rdma_writes; 876 struct xdr_write_chunk *rdma_reply; 877 /* rpc body follows */ 878 }; 880 struct rpc_rdma_header_nomsg { 881 struct xdr_read_list *rdma_reads; 882 struct xdr_write_list *rdma_writes; 883 struct xdr_write_chunk *rdma_reply; 884 }; 886 struct rpc_rdma_header_padded { 887 uint32 rdma_align; /* Padding alignment */ 888 uint32 rdma_thresh; /* Padding threshold */ 889 struct xdr_read_list *rdma_reads; 890 struct xdr_write_list *rdma_writes; 891 struct xdr_write_chunk *rdma_reply; 892 /* rpc body follows */ 893 }; 894 enum rpc_rdma_errcode { 895 ERR_VERS = 1, 896 ERR_CHUNK = 2 897 }; 899 union rpc_rdma_error switch (rpc_rdma_errcode) { 900 case ERR_VERS: 901 uint32 rdma_vers_low; 902 uint32 rdma_vers_high; 903 case ERR_CHUNK: 904 void; 905 default: 906 uint32 rdma_extra[8]; 907 }; 909 5. Long Messages 911 The receiver of RDMA Send messages is required to have previously 912 posted one or more adequately sized buffers. The RPC client can 913 inform the server of the maximum size of its RDMA Send messages via 914 the Connection Configuration Protocol described later in this 915 document. 917 Since RPC messages are frequently small, memory savings can be 918 achieved by posting small buffers. Even large messages like NFS 919 READ or WRITE will be quite small once the chunks are removed from 920 the message. However, there may be large messages that would 921 demand a very large buffer be posted, where the contents of the 922 buffer may not be a chunkable XDR element. A good example is an 923 NFS READDIR reply which may contain a large number of small 924 filename strings. Also, the NFS version 4 protocol [RFC3530] 925 features COMPOUND request and reply messages of unbounded length. 927 Ideally, each upper layer will negotiate these limits. However, it 928 is frequently necessary to provide a transparent solution. 930 5.1. Message as an RDMA Read Chunk 932 One relatively simple method is to have the client identify any RPC 933 message that exceeds the RPC server's posted buffer size and move 934 it separately as a chunk, i.e. reference it as the first entry in 935 the read chunk list with an XDR position of zero. 937 Normal Message 939 +--------+---------+---------+------------+-------------+---------- 940 | | | | | | RPC Call 941 | XID | Version | Credits | RDMA_MSG | Chunk Lists | or 942 | | | | | | Reply Msg 943 +--------+---------+---------+------------+-------------+---------- 945 Long Message 947 +--------+---------+---------+------------+-------------+ 948 | | | | | | 949 | XID | Version | Credits | RDMA_NOMSG | Chunk Lists | 950 | | | | | | 951 +--------+---------+---------+------------+-------------+ 952 | 953 | +---------- 954 | | Long RPC Call 955 +->| or 956 | Reply Message 957 +---------- 959 If the receiver gets an RPC over RDMA header with a message type of 960 RDMA_NOMSG and finds an initial read chunk list entry with a zero 961 XDR position, it allocates a registered buffer and issues an RDMA 962 Read of the long RPC message into it. The receiver then proceeds 963 to XDR decode the RPC message as if it had received it inline with 964 the Send data. Further decoding may issue additional RDMA Reads to 965 bring over additional chunks. 967 Although the handling of long messages requires one extra network 968 turnaround, in practice these messages should be rare if the posted 969 receive buffers are correctly sized, and of course they will be 970 non-existent for RDMA-aware upper layers. 972 An RPC with long reply returned via RDMA Read looks like this: 974 RPC Client RPC Server 975 | RPC Call | 976 Send | ------------------------------> | 977 | | 978 | RDMA over RPC Header | 979 | <------------------------------ | Send 980 | | 981 | Long RPC Reply Msg | 982 Read | ------------------------------+ | 983 | <-----------------------------v | 984 | | 985 | Done | 986 Send | ------------------------------> | 988 5.2. RDMA Write of Long Replies (Reply Chunks) 990 A superior method of handling long RPC replies is to have the RPC 991 client post a large buffer into which the server can write a large 992 RPC reply. This has the advantage that an RDMA Write may be 993 slightly faster in network latency than an RDMA Read. 994 Additionally, for a reply it removes the need for an RDMA_DONE 995 message if the large reply is returned as a Read chunk. 997 This protocol supports direct return of a large reply via the 998 inclusion of an optional rdma_reply write chunk after the read 999 chunk list and the write chunk list. The client allocates a buffer 1000 sized to receive a large reply and enters its steering tag, address 1001 and length in the rdma_reply write chunk. If the reply message is 1002 too long to return inline with an RDMA Send (exceeds the size of 1003 the client's posted receive buffer), even with read chunks removed, 1004 then the RPC server performs an RDMA Write of the RPC reply message 1005 into the buffer indicated by the rdma_reply chunk. If the client 1006 doesn't provide an rdma_reply chunk, or if it's too small, then the 1007 message must be returned as a Read chunk. 1009 An RPC with long reply returned via RDMA Write looks like this: 1011 RPC Client RPC Server 1012 | RPC Call with rdma_reply | 1013 Send | ------------------------------> | 1014 | | 1015 | Long RPC Reply Msg | 1016 | <------------------------------ | Write 1017 | | 1018 | RDMA over RPC Header | 1019 | <------------------------------ | Send 1021 The use of RDMA Write to return long replies requires that the 1022 client application anticipate a long reply and have some knowledge 1023 of its size so that an adequately sized buffer can be allocated. 1024 This is certainly true of NFS READDIR replies; where the client 1025 already provides an upper bound on the size of the encoded 1026 directory fragment to be returned by the server. 1028 The use of these "reply chunks" is highly efficient and convenient 1029 for both RPC client and server. Their use is encouraged for 1030 eligible RPC operations such as NFS READDIR, which would otherwise 1031 require extensive chunk management within the results or use of 1032 RDMA Read and a Done message. [NFSDDP] 1034 6. Connection Configuration Protocol 1036 RDMA Send operations require the receiver to post one or more 1037 buffers at the RDMA connection endpoint, each large enough to 1038 receive the largest Send message. Buffers are consumed as Send 1039 messages are received. If a buffer is too small, or if there are 1040 no buffers posted, the RDMA transport may return an error and break 1041 the RDMA connection. The receiver must post sufficient, adequately 1042 buffers to avoid buffer overrun or capacity errors. 1044 The protocol described above includes only a mechanism for managing 1045 the number of such receive buffers, and no explicit features to 1046 allow the RPC client and server to provision or control buffer 1047 sizing, nor any other session parameters. 1049 In the past, this type of connection management has not been 1050 necessary for RPC. RPC over UDP or TCP does not have a protocol to 1051 negotiate the link. The server can get a rough idea of the maximum 1052 size of messages from the server protocol code. However, a 1053 protocol to negotiate transport features on a more dynamic basis is 1054 desirable. 1056 The Connection Configuration Protocol allows the client to pass its 1057 connection requirements to the server, and allows the server to 1058 inform the client of its connection limits. 1060 6.1. Initial Connection State 1062 This protocol will be used for connection setup prior to the use of 1063 another RPC protocol that uses the RDMA transport. It operates in- 1064 band, i.e. it uses the connection itself to negotiate the 1065 connection parameters. To provide a basis for connection 1066 negotiation, the connection is assumed to provide a basic level of 1067 interoperability: the ability to exchange at least one RPC message 1068 at a time that is at least 1 KB in size. The server may exceed 1069 this basic level of configuration, but the client must not assume 1070 it. 1072 6.2. Protocol Description 1074 Version 1 of the Connection Configuration protocol consists of a 1075 single procedure that allows the client to inform the server of its 1076 connection requirements and the server to return connection 1077 information to the client. 1079 The maxcall_sendsize argument is the maximum size of an RPC call 1080 message that the client will send inline in an RDMA Send message to 1081 the server. The server may return a maxcall_sendsize value that is 1082 smaller or larger than the client's request. The client must not 1083 send an inline call message larger than what the server will 1084 accept. The maxcall_sendsize limits only the size of inline RPC 1085 calls. It does not limit the size of long RPC messages transferred 1086 as an initial chunk in the Read chunk list. 1088 The maxreply_sendsize is the maximum size of an inline RPC message 1089 that the client will accept from the server. 1091 The maxrdmaread is the maximum number of RDMA Reads which may be 1092 active at the peer. This number correlates to the RDMA incoming 1093 RDMA Read count ("IRD") configured into each originating endpoint 1094 by the client or server. If more than this number of RDMA Read 1095 operations by the connected peer are issued simultaneously, 1096 connection loss or suboptimal flow control may result, therefore 1097 the value should be observed at all times. The peers' values need 1098 not be equal. If zero, the peer must not issue requests which 1099 require RDMA Read to satisfy, as no transfer will be possible. 1101 The align value is the value recommended by the server for opaque 1102 data values such as strings and counted byte arrays. The client 1103 can use this value to compute the number of prepended pad bytes 1104 when XDR encoding opaque values in the RPC call message. 1106 typedef unsigned int uint32; 1108 struct config_rdma_req { 1109 uint32 maxcall_sendsize; 1110 /* max size of inline RPC call */ 1111 uint32 maxreply_sendsize; 1112 /* max size of inline RPC reply */ 1113 uint32 maxrdmaread; 1114 /* max active RDMA Reads at client */ 1115 }; 1117 struct config_rdma_reply { 1118 uint32 maxcall_sendsize; 1119 /* max call size accepted by server */ 1120 uint32 align; 1121 /* server's receive buffer alignment */ 1122 uint32 maxrdmaread; 1123 /* max active RDMA Reads at server */ 1124 }; 1126 program CONFIG_RDMA_PROG { 1127 version VERS1 { 1128 /* 1129 * Config call/reply 1130 */ 1131 config_rdma_reply CONF_RDMA(config_rdma_req) = 1; 1132 } = 1; 1133 } = nnnnnn; <-- Need program number assigned 1135 7. Memory Registration Overhead 1137 RDMA requires that all data be transferred between registered 1138 memory regions at the source and destination. All protocol headers 1139 as well as separately transferred data chunks must use registered 1140 memory. Since the cost of registering and de-registering memory 1141 can be a large proportion of the RDMA transaction cost, it is 1142 important to minimize registration activity. This is easily 1143 achieved within RPC controlled memory by allocating chunk list data 1144 and RPC headers in a reusable way from pre-registered pools. 1146 The data chunks transferred via RDMA may occupy memory that 1147 persists outside the bounds of the RPC transaction. Hence, the 1148 default behavior of an RPC over RDMA transport is to register and 1149 de-register these chunks on every transaction. However, this is 1150 not a limitation of the protocol - only of the existing local RPC 1151 API. The API is easily extended through such functions as 1152 rpc_control(3) to change the default behavior so that the 1153 application can assume responsibility for controlling memory 1154 registration through an RPC-provided registered memory allocator. 1156 8. Errors and Error Recovery 1158 RPC RDMA protocol errors are described in section 4. RPC errors 1159 and RPC error recovery are not affected by the protocol, and 1160 proceed as for any RPC error condition. RDMA Transport error 1161 reporting and recovery are outside the scope of this protocol. 1163 It is assumed that the link itself will provide some degree of 1164 error detection and retransmission. iWARP's MPA layer (when used 1165 over TCP), SCTP, as well as the Infiniband link layer all provide 1166 CRC protection of the RDMA payload, and CRC-class protection is a 1167 general attribute of such transports. Additionally, the RPC layer 1168 itself can accept errors from the link level and recover via 1169 retransmission. RPC recovery can handle complete loss and re- 1170 establishment of the link. 1172 See section 11 for further discussion of the use of RPC-level 1173 integrity schemes to detect errors, and related efficiency issues. 1175 9. Node Addressing 1177 In setting up a new RDMA connection, the first action by an RPC 1178 client will be to obtain a transport address for the server. The 1179 mechanism used to obtain this address, and to open an RDMA 1180 connection is dependent on the type of RDMA transport, and is the 1181 responsibility of each RPC protocol binding and its local 1182 implementation. 1184 10. RPC Binding 1186 RPC services normally register with a portmap or rpcbind service, 1187 which associates an RPC program number with a service address. In 1188 the case of UDP or TCP, the service address for NFS is normally 1189 port 2049. This policy should be no different with RDMA 1190 interconnects. 1192 One possibility is to have the server's portmapper register itself 1193 on the RDMA interconnect at a "well known" service address. On UDP 1194 or TCP, this corresponds to port 111. A client could connect to 1195 this service address and use the portmap protocol to obtain a 1196 service address in response to a program number, e.g. an iWARP port 1197 number, or an Infiniband GID. 1199 11. Security 1201 ONC RPC provides its own security via the RPCSEC_GSS framework [RFC 1202 2203]. RPCSEC_GSS can provide message authentication, integrity 1203 checking, and privacy. This security mechanism will be unaffected 1204 by the RDMA transport. The data integrity and privacy features 1205 alter the body of the message, presenting it as a single chunk. 1206 For large messages the chunk may be large enough to qualify for 1207 RDMA Read transfer. However, there is much data movement 1208 associated with computation and verification of integrity, or 1209 encryption/decryption, so certain performance advantages may be 1210 lost. 1212 For efficiency, more appropriate security mechanism for RDMA links 1213 may be link-level protection, such as IPSec, which may be co- 1214 located in the RDMA hardware. The use of link-level protection may 1215 be negotiated through the use of a new RPCSEC_GSS mechanism like 1216 the Credential Cache GSS Mechanism [CCM]. Use of such mechanisms 1217 is recommended where end-to-end integrity and/or privacy is 1218 desired, and where efficiency is required. 1220 There are no new issues here with exposed addresses. The only 1221 exposed addresses here are in the chunk list and in the transport 1222 packets transferred via RDMA. The data contained in these 1223 addresses continues to be protected by RPCSEC_GSS integrity and 1224 privacy. 1226 12. IANA Considerations 1228 As a new RPC transport, this protocol should have no effect on RPC 1229 program numbers or registered port numbers. The new RPC transport 1230 should be assigned a new RPC "netid". If adopted, the Connection 1231 Configuration protocol described herein will require an RPC program 1232 number assignment. 1234 13. Acknowledgements 1236 The authors wish to thank Rob Thurlow, John Howard, Chet Juszczak, 1237 Alex Chiu, Peter Staubach, Dave Noveck, Brian Pawlowski, Steve 1238 Kleiman, Mike Eisler, Mark Wittle and Shantanu Mehendale for their 1239 contributions to this document. 1241 14. Normative References 1243 [RFC1831] 1244 R. Srinivasan, "RPC: Remote Procedure Call Protocol Specification 1245 Version 2", Standards Track RFC, 1246 http://www.ietf.org/rfc/rfc1831.txt 1248 [RFC1832] 1249 R. Srinivasan, "XDR: External Data Representation Standard", 1250 Standards Track RFC, http://www.ietf.org/rfc/rfc1832.txt 1252 [RFC1813] 1253 B. Callaghan, B. Pawlowski, P. Staubach, "NFS Version 3 Protocol 1254 Specification", Informational RFC, 1255 http://www.ietf.org/rfc/rfc1813.txt 1257 [RFC3530] 1258 S. Shepler, B. Callaghan, D. Robinson, R. Thurlow, C. Beame, M. 1259 Eisler, D. Noveck, "NFS version 4 Protocol", Standards Track RFC, 1260 http://www.ietf.org/rfc/rfc3530.txt 1262 [RFC2203] 1263 M. Eisler, A. Chiu, L. Ling, "RPCSEC_GSS Protocol Specification", 1264 Standards Track RFC, http://www.ietf.org/rfc/rfc2203.txt 1266 15. Informative References 1268 [RDMA] 1269 R. Recio et al, "An RDMA Protocol Specification", Internet Draft 1270 Work in Progress, draft-ietf-rddp-rdmap 1272 [CCM] 1273 M. Eisler, N. Williams, "CCM: The Credential Cache GSS Mechanism", 1274 Internet Draft Work in Progress, draft-ietf-nfsv4-ccm 1276 [NFSDDP] 1277 B. Callaghan, T. Talpey, "NFS Direct Data Placement" Internet Draft 1278 Work in Progress, draft-ietf-nfsv4-nfsdirect 1280 [RDDP] 1281 Remote Direct Data Placement Working Group Charter, 1282 http://www.ietf.org/html.charters/rddp-charter.html 1284 [NFSRDMAPS] 1285 T. Talpey, C. Juszczak, "NFS RDMA Problem Statement", Internet 1286 Draft Work in Progress, draft-ietf-nfsv4-nfs-rdma-problem-statement 1288 [NFSSESS] 1289 T. Talpey, S. Shepler, J. Bauman, "NFSv4 Session Extensions", 1290 Internet Draft Work in Progress, draft-ietf-nfsv4-nfs-sess 1292 [IB] 1293 Infiniband Architecture Specification, http://www.infinibandta.org 1295 16. Authors' Addresses 1297 Brent Callaghan 1298 1614 Montalto Dr. 1299 Mountain View, California 94040 USA 1301 Phone: +1 650 968 2333 1302 EMail: brent.callaghan@gmail.com 1304 Tom Talpey 1305 Network Appliance, Inc. 1306 375 Totten Pond Road 1307 Waltham, MA 02451 USA 1309 Phone: +1 781 768 5329 1310 EMail: thomas.talpey@netapp.com 1312 17. Intellectual Property and Copyright Statements 1314 Intellectual Property Statement 1316 The IETF takes no position regarding the validity or scope of any 1317 Intellectual Property Rights or other rights that might be claimed 1318 to pertain to the implementation or use of the technology described 1319 in this document or the extent to which any license under such 1320 rights might or might not be available; nor does it represent that 1321 it has made any independent effort to identify any such rights. 1322 Information on the procedures with respect to rights in RFC 1323 documents can be found in BCP 78 and BCP 79. 1325 Copies of IPR disclosures made to the IETF Secretariat and any 1326 assurances of licenses to be made available, or the result of an 1327 attempt made to obtain a general license or permission for the use 1328 of such proprietary rights by implementers or users of this 1329 specification can be obtained from the IETF on-line IPR repository 1330 at http://www.ietf.org/ipr. 1332 The IETF invites any interested party to bring to its attention any 1333 copyrights, patents or patent applications, or other proprietary 1334 rights that may cover technology that may be required to implement 1335 this standard. Please address the information to the IETF at ietf- 1336 ipr@ietf.org. 1338 Disclaimer of Validity 1340 This document and the information contained herein are provided on 1341 an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE 1342 REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND 1343 THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, 1344 EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT 1345 THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR 1346 ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A 1347 PARTICULAR PURPOSE. 1349 Copyright Statement 1351 Copyright (C) The Internet Society (2005). This document is 1352 subject to the rights, licenses and restrictions contained in BCP 1353 78, and except as set forth therein, the authors retain all their 1354 rights. 1356 Acknowledgement 1357 Funding for the RFC Editor function is currently provided by the 1358 Internet Society.