idnits 2.17.1 draft-ietf-nfsv4-rpcrdma-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3978, Section 5.1 on line 14. -- Found old boilerplate from RFC 3978, Section 5.5 on line 1378. -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 1354. -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 1361. -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 1367. ** This document has an original RFC 3978 Section 5.4 Copyright Line, instead of the newer IETF Trust Copyright according to RFC 4748. ** This document has an original RFC 3978 Section 5.5 Disclaimer, instead of the newer disclaimer which includes the IETF Trust according to RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- Couldn't find a document date in the document -- date freshness check skipped. -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Looks like a reference, but probably isn't: '8' on line 908 ** Downref: Normative reference to an Informational RFC: RFC 1094 ** Obsolete normative reference: RFC 1831 (Obsoleted by RFC 5531) ** Obsolete normative reference: RFC 1832 (Obsoleted by RFC 4506) ** Downref: Normative reference to an Informational RFC: RFC 1813 ** Obsolete normative reference: RFC 3530 (Obsoleted by RFC 7530) Summary: 8 errors (**), 0 flaws (~~), 2 warnings (==), 9 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internet-Draft Tom Talpey 3 Expires: December 2006 Brent Callaghan 5 Document: draft-ietf-nfsv4-rpcrdma-03 June, 2006 7 RDMA Transport for ONC RPC 9 Status of this Memo 11 By submitting this Internet-Draft, each author represents that any 12 applicable patent or other IPR claims of which he or she is aware 13 have been or will be disclosed, and any of which he or she becomes 14 aware will be disclosed, in accordance with Section 6 of BCP 79. 16 Internet-Drafts are working documents of the Internet Engineering 17 Task Force (IETF), its areas, and its working groups. Note that 18 other groups may also distribute working documents as Internet- 19 Drafts. 21 Internet-Drafts are draft documents valid for a maximum of six 22 months and may be updated, replaced, or obsoleted by other 23 documents at any time. It is inappropriate to use Internet-Drafts 24 as reference material or to cite them other than as "work in 25 progress." 27 The list of current Internet-Drafts can be accessed at 28 http://www.ietf.org/ietf/1id-abstracts.txt 30 The list of Internet-Draft Shadow Directories can be accessed at 31 http://www.ietf.org/shadow.html. 33 Abstract 35 A protocol is described providing RDMA as a new transport for ONC 36 RPC. The RDMA transport binding conveys the benefits of efficient, 37 bulk data transport over high speed networks, while providing for 38 minimal change to RPC applications and with no required revision of 39 the application RPC protocol, or the RPC protocol itself. 41 Table of Contents 43 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 44 2. Abstract RDMA Requirements . . . . . . . . . . . . . . . . . 3 45 3. Protocol Outline . . . . . . . . . . . . . . . . . . . . . . 4 46 3.1. Short Messages . . . . . . . . . . . . . . . . . . . . . . 4 47 3.2. Data Chunks . . . . . . . . . . . . . . . . . . . . . . . 5 48 3.3. Flow Control . . . . . . . . . . . . . . . . . . . . . . . 5 49 3.4. XDR Encoding with Chunks . . . . . . . . . . . . . . . . . 7 50 3.5. Padding . . . . . . . . . . . . . . . . . . . . . . . . 10 51 3.6. XDR Decoding with Read Chunks . . . . . . . . . . . . . 11 52 3.7. XDR Decoding with Write Chunks . . . . . . . . . . . . . 12 53 3.8. RPC Call and Reply . . . . . . . . . . . . . . . . . . . 12 54 4. RPC RDMA Message Layout . . . . . . . . . . . . . . . . . 15 55 4.1. RPC over RDMA Header . . . . . . . . . . . . . . . . . . 16 56 4.2. RPC over RDMA header errors . . . . . . . . . . . . . . 17 57 4.3. XDR Language Description . . . . . . . . . . . . . . . . 18 58 5. Long Messages . . . . . . . . . . . . . . . . . . . . . . 20 59 5.1. Message as an RDMA Read Chunk . . . . . . . . . . . . . 20 60 5.2. RDMA Write of Long Replies (Reply Chunks) . . . . . . . 22 61 6. Connection Configuration Protocol . . . . . . . . . . . . 23 62 6.1. Initial Connection State . . . . . . . . . . . . . . . . 24 63 6.2. Protocol Description . . . . . . . . . . . . . . . . . . 24 64 7. Memory Registration Overhead . . . . . . . . . . . . . . . 25 65 8. Errors and Error Recovery . . . . . . . . . . . . . . . . 26 66 9. Node Addressing . . . . . . . . . . . . . . . . . . . . . 26 67 10. RPC Binding . . . . . . . . . . . . . . . . . . . . . . . 26 68 11. Security . . . . . . . . . . . . . . . . . . . . . . . . 27 69 12. IANA Considerations . . . . . . . . . . . . . . . . . . . 27 70 13. Acknowledgements . . . . . . . . . . . . . . . . . . . . 28 71 14. Normative References . . . . . . . . . . . . . . . . . . 28 72 15. Informative References . . . . . . . . . . . . . . . . . 29 73 16. Authors' Addresses . . . . . . . . . . . . . . . . . . . 29 74 17. Intellectual Property and Copyright Statements . . . . . 30 75 Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . 31 77 1. Introduction 79 RDMA is a technique for efficient movement of data between end 80 nodes, which becomes increasingly compelling over high speed 81 transports. By directing data into destination buffers as it is 82 sent on a network, and placing it via direct memory access by 83 hardware, the double benefit of faster transfers and reduced host 84 overhead is obtained. 86 ONC RPC [RFC1831] is a remote procedure call protocol that has been 87 run over a variety of transports. Most RPC implementations today 88 use UDP or TCP. RPC messages are defined in terms of an eXternal 89 Data Representation (XDR) [RFC1832] which provides a canonical data 90 representation across a variety of host architectures. An XDR data 91 stream is conveyed differently on each type of transport. On UDP, 92 RPC messages are encapsulated inside datagrams, while on a TCP byte 93 stream, RPC messages are delineated by a record marking protocol. 94 An RDMA transport also conveys RPC messages in a unique fashion 95 that must be fully described if client and server implementations 96 are to interoperate. 98 RDMA transports present new semantics unlike the behaviors of 99 either UDP and TCP alone. They retain message delineations like 100 UDP while also providing a reliable, sequenced data transfer like 101 TCP. And, they provide the new efficient, bulk transfer service of 102 RDMA. RDMA transports are therefore naturally viewed as a new 103 transport type by ONC RPC. 105 RDMA as a transport will benefit the performance of RPC protocols 106 that move large "chunks" of data, since RDMA hardware excels at 107 moving data efficiently between host memory and a high speed 108 network with little or no host CPU involvement. In this context, 109 the NFS protocol, in all its versions [RFC1094] [RFC1813] [RFC3530] 110 [NFSv4.1], is an obvious beneficiary of RDMA. A complete problem 111 statement is discussed in [NFSRDMAPS], and related NFSv4 issues are 112 discussed in [NFSv4.1]. Many other RPC-based protocols will also 113 benefit. 115 Although the RDMA transport described here provides relatively 116 transparent support for any RPC application, the proposal goes 117 further in describing mechanisms that can optimize the use of RDMA 118 with more active participation by the RPC application. 120 2. Abstract RDMA Requirements 122 An RPC transport is responsible for conveying an RPC message from a 123 sender to a receiver. An RPC message is either an RPC call from a 124 client to a server, or an RPC reply from the server back to the 125 client. An RPC message contains an RPC call header followed by 126 arguments if the message is an RPC call, or an RPC reply header 127 followed by results if the message is an RPC reply. The call 128 header contains a transaction ID (XID) followed by the program and 129 procedure number as well as a security credential. An RPC reply 130 header begins with an XID that matches that of the RPC call 131 message, followed by a security verifier and results. All data in 132 an RPC message is XDR encoded. For a complete description of the 133 RPC protocol and XDR encoding, see [RFC1831] and [RFC1832]. 135 This protocol assumes the following abstract model for RDMA 136 transports. These terms, common in the RDMA lexicon, are used in 137 this document. A more complete glossary of RDMA terms can be found 138 in [RDMAP]. 140 o Registered Memory 141 All data moved via tagged RDMA operations must be resident in 142 registered memory at its destination. This protocol assumes 143 that each segment of registered memory may be identified with 144 a steering tag of no more than 32 bits and memory addresses of 145 up to 64 bits in length. 147 o RDMA Send 148 The RDMA provider supports an RDMA Send operation with 149 completion signalled at the receiver when data is placed in a 150 pre-posted buffer. The amount of transferred data is limited 151 only by the size of the receiver's buffer. Sends complete at 152 the receiver in the order they were issued at the sender. 154 o RDMA Write 155 The RDMA provider supports an RDMA Write operation to directly 156 place data in the receiver's buffer. An RDMA Write is 157 initiated by the sender and completion is signalled at the 158 sender. No completion is signalled at the receiver. The 159 sender uses a steering tag, memory address and length of the 160 remote destination buffer. RDMA Writes are not necessarily 161 ordered with respect to one another, but are ordered with 162 respect to RDMA Sends; a subsequent RDMA Send completion must 163 be obtained at the receiver to notify that prior RDMA Write 164 data has been successfully placed in the receiver's memory. 166 o RDMA Read 167 The RDMA provider supports an RDMA Read operation to directly 168 place peer source data in the requester's buffer. An RDMA 169 Read is initiated by the receiver and completion is signalled 170 at the receiver. The receiver provides steering tags, memory 171 addresses and a length for the remote source and local 172 destination buffers. Since the peer at the data source 173 receives no notification of RDMA Read completion, there is an 174 assumption that on receiving the data the receiver will signal 175 completion with an RDMA Send message, so that the peer can 176 free the source buffers and the associated steering tags. 178 This protocol is designed to be carried over all RDMA transports 179 meeting the stated requirements. This protocol conveys to the RPC 180 peer, information sufficient for that RPC peer to direct an RDMA 181 layer to perform transfers containing RPC data, and to communicate 182 their result(s). For example, it is readily carried over RDMA 183 transports such as iWARP [RDDP] or Infiniband [IB]. 185 3. Protocol Outline 187 An RPC message can be conveyed in identical fashion, whether it is 188 a call or reply message. In each case, the transmission of the 189 message proper is preceded by transmission of a transport-specific 190 header for use by RPC over RDMA transports. This header is 191 analogous to the record marking used for RPC over TCP, but is more 192 extensive, since RDMA transports support several modes of data 193 transfer and it is important to allow the client and server to use 194 the most efficient mode for any given transfer. Multiple segments 195 of a message may be transferred in different ways to different 196 remote memory destinations. 198 All transfers of a call or reply begin with an RDMA Send which 199 transfers at least the RPC over RDMA header, usually with the call 200 or reply message appended, or at least some part thereof. Because 201 the size of what may be transmitted via RDMA Send is limited by the 202 size of the receiver's pre-posted buffer, the RPC over RDMA 203 transport provides a number of methods to reduce the amount 204 transferred by means of the RDMA Send, when necessary, by 205 transferring various parts of the message using RDMA Read and RDMA 206 Write. 208 3.1. Short Messages 210 Many RPC messages are quite short. For example, the NFS version 3 211 GETATTR request, is only 56 bytes: 20 bytes of RPC header plus a 32 212 byte filehandle argument and 4 bytes of length. The reply to this 213 common request is about 100 bytes. 215 There is no benefit in transferring such small messages with an 216 RDMA Read or Write operation. The overhead in transferring 217 steering tags and memory addresses is justified only by large 218 transfers. The critical message size that justifies RDMA transfer 219 will vary depending on the RDMA implementation and network, but is 220 typically of the order of a few kilobytes. It is appropriate to 221 transfer a short message with an RDMA Send to a pre-posted buffer. 222 The RPC over RDMA header with the short message (call or reply) 223 immediately following is transferred using a single RDMA Send 224 operation. 226 Short RPC messages over an RDMA transport will look like this: 228 RPC Client RPC Server 229 | RPC Call | 230 Send | ------------------------------> | 231 | | 232 | RPC Reply | 233 | <------------------------------ | Send 235 3.2. Data Chunks 237 Some protocols, like NFS, have RPC procedures that can transfer 238 very large "chunks" of data in the RPC call or reply and would 239 cause the maximum send size to be exceeded if one tried to transfer 240 them as part of the RDMA Send. These large chunks typically range 241 from a kilobyte to a megabyte or more. An RDMA transport can 242 transfer large chunks of data more efficiently via the direct 243 placement of an RDMA Read or RDMA Write operation. Using direct 244 placement instead of inline transfer not only avoids expensive data 245 copies, but provides correct data alignment at the destination. 247 3.3. Flow Control 249 It is critical to provide RDMA Send flow control for an RDMA 250 connection. RDMA receive operations will fail if a pre-posted 251 receive buffer is not available to accept an incoming RDMA Send, 252 and repeated occurrences of such errors can be fatal to the 253 connection. This is a departure from conventional TCP/IP 254 networking where buffers are allocated dynamically on an as-needed 255 basis, and pre-posting is not required. 257 It is not practical to provide for fixed credit limits at the RPC 258 server. Fixed limits scale poorly, since posted buffers are 259 dedicated to the associated connection until consumed by receive 260 operations. Additionally for protocol correctness, the RPC server 261 must always be able to reply to client requests, whether or not new 262 buffers have been posted to accept future receives. (Note that the 263 RPC server may in fact be a client at some other layer. For 264 example, NFSv4 callbacks are processed by the NFSv4 client, acting 265 as an RPC server. The credit discussions apply equally in either 266 case.) 268 Flow control for RDMA Send operations is implemented as a simple 269 request/grant protocol in the RPC over RDMA header associated with 270 each RPC message. The RPC over RDMA header for RPC call messages 271 contains a requested credit value for the RPC server, which may be 272 dynamically adjusted by the caller to match its expected needs. 273 The RPC over RDMA header for the RPC reply messages provides the 274 granted result, which may have any value except it may not be zero 275 when no in-progress operations are present at the server, since 276 such a value would result in deadlock. The value may be adjusted 277 up or down at each opportunity to match the server's needs or 278 policies. 280 The RPC client must not send unacknowledged requests in excess of 281 this granted RPC server credit limit. If the limit is exceeded, 282 the RDMA layer may signal an error, possibly terminating the 283 connection. Even if an error does not occur, there is no 284 requirement that the server must handle the excess request(s), and 285 it may return an RPC error to the client. Also note that the 286 never-zero requirement implies that an RPC server must always 287 provide at least one credit to each connected RPC client. It does 288 not however require that the server must always be prepared to 289 receive a request from each client, for example when the server is 290 busy processing all granted client requests. 292 While RPC call may complete in any order, the current flow control 293 limit at the RPC server is known to the RPC client from the Send 294 ordering properties. It is always the most recent server-granted 295 credit value minus the number of requests in flight. 297 Certain RDMA implementations may impose additional flow control 298 restrictions, such as limits on RDMA Read operations in progress at 299 the responder. Because these operations are outside the scope of 300 this protocol, they are not addressed and must be provided for by 301 other layers. For example, a simple upper layer RPC consumer might 302 perform single-issue RDMA Read requests, while a more 303 sophisticated, multithreaded RPC consumer may implement its own 304 FIFO queue of such operations. For further discussion of possible 305 protocol implementations capable of negotiating these values, see 306 section 6 "Connection Configuration Protocol" of this draft, or 307 [NFSv4.1]. 309 3.4. XDR Encoding with Chunks 311 The data comprising an RPC call or reply message is marshaled or 312 serialized into a contiguous stream by an XDR routine. XDR data 313 types such as integers, strings, arrays and linked lists are 314 commonly implemented over two very simple functions that encode 315 either an XDR data unit (32 bits) or an array of bytes. 317 Normally, the separate data items in an RPC call or reply are 318 encoded as a contiguous sequence of bytes for network transmission 319 over UDP or TCP. However, in the case of an RDMA transport, local 320 routines such as XDR encode can determine that (for instance) an 321 opaque byte array is large enough to be more efficiently moved via 322 an RDMA data transfer operation like RDMA Read or RDMA Write. 324 Semantically speaking, the protocol has no restriction regarding 325 data types which may or may not be represented by a read or write 326 chunk. In practice however, efficiency considerations lead to the 327 conclusion that certain data types are not generally "chunkable". 328 Typically, only opaque and aggregate data types which may attain 329 substantial size are considered to be eligible. With today's 330 hardware this size may be a kilobyte or more. However any object 331 may be chosen for chunking in any given message. 333 The eligibility of XDR data items to be candidates for being moved 334 as data chunks (as opposed to being marshalled inline) is not 335 specified by the RPC over RDMA protocol. Chunk eligibility 336 criteria must be determined by each upper layer in order to provide 337 for an interoperable specification. One such example with 338 rationale, for the NFS protocol family, is provided in [NFSDDP]. 340 The interface by which an upper layer implementation communicates 341 the eligibility of a data item locally to RPC for chunking is out 342 of scope for this specification. In many implementations, it is 343 possible to implement a transparent RPC chunking facility. 344 However, such implementations may lead to inefficiencies, either 345 because they require the RPC layer to perform expensive 346 registration and deregistration of memory "on the fly", or they may 347 require using RDMA chunks in reply messages, along with the 348 resulting additional handshaking with the RPC over RDMA peer. 349 However, these issues are internal and generally confined to the 350 local interface between RPC and its upper layers, one in which 351 implementations are free to innovate. The only requirement is that 352 the resulting RPC RDMA protocol sent to the peer is valid for the 353 upper layer. See for example [NFSDDP]. 355 When sending any message (request or reply) that contains an 356 eligible large data chunk, the XDR encoding routine avoids moving 357 the data into the XDR stream. Instead, it does not encode the data 358 portion, but records the address and size of each chunk in a 359 separate "read chunk list" encoded within RPC RDMA transport- 360 specific headers. Such chunks will be transferred via RDMA Read 361 operations initiated by the receiver. 363 When the read chunks are to be moved via RDMA, the memory for each 364 chunk must be registered. This registration may take place within 365 XDR itself, providing for full transparency to upper layers, or it 366 may be performed by any other specific local implementation. 368 Additionally, when making an RPC call that can result in bulk data 369 transferred in the reply, it is desirable to provide chunks to 370 accept the data directly via RDMA Write. These write chunks will 371 therefore be pre-filled by the RPC server prior to responding, and 372 XDR decode at the client will not be required. These chunks 373 undergo a similar registration and advertisement via "write chunk 374 lists" built as a part of XDR encoding. 376 Some RPC client implementations are not able to determine where an 377 RPC call's results reside during the "encode" phase. This makes it 378 difficult or impossible for the RPC client layer to encode the 379 write chunk list at the time of building the request. In this 380 case, it is difficult for the RPC implementation to provide 381 transparency to the RPC consumer, which may require recoding to 382 provide result information at this earlier stage. 384 Therefore if the RPC client does not make a write chunk list 385 available to receive the result, then the RPC server must return 386 data inline in the reply, or if it so chooses, via a read chunk 387 list. RPC clients are discouraged from omitting write chunk lists 388 for eligible replies, due to the lower performance of the 389 additional handshaking to perform data transfer, and the 390 requirement that the RPC server must expose (and preserve) the 391 reply data for a period of time. In the absence of a server- 392 provided read chunk list in the reply, if the encoded reply 393 overflows the posted receive buffer, the RPC will fail. 395 When any data within a message is provided via either read or write 396 chunks, the chunk itself refers only to the data portion of the XDR 397 stream element. In particular, for counted fields (e.g. a "<>" 398 encoding) the byte count which is encoded as part of the field 399 remains in the XDR stream, and is also encoded in the chunk list. 400 The data portion is however elided from the encoded XDR stream, and 401 is transferred as part of chunk list processing. This is important 402 to maintain upper layer implementation compatibility - both the 403 count and the data must be transferred as part of the logical XDR 404 stream. While the chunk list processing results in the data being 405 available to the upper layer peer for XDR decoding, the length 406 present in the chunk list entries is not. Any byte count in the 407 XDR stream must match the sum of the byte counts present in the 408 corresponding read or write chunk list. If they do not agree, an 409 RPC protocol encoding error results. 411 The following items are contained in a chunk list entry. 413 Handle 414 Steering tag or handle obtained when the chunk memory is 415 registered for RDMA. 417 Length 418 The length of the chunk in bytes. 420 Offset 421 The offset or beginning memory address of the chunk. In order 422 to support the widest array of RDMA implementations, as well 423 as the most general steering tag scheme, this field is 424 unconditionally included in each chunk list entry. 426 While zero-based offset schemes are available in many RDMA 427 implementations, their use by RPC requires individual 428 registration of each read or write chunk. On many such 429 implementations this can be a significant overhead. By 430 providing an offset in each chunk, many pre-registration or 431 region-based registrations can be readily supported, and by 432 using a single, universal chunk representation, the RPC RDMA 433 protocol implementation is simplified to its most general 434 form. 436 Position 437 For data which is to be encoded, the position in the XDR 438 stream where the chunk would normally reside. Note that the 439 chunk therefore inserts its data into the XDR stream at this 440 position, but its transfer is no longer "inline". Also note 441 it is possible that a contiguous sequence of chunks might all 442 have the same position. For data which is to be decoded, no 443 "position" is used. 445 When XDR marshaling is complete, the chunk list is XDR encoded, 446 then sent to the receiver prepended to the RPC message. Any source 447 data for a read chunk, or the destination of a write chunk, remain 448 behind in the sender's registered memory and their actual payload 449 is not marshalled into the request or reply. 451 +----------------+----------------+------------- 452 | RPC over RDMA | | 453 | header w/ | RPC Header | Non-chunk args/results 454 | chunks | | 455 +----------------+----------------+------------- 457 Read chunk lists and write chunk lists are structured somewhat 458 differently. This is due to the different usage - read chunks are 459 decoded and indexed by their position in the XDR data stream, their 460 size is always known, and may be used for both arguments and 461 results. Write chunks on the other hand are used only for results, 462 and have neither a preassigned offset in the XDR stream, nor a size 463 until the results are produced, since the buffers may not be used 464 for results at all, or may be partially filled. Their presence in 465 the XDR stream is therefore not known until the reply is processed. 466 The mapping of Write chunks onto designated NFS procedures and 467 their results is described in [NFSDDP]. 469 Therefore, read chunks are encoded into a read chunk list as a 470 single array, with each entry tagged by its (known) size and 471 position in the XDR stream. Write chunks are encoded as a list of 472 arrays of RDMA buffers, with each list element (an array) providing 473 buffers for a separate result. Individual write chunk list 474 elements may thereby result in being partially or fully filled, or 475 in fact not being filled at all. Unused write chunks, or unused 476 bytes in write chunk buffer lists, are not returned as results, and 477 their memory is returned to the upper layer as part of RPC 478 completion. However, the RPC layer should not assume that the 479 buffers have not been modified. 481 3.5. Padding 483 Alignment of specific opaque data enables certain scatter/gather 484 optimizations. Padding leverages the useful property that RDMA 485 transfers preserve alignment of data, even when they are placed 486 into pre-posted receive buffers by Sends. 488 Many servers can make good use of such padding. Padding allows the 489 chaining of RDMA receive buffers such that any data transferred by 490 RDMA on behalf of RPC requests will be placed into appropriately 491 aligned buffers on the system that receives the transfer. In this 492 way, the need for servers to perform RDMA Read to satisfy all but 493 the largest client writes is obviated. 495 The effect of padding is demonstrated below showing prior bytes on 496 an XDR stream (XXX) followed by an opaque field consisting of four 497 length bytes (LLLL) followed by data bytes (DDDD). The receiver of 498 the RDMA Send has posted two chained receive buffers. Without 499 padding, the opaque data is split across the two buffers. With the 500 addition of padding bytes (ppp) prior to the first data byte, the 501 data can be forced to align correctly in the second buffer. 503 Buffer 1 Buffer 2 504 Unpadded -------------- -------------- 506 XXXXXXXLLLLDDDDDDDDDDDDDD ---> XXXXXXXLLLLDDD DDDDDDDDDDD 508 Padded 510 XXXXXXXLLLLpppDDDDDDDDDDDDDD ---> XXXXXXXLLLLppp DDDDDDDDDDDDDD 512 Padding is implemented completely within the RDMA transport 513 encoding, flagged with a specific message type. Where padding is 514 applied, two values are passed to the peer: an "rdma_align" which 515 is the padding value used, and "rdma_thresh", which is the opaque 516 data size at or above which padding is applied. For instance, if 517 the server is using chained 4 KB receive buffers, then up to (4 KB 518 - 1) padding bytes could be used to achieve alignment of the data. 519 If padding is to apply only to chunks at least 1 KB in size, then 520 the threshold should be set to 1 KB. The XDR routine at the peer 521 will consult these values when decoding opaque values. Where the 522 decoded length exceeds the rdma_thresh, the XDR decode will skip 523 over the appropriate padding as indicated by rdma_align and the 524 current XDR stream position. 526 3.6. XDR Decoding with Read Chunks 528 The XDR decode process moves data from an XDR stream into a data 529 structure provided by the RPC client or server application. Where 530 elements of the destination data structure are buffers or strings, 531 the RPC application can either pre-allocate storage to receive the 532 data, or leave the string or buffer fields null and allow the XDR 533 decode stage of RPC processing to automatically allocate storage of 534 sufficient size. 536 When decoding a message from an RDMA transport, the receiver first 537 XDR decodes the chunk lists from the RPC over RDMA header, then 538 proceeds to decode the body of the RPC message (arguments or 539 results). Whenever the XDR offset in the decode stream matches 540 that of a chunk in the read chunk list, the XDR routine initiates 541 an RDMA Read to bring over the chunk data into locally registered 542 memory for the destination buffer. 544 When processing an RPC request, the RPC receiver (RPC server) 545 acknowledges its completion of use of the source buffers by simply 546 replying to the RPC sender (client), and the peer may free all 547 source buffers advertised by the request. 549 When processing an RPC reply, after completing such a transfer the 550 RPC receiver (client) must issue an RDMA_DONE message (described in 551 Section 3.8) to notify the peer (server) that the source buffers 552 can be freed. 554 The read chunk list is constructed and used entirely within the 555 RPC/XDR layer. Other than specifying the minimum chunk size, the 556 management of the read chunk list is automatic and transparent to 557 an RPC application. 559 3.7. XDR Decoding with Write Chunks 561 When a "write chunk list" is provided for the results of the RPC 562 call, the RPC server must provide any corresponding data via RDMA 563 Write to the memory referenced in the chunk list entries. The RPC 564 reply conveys this by returning the write chunk list to the client 565 with the lengths rewritten to match the actual transfer. The XDR 566 "decode" of the reply therefore performs no local data transfer but 567 merely returns the length obtained from the reply. 569 Each decoded result consumes one entry in the write chunk list, 570 which in turn consists of an array of RDMA segments. The length is 571 therefore the sum of all returned lengths in all segments 572 comprising the corresponding list entry. As each list entry is 573 "decoded", the entire entry is consumed. 575 The write chunk list is constructed and used by the RPC 576 application. The RPC/XDR layer simply conveys the list between 577 client and server and initiates the RDMA Writes back to the client. 578 The mapping of write chunk list entries to procedure arguments must 579 be determined for each protocol. An example of a mapping is 580 described in [NFSDDP]. 582 3.8. RPC Call and Reply 584 The RDMA transport for RPC provides three methods of moving data 585 between RPC client and server: 587 Inline 588 Data are moved between RPC client and server within an RDMA 589 Send. 591 RDMA Read 592 Data are moved between RPC client and server via an RDMA Read 593 operation via steering tag, address and offset obtained from a 594 read chunk list. 596 RDMA Write 597 Result data is moved from RPC server to client via an RDMA 598 Write operation via steering tag, address and offset obtained 599 from a write chunk list or reply chunk in the client's RPC 600 call message. 602 These methods of data movement may occur in combinations within a 603 single RPC. For instance, an RPC call may contain some inline data 604 along with some large chunks to be transferred via RDMA Read to the 605 server. The reply to that call may have some result chunks that 606 the server RDMA Writes back to the client. The following protocol 607 interactions illustrate RPC calls that use these methods to move 608 RPC message data: 610 An RPC with write chunks in the call message looks like this: 612 RPC Client RPC Server 613 | RPC Call + Write Chunk list | 614 Send | ------------------------------> | 615 | | 616 | Chunk 1 | 617 | <------------------------------ | Write 618 | : | 619 | Chunk n | 620 | <------------------------------ | Write 621 | | 622 | RPC Reply | 623 | <------------------------------ | Send 625 In the presence of write chunks, RDMA ordering provides the 626 guarantee that all data in the RDMA Write operations has been 627 placed in memory prior to the client's RPC reply processing. 629 An RPC with read chunks in the call message looks like this: 631 RPC Client RPC Server 632 | RPC Call + Read Chunk list | 633 Send | ------------------------------> | 634 | | 635 | Chunk 1 | 636 | +------------------------------ | Read 637 | v-----------------------------> | 638 | : | 639 | Chunk n | 640 | +------------------------------ | Read 641 | v-----------------------------> | 642 | | 643 | RPC Reply | 644 | <------------------------------ | Send 646 And an RPC with read chunks in the reply message looks like this: 648 RPC Client RPC Server 649 | RPC Call | 650 Send | ------------------------------> | 651 | | 652 | RPC Reply + Read Chunk list | 653 | <------------------------------ | Send 654 | | 655 | Chunk 1 | 656 Read | ------------------------------+ | 657 | <-----------------------------v | 658 | : | 659 | Chunk n | 660 Read | ------------------------------+ | 661 | <-----------------------------v | 662 | | 663 | Done | 664 Send | ------------------------------> | 666 The final Done message allows the RPC client to signal the server 667 that it has received the chunks, so the server can de-register and 668 free the memory holding the chunks. A Done completion is not 669 necessary for an RPC call, since the RPC reply Send is itself a 670 receive completion notification. In the event that the client 671 fails to return the Done message within some timeout period, the 672 server may conclude that a protocol violation has occurred and 673 close the RPC connection, or it may proceed with a de-register and 674 free its chunk buffers. This may result in a fatal RDMA error if 675 the client later attempts to perform an RDMA Read operation, which 676 amounts to the same thing. 678 The use of read chunks in RPC reply messages is much less efficient 679 than providing write chunks in the originating RPC calls, due to 680 the additional message exchanges, the need for the RPC server to 681 advertise buffers to the peer, the necessity of the server 682 maintaining a timer for the purpose of recovery from misbehaving 683 clients, and the need for additional memory registration. Their 684 use is not recommended by upper layers where efficiency is a 685 primary concern. [NFSDDP] However, they may be employed by upper 686 layer protocol bindings which are primarily concerned with 687 transparency, since they can frequently be implemented completely 688 within the RPC lower layers. 690 It is important to note that the Done message consumes a credit at 691 the RPC server. The RPC server should provide sufficient credits 692 to the client to allow the Done message to be sent without deadlock 693 (driving the outstanding credit count to zero). The RPC client 694 must account for its required Done messages to the server in its 695 accounting of available credits, and the server should replenish 696 any credit consumed by its use of such exchanges at its earliest 697 opportunity. 699 Finally, it is possible to conceive of RPC exchanges that involve 700 any or all combinations of write chunks in the RPC call, read 701 chunks in the RPC call, and read chunks in the RPC reply. Support 702 for such exchanges is straightforward from a protocol perspective, 703 but in practice such exchanges would be quite rare, limited to 704 upper layer protocol exchanges which transferred bulk data in both 705 the call and corresponding reply. 707 4. RPC RDMA Message Layout 709 RPC call and reply messages are conveyed across an RDMA transport 710 with a prepended RPC over RDMA header. The RPC over RDMA header 711 includes data for RDMA flow control credits, padding parameters and 712 lists of addresses that provide direct data placement via RDMA Read 713 and Write operations. The layout of the RPC message itself is 714 unchanged from that described in [RFC1831] except for the possible 715 exclusion of large data chunks that will be moved by RDMA Read or 716 Write operations. If the RPC message (along with the RPC over RDMA 717 header) is too long for the posted receive buffer (even after any 718 large chunks are removed), then the entire RPC message can be moved 719 separately as a chunk, leaving just the RPC over RDMA header in the 720 RDMA Send. 722 4.1. RPC over RDMA Header 724 The RPC over RDMA header begins with four 32-bit fields that are 725 always present and which control the RDMA interaction including 726 RDMA-specific flow control. These are then followed by a number of 727 items such as chunk lists and padding which may or may not be 728 present depending on the type of transmission. The four fields 729 which are always present are: 731 1. Transaction ID (XID). 732 The XID generated for the RPC call and reply. Having the XID 733 at the beginning of the message makes it easy to establish the 734 message context. This XID mirrors the XID in the RPC header, 735 and takes precedence. The receiver may ignore the XID in the 736 RPC header, if it so chooses. 738 2. Version number. 739 This version of the RPC RDMA message protocol is 1. The 740 version number must be increased by one whenever the format of 741 the RPC RDMA messages is changed. 743 3. Flow control credit value. 744 When sent in an RPC call message, the requested value is 745 provided. When sent in an RPC reply message, the granted 746 value is returned. RPC calls must not be sent in excess of 747 the currently granted limit. 749 4. Message type. 751 o RDMA_MSG = 0 indicates that chunk lists and RPC message 752 follow. 754 o RDMA_NOMSG = 1 indicates that after the chunk lists there 755 is no RPC message. In this case, the chunk lists provide 756 information to allow the message proper to be transferred 757 using RDMA Read or write and thus is not appended to the 758 RPC over RDMA header. 760 o RDMA_MSGP = 2 indicates that a chunk list and RPC message 761 with some padding follow. 763 0 RDMA_DONE = 3 indicates that the message signals the 764 completion of a chunk transfer via RDMA Read. 766 o RDMA_ERROR = 4 is used to signal any detected error(s) in 767 the RPC RDMA chunk encoding. 769 Because the version number is encoded as part of this header, and 770 the RDMA_ERROR message type is used to indicate errors, these first 771 four fields and the start of the following message body must always 772 remain aligned at these fixed offsets for all versions of the RPC 773 over RDMA header. 775 For a message of type RDMA_MSG or RDMA_NOMSG, the Read and Write 776 chunk lists follow. If the Read chunk list is null (a 32 bit word 777 of zeros), then there are no chunks to be transferred separately 778 and the RPC message follows in its entirety. If non-null, then 779 it's the beginning of an XDR encoded sequence of Read chunk list 780 entries. If the Write chunk list is non-null, then an XDR encoded 781 sequence of Write chunk entries follows. 783 If the message type is RDMA_MSGP, then two additional fields that 784 specify the padding alignment and threshold are inserted prior to 785 the Read and Write chunk lists. 787 A header of message type RDMA_MSG or RDMA_MSGP will be followed by 788 the RPC call or RPC reply message body, beginning with the XID. 789 The XID in the RDMA_MSG or RDMA_MSGP header must match this. 791 +--------+---------+---------+-----------+-------------+---------- 792 | | | | Message | NULLs | RPC Call 793 | XID | Version | Credits | Type | or | or 794 | | | | | Chunk Lists | Reply Msg 795 +--------+---------+---------+-----------+-------------+---------- 797 Note that in the case of RDMA_DONE and RDMA_ERROR, no chunk list or 798 RPC message follows. As an implementation hint: a gather operation 799 on the Send of the RDMA RPC message can be used to marshal the 800 initial header, the chunk list, and the RPC message itself. 802 4.2. RPC over RDMA header errors 804 When a peer receives an RPC RDMA message, it must perform certain 805 basic validity checks on the header and chunk contents. If errors 806 are detected in an RPC request, an RDMA_ERROR reply should be 807 generated. 809 Two types of errors are defined, version mismatch and invalid chunk 810 format. When the peer detects an RPC over RDMA header version 811 which it does not support (currently this draft defines only 812 version 1), it replies with an error code of ERR_VERS, and provides 813 the low and high inclusive version numbers it does, in fact, 814 support. The version number in this reply can be any value 815 otherwise valid at the receiver. When other decoding errors are 816 detected in the header or chunks, either an RPC decode error may be 817 returned, or the error code ERR_CHUNK. 819 4.3. XDR Language Description 821 Here is the message layout in XDR language. 823 struct xdr_rdma_segment { 824 uint32 handle; /* Registered memory handle */ 825 uint32 length; /* Length of the chunk in bytes */ 826 uint64 offset; /* Chunk virtual address or offset */ 827 }; 829 struct xdr_read_chunk { 830 uint32 position; /* Position in XDR stream */ 831 struct xdr_rdma_segment target; 832 }; 834 struct xdr_read_list { 835 struct xdr_read_chunk entry; 836 struct xdr_read_list *next; 837 }; 839 struct xdr_write_chunk { 840 struct xdr_rdma_segment target<>; 841 }; 843 struct xdr_write_list { 844 struct xdr_write_chunk entry; 845 struct xdr_write_list *next; 846 }; 848 struct rdma_msg { 849 uint32 rdma_xid; /* Mirrors the RPC header xid */ 850 uint32 rdma_vers; /* Version of this protocol */ 851 uint32 rdma_credit; /* Buffers requested/granted */ 852 rdma_body rdma_body; 853 }; 855 enum rdma_proc { 856 RDMA_MSG=0, /* An RPC call or reply msg */ 857 RDMA_NOMSG=1, /* An RPC call or reply msg - separate body */ 858 RDMA_MSGP=2, /* An RPC call or reply msg with padding */ 859 RDMA_DONE=3, /* Client signals reply completion */ 860 RDMA_ERROR=4 /* An RPC RDMA encoding error */ 861 }; 862 union rdma_body switch (rdma_proc proc) { 863 case RDMA_MSG: 864 rpc_rdma_header rdma_msg; 865 case RDMA_NOMSG: 866 rpc_rdma_header_nomsg rdma_nomsg; 867 case RDMA_MSGP: 868 rpc_rdma_header_padded rdma_msgp; 869 case RDMA_DONE: 870 void; 871 case RDMA_ERROR: 872 rpc_rdma_error rdma_error; 873 }; 875 struct rpc_rdma_header { 876 struct xdr_read_list *rdma_reads; 877 struct xdr_write_list *rdma_writes; 878 struct xdr_write_chunk *rdma_reply; 879 /* rpc body follows */ 880 }; 882 struct rpc_rdma_header_nomsg { 883 struct xdr_read_list *rdma_reads; 884 struct xdr_write_list *rdma_writes; 885 struct xdr_write_chunk *rdma_reply; 886 }; 888 struct rpc_rdma_header_padded { 889 uint32 rdma_align; /* Padding alignment */ 890 uint32 rdma_thresh; /* Padding threshold */ 891 struct xdr_read_list *rdma_reads; 892 struct xdr_write_list *rdma_writes; 893 struct xdr_write_chunk *rdma_reply; 894 /* rpc body follows */ 895 }; 896 enum rpc_rdma_errcode { 897 ERR_VERS = 1, 898 ERR_CHUNK = 2 899 }; 901 union rpc_rdma_error switch (rpc_rdma_errcode) { 902 case ERR_VERS: 903 uint32 rdma_vers_low; 904 uint32 rdma_vers_high; 905 case ERR_CHUNK: 906 void; 907 default: 908 uint32 rdma_extra[8]; 909 }; 911 5. Long Messages 913 The receiver of RDMA Send messages is required to have previously 914 posted one or more adequately sized buffers. The RPC client can 915 inform the server of the maximum size of its RDMA Send messages via 916 the Connection Configuration Protocol described later in this 917 document. 919 Since RPC messages are frequently small, memory savings can be 920 achieved by posting small buffers. Even large messages like NFS 921 READ or WRITE will be quite small once the chunks are removed from 922 the message. However, there may be large messages that would 923 demand a very large buffer be posted, where the contents of the 924 buffer may not be a chunkable XDR element. A good example is an 925 NFS READDIR reply which may contain a large number of small 926 filename strings. Also, the NFS version 4 protocol [RFC3530] 927 features COMPOUND request and reply messages of unbounded length. 929 Ideally, each upper layer will negotiate these limits. However, it 930 is frequently necessary to provide a transparent solution. 932 5.1. Message as an RDMA Read Chunk 934 One relatively simple method is to have the client identify any RPC 935 message that exceeds the RPC server's posted buffer size and move 936 it separately as a chunk, i.e. reference it as the first entry in 937 the read chunk list with an XDR position of zero. 939 Normal Message 941 +--------+---------+---------+------------+-------------+---------- 942 | | | | | | RPC Call 943 | XID | Version | Credits | RDMA_MSG | Chunk Lists | or 944 | | | | | | Reply Msg 945 +--------+---------+---------+------------+-------------+---------- 947 Long Message 949 +--------+---------+---------+------------+-------------+ 950 | | | | | | 951 | XID | Version | Credits | RDMA_NOMSG | Chunk Lists | 952 | | | | | | 953 +--------+---------+---------+------------+-------------+ 954 | 955 | +---------- 956 | | Long RPC Call 957 +->| or 958 | Reply Message 959 +---------- 961 If the receiver gets an RPC over RDMA header with a message type of 962 RDMA_NOMSG and finds an initial read chunk list entry with a zero 963 XDR position, it allocates a registered buffer and issues an RDMA 964 Read of the long RPC message into it. The receiver then proceeds 965 to XDR decode the RPC message as if it had received it inline with 966 the Send data. Further decoding may issue additional RDMA Reads to 967 bring over additional chunks. 969 Although the handling of long messages requires one extra network 970 turnaround, in practice these messages should be rare if the posted 971 receive buffers are correctly sized, and of course they will be 972 non-existent for RDMA-aware upper layers. 974 An RPC with long reply returned via RDMA Read looks like this: 976 RPC Client RPC Server 977 | RPC Call | 978 Send | ------------------------------> | 979 | | 980 | RDMA over RPC Header | 981 | <------------------------------ | Send 982 | | 983 | Long RPC Reply Msg | 984 Read | ------------------------------+ | 985 | <-----------------------------v | 986 | | 987 | Done | 988 Send | ------------------------------> | 990 5.2. RDMA Write of Long Replies (Reply Chunks) 992 A superior method of handling long RPC replies is to have the RPC 993 client post a large buffer into which the server can write a large 994 RPC reply. This has the advantage that an RDMA Write may be 995 slightly faster in network latency than an RDMA Read. 996 Additionally, for a reply it removes the need for an RDMA_DONE 997 message if the large reply is returned as a Read chunk. 999 This protocol supports direct return of a large reply via the 1000 inclusion of an optional rdma_reply write chunk after the read 1001 chunk list and the write chunk list. The client allocates a buffer 1002 sized to receive a large reply and enters its steering tag, address 1003 and length in the rdma_reply write chunk. If the reply message is 1004 too long to return inline with an RDMA Send (exceeds the size of 1005 the client's posted receive buffer), even with read chunks removed, 1006 then the RPC server performs an RDMA Write of the RPC reply message 1007 into the buffer indicated by the rdma_reply chunk. If the client 1008 doesn't provide an rdma_reply chunk, or if it's too small, then the 1009 message must be returned as a Read chunk. 1011 An RPC with long reply returned via RDMA Write looks like this: 1013 RPC Client RPC Server 1014 | RPC Call with rdma_reply | 1015 Send | ------------------------------> | 1016 | | 1017 | Long RPC Reply Msg | 1018 | <------------------------------ | Write 1019 | | 1020 | RDMA over RPC Header | 1021 | <------------------------------ | Send 1023 The use of RDMA Write to return long replies requires that the 1024 client application anticipate a long reply and have some knowledge 1025 of its size so that an adequately sized buffer can be allocated. 1026 This is certainly true of NFS READDIR replies; where the client 1027 already provides an upper bound on the size of the encoded 1028 directory fragment to be returned by the server. 1030 The use of these "reply chunks" is highly efficient and convenient 1031 for both RPC client and server. Their use is encouraged for 1032 eligible RPC operations such as NFS READDIR, which would otherwise 1033 require extensive chunk management within the results or use of 1034 RDMA Read and a Done message. [NFSDDP] 1036 6. Connection Configuration Protocol 1038 RDMA Send operations require the receiver to post one or more 1039 buffers at the RDMA connection endpoint, each large enough to 1040 receive the largest Send message. Buffers are consumed as Send 1041 messages are received. If a buffer is too small, or if there are 1042 no buffers posted, the RDMA transport may return an error and break 1043 the RDMA connection. The receiver must post sufficient, adequately 1044 buffers to avoid buffer overrun or capacity errors. 1046 The protocol described above includes only a mechanism for managing 1047 the number of such receive buffers, and no explicit features to 1048 allow the RPC client and server to provision or control buffer 1049 sizing, nor any other session parameters. 1051 In the past, this type of connection management has not been 1052 necessary for RPC. RPC over UDP or TCP does not have a protocol to 1053 negotiate the link. The server can get a rough idea of the maximum 1054 size of messages from the server protocol code. However, a 1055 protocol to negotiate transport features on a more dynamic basis is 1056 desirable. 1058 The Connection Configuration Protocol allows the client to pass its 1059 connection requirements to the server, and allows the server to 1060 inform the client of its connection limits. 1062 6.1. Initial Connection State 1064 This protocol will be used for connection setup prior to the use of 1065 another RPC protocol that uses the RDMA transport. It operates in- 1066 band, i.e. it uses the connection itself to negotiate the 1067 connection parameters. To provide a basis for connection 1068 negotiation, the connection is assumed to provide a basic level of 1069 interoperability: the ability to exchange at least one RPC message 1070 at a time that is at least 1 KB in size. The server may exceed 1071 this basic level of configuration, but the client must not assume 1072 it. 1074 6.2. Protocol Description 1076 Version 1 of the Connection Configuration protocol consists of a 1077 single procedure that allows the client to inform the server of its 1078 connection requirements and the server to return connection 1079 information to the client. 1081 The maxcall_sendsize argument is the maximum size of an RPC call 1082 message that the client will send inline in an RDMA Send message to 1083 the server. The server may return a maxcall_sendsize value that is 1084 smaller or larger than the client's request. The client must not 1085 send an inline call message larger than what the server will 1086 accept. The maxcall_sendsize limits only the size of inline RPC 1087 calls. It does not limit the size of long RPC messages transferred 1088 as an initial chunk in the Read chunk list. 1090 The maxreply_sendsize is the maximum size of an inline RPC message 1091 that the client will accept from the server. 1093 The maxrdmaread is the maximum number of RDMA Reads which may be 1094 active at the peer. This number correlates to the RDMA incoming 1095 RDMA Read count ("IRD") configured into each originating endpoint 1096 by the client or server. If more than this number of RDMA Read 1097 operations by the connected peer are issued simultaneously, 1098 connection loss or suboptimal flow control may result, therefore 1099 the value should be observed at all times. The peers' values need 1100 not be equal. If zero, the peer must not issue requests which 1101 require RDMA Read to satisfy, as no transfer will be possible. 1103 The align value is the value recommended by the server for opaque 1104 data values such as strings and counted byte arrays. The client 1105 can use this value to compute the number of prepended pad bytes 1106 when XDR encoding opaque values in the RPC call message. 1108 typedef unsigned int uint32; 1110 struct config_rdma_req { 1111 uint32 maxcall_sendsize; 1112 /* max size of inline RPC call */ 1113 uint32 maxreply_sendsize; 1114 /* max size of inline RPC reply */ 1115 uint32 maxrdmaread; 1116 /* max active RDMA Reads at client */ 1117 }; 1119 struct config_rdma_reply { 1120 uint32 maxcall_sendsize; 1121 /* max call size accepted by server */ 1122 uint32 align; 1123 /* server's receive buffer alignment */ 1124 uint32 maxrdmaread; 1125 /* max active RDMA Reads at server */ 1126 }; 1128 program CONFIG_RDMA_PROG { 1129 version VERS1 { 1130 /* 1131 * Config call/reply 1132 */ 1133 config_rdma_reply CONF_RDMA(config_rdma_req) = 1; 1134 } = 1; 1135 } = nnnnnn; <-- Need program number assigned 1137 7. Memory Registration Overhead 1139 RDMA requires that all data be transferred between registered 1140 memory regions at the source and destination. All protocol headers 1141 as well as separately transferred data chunks must use registered 1142 memory. Since the cost of registering and de-registering memory 1143 can be a large proportion of the RDMA transaction cost, it is 1144 important to minimize registration activity. This is easily 1145 achieved within RPC controlled memory by allocating chunk list data 1146 and RPC headers in a reusable way from pre-registered pools. 1148 The data chunks transferred via RDMA may occupy memory that 1149 persists outside the bounds of the RPC transaction. Hence, the 1150 default behavior of an RPC over RDMA transport is to register and 1151 de-register these chunks on every transaction. However, this is 1152 not a limitation of the protocol - only of the existing local RPC 1153 API. The API is easily extended through such functions as 1154 rpc_control(3) to change the default behavior so that the 1155 application can assume responsibility for controlling memory 1156 registration through an RPC-provided registered memory allocator. 1158 8. Errors and Error Recovery 1160 RPC RDMA protocol errors are described in section 4. RPC errors 1161 and RPC error recovery are not affected by the protocol, and 1162 proceed as for any RPC error condition. RDMA Transport error 1163 reporting and recovery are outside the scope of this protocol. 1165 It is assumed that the link itself will provide some degree of 1166 error detection and retransmission. iWARP's MPA layer (when used 1167 over TCP), SCTP, as well as the Infiniband link layer all provide 1168 CRC protection of the RDMA payload, and CRC-class protection is a 1169 general attribute of such transports. Additionally, the RPC layer 1170 itself can accept errors from the link level and recover via 1171 retransmission. RPC recovery can handle complete loss and re- 1172 establishment of the link. 1174 See section 11 for further discussion of the use of RPC-level 1175 integrity schemes to detect errors, and related efficiency issues. 1177 9. Node Addressing 1179 In setting up a new RDMA connection, the first action by an RPC 1180 client will be to obtain a transport address for the server. The 1181 mechanism used to obtain this address, and to open an RDMA 1182 connection is dependent on the type of RDMA transport, and is the 1183 responsibility of each RPC protocol binding and its local 1184 implementation. 1186 10. RPC Binding 1188 RPC services normally register with a portmap or rpcbind service, 1189 which associates an RPC program number with a service address. (In 1190 the case of UDP or TCP, the service address for NFS is normally 1191 port 2049.) This policy should be no different with RDMA 1192 interconnects, although it may require the allocation of port 1193 numbers appropriate to each. 1195 When a mapping standard or convention exists for IP ports on an 1196 RDMA interconnect, there are two possibilities: 1198 One possibility is to have the server's portmapper register 1199 itself on the RDMA interconnect at a "well known" service 1200 address. (On UDP or TCP, this corresponds to port 111.) A 1201 client could connect to this service address and use the 1202 portmap protocol to obtain a service address in response to a 1203 program number, e.g. an iWARP port number, or an Infiniband 1204 GID. 1206 Alternatively, the client could simply connect to the mapped 1207 well-known port for the service itself, if it is appropriately 1208 defined. 1210 Historically, different RPC protocols have taken different 1211 approaches to their port assignment, therefore the specific method 1212 is left to each RPC/RDMA-enabled upper layer binding, and not 1213 addressed here. 1215 11. Security 1217 ONC RPC provides its own security via the RPCSEC_GSS framework 1218 [RFC2203]. RPCSEC_GSS can provide message authentication, 1219 integrity checking, and privacy. This security mechanism will be 1220 unaffected by the RDMA transport. The data integrity and privacy 1221 features alter the body of the message, presenting it as a single 1222 chunk. For large messages the chunk may be large enough to qualify 1223 for RDMA Read transfer. However, there is much data movement 1224 associated with computation and verification of integrity, or 1225 encryption/decryption, so certain performance advantages may be 1226 lost. 1228 For efficiency, more appropriate security mechanism for RDMA links 1229 may be link-level protection, such as IPSec, which may be co- 1230 located in the RDMA hardware. The use of link-level protection may 1231 be negotiated through the use of a new RPCSEC_GSS mechanism like 1232 the Credential Cache GSS Mechanism [CCM]. Use of such mechanisms 1233 is recommended where end-to-end integrity and/or privacy is 1234 desired, and where efficiency is required. 1236 There are no new issues here with exposed addresses. The only 1237 exposed addresses here are in the chunk list and in the transport 1238 packets transferred via RDMA. The data contained in these 1239 addresses continues to be protected by RPCSEC_GSS integrity and 1240 privacy. 1242 12. IANA Considerations 1244 The new RPC transport should be assigned a new RPC "netid". 1246 As a new RPC transport, this protocol should have no effect on RPC 1247 program numbers or existing registered port numbers. However, new 1248 port numbers may be registered for use by RPC/RDMA-enabled 1249 services, as appropriate to the new networks over which the 1250 services will operate. 1252 If adopted, the Connection Configuration protocol described herein 1253 will require an RPC program number assignment. 1255 13. Acknowledgements 1257 The authors wish to thank Rob Thurlow, John Howard, Chet Juszczak, 1258 Alex Chiu, Peter Staubach, Dave Noveck, Brian Pawlowski, Steve 1259 Kleiman, Mike Eisler, Mark Wittle, Shantanu Mehendale, David 1260 Robinson and Mallikarjun Chadalapaka for their contributions to 1261 this document. 1263 14. Normative References 1265 [RFC1094] 1266 Sun Microsystems, "NFS: Network File System Protocol 1267 Specification", (NFS version 2) Informational RFC, 1268 http://www.ietf.org/rfc/rfc1094.txt 1270 [RFC1831] 1271 R. Srinivasan, "RPC: Remote Procedure Call Protocol 1272 Specification Version 2", Standards Track RFC, 1273 http://www.ietf.org/rfc/rfc1831.txt 1275 [RFC1832] 1276 R. Srinivasan, "XDR: External Data Representation Standard", 1277 Standards Track RFC, http://www.ietf.org/rfc/rfc1832.txt 1279 [RFC1813] 1280 B. Callaghan, B. Pawlowski, P. Staubach, "NFS Version 3 1281 Protocol Specification", Informational RFC, 1282 http://www.ietf.org/rfc/rfc1813.txt 1284 [RFC3530] 1285 S. Shepler, B. Callaghan, D. Robinson, R. Thurlow, C. Beame, 1286 M. Eisler, D. Noveck, "NFS version 4 Protocol", Standards 1287 Track RFC, http://www.ietf.org/rfc/rfc3530.txt 1289 [RFC2203] 1290 M. Eisler, A. Chiu, L. Ling, "RPCSEC_GSS Protocol 1291 Specification", Standards Track RFC, 1292 http://www.ietf.org/rfc/rfc2203.txt 1294 15. Informative References 1296 [RDMAP] 1297 R. Recio et al, "An RDMA Protocol Specification", Internet 1298 Draft Work in Progress, draft-ietf-rddp-rdmap 1300 [CCM] 1301 M. Eisler, N. Williams, "CCM: The Credential Cache GSS 1302 Mechanism", Internet Draft Work in Progress, draft-ietf- 1303 nfsv4-ccm 1305 [NFSDDP] 1306 B. Callaghan, T. Talpey, "NFS Direct Data Placement" Internet 1307 Draft Work in Progress, draft-ietf-nfsv4-nfsdirect 1309 [RDDP] 1310 Remote Direct Data Placement Working Group Charter, 1311 http://www.ietf.org/html.charters/rddp-charter.html 1313 [NFSRDMAPS] 1314 T. Talpey, C. Juszczak, "NFS RDMA Problem Statement", Internet 1315 Draft Work in Progress, draft-ietf-nfsv4-nfs-rdma-problem- 1316 statement 1318 [NFSv4.1] 1319 S. Shepler, ed., "NFSv4 Minor Version 1" Internet Draft Work 1320 in Progress, draft-ietf-nfsv4-minorversion1 1322 [IB] 1323 Infiniband Architecture Specification, 1324 http://www.infinibandta.org 1326 16. Authors' Addresses 1328 Tom Talpey 1329 Network Appliance, Inc. 1330 375 Totten Pond Road 1331 Waltham, MA 02451 USA 1333 Phone: +1 781 768 5329 1334 EMail: thomas.talpey@netapp.com 1335 Brent Callaghan 1336 Apple Computer, Inc. 1337 MS: 302-4K 1338 2 Infinite Loop 1339 Cupertino, CA 95014 USA 1341 EMail: brentc@apple.com 1343 17. Intellectual Property and Copyright Statements 1345 Intellectual Property Statement 1347 The IETF takes no position regarding the validity or scope of any 1348 Intellectual Property Rights or other rights that might be claimed 1349 to pertain to the implementation or use of the technology described 1350 in this document or the extent to which any license under such 1351 rights might or might not be available; nor does it represent that 1352 it has made any independent effort to identify any such rights. 1353 Information on the procedures with respect to rights in RFC 1354 documents can be found in BCP 78 and BCP 79. 1356 Copies of IPR disclosures made to the IETF Secretariat and any 1357 assurances of licenses to be made available, or the result of an 1358 attempt made to obtain a general license or permission for the use 1359 of such proprietary rights by implementers or users of this 1360 specification can be obtained from the IETF on-line IPR repository 1361 at http://www.ietf.org/ipr. 1363 The IETF invites any interested party to bring to its attention any 1364 copyrights, patents or patent applications, or other proprietary 1365 rights that may cover technology that may be required to implement 1366 this standard. Please address the information to the IETF at ietf- 1367 ipr@ietf.org. 1369 Disclaimer of Validity 1371 This document and the information contained herein are provided on 1372 an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE 1373 REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND 1374 THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, 1375 EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT 1376 THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR 1377 ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A 1378 PARTICULAR PURPOSE. 1380 Copyright Statement 1382 Copyright (C) The Internet Society (2006). 1384 This document is subject to the rights, licenses and restrictions 1385 contained in BCP 78, and except as set forth therein, the authors 1386 retain all their rights. 1388 Acknowledgement 1389 Funding for the RFC Editor function is currently provided by the 1390 Internet Society.