idnits 2.17.1 draft-ietf-nfsv4-rpcrdma-06.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3978, Section 5.1 on line 16. -- Found old boilerplate from RFC 3978, Section 5.5, updated by RFC 4748 on line 1545. -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 1555. -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 1562. -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 1568. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust Copyright Line does not match the current year == The document seems to use 'NOT RECOMMENDED' as an RFC 2119 keyword, but does not include the phrase in its RFC 2119 key words list. == Using lowercase 'not' together with uppercase 'MUST', 'SHALL', 'SHOULD', or 'RECOMMENDED' is not an accepted usage according to RFC 2119. Please use uppercase 'NOT' together with RFC 2119 keywords (if that is what you mean). Found 'SHOULD not' in this paragraph: 3. Flow control credit value. When sent in an RPC call message, the requested value is provided. When sent in an RPC reply message, the granted value is returned. RPC calls SHOULD not be sent in excess of the currently granted limit. -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (July 1, 2007) is 6143 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Looks like a reference, but probably isn't: '8' on line 1000 ** Downref: Normative reference to an Informational RFC: RFC 1094 ** Obsolete normative reference: RFC 1831 (Obsoleted by RFC 5531) ** Downref: Normative reference to an Informational RFC: RFC 1813 ** Obsolete normative reference: RFC 3530 (Obsoleted by RFC 7530) Summary: 5 errors (**), 0 flaws (~~), 3 warnings (==), 9 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 NFSv4 Working Group Tom Talpey 3 Internet-Draft Network Appliance, Inc. 4 Intended status: Standards Track Brent Callaghan 5 Expires: January 1, 2008 Apple Computer, Inc. 6 July 1, 2007 8 RDMA Transport for ONC RPC 9 draft-ietf-nfsv4-rpcrdma-06 11 Status of this Memo 13 By submitting this Internet-Draft, each author represents that any 14 applicable patent or other IPR claims of which he or she is aware 15 have been or will be disclosed, and any of which he or she becomes 16 aware will be disclosed, in accordance with Section 6 of BCP 79. 18 Internet-Drafts are working documents of the Internet Engineering 19 Task Force (IETF), its areas, and its working groups. Note that 20 other groups may also distribute working documents as Internet- 21 Drafts. 23 Internet-Drafts are draft documents valid for a maximum of six 24 months and may be updated, replaced, or obsoleted by other 25 documents at any time. It is inappropriate to use Internet-Drafts 26 as reference material or to cite them other than as "work in 27 progress." 29 The list of current Internet-Drafts can be accessed at 30 http://www.ietf.org/ietf/1id-abstracts.txt 32 The list of Internet-Draft Shadow Directories can be accessed at 33 http://www.ietf.org/shadow.html. 35 Copyright Notice 37 Copyright (C) The IETF Trust (2007). 39 Abstract 41 A protocol is described providing RDMA as a new transport for ONC 42 RPC. The RDMA transport binding conveys the benefits of efficient, 43 bulk data transport over high speed networks, while providing for 44 minimal change to RPC applications and with no required revision of 45 the application RPC protocol, or the RPC protocol itself. 47 Table of Contents 49 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 50 2. Abstract RDMA Requirements . . . . . . . . . . . . . . . . . 3 51 3. Protocol Outline . . . . . . . . . . . . . . . . . . . . . . 4 52 3.1. Short Messages . . . . . . . . . . . . . . . . . . . . . . 5 53 3.2. Data Chunks . . . . . . . . . . . . . . . . . . . . . . . 5 54 3.3. Flow Control . . . . . . . . . . . . . . . . . . . . . . . 6 55 3.4. XDR Encoding with Chunks . . . . . . . . . . . . . . . . . 7 56 3.5. XDR Decoding with Read Chunks . . . . . . . . . . . . . 11 57 3.6. XDR Decoding with Write Chunks . . . . . . . . . . . . . 11 58 3.7. XDR Roundup and Chunks . . . . . . . . . . . . . . . . . 12 59 3.8. RPC Call and Reply . . . . . . . . . . . . . . . . . . . 13 60 3.9. Padding . . . . . . . . . . . . . . . . . . . . . . . . 16 61 4. RPC RDMA Message Layout . . . . . . . . . . . . . . . . . 17 62 4.1. RPC over RDMA Header . . . . . . . . . . . . . . . . . . 17 63 4.2. RPC over RDMA header errors . . . . . . . . . . . . . . 19 64 4.3. XDR Language Description . . . . . . . . . . . . . . . . 20 65 5. Long Messages . . . . . . . . . . . . . . . . . . . . . . 22 66 5.1. Message as an RDMA Read Chunk . . . . . . . . . . . . . 22 67 5.2. RDMA Write of Long Replies (Reply Chunks) . . . . . . . 24 68 6. Connection Configuration Protocol . . . . . . . . . . . . 25 69 6.1. Initial Connection State . . . . . . . . . . . . . . . . 26 70 6.2. Protocol Description . . . . . . . . . . . . . . . . . . 26 71 7. Memory Registration Overhead . . . . . . . . . . . . . . . 28 72 8. Errors and Error Recovery . . . . . . . . . . . . . . . . 28 73 9. Node Addressing . . . . . . . . . . . . . . . . . . . . . 28 74 10. RPC Binding . . . . . . . . . . . . . . . . . . . . . . . 29 75 11. Security . . . . . . . . . . . . . . . . . . . . . . . . 30 76 12. IANA Considerations . . . . . . . . . . . . . . . . . . . 30 77 13. Acknowledgements . . . . . . . . . . . . . . . . . . . . 31 78 14. Normative References . . . . . . . . . . . . . . . . . . 31 79 15. Informative References . . . . . . . . . . . . . . . . . 32 80 16. Authors' Addresses . . . . . . . . . . . . . . . . . . . 33 81 17. Intellectual Property and Copyright Statements . . . . . 33 82 Acknowledgment . . . . . . . . . . . . . . . . . . . . . . . . 34 84 Requirements Language 86 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 87 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in 88 this document are to be interpreted as described in [RFC2119]. 90 1. Introduction 92 RDMA is a technique for efficient movement of data between end 93 nodes, which becomes increasingly compelling over high speed 94 transports. By directing data into destination buffers as it is 95 sent on a network, and placing it via direct memory access by 96 hardware, the double benefit of faster transfers and reduced host 97 overhead is obtained. 99 ONC RPC [RFC1831bis] is a remote procedure call protocol that has 100 been run over a variety of transports. Most RPC implementations 101 today use UDP or TCP. RPC messages are defined in terms of an 102 eXternal Data Representation (XDR) [RFC4506] which provides a 103 canonical data representation across a variety of host 104 architectures. An XDR data stream is conveyed differently on each 105 type of transport. On UDP, RPC messages are encapsulated inside 106 datagrams, while on a TCP byte stream, RPC messages are delineated 107 by a record marking protocol. An RDMA transport also conveys RPC 108 messages in a unique fashion that must be fully described if client 109 and server implementations are to interoperate. 111 RDMA transports present new semantics unlike the behaviors of 112 either UDP and TCP alone. They retain message delineations like 113 UDP while also providing a reliable, sequenced data transfer like 114 TCP. And, they provide the new efficient, bulk transfer service of 115 RDMA. RDMA transports are therefore naturally viewed as a new 116 transport type by ONC RPC. 118 RDMA as a transport will benefit the performance of RPC protocols 119 that move large "chunks" of data, since RDMA hardware excels at 120 moving data efficiently between host memory and a high speed 121 network with little or no host CPU involvement. In this context, 122 the NFS protocol, in all its versions [RFC1094] [RFC1813] [RFC3530] 123 [NFSv4.1], is an obvious beneficiary of RDMA. A complete problem 124 statement is discussed in [NFSRDMAPS], and related NFSv4 issues are 125 discussed in [NFSv4.1]. Many other RPC-based protocols will also 126 benefit. 128 Although the RDMA transport described here provides relatively 129 transparent support for any RPC application, the proposal goes 130 further in describing mechanisms that can optimize the use of RDMA 131 with more active participation by the RPC application. 133 2. Abstract RDMA Requirements 135 An RPC transport is responsible for conveying an RPC message from a 136 sender to a receiver. An RPC message is either an RPC call from a 137 client to a server, or an RPC reply from the server back to the 138 client. An RPC message contains an RPC call header followed by 139 arguments if the message is an RPC call, or an RPC reply header 140 followed by results if the message is an RPC reply. The call 141 header contains a transaction ID (XID) followed by the program and 142 procedure number as well as a security credential. An RPC reply 143 header begins with an XID that matches that of the RPC call 144 message, followed by a security verifier and results. All data in 145 an RPC message is XDR encoded. For a complete description of the 146 RPC protocol and XDR encoding, see [RFC1831bis] and [RFC4506]. 148 This protocol assumes the following abstract model for RDMA 149 transports. These terms, common in the RDMA lexicon, are used in 150 this document. A more complete glossary of RDMA terms can be found 151 in [RDMAP]. 153 o Registered Memory 154 All data moved via tagged RDMA operations is resident in 155 registered memory at its destination. This protocol assumes 156 that each segment of registered memory MUST be identified with 157 a steering tag of no more than 32 bits and memory addresses of 158 up to 64 bits in length. 160 o RDMA Send 161 The RDMA provider supports an RDMA Send operation with 162 completion signalled at the receiver when data is placed in a 163 pre-posted buffer. The amount of transferred data is limited 164 only by the size of the receiver's buffer. Sends complete at 165 the receiver in the order they were issued at the sender. 167 o RDMA Write 168 The RDMA provider supports an RDMA Write operation to directly 169 place data in the receiver's buffer. An RDMA Write is 170 initiated by the sender and completion is signalled at the 171 sender. No completion is signalled at the receiver. The 172 sender uses a steering tag, memory address and length of the 173 remote destination buffer. RDMA Writes are not necessarily 174 ordered with respect to one another, but are ordered with 175 respect to RDMA Sends; a subsequent RDMA Send completion 176 obtained at the receiver guarantees that prior RDMA Write data 177 has been successfully placed in the receiver's memory. 179 o RDMA Read 180 The RDMA provider supports an RDMA Read operation to directly 181 place peer source data in the requester's buffer. An RDMA 182 Read is initiated by the receiver and completion is signalled 183 at the receiver. The receiver provides steering tags, memory 184 addresses and a length for the remote source and local 185 destination buffers. Since the peer at the data source 186 receives no notification of RDMA Read completion, there is an 187 assumption that on receiving the data the receiver will signal 188 completion with an RDMA Send message, so that the peer can 189 free the source buffers and the associated steering tags. 191 This protocol is designed to be carried over all RDMA transports 192 meeting the stated requirements. This protocol conveys to the RPC 193 peer, information sufficient for that RPC peer to direct an RDMA 194 layer to perform transfers containing RPC data, and to communicate 195 their result(s). For example, it is readily carried over RDMA 196 transports such as iWARP [RDDP] or Infiniband [IB]. 198 3. Protocol Outline 200 An RPC message can be conveyed in identical fashion, whether it is 201 a call or reply message. In each case, the transmission of the 202 message proper is preceded by transmission of a transport-specific 203 header for use by RPC over RDMA transports. This header is 204 analogous to the record marking used for RPC over TCP, but is more 205 extensive, since RDMA transports support several modes of data 206 transfer and it is important to allow the client and server to use 207 the most efficient mode for any given transfer. Multiple segments 208 of a message may be transferred in different ways to different 209 remote memory destinations. 211 All transfers of a call or reply begin with an RDMA Send which 212 transfers at least the RPC over RDMA header, usually with the call 213 or reply message appended, or at least some part thereof. Because 214 the size of what may be transmitted via RDMA Send is limited by the 215 size of the receiver's pre-posted buffer, the RPC over RDMA 216 transport provides a number of methods to reduce the amount 217 transferred by means of the RDMA Send, when necessary, by 218 transferring various parts of the message using RDMA Read and RDMA 219 Write. 221 RPC over RDMA framing replaces all other RPC framing (such as TCP 222 record marking) when used atop an RPC/RDMA association, even though 223 the underlying RDMA protocol may itself be layered atop a protocol 224 with a defined RPC framing (such as TCP). An upper layer may 225 however define an exchange to dynamically enable RPC/RDMA on an 226 existing RPC association. Any such exchange must be carefully 227 architected so as to prevent any ambiguity as to the framing in use 228 for each side of the connection. Because RPC/RDMA framing delimits 229 an entire RPC request or reply, any such shift must occur between 230 distinct RPC messages. 232 3.1. Short Messages 234 Many RPC messages are quite short. For example, the NFS version 3 235 GETATTR request, is only 56 bytes: 20 bytes of RPC header plus a 32 236 byte filehandle argument and 4 bytes of length. The reply to this 237 common request is about 100 bytes. 239 There is no benefit in transferring such small messages with an 240 RDMA Read or Write operation. The overhead in transferring 241 steering tags and memory addresses is justified only by large 242 transfers. The critical message size that justifies RDMA transfer 243 will vary depending on the RDMA implementation and network, but is 244 typically of the order of a few kilobytes. It is appropriate to 245 transfer a short message with an RDMA Send to a pre-posted buffer. 246 The RPC over RDMA header with the short message (call or reply) 247 immediately following is transferred using a single RDMA Send 248 operation. 250 Short RPC messages over an RDMA transport: 252 RPC Client RPC Server 253 | RPC Call | 254 Send | ------------------------------> | 255 | | 256 | RPC Reply | 257 | <------------------------------ | Send 259 3.2. Data Chunks 261 Some protocols, like NFS, have RPC procedures that can transfer 262 very large "chunks" of data in the RPC call or reply and would 263 cause the maximum send size to be exceeded if one tried to transfer 264 them as part of the RDMA Send. These large chunks typically range 265 from a kilobyte to a megabyte or more. An RDMA transport can 266 transfer large chunks of data more efficiently via the direct 267 placement of an RDMA Read or RDMA Write operation. Using direct 268 placement instead of inline transfer not only avoids expensive data 269 copies, but provides correct data alignment at the destination. 271 3.3. Flow Control 273 It is critical to provide RDMA Send flow control for an RDMA 274 connection. RDMA receive operations will fail if a pre-posted 275 receive buffer is not available to accept an incoming RDMA Send, 276 and repeated occurrences of such errors can be fatal to the 277 connection. This is a departure from conventional TCP/IP 278 networking where buffers are allocated dynamically on an as-needed 279 basis, and where pre-posting is not required. 281 It is not practical to provide for fixed credit limits at the RPC 282 server. Fixed limits scale poorly, since posted buffers are 283 dedicated to the associated connection until consumed by receive 284 operations. Additionally for protocol correctness, the RPC server 285 must always be able to reply to client requests, whether or not new 286 buffers have been posted to accept future receives. (Note that the 287 RPC server may in fact be a client at some other layer. For 288 example, NFSv4 callbacks are processed by the NFSv4 client, acting 289 as an RPC server. The credit discussions apply equally in either 290 case.) 292 Flow control for RDMA Send operations is implemented as a simple 293 request/grant protocol in the RPC over RDMA header associated with 294 each RPC message. The RPC over RDMA header for RPC call messages 295 contains a requested credit value for the RPC server, which MAY be 296 dynamically adjusted by the caller to match its expected needs. 297 The RPC over RDMA header for the RPC reply messages provides the 298 granted result, which MAY have any value except it MUST NOT be zero 299 when no in-progress operations are present at the server, since 300 such a value would result in deadlock. The value MAY be adjusted 301 up or down at each opportunity to match the server's needs or 302 policies. 304 The RPC client MUST NOT send unacknowledged requests in excess of 305 this granted RPC server credit limit. If the limit is exceeded, 306 the RDMA layer may signal an error, possibly terminating the 307 connection. Even if an error does not occur, it is OPTIONAL that 308 the server handle the excess request(s), and it MAY return an RPC 309 error to the client. Also note that the never-zero requirement 310 implies that an RPC server MUST always provide at least one credit 311 to each connected RPC client. It is however OPTIONAL that the 312 server always be prepared to receive a request from each client, 313 for example when the server is busy processing all granted client 314 requests. 316 While RPC calls complete in any order, the current flow control 317 limit at the RPC server is known to the RPC client from the Send 318 ordering properties. It is always the most recent server-granted 319 credit value minus the number of requests in flight. 321 Certain RDMA implementations may impose additional flow control 322 restrictions, such as limits on RDMA Read operations in progress at 323 the responder. Because these operations are outside the scope of 324 this protocol, they are not addressed and SHOULD be provided for by 325 other layers. For example, a simple upper layer RPC consumer might 326 perform single-issue RDMA Read requests, while a more 327 sophisticated, multithreaded RPC consumer might implement its own 328 FIFO queue of such operations. For further discussion of possible 329 protocol implementations capable of negotiating these values, see 330 section 6 "Connection Configuration Protocol" of this draft, or 331 [NFSv4.1]. 333 3.4. XDR Encoding with Chunks 335 The data comprising an RPC call or reply message is marshaled or 336 serialized into a contiguous stream by an XDR routine. XDR data 337 types such as integers, strings, arrays and linked lists are 338 commonly implemented over two very simple functions that encode 339 either an XDR data unit (32 bits) or an array of bytes. 341 Normally, the separate data items in an RPC call or reply are 342 encoded as a contiguous sequence of bytes for network transmission 343 over UDP or TCP. However, in the case of an RDMA transport, local 344 routines such as XDR encode can determine that (for instance) an 345 opaque byte array is large enough to be more efficiently moved via 346 an RDMA data transfer operation like RDMA Read or RDMA Write. 348 Semantically speaking, the protocol has no restriction regarding 349 data types which may or may not be represented by a read or write 350 chunk. In practice however, efficiency considerations lead to the 351 conclusion that certain data types are not generally "chunkable". 352 Typically, only opaque and aggregate data types which may attain 353 substantial size are considered to be eligible. With today's 354 hardware this size may be a kilobyte or more. However any object 355 MAY be chosen for chunking in any given message. 357 The eligibility of XDR data items to be candidates for being moved 358 as data chunks (as opposed to being marshaled inline) is not 359 specified by the RPC over RDMA protocol. Chunk eligibility 360 criteria MUST be determined by each upper layer in order to provide 361 for an interoperable specification. One such example with 362 rationale, for the NFS protocol family, is provided in [NFSDDP]. 364 The interface by which an upper layer implementation communicates 365 the eligibility of a data item locally to RPC for chunking is out 366 of scope for this specification. In many implementations, it is 367 possible to implement a transparent RPC chunking facility. 368 However, such implementations may lead to inefficiencies, either 369 because they require the RPC layer to perform expensive 370 registration and deregistration of memory "on the fly", or they may 371 require using RDMA chunks in reply messages, along with the 372 resulting additional handshaking with the RPC over RDMA peer. 373 However, these issues are internal and generally confined to the 374 local interface between RPC and its upper layers, one in which 375 implementations are free to innovate. The only requirement is that 376 the resulting RPC RDMA protocol sent to the peer is valid for the 377 upper layer. See for example [NFSDDP]. 379 When sending any message (request or reply) that contains an 380 eligible large data chunk, the XDR encoding routine avoids moving 381 the data into the XDR stream. Instead, it does not encode the data 382 portion, but records the address and size of each chunk in a 383 separate "read chunk list" encoded within RPC RDMA transport- 384 specific headers. Such chunks will be transferred via RDMA Read 385 operations initiated by the receiver. 387 When the read chunks are to be moved via RDMA, the memory for each 388 chunk is registered. This registration may take place within XDR 389 itself, providing for full transparency to upper layers, or it may 390 be performed by any other specific local implementation. 392 Additionally, when making an RPC call that can result in bulk data 393 transferred in the reply, write chunks MAY be provided to accept 394 the data directly via RDMA Write. These write chunks will 395 therefore be pre-filled by the RPC server prior to responding, and 396 XDR decode of the data at the client will not be required. These 397 chunks undergo a similar registration and advertisement via "write 398 chunk lists" built as a part of XDR encoding. 400 Some RPC client implementations are not able to determine where an 401 RPC call's results reside during the "encode" phase. This makes it 402 difficult or impossible for the RPC client layer to encode the 403 write chunk list at the time of building the request. In this 404 case, it is difficult for the RPC implementation to provide 405 transparency to the RPC consumer, which may require recoding to 406 provide result information at this earlier stage. 408 Therefore if the RPC client does not make a write chunk list 409 available to receive the result, then the RPC server MAY return 410 data inline in the reply, or if the upper layer specification 411 permits, it MAY be returned via a read chunk list. It is NOT 412 RECOMMENDED that upper layer RPC client protocol specifications 413 omit write chunk lists for eligible replies, due to the lower 414 performance of the additional handshaking to perform data transfer, 415 and the requirement that the RPC server must expose (and preserve) 416 the reply data for a period of time. In the absence of a server- 417 provided read chunk list in the reply, if the encoded reply 418 overflows the posted receive buffer, the RPC will fail with an RDMA 419 transport error. 421 When any data within a message is provided via either read or write 422 chunks, the chunk itself refers only to the data portion of the XDR 423 stream element. In particular, for counted fields (e.g., a "<>" 424 encoding) the byte count which is encoded as part of the field 425 remains in the XDR stream, and is also encoded in the chunk list. 426 The data portion is however elided from the encoded XDR stream, and 427 is transferred as part of chunk list processing. This is important 428 to maintain upper layer implementation compatibility - both the 429 count and the data must be transferred as part of the logical XDR 430 stream. While the chunk list processing results in the data being 431 available to the upper layer peer for XDR decoding, the length 432 present in the chunk list entries is not. Any byte count in the 433 XDR stream MUST match the sum of the byte counts present in the 434 corresponding read or write chunk list. If they do not agree, an 435 RPC protocol encoding error results. 437 The following items are contained in a chunk list entry. 439 Handle 440 Steering tag or handle obtained when the chunk memory is 441 registered for RDMA. 443 Length 444 The length of the chunk in bytes. 446 Offset 447 The offset or beginning memory address of the chunk. In order 448 to support the widest array of RDMA implementations, as well 449 as the most general steering tag scheme, this field is 450 unconditionally included in each chunk list entry. 452 While zero-based offset schemes are available in many RDMA 453 implementations, their use by RPC requires individual 454 registration of each read or write chunk. On many such 455 implementations this can be a significant overhead. By 456 providing an offset in each chunk, many pre-registration or 457 region-based registrations can be readily supported, and by 458 using a single, universal chunk representation, the RPC RDMA 459 protocol implementation is simplified to its most general 460 form. 462 Position 463 For data which is to be encoded, the position in the XDR 464 stream where the chunk would normally reside. Note that the 465 chunk therefore inserts its data into the XDR stream at this 466 position, but its transfer is no longer "inline". Also note 467 therefore that all chunks belonging to a single RPC argument 468 or result will have the same position. For data which is to 469 be decoded, no position is used. 471 When XDR marshaling is complete, the chunk list is XDR encoded, 472 then sent to the receiver prepended to the RPC message. Any source 473 data for a read chunk, or the destination of a write chunk, remain 474 behind in the sender's registered memory and their actual payload 475 is not marshaled into the request or reply. 477 +----------------+----------------+------------- 478 | RPC over RDMA | | 479 | header w/ | RPC Header | Non-chunk args/results 480 | chunks | | 481 +----------------+----------------+------------- 483 Read chunk lists and write chunk lists are structured somewhat 484 differently. This is due to the different usage - read chunks are 485 decoded and indexed by their argument's or result's position in the 486 XDR data stream; their size is always known. Write chunks on the 487 other hand are used only for results, and have neither a 488 preassigned offset in the XDR stream, nor a size until the results 489 are produced, since the buffers may be only partially filled, or 490 may not be used for results at all. Their presence in the XDR 491 stream is therefore not known until the reply is processed. The 492 mapping of Write chunks onto designated NFS procedures and their 493 results is described in [NFSDDP]. 495 Therefore, read chunks are encoded into a read chunk list as a 496 single array, with each entry tagged by its (known) size and its 497 argument's or result's position in the XDR stream. Write chunks 498 are encoded as a list of arrays of RDMA buffers, with each list 499 element (an array) providing buffers for a separate result. 500 Individual write chunk list elements MAY thereby result in being 501 partially or fully filled, or in fact not being filled at all. 502 Unused write chunks, or unused bytes in write chunk buffer lists, 503 are not returned as results, and their memory is returned to the 504 upper layer as part of RPC completion. However, the RPC layer MUST 505 NOT assume that the buffers have not been modified. 507 3.5. XDR Decoding with Read Chunks 509 The XDR decode process moves data from an XDR stream into a data 510 structure provided by the RPC client or server application. Where 511 elements of the destination data structure are buffers or strings, 512 the RPC application can either pre-allocate storage to receive the 513 data, or leave the string or buffer fields null and allow the XDR 514 decode stage of RPC processing to automatically allocate storage of 515 sufficient size. 517 When decoding a message from an RDMA transport, the receiver first 518 XDR decodes the chunk lists from the RPC over RDMA header, then 519 proceeds to decode the body of the RPC message (arguments or 520 results). Whenever the XDR offset in the decode stream matches 521 that of a chunk in the read chunk list, the XDR routine initiates 522 an RDMA Read to bring over the chunk data into locally registered 523 memory for the destination buffer. 525 When processing an RPC request, the RPC receiver (RPC server) 526 acknowledges its completion of use of the source buffers by simply 527 replying to the RPC sender (client), and the peer may then free all 528 source buffers advertised by the request. 530 When processing an RPC reply, after completing such a transfer the 531 RPC receiver (client) MUST issue an RDMA_DONE message (described in 532 Section 3.8) to notify the peer (server) that the source buffers 533 can be freed. 535 The read chunk list is constructed and used entirely within the 536 RPC/XDR layer. Other than specifying the minimum chunk size, the 537 management of the read chunk list is automatic and transparent to 538 an RPC application. 540 3.6. XDR Decoding with Write Chunks 542 When a "write chunk list" is provided for the results of the RPC 543 call, the RPC server MUST provide any corresponding data via RDMA 544 Write to the memory referenced in the chunk list entries. The RPC 545 reply conveys this by returning the write chunk list to the client 546 with the lengths rewritten to match the actual transfer. The XDR 547 "decode" of the reply therefore performs no local data transfer but 548 merely returns the length obtained from the reply. 550 Each decoded result consumes one entry in the write chunk list, 551 which in turn consists of an array of RDMA segments. The length is 552 therefore the sum of all returned lengths in all segments 553 comprising the corresponding list entry. As each list entry is 554 "decoded", the entire entry is consumed. 556 The write chunk list is constructed and used by the RPC 557 application. The RPC/XDR layer simply conveys the list between 558 client and server and initiates the RDMA Writes back to the client. 559 The mapping of write chunk list entries to procedure arguments MUST 560 be determined for each protocol. An example of a mapping is 561 described in [NFSDDP]. 563 3.7. XDR Roundup and Chunks 565 The XDR protocol requires 4-byte alignment of each new encoded 566 element in any XDR stream. This requirement is for efficiency and 567 ease of decode/unmarshaling at the receiver - if the XDR stream 568 buffer begins on a native machine boundary, then the XDR elements 569 will lie on similarly predictable offsets in memory. 571 Within XDR, when non-4-byte encodes (such as an odd-length string 572 or bulk data) are marshaled, their length is encoded literally, 573 while their data is padded to begin the next element at a 4-byte 574 boundary in the XDR stream. For TCP or RDMA inline encoding, this 575 minimal overhead is required because the transport-specific framing 576 relies on the fact that the relative offset of the elements in the 577 XDR stream from the start of the message determines the XDR 578 position during decode. 580 On the other hand, RPC/RDMA Read chunks carry the XDR position of 581 each chunked element and length of the Chunk segment, and can be 582 placed by the receiver exactly where they belong in the receiver's 583 memory without regard to the alignment of their position in the XDR 584 stream. Since any rounded-up data is not actually part of the 585 upper layer's message, the receiver will not reference it, and 586 there is no reason to set it to any particular value in the 587 receiver's memory. 589 When roundup is present at the end of a sequence of chunks, the 590 length of the sequence will terminate it at an non-4-byte XDR 591 position. When the receiver proceeds to decode the remaining part 592 of the XDR stream, it inspects the XDR position indicated by the 593 next chunk. Because this position will not match (else roundup 594 would not have occurred), the receiver decoding will fall back to 595 inspecting the remaining inline portion. If in turn, no data 596 remains to be decoded from the inline portion, then the receiver 597 MUST conclude that roundup is present, and therefore advances the 598 XDR decode position to that indicated by the next chunk (if any). 599 In this way, roundup is passed without ever actually transferring 600 additional XDR bytes. 602 Some protocol operations over RPC/RDMA, for instance NFS writes of 603 data encountered at the end of a file or in direct i/o situations, 604 commonly yield these roundups within RDMA Read Chunks. Because any 605 roundup bytes are not actually present in the data buffers being 606 written, memory for these bytes would come from noncontiguous 607 buffers, either as an additional memory registration segment, or as 608 an additional Chunk. The overhead of these operations can be 609 significant to both the sender to marshal them, and even higher to 610 the receiver which to transfer them. Senders SHOULD therefore 611 avoid encoding individual RDMA Read Chunks for roundup whenever 612 possible. It is acceptable, but not necessary, to include roundup 613 data in an existing RDMA Read Chunk, but only if it is already 614 present in the XDR stream to carry upper layer data. 616 Note that there is no exposure of additional data at the sender due 617 to eliding roundup data from the XDR stream, since any additional 618 sender buffers are never exposed to the peer. The data is 619 literally not there to be transferred. 621 For RDMA Write Chunks, a simpler encoding method applies. Again, 622 roundup bytes are not transferred, instead the chunk length sent to 623 the receiver in the reply is simply increased to include any 624 roundup. Because of the requirement that the RDMA Write chunks are 625 filled sequentially without gaps, this situation can only occur on 626 the final chunk receiving data. Therefore there is no opportunity 627 for roundup data to insert misalignment or positional gaps into the 628 XDR stream. 630 3.8. RPC Call and Reply 632 The RDMA transport for RPC provides three methods of moving data 633 between RPC client and server: 635 Inline 636 Data are moved between RPC client and server within an RDMA 637 Send. 639 RDMA Read 640 Data are moved between RPC client and server via an RDMA Read 641 operation via steering tag, address and offset obtained from a 642 read chunk list. 644 RDMA Write 645 Result data is moved from RPC server to client via an RDMA 646 Write operation via steering tag, address and offset obtained 647 from a write chunk list or reply chunk in the client's RPC 648 call message. 650 These methods of data movement may occur in combinations within a 651 single RPC. For instance, an RPC call may contain some inline data 652 along with some large chunks to be transferred via RDMA Read to the 653 server. The reply to that call may have some result chunks that 654 the server RDMA Writes back to the client. The following protocol 655 interactions illustrate RPC calls that use these methods to move 656 RPC message data: 658 An RPC with write chunks in the call message: 660 RPC Client RPC Server 661 | RPC Call + Write Chunk list | 662 Send | ------------------------------> | 663 | | 664 | Chunk 1 | 665 | <------------------------------ | Write 666 | : | 667 | Chunk n | 668 | <------------------------------ | Write 669 | | 670 | RPC Reply | 671 | <------------------------------ | Send 673 In the presence of write chunks, RDMA ordering provides the 674 guarantee that all data in the RDMA Write operations has been 675 placed in memory prior to the client's RPC reply processing. 677 An RPC with read chunks in the call message: 679 RPC Client RPC Server 680 | RPC Call + Read Chunk list | 681 Send | ------------------------------> | 682 | | 683 | Chunk 1 | 684 | +------------------------------ | Read 685 | v-----------------------------> | 686 | : | 687 | Chunk n | 688 | +------------------------------ | Read 689 | v-----------------------------> | 690 | | 691 | RPC Reply | 692 | <------------------------------ | Send 694 An RPC with read chunks in the reply message: 696 RPC Client RPC Server 697 | RPC Call | 698 Send | ------------------------------> | 699 | | 700 | RPC Reply + Read Chunk list | 701 | <------------------------------ | Send 702 | | 703 | Chunk 1 | 704 Read | ------------------------------+ | 705 | <-----------------------------v | 706 | : | 707 | Chunk n | 708 Read | ------------------------------+ | 709 | <-----------------------------v | 710 | | 711 | Done | 712 Send | ------------------------------> | 714 The final Done message allows the RPC client to signal the server 715 that it has received the chunks, so the server can de-register and 716 free the memory holding the chunks. A Done completion is not 717 necessary for an RPC call, since the RPC reply Send is itself a 718 receive completion notification. In the event that the client 719 fails to return the Done message within some timeout period, the 720 server MAY conclude that a protocol violation has occurred and 721 close the RPC connection, or it MAY proceed with a de-register and 722 free its chunk buffers. This may result in a fatal RDMA error if 723 the client later attempts to perform an RDMA Read operation, which 724 amounts to the same thing. 726 The use of read chunks in RPC reply messages is much less efficient 727 than providing write chunks in the originating RPC calls, due to 728 the additional message exchanges, the need for the RPC server to 729 advertise buffers to the peer, the necessity of the server 730 maintaining a timer for the purpose of recovery from misbehaving 731 clients, and the need for additional memory registration. Their 732 use is NOT RECOMMENDED by upper layers where efficiency is a 733 primary concern. [NFSDDP] However, they MAY be employed by upper 734 layer protocol bindings which are primarily concerned with 735 transparency, since they can frequently be implemented completely 736 within the RPC lower layers. 738 It is important to note that the Done message consumes a credit at 739 the RPC server. The RPC server SHOULD provide sufficient credits 740 to the client to allow the Done message to be sent without deadlock 741 (driving the outstanding credit count to zero). The RPC client 742 MUST account for its required Done messages to the server in its 743 accounting of available credits, and the server SHOULD replenish 744 any credit consumed by its use of such exchanges at its earliest 745 opportunity. 747 Finally, it is possible to conceive of RPC exchanges that involve 748 any or all combinations of write chunks in the RPC call, read 749 chunks in the RPC call, and read chunks in the RPC reply. Support 750 for such exchanges is straightforward from a protocol perspective, 751 but in practice such exchanges would be quite rare, limited to 752 upper layer protocol exchanges which transferred bulk data in both 753 the call and corresponding reply. 755 3.9. Padding 757 Alignment of specific opaque data enables certain scatter/gather 758 optimizations. Padding leverages the useful property that RDMA 759 transfers preserve alignment of data, even when they are placed 760 into pre-posted receive buffers by Sends. 762 Many servers can make good use of such padding. Padding allows the 763 chaining of RDMA receive buffers such that any data transferred by 764 RDMA on behalf of RPC requests will be placed into appropriately 765 aligned buffers on the system that receives the transfer. In this 766 way, the need for servers to perform RDMA Read to satisfy all but 767 the largest client writes is obviated. 769 The effect of padding is demonstrated below showing prior bytes on 770 an XDR stream (XXX) followed by an opaque field consisting of four 771 length bytes (LLLL) followed by data bytes (DDDD). The receiver of 772 the RDMA Send has posted two chained receive buffers. Without 773 padding, the opaque data is split across the two buffers. With the 774 addition of padding bytes (ppp) prior to the first data byte, the 775 data can be forced to align correctly in the second buffer. 777 Buffer 1 Buffer 2 778 Unpadded -------------- -------------- 780 XXXXXXXLLLLDDDDDDDDDDDDDD ---> XXXXXXXLLLLDDD DDDDDDDDDDD 782 Padded 784 XXXXXXXLLLLpppDDDDDDDDDDDDDD ---> XXXXXXXLLLLppp DDDDDDDDDDDDDD 786 Padding is implemented completely within the RDMA transport 787 encoding, flagged with a specific message type. Where padding is 788 applied, two values are passed to the peer: an "rdma_align" which 789 is the padding value used, and "rdma_thresh", which is the opaque 790 data size at or above which padding is applied. For instance, if 791 the server is using chained 4 KB receive buffers, then up to (4 KB 792 - 1) padding bytes could be used to achieve alignment of the data. 793 The XDR routine at the peer MUST consult these values when decoding 794 opaque values. Where the decoded length exceeds the rdma_thresh, 795 the XDR decode MUST skip over the appropriate padding as indicated 796 by rdma_align and the current XDR stream position. 798 4. RPC RDMA Message Layout 800 RPC call and reply messages are conveyed across an RDMA transport 801 with a prepended RPC over RDMA header. The RPC over RDMA header 802 includes data for RDMA flow control credits, padding parameters and 803 lists of addresses that provide direct data placement via RDMA Read 804 and Write operations. The layout of the RPC message itself is 805 unchanged from that described in [RFC1831bis] except for the 806 possible exclusion of large data chunks that will be moved by RDMA 807 Read or Write operations. If the RPC message (along with the RPC 808 over RDMA header) is too long for the posted receive buffer (even 809 after any large chunks are removed), then the entire RPC message 810 MAY be moved separately as a chunk, leaving just the RPC over RDMA 811 header in the RDMA Send. 813 4.1. RPC over RDMA Header 815 The RPC over RDMA header begins with four 32-bit fields that are 816 always present and which control the RDMA interaction including 817 RDMA-specific flow control. These are then followed by a number of 818 items such as chunk lists and padding which MAY or MUST NOT be 819 present depending on the type of transmission. The four fields 820 which are always present are: 822 1. Transaction ID (XID). 823 The XID generated for the RPC call and reply. Having the XID 824 at the beginning of the message makes it easy to establish the 825 message context. This XID MUST be the same as the XID in the 826 RPC header. The receiver MAY perform its processing based 827 solely on the XID in the RPC over RDMA header, and thereby 828 ignore the XID in the RPC header, if it so chooses. 830 2. Version number. 831 This version of the RPC RDMA message protocol is 1. The 832 version number MUST be increased by one whenever the format of 833 the RPC RDMA messages is changed. 835 3. Flow control credit value. 836 When sent in an RPC call message, the requested value is 837 provided. When sent in an RPC reply message, the granted 838 value is returned. RPC calls SHOULD not be sent in excess of 839 the currently granted limit. 841 4. Message type. 843 o RDMA_MSG = 0 indicates that chunk lists and RPC message 844 follow. 846 o RDMA_NOMSG = 1 indicates that after the chunk lists there 847 is no RPC message. In this case, the chunk lists provide 848 information to allow the message proper to be transferred 849 using RDMA Read or write and thus is not appended to the 850 RPC over RDMA header. 852 o RDMA_MSGP = 2 indicates that a chunk list and RPC message 853 with some padding follow. 855 0 RDMA_DONE = 3 indicates that the message signals the 856 completion of a chunk transfer via RDMA Read. 858 o RDMA_ERROR = 4 is used to signal any detected error(s) in 859 the RPC RDMA chunk encoding. 861 Because the version number is encoded as part of this header, and 862 the RDMA_ERROR message type is used to indicate errors, these first 863 four fields and the start of the following message body MUST always 864 remain aligned at these fixed offsets for all versions of the RPC 865 over RDMA header. 867 For a message of type RDMA_MSG or RDMA_NOMSG, the Read and Write 868 chunk lists follow. If the Read chunk list is null (a 32 bit word 869 of zeros), then there are no chunks to be transferred separately 870 and the RPC message follows in its entirety. If non-null, then 871 it's the beginning of an XDR encoded sequence of Read chunk list 872 entries. If the Write chunk list is non-null, then an XDR encoded 873 sequence of Write chunk entries follows. 875 If the message type is RDMA_MSGP, then two additional fields that 876 specify the padding alignment and threshold are inserted prior to 877 the Read and Write chunk lists. 879 A header of message type RDMA_MSG or RDMA_MSGP MUST be followed by 880 the RPC call or RPC reply message body, beginning with the XID. 881 The XID in the RDMA_MSG or RDMA_MSGP header MUST match this. 883 +--------+---------+---------+-----------+-------------+---------- 884 | | | | Message | NULLs | RPC Call 885 | XID | Version | Credits | Type | or | or 886 | | | | | Chunk Lists | Reply Msg 887 +--------+---------+---------+-----------+-------------+---------- 889 Note that in the case of RDMA_DONE and RDMA_ERROR, no chunk list or 890 RPC message follows. As an implementation hint: a gather operation 891 on the Send of the RDMA RPC message can be used to marshal the 892 initial header, the chunk list, and the RPC message itself. 894 4.2. RPC over RDMA header errors 896 When a peer receives an RPC RDMA message, it MUST perform the 897 following basic validity checks on the header and chunk contents. 898 If such errors are detected in the request, an RDMA_ERROR reply 899 MUST be generated. 901 Two types of errors are defined, version mismatch and invalid chunk 902 format. When the peer detects an RPC over RDMA header version 903 which it does not support (currently this draft defines only 904 version 1), it replies with an error code of ERR_VERS, and provides 905 the low and high inclusive version numbers it does, in fact, 906 support. The version number in this reply MUST be any value 907 otherwise valid at the receiver. When other decoding errors are 908 detected in the header or chunks, either an RPC decode error MAY be 909 returned, or the ROC/RDMA error code ERR_CHUNK MUST be returned. 911 4.3. XDR Language Description 913 Here is the message layout in XDR language. 915 struct xdr_rdma_segment { 916 uint32 handle; /* Registered memory handle */ 917 uint32 length; /* Length of the chunk in bytes */ 918 uint64 offset; /* Chunk virtual address or offset */ 919 }; 921 struct xdr_read_chunk { 922 uint32 position; /* Position in XDR stream */ 923 struct xdr_rdma_segment target; 924 }; 926 struct xdr_read_list { 927 struct xdr_read_chunk entry; 928 struct xdr_read_list *next; 929 }; 931 struct xdr_write_chunk { 932 struct xdr_rdma_segment target<>; 933 }; 935 struct xdr_write_list { 936 struct xdr_write_chunk entry; 937 struct xdr_write_list *next; 938 }; 940 struct rdma_msg { 941 uint32 rdma_xid; /* Mirrors the RPC header xid */ 942 uint32 rdma_vers; /* Version of this protocol */ 943 uint32 rdma_credit; /* Buffers requested/granted */ 944 rdma_body rdma_body; 945 }; 947 enum rdma_proc { 948 RDMA_MSG=0, /* An RPC call or reply msg */ 949 RDMA_NOMSG=1, /* An RPC call or reply msg - separate body */ 950 RDMA_MSGP=2, /* An RPC call or reply msg with padding */ 951 RDMA_DONE=3, /* Client signals reply completion */ 952 RDMA_ERROR=4 /* An RPC RDMA encoding error */ 953 }; 954 union rdma_body switch (rdma_proc proc) { 955 case RDMA_MSG: 956 rpc_rdma_header rdma_msg; 957 case RDMA_NOMSG: 958 rpc_rdma_header_nomsg rdma_nomsg; 959 case RDMA_MSGP: 960 rpc_rdma_header_padded rdma_msgp; 961 case RDMA_DONE: 962 void; 963 case RDMA_ERROR: 964 rpc_rdma_error rdma_error; 965 }; 967 struct rpc_rdma_header { 968 struct xdr_read_list *rdma_reads; 969 struct xdr_write_list *rdma_writes; 970 struct xdr_write_chunk *rdma_reply; 971 /* rpc body follows */ 972 }; 974 struct rpc_rdma_header_nomsg { 975 struct xdr_read_list *rdma_reads; 976 struct xdr_write_list *rdma_writes; 977 struct xdr_write_chunk *rdma_reply; 978 }; 980 struct rpc_rdma_header_padded { 981 uint32 rdma_align; /* Padding alignment */ 982 uint32 rdma_thresh; /* Padding threshold */ 983 struct xdr_read_list *rdma_reads; 984 struct xdr_write_list *rdma_writes; 985 struct xdr_write_chunk *rdma_reply; 986 /* rpc body follows */ 987 }; 988 enum rpc_rdma_errcode { 989 ERR_VERS = 1, 990 ERR_CHUNK = 2 991 }; 993 union rpc_rdma_error switch (rpc_rdma_errcode err) { 994 case ERR_VERS: 995 uint32 rdma_vers_low; 996 uint32 rdma_vers_high; 997 case ERR_CHUNK: 998 void; 999 default: 1000 uint32 rdma_extra[8]; 1001 }; 1003 5. Long Messages 1005 The receiver of RDMA Send messages is required by RDMA to have 1006 previously posted one or more adequately sized buffers. The RPC 1007 client can inform the server of the maximum size of its RDMA Send 1008 messages via the Connection Configuration Protocol described later 1009 in this document. 1011 Since RPC messages are frequently small, memory savings can be 1012 achieved by posting small buffers. Even large messages like NFS 1013 READ or WRITE will be quite small once the chunks are removed from 1014 the message. However, there may be large messages that would 1015 demand a very large buffer be posted, where the contents of the 1016 buffer may not be a chunkable XDR element. A good example is an 1017 NFS READDIR reply which may contain a large number of small 1018 filename strings. Also, the NFS version 4 protocol [RFC3530] 1019 features COMPOUND request and reply messages of unbounded length. 1021 Ideally, each upper layer will negotiate these limits. However, it 1022 is frequently necessary to provide a transparent solution. 1024 5.1. Message as an RDMA Read Chunk 1026 One relatively simple method is to have the client identify any RPC 1027 message that exceeds the RPC server's posted buffer size and move 1028 it separately as a chunk, i.e., reference it as the first entry in 1029 the read chunk list with an XDR position of zero. 1031 Normal Message 1033 +--------+---------+---------+------------+-------------+---------- 1034 | | | | | | RPC Call 1035 | XID | Version | Credits | RDMA_MSG | Chunk Lists | or 1036 | | | | | | Reply Msg 1037 +--------+---------+---------+------------+-------------+---------- 1039 Long Message 1041 +--------+---------+---------+------------+-------------+ 1042 | | | | | | 1043 | XID | Version | Credits | RDMA_NOMSG | Chunk Lists | 1044 | | | | | | 1045 +--------+---------+---------+------------+-------------+ 1046 | 1047 | +---------- 1048 | | Long RPC Call 1049 +->| or 1050 | Reply Message 1051 +---------- 1053 If the receiver gets an RPC over RDMA header with a message type of 1054 RDMA_NOMSG and finds an initial read chunk list entry with a zero 1055 XDR position, it allocates a registered buffer and issues an RDMA 1056 Read of the long RPC message into it. The receiver then proceeds 1057 to XDR decode the RPC message as if it had received it inline with 1058 the Send data. Further decoding may issue additional RDMA Reads to 1059 bring over additional chunks. 1061 Although the handling of long messages requires one extra network 1062 turnaround, in practice these messages will be rare if the posted 1063 receive buffers are correctly sized, and of course they will be 1064 non-existent for RDMA-aware upper layers. 1066 A long call RPC with request supplied via RDMA Read 1068 RPC Client RPC Server 1069 | RDMA over RPC Header | 1070 Send | ------------------------------> | 1071 | | 1072 | Long RPC Call Msg | 1073 | +------------------------------ | Read 1074 | v-----------------------------> | 1075 | | 1076 | RDMA over RPC Reply | 1077 | <------------------------------ | Send 1079 An RPC with long reply returned via RDMA Read 1081 RPC Client RPC Server 1082 | RPC Call | 1083 Send | ------------------------------> | 1084 | | 1085 | RDMA over RPC Header | 1086 | <------------------------------ | Send 1087 | | 1088 | Long RPC Reply Msg | 1089 Read | ------------------------------+ | 1090 | <-----------------------------v | 1091 | | 1092 | Done | 1093 Send | ------------------------------> | 1095 It is possible for a single RPC procedure to employ both a long 1096 call for its arguments, and a long reply for its results. However, 1097 such an operation is atypical, as few upper layers define such 1098 exchanges. 1100 5.2. RDMA Write of Long Replies (Reply Chunks) 1102 A superior method of handling long RPC replies is to have the RPC 1103 client post a large buffer into which the server can write a large 1104 RPC reply. This has the advantage that an RDMA Write may be 1105 slightly faster in network latency than an RDMA Read, and does not 1106 require the server to wait for the completion as it must for RDMA 1107 Read. Additionally, for a reply it removes the need for an 1108 RDMA_DONE message if the large reply is returned as a Read chunk. 1110 This protocol supports direct return of a large reply via the 1111 inclusion of an OPTIONAL rdma_reply write chunk after the read 1112 chunk list and the write chunk list. The client allocates a buffer 1113 sized to receive a large reply and enters its steering tag, address 1114 and length in the rdma_reply write chunk. If the reply message is 1115 too long to return inline with an RDMA Send (exceeds the size of 1116 the client's posted receive buffer), even with read chunks removed, 1117 then the RPC server performs an RDMA Write of the RPC reply message 1118 into the buffer indicated by the rdma_reply chunk. If the client 1119 doesn't provide an rdma_reply chunk, or if it's too small, then if 1120 the upper layer specification permits, the message MAY be returned 1121 as a Read chunk. 1123 An RPC with long reply returned via RDMA Write 1125 RPC Client RPC Server 1126 | RPC Call with rdma_reply | 1127 Send | ------------------------------> | 1128 | | 1129 | Long RPC Reply Msg | 1130 | <------------------------------ | Write 1131 | | 1132 | RDMA over RPC Header | 1133 | <------------------------------ | Send 1135 The use of RDMA Write to return long replies requires that the 1136 client application anticipate a long reply and have some knowledge 1137 of its size so that an adequately sized buffer can be allocated. 1138 This is certainly true of NFS READDIR replies; where the client 1139 already provides an upper bound on the size of the encoded 1140 directory fragment to be returned by the server. 1142 The use of these "reply chunks" is highly efficient and convenient 1143 for both RPC client and server. Their use is encouraged for 1144 eligible RPC operations such as NFS READDIR, which would otherwise 1145 require extensive chunk management within the results or use of 1146 RDMA Read and a Done message. [NFSDDP] 1148 6. Connection Configuration Protocol 1150 RDMA Send operations require the receiver to post one or more 1151 buffers at the RDMA connection endpoint, each large enough to 1152 receive the largest Send message. Buffers are consumed as Send 1153 messages are received. If a buffer is too small, or if there are 1154 no buffers posted, the RDMA transport MAY return an error and break 1155 the RDMA connection. The receiver MUST post sufficient, adequately 1156 buffers to avoid buffer overrun or capacity errors. 1158 The protocol described above includes only a mechanism for managing 1159 the number of such receive buffers, and no explicit features to 1160 allow the RPC client and server to provision or control buffer 1161 sizing, nor any other session parameters. 1163 In the past, this type of connection management has not been 1164 necessary for RPC. RPC over UDP or TCP does not have a protocol to 1165 negotiate the link. The server can get a rough idea of the maximum 1166 size of messages from the server protocol code. However, a 1167 protocol to negotiate transport features on a more dynamic basis is 1168 desirable. 1170 The Connection Configuration Protocol allows the client to pass its 1171 connection requirements to the server, and allows the server to 1172 inform the client of its connection limits. 1174 Use of the Connection Configuration Protocol by an upper layer is 1175 OPTIONAL. 1177 6.1. Initial Connection State 1179 This protocol MAY be used for connection setup prior to the use of 1180 another RPC protocol that uses the RDMA transport. It operates in- 1181 band, i.e., it uses the connection itself to negotiate the 1182 connection parameters. To provide a basis for connection 1183 negotiation, the connection is assumed to provide a basic level of 1184 interoperability: the ability to exchange at least one RPC message 1185 at a time that is at least 1 KB in size. The server MAY exceed 1186 this basic level of configuration, but the client MUST NOT assume 1187 more than one, and MUST receive a valid reply from the server 1188 carrying the actual number of available receive messages, prior to 1189 sending its next request. 1191 6.2. Protocol Description 1193 Version 1 of the Connection Configuration protocol consists of a 1194 single procedure that allows the client to inform the server of its 1195 connection requirements and the server to return connection 1196 information to the client. 1198 The maxcall_sendsize argument is the maximum size of an RPC call 1199 message that the client MAY send inline in an RDMA Send message to 1200 the server. The server MAY return a maxcall_sendsize value that is 1201 smaller or larger than the client's request. The client MUST NOT 1202 send an inline call message larger than what the server will 1203 accept. The maxcall_sendsize limits only the size of inline RPC 1204 calls. It does not limit the size of long RPC messages transferred 1205 as an initial chunk in the Read chunk list. 1207 The maxreply_sendsize is the maximum size of an inline RPC message 1208 that the client will accept from the server. 1210 The maxrdmaread is the maximum number of RDMA Reads which may be 1211 active at the peer. This number correlates to the RDMA incoming 1212 RDMA Read count ("IRD") configured into each originating endpoint 1213 by the client or server. If more than this number of RDMA Read 1214 operations by the connected peer are issued simultaneously, 1215 connection loss or suboptimal flow control may result, therefore 1216 the value SHOULD be observed at all times. The peers' values need 1217 not be equal. If zero, the peer MUST NOT issue requests which 1218 require RDMA Read to satisfy, as no transfer will be possible. 1220 The align value is the value recommended by the server for opaque 1221 data values such as strings and counted byte arrays. The client 1222 MAY use this value to compute the number of prepended pad bytes 1223 when XDR encoding opaque values in the RPC call message. 1225 typedef unsigned int uint32; 1227 struct config_rdma_req { 1228 uint32 maxcall_sendsize; 1229 /* max size of inline RPC call */ 1230 uint32 maxreply_sendsize; 1231 /* max size of inline RPC reply */ 1232 uint32 maxrdmaread; 1233 /* max active RDMA Reads at client */ 1234 }; 1236 struct config_rdma_reply { 1237 uint32 maxcall_sendsize; 1238 /* max call size accepted by server */ 1239 uint32 align; 1240 /* server's receive buffer alignment */ 1241 uint32 maxrdmaread; 1242 /* max active RDMA Reads at server */ 1243 }; 1245 program CONFIG_RDMA_PROG { 1246 version VERS1 { 1247 /* 1248 * Config call/reply 1249 */ 1250 config_rdma_reply CONF_RDMA(config_rdma_req) = 1; 1251 } = 1; 1252 } = 100400; 1254 7. Memory Registration Overhead 1256 RDMA requires that all data be transferred between registered 1257 memory regions at the source and destination. All protocol headers 1258 as well as separately transferred data chunks use registered 1259 memory. Since the cost of registering and de-registering memory 1260 can be a large proportion of the RDMA transaction cost, it is 1261 important to minimize registration activity. This is easily 1262 achieved within RPC controlled memory by allocating chunk list data 1263 and RPC headers in a reusable way from pre-registered pools. 1265 The data chunks transferred via RDMA MAY occupy memory that 1266 persists outside the bounds of the RPC transaction. Hence, the 1267 default behavior of an RPC over RDMA transport is to register and 1268 de-register these chunks on every transaction. However, this is 1269 not a limitation of the protocol - only of the existing local RPC 1270 API. The API is easily extended through such functions as 1271 rpc_control(3) to change the default behavior so that the 1272 application can assume responsibility for controlling memory 1273 registration through an RPC-provided registered memory allocator. 1275 8. Errors and Error Recovery 1277 RPC RDMA protocol errors are described in section 4. RPC errors 1278 and RPC error recovery are not affected by the protocol, and 1279 proceed as for any RPC error condition. RDMA Transport error 1280 reporting and recovery are outside the scope of this protocol. 1282 It is assumed that the link itself will provide some degree of 1283 error detection and retransmission. iWARP's MPA layer (when used 1284 over TCP), SCTP, as well as the Infiniband link layer all provide 1285 CRC protection of the RDMA payload, and CRC-class protection is a 1286 general attribute of such transports. Additionally, the RPC layer 1287 itself can accept errors from the link level and recover via 1288 retransmission. RPC recovery can handle complete loss and re- 1289 establishment of the link. 1291 See section 11 for further discussion of the use of RPC-level 1292 integrity schemes to detect errors, and related efficiency issues. 1294 9. Node Addressing 1296 In setting up a new RDMA connection, the first action by an RPC 1297 client will be to obtain a transport address for the server. The 1298 mechanism used to obtain this address, and to open an RDMA 1299 connection is dependent on the type of RDMA transport, and is the 1300 responsibility of each RPC protocol binding and its local 1301 implementation. 1303 10. RPC Binding 1305 RPC services normally register with a portmap or rpcbind [RFC1833] 1306 service, which associates an RPC program number with a service 1307 address. (In the case of UDP or TCP, the service address for NFS 1308 is normally port 2049.) This policy is no different with RDMA 1309 interconnects, although it may require the allocation of port 1310 numbers appropriate to each upper layer binding which uses the RPC 1311 framing defined here. 1313 When mapped atop the iWARP [RDDP] transport, which uses IP port 1314 addressing due to its layering on TCP and/or SCTP, port mapping is 1315 trivial and consists merely of issuing the port in the connection 1316 process. 1318 When mapped atop Infiniband [IB], which uses a GID-based service 1319 endpoint naming scheme, a translation MUST be employed. One such 1320 translation is defined in the Infiniband Port Addressing Annex 1321 [IBPORT], which is appropriate for translating IP port addressing 1322 to the Infiniband network. Therefore, in this case, IP port 1323 addressing may be readily employed by the upper layer. 1325 When a mapping standard or convention exists for IP ports on an 1326 RDMA interconnect, there are several possibilities for each upper 1327 layer to consider: 1329 One possibility is to have an upper layer server register its 1330 mapped IP port with the rpcbind service, under the netid (or 1331 netid's) defined here. An RPC/RDMA-aware client can then 1332 resolve its desired service to a mappable port, and proceed to 1333 connect. This is the most flexible and compatible approach, 1334 for those upper layers which are defined to use the rpcbind 1335 service. 1337 A second possibility is to have the server's portmapper 1338 register itself on the RDMA interconnect at a "well known" 1339 service address. (On UDP or TCP, this corresponds to port 1340 111.) A client could connect to this service address and use 1341 the portmap protocol to obtain a service address in response 1342 to a program number, e.g., an iWARP port number, or an 1343 Infiniband GID. 1345 Alternatively, the client could simply connect to the mapped 1346 well-known port for the service itself, if it is appropriately 1347 defined. 1349 Historically, different RPC protocols have taken different 1350 approaches to their port assignment, therefore the specific method 1351 is left to each RPC/RDMA-enabled upper layer binding, and not 1352 addressed here. 1354 This specification defines a new "netid", to be used for 1355 registration of upper layers atop iWARP [RDDP] and (when a suitable 1356 port translation service is available) Infiniband [IB] in section 1357 12, "IANA Considerations." Additional RDMA-capable networks MAY 1358 define their own netids, or if they provide a port translation, MAY 1359 share the one defined here. 1361 11. Security 1363 ONC RPC provides its own security via the RPCSEC_GSS framework 1364 [RFC2203]. RPCSEC_GSS can provide message authentication, 1365 integrity checking, and privacy. This security mechanism will be 1366 unaffected by the RDMA transport. The data integrity and privacy 1367 features alter the body of the message, presenting it as a single 1368 chunk. For large messages the chunk may be large enough to qualify 1369 for RDMA Read transfer. However, there is much data movement 1370 associated with computation and verification of integrity, or 1371 encryption/decryption, so certain performance advantages may be 1372 lost. 1374 For efficiency, more appropriate security mechanism for RDMA links 1375 may be link-level protection, such as certain configurations of 1376 IPsec, which may be co-located in the RDMA hardware. The use of 1377 link-level protection MAY be negotiated through the use of a new 1378 RPCSEC_GSS mechanism like the Credential Cache GSS Mechanism [CCM]. 1379 Use of such mechanisms is RECOMMENDED where end-to-end integrity 1380 and/or privacy is desired, and where efficiency is required. 1382 There are no new issues here with exposed addresses. The only 1383 exposed addresses here are in the chunk list and in the transport 1384 packets transferred via RDMA. The data contained in these 1385 addresses continues to be protected by RPCSEC_GSS integrity and 1386 privacy. 1388 12. IANA Considerations 1390 The new RPC transport is to be assigned a new RPC "netid", which is 1391 an rpcbind [RFC1833] string used to describe the underlying 1392 protocol in order for RPC to select the appropriate transport 1393 framing, as well as the format of the service ports. 1395 The following "nc_proto" registry string is hereby defined for this 1396 purpose: 1398 NC_RDMA "rdma" 1400 The mechanism of adding this value to the RPC netid registry is 1401 outside the scope of this document and is an IANA consideration. 1403 This netid MAY be used for any RDMA network satisfying the 1404 requirements of section 2, and able to identify service endpoints 1405 using IP port addressing, possibly through use of a translation 1406 service as described above in section 10, RPC Binding. 1408 As a new RPC transport, this protocol has no effect on RPC program 1409 numbers or existing registered port numbers. However, new port 1410 numbers MAY be registered for use by RPC/RDMA-enabled services, as 1411 appropriate to the new networks over which the services will 1412 operate. 1414 The OPTIONAL Connection Configuration protocol described herein 1415 requires an RPC program number assignment. The value "100400" is 1416 hereby assigned: 1418 rdmaconfig 100400 rpc.rdmaconfig 1420 Currently, these numbers are not assigned by IANA, they are merely 1421 republished [IANA-RPC]. The mechanism of this republishing is 1422 outside the scope of this document and is an IANA consideration. 1424 13. Acknowledgements 1426 The authors wish to thank Rob Thurlow, John Howard, Chet Juszczak, 1427 Alex Chiu, Peter Staubach, Dave Noveck, Brian Pawlowski, Steve 1428 Kleiman, Mike Eisler, Mark Wittle, Shantanu Mehendale, David 1429 Robinson and Mallikarjun Chadalapaka for their contributions to 1430 this document. 1432 14. Normative References 1434 [RFC2119] 1435 S. Bradner, "Key words for use in RFCs to Indicate Requirement 1436 Levels", Best Current Practice, BCP 14, RFC 2119, March 1997. 1438 [RFC1094] 1439 Sun Microsystems, "NFS: Network File System Protocol 1440 Specification", (NFS version 2) Informational RFC, 1441 http://www.ietf.org/rfc/rfc1094.txt 1443 [RFC1831bis] 1444 R. Thurlow, Ed., "RPC: Remote Procedure Call Protocol 1445 Specification Version 2", Standards Track RFC 1447 [RFC4506] 1448 M. Eisler Ed., "XDR: External Data Representation Standard", 1449 Standards Track RFC, http://www.ietf.org/rfc/rfc4506.txt 1451 [RFC1813] 1452 B. Callaghan, B. Pawlowski, P. Staubach, "NFS Version 3 1453 Protocol Specification", Informational RFC, 1454 http://www.ietf.org/rfc/rfc1813.txt 1456 [RFC1833] 1457 R. Srinivasan, "Binding Protocols for ONC RPC Version 2", 1458 Standards Track RFC, http://www.ietf.org/rfc/rfc1833.txt 1460 [RFC3530] 1461 S. Shepler, B. Callaghan, D. Robinson, R. Thurlow, C. Beame, 1462 M. Eisler, D. Noveck, "NFS version 4 Protocol", Standards 1463 Track RFC, http://www.ietf.org/rfc/rfc3530.txt 1465 [RFC2203] 1466 M. Eisler, A. Chiu, L. Ling, "RPCSEC_GSS Protocol 1467 Specification", Standards Track RFC, 1468 http://www.ietf.org/rfc/rfc2203.txt 1470 15. Informative References 1472 [RDMAP] 1473 R. Recio et al., "A Remote Direct Memory Access Protocol 1474 Specification", Standards Track RFC, draft-ietf-rddp-rdmap 1476 [CCM] 1477 M. Eisler, N. Williams, "CCM: The Credential Cache GSS 1478 Mechanism", Internet Draft Work in Progress, draft-ietf- 1479 nfsv4-ccm 1481 [NFSDDP] 1482 B. Callaghan, T. Talpey, "NFS Direct Data Placement" Internet 1483 Draft Work in Progress, draft-ietf-nfsv4-nfsdirect 1485 [RDDP] 1486 H. Shah et al., "Direct Data Placement over Reliable 1487 Transports", Standards Track RFC, draft-ietf-rddp-ddp 1489 [NFSRDMAPS] 1490 T. Talpey, C. Juszczak, "NFS RDMA Problem Statement", Internet 1491 Draft Work in Progress, draft-ietf-nfsv4-nfs-rdma-problem- 1492 statement 1494 [NFSv4.1] 1495 S. Shepler et al., ed., "NFSv4 Minor Version 1" Internet Draft 1496 Work in Progress, draft-ietf-nfsv4-minorversion1 1498 [IB] 1499 Infiniband Architecture Specification, available from 1500 http://www.infinibandta.org 1502 [IBPORT] 1503 Infiniband Trade Association, "IP Addressing Annex", available 1504 from http://www.infinibandta.org 1506 [IANA-RPC] 1507 IANA Sun RPC number statement, 1508 http://www.iana.org/assignments/sun-rpc-numbers 1510 16. Authors' Addresses 1512 Tom Talpey 1513 Network Appliance, Inc. 1514 375 Totten Pond Road 1515 Waltham, MA 02451 USA 1517 Phone: +1 781 768 5329 1518 EMail: thomas.talpey@netapp.com 1520 Brent Callaghan 1521 Apple Computer, Inc. 1522 MS: 302-4K 1523 2 Infinite Loop 1524 Cupertino, CA 95014 USA 1526 EMail: brentc@apple.com 1528 17. Intellectual Property and Copyright Statements 1530 Full Copyright Statement 1532 Copyright (C) The IETF Trust (2007). 1534 This document is subject to the rights, licenses and restrictions 1535 contained in BCP 78, and except as set forth therein, the authors 1536 retain all their rights. 1538 This document and the information contained herein are provided on 1539 an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE 1540 REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE 1541 IETF TRUST AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL 1542 WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY 1543 WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE 1544 ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS 1545 FOR A PARTICULAR PURPOSE. 1547 Intellectual Property 1548 The IETF takes no position regarding the validity or scope of any 1549 Intellectual Property Rights or other rights that might be claimed 1550 to pertain to the implementation or use of the technology described 1551 in this document or the extent to which any license under such 1552 rights might or might not be available; nor does it represent that 1553 it has made any independent effort to identify any such rights. 1554 Information on the procedures with respect to rights in RFC 1555 documents can be found in BCP 78 and BCP 79. 1557 Copies of IPR disclosures made to the IETF Secretariat and any 1558 assurances of licenses to be made available, or the result of an 1559 attempt made to obtain a general license or permission for the use 1560 of such proprietary rights by implementers or users of this 1561 specification can be obtained from the IETF on-line IPR repository 1562 at http://www.ietf.org/ipr. 1564 The IETF invites any interested party to bring to its attention any 1565 copyrights, patents or patent applications, or other proprietary 1566 rights that may cover technology that may be required to implement 1567 this standard. Please address the information to the IETF at ietf- 1568 ipr@ietf.org. 1570 Acknowledgment 1571 Funding for the RFC Editor function is provided by the IETF 1572 Administrative Support Activity (IASA).