idnits 2.17.1 draft-cel-nfsv4-rpcrdma-version-two-06.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document seems to lack the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords -- however, there's a paragraph with a matching beginning. Boilerplate error? (The document does seem to have the reference to RFC 2119 which the ID-Checklist requires). == The document seems to contain a disclaimer for pre-RFC5378 work, but was first submitted on or after 10 November 2008. The disclaimer is usually necessary only for documents that revise or obsolete older RFCs, and that take significant amounts of text from those RFCs. If you can contact all authors of the source material and they are willing to grant the BCP78 rights to the IETF Trust, you can and should remove the disclaimer. Otherwise, the disclaimer is needed and you can ignore this comment. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (January 9, 2018) is 2289 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Unused Reference: 'RFC5662' is defined on line 1561, but no explicit reference was found in the text == Unused Reference: 'RFC5666' is defined on line 1556, but no explicit reference was found in the text -- Obsolete informational reference (is this intentional?): RFC 5661 (Obsoleted by RFC 8881) -- Obsolete informational reference (is this intentional?): RFC 5666 (Obsoleted by RFC 8166) Summary: 0 errors (**), 0 flaws (~~), 5 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network File System Version 4 C. Lever, Ed. 3 Internet-Draft Oracle 4 Intended status: Standards Track D. Noveck 5 Expires: July 13, 2018 NetApp 6 January 9, 2018 8 RPC-over-RDMA Version 2 Protocol 9 draft-cel-nfsv4-rpcrdma-version-two-06 11 Abstract 13 This document specifies an improved protocol for conveying Remote 14 Procedure Call (RPC) messages on physical transports capable of 15 Remote Direct Memory Access (RDMA), based on RPC-over-RDMA version 1. 17 Status of This Memo 19 This Internet-Draft is submitted in full conformance with the 20 provisions of BCP 78 and BCP 79. 22 Internet-Drafts are working documents of the Internet Engineering 23 Task Force (IETF). Note that other groups may also distribute 24 working documents as Internet-Drafts. The list of current Internet- 25 Drafts is at https://datatracker.ietf.org/drafts/current/. 27 Internet-Drafts are draft documents valid for a maximum of six months 28 and may be updated, replaced, or obsoleted by other documents at any 29 time. It is inappropriate to use Internet-Drafts as reference 30 material or to cite them other than as "work in progress." 32 This Internet-Draft will expire on July 13, 2018. 34 Copyright Notice 36 Copyright (c) 2018 IETF Trust and the persons identified as the 37 document authors. All rights reserved. 39 This document is subject to BCP 78 and the IETF Trust's Legal 40 Provisions Relating to IETF Documents 41 (https://trustee.ietf.org/license-info) in effect on the date of 42 publication of this document. Please review these documents 43 carefully, as they describe your rights and restrictions with respect 44 to this document. Code Components extracted from this document must 45 include Simplified BSD License text as described in Section 4.e of 46 the Trust Legal Provisions and are provided without warranty as 47 described in the Simplified BSD License. 49 This document may contain material from IETF Documents or IETF 50 Contributions published or made publicly available before November 51 10, 2008. The person(s) controlling the copyright in some of this 52 material may not have granted the IETF Trust the right to allow 53 modifications of such material outside the IETF Standards Process. 54 Without obtaining an adequate license from the person(s) controlling 55 the copyright in such materials, this document may not be modified 56 outside the IETF Standards Process, and derivative works of it may 57 not be created outside the IETF Standards Process, except to format 58 it for publication as an RFC or to translate it into languages other 59 than English. 61 Table of Contents 63 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 64 2. Requirements Language . . . . . . . . . . . . . . . . . . . . 4 65 3. Inline Threshold . . . . . . . . . . . . . . . . . . . . . . 4 66 3.1. Terminology . . . . . . . . . . . . . . . . . . . . . . . 4 67 3.2. Motivation . . . . . . . . . . . . . . . . . . . . . . . 5 68 3.3. Default Values . . . . . . . . . . . . . . . . . . . . . 5 69 4. Remote Invalidation . . . . . . . . . . . . . . . . . . . . . 6 70 4.1. Backward-Direction Remote Invalidation . . . . . . . . . 6 71 5. Protocol Extensibility . . . . . . . . . . . . . . . . . . . 7 72 5.1. Optional Features . . . . . . . . . . . . . . . . . . . . 7 73 5.2. Message Direction . . . . . . . . . . . . . . . . . . . . 7 74 5.3. Documentation Requirements . . . . . . . . . . . . . . . 8 75 6. Transport Properties . . . . . . . . . . . . . . . . . . . . 9 76 6.1. Introduction to Transport Properties . . . . . . . . . . 9 77 6.2. Basic Transport Properties . . . . . . . . . . . . . . . 12 78 6.3. New Operations . . . . . . . . . . . . . . . . . . . . . 15 79 6.4. Extensibility . . . . . . . . . . . . . . . . . . . . . . 20 80 7. XDR Protocol Definition . . . . . . . . . . . . . . . . . . . 21 81 7.1. Code Component License . . . . . . . . . . . . . . . . . 22 82 7.2. RPC-Over-RDMA Version 2 XDR . . . . . . . . . . . . . . . 24 83 8. Protocol Version Negotiation . . . . . . . . . . . . . . . . 31 84 8.1. Server Does Support RPC-over-RDMA Version 2 . . . . . . . 32 85 8.2. Server Does Not Support RPC-over-RDMA Version 2 . . . . . 32 86 8.3. Client Does Not Support RPC-over-RDMA Version 2 . . . . . 32 87 8.4. Security Considerations . . . . . . . . . . . . . . . . . 32 88 9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 33 89 10. References . . . . . . . . . . . . . . . . . . . . . . . . . 33 90 10.1. Normative References . . . . . . . . . . . . . . . . . . 33 91 10.2. Informative References . . . . . . . . . . . . . . . . . 33 92 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . 34 93 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 34 95 1. Introduction 97 Remote Direct Memory Access (RDMA) [RFC5040] [RFC5041] [IBARCH] is a 98 technique for moving data efficiently between end nodes. By 99 directing data into destination buffers as it is sent on a network 100 and placing it via direct memory access by hardware, the 101 complementary benefits of faster transfers and reduced host overhead 102 are obtained. 104 A protocol already exists that enables ONC RPC [RFC5531] messages to 105 be conveyed on RDMA transports. That protocol is RPC-over-RDMA 106 version 1, specified in [RFC8166]. RPC-over-RDMA version 1 is 107 deployed and in use, though there are some shortcomings to this 108 protocol, such as: 110 o The use of small Receive buffers force the use of RDMA Read and 111 Write transfers for small payloads, and limit the size of 112 backchannel messages. 114 o Lack of support for potential optimizations, such as remote 115 invalidation, that require changes to on-the-wire behavior. 117 To address these issues in a way that is compatible with existing 118 RPC-over-RDMA version 1 deployments, a new version of RPC-over-RDMA 119 is presented in this document. RPC-over-RDMA version 2 contains only 120 incremental changes over RPC-over-RDMA version 1 to facilitate 121 adoption of version 2 by existing version 1 implementations. 123 The major new feature in RPC-over-RDMA version 2 is extensibility of 124 the RPC-over-RDMA header. Extensibility enables narrow changes to 125 RPC-over-RDMA version 2 so that new optional capabilities can be 126 introduced without a protocol version change and while maintaining 127 interoperability with existing implementations. 129 New capabilities can be proposed and developed independently of each 130 other, and implementaters can choose among them, making it 131 straightforward to create and document experimental features and then 132 bring them through the standards process. 134 As part of this new extensibility feature set, a mechanism for 135 exchanging transport properties is introduced. This mechanism allows 136 RPC-over-RDMA version 2 connection endpoints to communicate 137 properties of their implementations, to request changes in properties 138 of the other endpoint, and to notify peer endpoints of changes to 139 properties that occur during operation. 141 In addition to extensibility, the default inline threshold value is 142 larger in RPC-over-RDMA version 2. This change is driven by the 143 increase in average size of RPC messages containing common NFS 144 operations. With NFS version 4.1 [RFC5661] and later, compound 145 operations convey more data per RPC message. The default 1KB inline 146 threshold in RPC-over-RDMA version 1 prevents attaining the best 147 possible performance. 149 Support for Remote Invalidation has been introduced into RPC-over- 150 RDMA version 2. An RPC-over-RDMA responder can now request 151 invalidation of an STag as part of sending an RPC Reply, saving the 152 requester the effort of invalidating after message receipt. This new 153 feature is general enough to enable a requester to control precisely 154 when Remote Invalidation may be utilized by responders. 156 RPC-over-RDMA version 2 expands the repertoire of error codes to 157 enable extensibility, to report overruns of specific resources, and 158 to avoid requester retries when an error is permanent. 160 2. Requirements Language 162 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 163 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 164 document are to be interpreted as described in BCP 14 [RFC2119] 165 [RFC8174] when, and only when, they appear in all capitals, as shown 166 here. 168 3. Inline Threshold 170 3.1. Terminology 172 The term "inline threshold" is defined in Section 4 of [RFC8166]. An 173 "inline threshold" value is the largest message size (in octets) that 174 can be conveyed in one direction on an RDMA connection using only 175 RDMA Send and Receive. Each connection has two inline threshold 176 values: one for messages flowing from requester-to-responder 177 (referred to as the "call inline threshold"), and one for messages 178 flowing from responder-to-requester (referred to as the "reply inline 179 threshold"). Inline threshold values are not advertised to peers via 180 the base RPC-over-RDMA version 2 protocol. 182 A connection's inline threshold determines when RDMA Read or Write 183 operations are required because the RPC message to be sent cannot be 184 conveyed via RDMA Send and Receive. When an RPC message does not 185 contain DDP-eligible data items, a requester prepares a Long Call or 186 Reply to convey the whole RPC message using RDMA Read or Write 187 operations. 189 3.2. Motivation 191 RDMA Read and Write operations require that each data payload resides 192 in a region of memory that is registered with the RNIC. When an RPC 193 is complete, that region is invalidated, fencing it from the 194 responder. 196 Both registration and invalidation have a latency cost which is 197 insignificant compared to data handling costs. When a data payload 198 is small, however, the cost of registering and invalidating the 199 memory where the payload resides becomes a relatively significant 200 part of total RPC latency. Therefore the most efficient operation of 201 RPC-over-RDMA occurs when RDMA Read and Write operations are used for 202 large payloads, and avoided for small payloads. 204 When RPC-over-RDMA version 1 was conceived, the typical size of RPC 205 messages that did not involve a significant data payload was under 206 500 bytes. A 1024-byte inline threshold adequately minimized the 207 frequency of inefficient Long Calls and Replies. 209 Starting with NFS version 4.1 [RFC5661], NFS COMPOUND messages are 210 larger and more complex than before. With a 1024-byte inline 211 threshold, RDMA Read or Write operations are needed for frequent 212 operations that do not bear a data payload, such as GETATTR and 213 LOOKUP, reducing the efficiency of the transport. 215 To reduce the need to use Long Calls and Replies, RPC-over-RDMA 216 version 2 increases the default inline threshold size. This also 217 increases the maximum size of backward direction RPC messages. 219 3.3. Default Values 221 RPC-over-RDMA version 2 receiver implementations MUST support an 222 inline threshold of 4096 bytes, but MAY support larger inline 223 threshold values. A mechanism for discovering a peer's preferred 224 inline threshold value (not defined in this document) may be used to 225 optimize RDMA Send operations further. In the absense of such a 226 mechanism, senders MUST assume a receiver's inline threshold is 4096 227 bytes. 229 The new default inline threshold size is no larger than the size of a 230 hardware page on typical platforms. This conserves the resources 231 needed to Send and Receive base level RPC-over-RDMA version 2 232 messages, enabling RPC-over-RDMA version 2 to be used on a broad 233 variety of hardware. 235 4. Remote Invalidation 237 An STag that is registered using the FRWR mechanism (in a privileged 238 execution context), or is registered via a Memory Window (in user 239 space), may be invalidated remotely [RFC5040]. These mechanisms are 240 available only when a requester's RNIC supports MEM_MGT_EXTENSIONS. 242 For the purposes of this discussion, there are two classes of STags. 243 Dynamically-registered STags are used in a single RPC, then 244 invalidated. Persistently-registered STags live longer than one RPC. 245 They may persist for the life of an RPC-over-RDMA connection, or 246 longer. 248 An RPC-over-RDMA requester may provide more than one STag in one 249 transport header. It may provide a combination of dynamically- and 250 persistently-registered STags in one RPC message, or any combination 251 of these in a series of RPCs on the same connection. Only 252 dynamically-registered STags using Memory Windows or FRWR (ie. 253 registered via MEM_MGT_EXTENSIONS) may be invalidated remotely. 255 There is no transport-level mechanism by which a responder can 256 determine how a requester-provided STag was registered, nor whether 257 it is eligible to be invalidated remotely. A requester that mixes 258 persistently- and dynamically-registered STags in one RPC, or mixes 259 them across RPCs on the same connection, must therefore indicate 260 which handles may be invalidated via a mechanism provided in the 261 Upper Layer Protocol. RPC-over-RDMA version 2 provides such a 262 mechanism. 264 The RDMA Send With Invalidate operation is used to invalidate an STag 265 on a remote system. It is available only when a responder's RNIC 266 supports MEM_MGT_EXTENSIONS, and must be utilized only when a 267 requester's RNIC supports MEM_MGT_EXTENSIONS (can receive and 268 recognize an IETH). 270 4.1. Backward-Direction Remote Invalidation 272 Existing RPC-over-RDMA protocol specifications [RFC8166] [RFC8167] do 273 not forbid direct data placement in the backward-direction, even 274 though there is currently no Upper Layer Protocol that may use it. 276 When chunks are present in a backward-direction RPC request, Remote 277 Invalidation allows the responder to trigger invalidation of a 278 requester's STags as part of sending a reply, the same as in the 279 forward direction. 281 However, in the backward direction, the server acts as the requester, 282 and the client is the responder. The server's RNIC, therefore, must 283 support receiving an IETH, and the server must have registered the 284 STags with an appropriate registration mechanism. 286 5. Protocol Extensibility 288 The core RPC-over-RDMA version 2 header format is specified in 289 Section 7 as a complete and stand-alone piece of XDR. Any change to 290 this XDR description requires a protocol version number change. 292 5.1. Optional Features 294 RPC-over-RDMA version 2 introduces the ability to extend the core 295 protocol via optional features. Extensibility enables minor protocol 296 issues to be addressed and incremental enhancements to be made 297 without the need to change the protocol version. The key capability 298 is that both sides can detect whether a feature is supported by their 299 peer or not. With this ability, OPTIONAL features can be introduced 300 over time to an otherwise stable protocol. 302 The rdma_opttype field carries a 32-bit unsigned integer. The value 303 in this field denotes an optional operation that MAY be supported by 304 the receiver. The values of this field and their meaning are defined 305 in other Standards Track documents. 307 The rdma_optinfo field carries opaque data. The content of this 308 field is data meaningful to the optional operation denoted by the 309 value in rdma_opttype. The content of this field is not defined in 310 the base RPC-over-RDMA version 2 protocol, but is defined in other 311 Standards Track documents 313 When an implementation does not recognize or support the value 314 contained in the rdma_opttype field, it MUST send an RPC-over-RDMA 315 message with the rdma_xid field set to the same value as the 316 erroneous message, the rdma_proc field set to RDMA2_ERROR, and the 317 rdma_err field set to RDMA2_ERR_INVAL_OPTION. 319 5.2. Message Direction 321 Backward direction operation depends on the ability of the receiver 322 to distinguish between incoming forward and backward direction calls 323 and replies. This needs to be done because both the XID field and 324 the flow control value (RPC-over-RDMA credits) in the RPC-over-RDMA 325 header are interpreted in the context of each message's direction. 327 A receiver typically distinguishes message direction by examining the 328 mtype field in the RPC header of each incoming payload message. 329 However, RDMA2_OPTIONAL type messages may not carry an RPC message 330 payload. 332 To enable RDMA2_OPTIONAL type messages that do not carry an RPC 333 message payload to be interpreted unambiguously, the rdma2_optional 334 structure contains a field that identifies the message direction. A 335 similar field has been added to the rpcrdma2_chunk_lists and 336 rpcrdma2_error structures to simplify parsing the RPC-over-RDMA 337 header at the receiver. 339 5.3. Documentation Requirements 341 RPC-over-RDMA version 2 may be extended by defining a new 342 rdma_opttype value, and then by providing an XDR description of the 343 rdma_optinfo content that corresponds with the new rdma_opttype 344 value. As a result, a new header type is effectively created. 346 A Standards Track document introduces each set of such protocol 347 elements. Together these elements are considered an OPTIONAL 348 feature. Each implementation is either aware of all the protocol 349 elements introduced by that feature, or is aware of none of them. 351 Documents describing extensions to RPC-over-RDMA version 2 should 352 contain: 354 o An explanation of the purpose and use of each new protocol element 355 added 357 o An XDR description of the protocol elements, and a script to 358 extract it 360 o A mechanism for reporting errors when the error is outside the 361 available choices already available in the base protocol or in 362 other extensions 364 o An indication of whether a Payload stream must be present, and a 365 description of its contents 367 o A description of interactions with existing extensions 369 The last bullet includes requirements that another OPTIONAL feature 370 needs to be present for new protocol elements to work, or that a 371 particular level of support be provided for some particular facility 372 for the new extension to work. 374 Implementers combine the XDR descriptions of the new features they 375 intend to use with the XDR description of the base protocol in this 376 document. This may be necessary to create a valid XDR input file 377 because extensions are free to use XDR types defined in the base 378 protocol, and later extensions may use types defined by earlier 379 extensions. 381 The XDR description for the RPC-over-RDMA version 2 protocol combined 382 with that for any selected extensions should provide an adequate 383 human-readable description of the extended protocol. 385 6. Transport Properties 387 6.1. Introduction to Transport Properties 389 6.1.1. Property Model 391 A basic set of receiver and sender properties is specified in this 392 document. An extensible approach is used, allowing new properties to 393 be defined in future standards track documents. 395 Such properties are specified using: 397 o A code identifying the particular transport property being 398 specified. 400 o A nominally opaque array which contains within it the XDR encoding 401 of the specific property indicated by the associated code. 403 The following XDR types are used by operations that deal with 404 transport properties: 406 408 typedef rpcrdma2_propid uint32; 410 struct rpcrdma2_propval { 411 rpcrdma2_propid rdma_which; 412 opaque rdma_data<>; 413 }; 415 typedef rpcrdma2_propval rpcrdma2_propset<>; 417 typedef uint32 rpcrdma2_propsubset<>; 419 421 An rpcrdma2_propid specifies a particular transport property. In 422 order to allow easier XDR extension of the set of properties by 423 concatenating XDR files, specific properties are defined as const 424 values rather than as elements in an enum. 426 An rpcrdma2_propval specifies a value of a particular transport 427 property with the particular property identified by rdma_which, while 428 the associated value of that property is contained within rdma_data. 430 A rdma_data field which is of zero length is interpreted as 431 indicating the default value or the property indicated by rdma_which. 433 While rdma_data is defined as opaque within the XDR, the contents are 434 interpreted (except when of length zero) using the XDR typedef 435 associated with the property specified by rdma_which. The receiver 436 of a message containing an rpcrdma2_propval MUST report an XDR error 437 [ cel: which error? BAD_XDR, or do we want to add a new one? ] if 438 the length of rdma_data is such that it extends beyond the bounds of 439 the message transferred. 441 In cases in which the rpcrdma2_propid specified by rdma_which is 442 understood by the receiver, the receiver also MUST report an XDR 443 error if either of the following occur: [ cel: which error? BAD_XDR, 444 or do we want to add a new one? ] 446 o The nominally opaque data within rdma_data is not valid when 447 interpreted using the property-associated typedef. 449 o The length of rdma_data is insufficient to contain the data 450 represented by the property-associated typedef. 452 Note that no error is to be reported if rdma_which is unknown to the 453 receiver. In that case, that rpcrdma2_propval is not processed and 454 processing continues using the next rpcrdma2_propval, if any. 456 A rpcrdma2_propset specifies a set of transport properties. No 457 particular ordering of the rpcrdma2_propval items within it is 458 imposed. 460 A rpcrdma2_propsubset identifies a subset of the properties in a 461 previously specified rpcrdma2_propset. Each bit in the mask denotes 462 a particular element in a previously specified rpcrdma2_propset. If 463 a particular rpcrdma2_propval is at position N in the array, then bit 464 number N mod 32 in word N div 32 specifies whether that particular 465 rpcrdma2_propval is included in the defined subset. Words beyond the 466 last one specified are treated as containing zero. 468 Propvalsubsets are useful in a number of contexts: 470 o In the specification of transport properties at connection, they 471 allow the sender to specify what subset of those are subject to 472 later change. 474 o In responding to a request to modify a set of transport 475 properties, they allow the responding endpoint to specify the 476 subsets of those properties for which the requested change has 477 been performed or been rejected. 479 6.1.2. Transport Property Groups 481 Transport properties are divided into a number of groups 483 o A basic set of transport properties defined in this document. See 484 Section 6.2 for the complete list. 486 o Additional transport properties defined in future standards track 487 documents as specified in Section 6.4.1. 489 o Experimental transport properties being explored preparatory to 490 being considered for standards track definition. See the 491 description in Section 6.4.2. 493 6.1.3. Operations Related to Transport Properties 495 There are a number of operations defined in Section 6.3 which are 496 used to communicate and manage transport properties. 498 Prime among these is RDMA2_CONNPROP (defined in Section 6.3.1 which 499 serves as a means by which an endpoint's transport properties may be 500 presented to its peer, typically upon establishing a connection. 502 In addition, there are a set of related operations concerned with 503 requesting, effecting and reporting changes in transport properties: 505 o RDMA2_REQPROP (defined in Section 6.3.2 which serves as a way for 506 an endpoint to request that a peer change the values for a set of 507 transport properties. 509 o RDMA2_RESPROP (defined in Section 6.3.3 is used to report on the 510 disposition of each of the individual transport property changes 511 requested in a previous RDMA2_REQPROP. 513 o RDMA2_UPDPROP (defined in Section 6.3.4 is used to report an 514 unsolicited change in a transport property. 516 Unlike many other operation types, the above are not used to effect 517 transfer of RPC requests but are internal one-way information 518 transfers. However, a RDMA2_REQPROP and the corresponding 519 RDMA2_RESPROP do constitute an RPC-like remote call. The other 520 operations are not part of a remote call transaction. 522 6.2. Basic Transport Properties 524 Although the set of transport properties is subject to later 525 extension, a basic set of transport properties is defined below in 526 Table 1. 528 In that table, the columns contain the following information: 530 o The column labeled "property" identifies the transport property 531 described by the current row. 533 o The column labeled "code" specifies the rpcrdma2_propid value used 534 to identify this property. 536 o The column labeled "XDR type" gives the XDR type of the data used 537 to communicate the value of this property. This data type 538 overlays the data portion of the nominally opaque field rdma_data 539 in a rpcrdma2_propval. 541 o The column labeled "default" gives the default value for the 542 property which is to be assumed by those who do not receive, or 543 are unable to interpret, information about the actual value of the 544 property. 546 o The column labeled "section" indicates the section (within this 547 document) that explains the semantics and use of this transport 548 property. 550 +---------+-----+------------------+----------------------+---------+ 551 | propert | cod | XDR type | default | section | 552 | y | e | | | | 553 +---------+-----+------------------+----------------------+---------+ 554 | Receive | 1 | uint32 | 4096 | 6.2.1 | 555 | Buffer | | | | | 556 | Size | | | | | 557 | Backwar | 2 | enum rpcrdma2_bk | RDMA2_BKREQSUP_INLIN | 6.2.2 | 558 | d | | reqsup | E | | 559 | Request | | | | | 560 | Support | | | | | 561 +---------+-----+------------------+----------------------+---------+ 563 Table 1 565 Note that this table does not provide any indication regarding 566 whether a particular property can change or whether a change in the 567 value may be requested (see Section 6.3.2). Such matters are not 568 addressed by the protocol definition. An implementation may provide 569 information about its readiness to make changed in a particular 570 property using the rdma_nochg field in the RDMA2_CONNPROP message. 572 A partner implementation can always request a change but peers MAY 573 reject a request to change a property for any reason. 574 Implementations are always free to reject such requests if they 575 cannot or do not wish to effect the requested change. 577 Either of the following will result in effective rejection requests 578 to change specific properties: 580 o If an endpoint does not wish to accept request to change 581 particular properties, it may reject such requests as described in 582 Section 6.3.3. 584 o If an endpoint does not support the RDMA2_REQPROP operation, the 585 effect would be the same as if every request to change a set of 586 property were rejected. 588 With regard to unrequested changes in transport properties, it is the 589 responsibility of the implementation making the change to do so in a 590 fashion that which does not interfere with the other partner's 591 continued correct operation (see Section 6.2.1). 593 6.2.1. Receive Buffer Size 595 The Receive Buffer Size specifies the minimum size, in octets, of 596 pre-posted receive buffers. It is the responsibility of the 597 participant sending this value to ensure that its pre-posted receives 598 are at least the size specified, allowing the participant receiving 599 this value to send messages that are of this size. 601 603 const uint32 RDMA2_PROPID_RBSIZ = 1; 604 typedef uint32 rpcrdma2_prop_rbsiz; 606 608 The sender may use his knowledge of the receiver's buffer size to 609 determine when the message to be sent will fit in the preposted 610 receive buffers that the receiver has set up. In particular, 612 o Requesters may use the value to determine when it is necessary to 613 provide a Position-Zero read chunk when sending a request. 615 o Requesters may use the value to determine when it is necessary to 616 provide a Reply chunk when sending a request, based on the maximum 617 possible size of the reply. 619 o Responders may use the value to determine when it is necessary, 620 given the actual size of the reply, to actually use a Reply chunk 621 provided by the requester. 623 Because there may be pre-posted receives with buffer sizes that 624 reflect earlier values of the buffer size property, changing this 625 property poses special difficulties: 627 o When the size is being raised, the partner should not be informed 628 of the change until all pending receives using the older value 629 have been eliminated. 631 o The size should not be reduced until the partner is aware of the 632 need to reduce the size of future sends to conform to this reduced 633 value. To ensure this, such a change should only occur in 634 response to an explicit request by the other endpoint (See 635 Section 6.3.2). The participant making the request should use 636 that lower size as the send size limit until the request is 637 rejected (See Section 6.3.3) or an update to a size larger than 638 the requested value becomes effective and the requested change is 639 no longer pending (See Section 6.3.4). 641 6.2.2. Backward Request Support 643 The value of this property is used to indicate a client 644 implementation's readiness to accept and process messages that are 645 part of backward-direction RPC requests. 647 649 enum rpcrdma2_bkreqsup { 650 RDMA2_BKREQSUP_NONE = 0, 651 RDMA2_BKREQSUP_INLINE = 1, 652 RDMA2_BKREQSUP_GENL = 2 653 }; 655 const uint32 RDMA2_PROPID_BRS = 2; 656 typedef rpcrdma2_bkreqsup rpcrdma2_prop_brs; 658 660 Multiple levels of support are distinguished: 662 o The value RDMA2_BKREQSUP_NONE indicates that receipt of backward- 663 direction requests and replies is not supported. 665 o The value RDMA2_BKREQSUP_INLINE indicates that receipt of 666 backward-direction requests or replies is only supported using 667 inline messages and that use of explicit RDMA operations or other 668 form of Direct Data Placement for backward direction requests or 669 responses is not supported. 671 o The value RDMA2_BKREQSUP_GENL that receipt of backward-direction 672 requests or replies is supported in the same ways that forward- 673 direction requests or replies typically are. 675 When information about this property is not provided, the support 676 level of servers can be inferred from the backward- direction 677 requests that they issue, assuming that issuing a request implicitly 678 indicates support for receiving the corresponding reply. On this 679 basis, support for receiving inline replies can be assumed when 680 requests without read chunks, write chunks, or Reply chunks are 681 issued, while requests with any of these elements allow the client to 682 assume that general support for backward-direction replies is present 683 on the server. 685 6.3. New Operations 687 The proposed new operations are set forth in Table 2 below. In that 688 table, the columns contain the following information: 690 o The column labeled "operation" specifies the particular operation. 692 o The column labeled "code" specifies the value of opttype for this 693 operation. 695 o The column labeled "XDR type" gives the XDR type of the data 696 structure used to describe the information in this new message 697 type. This data overlays the data portion of the nominally opaque 698 field optinfo in an RDMA_OPTIONAL message. 700 o The column labeled "msg" indicates whether this operation is 701 followed (or not) by an RPC message payload. 703 o The column labeled "section" indicates the section (within this 704 document) that explains the semantics and use of this optional 705 operation. 707 +------------------------+------+------------------+------+---------+ 708 | operation | code | XDR type | msg | section | 709 +------------------------+------+------------------+------+---------+ 710 | Specify Properties at | 1 | optinfo_connprop | No | 6.3.1 | 711 | Connection | | | | | 712 | Request Property | 2 | rpcrdma2_reqprop | No | 6.3.2 | 713 | Modification | | | | | 714 | Respond to | 3 | rpcrdma2_resprop | No | 6.3.3 | 715 | Modification Request | | | | | 716 | Report Updated | 4 | rpcrdma2_updprop | No | 6.3.4 | 717 | Properties | | | | | 718 +------------------------+------+------------------+------+---------+ 720 Table 2 722 Support for all of the operations above is OPTIONAL. RPC-over-RDMA 723 version 2 implementations that receive an operation that is not 724 supported MUST respond with RDMA_ERROR message with an error code of 725 RDMA_ERR_INVAL_OPTION. 727 The only operation support requirements are as follows: 729 o Implementations which send RDMA2_REQPROP messages must support 730 RDMA2_RESPROP messages. 732 o Implementations which support RDMA2_RESPROP or RDMA2_UPDPROP 733 messages must also support RDMA2_CONNPROP messages. 735 6.3.1. RDMA2_CONNPROP: Specify Properties at Connection 737 The RDMA2_CONNPROP message type allows an RPC-over-RDMA participant, 738 whether client or server, to indicate to its partner relevant 739 transport properties that the partner might need to be aware of. 741 The message definition for this operation is as follows: 743 745 struct rpcrdma2_connprop { 746 rpcrdma2_propset rdma_start; 747 rpcrdma2_propsubset rdma_nochg; 748 }; 750 751 All relevant transport properties that the sender is aware of should 752 be included in rdma_start. Since support of this request is 753 OPTIONAL, and since each of the properties is OPTIONAL as well, the 754 sender cannot assume that the receiver will necessarily take note of 755 these properties and so the sender should be prepared for cases in 756 which the partner continues to assume that the default value for a 757 particular property is still in effect. 759 Values of the subset of transport properties specified by rdma_nochg 760 is not expected to change during the lifetime of the connection. 762 Generally, a participant will send a RDMA2_CONNPROP message as the 763 first message after a connection is established. Given that fact, 764 the sender should make sure that the message can be received by 765 partners who use the default Receive Buffer Size. The connection's 766 initial receive buffer size is typically 1KB, but it depends on the 767 initial connection state of the RPC-over-RDMA version in use. 769 Properties not included in rdma_start are to be treated by the peer 770 endpoint as having the default value and are not allowed to change 771 subsequently. The peer should not request changes in such 772 properties. 774 Those receiving an RDMA2_CONNPROP may encounter properties that they 775 do not support or are unaware of. In such cases, these properties 776 are simply ignored without any error response being generated. 778 6.3.2. RDMA2_REQPROP: Request Modification of Properties 780 The RDMA2_REQPROP message type allows an RPC-over-RDMA participant, 781 whether client or server, to request of its partner that relevant 782 transport properties be changed. 784 The rdma_xid field allows the request to be tied to a corresponding 785 response of type RDMA2_RESPROP (See Section 6.3.3.) In assigning the 786 value of this field, the sender does not need to avoid conflict with 787 xid's associated with RPC messages or with RDMA2_REQPROP messages 788 sent by the peer endpoint. 790 The partner need not change the properties as requested by the sender 791 but if it does support the message type, it will generate a 792 RDMA2_RESPROP message, indicating the disposition of the request. 794 The message definition for this operation is as follows: 796 798 struct rpcrdma2_reqprop { 799 rpcrdma2_propset rdma_want; 800 }; 802 804 The rpcrdma2_propset rdma_want is a set of transport properties 805 together with the desired values requested by the sender. 807 6.3.3. RDMA2_RESPROP: Respond to Request to Modify Transport Properties 809 The RDMA2_RESPROP message type allows an RPC-over-RDMA participant to 810 respond to a request to change properties by its partner, indicating 811 how the request was dealt with. 813 The message definition for this operation is as follows: 815 817 struct rpcrdma2_resprop { 818 rpcrdma2_propsubset rdma_done; 819 rpcrdma2_propsubset rdma_rejected; 820 rpcrdma2_propset rdma_other; 821 }; 823 825 The rdma_xid field of this message must match that used in the 826 RDMA2_REQPROP message to which this message is responding. 828 The rdma_done field indicates which of the requested transport 829 property changes have been effected as requested. For each such 830 property, the receiver is entitled to conclude that the requested 831 change has been made and that future transmissions may be made based 832 on the new value. 834 The rdma_rejected field indicates which of the requested transport 835 property changes have been rejected by the sender. This may be 836 because of any of the following reasons: 838 o The particular property specified is not known or supported by the 839 receiver of the RDMA2_REQPROP message. 841 o The implementation receiving the RDMA2_REQPROP message does not 842 support modification of this property. 844 o The implementation receiving the RDMA2_REQPROP message has chosen 845 to reject the modification for another reason. 847 The rdma_other field contains new values for properties where a 848 change is requested. The new value of the property is included and 849 may be a value different from the original value in effect when the 850 change was requested and from the requested value. This is useful 851 when the new value of some property is not as large as requested but 852 still different from the original value, indicating a partial 853 satisfaction of the peer's property change request. 855 The sender MUST NOT include rpcrdma2_propval items within rdma_other 856 that are for properties other than the ones for which the 857 corresponding property request has requested a change. If the 858 receiver finds such a situation, it MUST ignore the erroneous 859 rpcrdma2_propval items. 861 The subsets of properties specified by rdma_done, rdma_rejected, and 862 included in rdma_other MUST NOT overlap, and when ored together, 863 should cover the entire set of properties specified by rdma_want in 864 the corresponding request. If the receiver finds such an overlap or 865 mismatch, it SHOULD treat properties missing or within the overlap as 866 having been rejected. 868 6.3.4. RDMA2_UPDPROP: Update Transport Properties 870 The RDMA2_UPDPROP message type allows an RPC-over-RDMA participant to 871 notify the other participant that a change to the transport 872 properties has occurred. This is because the sender has decided, 873 independently, to modify one or more transport properties and is 874 notifying the receiver of these changes. 876 The message definition for this operation is as follows: 878 880 struct rpcrdma2_updprop { 881 rpcrdma2_propset rdma_now; 882 }; 884 886 rdma_now defines the new property values to be used. 888 6.4. Extensibility 890 6.4.1. Additional Properties 892 The set of transport properties is designed to be extensible. As a 893 result, once new properties are defined in standards track documents, 894 the operations defined in this document may reference these new 895 transport properties, as well as the ones described in this document. 897 A standards track document defining a new transport property should 898 include the following information paralleling that provided in this 899 document for the transport properties defined herein. 901 o The rpcrdma2_propid value used to identify this property. 903 o The XDR typedef specifying the form in which the property value is 904 communicated. 906 o A description of the transport property that is communicated by 907 the sender of RDMA2_CONNPROP and RDMA2_UPDPROP and requested by 908 the sender of RDMA2_REQPROP. 910 o An explanation of how this knowledge could be used by the 911 participant receiving this information. 913 o Information giving rules governing possible changes of values of 914 this property. 916 The definition of transport property structures is such as to make it 917 easy to assign unique values. There is no requirement that a 918 continuous set of values be used and implementations should not rely 919 on all such values being small integers. A unique value should be 920 selected when the defining document is first published as an internet 921 draft. When the document becomes a standards track document working 922 group should insure that: 924 o rpcrdma2_propid values specified in the document do not conflict 925 with those currently assigned or in use by other pending working 926 group documents defining transport properties. 928 o rpcrdma2_propid values specified in the document do not conflict 929 with the range reserved for experimental use, as defined in 930 Section 6.4.2. 932 Documents defining new properties fall into a number of categories. 934 o Those defining new properties and explaining (only) how they 935 affect use of existing message types. 937 o Those defining new OPTIONAL message types and new properties 938 applicable to the operation of those new message types. 940 o Those defining new OPTIONAL message types and new properties 941 applicable both to new and existing message types. 943 When additional transport properties are proposed, the review of the 944 associated standards track document should deal with possible 945 security issues raised by those new transport properties. 947 6.4.2. Experimental Properties 949 Given the design of the transport properties data structure, it 950 possible to use the operations to implement experimental, possibly 951 unpublished, transport properties. 953 rpcrdma2_propid values in the range from 4,294,967,040 to 954 4,294,967,295 are reserved for experimental use and these values 955 should not be assigned to new properties in standards track 956 documents. 958 When values in this range are used there is no guarantee if 959 successful interoperation among independent implementations. 961 7. XDR Protocol Definition 963 This section contains a description of the core features of the RPC- 964 over-RDMA version 2 protocol, expressed in the XDR language 965 [RFC4506]. 967 This description is provided in a way that makes it simple to extract 968 into ready-to-compile form. The reader can apply the following shell 969 script to this document to produce a machine-readable XDR description 970 of the RPC-over-RDMA version 1 protocol without any OPTIONAL 971 extensions. 973 975 #!/bin/sh 976 grep '^ *///' | sed 's?^ /// ??' | sed 's?^ *///$??' 978 980 That is, if the above script is stored in a file called "extract.sh" 981 and this document is in a file called "spec.txt" then the reader can 982 do the following to extract an XDR description file: 984 986 sh extract.sh < spec.txt > rpcrdma_corev2.x 988 990 Optional extensions to RPC-over-RDMA version 2, published as 991 Standards Track documents, will have similar means of providing XDR 992 that describes those extensions. Once XDR for all desired extensions 993 is also extracted, it can be appended to the XDR description file 994 extracted from this document to produce a consolidated XDR 995 description file reflecting all extensions selected for an RPC-over- 996 RDMA implementation. 998 7.1. Code Component License 1000 Code components extracted from this document must include the 1001 following license text. When the extracted XDR code is combined with 1002 other complementary XDR code which itself has an identical license, 1003 only a single copy of the license text need be preserved. 1005 1007 /// /* 1008 /// * Copyright (c) 2010-2017 IETF Trust and the persons 1009 /// * identified as authors of the code. All rights reserved. 1010 /// * 1011 /// * The authors of the code are: 1012 /// * B. Callaghan, T. Talpey, C. Lever, and D. Noveck. 1013 /// * 1014 /// * Redistribution and use in source and binary forms, with 1015 /// * or without modification, are permitted provided that the 1016 /// * following conditions are met: 1017 /// * 1018 /// * - Redistributions of source code must retain the above 1019 /// * copyright notice, this list of conditions and the 1020 /// * following disclaimer. 1021 /// * 1022 /// * - Redistributions in binary form must reproduce the above 1023 /// * copyright notice, this list of conditions and the 1024 /// * following disclaimer in the documentation and/or other 1025 /// * materials provided with the distribution. 1026 /// * 1027 /// * - Neither the name of Internet Society, IETF or IETF 1028 /// * Trust, nor the names of specific contributors, may be 1029 /// * used to endorse or promote products derived from this 1030 /// * software without specific prior written permission. 1031 /// * 1032 /// * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS 1033 /// * AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED 1034 /// * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 1035 /// * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS 1036 /// * FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO 1037 /// * EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE 1038 /// * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, 1039 /// * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT 1040 /// * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 1041 /// * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS 1042 /// * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF 1043 /// * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 1044 /// * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING 1045 /// * IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF 1046 /// * ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 1047 /// */ 1049 1051 7.2. RPC-Over-RDMA Version 2 XDR 1053 The XDR defined in this section is used to encode the Transport 1054 Header Stream in each RPC-over-RDMA Version Two message. The terms 1055 "Transport Header Stream" and "RPC Payload Stream" are defined in 1056 Section 4 of [RFC8166]. 1058 1060 /// /* From RFC 5531, Section 9 */ 1061 /// enum msg_type { 1062 /// CALL = 0, 1063 /// REPLY = 1 1064 /// }; 1065 /// 1066 /// struct rpcrdma2_segment { 1067 /// uint32 rdma_handle; 1068 /// uint32 rdma_length; 1069 /// uint64 rdma_offset; 1070 /// }; 1071 /// 1072 /// struct rpcrdma2_read_segment { 1073 /// uint32 rdma_position; 1074 /// struct rpcrdma2_segment rdma_target; 1075 /// }; 1076 /// 1077 /// struct rpcrdma2_read_list { 1078 /// struct rpcrdma2_read_segment rdma_entry; 1079 /// struct rpcrdma2_read_list *rdma_next; 1080 /// }; 1081 /// 1082 /// struct rpcrdma2_write_chunk { 1083 /// struct rpcrdma2_segment rdma_target<>; 1084 /// }; 1085 /// 1086 /// struct rpcrdma2_write_list { 1087 /// struct rpcrdma2_write_chunk rdma_entry; 1088 /// struct rpcrdma2_write_list *rdma_next; 1089 /// }; 1090 /// 1091 /// struct rpcrdma2_chunk_lists { 1092 /// enum msg_type rdma_direction; 1093 /// uint32 rdma_inv_handle; 1094 /// struct rpcrdma2_read_list *rdma_reads; 1095 /// struct rpcrdma2_write_list *rdma_writes; 1096 /// struct rpcrdma2_write_chunk *rdma_reply; 1097 /// }; 1098 /// 1099 /// enum rpcrdma2_errcode { 1100 /// RDMA2_ERR_VERS = 1, 1101 /// RDMA2_ERR_BAD_XDR = 2, 1102 /// RDMA2_ERR_INVAL_PROC = 3, 1103 /// RDMA2_ERR_READ_CHUNKS = 4, 1104 /// RDMA2_ERR_WRITE_CHUNKS = 5, 1105 /// RDMA2_ERR_SEGMENTS = 6, 1106 /// RDMA2_ERR_WRITE_RESOURCE = 7, 1107 /// RDMA2_ERR_REPLY_RESOURCE = 8, 1108 /// RDMA2_ERR_INVAL_OPTION = 9, 1109 /// RDMA2_ERR_SYSTEM = 10, 1110 /// }; 1111 /// 1112 /// struct rpcrdma2_err_vers { 1113 /// uint32 rdma_vers_low; 1114 /// uint32 rdma_vers_high; 1115 /// }; 1116 /// 1117 /// struct rpcrdma2_err_write { 1118 /// uint32 rdma_chunk_index; 1119 /// uint32 rdma_length_needed; 1120 /// }; 1121 /// 1122 /// union rpcrdma2_error switch (rpcrdma2_errcode rdma_err) { 1123 /// case RDMA2_ERR_VERS: 1124 /// rpcrdma2_err_vers rdma_vrange; 1125 /// case RDMA2_ERR_BAD_XDR: 1126 /// void; 1127 /// case RDMA2_ERR_INVAL_PROC: 1128 /// void; 1129 /// case RDMA2_ERR_READ_CHUNKS: 1130 /// uint32 rdma_max_chunks; 1131 /// case RDMA2_ERR_WRITE_CHUNKS: 1132 /// uint32 rdma_max_chunks; 1133 /// case RDMA2_ERR_SEGMENTS: 1134 /// uint32 rdma_max_segments; 1135 /// case RDMA2_ERR_WRITE_RESOURCE: 1136 /// rpcrdma2_err_write rdma_writeres; 1137 /// case RDMA2_ERR_REPLY_RESOURCE: 1138 /// uint32 rdma_length_needed; 1139 /// case RDMA2_ERR_INVAL_OPTION: 1140 /// void; 1141 /// case RDMA2_ERR_SYSTEM: 1142 /// void; 1143 /// }; 1144 /// 1145 /// struct rpcrdma2_optional { 1146 /// enum msg_type rdma_optdir; 1147 /// uint32 rdma_opttype; 1148 /// opaque rdma_optinfo<>; 1149 /// }; 1150 /// 1151 /// typedef rpcrdma2_propid uint32; 1152 /// 1153 /// struct rpcrdma2_propval { 1154 /// rpcrdma2_propid rdma_which; 1155 /// opaque rdma_data<>; 1156 /// }; 1157 /// 1158 /// typedef rpcrdma2_propval rpcrdma2_propset<>; 1159 /// typedef uint32 rpcrdma2_propsubset<>; 1160 /// 1161 /// struct rpcrdma2_connprop { 1162 /// rpcrdma2_propset rdma_start; 1163 /// rpcrdma2_propsubset rdma_nochg; 1164 /// }; 1165 /// 1166 /// struct rpcrdma2_reqprop { 1167 /// rpcrdma2_propset rdma_want; 1168 /// }; 1169 /// 1170 /// struct rpcrdma2_resprop { 1171 /// rpcrdma2_propsubset rdma_done; 1172 /// rpcrdma2_propsubset rdma_rejected; 1173 /// rpcrdma2_propset rdma_other; 1174 /// }; 1175 /// 1176 /// struct rpcrdma2_updprop { 1177 /// rpcrdma2_propset rdma_now; 1178 /// }; 1180 /// enum rpcrdma2_proc { 1181 /// RDMA2_MSG = 0, 1182 /// RDMA2_NOMSG = 1, 1183 /// RDMA2_ERROR = 4, 1184 /// RDMA2_OPTIONAL = 5, 1185 /// RDMA2_CONNPROP = 6, 1186 /// RDMA2_REQPROP = 7, 1187 /// RDMA2_RESPROP = 8, 1188 /// RDMA2_UPDPROP = 9 1189 /// }; 1190 /// 1191 /// union rpcrdma2_body switch (rpcrdma2_proc rdma_proc) { 1192 /// case RDMA2_MSG: 1193 /// rpcrdma2_chunk_lists rdma_chunks; 1194 /// case RDMA2_NOMSG: 1195 /// rpcrdma2_chunk_lists rdma_chunks; 1196 /// case RDMA2_ERROR: 1197 /// rpcrdma2_error rdma_error; 1198 /// case RDMA2_OPTIONAL: 1199 /// rpcrdma2_optional rdma_optional; 1200 /// case RDMA2_CONNPROP: 1201 /// rpcrdma2_connprop rdma_connprop; 1202 /// case RDMA2_REQPROP: 1203 /// rpcrdma2_reqprop rdma_reqprop; 1204 /// case RDMA2_RESPROP: 1205 /// rpcrdma2_resprop rdma_resprop; 1206 /// case RDMA2_UPDPROP: 1207 /// rpcrdma2_updprop rdma_updprop; 1208 /// }; 1209 /// 1210 /// struct rpcrdma2_xprt_hdr { 1211 /// uint32 rdma_xid; 1212 /// uint32 rdma_vers; 1213 /// uint32 rdma_credit; 1214 /// rpcrdma2_body rdma_body; 1215 /// }; 1216 /// 1217 /// /* 1218 /// * Transport propid values for basic properties 1219 /// */ 1220 /// const uint32 RDMA2_PROPID_RBSIZ = 1; 1221 /// const uint32 RDMA2_PROPID_BRS = 2; 1222 /// 1223 /// /* 1224 /// * Transport property typedefs 1225 /// */ 1226 /// typedef uint32 rpcrdma2_prop_rbsiz; 1227 /// typedef rpcrdma2_bkreqsup rpcrdma2_prop_brs; 1228 /// 1229 /// enum rpcrdma2_bkreqsup { 1230 /// RDMA2_BKREQSUP_NONE = 0, 1231 /// RDMA2_BKREQSUP_INLINE = 1, 1232 /// RDMA2_BKREQSUP_GENL = 2 1233 /// }; 1235 1237 7.2.1. Presence Of Payload 1239 o When the rdma_proc field has the value RDMA2_MSG, an RPC Payload 1240 Stream MUST follow the Transport Header Stream in the Send buffer. 1242 o When the rdma_proc field has the value RDMA2_ERROR, an RPC Payload 1243 Stream MUST NOT follow the Transport Header Stream. 1245 o When the rdma_proc field has the value RDMA2_OPTIONAL, all, part 1246 of, or no RPC Payload Stream MAY follow the Transport header 1247 Stream in the Send buffer. 1249 7.2.2. Message Direction 1251 Implementations of RPC-over-RDMA version 2 are REQUIRED to support 1252 backwards direction operation as described in [RFC8167]. RPC-over- 1253 RDMA version 2 introduces the rdma_direction field in its transport 1254 header to optimize the process of distinguishing between forward- and 1255 backwards-direction messages. 1257 The rdma_direction field qualifies the value contained in the 1258 transport header's rdma_xid field. This enables a receiver to 1259 reliably avoid performing an XID lookup on incoming backwards- 1260 direction Call messages. 1262 In general, when a message carries an XID that was generated by the 1263 message's receiver (that is, the receiver is acting as a requester), 1264 the message's sender sets the rdma_direction field to REPLY (1). 1265 Otherwise the rdma_direction field is set to CALL (0). For example: 1267 o When the rdma_proc field has the value RDMA2_MSG or RDMA2_NOMSG, 1268 the value of the rdma_direction field MUST be the same as the 1269 value of the associated RPC message's msg_type field. 1271 o When the rdma_proc field has the value RDMA2_OPTIONAL and a whole 1272 or partial RPC message payload is present, the value of the 1273 rdma_optdir field MUST be the same as the value of the associated 1274 RPC message's msg_type field. 1276 o When the rdma_proc field has the value RDMA2_OPTIONAL and no RPC 1277 message payload is present, a Requester MUST set the value of the 1278 rdma_optdir field to CALL, and a Responder MUST set the value of 1279 the rdma_optdir field to REPLY. The Requester chooses a value for 1280 the rdma_xid field from the XID space that matches the message's 1281 direction. Requesters and Responders set the rdma_credit field in 1282 a similar fashion: a value is set that is appropriate for the 1283 direction of the message. 1285 o When the rdma_proc field has the value RDMA2_ERROR, the direction 1286 of the message is always Responder-to-Requester (REPLY). 1288 7.2.3. Remote Invalidation 1290 To request Remote Invalidation, a requester MUST set the value of the 1291 rdma_inv_handle field in an RPC Call's transport header to a non-zero 1292 value that matches one of the rdma_handle fields in that header. If 1293 none of the rdma_handle values in the Call may be invalidated by the 1294 responder, the requester MUST set the RPC Call's rdma_inv_handle 1295 field to the value zero. 1297 If the responder chooses not to use Remote Invalidation for this 1298 particular RPC Reply, or the RPC Call's rdma_inv_handle field 1299 contains the value zero, the responder MUST use RDMA Send to transmit 1300 the matching RPC reply. 1302 If a requester has provided a non-zero value in the RPC Call's 1303 rdma_inv_handle field and the responder chooses to use Remote 1304 Invalidation for the matching RPC Reply, the responder MUST use RDMA 1305 Send With Invalidate to transmit that RPC reply, and MUST use the 1306 value in the RPC Call's rdma_inv_handle field to construct the Send 1307 With Invalidate Work Request. 1309 7.2.4. Transport Errors 1311 Error handling works the same way in RPC-over-RDMA version 2 as it 1312 does in RPC-over-RDMA version 1, with the addition of several new 1313 error codes, and error messages never flow from requester to 1314 responder. version 1 error handling is described in Section 5 of 1315 [RFC8166]. 1317 In all cases below, the responder copies the values of the rdma_xid 1318 and rdma_vers fields from the incoming transport header that 1319 generated the error to transport header of the error response. The 1320 responder sets the rdma_proc field to RDMA2_ERROR, and the 1321 rdma_credit field is set to the credit grant value for this 1322 connection. 1324 RDMA2_ERR_VERS 1325 This is the equivalent of ERR_VERS in RPC-over-RDMA version 1. 1326 The error code value, semantics, and utilization are the same. 1328 RDMA2_ERR_INVAL_PROC 1329 If a responder recognizes the value in the rdma_vers field, but it 1330 does not recognize the value in the rdma_proc field, it MUST set 1331 the rdma_err field to RDMA2_ERR_INVAL_PROC. 1333 RDMA2_ERR_BAD_XDR 1334 If a responder recognizes the values in the rdma_vers and 1335 rdma_proc fields, but the incoming RPC-over-RDMA transport header 1336 cannot be parsed, it MUST set the rdma_err field to 1337 RDMA2_ERR_BAD_XDR. The error code value of RDMA2_ERR_BAD_XDR is 1338 the same as the error code value of ERR_CHUNK in RPC-over-RDMA 1339 version 1. The responder MUST NOT process the request in any way 1340 except to send an error message. 1342 RDMA2_ERR_READ_CHUNKS 1343 If a requester presents more DDP-eligible arguments than the 1344 responder is prepared to Read, the responder MUST set the rdma_err 1345 field to RDMA2_ERR_READ_CHUNKS, and set the rdma_max_chunks field 1346 to the maximum number of Read chunks the responder can receive and 1347 process. 1348 If the responder implementation cannot handle any Read chunks for 1349 a request, it MUST set the rdma_max_chunks to zero in this 1350 response. The requester SHOULD resend the request using a 1351 Position-Zero Read chunk. If this was a request using a Position- 1352 Zero Read chunk, the requester MUST terminate the transaction with 1353 an error. 1355 RDMA2_ERR_WRITE_CHUNKS 1356 If a requester has constructed an RPC Call message with more DDP- 1357 eligible results than the server is prepared to Write, the 1358 responder MUST set the rdma_err field to RDMA2_ERR_WRITE_CHUNKS, 1359 and set the rdma_max_chunks field to the maximum number of Write 1360 chunks the responder can process and return. 1361 If the responder implementation cannot handle any Write chunks for 1362 a request, it MUST return a response of RDMA2_ERR_REPLY_RESOURCE 1363 (below). The requester SHOULD resend the request with no Write 1364 chunks and a Reply chunk of appropriate size. 1366 RDMA2_ERR_SEGMENTS 1367 If a requester has constructed an RPC Call message with a chunk 1368 that contains more segments than the responder supports, the 1369 responder MUST set the rdma_err field to RDMA2_ERR_SEGMENTS, and 1370 set the rdma_max_segments field to the maximum number of segments 1371 the responder can process. 1373 RDMA2_ERR_WRITE_RESOURCE 1374 If a requester has provided a Write chunk that is not large enough 1375 to convey a DDP-eligible result, the responder MUST set the 1376 rdma_err field to RDMA2_ERR_WRITE_RESOURCE. 1378 The responder MUST set the rdma_chunk_index field to point to the 1379 first Write chunk in the transport header that is too short, or to 1380 zero to indicate that it was not possible to determine which chunk 1381 is too small. Indexing starts at one (1), which represents the 1382 first Write chunk. The responder MUST set the rdma_length_needed 1383 to the number of bytes needed in that chunk in order to convey the 1384 result data item. 1386 Upon receipt of this error code, a responder MAY choose to 1387 terminate the operation (for instance, if the responder set the 1388 index and length fields to zero), or it MAY send the request again 1389 using the same XID and more reply resources. 1391 RDMA2_ERR_REPLY_RESOURCE 1392 If an RPC Reply's Payload stream does not fit inline and the 1393 requester has not provided a large enough Reply chunk to convey 1394 the stream, the responder MUST set the rdma_err field to 1395 RDMA2_ERR_REPLY_RESOURCE. The responder MUST set the 1396 rdma_length_needed to the number of Reply chunk bytes needed to 1397 convey the reply. 1399 Upon receipt of this error code, a responder MAY choose to 1400 terminate the operation (for instance, if the responder set the 1401 index and length fields to zero), or it MAY send the request again 1402 using the same XID and larger reply resources. 1404 RDMA2_ERR_INVAL_OPTION 1405 A responder MUST set the rdma_err field to RDMA2_ERR_INVAL_OPTION 1406 when an RDMA2_OPTIONAL message is received and the responder does 1407 not recognize the value in the rdma_opttype field. 1409 RDMA2_ERR_SYSTEM 1410 If some problem occurs on a responder that does not fit into the 1411 above categories, the responder MAY report it to the sender by 1412 setting the rdma_err field to RDMA2_ERR_SYSTEM. 1414 This is a permanent error: a requester that receives this error 1415 MUST terminate the RPC transaction associated with the XID value 1416 in the rdma_xid field. 1418 8. Protocol Version Negotiation 1420 When an RPC-over-RDMA version 2 client establishes a connection to a 1421 server, the first order of business is to determine the server's 1422 highest supported protocol version. 1424 As with RPC-over-RDMA version 1, a client MUST assume the ability to 1425 exchange only a single RPC-over-RDMA message at a time until it 1426 receives a valid non-error RPC-over-RDMA message from the server that 1427 reports the server's credit limit. 1429 First, the client sends a single valid RPC-over-RDMA message with the 1430 value two (2) in the rdma_vers field. Because the server might 1431 support only RPC-over-RDMA version 1, this initial message can be no 1432 larger than the version 1 default inline threshold of 1024 bytes. 1434 8.1. Server Does Support RPC-over-RDMA Version 2 1436 If the server does support RPC-over-RDMA version 2, it sends RPC- 1437 over-RDMA messages back to the client with the value two (2) in the 1438 rdma_vers field. Both peers may use the default inline threshold 1439 value for RPC-over-RDMA version 2 connections (4096 bytes). 1441 8.2. Server Does Not Support RPC-over-RDMA Version 2 1443 If the server does not support RPC-over-RDMA version 2, it MUST send 1444 an RPC-over-RDMA message to the client with the same XID, with 1445 RDMA2_ERROR in the rdma_proc field, and with the error code 1446 RDMA2_ERR_VERS. This message also reports a range of protocol 1447 versions that the server supports. To continue operation, the client 1448 selects a protocol version in the range of server-supported versions 1449 for subsequent messages on this connection. 1451 If the connection is lost immediately after an RDMA2_ERROR / 1452 RDMA2_ERR_VERS message is received, a client can avoid a possible 1453 version negotiation loop when re-establishing another connection by 1454 assuming that particular server does not support RPC-over-RDMA 1455 version 2. A client can assume the same situation (no server support 1456 for RPC-over-RDMA version 2) if the initial negotiation message is 1457 lost or dropped. Once the negotiation exchange is complete, both 1458 peers may use the default inline threshold value for the transport 1459 protocol version that has been selected. 1461 8.3. Client Does Not Support RPC-over-RDMA Version 2 1463 If the server supports the RPC-over-RDMA protocol version used in 1464 Call messages from a client, it MUST send Replies with the same RPC- 1465 over-RDMA protocol version that the client uses to send its Calls. 1467 8.4. Security Considerations 1469 The security considerations for RPC-over-RDMA version 2 are the same 1470 as those for RPC-over-RDMA version 1. 1472 8.4.1. Security Considerations (Transport Properties) 1474 Like other fields that appear in each RPC-over-RDMA header, property 1475 information is sent in the clear on the fabric with no integrity 1476 protection, making it vulnerable to man-in-the-middle attacks. 1478 For example, if a man-in-the-middle were to change the value of the 1479 Receive buffer size or the Requester Remote Invalidation boolean, it 1480 could reduce connection performance or trigger loss of connection. 1481 Repeated connection loss can impact performance or even prevent a new 1482 connection from being established. Recourse is to deploy on a 1483 private network or use link-layer encryption. 1485 9. IANA Considerations 1487 This document does not require actions by IANA. 1489 10. References 1491 10.1. Normative References 1493 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 1494 Requirement Levels", BCP 14, RFC 2119, 1495 DOI 10.17487/RFC2119, March 1997, 1496 . 1498 [RFC4506] Eisler, M., Ed., "XDR: External Data Representation 1499 Standard", STD 67, RFC 4506, DOI 10.17487/RFC4506, May 1500 2006, . 1502 [RFC5531] Thurlow, R., "RPC: Remote Procedure Call Protocol 1503 Specification Version 2", RFC 5531, DOI 10.17487/RFC5531, 1504 May 2009, . 1506 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 1507 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, 1508 May 2017, . 1510 10.2. Informative References 1512 [IBARCH] InfiniBand Trade Association, "InfiniBand Architecture 1513 Specification Volume 1", Release 1.3, March 2015, 1514 . 1517 [RFC5040] Recio, R., Metzler, B., Culley, P., Hilland, J., and D. 1518 Garcia, "A Remote Direct Memory Access Protocol 1519 Specification", RFC 5040, DOI 10.17487/RFC5040, October 1520 2007, . 1522 [RFC5041] Shah, H., Pinkerton, J., Recio, R., and P. Culley, "Direct 1523 Data Placement over Reliable Transports", RFC 5041, 1524 DOI 10.17487/RFC5041, October 2007, 1525 . 1527 [RFC5661] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed., 1528 "Network File System (NFS) Version 4 Minor Version 1 1529 Protocol", RFC 5661, DOI 10.17487/RFC5661, January 2010, 1530 . 1532 [RFC5662] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed., 1533 "Network File System (NFS) Version 4 Minor Version 1 1534 External Data Representation Standard (XDR) Description", 1535 RFC 5662, DOI 10.17487/RFC5662, January 2010, 1536 . 1538 [RFC5666] Talpey, T. and B. Callaghan, "Remote Direct Memory Access 1539 Transport for Remote Procedure Call", RFC 5666, 1540 DOI 10.17487/RFC5666, January 2010, 1541 . 1543 [RFC8166] Lever, C., Ed., Simpson, W., and T. Talpey, "Remote Direct 1544 Memory Access Transport for Remote Procedure Call Version 1545 1", RFC 8166, DOI 10.17487/RFC8166, June 2017, 1546 . 1548 [RFC8167] Lever, C., "Bidirectional Remote Procedure Call on RPC- 1549 over-RDMA Transports", RFC 8167, DOI 10.17487/RFC8167, 1550 June 2017, . 1552 Acknowledgments 1554 The authors gratefully acknowledge the work of Brent Callaghan and 1555 Tom Talpey on the original RPC-over-RDMA version 1 specification 1556 [RFC5666]. The authors also wish to thank Bill Baker, Greg Marsden, 1557 and Matt Benjamin for their support of this work. 1559 The extract.sh shell script and formatting conventions were first 1560 described by the authors of the NFS version 4.1 XDR specification 1561 [RFC5662]. 1563 Special thanks go to Transport Area Director Spencer Dawkins, NFSV4 1564 Working Group Chair Spencer Shepler, and NFSV4 Working Group 1565 Secretary Thomas Haynes for their support. 1567 Authors' Addresses 1568 Charles Lever (editor) 1569 Oracle Corporation 1570 1015 Granger Avenue 1571 Ann Arbor, MI 48104 1572 United States of America 1574 Phone: +1 248 816 6463 1575 Email: chuck.lever@oracle.com 1577 David Noveck 1578 NetApp 1579 1601 Trapelo Road 1580 Waltham, MA 02451 1581 United States of America 1583 Phone: +1 781 572 8038 1584 Email: davenoveck@gmail.com