idnits 2.17.1 draft-cel-nfsv4-rpcrdma-version-two-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (December 1, 2016) is 2704 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Outdated reference: A later version (-11) exists of draft-ietf-nfsv4-rfc5666bis-08 == Outdated reference: A later version (-08) exists of draft-ietf-nfsv4-rpcrdma-bidirection-05 -- Obsolete informational reference (is this intentional?): RFC 5661 (Obsoleted by RFC 8881) -- Obsolete informational reference (is this intentional?): RFC 5666 (Obsoleted by RFC 8166) Summary: 0 errors (**), 0 flaws (~~), 3 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network File System Version 4 C. Lever, Ed. 3 Internet-Draft Oracle 4 Intended status: Standards Track D. Noveck 5 Expires: June 4, 2017 HPE 6 December 1, 2016 8 RPC-over-RDMA Version Two Protocol 9 draft-cel-nfsv4-rpcrdma-version-two-03 11 Abstract 13 This document specifies an improved protocol for conveying Remote 14 Procedure Call (RPC) messages on physical transports capable of 15 Remote Direct Memory Access (RDMA), based on RPC-over-RDMA Version 16 One. 18 Requirements Language 20 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 21 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 22 document are to be interpreted as described in [RFC2119]. 24 Status of This Memo 26 This Internet-Draft is submitted in full conformance with the 27 provisions of BCP 78 and BCP 79. 29 Internet-Drafts are working documents of the Internet Engineering 30 Task Force (IETF). Note that other groups may also distribute 31 working documents as Internet-Drafts. The list of current Internet- 32 Drafts is at http://datatracker.ietf.org/drafts/current/. 34 Internet-Drafts are draft documents valid for a maximum of six months 35 and may be updated, replaced, or obsoleted by other documents at any 36 time. It is inappropriate to use Internet-Drafts as reference 37 material or to cite them other than as "work in progress." 39 This Internet-Draft will expire on June 4, 2017. 41 Copyright Notice 43 Copyright (c) 2016 IETF Trust and the persons identified as the 44 document authors. All rights reserved. 46 This document is subject to BCP 78 and the IETF Trust's Legal 47 Provisions Relating to IETF Documents 48 (http://trustee.ietf.org/license-info) in effect on the date of 49 publication of this document. Please review these documents 50 carefully, as they describe your rights and restrictions with respect 51 to this document. Code Components extracted from this document must 52 include Simplified BSD License text as described in Section 4.e of 53 the Trust Legal Provisions and are provided without warranty as 54 described in the Simplified BSD License. 56 Table of Contents 58 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 59 2. Inline Threshold . . . . . . . . . . . . . . . . . . . . . . 4 60 2.1. Terminology . . . . . . . . . . . . . . . . . . . . . . . 4 61 2.2. Motivation . . . . . . . . . . . . . . . . . . . . . . . 4 62 2.3. Default Values . . . . . . . . . . . . . . . . . . . . . 5 63 3. Remote Invalidation . . . . . . . . . . . . . . . . . . . . . 5 64 3.1. Backward-Direction Remote Invalidation . . . . . . . . . 6 65 4. Protocol Extensibility . . . . . . . . . . . . . . . . . . . 6 66 4.1. Optional Features . . . . . . . . . . . . . . . . . . . . 6 67 4.2. Message Direction . . . . . . . . . . . . . . . . . . . . 7 68 4.3. Documentation Requirements . . . . . . . . . . . . . . . 7 69 5. Transport Properties . . . . . . . . . . . . . . . . . . . . 8 70 5.1. Introduction To Transport Properties . . . . . . . . . . 8 71 5.2. Basic Transport Properties . . . . . . . . . . . . . . . 11 72 5.3. New Operations . . . . . . . . . . . . . . . . . . . . . 15 73 5.4. Extensibility . . . . . . . . . . . . . . . . . . . . . . 19 74 6. XDR Protocol Definition . . . . . . . . . . . . . . . . . . . 21 75 6.1. Code Component License . . . . . . . . . . . . . . . . . 22 76 6.2. RPC-Over-RDMA Version Two XDR . . . . . . . . . . . . . . 24 77 7. Protocol Version Negotiation . . . . . . . . . . . . . . . . 30 78 7.1. Server Does Support RPC-over-RDMA Version Two . . . . . . 31 79 7.2. Server Does Not Support RPC-over-RDMA Version Two . . . . 31 80 7.3. Client Does Not Support RPC-over-RDMA Version Two . . . . 31 81 7.4. Security Considerations . . . . . . . . . . . . . . . . . 31 82 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 32 83 9. References . . . . . . . . . . . . . . . . . . . . . . . . . 32 84 9.1. Normative References . . . . . . . . . . . . . . . . . . 32 85 9.2. Informative References . . . . . . . . . . . . . . . . . 32 86 Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . 33 87 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 33 89 1. Introduction 91 Remote Direct Memory Access (RDMA) [RFC5040] [RFC5041] [IB] is a 92 technique for moving data efficiently between end nodes. By 93 directing data into destination buffers as it is sent on a network 94 and placing it via direct memory access by hardware, the 95 complementary benefits of faster transfers and reduced host overhead 96 are obtained. 98 A protocol already exists that enables ONC RPC [RFC5531] messages to 99 be conveyed on RDMA transports. That protocol is RPC-over-RDMA 100 Version One, specified in [I-D.ietf-nfsv4-rfc5666bis]. RPC-over-RDMA 101 Version One is deployed and in use, though there are some 102 shortcomings to this protocol, such as: 104 o The use of small Receive buffers force the use of RDMA Read and 105 Write transfers for small payloads, and limit the size of 106 backchannel messages. 108 o Lack of support for potential optimizations, such as remote 109 invalidation, that require changes to on-the-wire behavior. 111 To address these issues in a way that is compatible with existing 112 RPC-over-RDMA Version One deployments, a new version of RPC-over-RDMA 113 is presented in this document. RPC-over-RDMA Version Two contains 114 only incremental changes over RPC-over-RDMA Version One to facilitate 115 adoption of Version Two by existing Version One implementations. 117 The major new feature in RPC-over-RDMA Version Two is extensibility 118 of the RPC-over-RDMA header. Extensibility enables narrow changes to 119 RPC-over-RDMA Version Two so that new optional capabilities can be 120 introduced without a protocol version change and while maintaining 121 interoperability with existing implementations. 123 New capabilities can be proposed and developed independently of each 124 other, and implementaters can choose among them, making it 125 straightforward to create and document experimental features and then 126 bring them through the standards process. 128 As part of this new extensibility feature set, a mechanism for 129 exchanging transport properties is introduced. This mechanism allows 130 RPC-over-RDMA Version Two connection endpoints to communicate 131 properties of their implementations, to request changes in properties 132 of the other endpoint, and to notify peer endpoints of changes to 133 properties that occur during operation. 135 In addition to extensibility, the default inline threshold value is 136 larger in RPC-over-RDMA Version Two. This change is driven by the 137 increase in average size of RPC messages containing common NFS 138 operations. With NFSv4.1 [RFC5661] and later, compound operations 139 convey more data per RPC message. The default 1KB inline threshold 140 in RPC-over-RDMA Version One prevents attaining the best possible 141 performance. 143 Support for Remote Invalidation has been introduced into RPC-over- 144 RDMA Version Two. An RPC-over-RDMA responder can now request 145 invalidation of an STag as part of sending an RPC Reply, saving the 146 requester the effort of invalidating after message receipt. This new 147 feature is general enough to enable a requester to control precisely 148 when Remote Invalidation may be utilized by responders. 150 RPC-over-RDMA Version Two expands the repertoire of error codes to 151 enable extensibility, support debugging, and to prevent requester 152 retries when an error is permanent. 154 2. Inline Threshold 156 2.1. Terminology 158 The term "inline threshold" is defined in Section 4 of 159 [I-D.ietf-nfsv4-rfc5666bis]. An "inline threshold" value is the 160 largest message size (in octets) that can be conveyed in one 161 direction on an RDMA connection using only RDMA Send and Receive. 162 Each connection has two inline threshold values: one for messages 163 flowing from requester-to-responder (referred to as the "call inline 164 threshold"), and one for messages flowing from responder-to-requester 165 (referred to as the "reply inline threshold"). Inline threshold 166 values are not advertised to peers via the base RPC-over-RDMA Version 167 Two protocol. 169 A connection's inline threshold determines when RDMA Read or Write 170 operations are required because the RPC message to be sent cannot be 171 conveyed via RDMA Send and Receive. When an RPC message does not 172 contain DDP-eligible data items, a requester prepares a Long Call or 173 Reply to convey the whole RPC message using RDMA Read or Write 174 operations. 176 2.2. Motivation 178 RDMA Read and Write operations require that each data payload resides 179 in a region of memory that is registered with the RNIC. When an RPC 180 is complete, that region is invalidated, fencing it from the 181 responder. 183 Both registration and invalidation have a latency cost which is 184 insignificant compared to data handling costs. When a data payload 185 is small, however, the cost of registering and invalidating the 186 memory where the payload resides becomes a relatively significant 187 part of total RPC latency. Therefore the most efficient operation of 188 RPC-over-RDMA occurs when RDMA Read and Write operations are used for 189 large payloads, and avoided for small payloads. 191 When RPC-over-RDMA Version One was conceived, the typical size of RPC 192 messages that did not involve a significant data payload was under 193 500 bytes. A 1024-byte inline threshold adequately minimized the 194 frequency of inefficient Long Calls and Replies. 196 Starting with NFSv4.1 [RFC5661], NFS COMPOUND RPC messages are larger 197 and more complex than before. With a 1024-byte inline threshold, 198 RDMA Read or Write operations are needed for frequent operations that 199 do not bear a data payload, such as GETATTR and LOOKUP, reducing the 200 efficiency of the transport. 202 To reduce the need to use Long Calls and Replies, RPC-over-RDMA 203 Version Two increases the default inline threshold size. This also 204 increases the maximum size of backward direction RPC messages. 206 2.3. Default Values 208 RPC-over-RDMA Version Two receiver implementations MUST support an 209 inline threshold of 4096 bytes, but MAY support larger inline 210 threshold values. A mechanism for discovering a peer's preferred 211 inline threshold value (not defined in this document) may be used to 212 optimize RDMA Send operations further. In the absense of such a 213 mechanism, senders MUST assume a receiver's inline threshold is 4096 214 bytes. 216 The new default inline threshold size is no larger than the size of a 217 hardware page on typical platforms. This conserves the resources 218 needed to Send and Receive base level RPC-over-RDMA Version Two 219 messages, enabling RPC-over-RDMA Version Two to be used on a broad 220 variety of hardware. 222 3. Remote Invalidation 224 An STag that is registered using the FRWR mechanism (in a privileged 225 execution context), or is registered via a Memory Window (in user 226 space), may be invalidated remotely [RFC5040]. These mechanisms are 227 available only when a requester's RNIC supports MEM_MGT_EXTENSIONS. 229 For the purposes of this discussion, there are two classes of STags. 230 Dynamically-registered STags are used in a single RPC, then 231 invalidated. Persistently-registered STags live longer than one RPC. 232 They may persist for the life of an RPC-over-RDMA connection, or 233 longer. 235 An RPC-over-RDMA requester may provide more than one STag in one 236 transport header. It may provide a combination of dynamically- and 237 persistently-registered STags in one RPC message, or any combination 238 of these in a series of RPCs on the same connection. Only 239 dynamically-registered STags using Memory Windows or FRWR (ie. 240 registered via MEM_MGT_EXTENSIONS) may be invalidated remotely. 242 There is no transport-level mechanism by which a responder can 243 determine how a requester-provided STag was registered, nor whether 244 it is eligible to be invalidated remotely. A requester that mixes 245 persistently- and dynamically-registered STags in one RPC, or mixes 246 them across RPCs on the same connection, must therefore indicate 247 which handles may be invalidated via a mechanism provided in the 248 Upper Layer Protocol. RPC-over-RDMA Version Two provides such a 249 mechanism. 251 The RDMA Send With Invalidate operation is used to invalidate an STag 252 on a remote system. It is available only when a responder's RNIC 253 supports MEM_MGT_EXTENSIONS, and must be utilized only when a 254 requester's RNIC supports MEM_MGT_EXTENSIONS (can receive and 255 recognize an IETH). 257 3.1. Backward-Direction Remote Invalidation 259 Existing RPC-over-RDMA protocol specifications 260 [I-D.ietf-nfsv4-rfc5666bis] [I-D.ietf-nfsv4-rpcrdma-bidirection] do 261 not forbid direct data placement in the backward-direction, even 262 though there is currently no Upper Layer Protocol that may use it. 264 When chunks are present in a backward-direction RPC request, Remote 265 Invalidation allows the responder to trigger invalidation of a 266 requester's STags as part of sending a reply, the same as in the 267 forward direction. 269 However, in the backward direction, the server acts as the requester, 270 and the client is the responder. The server's RNIC, therefore, must 271 support receiving an IETH, and the server must have registered the 272 STags with an appropriate registration mechanism. 274 4. Protocol Extensibility 276 The core RPC-over-RDMA Version Two header format is specified in 277 Section 6 as a complete and stand-alone piece of XDR. Any change to 278 this XDR description requires a protocol version number change. 280 4.1. Optional Features 282 RPC-over-RDMA Version Two introduces the ability to extend the core 283 protocol via optional features. Extensibility enables minor protocol 284 issues to be addressed and incremental enhancements to be made 285 without the need to change the protocol version. The key capability 286 is that both sides can detect whether a feature is supported by their 287 peer or not. With this ability, OPTIONAL features can be introduced 288 over time to an otherwise stable protocol. 290 The rdma_opttype field carries a 32-bit unsigned integer. The value 291 in this field denotes an optional operation that MAY be supported by 292 the receiver. The values of this field and their meaning are defined 293 in other Standards Track documents. 295 The rdma_optinfo field carries opaque data. The content of this 296 field is data meaningful to the optional operation denoted by the 297 value in rdma_opttype. The content of this field is not defined in 298 the base RPC-over-RDMA Version Two protocol, but is defined in other 299 Standards Track documents 301 When an implementation does not recognize or support the value 302 contained in the rdma_opttype field, it MUST send an RPC-over-RDMA 303 message with the rdma_xid field set to the same value as the 304 erroneous message, the rdma_proc field set to RDMA2_ERROR, and the 305 rdma_err field set to RDMA2_ERR_INVAL_OPTION. 307 4.2. Message Direction 309 Backward direction operation depends on the ability of the receiver 310 to distinguish between incoming forward and backward direction calls 311 and replies. This needs to be done because both the XID field and 312 the flow control value (RPC-over-RDMA credits) in the RPC-over-RDMA 313 header are interpreted in the context of each message's direction. 315 A receiver typically distinguishes message direction by examining the 316 mtype field in the RPC header of each incoming payload message. 317 However, RDMA2_OPTIONAL type messages may not carry an RPC message 318 payload. 320 To enable RDMA2_OPTIONAL type messages that do not carry an RPC 321 message payload to be interpreted unambiguously, the rdma2_optional 322 structure contains a field that identifies the message direction. A 323 similar field has been added to the rpcrdma2_chunk_lists and 324 rpcrdma2_error structures to simplify parsing the RPC-over-RDMA 325 header at the receiver. 327 4.3. Documentation Requirements 329 RPC-over-RDMA Version Two may be extended by defining a new 330 rdma_opttype value, and then by providing an XDR description of the 331 rdma_optinfo content that corresponds with the new rdma_opttype 332 value. As a result, a new header type is effectively created. 334 A Standards Track document introduces each set of such protocol 335 elements. Together these elements are considered an OPTIONAL 336 feature. Each implementation is either aware of all the protocol 337 elements introduced by that feature, or is aware of none of them. 339 Documents describing extensions to RPC-over-RDMA Version Two should 340 contain: 342 o An explanation of the purpose and use of each new protocol element 343 added 345 o An XDR description of the protocol elements, and a script to 346 extract it 348 o A mechanism for reporting errors when the error is outside the 349 available choices already available in the base protocol or in 350 other extensions 352 o An indication of whether a Payload stream must be present, and a 353 description of its contents 355 o A description of interactions with existing extensions 357 The last bullet includes requirements that another OPTIONAL feature 358 needs to be present for new protocol elements to work, or that a 359 particular level of support be provided for some particular facility 360 for the new extension to work. 362 Implementers combine the XDR descriptions of the new features they 363 intend to use with the XDR description of the base protocol in this 364 document. This may be necessary to create a valid XDR input file 365 because extensions are free to use XDR types defined in the base 366 protocol, and later extensions may use types defined by earlier 367 extensions. 369 The XDR description for the RPC-over-RDMA Version Two protocol 370 combined with that for any selected extensions should provide an 371 adequate human-readable description of the extended protocol. 373 5. Transport Properties 375 5.1. Introduction To Transport Properties 377 5.1.1. Property Model 379 A basic set of receiver and sender properties is specified in this 380 document. An extensible approach is used, allowing new properties to 381 be defined in future standards track documents. 383 Such properties are specified using: 385 o A code identifying the particular transport property being 386 specified. 388 o A nominally opaque array which contains within it the XDR encoding 389 of the specific property indicated by the associated code. 391 The following XDR types are used by operations that deal with 392 transport properties: 394 396 typedef rpcrdma2_propid uint32; 398 struct rpcrdma2_propval { 399 rpcrdma2_propid rdma_which; 400 opaque rdma_data<>; 401 }; 403 typedef rpcrdma2_propval rpcrdma2_propset<>; 405 typedef uint32 rpcrdma2_propsubset<>; 407 409 An rpcrdma2_propid specifies a particular transport property. In 410 order to allow easier XDR extension of the set of properties by 411 concatenating XDR files, specific properties are defined as const 412 values rather than as elements in an enum. 414 An rpcrdma2_propval specifies a value of a particular transport 415 property with the particular property identified by rdma_which, while 416 the associated value of that property is contained within rdma_data. 418 A rdma_data field which is of zero length is interpreted as 419 indicating the default value or the property indicated by rdma_which. 421 While rdma_data is defined as opaque within the XDR, the contents are 422 interpreted (except when of length zero) using the XDR typedef 423 associated with the property specified by rdma_which. The receiver 424 of a message containing an rpcrdma2_propval MUST report an XDR error 425 [ cel: which error? BAD_XDR, or do we want to add a new one? ] if 426 the length of rdma_data is such that it extends beyond the bounds of 427 the message transferred. 429 In cases in which the rpcrdma2_propid specified by rdma_which is 430 understood by the receiver, the receiver also MUST report an XDR 431 error if either of the following occur: [ cel: which error? BAD_XDR, 432 or do we want to add a new one? ] 433 o The nominally opaque data within rdma_data is not valid when 434 interpreted using the property-associated typedef. 436 o The length of rdma_data is insufficient to contain the data 437 represented by the property-associated typedef. 439 Note that no error is to be reported if rdma_which is unknown to the 440 receiver. In that case, that rpcrdma2_propval is not processed and 441 processing continues using the next rpcrdma2_propval, if any. 443 A rpcrdma2_propset specifies a set of transport properties. No 444 particular ordering of the rpcrdma2_propval items within it is 445 imposed. 447 A rpcrdma2_propsubset identifies a subset of the properties in a 448 previously specified rpcrdma2_propset. Each bit in the mask denotes 449 a particular element in a previously specified rpcrdma2_propset. If 450 a particular rpcrdma2_propval is at position N in the array, then bit 451 number N mod 32 in word N div 32 specifies whether that particular 452 rpcrdma2_propval is included in the defined subset. Words beyond the 453 last one specified are treated as containing zero. 455 Propvalsubsets are useful in a number of contexts: 457 o In the specification of transport properties at connection, they 458 allow the sender to specify what subset of those are subject to 459 later change. 461 o In responding to a request to modify a set of transport 462 properties, they allow the responding endpoint to specify the 463 subsets of those properties for which the requested change has 464 been performed or been rejected. 466 5.1.2. Transport Property Groups 468 Transport properties are divided into a number of groups 470 o A basic set of transport properties defined in this document. See 471 Section 5.2 for the complete list. 473 o Additional transport properties defined in future standards track 474 documents as specified in Section 5.4.1. 476 o Experimental transport properties being explored preparatory to 477 being considered for standards track definition. See the 478 description in Section 5.4.2. 480 5.1.3. Operations Related to Transport Properties 482 There are a number of operations defined in Section 5.3 which are 483 used to communicate and manage transport properties. 485 Prime among these is RDMA2_CONNPROP (defined in Section 5.3.1 which 486 serves as a means by which an endpoint's transport properties may be 487 presented to its peer, typically upon establishing a connection. 489 In addition, there are a set of related operations concerned with 490 requesting, effecting and reporting changes in transport properties: 492 o RDMA2_REQPROP (defined in Section 5.3.2 which serves as a way for 493 an endpoint to request that a peer change the values for a set of 494 transport properties. 496 o RDMA2_RESPROP (defined in Section 5.3.3 is used to report on the 497 disposition of each of the individual transport property changes 498 requested in a previous RDMA2_REQPROP. 500 o RDMA2_UPDPROP (defined in Section 5.3.4 is used to report an 501 unsolicited change in a transport property. 503 Unlike many other operation types, the above are not used to effect 504 transfer of RPC requests but are internal one-way information 505 transfers. However, a RDMA2_REQPROP and the corresponding 506 RDMA2_RESPROP do constitute an RPC-like remote call. The other 507 operations are not part of a remote call transaction. 509 5.2. Basic Transport Properties 511 Although the set of transport properties is subject to later 512 extension, a basic set of transport properties is defined below in 513 Table 1. 515 In that table, the columns contain the following information: 517 o The column labeled "property" identifies the transport property 518 described by the current row. 520 o The column labeled "code" specifies the rpcrdma2_propid value used 521 to identify this property. 523 o The column labeled "XDR type" gives the XDR type of the data used 524 to communicate the value of this property. This data type 525 overlays the data portion of the nominally opaque field rdma_data 526 in a rpcrdma2_propval. 528 o The column labeled "default" gives the default value for the 529 property which is to be assumed by those who do not receive, or 530 are unable to interpret, information about the actual value of the 531 property. 533 o The column labeled "section" indicates the section (within this 534 document) that explains the semantics and use of this transport 535 property. 537 +---------+-----+------------------+----------------------+---------+ 538 | propert | cod | XDR type | default | section | 539 | y | e | | | | 540 +---------+-----+------------------+----------------------+---------+ 541 | Receive | 1 | uint32 | 4096 | 5.2.1 | 542 | Buffer | | | | | 543 | Size | | | | | 544 | Backwar | 2 | enum rpcrdma2_bk | RDMA2_BKREQSUP_INLIN | 5.2.2 | 545 | d | | reqsup | E | | 546 | Request | | | | | 547 | Support | | | | | 548 +---------+-----+------------------+----------------------+---------+ 550 Table 1 552 Note that this table does not provide any indication regarding 553 whether a particular property can change or whether a change in the 554 value may be requested (see Section 5.3.2). Such matters are not 555 addressed by the protocol definition. An implementation may provide 556 information about its readiness to make changed in a particular 557 property using the rdma_nochg field in the RDMA2_CONNPROP message. 559 A partner implementation can always request a change but peers MAY 560 reject a request to change a property for any reason. 561 Implementations are always free to reject such requests if they 562 cannot or do not wish to effect the requested change. 564 Either of the following will result in effective rejection requests 565 to change specific properties: 567 o If an endpoint does not wish to accept request to change 568 particular properties, it may reject such requests as described in 569 Section 5.3.3. 571 o If an endpoint does not support the RDMA2_REQPROP operation, the 572 effect would be the same as if every request to change a set of 573 property were rejected. 575 With regard to unrequested changes in transport properties, it is the 576 responsibility of the implementation making the change to do so in a 577 fashion that which does not interfere with the other partner's 578 continued correct operation (see Section 5.2.1). 580 5.2.1. Receive Buffer Size 582 The Receive Buffer Size specifies the minimum size, in octets, of 583 pre-posted receive buffers. It is the responsibility of the 584 participant sending this value to ensure that its pre-posted receives 585 are at least the size specified, allowing the participant receiving 586 this value to send messages that are of this size. 588 590 const uint32 RDMA2_PROPID_RBSIZ = 1; 591 typedef uint32 rpcrdma2_prop_rbsiz; 593 595 The sender may use his knowledge of the receiver's buffer size to 596 determine when the message to be sent will fit in the preposted 597 receive buffers that the receiver has set up. In particular, 599 o Requesters may use the value to determine when it is necessary to 600 provide a Position-Zero read chunk when sending a request. 602 o Requesters may use the value to determine when it is necessary to 603 provide a Reply chunk when sending a request, based on the maximum 604 possible size of the reply. 606 o Responders may use the value to determine when it is necessary, 607 given the actual size of the reply, to actually use a Reply chunk 608 provided by the requester. 610 Because there may be pre-posted receives with buffer sizes that 611 reflect earlier values of the buffer size property, changing this 612 property poses special difficulties: 614 o When the size is being raised, the partner should not be informed 615 of the change until all pending receives using the older value 616 have been eliminated. 618 o The size should not be reduced until the partner is aware of the 619 need to reduce the size of future sends to conform to this reduced 620 value. To ensure this, such a change should only occur in 621 response to an explicit request by the other endpoint (See 622 Section 5.3.2). The participant making the request should use 623 that lower size as the send size limit until the request is 624 rejected (See Section 5.3.3) or an update to a size larger than 625 the requested value becomes effective and the requested change is 626 no longer pending (See Section 5.3.4). 628 5.2.2. Backward Request Support 630 The value of this property is used to indicate a client 631 implementation's readiness to accept and process messages that are 632 part of backward-direction RPC requests. 634 636 enum rpcrdma2_bkreqsup { 637 RDMA2_BKREQSUP_NONE = 0, 638 RDMA2_BKREQSUP_INLINE = 1, 639 RDMA2_BKREQSUP_GENL = 2 640 }; 642 const uint32 RDMA2_PROPID_BRS = 2; 643 typedef rpcrdma2_bkreqsup rpcrdma2_prop_brs; 645 647 Multiple levels of support are distinguished: 649 o The value RDMA2_BKREQSUP_NONE indicates that receipt of backward- 650 direction requests and replies is not supported. 652 o The value RDMA2_BKREQSUP_INLINE indicates that receipt of 653 backward-direction requests or replies is only supported using 654 inline messages and that use of explicit RDMA operations or other 655 form of Direct Data Placement for backward direction requests or 656 responses is not supported. 658 o The value RDMA2_BKREQSUP_GENL that receipt of backward-direction 659 requests or replies is supported in the same ways that forward- 660 direction requests or replies typically are. 662 When information about this property is not provided, the support 663 level of servers can be inferred from the backward- direction 664 requests that they issue, assuming that issuing a request implicitly 665 indicates support for receiving the corresponding reply. On this 666 basis, support for receiving inline replies can be assumed when 667 requests without read chunks, write chunks, or Reply chunks are 668 issued, while requests with any of these elements allow the client to 669 assume that general support for backward-direction replies is present 670 on the server. 672 5.3. New Operations 674 The proposed new operations are set forth in Table 2 below. In that 675 table, the columns contain the following information: 677 o The column labeled "operation" specifies the particular operation. 679 o The column labeled "code" specifies the value of opttype for this 680 operation. 682 o The column labeled "XDR type" gives the XDR type of the data 683 structure used to describe the information in this new message 684 type. This data overlays the data portion of the nominally opaque 685 field optinfo in an RDMA_OPTIONAL message. 687 o The column labeled "msg" indicates whether this operation is 688 followed (or not) by an RPC message payload. 690 o The column labeled "section" indicates the section (within this 691 document) that explains the semantics and use of this optional 692 operation. 694 +------------------------+------+------------------+------+---------+ 695 | operation | code | XDR type | msg | section | 696 +------------------------+------+------------------+------+---------+ 697 | Specify Properties at | 1 | optinfo_connprop | No | 5.3.1 | 698 | Connection | | | | | 699 | Request Property | 2 | rpcrdma2_reqprop | No | 5.3.2 | 700 | Modification | | | | | 701 | Respond to | 3 | rpcrdma2_resprop | No | 5.3.3 | 702 | Modification Request | | | | | 703 | Report Updated | 4 | rpcrdma2_updprop | No | 5.3.4 | 704 | Properties | | | | | 705 +------------------------+------+------------------+------+---------+ 707 Table 2 709 Support for all of the operations above is OPTIONAL. RPC-over-RDMA 710 Version Two implementations that receive an operation that is not 711 supported MUST respond with RDMA_ERROR message with an error code of 712 RDMA_ERR_INVAL_OPTION. 714 The only operation support requirements are as follows: 716 o Implementations which send RDMA2_REQPROP messages must support 717 RDMA2_RESPROP messages. 719 o Implementations which support RDMA2_RESPROP or RDMA2_UPDPROP 720 messages must also support RDMA2_CONNPROP messages. 722 5.3.1. RDMA2_CONNPROP: Specify Properties at Connection 724 The RDMA2_CONNPROP message type allows an RPC-over-RDMA participant, 725 whether client or server, to indicate to its partner relevant 726 transport properties that the partner might need to be aware of. 728 The message definition for this operation is as follows: 730 732 struct rpcrdma2_connprop { 733 rpcrdma2_propset rdma_start; 734 rpcrdma2_propsubset rdma_nochg; 735 }; 737 739 All relevant transport properties that the sender is aware of should 740 be included in rdma_start. Since support of this request is 741 OPTIONAL, and since each of the properties is OPTIONAL as well, the 742 sender cannot assume that the receiver will necessarily take note of 743 these properties and so the sender should be prepared for cases in 744 which the partner continues to assume that the default value for a 745 particular property is still in effect. 747 Values of the subset of transport properties specified by rdma_nochg 748 is not expected to change during the lifetime of the connection. 750 Generally, a participant will send a RDMA2_CONNPROP message as the 751 first message after a connection is established. Given that fact, 752 the sender should make sure that the message can be received by 753 partners who use the default Receive Buffer Size. The connection's 754 initial receive buffer size is typically 1KB, but it depends on the 755 initial connection state of the RPC-over-RDMA version in use. 757 Properties not included in rdma_start are to be treated by the peer 758 endpoint as having the default value and are not allowed to change 759 subsequently. The peer should not request changes in such 760 properties. 762 Those receiving an RDMA2_CONNPROP may encounter properties that they 763 do not support or are unaware of. In such cases, these properties 764 are simply ignored without any error response being generated. 766 5.3.2. RDMA2_REQPROP: Request Modification of Properties 768 The RDMA2_REQPROP message type allows an RPC-over-RDMA participant, 769 whether client or server, to request of its partner that relevant 770 transport properties be changed. 772 The rdma_xid field allows the request to be tied to a corresponding 773 response of type RDMA2_RESPROP (See Section 5.3.3.) In assigning the 774 value of this field, the sender does not need to avoid conflict with 775 xid's associated with RPC messages or with RDMA2_REQPROP messages 776 sent by the peer endpoint. 778 The partner need not change the properties as requested by the sender 779 but if it does support the message type, it will generate a 780 RDMA2_RESPROP message, indicating the disposition of the request. 782 The message definition for this operation is as follows: 784 786 struct rpcrdma2_reqprop { 787 rpcrdma2_propset rdma_want; 788 }; 790 792 The rpcrdma2_propset rdma_want is a set of transport properties 793 together with the desired values requested by the sender. 795 5.3.3. RDMA2_RESPROP: Respond to Request to Modify Transport Properties 797 The RDMA2_RESPROP message type allows an RPC-over-RDMA participant to 798 respond to a request to change properties by its partner, indicating 799 how the request was dealt with. 801 The message definition for this operation is as follows: 803 805 struct rpcrdma2_resprop { 806 rpcrdma2_propsubset rdma_done; 807 rpcrdma2_propsubset rdma_rejected; 808 rpcrdma2_propset rdma_other; 809 }; 811 813 The rdma_xid field of this message must match that used in the 814 RDMA2_REQPROP message to which this message is responding. 816 The rdma_done field indicates which of the requested transport 817 property changes have been effected as requested. For each such 818 property, the receiver is entitled to conclude that the requested 819 change has been made and that future transmissions may be made based 820 on the new value. 822 The rdma_rejected field indicates which of the requested transport 823 property changes have been rejected by the sender. This may be 824 because of any of the following reasons: 826 o The particular property specified is not known or supported by the 827 receiver of the RDMA2_REQPROP message. 829 o The implementation receiving the RDMA2_REQPROP message does not 830 support modification of this property. 832 o The implementation receiving the RDMA2_REQPROP message has chosen 833 to reject the modification for another reason. 835 The rdma_other field contains new values for properties where a 836 change is requested. The new value of the property is included and 837 may be a value different from the original value in effect when the 838 change was requested and from the requested value. This is useful 839 when the new value of some property is not as large as requested but 840 still different from the original value, indicating a partial 841 satisfaction of the peer's property change request. 843 The sender MUST NOT include rpcrdma2_propval items within rdma_other 844 that are for properties other than the ones for which the 845 corresponding property request has requested a change. If the 846 receiver finds such a situation, it MUST ignore the erroneous 847 rpcrdma2_propval items. 849 The subsets of properties specified by rdma_done, rdma_rejected, and 850 included in rdma_other MUST NOT overlap, and when ored together, 851 should cover the entire set of properties specified by rdma_want in 852 the corresponding request. If the receiver finds such an overlap or 853 mismatch, it SHOULD treat properties missing or within the overlap as 854 having been rejected. 856 5.3.4. RDMA2_UPDPROP: Update Transport Properties 858 The RDMA2_UPDPROP message type allows an RPC-over-RDMA participant to 859 notify the other participant that a change to the transport 860 properties has occurred. This is because the sender has decided, 861 independently, to modify one or more transport properties and is 862 notifying the receiver of these changes. 864 The message definition for this operation is as follows: 866 868 struct rpcrdma2_updprop { 869 rpcrdma2_propset rdma_now; 870 }; 872 874 rdma_now defines the new property values to be used. 876 5.4. Extensibility 878 5.4.1. Additional Properties 880 The set of transport properties is designed to be extensible. As a 881 result, once new properties are defined in standards track documents, 882 the operations defined in this document may reference these new 883 transport properties, as well as the ones described in this document. 885 A standards track document defining a new transport property should 886 include the following information paralleling that provided in this 887 document for the transport properties defined herein. 889 o The rpcrdma2_propid value used to identify this property. 891 o The XDR typedef specifying the form in which the property value is 892 communicated. 894 o A description of the transport property that is communicated by 895 the sender of RDMA2_CONNPROP and RDMA2_UPDPROP and requested by 896 the sender of RDMA2_REQPROP. 898 o An explanation of how this knowledge could be used by the 899 participant receiving this information. 901 o Information giving rules governing possible changes of values of 902 this property. 904 The definition of transport property structures is such as to make it 905 easy to assign unique values. There is no requirement that a 906 continuous set of values be used and implementations should not rely 907 on all such values being small integers. A unique value should be 908 selected when the defining document is first published as an internet 909 draft. When the document becomes a standards track document working 910 group should insure that: 912 o rpcrdma2_propid values specified in the document do not conflict 913 with those currently assigned or in use by other pending working 914 group documents defining transport properties. 916 o rpcrdma2_propid values specified in the document do not conflict 917 with the range reserved for experimental use, as defined in 918 Section 5.4.2. 920 Documents defining new properties fall into a number of categories. 922 o Those defining new properties and explaining (only) how they 923 affect use of existing message types. 925 o Those defining new OPTIONAL message types and new properties 926 applicable to the operation of those new message types. 928 o Those defining new OPTIONAL message types and new properties 929 applicable both to new and existing message types. 931 When additional transport properties are proposed, the review of the 932 associated standards track document should deal with possible 933 security issues raised by those new transport properties. 935 5.4.2. Experimental Properties 937 Given the design of the transport properties data structure, it 938 possible to use the operations to implement experimental, possibly 939 unpublished, transport properties. 941 rpcrdma2_propid values in the range from 4,294,967,040 to 942 4,294,967,295 are reserved for experimental use and these values 943 should not be assigned to new properties in standards track 944 documents. 946 When values in this range are used there is no guarantee if 947 successful interoperation among independent implementations. 949 6. XDR Protocol Definition 951 This section contains a description of the core features of the RPC- 952 over-RDMA Version Two protocol, expressed in the XDR language 953 [RFC4506]. 955 This description is provided in a way that makes it simple to extract 956 into ready-to-compile form. The reader can apply the following shell 957 script to this document to produce a machine-readable XDR description 958 of the RPC-over-RDMA Version One protocol without any OPTIONAL 959 extensions. 961 963 #!/bin/sh 964 grep '^ *///' | sed 's?^ /// ??' | sed 's?^ *///$??' 966 968 That is, if the above script is stored in a file called "extract.sh" 969 and this document is in a file called "spec.txt" then the reader can 970 do the following to extract an XDR description file: 972 974 sh extract.sh < spec.txt > rpcrdma_corev2.x 976 978 Optional extensions to RPC-over-RDMA Version Two, published as 979 Standards Track documents, will have similar means of providing XDR 980 that describes those extensions. Once XDR for all desired extensions 981 is also extracted, it can be appended to the XDR description file 982 extracted from this document to produce a consolidated XDR 983 description file reflecting all extensions selected for an RPC-over- 984 RDMA implementation. 986 6.1. Code Component License 988 Code components extracted from this document must include the 989 following license text. When the extracted XDR code is combined with 990 other complementary XDR code which itself has an identical license, 991 only a single copy of the license text need be preserved. 993 995 /// /* 996 /// * Copyright (c) 2010, 2016 IETF Trust and the persons 997 /// * identified as authors of the code. All rights reserved. 998 /// * 999 /// * The authors of the code are: 1000 /// * B. Callaghan, T. Talpey, C. Lever, and D. Noveck. 1001 /// * 1002 /// * Redistribution and use in source and binary forms, with 1003 /// * or without modification, are permitted provided that the 1004 /// * following conditions are met: 1005 /// * 1006 /// * - Redistributions of source code must retain the above 1007 /// * copyright notice, this list of conditions and the 1008 /// * following disclaimer. 1009 /// * 1010 /// * - Redistributions in binary form must reproduce the above 1011 /// * copyright notice, this list of conditions and the 1012 /// * following disclaimer in the documentation and/or other 1013 /// * materials provided with the distribution. 1014 /// * 1015 /// * - Neither the name of Internet Society, IETF or IETF 1016 /// * Trust, nor the names of specific contributors, may be 1017 /// * used to endorse or promote products derived from this 1018 /// * software without specific prior written permission. 1019 /// * 1020 /// * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS 1021 /// * AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED 1022 /// * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 1023 /// * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS 1024 /// * FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO 1025 /// * EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE 1026 /// * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, 1027 /// * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT 1028 /// * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 1029 /// * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS 1030 /// * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF 1031 /// * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 1032 /// * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING 1033 /// * IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF 1034 /// * ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 1035 /// */ 1037 1039 6.2. RPC-Over-RDMA Version Two XDR 1041 The XDR defined in this section is used to encode the Transport 1042 Header Stream in each RPC-over-RDMA Version Two message. The terms 1043 "Transport Header Stream" and "RPC Payload Stream" are defined in 1044 Section 4 of [I-D.ietf-nfsv4-rfc5666bis]. 1046 1048 /// /* From RFC 5531, Section 9 */ 1049 /// enum msg_type { 1050 /// CALL = 0, 1051 /// REPLY = 1 1052 /// }; 1053 /// 1054 /// struct rpcrdma2_segment { 1055 /// uint32 rdma_handle; 1056 /// uint32 rdma_length; 1057 /// uint64 rdma_offset; 1058 /// }; 1059 /// 1060 /// struct rpcrdma2_read_segment { 1061 /// uint32 rdma_position; 1062 /// struct rpcrdma2_segment rdma_target; 1063 /// }; 1064 /// 1065 /// struct rpcrdma2_read_list { 1066 /// struct rpcrdma2_read_segment rdma_entry; 1067 /// struct rpcrdma2_read_list *rdma_next; 1068 /// }; 1069 /// 1070 /// struct rpcrdma2_write_chunk { 1071 /// struct rpcrdma2_segment rdma_target<>; 1072 /// }; 1073 /// 1074 /// struct rpcrdma2_write_list { 1075 /// struct rpcrdma2_write_chunk rdma_entry; 1076 /// struct rpcrdma2_write_list *rdma_next; 1077 /// }; 1078 /// 1079 /// struct rpcrdma2_chunk_lists { 1080 /// enum msg_type rdma_direction; 1081 /// uint32 rdma_inv_handle; 1082 /// struct rpcrdma2_read_list *rdma_reads; 1083 /// struct rpcrdma2_write_list *rdma_writes; 1084 /// struct rpcrdma2_write_chunk *rdma_reply; 1085 /// }; 1086 /// 1087 /// enum rpcrdma2_errcode { 1088 /// RDMA2_ERR_VERS = 1, 1089 /// RDMA2_ERR_BAD_XDR = 2, 1090 /// RDMA2_ERR_REPLY_RESOURCE = 3, 1091 /// RDMA2_ERR_INVAL_PROC = 4, 1092 /// RDMA2_ERR_INVAL_OPTION = 5 1093 /// RDMA2_ERR_SYSTEM = 6, 1094 /// }; 1095 /// 1096 /// struct rpcrdma2_err_vers { 1097 /// uint32 rdma_vers_low; 1098 /// uint32 rdma_vers_high; 1099 /// }; 1100 /// 1101 /// struct rpcrdma2_err_reply { 1102 /// uint32 rdma_segment_index; 1103 /// uint32 rdma_length_needed; 1104 /// }; 1105 /// 1106 /// union rpcrdma2_error switch (rpcrdma2_errcode rdma_err) { 1107 /// case RDMA2_ERR_VERS: 1108 /// rpcrdma2_err_vers rdma_vrange; 1109 /// case RDMA2_ERR_BAD_XDR: 1110 /// void; 1111 /// case RDMA2_ERR_REPLY_RESOURCE: 1112 /// rpcrdma2_err_reply rdma_reply; 1113 /// case RDMA2_ERR_INVAL_PROC: 1114 /// void; 1115 /// case RDMA2_ERR_INVAL_OPTION: 1116 /// void; 1117 /// case RDMA2_ERR_SYSTEM: 1118 /// void; 1119 /// }; 1120 /// 1121 /// struct rpcrdma2_optional { 1122 /// enum msg_type rdma_optdir; 1123 /// uint32 rdma_opttype; 1124 /// opaque rdma_optinfo<>; 1125 /// }; 1126 /// 1127 /// typedef rpcrdma2_propid uint32; 1128 /// 1129 /// struct rpcrdma2_propval { 1130 /// rpcrdma2_propid rdma_which; 1131 /// opaque rdma_data<>; 1132 /// }; 1133 /// 1134 /// typedef rpcrdma2_propval rpcrdma2_propset<>; 1135 /// typedef uint32 rpcrdma2_propsubset<>; 1136 /// 1137 /// struct rpcrdma2_connprop { 1138 /// rpcrdma2_propset rdma_start; 1139 /// rpcrdma2_propsubset rdma_nochg; 1140 /// }; 1141 /// 1142 /// struct rpcrdma2_reqprop { 1143 /// rpcrdma2_propset rdma_want; 1144 /// }; 1145 /// 1146 /// struct rpcrdma2_resprop { 1147 /// rpcrdma2_propsubset rdma_done; 1148 /// rpcrdma2_propsubset rdma_rejected; 1149 /// rpcrdma2_propset rdma_other; 1150 /// }; 1151 /// 1152 /// struct rpcrdma2_updprop { 1153 /// rpcrdma2_propset rdma_now; 1154 /// }; 1156 /// enum rpcrdma2_proc { 1157 /// RDMA2_MSG = 0, 1158 /// RDMA2_NOMSG = 1, 1159 /// RDMA2_ERROR = 4, 1160 /// RDMA2_OPTIONAL = 5, 1161 /// RDMA2_CONNPROP = 6, 1162 /// RDMA2_REQPROP = 7, 1163 /// RDMA2_RESPROP = 8, 1164 /// RDMA2_UPDPROP = 9 1165 /// }; 1166 /// 1167 /// union rpcrdma2_body switch (rpcrdma2_proc rdma_proc) { 1168 /// case RDMA2_MSG: 1169 /// rpcrdma2_chunk_lists rdma_chunks; 1170 /// case RDMA2_NOMSG: 1171 /// rpcrdma2_chunk_lists rdma_chunks; 1172 /// case RDMA2_ERROR: 1173 /// rpcrdma2_error rdma_error; 1174 /// case RDMA2_OPTIONAL: 1175 /// rpcrdma2_optional rdma_optional; 1176 /// case RDMA2_CONNPROP: 1177 /// rpcrdma2_connprop rdma_connprop; 1178 /// case RDMA2_REQPROP: 1179 /// rpcrdma2_reqprop rdma_reqprop; 1180 /// case RDMA2_RESPROP: 1181 /// rpcrdma2_resprop rdma_resprop; 1182 /// case RDMA2_UPDPROP: 1183 /// rpcrdma2_updprop rdma_updprop; 1184 /// }; 1185 /// 1186 /// struct rpcrdma2_xprt_hdr { 1187 /// uint32 rdma_xid; 1188 /// uint32 rdma_vers; 1189 /// uint32 rdma_credit; 1190 /// rpcrdma2_body rdma_body; 1191 /// }; 1192 /// 1193 /// /* 1194 /// * Transport propid values for basic properties 1195 /// */ 1196 /// const uint32 RDMA2_PROPID_RBSIZ = 1; 1197 /// const uint32 RDMA2_PROPID_BRS = 2; 1198 /// 1199 /// /* 1200 /// * Transport property typedefs 1201 /// */ 1202 /// typedef uint32 rpcrdma2_prop_rbsiz; 1203 /// typedef rpcrdma2_bkreqsup rpcrdma2_prop_brs; 1204 /// 1205 /// enum rpcrdma2_bkreqsup { 1206 /// RDMA2_BKREQSUP_NONE = 0, 1207 /// RDMA2_BKREQSUP_INLINE = 1, 1208 /// RDMA2_BKREQSUP_GENL = 2 1209 /// }; 1211 1213 6.2.1. Presence Of Payload 1215 o When the rdma_proc field has the value RDMA2_MSG, an RPC Payload 1216 Stream MUST follow the Transport Header Stream in the Send buffer. 1218 o When the rdma_proc field has the value RDMA2_ERROR, an RPC Payload 1219 Stream MUST NOT follow the Transport Header Stream. 1221 o When the rdma_proc field has the value RDMA2_OPTIONAL, all, part 1222 of, or no RPC Payload Stream MAY follow the Transport header 1223 Stream in the Send buffer. 1225 6.2.2. Message Direction 1227 Implementations of RPC-over-RDMA Version Two are REQUIRED to support 1228 backwards direction operation as described in 1229 [I-D.ietf-nfsv4-rpcrdma-bidirection]. RPC-over-RDMA Version Two 1230 introduces the rdma_direction field in its transport header to 1231 optimize the process of distinguishing between forward- and 1232 backwards-direction messages. 1234 The rdma_direction field qualifies the value contained in the 1235 transport header's rdma_xid field. This enables a receiver to 1236 reliably avoid performing an XID lookup on incoming backwards- 1237 direction Call messages. 1239 In general, when a message carries an XID that was generated by the 1240 message's receiver (that is, the receiver is acting as a requester), 1241 the message's sender sets the rdma_direction field to REPLY (1). 1242 Otherwise the rdma_direction field is set to CALL (0). For example: 1244 o When the rdma_proc field has the value RDMA2_MSG or RDMA2_NOMSG, 1245 the value of the rdma_direction field MUST be the same as the 1246 value of the associated RPC message's msg_type field. 1248 o When the rdma_proc field has the value RDMA2_OPTIONAL and a whole 1249 or partial RPC message payload is present, the value of the 1250 rdma_optdir field MUST be the same as the value of the associated 1251 RPC message's msg_type field. 1253 o When the rdma_proc field has the value RDMA2_OPTIONAL and no RPC 1254 message payload is present, a Requester MUST set the value of the 1255 rdma_optdir field to CALL, and a Responder MUST set the value of 1256 the rdma_optdir field to REPLY. The Requester chooses a value for 1257 the rdma_xid field from the XID space that matches the message's 1258 direction. Requesters and Responders set the rdma_credit field in 1259 a similar fashion: a value is set that is appropriate for the 1260 direction of the message. 1262 o When the rdma_proc field has the value RDMA2_ERROR, the direction 1263 of the message is always Responder-to-Requester (REPLY). 1265 6.2.3. Remote Invalidation 1267 To request Remote Invalidation, a requester MUST set the value of the 1268 rdma_inv_handle field in an RPC Call's transport header to a non-zero 1269 value that matches one of the rdma_handle fields in that header. If 1270 none of the rdma_handle values in the Call may be invalidated by the 1271 responder, the requester MUST set the RPC Call's rdma_inv_handle 1272 field to the value zero. 1274 If the responder chooses not to use Remote Invalidation for this 1275 particular RPC Reply, or the RPC Call's rdma_inv_handle field 1276 contains the value zero, the responder MUST use RDMA Send to transmit 1277 the matching RPC reply. 1279 If a requester has provided a non-zero value in the RPC Call's 1280 rdma_inv_handle field and the responder chooses to use Remote 1281 Invalidation for the matching RPC Reply, the responder MUST use RDMA 1282 Send With Invalidate to transmit that RPC reply, and MUST use the 1283 value in the RPC Call's rdma_inv_handle field to construct the Send 1284 With Invalidate Work Request. 1286 6.2.4. Transport Errors 1288 Error handling works the same way in RPC-over-RDMA Version Two as it 1289 does in RPC-over-RDMA Version One, with the addition of several new 1290 error codes. Version One error handling is described in Section 5 of 1291 [I-D.ietf-nfsv4-rfc5666bis]. 1293 In all cases below, the sender copies the values of the rdma_xid and 1294 rdma_vers fields from the incoming transport header that generated 1295 the error to transport header of the error response. The rdma_proc 1296 field is set to RDMA2_ERROR. 1298 RDMA2_ERR_VERS 1299 This is the equivalent of ERR_VERS in RPC-over-RDMA Version One. 1300 The error code value, semantics, and utilization are the same. 1302 RDMA2_ERR_INVAL_PROC 1303 This is a new error code in RPC-over-RDMA Version Two. If a 1304 receiver recognizes the value in the rdma_vers field, but it does 1305 not recognize the value in the rdma_proc field, it MUST send 1306 RDMA2_ERR_INVAL_PROC. 1308 RDMA2_ERR_BAD_XDR 1309 This is the equivalent of ERR_CHUNK in RPC-over-RDMA Version One, 1310 with a few extra restrictions. The error code value is the same. 1312 If a receiver recognizes the values in the rdma_vers and rdma_proc 1313 fields, but the incoming RPC-over-RDMA transport header cannot be 1314 parsed, the receiver MUST send RDMA2_ERR_BAD_XDR before Upper 1315 Layer Protocol processing starts. 1317 RDMA2_ERR_REPLY_RESOURCE 1318 This is a new error code in RPC-over-RDMA Version Two. If the 1319 RPC-over-RDMA transport header is otherwise correct, but the 1320 requester has not provided enough Write or Reply chunk resources 1321 to transmit the reply, the responder MUST send 1322 RDMA2_ERR_REPLY_RESOURCE. 1324 The responder MUST set the rdma_segment_index field to point to 1325 the first segment in the transport header that is too short, or to 1326 zero to indicate that it was not possible to determine which 1327 segment was too small. Indexing starts at one (1), which 1328 represents the first segment in the first Write chunk (in either 1329 the Write list or Reply chunk). The responder MUST set the 1330 rdma_length_needed to the number of bytes needed in that segment 1331 in order to convey the reply. 1333 Upon receipt of this error code, a responder MAY choose to 1334 terminate the operation (for instance, if the responder set the 1335 index and length fields to zero), or it MAY send the request again 1336 using the same XID and larger reply resources. 1338 RDMA2_ERR_INVAL_OPTION 1339 This is a new error code in RPC-over-RDMA Version Two. A receiver 1340 MUST send RDMA2_ERR_INVAL_OPTION when an RDMA2_OPTIONAL message is 1341 received and the receiver does not recognize the value in the 1342 rdma_opttype field. 1344 RDMA2_ERR_SYSTEM 1345 This is a new error code in RPC-over-RDMA Version Two. If some 1346 problem occurs on a receiver that does not fit into the above 1347 categories, the receiver MAY report it to the sender using the 1348 error code RDMA2_ERR_SYSTEM. This is a permanent error: a 1349 requester that receives this error MUST terminate the RPC 1350 transaction associated with the XID value in the rdma_xid field. 1352 7. Protocol Version Negotiation 1354 When an RPC-over-RDMA Version Two client establishes a connection to 1355 a server, the first order of business is to determine the server's 1356 highest supported protocol version. 1358 As with RPC-over-RDMA Version One, a client MUST assume the ability 1359 to exchange only a single RPC-over-RDMA message at a time until it 1360 receives a valid non-error RPC-over-RDMA message from the server that 1361 reports the server's credit limit. 1363 First, the client sends a single valid RPC-over-RDMA message with the 1364 value two (2) in the rdma_vers field. Because the server might 1365 support only RPC-over-RDMA Version One, this initial message can be 1366 no larger than the Version One default inline threshold of 1024 1367 bytes. 1369 7.1. Server Does Support RPC-over-RDMA Version Two 1371 If the server does support RPC-over-RDMA Version Two, it sends RPC- 1372 over-RDMA messages back to the client with the value two (2) in the 1373 rdma_vers field. Both peers may use the default inline threshold 1374 value for RPC-over-RDMA Version Two connections (4096 bytes). 1376 7.2. Server Does Not Support RPC-over-RDMA Version Two 1378 If the server does not support RPC-over-RDMA Version Two, it MUST 1379 send an RPC-over-RDMA message to the client with the same XID, with 1380 RDMA2_ERROR in the rdma_proc field, and with the error code 1381 RDMA2_ERR_VERS. This message also reports a range of protocol 1382 versions that the server supports. To continue operation, the client 1383 selects a protocol version in the range of server-supported versions 1384 for subsequent messages on this connection. 1386 If the connection is lost immediately after an RDMA2_ERROR / 1387 RDMA2_ERR_VERS message is received, a client can avoid a possible 1388 version negotiation loop when re-establishing another connection by 1389 assuming that particular server does not support RPC-over-RDMA 1390 Version Two. A client can assume the same situation (no server 1391 support for RPC-over-RDMA Version Two) if the initial negotiation 1392 message is lost or dropped. Once the negotiation exchange is 1393 complete, both peers may use the default inline threshold value for 1394 the transport protocol version that has been selected. 1396 7.3. Client Does Not Support RPC-over-RDMA Version Two 1398 If the server supports the RPC-over-RDMA protocol version used in 1399 Call messages from a client, it MUST send Replies with the same RPC- 1400 over-RDMA protocol version that the client uses to send its Calls. 1402 7.4. Security Considerations 1404 The security considerations for RPC-over-RDMA Version Two are the 1405 same as those for RPC-over-RDMA Version One. 1407 7.4.1. Security Considerations (Transport Properties) 1409 Like other fields that appear in each RPC-over-RDMA header, property 1410 information is sent in the clear on the fabric with no integrity 1411 protection, making it vulnerable to man-in-the-middle attacks. 1413 For example, if a man-in-the-middle were to change the value of the 1414 Receive buffer size or the Requester Remote Invalidation boolean, it 1415 could reduce connection performance or trigger loss of connection. 1416 Repeated connection loss can impact performance or even prevent a new 1417 connection from being established. Recourse is to deploy on a 1418 private network or use link-layer encryption. 1420 8. IANA Considerations 1422 There are no IANA considerations at this time. 1424 9. References 1426 9.1. Normative References 1428 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 1429 Requirement Levels", BCP 14, RFC 2119, 1430 DOI 10.17487/RFC2119, March 1997, 1431 . 1433 [RFC4506] Eisler, M., Ed., "XDR: External Data Representation 1434 Standard", STD 67, RFC 4506, DOI 10.17487/RFC4506, May 1435 2006, . 1437 [RFC5531] Thurlow, R., "RPC: Remote Procedure Call Protocol 1438 Specification Version 2", RFC 5531, DOI 10.17487/RFC5531, 1439 May 2009, . 1441 9.2. Informative References 1443 [I-D.ietf-nfsv4-rfc5666bis] 1444 Lever, C., Simpson, W., and T. Talpey, "Remote Direct 1445 Memory Access Transport for Remote Procedure Call, Version 1446 One", draft-ietf-nfsv4-rfc5666bis-08 (work in progress), 1447 November 2016. 1449 [I-D.ietf-nfsv4-rpcrdma-bidirection] 1450 Lever, C., "Bi-directional Remote Procedure Call On RPC- 1451 over-RDMA Transports", draft-ietf-nfsv4-rpcrdma- 1452 bidirection-05 (work in progress), June 2016. 1454 [IB] InfiniBand Trade Association, "InfiniBand Architecture 1455 Specifications", . 1457 [RFC5040] Recio, R., Metzler, B., Culley, P., Hilland, J., and D. 1458 Garcia, "A Remote Direct Memory Access Protocol 1459 Specification", RFC 5040, DOI 10.17487/RFC5040, October 1460 2007, . 1462 [RFC5041] Shah, H., Pinkerton, J., Recio, R., and P. Culley, "Direct 1463 Data Placement over Reliable Transports", RFC 5041, 1464 DOI 10.17487/RFC5041, October 2007, 1465 . 1467 [RFC5661] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed., 1468 "Network File System (NFS) Version 4 Minor Version 1 1469 Protocol", RFC 5661, DOI 10.17487/RFC5661, January 2010, 1470 . 1472 [RFC5662] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed., 1473 "Network File System (NFS) Version 4 Minor Version 1 1474 External Data Representation Standard (XDR) Description", 1475 RFC 5662, DOI 10.17487/RFC5662, January 2010, 1476 . 1478 [RFC5666] Talpey, T. and B. Callaghan, "Remote Direct Memory Access 1479 Transport for Remote Procedure Call", RFC 5666, 1480 DOI 10.17487/RFC5666, January 2010, 1481 . 1483 Appendix A. Acknowledgments 1485 The authors gratefully acknowledge the work of Brent Callaghan and 1486 Tom Talpey on the original RPC-over-RDMA Version One specification 1487 [RFC5666]. The authors also wish to thank Bill Baker, Greg Marsden, 1488 and Matt Benjamin for their support of this work. 1490 The extract.sh shell script and formatting conventions were first 1491 described by the authors of the NFSv4.1 XDR specification [RFC5662]. 1493 Special thanks go to nfsv4 Working Group Chair Spencer Shepler and 1494 nfsv4 Working Group Secretary Thomas Haynes for their support. 1496 Authors' Addresses 1498 Charles Lever (editor) 1499 Oracle Corporation 1500 1015 Granger Avenue 1501 Ann Arbor, MI 48104 1502 USA 1504 Phone: +1 248 816 6463 1505 Email: chuck.lever@oracle.com 1506 David Noveck 1507 Hewlett Packard Enterprise 1508 165 Dascomb Road 1509 Andover, MA 01810 1510 USA 1512 Phone: +1 978 474 2011 1513 Email: davenoveck@gmail.com