idnits 2.17.1 draft-cel-nfsv4-rpcrdma-version-two-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 503 has weird spacing: '... bool rdma...' -- The document date (October 25, 2016) is 2730 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Outdated reference: A later version (-11) exists of draft-ietf-nfsv4-rfc5666bis-07 == Outdated reference: A later version (-08) exists of draft-ietf-nfsv4-rpcrdma-bidirection-05 -- Obsolete informational reference (is this intentional?): RFC 5661 (Obsoleted by RFC 8881) -- Obsolete informational reference (is this intentional?): RFC 5666 (Obsoleted by RFC 8166) Summary: 0 errors (**), 0 flaws (~~), 4 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network File System Version 4 C. Lever, Ed. 3 Internet-Draft Oracle 4 Intended status: Standards Track D. Noveck 5 Expires: April 28, 2017 HPE 6 October 25, 2016 8 RPC-over-RDMA Version Two Protocol 9 draft-cel-nfsv4-rpcrdma-version-two-02 11 Abstract 13 This document specifies an improved protocol for conveying Remote 14 Procedure Call (RPC) messages on physical transports capable of 15 Remote Direct Memory Access (RDMA), based on RPC-over-RDMA Version 16 One. 18 Requirements Language 20 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 21 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 22 document are to be interpreted as described in [RFC2119]. 24 Status of This Memo 26 This Internet-Draft is submitted in full conformance with the 27 provisions of BCP 78 and BCP 79. 29 Internet-Drafts are working documents of the Internet Engineering 30 Task Force (IETF). Note that other groups may also distribute 31 working documents as Internet-Drafts. The list of current Internet- 32 Drafts is at http://datatracker.ietf.org/drafts/current/. 34 Internet-Drafts are draft documents valid for a maximum of six months 35 and may be updated, replaced, or obsoleted by other documents at any 36 time. It is inappropriate to use Internet-Drafts as reference 37 material or to cite them other than as "work in progress." 39 This Internet-Draft will expire on April 28, 2017. 41 Copyright Notice 43 Copyright (c) 2016 IETF Trust and the persons identified as the 44 document authors. All rights reserved. 46 This document is subject to BCP 78 and the IETF Trust's Legal 47 Provisions Relating to IETF Documents 48 (http://trustee.ietf.org/license-info) in effect on the date of 49 publication of this document. Please review these documents 50 carefully, as they describe your rights and restrictions with respect 51 to this document. Code Components extracted from this document must 52 include Simplified BSD License text as described in Section 4.e of 53 the Trust Legal Provisions and are provided without warranty as 54 described in the Simplified BSD License. 56 Table of Contents 58 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 59 2. Inline Threshold . . . . . . . . . . . . . . . . . . . . . . 3 60 2.1. Terminology . . . . . . . . . . . . . . . . . . . . . . . 3 61 2.2. Motivation . . . . . . . . . . . . . . . . . . . . . . . 4 62 2.3. Default Values . . . . . . . . . . . . . . . . . . . . . 4 63 3. Remote Invalidation . . . . . . . . . . . . . . . . . . . . . 5 64 3.1. Backward-Direction Remote Invalidation . . . . . . . . . 6 65 4. Protocol Extensibility . . . . . . . . . . . . . . . . . . . 6 66 4.1. Optional Features . . . . . . . . . . . . . . . . . . . . 6 67 4.2. Message Direction . . . . . . . . . . . . . . . . . . . . 7 68 4.3. Documentation Requirements . . . . . . . . . . . . . . . 7 69 5. XDR Protocol Definition . . . . . . . . . . . . . . . . . . . 8 70 5.1. Code Component License . . . . . . . . . . . . . . . . . 9 71 5.2. RPC-Over-RDMA Version Two XDR . . . . . . . . . . . . . . 11 72 6. Protocol Version Negotiation . . . . . . . . . . . . . . . . 15 73 6.1. Responder Does Support RPC-over-RDMA Version Two . . . . 16 74 6.2. Responder Does Not Support RPC-over-RDMA Version Two . . 16 75 6.3. Requester Does Not Support RPC-over-RDMA Version Two . . 16 76 7. Security Considerations . . . . . . . . . . . . . . . . . . . 17 77 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 17 78 9. References . . . . . . . . . . . . . . . . . . . . . . . . . 17 79 9.1. Normative References . . . . . . . . . . . . . . . . . . 17 80 9.2. Informative References . . . . . . . . . . . . . . . . . 17 81 Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . 18 82 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 18 84 1. Introduction 86 Remote Direct Memory Access (RDMA) [RFC5040] [RFC5041] [IB] is a 87 technique for moving data efficiently between end nodes. By 88 directing data into destination buffers as it is sent on a network 89 and placing it via direct memory access by hardware, the 90 complementary benefits of faster transfers and reduced host overhead 91 are obtained. 93 A protocol already exists that enables ONC RPC [RFC5531] messages to 94 be conveyed on RDMA transports. That protocol is RPC-over-RDMA 95 Version One, specified in [I-D.ietf-nfsv4-rfc5666bis]. RPC-over-RDMA 96 Version One is deployed and in use, though there are some 97 shortcomings to this protocol, such as: 99 o The use of small Receive buffers force the use of RDMA Read and 100 Write transfers for small payloads, and limit the size of 101 backchannel messages. 103 o Lack of support for potential optimizations, such as remote 104 invalidation, that require changes to on-the-wire behavior. 106 To address these issues in a way that is compatible with existing 107 RPC-over-RDMA Version One deployments, a new version of RPC-over-RDMA 108 is presented in this document. RPC-over-RDMA Version Two contains 109 only incremental changes over RPC-over-RDMA Version One to facilitate 110 adoption of Version Two by existing Version One implementations. 112 The major new feature in RPC-over-RDMA Version Two is extensibility 113 of the RPC-over-RDMA header. Extensibility enables narrow changes to 114 RPC-over-RDMA Version Two so that new optional capabilities can be 115 introduced without a protocol version change and while maintaining 116 interoperability with existing implementations. New capabilities can 117 be proposed and developed independently of each other, and 118 implementaters can choose among them. It should be straightforward 119 to create and document experimental features and then bring them 120 through the standards process. 122 In addition to extensibility, the default inline threshold value is 123 larger in RPC-over-RDMA Version Two. This change is driven by the 124 increase in average size of RPC messages containing common NFS 125 operations. With NFSv4.1 [RFC5661] and later, compound operations 126 convey more data per RPC message. The default 1KB inline threshold 127 in RPC-over-RDMA Version One prevents attaining the best possible 128 performance. 130 Other new features include support for Remote Invalidation. 132 2. Inline Threshold 134 2.1. Terminology 136 The term "inline threshold" is defined in Section 4 of 137 [I-D.ietf-nfsv4-rfc5666bis]. An "inline threshold" value is the 138 largest message size (in octets) that can be conveyed in one 139 direction on an RDMA connection using only RDMA Send and Receive. 140 Each connection has two inline threshold values: one for messages 141 flowing from requester-to-responder (referred to as the "call inline 142 threshold"), and one for messages flowing from responder-to-requester 143 (referred to as the "reply inline threshold"). Inline threshold 144 values are not advertised to peers via the base RPC-over-RDMA Version 145 Two protocol. 147 A connection's inline threshold determines when RDMA Read or Write 148 operations are required because the RPC message to be sent cannot be 149 conveyed via RDMA Send and Receive. When an RPC message does not 150 contain DDP-eligible data items, a requester prepares a Long Call or 151 Reply to convey the whole RPC message using RDMA Read or Write 152 operations. 154 2.2. Motivation 156 RDMA Read and Write operations require that each data payload resides 157 in a region of memory that is registered with the RNIC. When an RPC 158 is complete, that region is invalidated, fencing it from the 159 responder. 161 Both registration and invalidation have a latency cost which is 162 insignificant compared to data handling costs. When a data payload 163 is small, however, the cost of registering and invalidating the 164 memory where the payload resides becomes a relatively significant 165 part of total RPC latency. Therefore the most efficient operation of 166 RPC-over-RDMA occurs when RDMA Read and Write operations are used for 167 large payloads, and avoided for small payloads. 169 When RPC-over-RDMA Version One was conceived, the typical size of RPC 170 messages that did not involve a significant data payload was under 171 500 bytes. A 1024-byte inline threshold adequately minimized the 172 frequency of inefficient Long Calls and Replies. 174 Starting with NFSv4.1 [RFC5661], NFS COMPOUND RPC messages are larger 175 and more complex than before. With a 1024-byte inline threshold, 176 RDMA Read or Write operations are needed for frequent operations that 177 do not bear a data payload, such as GETATTR and LOOKUP, reducing the 178 efficiency of the transport. 180 To reduce the need to use Long Calls and Replies, RPC-over-RDMA 181 Version Two increases the default inline threshold size. This also 182 increases the maximum size of backward direction RPC messages. 184 2.3. Default Values 186 RPC-over-RDMA Version Two receiver implementations MUST support an 187 inline threshold of 4096 bytes, but MAY support larger inline 188 threshold values. A mechanism for discovering a peer's preferred 189 inline threshold value (not defined in this document) may be used to 190 optimize RDMA Send operations further. In the absense of such a 191 mechanism, senders MUST assume a receiver's inline threshold is 4096 192 bytes. 194 The new default inline threshold size is no larger than the size of a 195 hardware page on typical platforms. This conserves the resources 196 needed to Send and Receive base level RPC-over-RDMA Version Two 197 messages, enabling RPC-over-RDMA Version Two to be used on a broad 198 variety of hardware. 200 3. Remote Invalidation 202 An STag that is registered using the FRWR mechanism (in a privileged 203 execution context), or is registered via a Memory Window (in user 204 space), may be invalidated remotely [RFC5040]. These mechanisms are 205 available only when a requester's RNIC supports MEM_MGT_EXTENSIONS. 207 For the purposes of this discussion, there are two classes of STags. 208 Dynamically-registered STags are used in a single RPC, then 209 invalidated. Persistently-registered STags live longer than one RPC. 210 They may persist for the life of an RPC-over-RDMA connection, or 211 longer. 213 An RPC-over-RDMA requester may provide more than one STag in one 214 transport header. It may provide a combination of dynamically- and 215 persistently-registered STags in one RPC message, or any combination 216 of these in a series of RPCs on the same connection. Only 217 dynamically-registered STags using Memory Windows or FRWR (ie. 218 registered via MEM_MGT_EXTENSIONS) may be invalidated remotely. 220 There is no transport-level mechanism by which a responder can 221 determine how a requester-provided STag was registered, nor whether 222 it is eligible to be invalidated remotely. A requester that mixes 223 persistently- and dynamically-registered STags in one RPC, or mixes 224 them across RPCs on the same connection, must therefore indicate 225 which handles may be invalidated via a mechanism provided in the 226 Upper Layer Protocol. RPC-over-RDMA Version Two provides such a 227 mechanism. 229 The RDMA Send With Invalidate operation is used to invalidate an STag 230 on a remote system. It is available only when a responder's RNIC 231 supports MEM_MGT_EXTENSIONS, and must be utilized only when a 232 requester's RNIC supports MEM_MGT_EXTENSIONS (can receive and 233 recognize an IETH). 235 3.1. Backward-Direction Remote Invalidation 237 Existing RPC-over-RDMA protocol specifications 238 [I-D.ietf-nfsv4-rfc5666bis] [I-D.ietf-nfsv4-rpcrdma-bidirection] do 239 not forbid direct data placement in the backward-direction, even 240 though there is currently no Upper Layer Protocol that may use it. 242 When chunks are present in a backward-direction RPC request, Remote 243 Invalidation allows the responder to trigger invalidation of a 244 requester's STags as part of sending a reply, the same as in the 245 forward direction. 247 However, in the backward direction, the server acts as the requester, 248 and the client is the responder. The server's RNIC, therefore, must 249 support receiving an IETH, and the server must have registered the 250 STags with an appropriate registration mechanism. 252 4. Protocol Extensibility 254 The core RPC-over-RDMA Version Two header format is specified in 255 Section 5 as a complete and stand-alone piece of XDR. Any change to 256 this XDR description requires a protocol version number change. 258 4.1. Optional Features 260 RPC-over-RDMA Version Two introduces the ability to extend the core 261 protocol via optional features. Extensibility enables minor protocol 262 issues to be addressed and incremental enhancements to be made 263 without the need to change the protocol version. The key capability 264 is that both sides can detect whether a feature is supported by their 265 peer or not. With this ability, OPTIONAL features can be introduced 266 over time to an otherwise stable protocol. 268 The rdma_opttype field carries a 32-bit unsigned integer. The value 269 in this field denotes an optional operation that MAY be supported by 270 the receiver. The values of this field and their meaning are defined 271 in other Standards Track documents. 273 The rdma_optinfo field carries opaque data. The content of this 274 field is data meaningful to the optional operation denoted by the 275 value in rdma_opttype. The content of this field is not defined in 276 the base RPC-over-RDMA Version Two protocol, but is defined in other 277 Standards Track documents 279 When an implementation does not recognize or support the value 280 contained in the rdma_opttype field, it MUST send an RPC-over-RDMA 281 message with the rdma_xid field set to the same value as the 282 erroneous message, the rdma_proc field set to RDMA2_ERROR, and the 283 rdma_err field set to RDMA2_ERR_INVAL_OPTION. 285 4.2. Message Direction 287 Backward direction operation depends on the ability of the receiver 288 to distinguish between incoming forward and backward direction calls 289 and replies. This needs to be done because both the XID field and 290 the flow control value (RPC-over-RDMA credits) in the RPC-over-RDMA 291 header are interpreted in the context of each message's direction. 293 A receiver typically distinguishes message direction by examining the 294 mtype field in the RPC header of each incoming payload message. 295 However, RDMA2_OPTIONAL type messages may not carry an RPC message 296 payload. 298 To enable RDMA2_OPTIONAL type messages that do not carry an RPC 299 message payload to be interpreted unambiguously, the rdma2_optional 300 structure contains a field that identifies the message direction. A 301 similar field has been added to the rpcrdma2_chunk_lists and 302 rpcrdma2_error structures to simplify parsing the RPC-over-RDMA 303 header at the receiver. 305 4.3. Documentation Requirements 307 RPC-over-RDMA Version Two may be extended by defining a new 308 rdma_opttype value, and then by providing an XDR description of the 309 rdma_optinfo content that corresponds with the new rdma_opttype 310 value. As a result, a new header type is effectively created. 312 A Standards Track document introduces each set of such protocol 313 elements. Together these elements are considered an OPTIONAL 314 feature. Each implementation is either aware of all the protocol 315 elements introduced by that feature, or is aware of none of them. 317 Documents describing extensions to RPC-over-RDMA Version Two should 318 contain: 320 o An explanation of the purpose and use of each new protocol element 321 added 323 o An XDR description of the protocol elements, and a script to 324 extract it 326 o A mechanism for reporting errors when the error is outside the 327 available choices already available in the base protocol or in 328 other extensions 330 o An indication of whether a Payload stream must be present, and a 331 description of its contents 333 o A description of interactions with existing extensions 335 The last bullet includes requirements that another OPTIONAL feature 336 needs to be present for new protocol elements to work, or that a 337 particular level of support be provided for some particular facility 338 for the new extension to work. 340 Implementers combine the XDR descriptions of the new features they 341 intend to use with the XDR description of the base protocol in this 342 document. This may be necessary to create a valid XDR input file 343 because extensions are free to use XDR types defined in the base 344 protocol, and later extensions may use types defined by earlier 345 extensions. 347 The XDR description for the RPC-over-RDMA Version Two protocol 348 combined with that for any selected extensions should provide an 349 adequate human-readable description of the extended protocol. 351 5. XDR Protocol Definition 353 This section contains a description of the core features of the RPC- 354 over-RDMA Version Two protocol, expressed in the XDR language 355 [RFC4506]. 357 This description is provided in a way that makes it simple to extract 358 into ready-to-compile form. The reader can apply the following shell 359 script to this document to produce a machine-readable XDR description 360 of the RPC-over-RDMA Version One protocol without any OPTIONAL 361 extensions. 363 365 #!/bin/sh 366 grep '^ *///' | sed 's?^ /// ??' | sed 's?^ *///$??' 368 370 That is, if the above script is stored in a file called "extract.sh" 371 and this document is in a file called "spec.txt" then the reader can 372 do the following to extract an XDR description file: 374 376 sh extract.sh < spec.txt > rpcrdma_corev2.x 378 380 Optional extensions to RPC-over-RDMA Version Two, published as 381 Standards Track documents, will have similar means of providing XDR 382 that describes those extensions. Once XDR for all desired extensions 383 is also extracted, it can be appended to the XDR description file 384 extracted from this document to produce a consolidated XDR 385 description file reflecting all extensions selected for an RPC-over- 386 RDMA implementation. 388 5.1. Code Component License 390 Code components extracted from this document must include the 391 following license text. When the extracted XDR code is combined with 392 other complementary XDR code which itself has an identical license, 393 only a single copy of the license text need be preserved. 395 397 /// /* 398 /// * Copyright (c) 2010, 2016 IETF Trust and the persons 399 /// * identified as authors of the code. All rights reserved. 400 /// * 401 /// * The authors of the code are: 402 /// * B. Callaghan, T. Talpey, C. Lever, and D. Noveck. 403 /// * 404 /// * Redistribution and use in source and binary forms, with 405 /// * or without modification, are permitted provided that the 406 /// * following conditions are met: 407 /// * 408 /// * - Redistributions of source code must retain the above 409 /// * copyright notice, this list of conditions and the 410 /// * following disclaimer. 411 /// * 412 /// * - Redistributions in binary form must reproduce the above 413 /// * copyright notice, this list of conditions and the 414 /// * following disclaimer in the documentation and/or other 415 /// * materials provided with the distribution. 416 /// * 417 /// * - Neither the name of Internet Society, IETF or IETF 418 /// * Trust, nor the names of specific contributors, may be 419 /// * used to endorse or promote products derived from this 420 /// * software without specific prior written permission. 421 /// * 422 /// * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS 423 /// * AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED 424 /// * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 425 /// * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS 426 /// * FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO 427 /// * EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE 428 /// * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, 429 /// * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT 430 /// * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 431 /// * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS 432 /// * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF 433 /// * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 434 /// * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING 435 /// * IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF 436 /// * ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 437 /// */ 439 441 5.2. RPC-Over-RDMA Version Two XDR 443 The XDR defined in this section is used to encode the Transport 444 Header Stream in each RPC-over-RDMA Version Two message. The terms 445 "Transport Header Stream" and "RPC Payload Stream" are defined in 446 Section 4 of [I-D.ietf-nfsv4-rfc5666bis]. 448 450 /// /* From RFC 5531, Section 9 */ 451 /// enum msg_type { 452 /// CALL = 0, 453 /// REPLY = 1 454 /// }; 455 /// 456 /// struct rpcrdma2_segment { 457 /// uint32 rdma_handle; 458 /// uint32 rdma_length; 459 /// uint64 rdma_offset; 460 /// }; 461 /// 462 /// struct rpcrdma2_read_segment { 463 /// uint32 rdma_position; 464 /// struct rpcrdma2_segment rdma_target; 465 /// }; 466 /// 467 /// struct rpcrdma2_read_list { 468 /// struct rpcrdma2_read_segment rdma_entry; 469 /// struct rpcrdma2_read_list *rdma_next; 470 /// }; 471 /// 472 /// struct rpcrdma2_write_chunk { 473 /// struct rpcrdma2_segment rdma_target<>; 474 /// }; 475 /// 476 /// struct rpcrdma2_write_list { 477 /// struct rpcrdma2_write_chunk rdma_entry; 478 /// struct rpcrdma2_write_list *rdma_next; 479 /// }; 480 /// 481 /// struct rpcrdma2_chunk_lists { 482 /// enum msg_type rdma_direction; 483 /// uint32 rdma_inv_handle; 484 /// struct rpcrdma2_read_list *rdma_reads; 485 /// struct rpcrdma2_write_list *rdma_writes; 486 /// struct rpcrdma2_write_chunk *rdma_reply; 487 /// }; 488 /// 489 /// enum rpcrdma2_errcode { 490 /// RDMA2_ERR_VERS = 1, 491 /// RDMA2_ERR_BAD_XDR = 2, 492 /// RDMA2_ERR_CANT_REPLY = 3, 493 /// RDMA2_ERR_INVAL_PROC = 4, 494 /// RDMA2_ERR_INVAL_OPTION = 5 495 /// }; 496 /// 497 /// struct rpcrdma2_err_vers { 498 /// uint32 rdma_vers_low; 499 /// uint32 rdma_vers_high; 500 /// }; 501 /// 502 /// struct rpcrdma2_err_reply { 503 /// bool rdma_processed; 504 /// uint32 rdma_segment_index; 505 /// uint32 rdma_length_needed; 506 /// }; 507 /// 508 /// union rpcrdma2_error switch (rpcrdma2_errcode rdma_err) { 509 /// case RDMA2_ERR_VERS: 510 /// rpcrdma2_err_vers rdma_vrange; 511 /// case RDMA2_ERR_BAD_XDR: 512 /// void; 513 /// case RDMA2_ERR_CANT_REPLY: 514 /// rpcrdma2_err_reply rdma_reply; 515 /// case RDMA2_ERR_INVAL_PROC: 516 /// void; 517 /// case RDMA2_ERR_INVAL_OPTION: 518 /// void; 519 /// }; 520 /// 521 /// struct rpcrdma2_optional { 522 /// enum msg_type rdma_optdir; 523 /// uint32 rdma_opttype; 524 /// opaque rdma_optinfo<>; 525 /// }; 526 /// 527 /// enum rpcrdma2_proc { 528 /// RDMA2_MSG = 0, 529 /// RDMA2_NOMSG = 1, 530 /// RDMA2_ERROR = 4, 531 /// RDMA2_OPTIONAL = 5 532 /// }; 533 /// 534 /// union rpcrdma2_body switch (rpcrdma2_proc rdma_proc) { 535 /// case RDMA2_MSG: 537 /// rpcrdma2_chunk_lists rdma_chunks; 538 /// case RDMA2_NOMSG: 539 /// rpcrdma2_chunk_lists rdma_chunks; 540 /// case RDMA2_ERROR: 541 /// rpcrdma2_error rdma_error; 542 /// case RDMA2_OPTIONAL: 543 /// rpcrdma2_optional rdma_optional; 544 /// }; 545 /// 546 /// struct rpcrdma2_xprt_hdr { 547 /// uint32 rdma_xid; 548 /// uint32 rdma_vers; 549 /// uint32 rdma_credit; 550 /// rpcrdma2_body rdma_body; 551 /// }; 553 555 5.2.1. Presence Of Payload 557 o When the rdma_proc field has the value RDMA2_MSG, an RPC Payload 558 Stream MUST follow the Transport Header Stream in the Send buffer. 560 o When the rdma_proc field has the value RDMA2_ERROR, an RPC Payload 561 Stream MUST NOT follow the Transport Header Stream. 563 o When the rdma_proc field has the value RDMA2_OPTIONAL, all, part 564 of, or no RPC Payload Stream MAY follow the Transport header 565 Stream in the Send buffer. 567 5.2.2. Message Direction 569 Implementations of RPC-over-RDMA Version Two are REQUIRED to support 570 backwards direction operation as described in 571 [I-D.ietf-nfsv4-rpcrdma-bidirection]. 573 o When the rdma_proc field has the value RDMA2_MSG or RDMA2_NOMSG, 574 the value of the rdma_direction field MUST be the same as the 575 value of the associated RPC message's msg_type field. 577 o When the rdma_proc field has the value RDMA2_ERROR, the direction 578 of the message is always Responder-to-Requester (REPLY). 580 o When the rdma_proc field has the value RDMA2_OPTIONAL and a whole 581 or partial RPC message payload is present, the value of the 582 rdma_optdir field MUST be the same as the value of the associated 583 RPC message's msg_type field. 585 o When the rdma_proc field has the value RDMA2_OPTIONAL and no RPC 586 message payload is present, a Requester MUST set the value of the 587 rdma_optdir field to CALL, and a Responder MUST set the value of 588 the rdma_optdir field to REPLY. The Requester chooses a value for 589 the rdma_xid field from the XID space that matches the message's 590 direction. Requesters and Responders set the rdma_credit field in 591 a similar fashion: a value is set that is appropriate for the 592 direction of the message. 594 5.2.3. Remote Invalidation 596 Among the set of handles in the RPC Call's transport header, the 597 requester selects one handle that may be invalidated remotedly. The 598 requester sets the rdma_inv_handle field to that value. If none of 599 the rdma_handle values in the Call may be invalidated by the 600 responder, the requester MUST set the rdma_inv_handle field to the 601 value zero. The requester MUST NOT set the value of the 602 rdma_inv_handle field to any other value. 604 The responder copies the value of the rdma_inv_handle field set by 605 the requester to the rdma_inv_handle field in the matching reply. If 606 the rdma_inv_handle field contains zero, the responder MUST NOT use 607 RDMA Send With Invalidate to transmit the matching RPC reply. 608 Otherwise, the responder SHOULD use RDMA Send With Invalidate to 609 transmit the reply, specifying the value in the rdma_inv_handle field 610 as the handle to be invalidated remotely. The responder MUST NOT 611 specify any other handle for this operation. 613 5.2.4. Transport Errors 615 Error handling works the same way in RPC-over-RDMA Version Two as it 616 does in RPC-over-RDMA Version One, with the addition of several new 617 error codes. Version One error handling is described in Section 5 of 618 [I-D.ietf-nfsv4-rfc5666bis]. 620 In all cases below, the sender copies the values of the rdma_xid and 621 rdma_vers fields from the incoming transport header that generated 622 the error to transport header of the error response. The rdma_proc 623 field is set to RDMA2_ERROR. 625 RDMA2_ERR_VERS 626 This is the equivalent of ERR_VERS in RPC-over-RDMA Version One. 627 The error code value, semantics, and utilization are the same. 629 RDMA2_ERR_INVAL_PROC 630 This is a new error code in RPC-over-RDMA Version Two. If a 631 receiver recognizes the value in the rdma_vers field, but it does 632 not recognize the value in the rdma_proc field, it MUST send 633 RDMA2_ERR_INVAL_PROC. 635 RDMA2_ERR_BAD_XDR 636 This is the equivalent of ERR_CHUNK in RPC-over-RDMA Version One, 637 with a few extra restrictions; the error code value is the same. 638 If a receiver recognizes the value in the rdma_proc field but the 639 incoming RPC-over-RDMA transport header cannot be parsed, it MUST 640 send RDMA2_ERR_BAD_XDR before Upper Layer Protocol processing 641 starts. 643 RDMA2_ERR_CANT_REPLY 644 This is a new error code in RPC-over-RDMA Version Two. If a 645 message is otherwise correct but the requester has not provided 646 enough Write or Reply chunk resources to transmit the reply, the 647 responder MUST send RDMA2_ERR_CANT_REPLY. The responder MUST set 648 the rdma_processed field to TRUE if the responder discovered the 649 shortage after the Upper Layer Protocol has finished processing 650 the request; otherwise the field MUST be set to FALSE. The 651 responder MUST set the rdma_segment_index field to point to the 652 first segment in the transport header that is too short, or to 653 zero to indicate that it was not possible to determine which 654 segment was too small. Indexing starts at one (1), which 655 represents the first segment in the first Write chunk (in either 656 the Write list or Reply chunk). The responder MUST set the 657 rdma_length_needed to the number of bytes needed in that segment 658 in order to convey the reply. Upon receipt of this error code, a 659 responder may choose to terminate the operation (for instance, if 660 the responder set both fields above to zero), or it may send the 661 request again using the same XID and larger reply resources. 663 RDMA2_ERR_INVAL_OPTION 664 This is a new error code in RPC-over-RDMA Version Two. A receiver 665 MUST send RDMA2_ERR_INVAL_OPTION when an RDMA2_OPTIONAL message is 666 received and the receiver does not recognize the value in the 667 rdma_opttype field. 669 6. Protocol Version Negotiation 671 When an RPC-over-RDMA Version Two requester establishes a connection 672 to a responder, the first order of business is to determine the 673 responder's highest supported protocol version. 675 As with RPC-over-RDMA Version One, a requester MUST assume the 676 ability to exchange only a single RPC-over-RDMA message at a time 677 until it receives a non-error RPC-over-RDMA message from the 678 responder that reports the responder's actual credit limit. 680 First, the requester sends a single valid RPC-over-RDMA message with 681 the value two (2) in the rdma_vers field. Because the responder 682 might support only RPC-over-RDMA Version One, this initial message 683 can be no larger than the Version One default inline threshold of 684 1024 bytes. 686 6.1. Responder Does Support RPC-over-RDMA Version Two 688 If the responder does support RPC-over-RDMA Version Two, it sends an 689 RPC-over-RDMA message back to the requester with the same XID 690 containing a valid non-error response. Subsequently, both peers use 691 the default inline threshold value for RPC-over-RDMA Version Two 692 connections (4096 bytes). 694 6.2. Responder Does Not Support RPC-over-RDMA Version Two 696 If the responder does not support RPC-over-RDMA Version Two, 697 [I-D.ietf-nfsv4-rfc5666bis] REQUIRES that it send an RPC-over-RDMA 698 message to the requester with the same XID, with RDMA2_ERROR in the 699 rdma_proc field, and with the error code RDMA2_ERR_VERS. This 700 message also reports a range of protocol versions that the responder 701 supports. To continue operation, the requester selects a protocol 702 version in the range of responder-supported versions for subsequent 703 messages on this connection. 705 If the connection is lost immediately after the RDMA2_ERROR reply is 706 received, a requester can avoid a possible version negotiation loop 707 when re-establishing another connection by assuming that particular 708 responder does not support RPC-over-RDMA Version Two. A requester 709 can assume the same situation (no responder support for RPC-over-RDMA 710 Version Two) if the initial negotiation message is lost or dropped. 712 Once the negotiation exchange is complete, both peers use the default 713 inline threshold value for the protocol version that will be used for 714 the remainder of the connection lifetime. To permit inline threshold 715 values to change during negotiation of protocol version, RPC-over- 716 RDMA Version Two implementations MUST allow inline threshold values 717 to change without triggering a connection loss. 719 6.3. Requester Does Not Support RPC-over-RDMA Version Two 721 [I-D.ietf-nfsv4-rfc5666bis] REQUIRES that a responder MUST send 722 Replies with the same RPC-over-RDMA protocol version that the 723 requester uses to send its Calls. 725 7. Security Considerations 727 The security considerations for RPC-over-RDMA Version Two are the 728 same as those for RPC-over-RDMA Version One. 730 8. IANA Considerations 732 There are no IANA considerations at this time. 734 9. References 736 9.1. Normative References 738 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 739 Requirement Levels", BCP 14, RFC 2119, 740 DOI 10.17487/RFC2119, March 1997, 741 . 743 [RFC4506] Eisler, M., Ed., "XDR: External Data Representation 744 Standard", STD 67, RFC 4506, DOI 10.17487/RFC4506, May 745 2006, . 747 [RFC5531] Thurlow, R., "RPC: Remote Procedure Call Protocol 748 Specification Version 2", RFC 5531, DOI 10.17487/RFC5531, 749 May 2009, . 751 9.2. Informative References 753 [I-D.ietf-nfsv4-rfc5666bis] 754 Lever, C., Simpson, W., and T. Talpey, "Remote Direct 755 Memory Access Transport for Remote Procedure Call, Version 756 One", draft-ietf-nfsv4-rfc5666bis-07 (work in progress), 757 May 2016. 759 [I-D.ietf-nfsv4-rpcrdma-bidirection] 760 Lever, C., "Bi-directional Remote Procedure Call On RPC- 761 over-RDMA Transports", draft-ietf-nfsv4-rpcrdma- 762 bidirection-05 (work in progress), June 2016. 764 [IB] InfiniBand Trade Association, "InfiniBand Architecture 765 Specifications", . 767 [RFC5040] Recio, R., Metzler, B., Culley, P., Hilland, J., and D. 768 Garcia, "A Remote Direct Memory Access Protocol 769 Specification", RFC 5040, DOI 10.17487/RFC5040, October 770 2007, . 772 [RFC5041] Shah, H., Pinkerton, J., Recio, R., and P. Culley, "Direct 773 Data Placement over Reliable Transports", RFC 5041, 774 DOI 10.17487/RFC5041, October 2007, 775 . 777 [RFC5661] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed., 778 "Network File System (NFS) Version 4 Minor Version 1 779 Protocol", RFC 5661, DOI 10.17487/RFC5661, January 2010, 780 . 782 [RFC5662] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed., 783 "Network File System (NFS) Version 4 Minor Version 1 784 External Data Representation Standard (XDR) Description", 785 RFC 5662, DOI 10.17487/RFC5662, January 2010, 786 . 788 [RFC5666] Talpey, T. and B. Callaghan, "Remote Direct Memory Access 789 Transport for Remote Procedure Call", RFC 5666, 790 DOI 10.17487/RFC5666, January 2010, 791 . 793 Appendix A. Acknowledgments 795 The authors gratefully acknowledge the work of Brent Callaghan and 796 Tom Talpey on the original RPC-over-RDMA Version One specification 797 [RFC5666]. The authors also wish to thank Bill Baker, Greg Marsden, 798 and Matt Benjamin for their support of this work. 800 The extract.sh shell script and formatting conventions were first 801 described by the authors of the NFSv4.1 XDR specification [RFC5662]. 803 Special thanks go to nfsv4 Working Group Chair Spencer Shepler and 804 nfsv4 Working Group Secretary Thomas Haynes for their support. 806 Authors' Addresses 808 Charles Lever (editor) 809 Oracle Corporation 810 1015 Granger Avenue 811 Ann Arbor, MI 48104 812 USA 814 Phone: +1 734 274 2396 815 Email: chuck.lever@oracle.com 816 David Noveck 817 Hewlett Packard Enterprise 818 165 Dascomb Road 819 Andover, MA 01810 820 USA 822 Phone: +1 978 474 2011 823 Email: davenoveck@gmail.com