idnits 2.17.1 draft-cel-nfsv4-rpcrdma-version-two-08.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 653 has weird spacing: '...k_lists rdma_...' == Line 676 has weird spacing: '...k_lists rdma_...' == Line 1142 has weird spacing: '...k_lists rdma_...' == Line 1150 has weird spacing: '...k_lists rdma_...' == The document seems to contain a disclaimer for pre-RFC5378 work, but was first submitted on or after 10 November 2008. The disclaimer is usually necessary only for documents that revise or obsolete older RFCs, and that take significant amounts of text from those RFCs. If you can contact all authors of the source material and they are willing to grant the BCP78 rights to the IETF Trust, you can and should remove the disclaimer. Otherwise, the disclaimer is needed and you can ignore this comment. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (November 8, 2018) is 1995 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Obsolete informational reference (is this intentional?): RFC 5661 (Obsoleted by RFC 8881) Summary: 0 errors (**), 0 flaws (~~), 6 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network File System Version 4 C. Lever, Ed. 3 Internet-Draft Oracle 4 Intended status: Standards Track D. Noveck 5 Expires: May 12, 2019 NetApp 6 November 8, 2018 8 RPC-over-RDMA Version 2 Protocol 9 draft-cel-nfsv4-rpcrdma-version-two-08 11 Abstract 13 This document specifies a new version of the transport protocol that 14 conveys Remote Procedure Call (RPC) messages on physical transports 15 capable of Remote Direct Memory Access (RDMA). The new version of 16 this protocol is extensible. 18 Status of This Memo 20 This Internet-Draft is submitted in full conformance with the 21 provisions of BCP 78 and BCP 79. 23 Internet-Drafts are working documents of the Internet Engineering 24 Task Force (IETF). Note that other groups may also distribute 25 working documents as Internet-Drafts. The list of current Internet- 26 Drafts is at https://datatracker.ietf.org/drafts/current/. 28 Internet-Drafts are draft documents valid for a maximum of six months 29 and may be updated, replaced, or obsoleted by other documents at any 30 time. It is inappropriate to use Internet-Drafts as reference 31 material or to cite them other than as "work in progress." 33 This Internet-Draft will expire on May 12, 2019. 35 Copyright Notice 37 Copyright (c) 2018 IETF Trust and the persons identified as the 38 document authors. All rights reserved. 40 This document is subject to BCP 78 and the IETF Trust's Legal 41 Provisions Relating to IETF Documents 42 (https://trustee.ietf.org/license-info) in effect on the date of 43 publication of this document. Please review these documents 44 carefully, as they describe your rights and restrictions with respect 45 to this document. Code Components extracted from this document must 46 include Simplified BSD License text as described in Section 4.e of 47 the Trust Legal Provisions and are provided without warranty as 48 described in the Simplified BSD License. 50 This document may contain material from IETF Documents or IETF 51 Contributions published or made publicly available before November 52 10, 2008. The person(s) controlling the copyright in some of this 53 material may not have granted the IETF Trust the right to allow 54 modifications of such material outside the IETF Standards Process. 55 Without obtaining an adequate license from the person(s) controlling 56 the copyright in such materials, this document may not be modified 57 outside the IETF Standards Process, and derivative works of it may 58 not be created outside the IETF Standards Process, except to format 59 it for publication as an RFC or to translate it into languages other 60 than English. 62 Table of Contents 64 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 65 2. Requirements Language . . . . . . . . . . . . . . . . . . . . 4 66 3. RPC-over-RDMA Version 2 Headers and Chunks . . . . . . . . . 5 67 3.1. rpcrdma_common: Common Transport Header Prefix . . . . . 5 68 3.2. rpcrdma2_hdr_prefix: Version 2 Transport Header Prefix . 6 69 3.3. rpcrdma2_chunk_lists: Describe External Data Payload . . 7 70 4. Transport Properties . . . . . . . . . . . . . . . . . . . . 8 71 4.1. Transport Properties Model . . . . . . . . . . . . . . . 8 72 4.2. Current Transport Properties . . . . . . . . . . . . . . 10 73 4.2.1. Receive Buffer Size . . . . . . . . . . . . . . . . . 11 74 4.2.2. Reverse Request Support . . . . . . . . . . . . . . . 12 75 5. RPC-over-RDMA Version 2 Transport Messages . . . . . . . . . 13 76 5.1. Overall Transport Message Structure . . . . . . . . . . . 13 77 5.2. Transport Header Types . . . . . . . . . . . . . . . . . 13 78 5.3. Header Types Defined in RPC-over-RDMA version 2 . . . . . 14 79 5.3.1. RDMA2_MSG: Convey RPC Message Inline . . . . . . . . 15 80 5.3.2. RDMA2_NOMSG: Convey External RPC Message . . . . . . 15 81 5.3.3. RDMA2_ERROR: Report Transport Error . . . . . . . . . 15 82 5.3.4. RDMA2_CONNPROP: Advertise Transport Properties . . . 18 83 6. XDR Protocol Definition . . . . . . . . . . . . . . . . . . . 19 84 6.1. Code Component License . . . . . . . . . . . . . . . . . 20 85 6.2. Extraction and Use of XDR Definitions . . . . . . . . . . 22 86 6.3. XDR Definition for RPC-over-RDMA Version 2 Core 87 Structures . . . . . . . . . . . . . . . . . . . . . . . 24 88 6.4. XDR Definition for RPC-over-RDMA Version 2 Base Header 89 Types . . . . . . . . . . . . . . . . . . . . . . . . . . 26 90 6.5. Use of the XDR Description Files . . . . . . . . . . . . 27 91 7. Protocol Version Negotiation . . . . . . . . . . . . . . . . 29 92 7.1. Server Does Support RPC-over-RDMA Version 2 . . . . . . . 29 93 7.2. Server Does Not Support RPC-over-RDMA Version 2 . . . . . 29 94 7.3. Client Does Not Support RPC-over-RDMA Version 2 . . . . . 30 95 8. Differences from the RPC-over-RDMA Version 1 Protocol . . . . 30 96 8.1. Transport Properties . . . . . . . . . . . . . . . . . . 30 97 8.2. Credit Management Changes . . . . . . . . . . . . . . . . 30 98 8.3. Inline Threshold Changes . . . . . . . . . . . . . . . . 32 99 8.4. Support for Remote Invalidation . . . . . . . . . . . . . 33 100 8.4.1. Reverse Direction Remote Invalidation . . . . . . . . 33 101 8.5. Error Reporting Changes . . . . . . . . . . . . . . . . . 34 102 9. Extending the Version 2 Protocol . . . . . . . . . . . . . . 34 103 9.1. Adding New Header Types to RPC-over-RDMA Version 2 . . . 35 104 9.2. Adding New Transport properties to the Protocol . . . . . 36 105 9.3. Adding New Error Codes to the Protocol . . . . . . . . . 37 106 9.4. Adding New Header Flags to the Protocol . . . . . . . . . 38 107 10. Relationship to other RPC-over-RDMA Versions . . . . . . . . 38 108 10.1. Relationship to RPC-over-RDMA Version 1 . . . . . . . . 38 109 10.2. Extensibility Beyond RPC-over-RDMA Version 2 . . . . . . 40 110 11. Security Considerations . . . . . . . . . . . . . . . . . . . 40 111 11.1. Security Considerations (Transport Properties) . . . . . 40 112 12. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 41 113 13. References . . . . . . . . . . . . . . . . . . . . . . . . . 41 114 13.1. Normative References . . . . . . . . . . . . . . . . . . 41 115 13.2. Informative References . . . . . . . . . . . . . . . . . 41 116 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . 42 117 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 42 119 1. Introduction 121 Remote Direct Memory Access (RDMA) [RFC5040] [RFC5041] [IBARCH] is a 122 technique for moving data efficiently between end nodes. By 123 directing data into destination buffers as it is sent on a network 124 and placing it using direct memory access implemented by hardware, 125 the complementary benefits of faster transfers and reduced host 126 overhead are obtained. 128 RPC-over-RDMA version 1 enables ONC RPC [RFC5531] messages to be 129 conveyed on RDMA transports. That protocol is specified in 130 [RFC8166]. RPC-over-RDMA version 1 is deployed and in use, although 131 there are known shortcomings to this protocol: 133 o The protocol's default size of Receive buffers forces the use of 134 RDMA Read and Write transfers for small payloads, and limits the 135 size of reverse direction messages. 137 o It is difficult to make optimizations or protocol fixes that 138 require changes to on-the-wire behavior. 140 To address these issues in a way that is compatible with existing 141 RPC-over-RDMA version 1 deployments, a new version of the RPC-over- 142 RDMA transport protocol is presented in this document. 144 This new version of RPC-over-RDMA is extensible, enabling OPTIONAL 145 extensions to be added without impacting existing implementations. 147 To enable protocol extension, the XDR definition for RPC-over-RDMA 148 version 2 is organized differently than the definition version 1. 149 These changes, which are discussed in Section 10.1, do not affect the 150 on-the-wire format. 152 In addition, RPC-over-RDMA version 2 contains a set of incremental 153 changes that relieve certain performance constraints and enable 154 recovery from certain abnormal corner cases. These changes include: 156 o The exchange of transport properties as described in Section 8.1. 158 o A more flexible credit account mechanism, detailed in Section TBD. 160 o Larger default inline thresholds as described in Section 8.3. 162 o Support for remote invalidation as explained in Section 8.4. 164 o Support for reverse direction operation, as described in 165 [RFC8167], is now REQUIRED. Details are in Section 3.2. 167 o An expansion of error reporting capabilities, described in 168 Section 5.3.3. A summary of the reasons for this expansion 169 appears in Section 8.5. This expansion supports the addition of 170 new error codes as described in Section 9.3. 172 Because of the way in which RPC-over-RDMA version 2 builds upon the 173 facilities present in RPC-over-RDMA version 1, a knowledge of the 174 basic structure of RPC-over-RDMA version 1, as described in 175 [RFC8166], is assumed in this document. 177 As in that document, the terms "RPC Payload Stream" and "Transport 178 Header Stream" (defined in Section 3.2 of that document) are used to 179 distinguish between an RPC message as defined by [RFC5531] and the 180 header whose job it is to describe the RPC message and its associated 181 memory resources. In that regard, the reader is assumed to 182 understand how RDMA is used to transfer chunks between client and 183 server, the use of Position-Zero Read chunks and Reply chunks to 184 convey Long RPC messages, and the role of DDP-eligibility in 185 constraining how data payloads are to be conveyed. 187 2. Requirements Language 189 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 190 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 191 document are to be interpreted as described in [RFC2119] [RFC8174] 192 when, and only when, they appear in all capitals, as shown here. 194 3. RPC-over-RDMA Version 2 Headers and Chunks 196 Most RPC-over-RDMA version 2 data structures are derived from 197 corresponding structures in RPC-over-RDMA version 1. As is typical 198 for new versions of an existing protocol, the XDR data structures 199 have new names and there are a few small changes in content. In some 200 cases, there have been structural re-organizations to enabled 201 protocol extensibility. 203 3.1. rpcrdma_common: Common Transport Header Prefix 205 The rpcrdma_common prefix describes the first part of each RDMA-over- 206 RPC transport header for version 2 and subsequent versions. 208 210 struct rpcrdma_common { 211 uint32 rdma_xid; 212 uint32 rdma_vers; 213 uint32 rdma_credit; 214 uint32 rdma_htype; 215 }; 217 219 RPC-over-RDMA version 2's use of these first four words matches that 220 of version 1 as required by [RFC8166]. However, there are important 221 structural differences in the way that these words are described by 222 the respective XDR descriptions: 224 o The header type is represented as a uint32 rather than as an enum 225 that would need to be modified to reflect additions to the set of 226 header types made by later extensions. 228 o The header type field is part of an XDR structure devoted to 229 representing the transport header prefix, rather than being part 230 of a discriminated union, that includes the body of each transport 231 header type. 233 o There is now a prefix structure (see Section 3.2) of which the 234 rpcrdma_common structure is the initial segment. This is a newly 235 defined XDR object within the protocol description, in contrast 236 with RPC-over-RDMA version 1, which limits the common portion of 237 all header types to the four words in rpcrdma_common. 239 These changes are part of a larger structural change in the XDR 240 description of RPC-over-RDMA version 2 that enables a cleaner 241 treatment of protocol extension. The XDR appearing in Section 6 242 reflects these changes, which are discussed in further detail in 243 Section 10.1. 245 3.2. rpcrdma2_hdr_prefix: Version 2 Transport Header Prefix 247 The following prefix structure appears at the start of any RPC-over- 248 RDMA version 2 transport header. 250 252 const RPCRDMA2_F_RESPONSE 0x00000001; 254 struct rpcrdma2_hdr_prefix 255 struct rpcrdma_common rdma_start; 256 uint32 rdma_flags; 257 }; 259 261 The rdma_flags is new to RPC-over-RDMA version 2. Currently, the 262 only flag defined within this word is the RPCRDMA2_F_RESPONSE flag. 263 The other bits are reserved for future use as described in 264 Section 9.4. The sender MUST set these to zero. 266 The RPCRDMA2_F_RESPONSE flag qualifies the values contained in the 267 transport header's rdma_start.rdma_xid and rdma_start.rdma_credits 268 fields. The RPCRDMA2_F_RESPONSE flag enables a receiver to reliably 269 avoid performing an XID lookup on incoming reverse direction Call 270 messages, and apply the value of the rdma_start.rdma_credits field 271 correctly, based on the direction of the message being conveyed. 273 In general, when a message carries an XID that was generated by the 274 message's receiver (that is, the receiver is acting as a requester), 275 the message's sender sets the RPCRDMA2_F_RESPONSE flag. Otherwise 276 that flag is clear. For example: 278 o When the rdma_start.rdma_htype field has the value RDMA2_MSG or 279 RDMA2_NOMSG, the value of the RPCRDMA2_F_RESPONSE flag MUST be the 280 same as the value of the associated RPC message's msg_type field. 282 o When the header type is anything else and a whole or partial RPC 283 message payload is present, the value of the RPCRDMA2_F_RESPONSE 284 flag MUST be the same as the value of the associated RPC message's 285 msg_type field. 287 o When no RPC message payload is present, a Requester MUST set the 288 value of RPCRDMA2_F_RESPONSE to reflect how the receiver is to 289 interpret the rdma_start.rdma_credits and rdma_start.rdma_xid 290 fields. 292 o When the rdma_start.rdma_htype field has the value RDMA2_ERROR, 293 the RPCRDMA2_F_RESPONSE flag MUST be set. 295 3.3. rpcrdma2_chunk_lists: Describe External Data Payload 297 The rpcrdma2_chunk_lists structure specifies how an RPC message is 298 conveyed using explicit RDMA operations. 300 302 struct rpcrdma2_chunk_lists { 303 uint32 rdma_inv_handle; 304 struct rpcrdma2_read_list *rdma_reads; 305 struct rpcrdma2_write_list *rdma_writes; 306 struct rpcrdma2_write_chunk *rdma_reply; 307 }; 309 311 For the most part this structure parallels its RPC-over-RDMA version 312 1 equivalent. That is, rdma_reads, rdma_writes, rdma_reply provide, 313 respectively, descriptions of the chunks used to read a Long request 314 or directly placed data from the requester, to write directly placed 315 response data into the requester's memory, and to write a long reply 316 into the requester's memory. 318 An important addition relative to the corresponding RPC-over-RDMA 319 version 1 rdma_header structures is the rdma_inv_handle field. This 320 field supports remote invalidation of requester memory registrations 321 via the RDMA Send With Invalidate operation. 323 To request Remote Invalidation, a requester sets the value of the 324 rdma_inv_handle field in an RPC Call's transport header to a non-zero 325 value that matches one of the rdma_handle fields in that header. If 326 none of the rdma_handle values in the header conveying the Call may 327 be invalidated by the responder, the requester sets the RPC Call's 328 rdma_inv_handle field to the value zero. 330 If the responder chooses not to use remote invalidation for this 331 particular RPC Reply, or the RPC Call's rdma_inv_handle field 332 contains the value zero, the responder uses RDMA Send to transmit the 333 matching RPC reply. 335 If a requester has provided a non-zero value in the RPC Call's 336 rdma_inv_handle field and the responder chooses to use Remote 337 Invalidation for the matching RPC Reply, the responder uses RDMA Send 338 With Invalidate to transmit that RPC reply, and uses the value in the 339 corresponding Call's rdma_inv_handle field to construct the Send With 340 Invalidate Work Request. 342 4. Transport Properties 344 RPC-over-RDMA version 2 provides a mechanism for connection endpoints 345 to communicate information about implementation properties, enabling 346 compatible endpoints to optimize data transfer. Initially only a 347 small set of transport properties are defined and a single operation 348 is provided to exchange transport properties (see Section 5.3.4). 350 Both the set of transport properties and the operations used to 351 communicate may be extended. Within RPC-over-RDMA version 2, all 352 such extensions are OPTIONAL. For information about existing 353 transport properties, see Sections 4.1 through 4.2. For discussion 354 of extensions to the set of transport properties, see Section 9.2. 356 4.1. Transport Properties Model 358 A basic set of receiver and sender properties is specified in this 359 document. An extensible approach is used, allowing new properties to 360 be defined in future Standards Track documents. 362 Such properties are specified using: 364 o A code point identifying the particular transport property being 365 specified. 367 o A nominally opaque array which contains within it the XDR encoding 368 of the specific property indicated by the associated code point. 370 The following XDR types are used by operations that deal with 371 transport properties: 373 375 typedef rpcrdma2_propid uint32; 377 struct rpcrdma2_propval { 378 rpcrdma2_propid rdma_which; 379 opaque rdma_data<>; 380 }; 382 typedef rpcrdma2_propval rpcrdma2_propset<>; 384 typedef uint32 rpcrdma2_propsubset<>; 386 388 An rpcrdma2_propid specifies a particular transport property. In 389 order to facilitate XDR extension of the set of properties by 390 concatenating XDR definition files, specific properties are defined 391 as const values rather than as elements in an enum. 393 An rpcrdma2_propval specifies a value of a particular transport 394 property with the particular property identified by rdma_which, while 395 the associated value of that property is contained within rdma_data. 397 An rdma_data field which is of zero length is interpreted as 398 indicating the default value or the property indicated by rdma_which. 400 While rdma_data is defined as opaque within the XDR, the contents are 401 interpreted (except when of length zero) using the XDR typedef 402 associated with the property specified by rdma_which. As a result, 403 when rpcrdma2_propval does not conform to that typedef, the receiver 404 is REQUIRED to return the error RDMA2_ERR_BAD_XDR using the header 405 type RDMA2_ERROR as described in Section 5.3.3. For example, the 406 receiver of a message containing a valid rpcrdma2_propval returns 407 this error if the length of rdma_data is such that it extends beyond 408 the bounds of the message being transferred. 410 In cases in which the rpcrdma2_propid specified by rdma_which is 411 understood by the receiver, the receiver also MUST report the error 412 RDMA2_ERR_BAD_XDR if either of the following occur: 414 o The nominally opaque data within rdma_data is not valid when 415 interpreted using the property-associated typedef. 417 o The length of rdma_data is insufficient to contain the data 418 represented by the property-associated typedef. 420 Note that no error is to be reported if rdma_which is unknown to the 421 receiver. In that case, that rpcrdma2_propval is not processed and 422 processing continues using the next rpcrdma2_propval, if any. 424 A rpcrdma2_propset specifies a set of transport properties. No 425 particular ordering of the rpcrdma2_propval items within it is 426 imposed. 428 A rpcrdma2_propsubset identifies a subset of the properties in a 429 previously specified rpcrdma2_propset. Each bit in the mask denotes 430 a particular element in a previously specified rpcrdma2_propset. If 431 a particular rpcrdma2_propval is at position N in the array, then bit 432 number N mod 32 in word N div 32 specifies whether that particular 433 rpcrdma2_propval is included in the defined subset. Words beyond the 434 last one specified are treated as containing zero. 436 4.2. Current Transport Properties 438 Although the set of transport properties may be extended, a basic set 439 of transport properties is defined in Table 1. 441 In that table, the columns contain the following information: 443 o The column labeled "Property" identifies the transport property 444 described by the current row. 446 o The column labeled "Code" specifies the rpcrdma2_propid value used 447 to identify this property. 449 o The column labeled "XDR type" gives the XDR type of the data used 450 to communicate the value of this property. This data type 451 overlays the data portion of the nominally opaque field rdma_data 452 in a rpcrdma2_propval. 454 o The column labeled "Default" gives the default value for the 455 property which is to be assumed by those who do not receive, or 456 are unable to interpret, information about the actual value of the 457 property. 459 o The column labeled "Sec" indicates the section within this 460 document that explains the semantics and use of this transport 461 property. 463 +---------+-----+------------------+----------------------+---------+ 464 | Propert | Cod | XDR type | Default | Sec | 465 | y | e | | | | 466 +---------+-----+------------------+----------------------+---------+ 467 | Receive | 1 | uint32 | 4096 | Section | 468 | Buffer | | | | 4.2.1 | 469 | Size | | | | | 470 | Reverse | 2 | enum rpcrdma2_rv | RDMA2_RVREQSUP_INLIN | Section | 471 | Request | | reqsup | E | 4.2.2 | 472 | Support | | | | | 473 +---------+-----+------------------+----------------------+---------+ 475 Table 1 477 4.2.1. Receive Buffer Size 479 The Receive Buffer Size specifies the minimum size, in octets, of 480 pre-posted receive buffers. It is the responsibility of the endpoint 481 sending this value to ensure that its pre-posted receive buffers are 482 at least the size specified, allowing the endpoint receiving this 483 value to send messages that are of this size. 485 487 const uint32 RDMA2_PROPID_RBSIZ = 1; 488 typedef uint32 rpcrdma2_prop_rbsiz; 490 492 The sender may use his knowledge of the receiver's buffer size to 493 determine when the message to be sent will fit in the preposted 494 receive buffers that the receiver has set up. In particular, 496 o Requesters may use the value to determine when it is necessary to 497 provide a Position-Zero Read chunk when sending a request. 499 o Requesters may use the value to determine when it is necessary to 500 provide a Reply chunk when sending a request, based on the maximum 501 possible size of the reply. 503 o Responders may use the value to determine when it is necessary, 504 given the actual size of the reply, to actually use a Reply chunk 505 provided by the requester. 507 4.2.2. Reverse Request Support 509 The value of this property is used to indicate a client 510 implementation's readiness to accept and process messages that are 511 part of reverse direction RPC requests. 513 515 enum rpcrdma2_rvreqsup { 516 RDMA2_RVREQSUP_NONE = 0, 517 RDMA2_RVREQSUP_INLINE = 1, 518 RDMA2_RVREQSUP_GENL = 2 519 }; 521 const uint32 RDMA2_PROPID_BRS = 2; 522 typedef rpcrdma2_rvreqsup rpcrdma2_prop_brs; 524 526 Multiple levels of support are distinguished: 528 o The value RDMA2_RVREQSUP_NONE indicates that receipt of reverse 529 direction requests and replies is not supported. 531 o The value RDMA2_RVREQSUP_INLINE indicates that receipt of reverse 532 direction requests or replies is only supported using inline 533 messages and that use of explicit RDMA operations or other form of 534 Direct Data Placement for reverse direction requests or responses 535 is not supported. 537 o The value RDMA2_RVREQSUP_GENL that receipt of reverse direction 538 requests or replies is supported in the same ways that forward 539 direction requests or replies typically are. 541 When information about this property is not provided, the support 542 level of servers can be inferred from the reverse direction requests 543 that they issue, assuming that issuing a request implicitly indicates 544 support for receiving the corresponding reply. On this basis, 545 support for receiving inline replies can be assumed when requests 546 without Read chunks, Write chunks, or Reply chunks are issued, while 547 requests with any of these elements allow the client to assume that 548 general support for reverse direction replies is present on the 549 server. 551 5. RPC-over-RDMA Version 2 Transport Messages 553 5.1. Overall Transport Message Structure 555 Each transport message consists of multiple sections: 557 o A transport header prefix, as defined in Section 3.2. Among other 558 things, this structure indicates the header type. 560 o The transport header proper, as defined by one of the sub-sections 561 below. See Section 5.2 for the mapping between header types and 562 the corresponding header structure. 564 o Potentially, an RPC message being conveyed as an addendum to the 565 header. 567 This organization differs from that presented in the definition of 568 RPC-over-RDMA version 1 [RFC8166], which presented the first and 569 second of the items above as a single XDR item. The new organization 570 is more in keeping with RPC-over-RDMA version 2's extensibility model 571 in that new header types can be defined without modifying the 572 existing set of header types. 574 5.2. Transport Header Types 576 The new header types within RPC-over-RDMA version 2 are set forth in 577 Table 2. In that table, the columns contain the following 578 information: 580 o The column labeled "Operation" specifies the particular operation. 582 o The column labeled "Code" specifies the value of header type for 583 this operation. 585 o The column labeled "XDR type" gives the XDR type of the data 586 structure used to describe the information in this new message 587 type. This data immediately follows the universal portion on the 588 transport header present in every RPC-over-RDMA transport header. 590 o The column labeled "Msg" indicates whether this operation is 591 followed (or not) by an RPC message payload. 593 o The column labeled "Sec" indicates the section (within this 594 document) that explains the semantics and use of this operation. 596 +----------------------+------+-------------------+-----+-----------+ 597 | Operation | Code | XDR type | Msg | Sec | 598 +----------------------+------+-------------------+-----+-----------+ 599 | Convey Appended RPC | 0 | rpcrdma2_msg | Yes | Section | 600 | Message | | | | 5.3.1 | 601 | Convey External RPC | 1 | rpcrdma2_nomsg | No | Section | 602 | Message | | | | 5.3.2 | 603 | Report Transport | 4 | rpcrdma2_err | No | Section | 604 | Error | | | | 5.3.3 | 605 | Specify Properties | 5 | rpcrdma2_connprop | No | Section | 606 | at Connection | | | | 5.3.4 | 607 +----------------------+------+-------------------+-----+-----------+ 609 Table 2 611 Suppport for the operations in Table 2 is REQUIRED. Support for 612 additional operations will be OPTIONAL. RPC-over-RDMA version 2 613 implementations that receive an OPTIONAL operation that is not 614 supported MUST respond with an RDMA2_ERROR message with an error code 615 of RDMA2_ERR_INVAL_HTYPE. 617 5.3. Header Types Defined in RPC-over-RDMA version 2 619 The header types defined and used in RPC-over-RDMA version 1 are all 620 carried over into RPC-over-RDMA version 2, although there may be 621 limited changes in the definition of existing header types. 623 In comparison with the header types of RPC-over-RDMA version 1, the 624 changes can be summarized as follows: 626 o To simplify interoperability with RPC-over-RDMA version 1, only 627 the RDMA2_ERROR header (defined in Section 5.3.3) has an XDR 628 definition that differs from that in RPC-over-RDMA version 1, and 629 its modifications are all compatible extensions. 631 o RDMA2_MSG and RDMA2_NOMSG (defined in Sections Section 5.3.1 and 632 Section 5.3.2) have XDR definitions that match the corresponding 633 RPC-over-RDMA version 1 header types. However, because of the 634 changes to the header prefix, the version 1 and version 2 header 635 types differ in on-the-wire format. 637 o RDMA2_CONNPROP (defined in Section 5.3.4) is a completely new 638 header type devoted to enabling connection peers to exchange 639 information about their transport properties. 641 5.3.1. RDMA2_MSG: Convey RPC Message Inline 643 RDMA2_MSG is used to convey an RPC message that immediately follows 644 the Transport Header in the Send buffer. This is either an RPC 645 request that has no Position-Zero Read chunk or an RPC reply that is 646 not sent using a Reply chunk. 648 650 const rpcrdma2_proc RDMA2_MSG = 0; 652 struct rpcrdma2_msg { 653 struct rpcrdma2_chunk_lists rdma_chunks; 655 /* The rpc message starts here and continues 656 * through the end of the transmission. */ 657 uint32 rdma_rpc_first_word; 658 }; 660 662 5.3.2. RDMA2_NOMSG: Convey External RPC Message 664 RDMA2_NOMSG is used to convey an entire RPC message using explicit 665 RDMA operations. Usually this is because the RPC message does not 666 fit within the size limits that result from the receiver's inline 667 threshold. The message may be a Long request, which is read from a 668 memory area specified by a Position-Zero Read chunk; or a Long reply, 669 which is written into a memory area specified by a Reply chunk. 671 673 const rpcrdma2_proc RDMA2_NOMSG = 1; 675 struct rpcrdma2_nomsg { 676 struct rpcrdma2_chunk_lists rdma_chunks; 677 }; 679 681 5.3.3. RDMA2_ERROR: Report Transport Error 683 RDMA2_ERROR provides a way of reporting the occurrence of transport 684 errors on a previous transmission. This header type MUST NOT be 685 transmitted by a requester. [ cel: how is the XID field set when 686 sending an error report from a requester, or when the error occurred 687 on a non-RPC message? ] 688 690 const rpcrdma2_proc RDMA2_ERROR = 4; 692 struct rpcrdma2_err_vers { 693 uint32 rdma_vers_low; 694 uint32 rdma_vers_high; 695 }; 697 struct rpcrdma2_err_write { 698 uint32 rdma_chunk_index; 699 uint32 rdma_length_needed; 700 }; 702 union rpcrdma2_error switch (rpcrdma2_errcode rdma_err) { 703 case RDMA2_ERR_VERS: 704 rpcrdma2_err_vers rdma_vrange; 705 case RDMA2_ERR_READ_CHUNKS: 706 uint32 rdma_max_chunks; 707 case RDMA2_ERR_WRITE_CHUNKS: 708 uint32 rdma_max_chunks; 709 case RDMA2_ERR_SEGMENTS: 710 uint32 rdma_max_segments; 711 case RDMA2_ERR_WRITE_RESOURCE: 712 rpcrdma2_err_write rdma_writeres; 713 case RDMA2_ERR_REPLY_RESOURCE: 714 uint32 rdma_length_needed; 715 default: 716 void; 717 }; 719 721 Error reporting is addressed in RPC-over-RDMA version 2 in a fashion 722 similar to RPC-over-RDMA version 1. Several new error codes, and 723 error messages never flow from requester to responder. RPC-over-RDMA 724 version 1 error reporting is described in Section 5 of [RFC8166]. 726 In all cases below, the responder copies the values of the 727 rdma_start.rdma_xid and rdma_start.rdma_vers fields from the incoming 728 transport header that generated the error to transport header of the 729 error response. The responder sets the rdma_start.rdma_htype field 730 of the transport header prefix to RDMA2_ERROR, and the 731 rdma_start.rdma_credit field is set to the credit grant value for 732 this connection. The receiver of this header type MUST ignore the 733 value of the rdma_start.rdma_credits field. 735 RDMA2_ERR_VERS 736 This is the equivalent of ERR_VERS in RPC-over-RDMA version 1. 737 The error code value, semantics, and utilization are the same. 739 RDMA2_ERR_INVAL_HTYPE 740 If a responder recognizes the value in the rdma_start.rdma_vers 741 field, but it does not recognize the value in the 742 rdma_start.rdma_htype field or does not support that header type, 743 it MUST set the rdma_err field to RDMA2_ERR_INVAL_HTYPE. 745 RDMA2_ERR_BAD_XDR 746 If a responder recognizes the values in the rdma_start.rdma_vers 747 and rdma_start.rdma_proc fields, but the incoming RPC-over-RDMA 748 transport header cannot be parsed, it MUST set the rdma_err field 749 to RDMA2_ERR_BAD_XDR. This includes cases in which a nominally 750 opaque property value field cannot be parsed using the XDR typedef 751 associated with the transport property definition. The error code 752 value of RDMA2_ERR_BAD_XDR is the same as the error code value of 753 ERR_CHUNK in RPC-over-RDMA version 1. The responder MUST NOT 754 process the request in any way except to send an error message. 756 RDMA2_ERR_READ_CHUNKS 757 If a requester presents more DDP-eligible arguments than the 758 responder is prepared to Read, the responder MUST set the rdma_err 759 field to RDMA2_ERR_READ_CHUNKS, and set the rdma_max_chunks field 760 to the maximum number of Read chunks the responder can receive and 761 process. 762 If the responder implementation cannot handle any Read chunks for 763 a request, it MUST set the rdma_max_chunks to zero in this 764 response. The requester SHOULD resend the request using a 765 Position-Zero Read chunk. If this was a request using a Position- 766 Zero Read chunk, the requester MUST terminate the transaction with 767 an error. 769 RDMA2_ERR_WRITE_CHUNKS 770 If a requester has constructed an RPC Call message with more DDP- 771 eligible results than the server is prepared to Write, the 772 responder MUST set the rdma_err field to RDMA2_ERR_WRITE_CHUNKS, 773 and set the rdma_max_chunks field to the maximum number of Write 774 chunks the responder can process and return. 775 If the responder implementation cannot handle any Write chunks for 776 a request, it MUST return a response of RDMA2_ERR_REPLY_RESOURCE 777 (below). The requester SHOULD resend the request with no Write 778 chunks and a Reply chunk of appropriate size. 780 RDMA2_ERR_SEGMENTS 781 If a requester has constructed an RPC Call message with a chunk 782 that contains more segments than the responder supports, the 783 responder MUST set the rdma_err field to RDMA2_ERR_SEGMENTS, and 784 set the rdma_max_segments field to the maximum number of segments 785 the responder can process. 787 RDMA2_ERR_WRITE_RESOURCE 788 If a requester has provided a Write chunk that is not large enough 789 to fully convey a DDP-eligible result, the responder MUST set the 790 rdma_err field to RDMA2_ERR_WRITE_RESOURCE. 792 The responder MUST set the rdma_chunk_index field to point to the 793 first Write chunk in the transport header that is too short, or to 794 zero to indicate that it was not possible to determine which chunk 795 is too small. Indexing starts at one (1), which represents the 796 first Write chunk. The responder MUST set the rdma_length_needed 797 to the number of bytes needed in that chunk in order to convey the 798 result data item. 800 Upon receipt of this error code, a responder MAY choose to 801 terminate the operation (for instance, if the responder set the 802 index and length fields to zero), or it MAY send the request again 803 using the same XID and more reply resources. 805 RDMA2_ERR_REPLY_RESOURCE 806 If an RPC Reply's Payload stream does not fit inline and the 807 requester has not provided a large enough Reply chunk to convey 808 the stream, the responder MUST set the rdma_err field to 809 RDMA2_ERR_REPLY_RESOURCE. The responder MUST set the 810 rdma_length_needed to the number of Reply chunk bytes needed to 811 convey the reply. 813 Upon receipt of this error code, a responder MAY choose to 814 terminate the operation (for instance, if the responder set the 815 index and length fields to zero), or it MAY send the request again 816 using the same XID and larger reply resources. 818 RDMA2_ERR_SYSTEM 819 If some problem occurs on a responder that does not fit into the 820 above categories, the responder MAY report it to the sender by 821 setting the rdma_err field to RDMA2_ERR_SYSTEM. 823 This is a permanent error: a requester that receives this error 824 MUST terminate the RPC transaction associated with the XID value 825 in the rdma_start.rdma_xid field. 827 5.3.4. RDMA2_CONNPROP: Advertise Transport Properties 829 The RDMA2_CONNPROP message type allows an RPC-over-RDMA endpoint, 830 whether client or server, to indicate to its partner relevant 831 transport properties that the partner might need to be aware of. 833 The message definition for this operation is as follows: 835 837 struct rpcrdma2_connprop { 838 rpcrdma2_propset rdma_props; 839 }; 841 843 All relevant transport properties that the sender is aware of should 844 be included in rdma_props. Since support of each of the properties 845 is OPTIONAL, the sender cannot assume that the receiver will 846 necessarily take note of these properties. The sender should be 847 prepared for cases in which the receiver continues to assume that the 848 default value for a particular property is still in effect. 850 Generally, a participant will send a RDMA2_CONNPROP message as the 851 first message after a connection is established. Given that fact, 852 the sender should make sure that the message can be received by peers 853 who use the default Receive Buffer Size. The connection's initial 854 receive buffer size is typically 1KB, but it depends on the initial 855 connection state of the RPC-over-RDMA version in use. 857 Properties not included in rdma_props are to be treated by the peer 858 endpoint as having the default value and are not allowed to change 859 subsequently. The peer should not request changes in such 860 properties. 862 Those receiving an RDMA2_CONNPROP may encounter properties that they 863 do not support or are unaware of. In such cases, these properties 864 are simply ignored without any error response being generated. 866 6. XDR Protocol Definition 868 This section contains a description of the core features of the RPC- 869 over-RDMA version 2 protocol expressed in the XDR language [RFC4506]. 871 Because of the need to provide for protocol extensibility without 872 modifying an existing XDR definition, this description has some 873 important structural differences from the corresponding XDR 874 description for RPC-over-RDMA version 1, which appears in [RFC8166]. 876 This description is divided into three parts: 878 o A code component license which appears in Section 6.1. 880 o An XDR description of the structures that are generally available 881 for use by transport header types including both those defined in 882 this document and those that may be defined as extensions. This 883 includes definitions of the chunk-related structures derived from 884 RPC-over-RDMA version 1, the transport property model introduced 885 in this document, and a definition of the transport header 886 prefixes that precede the various transport header types. This 887 appears in Section 6.3. 889 o An XDR description of the transport header types defined in this 890 document, including those derived from RPC-over-RDMA version 1 and 891 those introduced in RPC-over-RDMA version 2. This appears in 892 Section 6.4. 894 This description is provided in a way that makes it simple to extract 895 into ready-to-compile form. To enable the combination of this 896 description with the descriptions of subsequent extensions to RPC- 897 over-RDMA version 2, the extracted description can be combined with 898 similar descriptions published later, or those descriptions can be 899 compiled separately. Refer to Section 6.2 for details. 901 6.1. Code Component License 903 Code components extracted from this document must include the 904 following license text. When the extracted XDR code is combined with 905 other complementary XDR code which itself has an identical license, 906 only a single copy of the license text need be preserved. 908 910 /// /* 911 /// * Copyright (c) 2010-2018 IETF Trust and the persons 912 /// * identified as authors of the code. All rights reserved. 913 /// * 914 /// * The authors of the code are: 915 /// * B. Callaghan, T. Talpey, C. Lever, and D. Noveck. 916 /// * 917 /// * Redistribution and use in source and binary forms, with 918 /// * or without modification, are permitted provided that the 919 /// * following conditions are met: 920 /// * 921 /// * - Redistributions of source code must retain the above 922 /// * copyright notice, this list of conditions and the 923 /// * following disclaimer. 924 /// * 925 /// * - Redistributions in binary form must reproduce the above 926 /// * copyright notice, this list of conditions and the 927 /// * following disclaimer in the documentation and/or other 928 /// * materials provided with the distribution. 929 /// * 930 /// * - Neither the name of Internet Society, IETF or IETF 931 /// * Trust, nor the names of specific contributors, may be 932 /// * used to endorse or promote products derived from this 933 /// * software without specific prior written permission. 934 /// * 935 /// * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS 936 /// * AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED 937 /// * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 938 /// * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS 939 /// * FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO 940 /// * EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE 941 /// * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, 942 /// * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT 943 /// * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 944 /// * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS 945 /// * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF 946 /// * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 947 /// * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING 948 /// * IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF 949 /// * ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 950 /// */ 951 /// 953 955 6.2. Extraction and Use of XDR Definitions 957 The reader can apply the following sed script to this document to 958 produce a machine-readable XDR description of the RPC-over-RDMA 959 version 2 protocol without any OPTIONAL extensions. 961 963 sed -n -e 's:^ */// ::p' -e 's:^ *///$::p' 965 967 That is, if this document is in a file called "spec.txt" then the 968 reader can do the following to extract an XDR description file and 969 store it in the file rpcrdma-v2.x. 971 973 sed -n -e 's:^ */// ::p' -e 's:^ *///$::p' \ 974 < spec.txt > rpcrdma-v2.x 976 978 Although this file is a usable description of the base protocol, when 979 extensions are to supported, it may be desirable to divide into 980 multiple files. The following script can be used for that purpose: 982 984 #!/usr/local/bin/perl 985 open(IN,"rpcrdma-v2.x"); 986 open(OUT,">temp.x"); 987 while() 988 { 989 if (m/FILE ENDS: (.*)$/) 990 { 991 close(OUT); 992 rename("temp.x", $1); 993 open(OUT,">temp.x"); 994 } 995 else 996 { 997 print OUT $_; 998 } 999 } 1000 close(IN); 1001 close(OUT); 1003 1005 Running the above script will result in two files: 1007 o The file common.x, containing the license plus the common XDR 1008 definitions which need to be made available to both the base 1009 operations and any subsequent extensions. 1011 o The file baseops.x containing the XDR definitions for the base 1012 operations, defined in this document. 1014 Optional extensions to RPC-over-RDMA version 2, published as 1015 Standards Track documents, will have similar means of providing XDR 1016 that describes those extensions. Once XDR for all desired extensions 1017 is also extracted, it can be appended to the XDR description file 1018 extracted from this document to produce a consolidated XDR 1019 description file reflecting all extensions selected for an RPC-over- 1020 RDMA implementation. 1022 Alternatively, the XDR descriptions can be compiled separately. In 1023 this case the combination of common.x and baseops.x serves to define 1024 the base transport, while using as XDR descriptions for extensions, 1025 the XDR from the document defining that extension, together with the 1026 file common.x, obtained from this document. 1028 6.3. XDR Definition for RPC-over-RDMA Version 2 Core Structures 1030 1031 /// /******************************************************************* 1032 /// * Transport Header Prefixes 1033 /// ******************************************************************/ 1034 /// 1035 /// struct rpcrdma_common { 1036 /// uint32 rdma_xid; 1037 /// uint32 rdma_vers; 1038 /// uint32 rdma_credit; 1039 /// uint32 rdma_htype; 1040 /// }; 1041 /// 1042 /// const RPCRDMA2_F_RESPONSE 0x00000001; 1043 /// 1044 /// struct rpcrdma2_hdr_prefix 1045 /// struct rpcrdma_common rdma_start; 1046 /// uint32 rdma_flags; 1047 /// }; 1048 /// 1049 /// /******************************************************************* 1050 /// * Chunks and Chunk Lists 1051 /// ******************************************************************/ 1052 /// 1053 /// struct rpcrdma2_segment { 1054 /// uint32 rdma_handle; 1055 /// uint32 rdma_length; 1056 /// uint64 rdma_offset; 1057 /// }; 1058 /// 1059 /// struct rpcrdma2_read_segment { 1060 /// uint32 rdma_position; 1061 /// struct rpcrdma2_segment rdma_target; 1062 /// }; 1063 /// 1064 /// struct rpcrdma2_read_list { 1065 /// struct rpcrdma2_read_segment rdma_entry; 1066 /// struct rpcrdma2_read_list *rdma_next; 1067 /// }; 1068 /// 1069 /// struct rpcrdma2_write_chunk { 1070 /// struct rpcrdma2_segment rdma_target<>; 1071 /// }; 1072 /// 1073 /// struct rpcrdma2_write_list { 1074 /// struct rpcrdma2_write_chunk rdma_entry; 1075 /// struct rpcrdma2_write_list *rdma_next; 1076 /// }; 1077 /// 1078 /// struct rpcrdma2_chunk_lists { 1079 /// uint32 rdma_inv_handle; 1080 /// struct rpcrdma2_read_list *rdma_reads; 1081 /// struct rpcrdma2_write_list *rdma_writes; 1082 /// struct rpcrdma2_write_chunk *rdma_reply; 1083 /// }; 1084 /// 1085 /// /******************************************************************* 1086 /// * Transport Properties 1087 /// ******************************************************************/ 1088 /// 1089 /// /* 1090 /// * Types for transport properties model 1091 /// */ 1092 /// typedef rpcrdma2_propid uint32; 1093 /// 1094 /// struct rpcrdma2_propval { 1095 /// rpcrdma2_propid rdma_which; 1096 /// opaque rdma_data<>; 1097 /// }; 1098 /// 1099 /// typedef rpcrdma2_propval rpcrdma2_propset<>; 1100 /// typedef uint32 rpcrdma2_propsubset<>; 1101 /// 1102 /// /* 1103 /// * Transport propid values for basic properties 1104 /// */ 1105 /// const uint32 RDMA2_PROPID_RBSIZ = 1; 1106 /// const uint32 RDMA2_PROPID_BRS = 2; 1107 /// 1108 /// /* 1109 /// * Types specific to particular properties 1110 /// */ 1111 /// typedef uint32 rpcrdma2_prop_rbsiz; 1112 /// typedef rpcrdma2_rvreqsup rpcrdma2_prop_brs; 1113 /// 1114 /// enum rpcrdma2_rvreqsup { 1115 /// RDMA2_RVREQSUP_NONE = 0, 1116 /// RDMA2_RVREQSUP_INLINE = 1, 1117 /// RDMA2_RVREQSUP_GENL = 2 1118 /// }; 1119 /// 1120 /// /* FILE ENDS: common.x; */ 1122 1123 6.4. XDR Definition for RPC-over-RDMA Version 2 Base Header Types 1125 1126 /// /******************************************************************* 1127 /// * Descriptions of RPC-over-RDMA Header Types 1128 /// ******************************************************************/ 1129 /// 1130 /// /* 1131 /// * Header Type Codes. 1132 /// */ 1133 /// const rpcrdma2_proc RDMA2_MSG = 0; 1134 /// const rpcrdma2_proc RDMA2_NOMSG = 1; 1135 /// const rpcrdma2_proc RDMA2_ERROR = 4; 1136 /// const rpcrdma2_proc RDMA2_CONNPROP = 5; 1137 /// 1138 /// /* 1139 /// * Header Types to Convey RPC Messages. 1140 /// */ 1141 /// struct rpcrdma2_msg { 1142 /// struct rpcrdma2_chunk_lists rdma_chunks; 1143 /// 1144 /// /* The rpc message starts here and continues 1145 /// * through the end of the transmission. */ 1146 /// uint32 rdma_rpc_first_word; 1147 /// }; 1148 /// 1149 /// struct rpcrdma2_nomsg { 1150 /// struct rpcrdma2_chunk_lists rdma_chunks; 1151 /// }; 1152 /// 1153 /// /* 1154 /// * Header Type to Report Errors. 1155 /// */ 1156 /// const uint32 RDMA2_ERR_VERS = 1; 1157 /// const uint32 RDMA2_ERR_BAD_XDR = 2; 1158 /// const uint32 RDMA2_ERR_INVAL_HTYPE = 3; 1159 /// const uint32 RDMA2_ERR_READ_CHUNKS = 4; 1160 /// const uint32 RDMA2_ERR_WRITE_CHUNKS = 5; 1161 /// const uint32 RDMA2_ERR_SEGMENTS = 6; 1162 /// const uint32 RDMA2_ERR_WRITE_RESOURCE = 7; 1163 /// const uint32 RDMA2_ERR_REPLY_RESOURCE = 8; 1164 /// const uint32 RDMA2_ERR_SYSTEM = 9; 1165 /// 1166 /// struct rpcrdma2_err_vers { 1167 /// uint32 rdma_vers_low; 1168 /// uint32 rdma_vers_high; 1169 /// }; 1170 /// 1171 /// struct rpcrdma2_err_write { 1172 /// uint32 rdma_chunk_index; 1173 /// uint32 rdma_length_needed; 1174 /// }; 1175 /// 1176 /// union rpcrdma2_error switch (rpcrdma2_errcode rdma_err) { 1177 /// case RDMA2_ERR_VERS: 1178 /// rpcrdma2_err_vers rdma_vrange; 1179 /// case RDMA2_ERR_READ_CHUNKS: 1180 /// uint32 rdma_max_chunks; 1181 /// case RDMA2_ERR_WRITE_CHUNKS: 1182 /// uint32 rdma_max_chunks; 1183 /// case RDMA2_ERR_SEGMENTS: 1184 /// uint32 rdma_max_segments; 1185 /// case RDMA2_ERR_WRITE_RESOURCE: 1186 /// rpcrdma2_err_write rdma_writeres; 1187 /// case RDMA2_ERR_REPLY_RESOURCE: 1188 /// uint32 rdma_length_needed; 1189 /// default: 1190 /// void; 1191 /// }; 1192 /// 1193 /// /* 1194 /// * Header Type to Exchange Transport Properties. 1195 /// */ 1196 /// struct rpcrdma2_connprop { 1197 /// rpcrdma2_propset rdma_props; 1198 /// }; 1199 /// 1200 /// /* FILE ENDS: baseops.x; */ 1202 1204 6.5. Use of the XDR Description Files 1206 The three files common.x and baseops.x, when combined with the XDR 1207 descriptions for extension defined later, produce a human-readable 1208 and compilable description of the RPC-over-RDMA version 2 protocol 1209 with the included extensions. 1211 Although this XDR description can be useful in generating code to 1212 encode and decode the transport and payload streams, there are 1213 elements of the structure of RPC-over-RDMA version 2 which are not 1214 expressible within the XDR language as currently defined. This 1215 requires implementations that use the output of the XDR processor to 1216 provide additional code to bridge the gaps. 1218 o The values of transport properties are represented within XDR as 1219 opaque values. However, the actual structures of each of the 1220 properties are represented by XDR typedefs, with the selection of 1221 the appropriate typedef described by text in this document. The 1222 determination of the appropriate typedef is not specified by XDR, 1223 which does not possess the facilities necessary for that 1224 determination to be specified in an extensible way. 1226 This is similar to the way in which NFSv4 attributes are handled 1227 [RFC7530] [RFC5661]. As in that case, implementations that need 1228 to encode and decode these nominally opaque entities need to use 1229 the protocol description to determine the actual XDR 1230 representation that underlays the items described as opaque. 1232 o The transport stream is not represented as a single XDR object. 1233 Instead, the header prefix is described by one XDR object while 1234 the rest of the header is described as another XDR object with the 1235 mapping between the header type in the header prefix and the XDR 1236 object representing the header type represented by tables 1237 contained in this document, with additional mappings being 1238 specifiable by a later extension document. 1240 This situation is similar to that in which RPC message headers 1241 contain program and procedure numbers, so that the XDR for those 1242 request and replies can be used to encode and decode the 1243 associated messages without requiring that all be present in a 1244 single XDR specification. As in that case, implementations need 1245 to use the header specification to select the appropriate XDR- 1246 generated code to be used in message processing. 1248 o The relationship between the transport stream and the payload 1249 stream is not specified in the XDR itself, although comments 1250 within the XDR text make clear where transported messages, 1251 described by their own XDR, need to appear. Such data by its 1252 nature is opaque to the transport, although its form differs XDR 1253 opaque arrays. 1255 Potential extensions allowing continuation of RPC messages across 1256 transport message boundaries will require that message assembly 1257 facilities, not specifiable within XDR, also be part of transport 1258 implementations. 1260 To summarize, the role of XDR in this specification is more limited 1261 than for protocols which are themselves XDR programs, where the 1262 totality of the protocol is expressible within the XDR paradigm 1263 established for that purpose. This more limited role reflects the 1264 fact that XDR lacks facilities to represent the embedding of 1265 transported material within the transport framework. In addition, 1266 the need to cleanly accommodate extensions has meant that those using 1267 rpcgen in their applications need to take a more active role in 1268 providing the facilities that cannot be expressed within XDR. 1270 7. Protocol Version Negotiation 1272 When an RPC-over-RDMA version 2 client establishes a connection to a 1273 server, its first order of business is to determine the server's 1274 highest supported protocol version. 1276 As with RPC-over-RDMA version 1, upon connection establishment a 1277 client MUST NOT send more than a single RPC-over-RDMA message at a 1278 time until it receives a valid non-error RPC-over-RDMA message from 1279 the server that grants client credits. 1281 The second word of each transport header is used to convey the 1282 transport protocol version. In the interest of simplicity, we refer 1283 to that word as rdma_vers even though in the RPC-over-RDMA version 2 1284 XDR definition it is described as rdma_start.rdma_vers. 1286 First, the client sends a single valid RPC-over-RDMA message with the 1287 value two (2) in the rdma_vers field. Because the server might 1288 support only RPC-over-RDMA version 1, this initial message can be no 1289 larger than the version 1 default inline threshold of 1024 bytes. 1291 7.1. Server Does Support RPC-over-RDMA Version 2 1293 If the server does support RPC-over-RDMA version 2, it sends RPC- 1294 over-RDMA messages back to the client with the value two (2) in the 1295 rdma_vers field. Both peers may use the default inline threshold 1296 value for RPC-over-RDMA version 2 connections (4096 bytes). 1298 7.2. Server Does Not Support RPC-over-RDMA Version 2 1300 If the server does not support RPC-over-RDMA version 2, it MUST send 1301 an RPC-over-RDMA message to the client with the same XID, with 1302 RDMA2_ERROR in the rdma_start.rdma_htype field, and with the error 1303 code RDMA2_ERR_VERS. This message also reports a range of protocol 1304 versions that the server supports. To continue operation, the client 1305 selects a protocol version in the range of server-supported versions 1306 for subsequent messages on this connection. 1308 If the connection is lost immediately after an RDMA2_ERROR / 1309 RDMA2_ERR_VERS message is received, a client can avoid a possible 1310 version negotiation loop when re-establishing another connection by 1311 assuming that particular server does not support RPC-over-RDMA 1312 version 2. A client can assume the same situation (no server support 1313 for RPC-over-RDMA version 2) if the initial negotiation message is 1314 lost or dropped. Once the negotiation exchange is complete, both 1315 peers may use the default inline threshold value for the transport 1316 protocol version that has been selected. 1318 7.3. Client Does Not Support RPC-over-RDMA Version 2 1320 If the server supports the RPC-over-RDMA protocol version used in 1321 Call messages from a client, it MUST send Replies with the same RPC- 1322 over-RDMA protocol version that the client uses to send its Calls. 1323 The client MUST NOT change the version during the duration of the 1324 connection. 1326 8. Differences from the RPC-over-RDMA Version 1 Protocol 1328 This section describes the substantive changes made in RPC-over-RDMA 1329 version 2, as opposed to the structural changes to enable 1330 extensibility, which are discussed in Section 10.1. 1332 8.1. Transport Properties 1334 RPC-over-RDMA version 2 provides a mechanism for exchanging the 1335 transport's operational properties. This mechanism allows connection 1336 endpoints to communicate the properties of their implementation at 1337 connection setup. The mechanism could be expanded to enable an 1338 endpoint to request changes in properties of the other endpoint and 1339 to notify peer endpoints of changes to properties that occur during 1340 operation. Transport properties are described in Section 4. 1342 8.2. Credit Management Changes 1344 RPC-over-RDMA transports employ credit-based flow control to ensure 1345 that a requester does not emit more RDMA Sends than the responder is 1346 prepared to receive. Section 3.3.1 of [RFC8166] explains the purpose 1347 and operation of RPC-over-RDMA version 1 credit management in detail. 1349 In the RPC-over-RDMA version 1 design, each RDMA Send from a 1350 requester contains an RPC Call with a credit request, and each RDMA 1351 Send from a responder contains an RPC Reply with a credit grant. The 1352 credit grant implies that enough Receives have been posted on the 1353 responder to handle the credit grant minus the number of pending RPC 1354 transactions (the number of remaining Receive buffers might be zero). 1356 In other words, each RPC Reply acts as an implicit ACK for a previous 1357 RPC Call from the requester, indicating that the responder has posted 1358 a Receive to replace the Receive consumed by the requester's RDMA 1359 Send. Without an RPC Reply message, the requester has no way to know 1360 that the responder is properly prepared for subsequent RPC Calls. 1362 Aside from being a bit of a layering violation, there are basic (but 1363 rare) cases where this arrangement is inadequate: 1365 o When a requester retransmits an RPC Call on the same connection as 1366 an earlier RPC Call for the same transaction. 1368 o When a requester transmits an RPC operation that requires no 1369 reply. 1371 o When more than one RPC-over-RDMA message is needed to complete the 1372 transaction (e.g., RDMA_DONE). 1374 Typically, the connection must be replaced in these cases. This 1375 resets the credit accounting mechanism but has an undesirable impact 1376 on other ongoing RPC transactions on that connection. 1378 Because credit management accompanies each RPC message, there is a 1379 strict one-to-one ratio between RDMA Send and RPC message. There are 1380 interesting use cases that might be enabled if this relationship were 1381 more flexible: 1383 o RPC-over-RDMA operations which do not carry an RPC message; e.g., 1384 control plane operations. 1386 o A single RDMA Send that conveys more than one RPC message for the 1387 purpose of interrupt mitigation. 1389 o An RPC message that is conveyed via several sequential RDMA Sends 1390 to reduce the use of explicit RDMA operations for moderate-sized 1391 RPC messages. 1393 o An RPC transaction that needs multiple exchanges or an odd number 1394 of RPC-over-RDMA operations to complete. 1396 Bi-directional RPC operation also introduces an ambiguity. If the 1397 RPC-over-RDMA message does not carry an RPC message, then it is not 1398 possible to determine whether the sender is a requester or a 1399 responder, and thus whether the rdma_credit field contains a credit 1400 request or a credit grant. 1402 A more sophisticated credit accounting mechanism is provided in RPC- 1403 over-RDMA version 2 in an attempt to address some of these 1404 shortcomings. This new mechanism is detailed in Section TBD. 1406 8.3. Inline Threshold Changes 1408 The term "inline threshold" is defined in Section 3.3.2 of [RFC8166]. 1409 An "inline threshold" value is the largest message size (in octets) 1410 that can be conveyed on an RDMA connection using only RDMA Send and 1411 Receive. Each connection has two inline threshold values: one for 1412 messages flowing from client-to-server (referred to as the "client- 1413 to-server inline threshold") and one for messages flowing from 1414 server-to-client (referred to as the "server-to-client inline 1415 threshold"). Note that [RFC8166] uses somewhat different 1416 terminology. This is because it was written with only forward- 1417 direction RPC transactions in mind. 1419 A connection's inline thresholds determine when RDMA Read or Write 1420 operations are required because the RPC message to be sent cannot be 1421 conveyed via a single RDMA Send and Receive pair. When an RPC 1422 message does not contain DDP-eligible data items, a requester 1423 prepares a Long Call or Reply to convey the whole RPC message using 1424 RDMA Read or Write operations. 1426 RDMA Read and Write operations require that each data payload resides 1427 in a region of memory that is registered with the RNIC. When an RPC 1428 is complete, that region is invalidated, fencing it from the 1429 responder. Memory registration and invalidation typically have a 1430 latency cost that is insignificant compared to data handling costs. 1431 When a data payload is small, however, the cost of registering and 1432 invalidating the memory where the payload resides becomes a 1433 relatively significant part of total RPC latency. Therefore the most 1434 efficient operation of RPC-over-RDMA occurs when explicit RDMA Read 1435 and Write operations are used for large payloads, and are avoided for 1436 small payloads. 1438 When RPC-over-RDMA version 1 was conceived, the typical size of RPC 1439 messages that did not involve a significant data payload was under 1440 500 bytes. A 1024-byte inline threshold adequately minimized the 1441 frequency of inefficient Long Calls and Replies. 1443 With NFS version 4.1 [RFC5661], the increased size of NFS COMPOUND 1444 operations resulted in RPC messages that are on average larger and 1445 more complex than previous versions of NFS. With 1024-byte inline 1446 thresholds, RDMA Read or Write operations are needed for frequent 1447 operations that do not bear a data payload, such as GETATTR and 1448 LOOKUP, reducing the efficiency of the transport. 1450 To reduce the need to use Long Calls and Replies, RPC-over-RDMA 1451 version 2 increases the default size of inline thresholds. This also 1452 increases the maximum size of reverse-direction RPC messages. 1454 8.4. Support for Remote Invalidation 1456 An STag that is registered using the FRWR mechanism in a privileged 1457 execution context or is registered via a Memory Window in an 1458 unprivileged context may be invalidated remotely [RFC5040]. These 1459 mechanisms are available when a requester's RNIC supports 1460 MEM_MGT_EXTENSIONS. 1462 For the purposes of this discussion, there are two classes of STags. 1463 Dynamically-registered STags are used in a single RPC, then 1464 invalidated. Persistently-registered STags live longer than one RPC. 1465 They may persist for the life of an RPC-over-RDMA connection, or 1466 longer. 1468 An RPC-over-RDMA requester may provide more than one STag in one 1469 transport header. It may provide a combination of dynamically- and 1470 persistently-registered STags in one RPC message, or any combination 1471 of these in a series of RPCs on the same connection. Only 1472 dynamically-registered STags using Memory Windows or FRWR (i.e., 1473 registered via MEM_MGT_EXTENSIONS) may be invalidated remotely. 1475 There is no transport-level mechanism by which a responder can 1476 determine how a requester-provided STag was registered, nor whether 1477 it is eligible to be invalidated remotely. A requester that mixes 1478 persistently- and dynamically-registered STags in one RPC, or mixes 1479 them across RPCs on the same connection, must therefore indicate 1480 which handles may be invalidated via a mechanism provided in the 1481 Upper Layer Protocol. RPC-over-RDMA version 2 provides such a 1482 mechanism. 1484 The RDMA Send With Invalidate operation is used to invalidate an STag 1485 on a remote system. It is available only when a responder's RNIC 1486 supports MEM_MGT_EXTENSIONS, and must be utilized only when a 1487 requester's RNIC supports MEM_MGT_EXTENSIONS (can receive and 1488 recognize an IETH). 1490 8.4.1. Reverse Direction Remote Invalidation 1492 Existing RPC-over-RDMA transport protocol specifications [RFC8166] 1493 [RFC8167] do not forbid direct data placement in the reverse 1494 direction, even though there is currently no Upper Layer Protocol 1495 that makes data items in reverse direction operations elegible for 1496 direct data placement. 1498 When chunks are present in a reverse direction RPC request, Remote 1499 Invalidation allows the responder to trigger invalidation of a 1500 requester's STags as part of sending a reply, the same way as is done 1501 in the forward direction. 1503 However, in the reverse direction, the server acts as the requester, 1504 and the client is the responder. The server's RNIC, therefore, must 1505 support receiving an IETH, and the server must have registered the 1506 STags with an appropriate registration mechanism. 1508 8.5. Error Reporting Changes 1510 RPC-over-RDMA version 2 expands the repertoire of errors that may be 1511 reported by connection endpoints. This change, which is structured 1512 to enable extensibility, allows a peer to report overruns of specific 1513 resources and to avoid requester retries when an error is permanent. 1515 9. Extending the Version 2 Protocol 1517 RPC-over-RDMA version 2 is designed to be extensible in a way that 1518 enables the addition of OPTIONAL features that may subsequently be 1519 converted to REQUIRED status in a future protocol version. The 1520 protocol may be extended by Standards Track documents in a way 1521 analogous to that provided for Network File System Version 4 as 1522 described in [RFC8178]. 1524 This form of extensibility enables limited extensions to the base 1525 RPC-over-RDMA version 2 protocol presented in this document so that 1526 new optional capabilities can be introduced without a protocol 1527 version change, while maintaining robust interoperability with 1528 existing RPC-over-RDMA version 2 implementations. The design allows 1529 extensions to be defined, including the definition of new protocol 1530 elements, without requiring modification or recompilation of the 1531 existing XDR. 1533 A Standards Track document introduces each set of such protocol 1534 elements. Together these elements are considered an OPTIONAL 1535 feature. Each implementation is either aware of all the protocol 1536 elements introduced by that feature or is aware of none of them. 1538 Documents describing extensions to RPC-over-RDMA version 2 should 1539 contain: 1541 o An explanation of the purpose and use of each new protocol element 1542 added. 1544 o An XDR description including all of the new protocol elements, and 1545 a script to extract it. 1547 o A description of interactions with existing extensions. 1549 This includes possible requirements of other OPTIONAL features to 1550 be present for new protocol elements to work, or that a particular 1551 level of support for an OPTIONAL facility is required for the new 1552 extension to work. 1554 Implementers combine the XDR descriptions of the new features they 1555 intend to use with the XDR description of the base protocol in this 1556 document. This may be necessary to create a valid XDR input file 1557 because extensions are free to use XDR types defined in the base 1558 protocol, and later extensions may use types defined by earlier 1559 extensions. 1561 The XDR description for the RPC-over-RDMA version 2 base protocol 1562 combined with that for any selected extensions should provide an 1563 adequate human-readable description of the extended protocol. 1565 The base protocol specified in this document may be extended within 1566 RPC-over-RDMA version 2 in two ways: 1568 o New OPTIONAL transport header types may be introduced by later 1569 Standards Track documents. Such transport header types will be 1570 documented as described in Section 9.1. 1572 o New OPTIONAL transport properties may be defined in later 1573 Standards Track documents. Such transport properties will be 1574 documented as described in Section 9.2. 1576 The following sorts of ancillary protocol elements may be added to 1577 the protocol to support the addition of new transport properties and 1578 header types. 1580 o New error codes may be created as described in Section 9.3. 1582 o New flags to use within the rdma_flags field may be created as 1583 described in Section 9.4. 1585 New capabilities can be proposed and developed independently of each 1586 other, and implementers can choose among them. This makes it 1587 straightforward to create and document experimental features and then 1588 bring them through the standards process. 1590 9.1. Adding New Header Types to RPC-over-RDMA Version 2 1592 New transport header types are to defined in a manner similar to the 1593 way existing ones are described in Sections Section 5.3.1 through 1594 Section 5.3.4 Specifically what is needed is: 1596 o A description of the function and use of the new header type. 1598 o A complete XDR description of the new header type including a 1599 description of the use of all fields within the header. 1601 o A description of how errors are reported, including the definition 1602 of a mechanism for reporting errors when the error is outside the 1603 available choices already available in the base protocol or in 1604 other existing extensions. 1606 o An indication of whether a Payload stream must be present, and a 1607 description of its contents and how such payload streams are used 1608 to construct RPC messages for processing. 1610 In addition, there needs to be additional documentation that is made 1611 necessary due to the Optional status of new transport header types. 1613 o Information about constraints on support for the new header types 1614 should be provided. For example, if support for one header type 1615 is implied or foreclosed by another one, this needs to be 1616 documented. 1618 o A preferred method by which a sender should determine whether the 1619 peer supports a particular header type needs to be provided. 1620 While it is always possible for a send a test invocation of a 1621 particular header type to see if support is available, when more 1622 efficient means are available (e.g. the value of a transport 1623 property, this should be noted. 1625 9.2. Adding New Transport properties to the Protocol 1627 The set of transport properties is designed to be extensible. As a 1628 result, once new properties are defined in standards track documents, 1629 the operations defined in this document may reference these new 1630 transport properties, as well as the ones described in this document. 1632 A standards track document defining a new transport property should 1633 include the following information paralleling that provided in this 1634 document for the transport properties defined herein. 1636 o The rpcrdma2_propid value used to identify this property. 1638 o The XDR typedef specifying the form in which the property value is 1639 communicated. 1641 o A description of the transport property that is communicated by 1642 the sender of RDMA2_CONNPROP. 1644 o An explanation of how this knowledge could be used by the peer 1645 receiving this information. 1647 The definition of transport property structures is such as to make it 1648 easy to assign unique values. There is no requirement that a 1649 continuous set of values be used and implementations should not rely 1650 on all such values being small integers. A unique value should be 1651 selected when the defining document is first published as an internet 1652 draft. When the document becomes a standards track document, the 1653 working group should ensure that: 1655 o rpcrdma2_propid values specified in the document do not conflict 1656 with those currently assigned or in use by other pending working 1657 group documents defining transport properties. 1659 o rpcrdma2_propid values specified in the document do not conflict 1660 with the range reserved for experimental use, as defined in 1661 Section 8.2. 1663 Documents defining new properties fall into a number of categories. 1665 o Those defining new properties and explaining (only) how they 1666 affect use of existing message types. 1668 o Those defining new OPTIONAL message types and new properties 1669 applicable to the operation of those new message types. 1671 o Those defining new OPTIONAL message types and new properties 1672 applicable both to new and existing message types. 1674 When additional transport properties are proposed, the review of the 1675 associated standards track document should deal with possible 1676 security issues raised by those new transport properties. 1678 9.3. Adding New Error Codes to the Protocol 1680 New error codes to be returned when using new header types may be 1681 introduced in the same Standards Track document that defines the new 1682 header type. [ cel: what about adding a new error code that is 1683 returned for an existing header type? ] 1685 For error codes that do not require that additional error information 1686 be returned with them, the existing RDMA_ERR2 header can be used to 1687 report the new error. The new error code is set as the value of 1688 rdma_err with the result that the default switch arm of the 1689 rpcrdma2_error (i.e. void) is selected. 1691 For error codes that do require the return of additional error- 1692 related information together with the error, a new header type should 1693 be defined for the purpose of returning the error together with 1694 needed additional information. It should be documented just like any 1695 other new header type. 1697 When a new header type is sent, the sender needs to be prepared to 1698 accept header types necessary to report associated errors. 1700 9.4. Adding New Header Flags to the Protocol 1702 There are currently thirty-one flags available for later assignment. 1703 One possible use for such flags would be in a later protocol version, 1704 should that version retain the same general header structure as 1705 version 2. 1707 In addition, it is possible to assign unused flags within extensions 1708 made to version 2, as long as the following practices are adhered to: 1710 o Flags should not be added to the flag word in the prefix structure 1711 if those flags only apply to a single header type. New flags 1712 should only be defined for conditions applying to multiple header 1713 types. 1715 o The document defining the new flag should indicate for which 1716 header types the flag value is meaningful and for which header 1717 types it is an error to set the flag or to leave it unset. 1719 o The sender needs to be provided with a means to determine whether 1720 the receiver is prepared to receive transport headers with the new 1721 flag set. This is most likely to take the form of a transport 1722 property together with the definition of suitable defaults to use 1723 when that property is not supported. Another possibility is to 1724 REQUIRE that receivers supporting a particular header type also 1725 support a set of additional flags. 1727 10. Relationship to other RPC-over-RDMA Versions 1729 10.1. Relationship to RPC-over-RDMA Version 1 1731 In addition to the substantive protocol changes discussed in 1732 Section 8, there are a number of structural XDR changes whose goal is 1733 to enable within-version protocol extensibility. 1735 The RPC-over-RDMA version 1 transport header is defined as a single 1736 XDR object, with an RPC message proper potentially following it. In 1737 RPC-over-RDMA version 2, as described in Section 5.1 there are 1738 separate XDR definitions of the transport header prefix (see 1739 Section 3.2 which specifies the transport header type to be used, and 1740 the specific transport header, defined within one of the subsections 1741 of Section 5). This is similar to the way that an RPC message 1742 consists of an RPC header (defined in [RFC5531]) and an RPC request 1743 or reply, defined by the Upper Layer protocol being conveyed. 1745 As a new version of the RPC-over-RDMA transport protocol, RPC-over- 1746 RDMA version 2 exists within the versioning rules defined in 1747 [RFC8166]. In particular, it maintains the first four words of the 1748 protocol header as sent and received, as specified in Section 4.2 of 1749 [RFC8166], even though, as explained in Section 3.1 of this document, 1750 the XDR definition of those words is structured differently. 1752 Although each of the first four words retains its semantic function, 1753 there are important differences of field interpretation, besides the 1754 fact that the words have different names and different roles with the 1755 XDR constrict of they are parts. 1757 o The first word of the header, previously the rdma_xid field, 1758 retains the format and function that in had in RPC-over-RDMA 1759 version 1. Within RPC-over-RDMA version 2, this word is the 1760 rdma_xid field of the structure rdma_start. However, to 1761 accommodate the use of request-response pairing of non-RPC 1762 messages and the potential use of message continuation, it cannot 1763 be assumed that it will always have the same value it would have 1764 had in RPC-over-RDMA version 1. As a result, the contents of this 1765 field should not be used without consideration of the associated 1766 protocol version identification. 1768 o The second word of the header, previously the rdma_vers field, 1769 retains the format and function that it had in RPC-over-RDMA 1770 version 1. Within RPC-over-RDMA version 2, this word is the 1771 rdma_vers field of the structure rdma_start. To clearly 1772 distinguish version 1 and version 2 messages, senders MUST fill in 1773 the correct version (fixed after version negotiation) and 1774 receivers MUST check that the content of the rdma_vers is correct 1775 before using referencing any other header field. 1777 o The third word of the header, previously the rdma_credit field, 1778 retains the format and general purpose that it had in RPC-over- 1779 RDMA version 1. Within RPC-over-RDMA version 2, this word is the 1780 rdma_credit field of the structure rdma_start. The RPC-over-RDMA 1781 version 2 protocol provides additional mechanisms that determine 1782 whether the value contained in this field is a credit request or 1783 grant. Also, the way in which credits are accounted for may be 1784 different in RPC-over-RDMA version 2. 1786 o The fourth word of the header, previously the union discriminator 1787 field rdma_proc, retains its format and general function even 1788 though the set of valid values has changed. The value of this 1789 field is now considered an unsigned 32-bit integer rather than an 1790 enum. Within RPC-over-RDMA version 2, this word is the rdma_htype 1791 field of the structure rdma_start. 1793 Beyond conforming to the restrictions specified in [RFC8166], RPC- 1794 over-RDMA version 2 tightly limits the scope of the changes made in 1795 order to ensure interoperability. It makes no major structural 1796 changes to the protocol, and all existing transport header types used 1797 in version 1 (as defined in [RFC8166]) are retained in version 2. 1798 Chunks are expressed using the same on-the-wire format and are used 1799 in the same way in both versions. 1801 10.2. Extensibility Beyond RPC-over-RDMA Version 2 1803 Subsequent RPC-over-RDMA versions are free to change the protocol in 1804 any way they choose as long as they maintain the first four header 1805 words as currently specified by [RFC8166]. 1807 Such changes might involve deletion or major re-organization of 1808 existing transport headers. However, the need for interoperability 1809 between adjacent versions will often limit the scope of changes that 1810 can be made in a single version. 1812 In some cases it may prove desirable to transition to a new version 1813 by using the extension features described for use with RPC-over-RDMA 1814 version 2, by continuing the same basic extension model but allowing 1815 header types and properties that were OPTIONAL in one version to 1816 become REQUIRED in the subsequent version. 1818 11. Security Considerations 1820 The security considerations for RPC-over-RDMA version 2 are the same 1821 as those for RPC-over-RDMA version 1. 1823 11.1. Security Considerations (Transport Properties) 1825 Like other fields that appear in each RPC-over-RDMA header, property 1826 information is sent in the clear on the fabric with no integrity 1827 protection, making it vulnerable to man-in-the-middle attacks. 1829 For example, if a man-in-the-middle were to change the value of the 1830 Receive buffer size or the Requester Remote Invalidation boolean, it 1831 could reduce connection performance or trigger loss of connection. 1832 Repeated connection loss can impact performance or even prevent a new 1833 connection from being established. Recourse is to deploy on a 1834 private network or use link-layer encryption. 1836 12. IANA Considerations 1838 This document does not require actions by IANA. 1840 13. References 1842 13.1. Normative References 1844 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 1845 Requirement Levels", BCP 14, RFC 2119, 1846 DOI 10.17487/RFC2119, March 1997, 1847 . 1849 [RFC4506] Eisler, M., Ed., "XDR: External Data Representation 1850 Standard", STD 67, RFC 4506, DOI 10.17487/RFC4506, May 1851 2006, . 1853 [RFC5531] Thurlow, R., "RPC: Remote Procedure Call Protocol 1854 Specification Version 2", RFC 5531, DOI 10.17487/RFC5531, 1855 May 2009, . 1857 [RFC8166] Lever, C., Ed., Simpson, W., and T. Talpey, "Remote Direct 1858 Memory Access Transport for Remote Procedure Call Version 1859 1", RFC 8166, DOI 10.17487/RFC8166, June 2017, 1860 . 1862 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 1863 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, 1864 May 2017, . 1866 13.2. Informative References 1868 [IBARCH] InfiniBand Trade Association, "InfiniBand Architecture 1869 Specification Volume 1", Release 1.3, March 2015, 1870 . 1873 [RFC5040] Recio, R., Metzler, B., Culley, P., Hilland, J., and D. 1874 Garcia, "A Remote Direct Memory Access Protocol 1875 Specification", RFC 5040, DOI 10.17487/RFC5040, October 1876 2007, . 1878 [RFC5041] Shah, H., Pinkerton, J., Recio, R., and P. Culley, "Direct 1879 Data Placement over Reliable Transports", RFC 5041, 1880 DOI 10.17487/RFC5041, October 2007, 1881 . 1883 [RFC5661] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed., 1884 "Network File System (NFS) Version 4 Minor Version 1 1885 Protocol", RFC 5661, DOI 10.17487/RFC5661, January 2010, 1886 . 1888 [RFC5662] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed., 1889 "Network File System (NFS) Version 4 Minor Version 1 1890 External Data Representation Standard (XDR) Description", 1891 RFC 5662, DOI 10.17487/RFC5662, January 2010, 1892 . 1894 [RFC7530] Haynes, T., Ed. and D. Noveck, Ed., "Network File System 1895 (NFS) Version 4 Protocol", RFC 7530, DOI 10.17487/RFC7530, 1896 March 2015, . 1898 [RFC8167] Lever, C., "Bidirectional Remote Procedure Call on RPC- 1899 over-RDMA Transports", RFC 8167, DOI 10.17487/RFC8167, 1900 June 2017, . 1902 [RFC8178] Noveck, D., "Rules for NFSv4 Extensions and Minor 1903 Versions", RFC 8178, DOI 10.17487/RFC8178, July 2017, 1904 . 1906 Acknowledgments 1908 The authors gratefully acknowledge the work of Brent Callaghan and 1909 Tom Talpey on the original RPC-over-RDMA version 1 specification (RFC 1910 5666). The authors also wish to thank Bill Baker, Greg Marsden, and 1911 Matt Benjamin for their support of this work. 1913 The XDR extraction conventions were first described by the authors of 1914 the NFS version 4.1 XDR specification [RFC5662]. Herbert van den 1915 Bergh suggested the replacement sed script used in this document. 1917 Special thanks go to Transport Area Director Spencer Dawkins, NFSV4 1918 Working Group Chairs Spencer Shepler and Brian Pawlowski, and NFSV4 1919 Working Group Secretary Thomas Haynes for their support. 1921 Authors' Addresses 1923 Charles Lever (editor) 1924 Oracle Corporation 1925 1015 Granger Avenue 1926 Ann Arbor, MI 48104 1927 United States of America 1929 Phone: +1 248 816 6463 1930 Email: chuck.lever@oracle.com 1931 David Noveck 1932 NetApp 1933 1601 Trapelo Road 1934 Waltham, MA 02451 1935 United States of America 1937 Phone: +1 781 572 8038 1938 Email: davenoveck@gmail.com