idnits 2.17.1 draft-cel-nfsv4-rpcrdma-version-two-09.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 654 has weird spacing: '...k_lists rdma_...' == Line 677 has weird spacing: '...k_lists rdma_...' == Line 1143 has weird spacing: '...k_lists rdma_...' == Line 1151 has weird spacing: '...k_lists rdma_...' == The document seems to contain a disclaimer for pre-RFC5378 work, but was first submitted on or after 10 November 2008. The disclaimer is usually necessary only for documents that revise or obsolete older RFCs, and that take significant amounts of text from those RFCs. If you can contact all authors of the source material and they are willing to grant the BCP78 rights to the IETF Trust, you can and should remove the disclaimer. Otherwise, the disclaimer is needed and you can ignore this comment. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (May 6, 2019) is 1817 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Obsolete informational reference (is this intentional?): RFC 5661 (Obsoleted by RFC 8881) Summary: 0 errors (**), 0 flaws (~~), 6 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network File System Version 4 C. Lever, Ed. 3 Internet-Draft Oracle 4 Intended status: Standards Track D. Noveck 5 Expires: November 7, 2019 NetApp 6 May 6, 2019 8 RPC-over-RDMA Version 2 Protocol 9 draft-cel-nfsv4-rpcrdma-version-two-09 11 Abstract 13 This document specifies a new version of the transport protocol that 14 conveys Remote Procedure Call (RPC) messages on physical transports 15 capable of Remote Direct Memory Access (RDMA). The new version of 16 this protocol is extensible. 18 Status of This Memo 20 This Internet-Draft is submitted in full conformance with the 21 provisions of BCP 78 and BCP 79. 23 Internet-Drafts are working documents of the Internet Engineering 24 Task Force (IETF). Note that other groups may also distribute 25 working documents as Internet-Drafts. The list of current Internet- 26 Drafts is at https://datatracker.ietf.org/drafts/current/. 28 Internet-Drafts are draft documents valid for a maximum of six months 29 and may be updated, replaced, or obsoleted by other documents at any 30 time. It is inappropriate to use Internet-Drafts as reference 31 material or to cite them other than as "work in progress." 33 This Internet-Draft will expire on November 7, 2019. 35 Copyright Notice 37 Copyright (c) 2019 IETF Trust and the persons identified as the 38 document authors. All rights reserved. 40 This document is subject to BCP 78 and the IETF Trust's Legal 41 Provisions Relating to IETF Documents 42 (https://trustee.ietf.org/license-info) in effect on the date of 43 publication of this document. Please review these documents 44 carefully, as they describe your rights and restrictions with respect 45 to this document. Code Components extracted from this document must 46 include Simplified BSD License text as described in Section 4.e of 47 the Trust Legal Provisions and are provided without warranty as 48 described in the Simplified BSD License. 50 This document may contain material from IETF Documents or IETF 51 Contributions published or made publicly available before November 52 10, 2008. The person(s) controlling the copyright in some of this 53 material may not have granted the IETF Trust the right to allow 54 modifications of such material outside the IETF Standards Process. 55 Without obtaining an adequate license from the person(s) controlling 56 the copyright in such materials, this document may not be modified 57 outside the IETF Standards Process, and derivative works of it may 58 not be created outside the IETF Standards Process, except to format 59 it for publication as an RFC or to translate it into languages other 60 than English. 62 Table of Contents 64 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 65 2. Requirements Language . . . . . . . . . . . . . . . . . . . . 4 66 3. RPC-over-RDMA Version 2 Headers and Chunks . . . . . . . . . 5 67 3.1. rpcrdma_common: Common Transport Header Prefix . . . . . 5 68 3.2. rpcrdma2_hdr_prefix: Version 2 Transport Header Prefix . 6 69 3.3. rpcrdma2_chunk_lists: Describe External Data Payload . . 7 70 4. Transport Properties . . . . . . . . . . . . . . . . . . . . 8 71 4.1. Transport Properties Model . . . . . . . . . . . . . . . 8 72 4.2. Current Transport Properties . . . . . . . . . . . . . . 10 73 4.2.1. Receive Buffer Size . . . . . . . . . . . . . . . . . 11 74 4.2.2. Reverse Request Support . . . . . . . . . . . . . . . 12 75 5. RPC-over-RDMA Version 2 Transport Messages . . . . . . . . . 13 76 5.1. Overall Transport Message Structure . . . . . . . . . . . 13 77 5.2. Transport Header Types . . . . . . . . . . . . . . . . . 13 78 5.3. Header Types Defined in RPC-over-RDMA version 2 . . . . . 14 79 5.3.1. RDMA2_MSG: Convey RPC Message Inline . . . . . . . . 15 80 5.3.2. RDMA2_NOMSG: Convey External RPC Message . . . . . . 15 81 5.3.3. RDMA2_ERROR: Report Transport Error . . . . . . . . . 15 82 5.3.4. RDMA2_CONNPROP: Advertise Transport Properties . . . 18 83 6. XDR Protocol Definition . . . . . . . . . . . . . . . . . . . 19 84 6.1. Code Component License . . . . . . . . . . . . . . . . . 20 85 6.2. Extraction and Use of XDR Definitions . . . . . . . . . . 22 86 6.3. XDR Definition for RPC-over-RDMA Version 2 Core 87 Structures . . . . . . . . . . . . . . . . . . . . . . . 24 88 6.4. XDR Definition for RPC-over-RDMA Version 2 Base Header 89 Types . . . . . . . . . . . . . . . . . . . . . . . . . . 26 90 6.5. Use of the XDR Description Files . . . . . . . . . . . . 27 91 7. Protocol Version Negotiation . . . . . . . . . . . . . . . . 29 92 7.1. Server Does Support RPC-over-RDMA Version 2 . . . . . . . 29 93 7.2. Server Does Not Support RPC-over-RDMA Version 2 . . . . . 29 94 7.3. Client Does Not Support RPC-over-RDMA Version 2 . . . . . 30 95 8. Differences from the RPC-over-RDMA Version 1 Protocol . . . . 30 96 8.1. Transport Properties . . . . . . . . . . . . . . . . . . 30 97 8.2. Credit Management Changes . . . . . . . . . . . . . . . . 30 98 8.3. Inline Threshold Changes . . . . . . . . . . . . . . . . 32 99 8.4. Support for Remote Invalidation . . . . . . . . . . . . . 33 100 8.4.1. Reverse Direction Remote Invalidation . . . . . . . . 33 101 8.5. Error Reporting Changes . . . . . . . . . . . . . . . . . 34 102 9. Extending the Version 2 Protocol . . . . . . . . . . . . . . 34 103 9.1. Adding New Header Types to RPC-over-RDMA Version 2 . . . 35 104 9.2. Adding New Transport properties to the Protocol . . . . . 36 105 9.3. Adding New Error Codes to the Protocol . . . . . . . . . 37 106 9.4. Adding New Header Flags to the Protocol . . . . . . . . . 38 107 10. Relationship to other RPC-over-RDMA Versions . . . . . . . . 38 108 10.1. Relationship to RPC-over-RDMA Version 1 . . . . . . . . 38 109 10.2. Extensibility Beyond RPC-over-RDMA Version 2 . . . . . . 40 110 11. Security Considerations . . . . . . . . . . . . . . . . . . . 40 111 11.1. Security Considerations (Transport Properties) . . . . . 40 112 12. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 41 113 13. References . . . . . . . . . . . . . . . . . . . . . . . . . 41 114 13.1. Normative References . . . . . . . . . . . . . . . . . . 41 115 13.2. Informative References . . . . . . . . . . . . . . . . . 41 116 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . 42 117 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 42 119 1. Introduction 121 Remote Direct Memory Access (RDMA) [RFC5040] [RFC5041] [IBARCH] is a 122 technique for moving data efficiently between end nodes. By 123 directing data into destination buffers as it is sent on a network 124 and placing it using direct memory access implemented by hardware, 125 the complementary benefits of faster transfers and reduced host 126 overhead are obtained. 128 RPC-over-RDMA version 1 enables ONC RPC [RFC5531] messages to be 129 conveyed on RDMA transports. That protocol is specified in 130 [RFC8166]. RPC-over-RDMA version 1 is deployed and in use, although 131 there are known shortcomings to this protocol: 133 o The protocol's default size of Receive buffers forces the use of 134 RDMA Read and Write transfers for small payloads, and limits the 135 size of reverse direction messages. 137 o It is difficult to make optimizations or protocol fixes that 138 require changes to on-the-wire behavior. 140 To address these issues in a way that is compatible with existing 141 RPC-over-RDMA version 1 deployments, a new version of the RPC-over- 142 RDMA transport protocol is presented in this document. 144 This new version of RPC-over-RDMA is extensible, enabling OPTIONAL 145 extensions to be added without impacting existing implementations. 147 To enable protocol extension, the XDR definition for RPC-over-RDMA 148 version 2 is organized differently than the definition version 1. 149 These changes, which are discussed in Section 10.1, do not affect the 150 on-the-wire format. 152 In addition, RPC-over-RDMA version 2 contains a set of incremental 153 changes that relieve certain performance constraints and enable 154 recovery from certain abnormal corner cases. These changes include: 156 o The exchange of transport properties as described in Section 8.1. 158 o A more flexible credit account mechanism, detailed in Section TBD. 160 o Larger default inline thresholds as described in Section 8.3. 162 o Support for remote invalidation as explained in Section 8.4. 164 o Support for reverse direction operation, as described in 165 [RFC8167], is now REQUIRED. Details are in Section 3.2. 167 o An expansion of error reporting capabilities, described in 168 Section 5.3.3. A summary of the reasons for this expansion 169 appears in Section 8.5. This expansion supports the addition of 170 new error codes as described in Section 9.3. 172 Because of the way in which RPC-over-RDMA version 2 builds upon the 173 facilities present in RPC-over-RDMA version 1, a knowledge of the 174 basic structure of RPC-over-RDMA version 1, as described in 175 [RFC8166], is assumed in this document. 177 As in that document, the terms "RPC Payload Stream" and "Transport 178 Header Stream" (defined in Section 3.2 of that document) are used to 179 distinguish between an RPC message as defined by [RFC5531] and the 180 header whose job it is to describe the RPC message and its associated 181 memory resources. In that regard, the reader is assumed to 182 understand how RDMA is used to transfer chunks between client and 183 server, the use of Position-Zero Read chunks and Reply chunks to 184 convey Long RPC messages, and the role of DDP-eligibility in 185 constraining how data payloads are to be conveyed. 187 2. Requirements Language 189 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 190 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 191 "OPTIONAL" in this document are to be interpreted as described in BCP 192 14 [RFC2119] [RFC8174] when, and only when, they appear in all 193 capitals, as shown here. 195 3. RPC-over-RDMA Version 2 Headers and Chunks 197 Most RPC-over-RDMA version 2 data structures are derived from 198 corresponding structures in RPC-over-RDMA version 1. As is typical 199 for new versions of an existing protocol, the XDR data structures 200 have new names and there are a few small changes in content. In some 201 cases, there have been structural re-organizations to enabled 202 protocol extensibility. 204 3.1. rpcrdma_common: Common Transport Header Prefix 206 The rpcrdma_common prefix describes the first part of each RDMA-over- 207 RPC transport header for version 2 and subsequent versions. 209 211 struct rpcrdma_common { 212 uint32 rdma_xid; 213 uint32 rdma_vers; 214 uint32 rdma_credit; 215 uint32 rdma_htype; 216 }; 218 220 RPC-over-RDMA version 2's use of these first four words matches that 221 of version 1 as required by [RFC8166]. However, there are important 222 structural differences in the way that these words are described by 223 the respective XDR descriptions: 225 o The header type is represented as a uint32 rather than as an enum 226 that would need to be modified to reflect additions to the set of 227 header types made by later extensions. 229 o The header type field is part of an XDR structure devoted to 230 representing the transport header prefix, rather than being part 231 of a discriminated union, that includes the body of each transport 232 header type. 234 o There is now a prefix structure (see Section 3.2) of which the 235 rpcrdma_common structure is the initial segment. This is a newly 236 defined XDR object within the protocol description, in contrast 237 with RPC-over-RDMA version 1, which limits the common portion of 238 all header types to the four words in rpcrdma_common. 240 These changes are part of a larger structural change in the XDR 241 description of RPC-over-RDMA version 2 that enables a cleaner 242 treatment of protocol extension. The XDR appearing in Section 6 243 reflects these changes, which are discussed in further detail in 244 Section 10.1. 246 3.2. rpcrdma2_hdr_prefix: Version 2 Transport Header Prefix 248 The following prefix structure appears at the start of any RPC-over- 249 RDMA version 2 transport header. 251 253 const RPCRDMA2_F_RESPONSE 0x00000001; 255 struct rpcrdma2_hdr_prefix 256 struct rpcrdma_common rdma_start; 257 uint32 rdma_flags; 258 }; 260 262 The rdma_flags is new to RPC-over-RDMA version 2. Currently, the 263 only flag defined within this word is the RPCRDMA2_F_RESPONSE flag. 264 The other bits are reserved for future use as described in 265 Section 9.4. The sender MUST set these to zero. 267 The RPCRDMA2_F_RESPONSE flag qualifies the values contained in the 268 transport header's rdma_start.rdma_xid and rdma_start.rdma_credits 269 fields. The RPCRDMA2_F_RESPONSE flag enables a receiver to reliably 270 avoid performing an XID lookup on incoming reverse direction Call 271 messages, and apply the value of the rdma_start.rdma_credits field 272 correctly, based on the direction of the message being conveyed. 274 In general, when a message carries an XID that was generated by the 275 message's receiver (that is, the receiver is acting as a requester), 276 the message's sender sets the RPCRDMA2_F_RESPONSE flag. Otherwise 277 that flag is clear. For example: 279 o When the rdma_start.rdma_htype field has the value RDMA2_MSG or 280 RDMA2_NOMSG, the value of the RPCRDMA2_F_RESPONSE flag MUST be the 281 same as the value of the associated RPC message's msg_type field. 283 o When the header type is anything else and a whole or partial RPC 284 message payload is present, the value of the RPCRDMA2_F_RESPONSE 285 flag MUST be the same as the value of the associated RPC message's 286 msg_type field. 288 o When no RPC message payload is present, a Requester MUST set the 289 value of RPCRDMA2_F_RESPONSE to reflect how the receiver is to 290 interpret the rdma_start.rdma_credits and rdma_start.rdma_xid 291 fields. 293 o When the rdma_start.rdma_htype field has the value RDMA2_ERROR, 294 the RPCRDMA2_F_RESPONSE flag MUST be set. 296 3.3. rpcrdma2_chunk_lists: Describe External Data Payload 298 The rpcrdma2_chunk_lists structure specifies how an RPC message is 299 conveyed using explicit RDMA operations. 301 303 struct rpcrdma2_chunk_lists { 304 uint32 rdma_inv_handle; 305 struct rpcrdma2_read_list *rdma_reads; 306 struct rpcrdma2_write_list *rdma_writes; 307 struct rpcrdma2_write_chunk *rdma_reply; 308 }; 310 312 For the most part this structure parallels its RPC-over-RDMA version 313 1 equivalent. That is, rdma_reads, rdma_writes, rdma_reply provide, 314 respectively, descriptions of the chunks used to read a Long request 315 or directly placed data from the requester, to write directly placed 316 response data into the requester's memory, and to write a long reply 317 into the requester's memory. 319 An important addition relative to the corresponding RPC-over-RDMA 320 version 1 rdma_header structures is the rdma_inv_handle field. This 321 field supports remote invalidation of requester memory registrations 322 via the RDMA Send With Invalidate operation. 324 To request Remote Invalidation, a requester sets the value of the 325 rdma_inv_handle field in an RPC Call's transport header to a non-zero 326 value that matches one of the rdma_handle fields in that header. If 327 none of the rdma_handle values in the header conveying the Call may 328 be invalidated by the responder, the requester sets the RPC Call's 329 rdma_inv_handle field to the value zero. 331 If the responder chooses not to use remote invalidation for this 332 particular RPC Reply, or the RPC Call's rdma_inv_handle field 333 contains the value zero, the responder uses RDMA Send to transmit the 334 matching RPC reply. 336 If a requester has provided a non-zero value in the RPC Call's 337 rdma_inv_handle field and the responder chooses to use Remote 338 Invalidation for the matching RPC Reply, the responder uses RDMA Send 339 With Invalidate to transmit that RPC reply, and uses the value in the 340 corresponding Call's rdma_inv_handle field to construct the Send With 341 Invalidate Work Request. 343 4. Transport Properties 345 RPC-over-RDMA version 2 provides a mechanism for connection endpoints 346 to communicate information about implementation properties, enabling 347 compatible endpoints to optimize data transfer. Initially only a 348 small set of transport properties are defined and a single operation 349 is provided to exchange transport properties (see Section 5.3.4). 351 Both the set of transport properties and the operations used to 352 communicate may be extended. Within RPC-over-RDMA version 2, all 353 such extensions are OPTIONAL. For information about existing 354 transport properties, see Sections 4.1 through 4.2. For discussion 355 of extensions to the set of transport properties, see Section 9.2. 357 4.1. Transport Properties Model 359 A basic set of receiver and sender properties is specified in this 360 document. An extensible approach is used, allowing new properties to 361 be defined in future Standards Track documents. 363 Such properties are specified using: 365 o A code point identifying the particular transport property being 366 specified. 368 o A nominally opaque array which contains within it the XDR encoding 369 of the specific property indicated by the associated code point. 371 The following XDR types are used by operations that deal with 372 transport properties: 374 376 typedef rpcrdma2_propid uint32; 378 struct rpcrdma2_propval { 379 rpcrdma2_propid rdma_which; 380 opaque rdma_data<>; 381 }; 383 typedef rpcrdma2_propval rpcrdma2_propset<>; 385 typedef uint32 rpcrdma2_propsubset<>; 387 389 An rpcrdma2_propid specifies a particular transport property. In 390 order to facilitate XDR extension of the set of properties by 391 concatenating XDR definition files, specific properties are defined 392 as const values rather than as elements in an enum. 394 An rpcrdma2_propval specifies a value of a particular transport 395 property with the particular property identified by rdma_which, while 396 the associated value of that property is contained within rdma_data. 398 An rdma_data field which is of zero length is interpreted as 399 indicating the default value or the property indicated by rdma_which. 401 While rdma_data is defined as opaque within the XDR, the contents are 402 interpreted (except when of length zero) using the XDR typedef 403 associated with the property specified by rdma_which. As a result, 404 when rpcrdma2_propval does not conform to that typedef, the receiver 405 is REQUIRED to return the error RDMA2_ERR_BAD_XDR using the header 406 type RDMA2_ERROR as described in Section 5.3.3. For example, the 407 receiver of a message containing a valid rpcrdma2_propval returns 408 this error if the length of rdma_data is such that it extends beyond 409 the bounds of the message being transferred. 411 In cases in which the rpcrdma2_propid specified by rdma_which is 412 understood by the receiver, the receiver also MUST report the error 413 RDMA2_ERR_BAD_XDR if either of the following occur: 415 o The nominally opaque data within rdma_data is not valid when 416 interpreted using the property-associated typedef. 418 o The length of rdma_data is insufficient to contain the data 419 represented by the property-associated typedef. 421 Note that no error is to be reported if rdma_which is unknown to the 422 receiver. In that case, that rpcrdma2_propval is not processed and 423 processing continues using the next rpcrdma2_propval, if any. 425 A rpcrdma2_propset specifies a set of transport properties. No 426 particular ordering of the rpcrdma2_propval items within it is 427 imposed. 429 A rpcrdma2_propsubset identifies a subset of the properties in a 430 previously specified rpcrdma2_propset. Each bit in the mask denotes 431 a particular element in a previously specified rpcrdma2_propset. If 432 a particular rpcrdma2_propval is at position N in the array, then bit 433 number N mod 32 in word N div 32 specifies whether that particular 434 rpcrdma2_propval is included in the defined subset. Words beyond the 435 last one specified are treated as containing zero. 437 4.2. Current Transport Properties 439 Although the set of transport properties may be extended, a basic set 440 of transport properties is defined in Table 1. 442 In that table, the columns contain the following information: 444 o The column labeled "Property" identifies the transport property 445 described by the current row. 447 o The column labeled "Code" specifies the rpcrdma2_propid value used 448 to identify this property. 450 o The column labeled "XDR type" gives the XDR type of the data used 451 to communicate the value of this property. This data type 452 overlays the data portion of the nominally opaque field rdma_data 453 in a rpcrdma2_propval. 455 o The column labeled "Default" gives the default value for the 456 property which is to be assumed by those who do not receive, or 457 are unable to interpret, information about the actual value of the 458 property. 460 o The column labeled "Sec" indicates the section within this 461 document that explains the semantics and use of this transport 462 property. 464 +---------+-----+------------------+----------------------+---------+ 465 | Propert | Cod | XDR type | Default | Sec | 466 | y | e | | | | 467 +---------+-----+------------------+----------------------+---------+ 468 | Receive | 1 | uint32 | 4096 | Section | 469 | Buffer | | | | 4.2.1 | 470 | Size | | | | | 471 | Reverse | 2 | enum rpcrdma2_rv | RDMA2_RVREQSUP_INLIN | Section | 472 | Request | | reqsup | E | 4.2.2 | 473 | Support | | | | | 474 +---------+-----+------------------+----------------------+---------+ 476 Table 1 478 4.2.1. Receive Buffer Size 480 The Receive Buffer Size specifies the minimum size, in octets, of 481 pre-posted receive buffers. It is the responsibility of the endpoint 482 sending this value to ensure that its pre-posted receive buffers are 483 at least the size specified, allowing the endpoint receiving this 484 value to send messages that are of this size. 486 488 const uint32 RDMA2_PROPID_RBSIZ = 1; 489 typedef uint32 rpcrdma2_prop_rbsiz; 491 493 The sender may use his knowledge of the receiver's buffer size to 494 determine when the message to be sent will fit in the preposted 495 receive buffers that the receiver has set up. In particular, 497 o Requesters may use the value to determine when it is necessary to 498 provide a Position-Zero Read chunk when sending a request. 500 o Requesters may use the value to determine when it is necessary to 501 provide a Reply chunk when sending a request, based on the maximum 502 possible size of the reply. 504 o Responders may use the value to determine when it is necessary, 505 given the actual size of the reply, to actually use a Reply chunk 506 provided by the requester. 508 4.2.2. Reverse Request Support 510 The value of this property is used to indicate a client 511 implementation's readiness to accept and process messages that are 512 part of reverse direction RPC requests. 514 516 enum rpcrdma2_rvreqsup { 517 RDMA2_RVREQSUP_NONE = 0, 518 RDMA2_RVREQSUP_INLINE = 1, 519 RDMA2_RVREQSUP_GENL = 2 520 }; 522 const uint32 RDMA2_PROPID_BRS = 2; 523 typedef rpcrdma2_rvreqsup rpcrdma2_prop_brs; 525 527 Multiple levels of support are distinguished: 529 o The value RDMA2_RVREQSUP_NONE indicates that receipt of reverse 530 direction requests and replies is not supported. 532 o The value RDMA2_RVREQSUP_INLINE indicates that receipt of reverse 533 direction requests or replies is only supported using inline 534 messages and that use of explicit RDMA operations or other form of 535 Direct Data Placement for reverse direction requests or responses 536 is not supported. 538 o The value RDMA2_RVREQSUP_GENL that receipt of reverse direction 539 requests or replies is supported in the same ways that forward 540 direction requests or replies typically are. 542 When information about this property is not provided, the support 543 level of servers can be inferred from the reverse direction requests 544 that they issue, assuming that issuing a request implicitly indicates 545 support for receiving the corresponding reply. On this basis, 546 support for receiving inline replies can be assumed when requests 547 without Read chunks, Write chunks, or Reply chunks are issued, while 548 requests with any of these elements allow the client to assume that 549 general support for reverse direction replies is present on the 550 server. 552 5. RPC-over-RDMA Version 2 Transport Messages 554 5.1. Overall Transport Message Structure 556 Each transport message consists of multiple sections: 558 o A transport header prefix, as defined in Section 3.2. Among other 559 things, this structure indicates the header type. 561 o The transport header proper, as defined by one of the sub-sections 562 below. See Section 5.2 for the mapping between header types and 563 the corresponding header structure. 565 o Potentially, an RPC message being conveyed as an addendum to the 566 header. 568 This organization differs from that presented in the definition of 569 RPC-over-RDMA version 1 [RFC8166], which presented the first and 570 second of the items above as a single XDR item. The new organization 571 is more in keeping with RPC-over-RDMA version 2's extensibility model 572 in that new header types can be defined without modifying the 573 existing set of header types. 575 5.2. Transport Header Types 577 The new header types within RPC-over-RDMA version 2 are set forth in 578 Table 2. In that table, the columns contain the following 579 information: 581 o The column labeled "Operation" specifies the particular operation. 583 o The column labeled "Code" specifies the value of header type for 584 this operation. 586 o The column labeled "XDR type" gives the XDR type of the data 587 structure used to describe the information in this new message 588 type. This data immediately follows the universal portion on the 589 transport header present in every RPC-over-RDMA transport header. 591 o The column labeled "Msg" indicates whether this operation is 592 followed (or not) by an RPC message payload. 594 o The column labeled "Sec" indicates the section (within this 595 document) that explains the semantics and use of this operation. 597 +----------------------+------+-------------------+-----+-----------+ 598 | Operation | Code | XDR type | Msg | Sec | 599 +----------------------+------+-------------------+-----+-----------+ 600 | Convey Appended RPC | 0 | rpcrdma2_msg | Yes | Section | 601 | Message | | | | 5.3.1 | 602 | Convey External RPC | 1 | rpcrdma2_nomsg | No | Section | 603 | Message | | | | 5.3.2 | 604 | Report Transport | 4 | rpcrdma2_err | No | Section | 605 | Error | | | | 5.3.3 | 606 | Specify Properties | 5 | rpcrdma2_connprop | No | Section | 607 | at Connection | | | | 5.3.4 | 608 +----------------------+------+-------------------+-----+-----------+ 610 Table 2 612 Suppport for the operations in Table 2 is REQUIRED. Support for 613 additional operations will be OPTIONAL. RPC-over-RDMA version 2 614 implementations that receive an OPTIONAL operation that is not 615 supported MUST respond with an RDMA2_ERROR message with an error code 616 of RDMA2_ERR_INVAL_HTYPE. 618 5.3. Header Types Defined in RPC-over-RDMA version 2 620 The header types defined and used in RPC-over-RDMA version 1 are all 621 carried over into RPC-over-RDMA version 2, although there may be 622 limited changes in the definition of existing header types. 624 In comparison with the header types of RPC-over-RDMA version 1, the 625 changes can be summarized as follows: 627 o To simplify interoperability with RPC-over-RDMA version 1, only 628 the RDMA2_ERROR header (defined in Section 5.3.3) has an XDR 629 definition that differs from that in RPC-over-RDMA version 1, and 630 its modifications are all compatible extensions. 632 o RDMA2_MSG and RDMA2_NOMSG (defined in Sections Section 5.3.1 and 633 Section 5.3.2) have XDR definitions that match the corresponding 634 RPC-over-RDMA version 1 header types. However, because of the 635 changes to the header prefix, the version 1 and version 2 header 636 types differ in on-the-wire format. 638 o RDMA2_CONNPROP (defined in Section 5.3.4) is a completely new 639 header type devoted to enabling connection peers to exchange 640 information about their transport properties. 642 5.3.1. RDMA2_MSG: Convey RPC Message Inline 644 RDMA2_MSG is used to convey an RPC message that immediately follows 645 the Transport Header in the Send buffer. This is either an RPC 646 request that has no Position-Zero Read chunk or an RPC reply that is 647 not sent using a Reply chunk. 649 651 const rpcrdma2_proc RDMA2_MSG = 0; 653 struct rpcrdma2_msg { 654 struct rpcrdma2_chunk_lists rdma_chunks; 656 /* The rpc message starts here and continues 657 * through the end of the transmission. */ 658 uint32 rdma_rpc_first_word; 659 }; 661 663 5.3.2. RDMA2_NOMSG: Convey External RPC Message 665 RDMA2_NOMSG is used to convey an entire RPC message using explicit 666 RDMA operations. Usually this is because the RPC message does not 667 fit within the size limits that result from the receiver's inline 668 threshold. The message may be a Long request, which is read from a 669 memory area specified by a Position-Zero Read chunk; or a Long reply, 670 which is written into a memory area specified by a Reply chunk. 672 674 const rpcrdma2_proc RDMA2_NOMSG = 1; 676 struct rpcrdma2_nomsg { 677 struct rpcrdma2_chunk_lists rdma_chunks; 678 }; 680 682 5.3.3. RDMA2_ERROR: Report Transport Error 684 RDMA2_ERROR provides a way of reporting the occurrence of transport 685 errors on a previous transmission. This header type MUST NOT be 686 transmitted by a requester. [ cel: how is the XID field set when 687 sending an error report from a requester, or when the error occurred 688 on a non-RPC message? ] 689 691 const rpcrdma2_proc RDMA2_ERROR = 4; 693 struct rpcrdma2_err_vers { 694 uint32 rdma_vers_low; 695 uint32 rdma_vers_high; 696 }; 698 struct rpcrdma2_err_write { 699 uint32 rdma_chunk_index; 700 uint32 rdma_length_needed; 701 }; 703 union rpcrdma2_error switch (rpcrdma2_errcode rdma_err) { 704 case RDMA2_ERR_VERS: 705 rpcrdma2_err_vers rdma_vrange; 706 case RDMA2_ERR_READ_CHUNKS: 707 uint32 rdma_max_chunks; 708 case RDMA2_ERR_WRITE_CHUNKS: 709 uint32 rdma_max_chunks; 710 case RDMA2_ERR_SEGMENTS: 711 uint32 rdma_max_segments; 712 case RDMA2_ERR_WRITE_RESOURCE: 713 rpcrdma2_err_write rdma_writeres; 714 case RDMA2_ERR_REPLY_RESOURCE: 715 uint32 rdma_length_needed; 716 default: 717 void; 718 }; 720 722 Error reporting is addressed in RPC-over-RDMA version 2 in a fashion 723 similar to RPC-over-RDMA version 1. Several new error codes, and 724 error messages never flow from requester to responder. RPC-over-RDMA 725 version 1 error reporting is described in Section 5 of [RFC8166]. 727 In all cases below, the responder copies the values of the 728 rdma_start.rdma_xid and rdma_start.rdma_vers fields from the incoming 729 transport header that generated the error to transport header of the 730 error response. The responder sets the rdma_start.rdma_htype field 731 of the transport header prefix to RDMA2_ERROR, and the 732 rdma_start.rdma_credit field is set to the credit grant value for 733 this connection. The receiver of this header type MUST ignore the 734 value of the rdma_start.rdma_credits field. 736 RDMA2_ERR_VERS 737 This is the equivalent of ERR_VERS in RPC-over-RDMA version 1. 738 The error code value, semantics, and utilization are the same. 740 RDMA2_ERR_INVAL_HTYPE 741 If a responder recognizes the value in the rdma_start.rdma_vers 742 field, but it does not recognize the value in the 743 rdma_start.rdma_htype field or does not support that header type, 744 it MUST set the rdma_err field to RDMA2_ERR_INVAL_HTYPE. 746 RDMA2_ERR_BAD_XDR 747 If a responder recognizes the values in the rdma_start.rdma_vers 748 and rdma_start.rdma_proc fields, but the incoming RPC-over-RDMA 749 transport header cannot be parsed, it MUST set the rdma_err field 750 to RDMA2_ERR_BAD_XDR. This includes cases in which a nominally 751 opaque property value field cannot be parsed using the XDR typedef 752 associated with the transport property definition. The error code 753 value of RDMA2_ERR_BAD_XDR is the same as the error code value of 754 ERR_CHUNK in RPC-over-RDMA version 1. The responder MUST NOT 755 process the request in any way except to send an error message. 757 RDMA2_ERR_READ_CHUNKS 758 If a requester presents more DDP-eligible arguments than the 759 responder is prepared to Read, the responder MUST set the rdma_err 760 field to RDMA2_ERR_READ_CHUNKS, and set the rdma_max_chunks field 761 to the maximum number of Read chunks the responder can receive and 762 process. 763 If the responder implementation cannot handle any Read chunks for 764 a request, it MUST set the rdma_max_chunks to zero in this 765 response. The requester SHOULD resend the request using a 766 Position-Zero Read chunk. If this was a request using a Position- 767 Zero Read chunk, the requester MUST terminate the transaction with 768 an error. 770 RDMA2_ERR_WRITE_CHUNKS 771 If a requester has constructed an RPC Call message with more DDP- 772 eligible results than the server is prepared to Write, the 773 responder MUST set the rdma_err field to RDMA2_ERR_WRITE_CHUNKS, 774 and set the rdma_max_chunks field to the maximum number of Write 775 chunks the responder can process and return. 776 If the responder implementation cannot handle any Write chunks for 777 a request, it MUST return a response of RDMA2_ERR_REPLY_RESOURCE 778 (below). The requester SHOULD resend the request with no Write 779 chunks and a Reply chunk of appropriate size. 781 RDMA2_ERR_SEGMENTS 782 If a requester has constructed an RPC Call message with a chunk 783 that contains more segments than the responder supports, the 784 responder MUST set the rdma_err field to RDMA2_ERR_SEGMENTS, and 785 set the rdma_max_segments field to the maximum number of segments 786 the responder can process. 788 RDMA2_ERR_WRITE_RESOURCE 789 If a requester has provided a Write chunk that is not large enough 790 to fully convey a DDP-eligible result, the responder MUST set the 791 rdma_err field to RDMA2_ERR_WRITE_RESOURCE. 793 The responder MUST set the rdma_chunk_index field to point to the 794 first Write chunk in the transport header that is too short, or to 795 zero to indicate that it was not possible to determine which chunk 796 is too small. Indexing starts at one (1), which represents the 797 first Write chunk. The responder MUST set the rdma_length_needed 798 to the number of bytes needed in that chunk in order to convey the 799 result data item. 801 Upon receipt of this error code, a responder MAY choose to 802 terminate the operation (for instance, if the responder set the 803 index and length fields to zero), or it MAY send the request again 804 using the same XID and more reply resources. 806 RDMA2_ERR_REPLY_RESOURCE 807 If an RPC Reply's Payload stream does not fit inline and the 808 requester has not provided a large enough Reply chunk to convey 809 the stream, the responder MUST set the rdma_err field to 810 RDMA2_ERR_REPLY_RESOURCE. The responder MUST set the 811 rdma_length_needed to the number of Reply chunk bytes needed to 812 convey the reply. 814 Upon receipt of this error code, a responder MAY choose to 815 terminate the operation (for instance, if the responder set the 816 index and length fields to zero), or it MAY send the request again 817 using the same XID and larger reply resources. 819 RDMA2_ERR_SYSTEM 820 If some problem occurs on a responder that does not fit into the 821 above categories, the responder MAY report it to the sender by 822 setting the rdma_err field to RDMA2_ERR_SYSTEM. 824 This is a permanent error: a requester that receives this error 825 MUST terminate the RPC transaction associated with the XID value 826 in the rdma_start.rdma_xid field. 828 5.3.4. RDMA2_CONNPROP: Advertise Transport Properties 830 The RDMA2_CONNPROP message type allows an RPC-over-RDMA endpoint, 831 whether client or server, to indicate to its partner relevant 832 transport properties that the partner might need to be aware of. 834 The message definition for this operation is as follows: 836 838 struct rpcrdma2_connprop { 839 rpcrdma2_propset rdma_props; 840 }; 842 844 All relevant transport properties that the sender is aware of should 845 be included in rdma_props. Since support of each of the properties 846 is OPTIONAL, the sender cannot assume that the receiver will 847 necessarily take note of these properties. The sender should be 848 prepared for cases in which the receiver continues to assume that the 849 default value for a particular property is still in effect. 851 Generally, a participant will send a RDMA2_CONNPROP message as the 852 first message after a connection is established. Given that fact, 853 the sender should make sure that the message can be received by peers 854 who use the default Receive Buffer Size. The connection's initial 855 receive buffer size is typically 1KB, but it depends on the initial 856 connection state of the RPC-over-RDMA version in use. 858 Properties not included in rdma_props are to be treated by the peer 859 endpoint as having the default value and are not allowed to change 860 subsequently. The peer should not request changes in such 861 properties. 863 Those receiving an RDMA2_CONNPROP may encounter properties that they 864 do not support or are unaware of. In such cases, these properties 865 are simply ignored without any error response being generated. 867 6. XDR Protocol Definition 869 This section contains a description of the core features of the RPC- 870 over-RDMA version 2 protocol expressed in the XDR language [RFC4506]. 872 Because of the need to provide for protocol extensibility without 873 modifying an existing XDR definition, this description has some 874 important structural differences from the corresponding XDR 875 description for RPC-over-RDMA version 1, which appears in [RFC8166]. 877 This description is divided into three parts: 879 o A code component license which appears in Section 6.1. 881 o An XDR description of the structures that are generally available 882 for use by transport header types including both those defined in 883 this document and those that may be defined as extensions. This 884 includes definitions of the chunk-related structures derived from 885 RPC-over-RDMA version 1, the transport property model introduced 886 in this document, and a definition of the transport header 887 prefixes that precede the various transport header types. This 888 appears in Section 6.3. 890 o An XDR description of the transport header types defined in this 891 document, including those derived from RPC-over-RDMA version 1 and 892 those introduced in RPC-over-RDMA version 2. This appears in 893 Section 6.4. 895 This description is provided in a way that makes it simple to extract 896 into ready-to-compile form. To enable the combination of this 897 description with the descriptions of subsequent extensions to RPC- 898 over-RDMA version 2, the extracted description can be combined with 899 similar descriptions published later, or those descriptions can be 900 compiled separately. Refer to Section 6.2 for details. 902 6.1. Code Component License 904 Code components extracted from this document must include the 905 following license text. When the extracted XDR code is combined with 906 other complementary XDR code which itself has an identical license, 907 only a single copy of the license text need be preserved. 909 911 /// /* 912 /// * Copyright (c) 2010-2018 IETF Trust and the persons 913 /// * identified as authors of the code. All rights reserved. 914 /// * 915 /// * The authors of the code are: 916 /// * B. Callaghan, T. Talpey, C. Lever, and D. Noveck. 917 /// * 918 /// * Redistribution and use in source and binary forms, with 919 /// * or without modification, are permitted provided that the 920 /// * following conditions are met: 921 /// * 922 /// * - Redistributions of source code must retain the above 923 /// * copyright notice, this list of conditions and the 924 /// * following disclaimer. 925 /// * 926 /// * - Redistributions in binary form must reproduce the above 927 /// * copyright notice, this list of conditions and the 928 /// * following disclaimer in the documentation and/or other 929 /// * materials provided with the distribution. 930 /// * 931 /// * - Neither the name of Internet Society, IETF or IETF 932 /// * Trust, nor the names of specific contributors, may be 933 /// * used to endorse or promote products derived from this 934 /// * software without specific prior written permission. 935 /// * 936 /// * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS 937 /// * AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED 938 /// * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 939 /// * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS 940 /// * FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO 941 /// * EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE 942 /// * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, 943 /// * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT 944 /// * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 945 /// * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS 946 /// * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF 947 /// * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 948 /// * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING 949 /// * IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF 950 /// * ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 951 /// */ 952 /// 954 956 6.2. Extraction and Use of XDR Definitions 958 The reader can apply the following sed script to this document to 959 produce a machine-readable XDR description of the RPC-over-RDMA 960 version 2 protocol without any OPTIONAL extensions. 962 964 sed -n -e 's:^ */// ::p' -e 's:^ *///$::p' 966 968 That is, if this document is in a file called "spec.txt" then the 969 reader can do the following to extract an XDR description file and 970 store it in the file rpcrdma-v2.x. 972 974 sed -n -e 's:^ */// ::p' -e 's:^ *///$::p' \ 975 < spec.txt > rpcrdma-v2.x 977 979 Although this file is a usable description of the base protocol, when 980 extensions are to supported, it may be desirable to divide into 981 multiple files. The following script can be used for that purpose: 983 985 #!/usr/local/bin/perl 986 open(IN,"rpcrdma-v2.x"); 987 open(OUT,">temp.x"); 988 while() 989 { 990 if (m/FILE ENDS: (.*)$/) 991 { 992 close(OUT); 993 rename("temp.x", $1); 994 open(OUT,">temp.x"); 995 } 996 else 997 { 998 print OUT $_; 999 } 1000 } 1001 close(IN); 1002 close(OUT); 1004 1006 Running the above script will result in two files: 1008 o The file common.x, containing the license plus the common XDR 1009 definitions which need to be made available to both the base 1010 operations and any subsequent extensions. 1012 o The file baseops.x containing the XDR definitions for the base 1013 operations, defined in this document. 1015 Optional extensions to RPC-over-RDMA version 2, published as 1016 Standards Track documents, will have similar means of providing XDR 1017 that describes those extensions. Once XDR for all desired extensions 1018 is also extracted, it can be appended to the XDR description file 1019 extracted from this document to produce a consolidated XDR 1020 description file reflecting all extensions selected for an RPC-over- 1021 RDMA implementation. 1023 Alternatively, the XDR descriptions can be compiled separately. In 1024 this case the combination of common.x and baseops.x serves to define 1025 the base transport, while using as XDR descriptions for extensions, 1026 the XDR from the document defining that extension, together with the 1027 file common.x, obtained from this document. 1029 6.3. XDR Definition for RPC-over-RDMA Version 2 Core Structures 1031 1032 /// /******************************************************************* 1033 /// * Transport Header Prefixes 1034 /// ******************************************************************/ 1035 /// 1036 /// struct rpcrdma_common { 1037 /// uint32 rdma_xid; 1038 /// uint32 rdma_vers; 1039 /// uint32 rdma_credit; 1040 /// uint32 rdma_htype; 1041 /// }; 1042 /// 1043 /// const RPCRDMA2_F_RESPONSE 0x00000001; 1044 /// 1045 /// struct rpcrdma2_hdr_prefix 1046 /// struct rpcrdma_common rdma_start; 1047 /// uint32 rdma_flags; 1048 /// }; 1049 /// 1050 /// /******************************************************************* 1051 /// * Chunks and Chunk Lists 1052 /// ******************************************************************/ 1053 /// 1054 /// struct rpcrdma2_segment { 1055 /// uint32 rdma_handle; 1056 /// uint32 rdma_length; 1057 /// uint64 rdma_offset; 1058 /// }; 1059 /// 1060 /// struct rpcrdma2_read_segment { 1061 /// uint32 rdma_position; 1062 /// struct rpcrdma2_segment rdma_target; 1063 /// }; 1064 /// 1065 /// struct rpcrdma2_read_list { 1066 /// struct rpcrdma2_read_segment rdma_entry; 1067 /// struct rpcrdma2_read_list *rdma_next; 1068 /// }; 1069 /// 1070 /// struct rpcrdma2_write_chunk { 1071 /// struct rpcrdma2_segment rdma_target<>; 1072 /// }; 1073 /// 1074 /// struct rpcrdma2_write_list { 1075 /// struct rpcrdma2_write_chunk rdma_entry; 1076 /// struct rpcrdma2_write_list *rdma_next; 1077 /// }; 1078 /// 1079 /// struct rpcrdma2_chunk_lists { 1080 /// uint32 rdma_inv_handle; 1081 /// struct rpcrdma2_read_list *rdma_reads; 1082 /// struct rpcrdma2_write_list *rdma_writes; 1083 /// struct rpcrdma2_write_chunk *rdma_reply; 1084 /// }; 1085 /// 1086 /// /******************************************************************* 1087 /// * Transport Properties 1088 /// ******************************************************************/ 1089 /// 1090 /// /* 1091 /// * Types for transport properties model 1092 /// */ 1093 /// typedef rpcrdma2_propid uint32; 1094 /// 1095 /// struct rpcrdma2_propval { 1096 /// rpcrdma2_propid rdma_which; 1097 /// opaque rdma_data<>; 1098 /// }; 1099 /// 1100 /// typedef rpcrdma2_propval rpcrdma2_propset<>; 1101 /// typedef uint32 rpcrdma2_propsubset<>; 1102 /// 1103 /// /* 1104 /// * Transport propid values for basic properties 1105 /// */ 1106 /// const uint32 RDMA2_PROPID_RBSIZ = 1; 1107 /// const uint32 RDMA2_PROPID_BRS = 2; 1108 /// 1109 /// /* 1110 /// * Types specific to particular properties 1111 /// */ 1112 /// typedef uint32 rpcrdma2_prop_rbsiz; 1113 /// typedef rpcrdma2_rvreqsup rpcrdma2_prop_brs; 1114 /// 1115 /// enum rpcrdma2_rvreqsup { 1116 /// RDMA2_RVREQSUP_NONE = 0, 1117 /// RDMA2_RVREQSUP_INLINE = 1, 1118 /// RDMA2_RVREQSUP_GENL = 2 1119 /// }; 1120 /// 1121 /// /* FILE ENDS: common.x; */ 1123 1124 6.4. XDR Definition for RPC-over-RDMA Version 2 Base Header Types 1126 1127 /// /******************************************************************* 1128 /// * Descriptions of RPC-over-RDMA Header Types 1129 /// ******************************************************************/ 1130 /// 1131 /// /* 1132 /// * Header Type Codes. 1133 /// */ 1134 /// const rpcrdma2_proc RDMA2_MSG = 0; 1135 /// const rpcrdma2_proc RDMA2_NOMSG = 1; 1136 /// const rpcrdma2_proc RDMA2_ERROR = 4; 1137 /// const rpcrdma2_proc RDMA2_CONNPROP = 5; 1138 /// 1139 /// /* 1140 /// * Header Types to Convey RPC Messages. 1141 /// */ 1142 /// struct rpcrdma2_msg { 1143 /// struct rpcrdma2_chunk_lists rdma_chunks; 1144 /// 1145 /// /* The rpc message starts here and continues 1146 /// * through the end of the transmission. */ 1147 /// uint32 rdma_rpc_first_word; 1148 /// }; 1149 /// 1150 /// struct rpcrdma2_nomsg { 1151 /// struct rpcrdma2_chunk_lists rdma_chunks; 1152 /// }; 1153 /// 1154 /// /* 1155 /// * Header Type to Report Errors. 1156 /// */ 1157 /// const uint32 RDMA2_ERR_VERS = 1; 1158 /// const uint32 RDMA2_ERR_BAD_XDR = 2; 1159 /// const uint32 RDMA2_ERR_INVAL_HTYPE = 3; 1160 /// const uint32 RDMA2_ERR_READ_CHUNKS = 4; 1161 /// const uint32 RDMA2_ERR_WRITE_CHUNKS = 5; 1162 /// const uint32 RDMA2_ERR_SEGMENTS = 6; 1163 /// const uint32 RDMA2_ERR_WRITE_RESOURCE = 7; 1164 /// const uint32 RDMA2_ERR_REPLY_RESOURCE = 8; 1165 /// const uint32 RDMA2_ERR_SYSTEM = 9; 1166 /// 1167 /// struct rpcrdma2_err_vers { 1168 /// uint32 rdma_vers_low; 1169 /// uint32 rdma_vers_high; 1170 /// }; 1171 /// 1172 /// struct rpcrdma2_err_write { 1173 /// uint32 rdma_chunk_index; 1174 /// uint32 rdma_length_needed; 1175 /// }; 1176 /// 1177 /// union rpcrdma2_error switch (rpcrdma2_errcode rdma_err) { 1178 /// case RDMA2_ERR_VERS: 1179 /// rpcrdma2_err_vers rdma_vrange; 1180 /// case RDMA2_ERR_READ_CHUNKS: 1181 /// uint32 rdma_max_chunks; 1182 /// case RDMA2_ERR_WRITE_CHUNKS: 1183 /// uint32 rdma_max_chunks; 1184 /// case RDMA2_ERR_SEGMENTS: 1185 /// uint32 rdma_max_segments; 1186 /// case RDMA2_ERR_WRITE_RESOURCE: 1187 /// rpcrdma2_err_write rdma_writeres; 1188 /// case RDMA2_ERR_REPLY_RESOURCE: 1189 /// uint32 rdma_length_needed; 1190 /// default: 1191 /// void; 1192 /// }; 1193 /// 1194 /// /* 1195 /// * Header Type to Exchange Transport Properties. 1196 /// */ 1197 /// struct rpcrdma2_connprop { 1198 /// rpcrdma2_propset rdma_props; 1199 /// }; 1200 /// 1201 /// /* FILE ENDS: baseops.x; */ 1203 1205 6.5. Use of the XDR Description Files 1207 The three files common.x and baseops.x, when combined with the XDR 1208 descriptions for extension defined later, produce a human-readable 1209 and compilable description of the RPC-over-RDMA version 2 protocol 1210 with the included extensions. 1212 Although this XDR description can be useful in generating code to 1213 encode and decode the transport and payload streams, there are 1214 elements of the structure of RPC-over-RDMA version 2 which are not 1215 expressible within the XDR language as currently defined. This 1216 requires implementations that use the output of the XDR processor to 1217 provide additional code to bridge the gaps. 1219 o The values of transport properties are represented within XDR as 1220 opaque values. However, the actual structures of each of the 1221 properties are represented by XDR typedefs, with the selection of 1222 the appropriate typedef described by text in this document. The 1223 determination of the appropriate typedef is not specified by XDR, 1224 which does not possess the facilities necessary for that 1225 determination to be specified in an extensible way. 1227 This is similar to the way in which NFSv4 attributes are handled 1228 [RFC7530] [RFC5661]. As in that case, implementations that need 1229 to encode and decode these nominally opaque entities need to use 1230 the protocol description to determine the actual XDR 1231 representation that underlays the items described as opaque. 1233 o The transport stream is not represented as a single XDR object. 1234 Instead, the header prefix is described by one XDR object while 1235 the rest of the header is described as another XDR object with the 1236 mapping between the header type in the header prefix and the XDR 1237 object representing the header type represented by tables 1238 contained in this document, with additional mappings being 1239 specifiable by a later extension document. 1241 This situation is similar to that in which RPC message headers 1242 contain program and procedure numbers, so that the XDR for those 1243 request and replies can be used to encode and decode the 1244 associated messages without requiring that all be present in a 1245 single XDR specification. As in that case, implementations need 1246 to use the header specification to select the appropriate XDR- 1247 generated code to be used in message processing. 1249 o The relationship between the transport stream and the payload 1250 stream is not specified in the XDR itself, although comments 1251 within the XDR text make clear where transported messages, 1252 described by their own XDR, need to appear. Such data by its 1253 nature is opaque to the transport, although its form differs XDR 1254 opaque arrays. 1256 Potential extensions allowing continuation of RPC messages across 1257 transport message boundaries will require that message assembly 1258 facilities, not specifiable within XDR, also be part of transport 1259 implementations. 1261 To summarize, the role of XDR in this specification is more limited 1262 than for protocols which are themselves XDR programs, where the 1263 totality of the protocol is expressible within the XDR paradigm 1264 established for that purpose. This more limited role reflects the 1265 fact that XDR lacks facilities to represent the embedding of 1266 transported material within the transport framework. In addition, 1267 the need to cleanly accommodate extensions has meant that those using 1268 rpcgen in their applications need to take a more active role in 1269 providing the facilities that cannot be expressed within XDR. 1271 7. Protocol Version Negotiation 1273 When an RPC-over-RDMA version 2 client establishes a connection to a 1274 server, its first order of business is to determine the server's 1275 highest supported protocol version. 1277 As with RPC-over-RDMA version 1, upon connection establishment a 1278 client MUST NOT send more than a single RPC-over-RDMA message at a 1279 time until it receives a valid non-error RPC-over-RDMA message from 1280 the server that grants client credits. 1282 The second word of each transport header is used to convey the 1283 transport protocol version. In the interest of simplicity, we refer 1284 to that word as rdma_vers even though in the RPC-over-RDMA version 2 1285 XDR definition it is described as rdma_start.rdma_vers. 1287 First, the client sends a single valid RPC-over-RDMA message with the 1288 value two (2) in the rdma_vers field. Because the server might 1289 support only RPC-over-RDMA version 1, this initial message can be no 1290 larger than the version 1 default inline threshold of 1024 bytes. 1292 7.1. Server Does Support RPC-over-RDMA Version 2 1294 If the server does support RPC-over-RDMA version 2, it sends RPC- 1295 over-RDMA messages back to the client with the value two (2) in the 1296 rdma_vers field. Both peers may use the default inline threshold 1297 value for RPC-over-RDMA version 2 connections (4096 bytes). 1299 7.2. Server Does Not Support RPC-over-RDMA Version 2 1301 If the server does not support RPC-over-RDMA version 2, it MUST send 1302 an RPC-over-RDMA message to the client with the same XID, with 1303 RDMA2_ERROR in the rdma_start.rdma_htype field, and with the error 1304 code RDMA2_ERR_VERS. This message also reports a range of protocol 1305 versions that the server supports. To continue operation, the client 1306 selects a protocol version in the range of server-supported versions 1307 for subsequent messages on this connection. 1309 If the connection is lost immediately after an RDMA2_ERROR / 1310 RDMA2_ERR_VERS message is received, a client can avoid a possible 1311 version negotiation loop when re-establishing another connection by 1312 assuming that particular server does not support RPC-over-RDMA 1313 version 2. A client can assume the same situation (no server support 1314 for RPC-over-RDMA version 2) if the initial negotiation message is 1315 lost or dropped. Once the negotiation exchange is complete, both 1316 peers may use the default inline threshold value for the transport 1317 protocol version that has been selected. 1319 7.3. Client Does Not Support RPC-over-RDMA Version 2 1321 If the server supports the RPC-over-RDMA protocol version used in 1322 Call messages from a client, it MUST send Replies with the same RPC- 1323 over-RDMA protocol version that the client uses to send its Calls. 1324 The client MUST NOT change the version during the duration of the 1325 connection. 1327 8. Differences from the RPC-over-RDMA Version 1 Protocol 1329 This section describes the substantive changes made in RPC-over-RDMA 1330 version 2, as opposed to the structural changes to enable 1331 extensibility, which are discussed in Section 10.1. 1333 8.1. Transport Properties 1335 RPC-over-RDMA version 2 provides a mechanism for exchanging the 1336 transport's operational properties. This mechanism allows connection 1337 endpoints to communicate the properties of their implementation at 1338 connection setup. The mechanism could be expanded to enable an 1339 endpoint to request changes in properties of the other endpoint and 1340 to notify peer endpoints of changes to properties that occur during 1341 operation. Transport properties are described in Section 4. 1343 8.2. Credit Management Changes 1345 RPC-over-RDMA transports employ credit-based flow control to ensure 1346 that a requester does not emit more RDMA Sends than the responder is 1347 prepared to receive. Section 3.3.1 of [RFC8166] explains the purpose 1348 and operation of RPC-over-RDMA version 1 credit management in detail. 1350 In the RPC-over-RDMA version 1 design, each RDMA Send from a 1351 requester contains an RPC Call with a credit request, and each RDMA 1352 Send from a responder contains an RPC Reply with a credit grant. The 1353 credit grant implies that enough Receives have been posted on the 1354 responder to handle the credit grant minus the number of pending RPC 1355 transactions (the number of remaining Receive buffers might be zero). 1357 In other words, each RPC Reply acts as an implicit ACK for a previous 1358 RPC Call from the requester, indicating that the responder has posted 1359 a Receive to replace the Receive consumed by the requester's RDMA 1360 Send. Without an RPC Reply message, the requester has no way to know 1361 that the responder is properly prepared for subsequent RPC Calls. 1363 Aside from being a bit of a layering violation, there are basic (but 1364 rare) cases where this arrangement is inadequate: 1366 o When a requester retransmits an RPC Call on the same connection as 1367 an earlier RPC Call for the same transaction. 1369 o When a requester transmits an RPC operation that requires no 1370 reply. 1372 o When more than one RPC-over-RDMA message is needed to complete the 1373 transaction (e.g., RDMA_DONE). 1375 Typically, the connection must be replaced in these cases. This 1376 resets the credit accounting mechanism but has an undesirable impact 1377 on other ongoing RPC transactions on that connection. 1379 Because credit management accompanies each RPC message, there is a 1380 strict one-to-one ratio between RDMA Send and RPC message. There are 1381 interesting use cases that might be enabled if this relationship were 1382 more flexible: 1384 o RPC-over-RDMA operations which do not carry an RPC message; e.g., 1385 control plane operations. 1387 o A single RDMA Send that conveys more than one RPC message for the 1388 purpose of interrupt mitigation. 1390 o An RPC message that is conveyed via several sequential RDMA Sends 1391 to reduce the use of explicit RDMA operations for moderate-sized 1392 RPC messages. 1394 o An RPC transaction that needs multiple exchanges or an odd number 1395 of RPC-over-RDMA operations to complete. 1397 Bi-directional RPC operation also introduces an ambiguity. If the 1398 RPC-over-RDMA message does not carry an RPC message, then it is not 1399 possible to determine whether the sender is a requester or a 1400 responder, and thus whether the rdma_credit field contains a credit 1401 request or a credit grant. 1403 A more sophisticated credit accounting mechanism is provided in RPC- 1404 over-RDMA version 2 in an attempt to address some of these 1405 shortcomings. This new mechanism is detailed in Section TBD. 1407 8.3. Inline Threshold Changes 1409 The term "inline threshold" is defined in Section 3.3.2 of [RFC8166]. 1410 An "inline threshold" value is the largest message size (in octets) 1411 that can be conveyed on an RDMA connection using only RDMA Send and 1412 Receive. Each connection has two inline threshold values: one for 1413 messages flowing from client-to-server (referred to as the "client- 1414 to-server inline threshold") and one for messages flowing from 1415 server-to-client (referred to as the "server-to-client inline 1416 threshold"). Note that [RFC8166] uses somewhat different 1417 terminology. This is because it was written with only forward- 1418 direction RPC transactions in mind. 1420 A connection's inline thresholds determine when RDMA Read or Write 1421 operations are required because the RPC message to be sent cannot be 1422 conveyed via a single RDMA Send and Receive pair. When an RPC 1423 message does not contain DDP-eligible data items, a requester 1424 prepares a Long Call or Reply to convey the whole RPC message using 1425 RDMA Read or Write operations. 1427 RDMA Read and Write operations require that each data payload resides 1428 in a region of memory that is registered with the RNIC. When an RPC 1429 is complete, that region is invalidated, fencing it from the 1430 responder. Memory registration and invalidation typically have a 1431 latency cost that is insignificant compared to data handling costs. 1432 When a data payload is small, however, the cost of registering and 1433 invalidating the memory where the payload resides becomes a 1434 relatively significant part of total RPC latency. Therefore the most 1435 efficient operation of RPC-over-RDMA occurs when explicit RDMA Read 1436 and Write operations are used for large payloads, and are avoided for 1437 small payloads. 1439 When RPC-over-RDMA version 1 was conceived, the typical size of RPC 1440 messages that did not involve a significant data payload was under 1441 500 bytes. A 1024-byte inline threshold adequately minimized the 1442 frequency of inefficient Long Calls and Replies. 1444 With NFS version 4.1 [RFC5661], the increased size of NFS COMPOUND 1445 operations resulted in RPC messages that are on average larger and 1446 more complex than previous versions of NFS. With 1024-byte inline 1447 thresholds, RDMA Read or Write operations are needed for frequent 1448 operations that do not bear a data payload, such as GETATTR and 1449 LOOKUP, reducing the efficiency of the transport. 1451 To reduce the need to use Long Calls and Replies, RPC-over-RDMA 1452 version 2 increases the default size of inline thresholds. This also 1453 increases the maximum size of reverse-direction RPC messages. 1455 8.4. Support for Remote Invalidation 1457 An STag that is registered using the FRWR mechanism in a privileged 1458 execution context or is registered via a Memory Window in an 1459 unprivileged context may be invalidated remotely [RFC5040]. These 1460 mechanisms are available when a requester's RNIC supports 1461 MEM_MGT_EXTENSIONS. 1463 For the purposes of this discussion, there are two classes of STags. 1464 Dynamically-registered STags are used in a single RPC, then 1465 invalidated. Persistently-registered STags live longer than one RPC. 1466 They may persist for the life of an RPC-over-RDMA connection, or 1467 longer. 1469 An RPC-over-RDMA requester may provide more than one STag in one 1470 transport header. It may provide a combination of dynamically- and 1471 persistently-registered STags in one RPC message, or any combination 1472 of these in a series of RPCs on the same connection. Only 1473 dynamically-registered STags using Memory Windows or FRWR (i.e., 1474 registered via MEM_MGT_EXTENSIONS) may be invalidated remotely. 1476 There is no transport-level mechanism by which a responder can 1477 determine how a requester-provided STag was registered, nor whether 1478 it is eligible to be invalidated remotely. A requester that mixes 1479 persistently- and dynamically-registered STags in one RPC, or mixes 1480 them across RPCs on the same connection, must therefore indicate 1481 which handles may be invalidated via a mechanism provided in the 1482 Upper Layer Protocol. RPC-over-RDMA version 2 provides such a 1483 mechanism. 1485 The RDMA Send With Invalidate operation is used to invalidate an STag 1486 on a remote system. It is available only when a responder's RNIC 1487 supports MEM_MGT_EXTENSIONS, and must be utilized only when a 1488 requester's RNIC supports MEM_MGT_EXTENSIONS (can receive and 1489 recognize an IETH). 1491 8.4.1. Reverse Direction Remote Invalidation 1493 Existing RPC-over-RDMA transport protocol specifications [RFC8166] 1494 [RFC8167] do not forbid direct data placement in the reverse 1495 direction, even though there is currently no Upper Layer Protocol 1496 that makes data items in reverse direction operations elegible for 1497 direct data placement. 1499 When chunks are present in a reverse direction RPC request, Remote 1500 Invalidation allows the responder to trigger invalidation of a 1501 requester's STags as part of sending a reply, the same way as is done 1502 in the forward direction. 1504 However, in the reverse direction, the server acts as the requester, 1505 and the client is the responder. The server's RNIC, therefore, must 1506 support receiving an IETH, and the server must have registered the 1507 STags with an appropriate registration mechanism. 1509 8.5. Error Reporting Changes 1511 RPC-over-RDMA version 2 expands the repertoire of errors that may be 1512 reported by connection endpoints. This change, which is structured 1513 to enable extensibility, allows a peer to report overruns of specific 1514 resources and to avoid requester retries when an error is permanent. 1516 9. Extending the Version 2 Protocol 1518 RPC-over-RDMA version 2 is designed to be extensible in a way that 1519 enables the addition of OPTIONAL features that may subsequently be 1520 converted to REQUIRED status in a future protocol version. The 1521 protocol may be extended by Standards Track documents in a way 1522 analogous to that provided for Network File System Version 4 as 1523 described in [RFC8178]. 1525 This form of extensibility enables limited extensions to the base 1526 RPC-over-RDMA version 2 protocol presented in this document so that 1527 new optional capabilities can be introduced without a protocol 1528 version change, while maintaining robust interoperability with 1529 existing RPC-over-RDMA version 2 implementations. The design allows 1530 extensions to be defined, including the definition of new protocol 1531 elements, without requiring modification or recompilation of the 1532 existing XDR. 1534 A Standards Track document introduces each set of such protocol 1535 elements. Together these elements are considered an OPTIONAL 1536 feature. Each implementation is either aware of all the protocol 1537 elements introduced by that feature or is aware of none of them. 1539 Documents describing extensions to RPC-over-RDMA version 2 should 1540 contain: 1542 o An explanation of the purpose and use of each new protocol element 1543 added. 1545 o An XDR description including all of the new protocol elements, and 1546 a script to extract it. 1548 o A description of interactions with existing extensions. 1550 This includes possible requirements of other OPTIONAL features to 1551 be present for new protocol elements to work, or that a particular 1552 level of support for an OPTIONAL facility is required for the new 1553 extension to work. 1555 Implementers combine the XDR descriptions of the new features they 1556 intend to use with the XDR description of the base protocol in this 1557 document. This may be necessary to create a valid XDR input file 1558 because extensions are free to use XDR types defined in the base 1559 protocol, and later extensions may use types defined by earlier 1560 extensions. 1562 The XDR description for the RPC-over-RDMA version 2 base protocol 1563 combined with that for any selected extensions should provide an 1564 adequate human-readable description of the extended protocol. 1566 The base protocol specified in this document may be extended within 1567 RPC-over-RDMA version 2 in two ways: 1569 o New OPTIONAL transport header types may be introduced by later 1570 Standards Track documents. Such transport header types will be 1571 documented as described in Section 9.1. 1573 o New OPTIONAL transport properties may be defined in later 1574 Standards Track documents. Such transport properties will be 1575 documented as described in Section 9.2. 1577 The following sorts of ancillary protocol elements may be added to 1578 the protocol to support the addition of new transport properties and 1579 header types. 1581 o New error codes may be created as described in Section 9.3. 1583 o New flags to use within the rdma_flags field may be created as 1584 described in Section 9.4. 1586 New capabilities can be proposed and developed independently of each 1587 other, and implementers can choose among them. This makes it 1588 straightforward to create and document experimental features and then 1589 bring them through the standards process. 1591 9.1. Adding New Header Types to RPC-over-RDMA Version 2 1593 New transport header types are to defined in a manner similar to the 1594 way existing ones are described in Sections Section 5.3.1 through 1595 Section 5.3.4 Specifically what is needed is: 1597 o A description of the function and use of the new header type. 1599 o A complete XDR description of the new header type including a 1600 description of the use of all fields within the header. 1602 o A description of how errors are reported, including the definition 1603 of a mechanism for reporting errors when the error is outside the 1604 available choices already available in the base protocol or in 1605 other existing extensions. 1607 o An indication of whether a Payload stream must be present, and a 1608 description of its contents and how such payload streams are used 1609 to construct RPC messages for processing. 1611 In addition, there needs to be additional documentation that is made 1612 necessary due to the Optional status of new transport header types. 1614 o Information about constraints on support for the new header types 1615 should be provided. For example, if support for one header type 1616 is implied or foreclosed by another one, this needs to be 1617 documented. 1619 o A preferred method by which a sender should determine whether the 1620 peer supports a particular header type needs to be provided. 1621 While it is always possible for a send a test invocation of a 1622 particular header type to see if support is available, when more 1623 efficient means are available (e.g. the value of a transport 1624 property, this should be noted. 1626 9.2. Adding New Transport properties to the Protocol 1628 The set of transport properties is designed to be extensible. As a 1629 result, once new properties are defined in standards track documents, 1630 the operations defined in this document may reference these new 1631 transport properties, as well as the ones described in this document. 1633 A standards track document defining a new transport property should 1634 include the following information paralleling that provided in this 1635 document for the transport properties defined herein. 1637 o The rpcrdma2_propid value used to identify this property. 1639 o The XDR typedef specifying the form in which the property value is 1640 communicated. 1642 o A description of the transport property that is communicated by 1643 the sender of RDMA2_CONNPROP. 1645 o An explanation of how this knowledge could be used by the peer 1646 receiving this information. 1648 The definition of transport property structures is such as to make it 1649 easy to assign unique values. There is no requirement that a 1650 continuous set of values be used and implementations should not rely 1651 on all such values being small integers. A unique value should be 1652 selected when the defining document is first published as an internet 1653 draft. When the document becomes a standards track document, the 1654 working group should ensure that: 1656 o rpcrdma2_propid values specified in the document do not conflict 1657 with those currently assigned or in use by other pending working 1658 group documents defining transport properties. 1660 o rpcrdma2_propid values specified in the document do not conflict 1661 with the range reserved for experimental use, as defined in 1662 Section 8.2. 1664 Documents defining new properties fall into a number of categories. 1666 o Those defining new properties and explaining (only) how they 1667 affect use of existing message types. 1669 o Those defining new OPTIONAL message types and new properties 1670 applicable to the operation of those new message types. 1672 o Those defining new OPTIONAL message types and new properties 1673 applicable both to new and existing message types. 1675 When additional transport properties are proposed, the review of the 1676 associated standards track document should deal with possible 1677 security issues raised by those new transport properties. 1679 9.3. Adding New Error Codes to the Protocol 1681 New error codes to be returned when using new header types may be 1682 introduced in the same Standards Track document that defines the new 1683 header type. [ cel: what about adding a new error code that is 1684 returned for an existing header type? ] 1686 For error codes that do not require that additional error information 1687 be returned with them, the existing RDMA_ERR2 header can be used to 1688 report the new error. The new error code is set as the value of 1689 rdma_err with the result that the default switch arm of the 1690 rpcrdma2_error (i.e. void) is selected. 1692 For error codes that do require the return of additional error- 1693 related information together with the error, a new header type should 1694 be defined for the purpose of returning the error together with 1695 needed additional information. It should be documented just like any 1696 other new header type. 1698 When a new header type is sent, the sender needs to be prepared to 1699 accept header types necessary to report associated errors. 1701 9.4. Adding New Header Flags to the Protocol 1703 There are currently thirty-one flags available for later assignment. 1704 One possible use for such flags would be in a later protocol version, 1705 should that version retain the same general header structure as 1706 version 2. 1708 In addition, it is possible to assign unused flags within extensions 1709 made to version 2, as long as the following practices are adhered to: 1711 o Flags should not be added to the flag word in the prefix structure 1712 if those flags only apply to a single header type. New flags 1713 should only be defined for conditions applying to multiple header 1714 types. 1716 o The document defining the new flag should indicate for which 1717 header types the flag value is meaningful and for which header 1718 types it is an error to set the flag or to leave it unset. 1720 o The sender needs to be provided with a means to determine whether 1721 the receiver is prepared to receive transport headers with the new 1722 flag set. This is most likely to take the form of a transport 1723 property together with the definition of suitable defaults to use 1724 when that property is not supported. Another possibility is to 1725 REQUIRE that receivers supporting a particular header type also 1726 support a set of additional flags. 1728 10. Relationship to other RPC-over-RDMA Versions 1730 10.1. Relationship to RPC-over-RDMA Version 1 1732 In addition to the substantive protocol changes discussed in 1733 Section 8, there are a number of structural XDR changes whose goal is 1734 to enable within-version protocol extensibility. 1736 The RPC-over-RDMA version 1 transport header is defined as a single 1737 XDR object, with an RPC message proper potentially following it. In 1738 RPC-over-RDMA version 2, as described in Section 5.1 there are 1739 separate XDR definitions of the transport header prefix (see 1740 Section 3.2 which specifies the transport header type to be used, and 1741 the specific transport header, defined within one of the subsections 1742 of Section 5). This is similar to the way that an RPC message 1743 consists of an RPC header (defined in [RFC5531]) and an RPC request 1744 or reply, defined by the Upper Layer protocol being conveyed. 1746 As a new version of the RPC-over-RDMA transport protocol, RPC-over- 1747 RDMA version 2 exists within the versioning rules defined in 1748 [RFC8166]. In particular, it maintains the first four words of the 1749 protocol header as sent and received, as specified in Section 4.2 of 1750 [RFC8166], even though, as explained in Section 3.1 of this document, 1751 the XDR definition of those words is structured differently. 1753 Although each of the first four words retains its semantic function, 1754 there are important differences of field interpretation, besides the 1755 fact that the words have different names and different roles with the 1756 XDR constrict of they are parts. 1758 o The first word of the header, previously the rdma_xid field, 1759 retains the format and function that in had in RPC-over-RDMA 1760 version 1. Within RPC-over-RDMA version 2, this word is the 1761 rdma_xid field of the structure rdma_start. However, to 1762 accommodate the use of request-response pairing of non-RPC 1763 messages and the potential use of message continuation, it cannot 1764 be assumed that it will always have the same value it would have 1765 had in RPC-over-RDMA version 1. As a result, the contents of this 1766 field should not be used without consideration of the associated 1767 protocol version identification. 1769 o The second word of the header, previously the rdma_vers field, 1770 retains the format and function that it had in RPC-over-RDMA 1771 version 1. Within RPC-over-RDMA version 2, this word is the 1772 rdma_vers field of the structure rdma_start. To clearly 1773 distinguish version 1 and version 2 messages, senders MUST fill in 1774 the correct version (fixed after version negotiation) and 1775 receivers MUST check that the content of the rdma_vers is correct 1776 before using referencing any other header field. 1778 o The third word of the header, previously the rdma_credit field, 1779 retains the format and general purpose that it had in RPC-over- 1780 RDMA version 1. Within RPC-over-RDMA version 2, this word is the 1781 rdma_credit field of the structure rdma_start. The RPC-over-RDMA 1782 version 2 protocol provides additional mechanisms that determine 1783 whether the value contained in this field is a credit request or 1784 grant. Also, the way in which credits are accounted for may be 1785 different in RPC-over-RDMA version 2. 1787 o The fourth word of the header, previously the union discriminator 1788 field rdma_proc, retains its format and general function even 1789 though the set of valid values has changed. The value of this 1790 field is now considered an unsigned 32-bit integer rather than an 1791 enum. Within RPC-over-RDMA version 2, this word is the rdma_htype 1792 field of the structure rdma_start. 1794 Beyond conforming to the restrictions specified in [RFC8166], RPC- 1795 over-RDMA version 2 tightly limits the scope of the changes made in 1796 order to ensure interoperability. It makes no major structural 1797 changes to the protocol, and all existing transport header types used 1798 in version 1 (as defined in [RFC8166]) are retained in version 2. 1799 Chunks are expressed using the same on-the-wire format and are used 1800 in the same way in both versions. 1802 10.2. Extensibility Beyond RPC-over-RDMA Version 2 1804 Subsequent RPC-over-RDMA versions are free to change the protocol in 1805 any way they choose as long as they maintain the first four header 1806 words as currently specified by [RFC8166]. 1808 Such changes might involve deletion or major re-organization of 1809 existing transport headers. However, the need for interoperability 1810 between adjacent versions will often limit the scope of changes that 1811 can be made in a single version. 1813 In some cases it may prove desirable to transition to a new version 1814 by using the extension features described for use with RPC-over-RDMA 1815 version 2, by continuing the same basic extension model but allowing 1816 header types and properties that were OPTIONAL in one version to 1817 become REQUIRED in the subsequent version. 1819 11. Security Considerations 1821 The security considerations for RPC-over-RDMA version 2 are the same 1822 as those for RPC-over-RDMA version 1. 1824 11.1. Security Considerations (Transport Properties) 1826 Like other fields that appear in each RPC-over-RDMA header, property 1827 information is sent in the clear on the fabric with no integrity 1828 protection, making it vulnerable to man-in-the-middle attacks. 1830 For example, if a man-in-the-middle were to change the value of the 1831 Receive buffer size or the Requester Remote Invalidation boolean, it 1832 could reduce connection performance or trigger loss of connection. 1833 Repeated connection loss can impact performance or even prevent a new 1834 connection from being established. Recourse is to deploy on a 1835 private network or use link-layer encryption. 1837 12. IANA Considerations 1839 This document does not require actions by IANA. 1841 13. References 1843 13.1. Normative References 1845 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 1846 Requirement Levels", BCP 14, RFC 2119, 1847 DOI 10.17487/RFC2119, March 1997, 1848 . 1850 [RFC4506] Eisler, M., Ed., "XDR: External Data Representation 1851 Standard", STD 67, RFC 4506, DOI 10.17487/RFC4506, May 1852 2006, . 1854 [RFC5531] Thurlow, R., "RPC: Remote Procedure Call Protocol 1855 Specification Version 2", RFC 5531, DOI 10.17487/RFC5531, 1856 May 2009, . 1858 [RFC8166] Lever, C., Ed., Simpson, W., and T. Talpey, "Remote Direct 1859 Memory Access Transport for Remote Procedure Call Version 1860 1", RFC 8166, DOI 10.17487/RFC8166, June 2017, 1861 . 1863 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 1864 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, 1865 May 2017, . 1867 13.2. Informative References 1869 [IBARCH] InfiniBand Trade Association, "InfiniBand Architecture 1870 Specification Volume 1", Release 1.3, March 2015, 1871 . 1874 [RFC5040] Recio, R., Metzler, B., Culley, P., Hilland, J., and D. 1875 Garcia, "A Remote Direct Memory Access Protocol 1876 Specification", RFC 5040, DOI 10.17487/RFC5040, October 1877 2007, . 1879 [RFC5041] Shah, H., Pinkerton, J., Recio, R., and P. Culley, "Direct 1880 Data Placement over Reliable Transports", RFC 5041, 1881 DOI 10.17487/RFC5041, October 2007, 1882 . 1884 [RFC5661] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed., 1885 "Network File System (NFS) Version 4 Minor Version 1 1886 Protocol", RFC 5661, DOI 10.17487/RFC5661, January 2010, 1887 . 1889 [RFC5662] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed., 1890 "Network File System (NFS) Version 4 Minor Version 1 1891 External Data Representation Standard (XDR) Description", 1892 RFC 5662, DOI 10.17487/RFC5662, January 2010, 1893 . 1895 [RFC7530] Haynes, T., Ed. and D. Noveck, Ed., "Network File System 1896 (NFS) Version 4 Protocol", RFC 7530, DOI 10.17487/RFC7530, 1897 March 2015, . 1899 [RFC8167] Lever, C., "Bidirectional Remote Procedure Call on RPC- 1900 over-RDMA Transports", RFC 8167, DOI 10.17487/RFC8167, 1901 June 2017, . 1903 [RFC8178] Noveck, D., "Rules for NFSv4 Extensions and Minor 1904 Versions", RFC 8178, DOI 10.17487/RFC8178, July 2017, 1905 . 1907 Acknowledgments 1909 The authors gratefully acknowledge the work of Brent Callaghan and 1910 Tom Talpey on the original RPC-over-RDMA version 1 specification (RFC 1911 5666). The authors also wish to thank Bill Baker, Greg Marsden, and 1912 Matt Benjamin for their support of this work. 1914 The XDR extraction conventions were first described by the authors of 1915 the NFS version 4.1 XDR specification [RFC5662]. Herbert van den 1916 Bergh suggested the replacement sed script used in this document. 1918 Special thanks go to Transport Area Director Spencer Dawkins, NFSV4 1919 Working Group Chairs Spencer Shepler and Brian Pawlowski, and NFSV4 1920 Working Group Secretary Thomas Haynes for their support. 1922 Authors' Addresses 1924 Charles Lever (editor) 1925 Oracle Corporation 1926 1015 Granger Avenue 1927 Ann Arbor, MI 48104 1928 United States of America 1930 Phone: +1 248 816 6463 1931 Email: chuck.lever@oracle.com 1932 David Noveck 1933 NetApp 1934 1601 Trapelo Road 1935 Waltham, MA 02451 1936 United States of America 1938 Phone: +1 781 572 8038 1939 Email: davenoveck@gmail.com