<?xml version="1.0" encoding="US-ASCII"?>
<!DOCTYPE rfc SYSTEM "rfc2629.dtd" >
<?xml-stylesheet type="text/xsl" href="rfc2629.xslt" ?>

<?rfc strict="yes" ?>
<?rfc toc="yes"?>
<?rfc symrefs="yes"?>
<?rfc sortrefs="yes" ?>
<?rfc compact="yes" ?>
<?rfc subcompact="no" ?>

<rfc
 docName="draft-cel-nfsv4-rpcrdma-version-two-09"
 category="std"
 ipr="pre5378Trust200902">

<front>

<title abbrev="RDMA Transport for RPC V2">
RPC-over-RDMA Version 2 Protocol
</title>

<author initials="C.L." surname="Lever" fullname="Charles Lever" role="editor">
<organization abbrev="Oracle">Oracle Corporation</organization>
<address>
<postal>
<street>1015 Granger Avenue</street>
<city>Ann Arbor</city>
<region>MI</region>
<code>48104</code>
<country>United States of America</country>
</postal>
<phone>+1 248 816 6463</phone>
<email>chuck.lever@oracle.com</email>
</address>
</author>

<author initials="D.N." surname="Noveck" fullname="David Noveck">
<organization>NetApp</organization>
<address>
<postal>
<street>1601 Trapelo Road</street>
<city>Waltham</city>
<region>MA</region>
<code>02451</code>
<country>United States of America</country>
</postal>
<phone>+1 781 572 8038</phone>
<email>davenoveck@gmail.com</email>
</address>
</author>

<date />

<area>Transport</area>
<workgroup>Network File System Version 4</workgroup>
<keyword>NFS-Over-RDMA</keyword>

<abstract>
<t>
This document specifies a new version of the transport protocol
that conveys Remote Procedure Call (RPC) messages
on physical transports capable of Remote Direct Memory Access (RDMA).
The new version of this protocol is extensible.
</t>
</abstract>

</front>

<middle>

<section
 title="Introduction"
 anchor="section:72f6ba4a-aafb-4e9d-8b87-800ebccc5879">
<t>
Remote Direct Memory Access (RDMA)
<xref target="RFC5040"/>
<xref target="RFC5041"/>
<xref target="IBARCH"/>
is a technique for moving data efficiently between end nodes.
By directing data into destination buffers as it is sent
on a network and placing it using direct memory access
implemented by hardware,
the complementary benefits of
faster transfers
and
reduced host overhead
are obtained.
</t>
<t>
RPC-over-RDMA version 1 enables ONC RPC
<xref target="RFC5531"/>
messages to be conveyed on RDMA transports.
That protocol is specified in
<xref target="RFC8166"/>.
RPC-over-RDMA version 1 is deployed and in use,
although there are known shortcomings to this protocol:
<list style="symbols">
<t>
The protocol's default size of Receive buffers forces
the use of RDMA Read and Write transfers for small payloads,
and limits the size of reverse direction messages.
</t>
<t>
It is difficult to make optimizations or protocol fixes
that require changes to on-the-wire behavior.
</t>
</list>
</t>
<t>
To address these issues in a way that is
compatible with existing RPC-over-RDMA version 1
deployments, a new version of the RPC-over-RDMA transport protocol
is presented in this document.
</t>
<t>
This new version of RPC-over-RDMA is extensible,
enabling OPTIONAL extensions to be added
without impacting existing implementations.
To enable protocol extension,
the XDR definition for RPC-over-RDMA version 2 is
organized differently than the definition version 1.
These changes, which are discussed in
<xref target="section:d945b9f0-0666-4db7-9126-be57cf7b5f4f"/>,
do not affect the on-the-wire format.
</t>
<t>
In addition, RPC-over-RDMA version 2 contains
a set of incremental changes that relieve certain
performance constraints and enable recovery from
certain abnormal corner cases.
These changes include:
<list style="symbols">
<t>
The exchange of transport properties as described in
<xref target="section:630314a8-1cf5-40f7-a5ad-5bc12c719233"/>.
</t>
<t>
A more flexible credit account mechanism, detailed in
Section TBD.
</t>
<t>
Larger default inline thresholds as described in
<xref target="section:F7FB5108-58EA-4718-84BE-5119A302F5F5"/>.
</t>
<t>
Support for remote invalidation as explained in
<xref target="section:57C034D6-7129-4F7B-B8DF-31E8BC691964"/>.
</t>
<t>
Support for reverse direction operation, as described in
<xref target="RFC8167"/>,
is now REQUIRED.
Details are in
<xref target="section:2d1735f0-c465-43c6-9c18-3da6b7979862"/>.
</t>
<t>
An expansion of error reporting capabilities, described in
<xref target="section:b1d23e5c-31df-483f-adb7-25430b5de38d"/>.
A summary of the reasons for this expansion appears in
<xref target="section:E554DF42-6E82-4E28-96A0-8F9872EB476C"/>.
This expansion supports the addition of new error codes as described in
<xref target="section:C2F5E937-D612-4E3A-A380-E0B15261F6A0"/>.
</t>
</list>
</t>
<t>
Because of the way in which RPC-over-RDMA version 2
builds upon the facilities present in RPC-over-RDMA version 1,
a knowledge of the basic structure of RPC-over-RDMA version 1,
as described in
<xref target="RFC8166"/>,
is assumed in this document.
</t>
<t>
As in that document, the terms
"RPC Payload Stream"
and
"Transport Header Stream"
(defined in Section 3.2 of that document)
are used to distinguish between an RPC message as defined by
<xref
 target="RFC5531"/>
and the header whose job it is to describe the RPC message
and its associated memory resources.
In that regard, the reader is assumed to understand
how RDMA is used to transfer chunks between client and server,
the use of Position-Zero Read chunks and Reply chunks
to convey Long RPC messages,
and
the role of DDP-eligibility in constraining how data payloads are to be conveyed.
</t>
</section>

<section
 title="Requirements Language"
 anchor="section:ef1a2819-4d22-40af-8d38-fde10849c872">
<t>
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY",
and "OPTIONAL" in this document are to be interpreted
as described in BCP 14
<xref target="RFC2119"/>
<xref target="RFC8174"/>
when, and only when, they appear in all capitals, as shown here.
</t>
</section>

<section
 title="RPC-over-RDMA Version 2 Headers and Chunks"
 anchor="section:2e577c75-4e43-4e13-8b17-75afa849f0b6">
<t>
Most RPC-over-RDMA version 2 data structures are derived
from corresponding structures in RPC-over-RDMA version 1.
As is typical for new versions of an existing protocol,
the XDR data structures have new names and there are a
few small changes in content.
In some cases,
there have been structural re-organizations to enabled
protocol extensibility.
</t>

<section
 title="rpcrdma_common: Common Transport Header Prefix"
 anchor="section:e21d4f74-b536-47f2-9d07-c03a27a20de4">
<t>
The rpcrdma_common prefix describes the first part
of each RDMA-over-RPC transport header for version 2
and subsequent versions.
<figure align="left">
<artwork xml:space="preserve" align="left">
&lt;CODE BEGINS&gt;

struct rpcrdma_common {
             uint32         rdma_xid;
             uint32         rdma_vers;
             uint32         rdma_credit;
             uint32         rdma_htype;
};

&lt;CODE ENDS&gt;
</artwork>
</figure>
</t>
<t>
RPC-over-RDMA version 2's use of these first four words
matches that of version 1 as required by
<xref target="RFC8166"/>.
However, there are important structural differences
in the way that these words are described
by the respective XDR descriptions:
<list style="symbols">
<t>
The header type is represented as a uint32 rather than as an enum
that would need to be modified to reflect additions to the set of
header types made by later extensions.
</t>
<t>
The header type field is part of an XDR structure devoted to
representing the transport header prefix,
rather than being part of a discriminated union,
that includes the body of each transport header type.
</t>
<t>
There is now a prefix structure
(see
<xref target="section:2d1735f0-c465-43c6-9c18-3da6b7979862"/>)
of which the rpcrdma_common structure is the initial segment.
This is a newly defined XDR object within the protocol description,
in contrast with RPC-over-RDMA version 1, which limits the common
portion of all header types to the four words in rpcrdma_common.
</t>
</list>
These changes are part of a larger structural change
in the XDR description of RPC-over-RDMA version 2
that enables a cleaner treatment of protocol extension.
The XDR appearing in
<xref target="section:bf53e759-d97f-487d-a5e2-9b8153db1803"/>
reflects these changes, which are discussed in further detail in
<xref target="section:d945b9f0-0666-4db7-9126-be57cf7b5f4f"/>.
</t>
</section>

<section
 title="rpcrdma2_hdr_prefix: Version 2 Transport Header Prefix"
 anchor="section:2d1735f0-c465-43c6-9c18-3da6b7979862">
<t>
The following prefix structure appears at the start of any
RPC-over-RDMA version 2 transport header.
<figure align="left">
<artwork xml:space="preserve" align="left">
&lt;CODE BEGINS&gt;

const RPCRDMA2_F_RESPONSE           0x00000001;

struct rpcrdma2_hdr_prefix
        struct rpcrdma_common       rdma_start;
        uint32                      rdma_flags;
};

&lt;CODE ENDS&gt;
</artwork>
</figure>
</t>
<t>
The rdma_flags is new to RPC-over-RDMA version 2.
Currently, the only flag defined within this word is the
RPCRDMA2_F_RESPONSE flag.
The other bits are reserved for future use as described in
<xref target="section:15C76685-A24D-4799-BC8A-311D03AC3510"/>.
The sender MUST set these to zero.
</t>
<t>
The RPCRDMA2_F_RESPONSE flag qualifies the values contained in the
transport header's rdma_start.rdma_xid and rdma_start.rdma_credits fields.
The RPCRDMA2_F_RESPONSE flag enables a receiver to reliably avoid
performing an XID lookup on incoming reverse direction Call messages,
and apply the value of the rdma_start.rdma_credits field correctly,
based on the direction of the message being conveyed.
</t>
<t>
In general, when a message carries an XID that was generated
by the message's receiver
(that is, the receiver is acting as a requester),
the message's sender sets the RPCRDMA2_F_RESPONSE flag.
Otherwise that flag is clear.
For example:
<list style="symbols">
<t>
When the rdma_start.rdma_htype field has the value RDMA2_MSG or
RDMA2_NOMSG, the value of the RPCRDMA2_F_RESPONSE flag MUST be the
same as the value of the associated RPC message's msg_type field.
</t>
<t>
When the header type is anything else
and
a whole or partial RPC message payload is present,
the value of the RPCRDMA2_F_RESPONSE flag MUST be the same
as the value of the associated RPC message's msg_type field.
</t>
<t>
When no RPC message payload is present,
a Requester MUST set the value of RPCRDMA2_F_RESPONSE
to reflect how the receiver is to interpret the
rdma_start.rdma_credits and rdma_start.rdma_xid fields.
</t>
<t>
When the rdma_start.rdma_htype field has the value RDMA2_ERROR,
the RPCRDMA2_F_RESPONSE flag MUST be set.
</t>
</list>
</t>
</section>

<section
 title="rpcrdma2_chunk_lists: Describe External Data Payload"
 anchor="section:af116198-1815-4308-99ab-0197b2c5ea0b">
<t>
The rpcrdma2_chunk_lists structure specifies how an RPC message
is conveyed using explicit RDMA operations.
<figure align="left">
<artwork xml:space="preserve" align="left">
&lt;CODE BEGINS&gt;

struct rpcrdma2_chunk_lists {
        uint32                      rdma_inv_handle;
        struct rpcrdma2_read_list   *rdma_reads;
        struct rpcrdma2_write_list  *rdma_writes;
        struct rpcrdma2_write_chunk *rdma_reply;
};

&lt;CODE ENDS&gt;
</artwork>
</figure>
</t>
<t>
For the most part this structure parallels
its RPC-over-RDMA version 1 equivalent.
That is, rdma_reads, rdma_writes, rdma_reply provide, respectively,
descriptions of the chunks used to read a Long request or directly
placed data from the requester, to write directly placed response
data into the requester's memory, and to write a long reply into the
requester's memory.
</t>
<t>
An important addition relative to the corresponding RPC-over-RDMA version 1
rdma_header structures is the rdma_inv_handle field.
This field supports remote invalidation
of requester memory registrations
via the RDMA Send With Invalidate operation.
</t>
<t>
To request Remote Invalidation, a requester sets the value of the
rdma_inv_handle field in an RPC Call's transport header to a non-zero
value that matches one of the rdma_handle fields in that header.  If
none of the rdma_handle values in the header conveying the Call may
be invalidated by the responder, the requester sets the RPC Call's
rdma_inv_handle field to the value zero.
</t>
<t>
If the responder chooses not to use remote invalidation for this
particular RPC Reply, or the RPC Call's rdma_inv_handle field
contains the value zero, the responder uses RDMA Send to transmit the
matching RPC reply.
</t>
<t>
If a requester has provided a non-zero value in the RPC Call's
rdma_inv_handle field and the responder chooses to use Remote
Invalidation for the matching RPC Reply, the responder uses RDMA Send
With Invalidate to transmit that RPC reply, and uses the value in the
corresponding Call's rdma_inv_handle field to construct the Send With
Invalidate Work Request.
</t>
</section>

</section>

<section
 title="Transport Properties"
 anchor="section:86248e99-ca60-478a-8aff-3fb387410077">
<t>
RPC-over-RDMA version 2 provides a mechanism
for connection endpoints
to communicate information about implementation properties,
enabling compatible endpoints to optimize data transfer.
Initially only a small set of transport properties are defined
and a single operation is provided to exchange transport properties
(see
<xref target="section:07e8c178-62df-46a7-a57e-dcf107821d93"/>).
</t>
<t>
Both the set of transport properties and the operations used to
communicate may be extended.
Within RPC-over-RDMA version 2, all such extensions are OPTIONAL.
For information about existing transport properties, see Sections
<xref
 target="section:d5ac12f6-6735-48f3-b4ba-b44a19ff9298"
 pageno="false"
 format="counter"/>
through
<xref
 target="section:943010bd-c342-46b7-9fcd-df746437dd6f"
 pageno="false"
 format="counter"/>.
For discussion of extensions to the set of transport properties, see
<xref target="section:A355ADAD-F03B-41A6-94A8-4128B10301BB"/>.
</t>

<section
 title="Transport Properties Model"
 anchor="section:d5ac12f6-6735-48f3-b4ba-b44a19ff9298">
<t>
A basic set of receiver and sender properties is specified in this document.
An extensible approach is used, allowing new properties to be defined
in future Standards Track documents.
</t>
<t>
Such properties are specified using:
<list style="symbols">
<t>
A code point identifying the particular transport property being specified.
</t>
<t>
A nominally opaque array which contains within it the XDR encoding
of the specific property indicated by the associated code point.
</t>
</list>
</t>
<t>
The following XDR types are used by operations that deal with
transport properties:
<figure align="left">
<artwork xml:space="preserve" align="left">
&lt;CODE BEGINS&gt;

typedef rpcrdma2_propid uint32;

struct rpcrdma2_propval {
        rpcrdma2_propid rdma_which;
        opaque          rdma_data&lt;&gt;;
};

typedef rpcrdma2_propval rpcrdma2_propset&lt;&gt;;

typedef uint32 rpcrdma2_propsubset&lt;&gt;;

&lt;CODE ENDS&gt;
</artwork>
</figure>
</t>
<t>
An rpcrdma2_propid specifies a particular transport property.
In order to facilitate XDR extension of the set of properties
by concatenating XDR definition files,
specific properties are defined as const values
rather than as elements in an enum.
</t>
<t>
An rpcrdma2_propval specifies a value of a particular transport
property with the particular property identified by rdma_which,
while the associated value of that property is contained within rdma_data.
</t>
<t>
An rdma_data field which is of zero length is interpreted as
indicating the default value or the property indicated by rdma_which.
</t>
<t>
While rdma_data is defined as opaque within the XDR,
the contents are interpreted (except when of length zero)
using the XDR typedef associated with the property specified by rdma_which.
As a result, when rpcrdma2_propval does not conform to that typedef,
the receiver is REQUIRED to return the error RDMA2_ERR_BAD_XDR
using the header type RDMA2_ERROR as described in
<xref target="section:b1d23e5c-31df-483f-adb7-25430b5de38d"/>.
For example, the receiver of a message
containing a valid rpcrdma2_propval returns this error
if the length of rdma_data is such that it extends beyond
the bounds of the message being transferred.
</t>
<t>
In cases in which the rpcrdma2_propid specified by rdma_which is
understood by the receiver, the receiver also MUST report the error
RDMA2_ERR_BAD_XDR if either of the following occur:
<list style="symbols">
<t>
The nominally opaque data within rdma_data is not valid when
interpreted using the property-associated typedef.
</t>
<t>
The length of rdma_data is insufficient to contain the data
represented by the property-associated typedef.
</t>
</list>
Note that no error is to be reported if rdma_which is unknown to the receiver.
In that case, that rpcrdma2_propval is not processed and processing continues
using the next rpcrdma2_propval, if any.
</t>
<t>
A rpcrdma2_propset specifies a set of transport properties.
No particular ordering of the rpcrdma2_propval items within it is imposed.
</t>
<t>
A rpcrdma2_propsubset identifies a subset of the properties in a
previously specified rpcrdma2_propset.
Each bit in the mask denotes a particular element in a previously
specified rpcrdma2_propset.
If a particular rpcrdma2_propval is at position N in the array,
then bit number N mod 32 in word N div 32 specifies whether
that particular rpcrdma2_propval is included in the defined subset.
Words beyond the last one specified are treated as containing zero.
</t>
</section>

<section
 title="Current Transport Properties"
 anchor="section:943010bd-c342-46b7-9fcd-df746437dd6f">
<t>
Although the set of transport properties may be extended,
a basic set of transport properties is defined in
<xref target="table:99d0e7cc-da81-4f16-9bd0-471f806bc0b6"/>.
</t>
<t>
In that table, the columns contain the following information:
<list style="symbols">
<t>
The column labeled "Property" identifies the transport property
described by the current row.
</t>
<t>
The column labeled "Code" specifies the rpcrdma2_propid value used
to identify this property.
</t>
<t>
The column labeled "XDR type" gives the XDR type of the data used
to communicate the value of this property.
This data type overlays the data portion
of the nominally opaque field rdma_data in a rpcrdma2_propval.
</t>
<t>
The column labeled "Default" gives the default value for the
property which is to be assumed by those who do not receive,
or are unable to interpret,
information about the actual value of the property.
</t>
<t>
The column labeled "Sec" indicates the section within this
document that explains the semantics and use of this transport
property.
</t>
</list>
</t>
<texttable
 align="left"
 style="full"
 anchor="table:99d0e7cc-da81-4f16-9bd0-471f806bc0b6">
<ttcol align="left">Property</ttcol>
<ttcol align="left">Code</ttcol>
<ttcol align="left">XDR type</ttcol>
<ttcol align="left">Default</ttcol>
<ttcol align="left">Sec</ttcol>
<c>Receive Buffer Size</c>
<c>1</c>
<c>uint32</c>
<c>4096</c>
<c>
<xref target="section:5101b1f1-b1ad-4b6b-9fa4-d6fa324ffc0d"/>
</c>
<c>Reverse Request Support</c>
<c>2</c>
<c>enum rpcrdma2_rvreqsup</c>
<c>RDMA2_RVREQSUP_INLINE</c>
<c>
<xref target="section:6ace2d7f-044b-491f-97ea-5760345a2e8f"/>
</c>
</texttable>

<section
 title="Receive Buffer Size"
 anchor="section:5101b1f1-b1ad-4b6b-9fa4-d6fa324ffc0d">
<t>
The Receive Buffer Size specifies the minimum size, in octets,
of pre-posted receive buffers.
It is the responsibility of the endpoint sending this value
to ensure that its pre-posted receive buffers are at least the size specified,
allowing the endpoint receiving this value to send messages
that are of this size.
<figure align="left">
<artwork xml:space="preserve" align="left">
&lt;CODE BEGINS&gt;

const uint32 RDMA2_PROPID_RBSIZ = 1;
typedef uint32 rpcrdma2_prop_rbsiz;

&lt;CODE ENDS&gt;
</artwork>
</figure>
</t>
<t>
The sender may use his knowledge of the receiver's buffer size to
determine when the message to be sent will fit in the preposted
receive buffers that the receiver has set up.
In particular,
<list style="symbols">
<t>
Requesters may use the value to determine when it is necessary to
provide a Position-Zero Read chunk when sending a request.
</t>
<t>
Requesters may use the value to determine when it is necessary to
provide a Reply chunk when sending a request, based on the maximum
possible size of the reply.
</t>
<t>
Responders may use the value to determine when it is necessary,
given the actual size of the reply, to actually use a Reply chunk
provided by the requester.
</t>
</list>
</t>
</section>

<section
 title="Reverse Request Support"
 anchor="section:6ace2d7f-044b-491f-97ea-5760345a2e8f">
<t>
The value of this property is used to indicate a client implementation's
readiness to accept and process messages that are part
of reverse direction RPC requests.
<figure align="left">
<artwork xml:space="preserve" align="left">
&lt;CODE BEGINS&gt;

enum rpcrdma2_rvreqsup {
        RDMA2_RVREQSUP_NONE    = 0,
        RDMA2_RVREQSUP_INLINE  = 1,
        RDMA2_RVREQSUP_GENL    = 2
};

const uint32 RDMA2_PROPID_BRS = 2;
typedef rpcrdma2_rvreqsup rpcrdma2_prop_brs;

&lt;CODE ENDS&gt;
</artwork>
</figure>
</t>
<t>
Multiple levels of support are distinguished:
<list style="symbols">
<t>
The value RDMA2_RVREQSUP_NONE indicates that receipt of reverse
direction requests and replies is not supported.
</t>
<t>
The value RDMA2_RVREQSUP_INLINE indicates that receipt of reverse
direction requests or replies is only supported using inline
messages and that use of explicit RDMA operations or other form of
Direct Data Placement for reverse direction requests or responses
is not supported.
</t>
<t>
The value RDMA2_RVREQSUP_GENL that receipt of reverse direction
requests or replies is supported in the same ways that forward
direction requests or replies typically are.
</t>
</list>
</t>
<t>
When information about this property is not provided,
the support level of servers can be inferred
from the reverse direction requests that they issue,
assuming that issuing a request implicitly indicates support
for receiving the corresponding reply.
On this basis, support for receiving inline replies
can be assumed when requests without
Read chunks, Write chunks, or Reply chunks are issued,
while requests with any of these elements allow the client to assume
that general support for reverse direction replies is present on the server.
</t>
</section>

</section>

</section>

<section
 title="RPC-over-RDMA Version 2 Transport Messages"
 anchor="section:eef6a22e-2633-44a2-a8f0-821fec8bf824">

<section
 title="Overall Transport Message Structure"
 anchor="section:417e749e-efec-455f-aae7-12535b9ee8dc">
<t>
Each transport message consists of multiple sections:
<list style="symbols">
<t>
A transport header prefix, as defined in
<xref target="section:2d1735f0-c465-43c6-9c18-3da6b7979862"/>.
Among other things, this structure indicates the header type.
</t>
<t>
The transport header proper, as defined by one of the sub-sections below.
See
<xref target="section:67b34950-5376-49fd-93d7-b4fdf80d1c9b"/>
for the mapping between header types and the corresponding header structure.
</t>
<t>
Potentially, an RPC message being conveyed as an addendum to the header.
</t>
</list>
</t>
<t>
This organization differs from that presented in the definition of
RPC-over-RDMA version 1
<xref target="RFC8166"/>,
which presented the first and second of the items above as a single XDR item.
The new organization is more in keeping with RPC-over-RDMA version 2's
extensibility model in that new header types can be defined without
modifying the existing set of header types.
</t>
</section>

<section
 title="Transport Header Types"
 anchor="section:67b34950-5376-49fd-93d7-b4fdf80d1c9b">
<t>
The new header types within RPC-over-RDMA version 2
are set forth in
<xref target="table:b5c31bf9-d623-4957-97db-29fc1d416cb8"/>.
In that table, the columns contain the following information:
<list style="symbols">
<t>
The column labeled "Operation" specifies the particular operation.
</t>
<t>
The column labeled "Code" specifies the value of header type for
this operation.
</t>
<t>
The column labeled "XDR type" gives the XDR type of the data
structure used to describe the information in this new message type.
This data immediately follows the universal portion on the
transport header present in every RPC-over-RDMA transport header.
</t>
<t>
The column labeled "Msg" indicates whether this operation is
followed (or not) by an RPC message payload.
</t>
<t>
The column labeled "Sec" indicates the section (within this
document) that explains the semantics and use of this operation.
</t>
</list>
</t>
<texttable
 align="left"
 style="full"
 anchor="table:b5c31bf9-d623-4957-97db-29fc1d416cb8"
 title=""
 suppress-title="false">
<ttcol align="left">Operation</ttcol>
<ttcol align="left">Code</ttcol>
<ttcol align="left">XDR type</ttcol>
<ttcol align="left">Msg</ttcol>
<ttcol align="left">Sec</ttcol>
<c>Convey Appended RPC Message</c>
<c>0</c>
<c>rpcrdma2_msg</c>
<c>Yes</c>
<c>
<xref target="section:9af0d451-2ef3-454f-adb9-827664ccc39c"/>
</c>
<c>Convey External RPC Message</c>
<c>1</c>
<c>rpcrdma2_nomsg</c>
<c>No</c>
<c>
<xref target="section:1c401555-4b7d-4e35-a9a3-8aa14228170e"/>
</c>
<c>Report Transport Error</c>
<c>4</c>
<c>rpcrdma2_err</c>
<c>No</c>
<c>
<xref target="section:b1d23e5c-31df-483f-adb7-25430b5de38d"/>
</c>
<c>Specify Properties at Connection</c>
<c>5</c>
<c>rpcrdma2_connprop</c>
<c>No</c>
<c>
<xref target="section:07e8c178-62df-46a7-a57e-dcf107821d93"/>
</c>
</texttable>
<t>
Suppport for the operations in
<xref target="table:b5c31bf9-d623-4957-97db-29fc1d416cb8"/>
is REQUIRED.
Support for additional operations will be OPTIONAL.
RPC-over-RDMA version 2 implementations that receive an OPTIONAL operation
that is not supported MUST respond with an RDMA2_ERROR message
with an error code of RDMA2_ERR_INVAL_HTYPE.
</t>
</section>

<section
 title="Header Types Defined in RPC-over-RDMA version 2"
 anchor="section:8039c7b8-9068-401e-9cbd-5c1e67d403e7">
<t>
The header types defined and used in RPC-over-RDMA version 1
are all carried over into RPC-over-RDMA version 2,
although there may be limited changes
in the definition of existing header types.
</t>
<t>
In comparison with the header types of RPC-over-RDMA version 1,
the changes can be summarized as follows:
<list style="symbols">
<t>
To simplify interoperability with RPC-over-RDMA version 1,
only the RDMA2_ERROR header (defined in
<xref target="section:b1d23e5c-31df-483f-adb7-25430b5de38d"/>)
has an XDR definition that differs from that in RPC-over-RDMA version 1,
and its modifications are all compatible extensions.
</t>
<t>
RDMA2_MSG and RDMA2_NOMSG
(defined in Sections
<xref target="section:9af0d451-2ef3-454f-adb9-827664ccc39c"/>
and
<xref target="section:1c401555-4b7d-4e35-a9a3-8aa14228170e"/>)
have XDR definitions that match the corresponding
RPC-over-RDMA version 1 header types.
However, because of the changes to the header prefix,
the version 1 and version 2 header types
differ in on-the-wire format.
</t>
<t>
RDMA2_CONNPROP
(defined in
<xref target="section:07e8c178-62df-46a7-a57e-dcf107821d93"/>)
is a completely new header type devoted to enabling
connection peers to exchange information about their transport properties.
</t>
</list>
</t>

<section
 title="RDMA2_MSG: Convey RPC Message Inline"
 anchor="section:9af0d451-2ef3-454f-adb9-827664ccc39c">
<t>
RDMA2_MSG is used to convey an RPC message that immediately
follows the Transport Header in the Send buffer.
This is either an RPC request that has no Position-Zero Read chunk
or an RPC reply that is not sent using a Reply chunk.
<figure align="left">
<artwork xml:space="preserve" align="left">
&lt;CODE BEGINS&gt;

const rpcrdma2_proc RDMA2_MSG = 0;

struct rpcrdma2_msg {
        struct rpcrdma2_chunk_lists  rdma_chunks;

        /* The rpc message starts here and continues
         * through the end of the transmission. */
        uint32                       rdma_rpc_first_word;
};

&lt;CODE ENDS&gt;
</artwork>
</figure>
</t>
</section>

<section
 title="RDMA2_NOMSG: Convey External RPC Message"
 anchor="section:1c401555-4b7d-4e35-a9a3-8aa14228170e">
<t>
RDMA2_NOMSG is used to convey an entire RPC message using explicit RDMA operations.
Usually this is because the RPC message does not fit
within the size limits that result from the
receiver's inline threshold.
The message may be a Long request,
which is read from a memory area specified by a Position-Zero Read chunk;
or a Long reply,
which is written into a memory area specified by a Reply chunk.
<figure align="left">
<artwork xml:space="preserve" align="left">
&lt;CODE BEGINS&gt;

const rpcrdma2_proc RDMA2_NOMSG = 1;

struct rpcrdma2_nomsg {
        struct rpcrdma2_chunk_lists  rdma_chunks;
};

&lt;CODE ENDS&gt;
</artwork>
</figure>
</t>
</section>

<section
 title="RDMA2_ERROR: Report Transport Error"
 anchor="section:b1d23e5c-31df-483f-adb7-25430b5de38d">
<t>
RDMA2_ERROR provides a way of reporting the occurrence of transport
errors on a previous transmission.
This header type MUST NOT be transmitted by a requester.
[ cel: how is the XID field set when sending an error report
from a requester, or when the error occurred on a non-RPC message? ]
<figure align="left">
<artwork xml:space="preserve" align="left">
&lt;CODE BEGINS&gt;

const rpcrdma2_proc RDMA2_ERROR = 4;

struct rpcrdma2_err_vers {
        uint32 rdma_vers_low;
        uint32 rdma_vers_high;
};

struct rpcrdma2_err_write {
        uint32 rdma_chunk_index;
        uint32 rdma_length_needed;
};

union rpcrdma2_error switch (rpcrdma2_errcode rdma_err) {
        case RDMA2_ERR_VERS:
          rpcrdma2_err_vers rdma_vrange;
        case RDMA2_ERR_READ_CHUNKS:
          uint32 rdma_max_chunks;
        case RDMA2_ERR_WRITE_CHUNKS:
          uint32 rdma_max_chunks;
        case RDMA2_ERR_SEGMENTS:
          uint32 rdma_max_segments;
        case RDMA2_ERR_WRITE_RESOURCE:
          rpcrdma2_err_write rdma_writeres;
        case RDMA2_ERR_REPLY_RESOURCE:
          uint32 rdma_length_needed;
        default:
          void;
};

&lt;CODE ENDS&gt;
</artwork>
</figure>
</t>
<t>
Error reporting is addressed in RPC-over-RDMA version 2
in a fashion similar to RPC-over-RDMA version 1.
Several new error codes, and error messages
never flow from requester to responder.
RPC-over-RDMA version 1 error reporting
is described in Section 5 of
<xref target="RFC8166"/>.
</t>
<t>
In all cases below, the responder copies the values of the
rdma_start.rdma_xid
and
rdma_start.rdma_vers
fields from the incoming transport header that
generated the error to transport header of the error response.
The responder sets the rdma_start.rdma_htype field of the transport
header prefix to RDMA2_ERROR, and the rdma_start.rdma_credit field is
set to the credit grant value for this connection.
The receiver of this header type MUST ignore the value of the
rdma_start.rdma_credits field.
</t>
<t>
<list style="hanging">
<t hangText="RDMA2_ERR_VERS">
<vspace/>
This is the equivalent of ERR_VERS in RPC-over-RDMA version 1.
The error code value, semantics, and utilization are the same.
</t>
<t hangText="RDMA2_ERR_INVAL_HTYPE">
<vspace/>
If a responder recognizes the value in the rdma_start.rdma_vers field,
but it does not recognize the value in the rdma_start.rdma_htype field
or does not support that header type,
it MUST set the rdma_err field to RDMA2_ERR_INVAL_HTYPE.
</t>
<t hangText="RDMA2_ERR_BAD_XDR">
<vspace/>
If a responder recognizes the values in the
rdma_start.rdma_vers
and
rdma_start.rdma_proc
fields,
but the incoming RPC-over-RDMA transport header cannot be parsed,
it MUST set the rdma_err field to RDMA2_ERR_BAD_XDR.
This includes cases in which a nominally opaque property value
field cannot be parsed
using the XDR typedef associated with the transport property definition.
The error code value of RDMA2_ERR_BAD_XDR is the same as
the error code value of ERR_CHUNK in RPC-over-RDMA version 1.
The responder MUST NOT process the request in any way
except to send an error message.
</t>
<t hangText="RDMA2_ERR_READ_CHUNKS">
<vspace/>
If a requester presents more DDP-eligible arguments than the responder
is prepared to Read,
the responder MUST set the rdma_err field to RDMA2_ERR_READ_CHUNKS,
and set the rdma_max_chunks field to the maximum number of
Read chunks the responder can receive and process.
<vspace/>
If the responder implementation cannot handle any Read chunks
for a request, it MUST set the rdma_max_chunks to zero in this response.
The requester SHOULD resend the request using a Position-Zero Read chunk.
If this was a request using a Position-Zero Read chunk,
the requester MUST terminate the transaction with an error.
</t>
<t hangText="RDMA2_ERR_WRITE_CHUNKS">
<vspace/>
If a requester has constructed an RPC Call message with
more DDP-eligible results than the server is prepared to Write,
the responder MUST set the rdma_err field to RDMA2_ERR_WRITE_CHUNKS,
and set the rdma_max_chunks field to the maximum number of
Write chunks the responder can process and return.
<vspace/>
If the responder implementation cannot handle any Write chunks for a
request, it MUST return a response of RDMA2_ERR_REPLY_RESOURCE (below).
The requester SHOULD resend the request with no Write chunks and
a Reply chunk of appropriate size.
</t>
<t hangText="RDMA2_ERR_SEGMENTS">
<vspace/>
If a requester has constructed an RPC Call message with a
chunk that contains more segments than the responder supports,
the responder MUST set the rdma_err field to RDMA2_ERR_SEGMENTS,
and set the rdma_max_segments field to the maximum number of
segments the responder can process.
</t>
<t hangText="RDMA2_ERR_WRITE_RESOURCE">
<vspace/>
If a requester has provided a Write chunk that is not large enough
to fully convey a DDP-eligible result,
the responder MUST set the rdma_err field to RDMA2_ERR_WRITE_RESOURCE.
<vspace/>
<vspace/>
The responder MUST set the rdma_chunk_index field to point to the
first Write chunk in the transport header that is too short, or to
zero to indicate that it was not possible to determine which chunk
is too small.
Indexing starts at one (1), which represents the first Write chunk.
The responder MUST set the rdma_length_needed to the number of bytes
needed in that chunk in order to convey the result data item.
<vspace/>
<vspace/>
Upon receipt of this error code,
a responder MAY choose to terminate the operation
(for instance, if the responder set the index and length fields to zero),
or it MAY send the request again using the same XID and more
reply resources.
</t>
<t hangText="RDMA2_ERR_REPLY_RESOURCE">
<vspace/>
If an RPC Reply's Payload stream does not fit inline
and the requester has not provided a large enough Reply chunk
to convey the stream,
the responder MUST set the rdma_err field to RDMA2_ERR_REPLY_RESOURCE.
The responder MUST set the rdma_length_needed to the number of
Reply chunk bytes needed to convey the reply.
<vspace/>
<vspace/>
Upon receipt of this error code,
a responder MAY choose to terminate the operation
(for instance, if the responder set the index and length fields to zero),
or it MAY send the request again using the same XID and larger
reply resources.
</t>
<t hangText="RDMA2_ERR_SYSTEM">
<vspace/>
If some problem occurs on a responder that does not fit
into the above categories,
the responder MAY report it to the sender by setting
the rdma_err field to RDMA2_ERR_SYSTEM.
<vspace/>
<vspace/>
This is a permanent error: a requester that receives this error MUST
terminate the RPC transaction associated with the XID value in the
rdma_start.rdma_xid field.
</t>
</list>
</t>
</section>

<section
 title="RDMA2_CONNPROP: Advertise Transport Properties"
 anchor="section:07e8c178-62df-46a7-a57e-dcf107821d93">
<t>
The RDMA2_CONNPROP message type allows an RPC-over-RDMA endpoint,
whether client or server, to indicate to its partner relevant
transport properties that the partner might need to be aware of.
</t>
<t>
The message definition for this operation is as follows:
<figure align="left">
<artwork xml:space="preserve" align="left">
&lt;CODE BEGINS&gt;

struct rpcrdma2_connprop {
        rpcrdma2_propset rdma_props;
};

&lt;CODE ENDS&gt;
</artwork>
</figure>
</t>
<t>
All relevant transport properties that the sender is aware of should
be included in rdma_props.
Since support of each of the properties is OPTIONAL,
the sender cannot assume that the receiver will necessarily take note
of these properties.
The sender should be prepared for cases in which the receiver
continues to assume that the default value for a particular property
is still in effect.
</t>
<t>
Generally, a participant will send a RDMA2_CONNPROP message as the
first message after a connection is established.
Given that fact, the sender should make sure that the message
can be received by peers who use the default Receive Buffer Size.
The connection's initial receive buffer size is typically 1KB,
but it depends on the initial connection state of the RPC-over-RDMA
version in use.
</t>
<t>
Properties not included in rdma_props are to be treated by the peer
endpoint as having the default value and are not allowed to change
subsequently.
The peer should not request changes in such properties.
</t>
<t>
Those receiving an RDMA2_CONNPROP may encounter properties that they
do not support or are unaware of.
In such cases, these properties are simply ignored
without any error response being generated.
</t>
</section>

</section>

</section>

<section
 title="XDR Protocol Definition"
 anchor="section:bf53e759-d97f-487d-a5e2-9b8153db1803">
<t>
This section contains a description of the core features of
the RPC-over-RDMA version 2 protocol expressed in the XDR language
<xref target="RFC4506"/>.
</t>
<t>
Because of the need to provide for protocol extensibility
without modifying an existing XDR definition,
this description has some important structural differences
from the corresponding XDR description for RPC-over-RDMA version 1,
which appears in
<xref target="RFC8166"/>.
</t>
<t>
This description is divided into three parts:
<list style="symbols">
<t>
A code component license which appears in
<xref target="section:aaab9699-eae3-46ca-a1d5-a8776a5ecb7d"/>.
</t>
<t>
An XDR description of the structures that are generally available
for use by transport header types including both those defined in
this document and those that may be defined as extensions.
This includes definitions of the chunk-related structures
derived from RPC-over-RDMA version 1,
the transport property model introduced in this document,
and a definition of the transport header prefixes that precede the
various transport header types.
This appears in
<xref target="section:b25ffcfc-511f-4383-8025-4a68cfcb4f49"/>.
</t>
<t>
An XDR description of the transport header types defined in this document,
including those derived from RPC-over-RDMA version 1
and those introduced in RPC-over-RDMA version 2.
This appears in
<xref target="section:84e950a5-c842-4d19-b56d-0458c3e219b2"/>.
</t>
</list>
</t>
<t>
This description is provided in a way that makes it simple
to extract into ready-to-compile form.
To enable the combination of this description with the descriptions
of subsequent extensions to RPC-over-RDMA version 2,
the extracted description can be combined with similar descriptions
published later, or those descriptions can be compiled separately.
Refer to
<xref target="section:a288b3e6-5e73-412d-91e8-f87c031cb05b"/>
for details.
</t>

<section
 title="Code Component License"
 anchor="section:aaab9699-eae3-46ca-a1d5-a8776a5ecb7d">
<t>
Code components extracted from this document must include the
following license text.
When the extracted XDR code is combined with other complementary
XDR code which itself has an identical license, only a single
copy of the license text need be preserved.
<figure align="left">
<artwork xml:space="preserve" align="left">
&lt;CODE BEGINS&gt;

/// /*
///  * Copyright (c) 2010-2018 IETF Trust and the persons
///  * identified as authors of the code.  All rights reserved.
///  *
///  * The authors of the code are:
///  * B. Callaghan, T. Talpey, C. Lever, and D. Noveck.
///  *
///  * Redistribution and use in source and binary forms, with
///  * or without modification, are permitted provided that the
///  * following conditions are met:
///  *
///  * - Redistributions of source code must retain the above
///  *   copyright notice, this list of conditions and the
///  *   following disclaimer.
///  *
///  * - Redistributions in binary form must reproduce the above
///  *   copyright notice, this list of conditions and the
///  *   following disclaimer in the documentation and/or other
///  *   materials provided with the distribution.
///  *
///  * - Neither the name of Internet Society, IETF or IETF
///  *   Trust, nor the names of specific contributors, may be
///  *   used to endorse or promote products derived from this
///  *   software without specific prior written permission.
///  *
///  *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS
///  *   AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED
///  *   WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
///  *   IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
///  *   FOR A PARTICULAR PURPOSE ARE DISCLAIMED.  IN NO
///  *   EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
///  *   LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
///  *   EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
///  *   NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
///  *   SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
///  *   INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
///  *   LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
///  *   OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING
///  *   IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
///  *   ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
///  */
///

&lt;CODE ENDS&gt;
</artwork>
</figure>
</t>
</section>

<section
 title="Extraction and Use of XDR Definitions"
 anchor="section:a288b3e6-5e73-412d-91e8-f87c031cb05b">
<t>
The reader can apply the following sed script to this
document to produce a machine-readable XDR description of
the RPC-over-RDMA version 2 protocol without any OPTIONAL
extensions.
<figure align="left">
<artwork xml:space="preserve" align="left">
&lt;CODE BEGINS&gt;

sed -n -e 's:^ */// ::p' -e 's:^ *///$::p'

&lt;CODE ENDS&gt;
</artwork>
</figure>
</t>
<t>
That is, if this document is in a file called
"spec.txt" then the reader can do the following to extract
an XDR description file and store it in the file rpcrdma-v2.x.
<figure align="left">
<artwork xml:space="preserve" align="left">
&lt;CODE BEGINS&gt;

sed -n -e 's:^ */// ::p' -e 's:^ *///$::p' \
     &lt; spec.txt &gt; rpcrdma-v2.x

&lt;CODE ENDS&gt;
</artwork>
</figure>
</t>
<t>
Although this file is a usable description of the base protocol,
when extensions are to supported, it may be desirable to divide into
multiple files.
The following script can be used for that purpose:
<figure align="left">
<artwork xml:space="preserve" align="left">
&lt;CODE BEGINS&gt;

#!/usr/local/bin/perl
open(IN,"rpcrdma-v2.x");
open(OUT,"&gt;temp.x");
while(&lt;IN&gt;)
{
  if (m/FILE ENDS: (.*)$/)
    {
      close(OUT);
      rename("temp.x", $1);
      open(OUT,"&gt;temp.x");
    }
    else
    {
      print OUT $_;
    }
}
close(IN);
close(OUT);

&lt;CODE ENDS&gt;
</artwork>
</figure>
</t>
<t>
Running the above script will result in two files:
<list style="symbols">
<t>
The file common.x, containing the license plus the common XDR
definitions which need to be made available to both the base
operations and any subsequent extensions.
</t>
<t>
The file baseops.x containing the XDR definitions for the base
operations, defined in this document.
</t>
</list>
</t>
<t>
Optional extensions to RPC-over-RDMA version 2,
published as Standards Track documents,
will have similar means of providing XDR that describes
those extensions.
Once XDR for all desired extensions is also extracted,
it can be appended to the XDR description file extracted
from this document to produce a consolidated XDR description
file reflecting all extensions selected for an RPC-over-RDMA
implementation.
</t>
<t>
Alternatively, the XDR descriptions can be compiled separately.
In this case the combination of common.x and baseops.x serves to
define the base transport, while using as XDR descriptions for
extensions, the XDR from the document defining that extension,
together with the file common.x, obtained from this document.
</t>
</section>

<section
 title="XDR Definition for RPC-over-RDMA Version 2 Core Structures"
 anchor="section:b25ffcfc-511f-4383-8025-4a68cfcb4f49">
<t>
<figure align="left">
<artwork xml:space="preserve" align="left">
&lt;CODE BEGINS&gt;
/// /*******************************************************************
///  *    Transport Header Prefixes
///  ******************************************************************/
///
/// struct rpcrdma_common {
///         uint32         rdma_xid;
///         uint32         rdma_vers;
///         uint32         rdma_credit;
///         uint32         rdma_htype;
/// };
///
/// const RPCRDMA2_F_RESPONSE           0x00000001;
///
/// struct rpcrdma2_hdr_prefix
///         struct rpcrdma_common       rdma_start;
///         uint32                      rdma_flags;
/// };
///
/// /*******************************************************************
///  *    Chunks and Chunk Lists
///  ******************************************************************/
///
/// struct rpcrdma2_segment {
///         uint32 rdma_handle;
///         uint32 rdma_length;
///         uint64 rdma_offset;
/// };
///
/// struct rpcrdma2_read_segment {
///         uint32                  rdma_position;
///         struct rpcrdma2_segment rdma_target;
/// };
///
/// struct rpcrdma2_read_list {
///         struct rpcrdma2_read_segment rdma_entry;
///         struct rpcrdma2_read_list    *rdma_next;
/// };
///
/// struct rpcrdma2_write_chunk {
///         struct rpcrdma2_segment rdma_target&lt;&gt;;
/// };
///
/// struct rpcrdma2_write_list {
///         struct rpcrdma2_write_chunk rdma_entry;
///         struct rpcrdma2_write_list  *rdma_next;
/// };
///
/// struct rpcrdma2_chunk_lists {
///         uint32                      rdma_inv_handle;
///         struct rpcrdma2_read_list   *rdma_reads;
///         struct rpcrdma2_write_list  *rdma_writes;
///         struct rpcrdma2_write_chunk *rdma_reply;
/// };
///
/// /*******************************************************************
///  *    Transport Properties
///  ******************************************************************/
///
/// /*
///  * Types for transport properties model
///  */
/// typedef rpcrdma2_propid uint32;
///
/// struct rpcrdma2_propval {
///         rpcrdma2_propid rdma_which;
///         opaque          rdma_data&lt;&gt;;
/// };
///
/// typedef rpcrdma2_propval rpcrdma2_propset&lt;&gt;;
/// typedef uint32 rpcrdma2_propsubset&lt;&gt;;
///
/// /*
///  * Transport propid values for basic properties
///  */
/// const uint32 RDMA2_PROPID_RBSIZ = 1;
/// const uint32 RDMA2_PROPID_BRS = 2;
///
/// /*
///  * Types specific to particular properties
///  */
/// typedef uint32 rpcrdma2_prop_rbsiz;
/// typedef rpcrdma2_rvreqsup rpcrdma2_prop_brs;
///
/// enum rpcrdma2_rvreqsup {
///         RDMA2_RVREQSUP_NONE = 0,
///         RDMA2_RVREQSUP_INLINE = 1,
///         RDMA2_RVREQSUP_GENL = 2
/// };
///
/// /* FILE ENDS: common.x; */

&lt;CODE ENDS&gt;
</artwork>
</figure>
</t>
</section>

<section
 title="XDR Definition for RPC-over-RDMA Version 2 Base Header Types"
 anchor="section:84e950a5-c842-4d19-b56d-0458c3e219b2">
<t>
<figure align="left">
<artwork xml:space="preserve" align="left">
&lt;CODE BEGINS&gt;
/// /*******************************************************************
///  *    Descriptions of RPC-over-RDMA Header Types
///  ******************************************************************/
///
/// /*
///  * Header Type Codes.
///  */
/// const rpcrdma2_proc RDMA2_MSG = 0;
/// const rpcrdma2_proc RDMA2_NOMSG = 1;
/// const rpcrdma2_proc RDMA2_ERROR = 4;
/// const rpcrdma2_proc RDMA2_CONNPROP = 5;
///
/// /*
///  * Header Types to Convey RPC Messages.
///  */
/// struct rpcrdma2_msg {
///         struct rpcrdma2_chunk_lists  rdma_chunks;
///
///         /* The rpc message starts here and continues
///          * through the end of the transmission. */
///         uint32                       rdma_rpc_first_word;
/// };
///
/// struct rpcrdma2_nomsg {
///         struct rpcrdma2_chunk_lists  rdma_chunks;
/// };
///
/// /*
///  * Header Type to Report Errors.
///  */
/// const uint32 RDMA2_ERR_VERS = 1;
/// const uint32 RDMA2_ERR_BAD_XDR = 2;
/// const uint32 RDMA2_ERR_INVAL_HTYPE = 3;
/// const uint32 RDMA2_ERR_READ_CHUNKS = 4;
/// const uint32 RDMA2_ERR_WRITE_CHUNKS = 5;
/// const uint32 RDMA2_ERR_SEGMENTS = 6;
/// const uint32 RDMA2_ERR_WRITE_RESOURCE = 7;
/// const uint32 RDMA2_ERR_REPLY_RESOURCE = 8;
/// const uint32 RDMA2_ERR_SYSTEM = 9;
///
/// struct rpcrdma2_err_vers {
///         uint32 rdma_vers_low;
///         uint32 rdma_vers_high;
/// };
///
/// struct rpcrdma2_err_write {
///         uint32 rdma_chunk_index;
///         uint32 rdma_length_needed;
/// };
///
/// union rpcrdma2_error switch (rpcrdma2_errcode rdma_err) {
///         case RDMA2_ERR_VERS:
///           rpcrdma2_err_vers rdma_vrange;
///         case RDMA2_ERR_READ_CHUNKS:
///           uint32 rdma_max_chunks;
///         case RDMA2_ERR_WRITE_CHUNKS:
///           uint32 rdma_max_chunks;
///         case RDMA2_ERR_SEGMENTS:
///           uint32 rdma_max_segments;
///         case RDMA2_ERR_WRITE_RESOURCE:
///           rpcrdma2_err_write rdma_writeres;
///         case RDMA2_ERR_REPLY_RESOURCE:
///           uint32 rdma_length_needed;
///         default:
///           void;
/// };
///
/// /*
///  * Header Type to Exchange Transport Properties.
///  */
/// struct rpcrdma2_connprop {
///         rpcrdma2_propset rdma_props;
/// };
///
/// /* FILE ENDS: baseops.x; */

&lt;CODE ENDS&gt;
</artwork>
</figure>
</t>
</section>

<section
 title="Use of the XDR Description Files"
 anchor="section:5541f0da-efbb-4431-af9c-6f82aa773963">
<t>
The three files common.x and baseops.x,
when combined with the XDR descriptions for extension defined later,
produce a human-readable and compilable description
of the RPC-over-RDMA version 2 protocol with the included extensions.
</t>
<t>
Although this XDR description can be useful in generating code
to encode and decode the transport and payload streams,
there are elements of the structure of RPC-over-RDMA version 2
which are not expressible within the XDR language as currently defined.
This requires implementations that use the output of the XDR processor
to provide additional code to bridge the gaps.
<list style="symbols">
<t>
The values of transport properties are represented
within XDR as opaque values.
However, the actual structures of each
of the properties are represented by XDR typedefs,
with the selection of the appropriate typedef described by text in
this document.
The determination of the appropriate typedef is not specified by XDR,
which does not possess the facilities necessary for that determination
to be specified in an extensible way.
<vspace blankLines="1"/>
This is similar to the way in which NFSv4 attributes are handled
<xref target="RFC7530"/>
<xref target="RFC5661"/>.
As in that case, implementations that need to encode and decode
these nominally opaque entities need to use the protocol description
to determine the actual XDR representation that underlays the
items described as opaque.
</t>
<t>
The transport stream is not represented as a single XDR object.
Instead, the header prefix is described by one XDR object  while
the rest of the header is described as another XDR object with
the mapping between the header type in the header prefix and the
XDR object representing the header type represented by tables
contained in this document, with additional mappings being
specifiable by a later extension document.
<vspace blankLines="1"/>
This situation is similar to that in which RPC message headers
contain program and procedure numbers, so that the XDR for
those request and replies can be used to encode and decode the
associated messages without requiring that all be present in a
single XDR specification.
As in that case, implementations need to use the header specification
to select the appropriate XDR-generated code to be used
in message processing.
</t>
<t>
The relationship between the transport stream and the payload
stream is not specified in the XDR itself,
although comments within the XDR text make clear
where transported messages, described by their own XDR, need to appear.
Such data by its nature is opaque to the transport,
although its form differs XDR opaque arrays.
<vspace blankLines="1"/>
Potential extensions allowing continuation of RPC messages
across transport message boundaries will require that message
assembly facilities, not specifiable within XDR, also be part
of transport implementations.
</t>
</list>
</t>
<t>
To summarize, the role of XDR in this specification
is more limited than for protocols which are themselves XDR programs,
where the totality of the protocol is expressible within the
XDR paradigm established for that purpose.
This more limited role reflects the fact that XDR lacks facilities
to represent the embedding of transported material
within the transport framework.
In addition, the need to cleanly accommodate extensions
has meant that those using rpcgen in their applications
need to take a more active role in providing the facilities that
cannot be expressed within XDR.
</t>
</section>

</section>

<section
 title="Protocol Version Negotiation"
 anchor="section:86d76c8e-8954-40dc-b94c-8fb7fa9ec86f">
<t>
When an RPC-over-RDMA version 2 client establishes a
connection to a server, its first order of business is to
determine the server's highest supported protocol version.
</t>
<t>
As with RPC-over-RDMA version 1,
upon connection establishment a client
MUST NOT send more than a single RPC-over-RDMA message at a
time until it receives a valid non-error RPC-over-RDMA message
from the server that grants client credits.
</t>
<t>
The second word of each transport header is used to convey
the transport protocol version.
In the interest of simplicity, we refer to that word as
rdma_vers even though in the RPC-over-RDMA version 2
XDR definition it is described as rdma_start.rdma_vers.
</t>
<t>
First, the client sends a single valid RPC-over-RDMA message
with the value two (2) in the rdma_vers field.
Because the server might support only RPC-over-RDMA
version 1, this initial message can be no larger than the
version 1 default inline threshold of 1024 bytes.
</t>

<section
 title="Server Does Support RPC-over-RDMA Version 2"
 anchor="section:8db4c54e-c1ce-43ba-93b4-031e829960f5">
<t>
If the server does support RPC-over-RDMA version 2,
it sends RPC-over-RDMA messages back to the client
with the value two (2) in the rdma_vers field.
Both peers may use the default inline threshold value
for RPC-over-RDMA version 2 connections (4096 bytes).
</t>
</section>

<section
 title="Server Does Not Support RPC-over-RDMA Version 2"
 anchor="section:bedc4e66-4295-4dd6-8ac9-dd06907a08ad">
<t>
If the server does not support RPC-over-RDMA version 2,
it MUST send an RPC-over-RDMA message to the client with the
same XID, with RDMA2_ERROR in the rdma_start.rdma_htype field,
and with the error code RDMA2_ERR_VERS.
This message also reports a range of protocol versions that
the server supports.
To continue operation, the client selects a protocol
version in the range of server-supported versions for
subsequent messages on this connection.
</t>
<t>
If the connection is lost immediately after an
RDMA2_ERROR / RDMA2_ERR_VERS message is received,
a client can avoid a possible version negotiation loop
when re-establishing another connection by assuming
that particular server does not support RPC-over-RDMA version 2.
A client can assume the same situation (no server support
for RPC-over-RDMA version 2) if the initial negotiation message
is lost or dropped.
Once the negotiation exchange is complete,
both peers may use the default inline threshold value
for the transport protocol version that has been selected.
</t>
</section>

<section
 title="Client Does Not Support RPC-over-RDMA Version 2"
 anchor="section:1ffe4c69-b516-476a-bba7-41863709f48d">
<t>
If the server supports the RPC-over-RDMA protocol version
used in Call messages from a client,
it MUST send Replies with the same RPC-over-RDMA protocol version
that the client uses to send its Calls.
The client MUST NOT change the version during the duration of the connection.
</t>
</section>

</section>

<section
 title="Differences from the RPC-over-RDMA Version 1 Protocol"
 anchor="section:c2574344-5aec-427d-a5ed-048d7fcc0d95">
<t>
This section describes the substantive changes made in
RPC-over-RDMA version 2,
as opposed to the structural changes to enable extensibility,
which are discussed in
<xref target="section:d945b9f0-0666-4db7-9126-be57cf7b5f4f"/>.
</t>

<section
 title="Transport Properties"
 anchor="section:630314a8-1cf5-40f7-a5ad-5bc12c719233">
<t>
RPC-over-RDMA version 2 provides a mechanism for
exchanging the transport's operational properties.
This mechanism allows connection endpoints to communicate the properties
of their implementation at connection setup.
The mechanism could be expanded to enable an endpoint to request changes
in properties of the other endpoint and to notify peer endpoints of
changes to properties that occur during operation.
Transport properties are described in
<xref target="section:86248e99-ca60-478a-8aff-3fb387410077"/>.
</t>
</section>

<section
 title="Credit Management Changes"
 anchor="section:5A2E5FF8-0F0B-454D-B9B3-C6773CD77780">
<t>
RPC-over-RDMA transports employ credit-based flow control
to ensure that a requester does not emit more RDMA Sends
than the responder is prepared to receive.
Section 3.3.1 of
<xref target="RFC8166"/>
explains the purpose and operation
of RPC-over-RDMA version 1 credit management in detail.
</t>
<t>
In the RPC-over-RDMA version 1 design,
each RDMA Send from a requester contains an RPC Call with a credit request,
and each RDMA Send from a responder contains an RPC Reply with a credit grant.
The credit grant implies that enough Receives have been posted on
the responder to handle the credit grant minus the number of pending
RPC transactions (the number of remaining Receive buffers might be zero).
</t>
<t>
In other words, each RPC Reply acts as an implicit ACK
for a previous RPC Call from the requester,
indicating that the responder has posted a Receive to replace
the Receive consumed by the requester's RDMA Send.
Without an RPC Reply message, the requester has no way to know
that the responder is properly prepared for subsequent RPC Calls.
</t>
<t>
Aside from being a bit of a layering violation,
there are basic (but rare) cases where this arrangement is inadequate:
<list style="symbols">
<t>
When a requester retransmits an RPC Call on the same connection
as an earlier RPC Call for the same transaction.
</t>
<t>
When a requester transmits an RPC operation that requires no reply.
</t>
<t>
When more than one RPC-over-RDMA message is needed to complete the
transaction (e.g., RDMA_DONE).
</t>
</list>
Typically, the connection must be replaced in these cases.
This resets the credit accounting mechanism but has an undesirable impact
on other ongoing RPC transactions on that connection.
</t>
<t>
Because credit management accompanies each RPC message,
there is a strict one-to-one ratio between RDMA Send and RPC message.
There are interesting use cases that might be enabled if this relationship
were more flexible:
<list style="symbols">
<t>
RPC-over-RDMA operations which do not carry an RPC message;
e.g., control plane operations.
</t>
<t>
A single RDMA Send that conveys more than one RPC message
for the purpose of interrupt mitigation.
</t>
<t>
An RPC message that is conveyed via several sequential RDMA Sends
to reduce the use of explicit RDMA operations for moderate-sized RPC messages.
</t>
<t>
An RPC transaction that needs multiple exchanges
or an odd number of RPC-over-RDMA operations
to complete.
</t>
</list>
Bi-directional RPC operation also introduces an ambiguity.
If the RPC-over-RDMA message does not carry an RPC message, then
it is not possible to determine whether the sender is a requester
or a responder, and thus whether the rdma_credit field contains
a credit request or a credit grant.
</t>
<t>
A more sophisticated credit accounting mechanism is provided in
RPC-over-RDMA version 2 in an attempt to address some of these shortcomings.
This new mechanism is detailed in Section TBD.
</t>
</section>

<section
 title="Inline Threshold Changes"
 anchor="section:F7FB5108-58EA-4718-84BE-5119A302F5F5">
<t>
The term "inline threshold" is defined in Section 3.3.2 of
<xref target="RFC8166"/>.
An "inline threshold" value is the largest message size (in octets)
that can be conveyed on an RDMA connection using only RDMA Send and Receive.
Each connection has two inline threshold values: one for messages
flowing from
client-to-server (referred to as the "client-to-server inline threshold")
and one for messages flowing from server-to-client
(referred to as the "server-to-client inline threshold").
Note that
<xref target="RFC8166"/>
uses somewhat different terminology.
This is because it was written
with only forward-direction RPC transactions in mind.
</t>
<t>
A connection's inline thresholds determine when RDMA Read or
Write operations are required because the RPC message to be
sent cannot be conveyed via a single RDMA Send and Receive pair.
When an RPC message does not contain DDP-eligible data items,
a requester prepares a Long Call or Reply to convey the whole
RPC message using RDMA Read or Write operations.
</t>
<t>
RDMA Read and Write operations require that each data payload
resides in a region of memory that is registered with the RNIC.
When an RPC is complete, that region is invalidated, fencing it
from the responder.
Memory registration and invalidation typically have a latency cost
that is insignificant compared to data handling costs.
When a data payload is small, however, the cost of registering and
invalidating the memory where the payload resides becomes
a relatively significant part of total RPC latency.
Therefore the most efficient operation of RPC-over-RDMA occurs
when explicit RDMA Read and Write operations are used for large payloads,
and are avoided for small payloads.
</t>
<t>
When RPC-over-RDMA version 1 was conceived, the typical size
of RPC messages that did not involve a significant data payload
was under 500 bytes.
A 1024-byte inline threshold adequately minimized the frequency
of inefficient Long Calls and Replies.
</t>
<t>
With NFS version 4.1
<xref target="RFC5661"/>,
the increased size of NFS COMPOUND operations
resulted in RPC messages that are on average larger
and more complex than previous versions of NFS.
With 1024-byte inline thresholds, RDMA Read or Write operations
are needed for frequent operations that do not bear a data payload,
such as GETATTR and LOOKUP,
reducing the efficiency of the transport.
</t>
<t>
To reduce the need to use Long Calls and Replies, RPC-over-RDMA
version 2 increases the default size of inline thresholds.
This also increases the maximum size of reverse-direction
RPC messages.
</t>
</section>

<section
 title="Support for Remote Invalidation"
 anchor="section:57C034D6-7129-4F7B-B8DF-31E8BC691964">
<t>
An STag that is registered using
the FRWR mechanism in a privileged execution context
or is registered via a Memory Window in an unprivileged context
may be invalidated remotely
<xref target="RFC5040"/>.
These mechanisms are available when a requester's
RNIC supports MEM_MGT_EXTENSIONS.
</t>
<t>
For the purposes of this discussion, there are two classes of STags.
Dynamically-registered STags are used in a single RPC, then invalidated.
Persistently-registered STags live longer than one RPC.
They may persist for the life of an RPC-over-RDMA connection, or longer.
</t>
<t>
An RPC-over-RDMA requester may provide more than one STag
in one transport header.
It may provide a combination of dynamically- and
persistently-registered STags in one RPC message, or
any combination of these in a series of RPCs on the same connection.
Only dynamically-registered STags using Memory Windows
or FRWR (i.e., registered via MEM_MGT_EXTENSIONS) may be invalidated remotely.
</t>
<t>
There is no transport-level mechanism by which a responder can determine
how a requester-provided STag was registered, nor whether it is
eligible to be invalidated remotely.
A requester that mixes persistently- and dynamically-registered STags in one RPC,
or mixes them across RPCs on the same connection,
must therefore indicate which handles may be invalidated via a mechanism
provided in the Upper Layer Protocol.
RPC-over-RDMA version 2 provides such a mechanism.
</t>
<t>
The RDMA Send With Invalidate operation is used to invalidate an
STag on a remote system.
It is available only when a responder's RNIC supports MEM_MGT_EXTENSIONS,
and must be utilized only when a requester's RNIC supports MEM_MGT_EXTENSIONS
(can receive and recognize an IETH).
</t>

<section
 title="Reverse Direction Remote Invalidation"
 anchor="section:750A8D5E-1D5A-4FD7-AFA3-E156561D20E5">
<t>
Existing RPC-over-RDMA transport protocol specifications
<xref target="RFC8166"/>
<xref target="RFC8167"/>
do not forbid direct data placement in the reverse direction,
even though there is currently no Upper Layer Protocol that
makes data items in reverse direction operations elegible
for direct data placement.
</t>
<t>
When chunks are present in a reverse direction RPC request,
Remote Invalidation allows the responder
to trigger invalidation of a requester's STags as part of sending a reply,
the same way as is done in the forward direction.
</t>
<t>
However, in the reverse direction,
the server acts as the requester,
and the client is the responder.
The server's RNIC, therefore, must support receiving an IETH,
and the server must have registered the STags
with an appropriate registration mechanism.
</t>
</section>

</section>

<section
 title="Error Reporting Changes"
 anchor="section:E554DF42-6E82-4E28-96A0-8F9872EB476C">
<t>
RPC-over-RDMA version 2 expands the repertoire of errors that
may be reported by connection endpoints.
This change, which is structured to enable extensibility,
allows a peer to report overruns of specific resources
and to avoid requester retries when an error is permanent.
</t>
</section>

</section>

<section
 title="Extending the Version 2 Protocol"
 anchor="section:84E1FFC4-D916-4EB4-9FD8-A8218D084503">
<t>
RPC-over-RDMA version 2 is designed to be extensible
in a way that enables the addition of OPTIONAL features
that may subsequently be converted to REQUIRED status
in a future protocol version.
The protocol may be extended by Standards Track documents
in a way analogous to that provided for Network File
System Version 4 as described in
<xref target="RFC8178"/>.
</t>
<t>
This form of extensibility enables limited extensions
to the base RPC-over-RDMA version 2 protocol presented
in this document so that new optional capabilities
can be introduced without a protocol version change,
while maintaining robust interoperability
with existing RPC-over-RDMA version 2 implementations.
The design allows extensions to be
defined, including the definition of new protocol elements, without
requiring modification or recompilation of the existing XDR.
</t>
<t>
A Standards Track document introduces each set of such protocol elements.
Together these elements are considered an OPTIONAL feature.
Each implementation is either aware of all the protocol
elements introduced by that feature or is aware of none of them.
</t>
<t>
Documents describing extensions to RPC-over-RDMA version 2 should
contain:
<list style="symbols">
<t>
An explanation of the purpose and use of each new protocol element added.
</t>
<t>
An XDR description including all of the new protocol elements,
and a script to extract it.
</t>
<t>
A description of interactions with existing extensions.
<vspace blankLines="1"/>
This includes possible requirements of other OPTIONAL features
to be present for new protocol elements to work,
or that a particular level of support
for an OPTIONAL facility is required for the new extension to work.
</t>
</list>
</t>
<t>
Implementers combine the XDR descriptions of the new features they
intend to use with the XDR description of the base protocol in this
document.
This may be necessary to create a valid XDR input file
because extensions are free to use XDR types defined in the base
protocol, and later extensions may use types defined by earlier
extensions.
</t>
<t>
The XDR description for the RPC-over-RDMA version 2 base protocol
combined with that for any selected extensions
should provide an adequate human-readable description
of the extended protocol.
</t>
<t>
The base protocol specified in this document may be extended within
RPC-over-RDMA version 2 in two ways:
<list style="symbols">
<t>
New OPTIONAL transport header types may be introduced by later
Standards Track documents.
Such transport header types will be documented as described in
<xref target="section:D4650151-40F0-4E85-8755-02C38CF8F444"/>.
</t>
<t>
New OPTIONAL transport properties may be defined in later
Standards Track documents.
Such transport properties will be documented as described in
<xref target="section:A355ADAD-F03B-41A6-94A8-4128B10301BB"/>.
</t>
</list>
</t>
<t>
The following sorts of ancillary  protocol elements may be added
to the protocol to support the addition of new transport properties
and header types.
<list style="symbols">
<t>
New error codes may be created as described in
<xref target="section:C2F5E937-D612-4E3A-A380-E0B15261F6A0"/>.
</t>
<t>
New flags to use within the rdma_flags field may be created as described in
<xref target="section:15C76685-A24D-4799-BC8A-311D03AC3510"/>.
</t>
</list>
</t>
<t>
New capabilities can be proposed and developed independently of each other,
and implementers can choose among them.
This makes it straightforward to create and document experimental features
and then bring them through the standards process.
</t>

<section
 title="Adding New Header Types to RPC-over-RDMA Version 2"
 anchor="section:D4650151-40F0-4E85-8755-02C38CF8F444">
<t>
New transport header types are to defined in a manner similar to
the way existing ones are described in Sections
<xref target="section:9af0d451-2ef3-454f-adb9-827664ccc39c"/>
through
<xref target="section:07e8c178-62df-46a7-a57e-dcf107821d93"/>
Specifically what is needed is:
<list style="symbols">
<t>
A description of the function and use of the new header type.
</t>
<t>
A complete XDR description of the new header type including a description
of the use of all fields within the header.
</t>
<t>
A description of how errors are reported, including the definition
of a mechanism for reporting errors when the error is outside the
available choices already available in the base protocol or in
other existing extensions.
</t>
<t>
An indication of whether a Payload stream must be present,
and a description of its contents and how such payload streams
are used to construct RPC messages for processing.
</t>
</list>
</t>
<t>
In addition, there needs to be additional documentation that is made
necessary due to the Optional status of new transport header types.
<list style="symbols">
<t>
Information about constraints on support for the new header types should
be provided.
For example, if support for one header type is implied
or foreclosed by another one,
this needs to be documented.
</t>
<t>
A preferred method by which a sender should determine whether the peer
supports a particular header type needs to be provided.
While it is always possible for a send a test invocation
of a particular header type to see if support is available,
when more efficient means are available
(e.g. the value of a transport property,
this should be noted.
</t>
</list>
</t>
</section>

<section
 title="Adding New Transport properties to the Protocol"
 anchor="section:A355ADAD-F03B-41A6-94A8-4128B10301BB">
<t>
The set of transport properties is designed to be extensible.
As a result, once new properties are defined in standards track documents,
the operations defined in this document may reference these new
transport properties, as well as the ones described in this document.
</t>
<t>
A standards track document defining a new transport property should
include the following information paralleling that provided in this
document for the transport properties defined herein.
<list style="symbols">
<t>
The rpcrdma2_propid value used to identify this property.
</t>
<t>
The XDR typedef specifying the form in which the property value is
communicated.
</t>
<t>
A description of the transport property that is communicated by
the sender of RDMA2_CONNPROP.
</t>
<t>
An explanation of how this knowledge could be used by the
peer receiving this information.
</t>
</list>
</t>
<t>
The definition of transport property structures is such as to make it
easy to assign unique values.
There is no requirement that a continuous set of values be used and
implementations should not rely on all such values being small integers.
A unique value should be selected when the defining document is first
published as an internet draft.
When the document becomes a standards track document,
the working group should ensure that:
<list style="symbols">
<t>
rpcrdma2_propid values specified in the document do not conflict
with those currently assigned or in use by other pending working
group documents defining transport properties.
</t>
<t>
rpcrdma2_propid values specified in the document do not conflict
with the range reserved for experimental use, as defined in
Section 8.2.
</t>
</list>
</t>
<t>
Documents defining new properties fall into a number of categories.
<list style="symbols">
<t>
Those defining new properties and explaining (only) how they
affect use of existing message types.
</t>
<t>
Those defining new OPTIONAL message types and new properties
applicable to the operation of those new message types.
</t>
<t>
Those defining new OPTIONAL message types and new properties
applicable both to new and existing message types.
</t>
</list>
</t>
<t>
When additional transport properties are proposed, the review of the
associated standards track document should deal with possible
security issues raised by those new transport properties.
</t>
</section>

<section
 title="Adding New Error Codes to the Protocol"
 anchor="section:C2F5E937-D612-4E3A-A380-E0B15261F6A0">
<t>
New error codes to be returned when using new header types
may be introduced in the same Standards Track document
that defines the new header type.
[ cel: what about adding a new error code that is returned
 for an existing header type? ]
</t>
<t>
For error codes that do not require that additional error
information be returned with them,
the existing RDMA_ERR2 header can be used to report the new error.
The new error code is set as the value of rdma_err with
the result that the default switch arm of the rpcrdma2_error
(i.e. void) is selected.
</t>
<t>
For error codes that do require the return of additional error-related
information together with the error, a new header type should be defined
for the purpose of returning the error together with needed additional
information.
It should be documented just like any other new header type.
</t>
<t>
When a new header type is sent, the sender needs to be prepared to accept
header types necessary to report associated errors.
</t>
</section>

<section
 title="Adding New Header Flags to the Protocol"
 anchor="section:15C76685-A24D-4799-BC8A-311D03AC3510">
<t>
There are currently thirty-one flags available for later assignment.
One possible use for such flags would be in a later protocol version,
should that version retain the same general header structure as version 2.
</t>
<t>
In addition, it is possible to assign unused flags within
extensions made to version 2, as long as the following practices are adhered to:
<list style="symbols">
<t>
Flags should not be added to the flag word in the prefix structure if those
flags only apply to a single header type.
New flags should only be defined for conditions applying to multiple header types.
</t>
<t>
The document defining the new flag should indicate for which header types
the flag value is meaningful and for which header types it is an error to set
the flag or to leave it unset.
</t>
<t>
The sender needs to be provided with a means to determine whether the receiver
is prepared to receive transport headers with the new flag set.
This is most likely to take the form of a transport property
together with the definition of suitable defaults
to use when that property is not supported.
Another possibility is to REQUIRE that receivers
supporting a particular header type also support a set of additional flags.
</t>
</list>
</t>
</section>

</section>

<section
 title="Relationship to other RPC-over-RDMA Versions"
 anchor="section:e8a03b68-4e05-4f06-b907-9817715a348f">

<section
 title="Relationship to RPC-over-RDMA Version 1"
 anchor="section:d945b9f0-0666-4db7-9126-be57cf7b5f4f">
<t>
In addition to the substantive protocol changes discussed in
<xref target="section:c2574344-5aec-427d-a5ed-048d7fcc0d95"/>,
there are a number of structural XDR changes whose goal
is to enable within-version protocol extensibility.
</t>
<t>
The RPC-over-RDMA version 1 transport header is defined as a single XDR object,
with an RPC message proper potentially following it.
In RPC-over-RDMA version 2, as described in
<xref target="section:417e749e-efec-455f-aae7-12535b9ee8dc"/>
there are separate XDR definitions of the transport header prefix
(see
<xref target="section:2d1735f0-c465-43c6-9c18-3da6b7979862"/>
which specifies the transport header type to be used,
and the specific transport header, defined within one of the subsections of
<xref target="section:eef6a22e-2633-44a2-a8f0-821fec8bf824"/>).
This is similar to the way that an RPC message consists of
an RPC header (defined in
<xref target="RFC5531"/>)
and an RPC request or reply,
defined by the Upper Layer protocol being conveyed.
</t>
<t>
As a new version of the RPC-over-RDMA transport protocol,
RPC-over-RDMA version 2 exists within the versioning rules defined in
<xref target="RFC8166"/>.
In particular, it maintains the first four words of the protocol header
as sent and received,
as specified in Section 4.2 of
<xref target="RFC8166"/>,
even though, as explained in
<xref target="section:e21d4f74-b536-47f2-9d07-c03a27a20de4"/>
of this document,
the XDR definition of those words is structured differently.
</t>
<t>
Although each of the first four words retains its semantic function,
there are important differences of field interpretation, besides the fact
that the words have different names and different roles with the XDR
constrict of they are parts.
<list style="symbols">
<t>
The first word of the header, previously the rdma_xid field,
retains the format and function that in had in
RPC-over-RDMA version 1.
Within RPC-over-RDMA version 2,
this word is the rdma_xid field of the structure rdma_start.
However, to accommodate the use of request-response
pairing of non-RPC messages and the potential use of message continuation,
it cannot be assumed that it will always have the same value
it would have had in RPC-over-RDMA version 1.
As a result, the contents of this field should not be used
without consideration of the associated protocol version identification.
</t>
<t>
The second word of the header, previously the rdma_vers field,
retains the format and function that it had in RPC-over-RDMA version 1.
Within RPC-over-RDMA version 2,
this word is the rdma_vers field of the structure rdma_start.
To clearly distinguish version 1 and version 2 messages,
senders MUST fill in the correct version (fixed after version negotiation)
and receivers MUST check that the content of the rdma_vers is correct
before using referencing any other header field.
</t>
<t>
The third word of the header, previously the rdma_credit field,
retains the format and general purpose that it had in RPC-over-RDMA version 1.
Within RPC-over-RDMA version 2,
this word is the rdma_credit field of the structure rdma_start.
The RPC-over-RDMA version 2 protocol provides additional mechanisms
that determine whether the value contained in this field
is a credit request or grant.
Also, the way in which credits are accounted for may be different
in RPC-over-RDMA version 2.
</t>
<t>
The fourth word of the header,
previously the union discriminator field rdma_proc,
retains its format and general function even though
the set of valid values has changed.
The value of this field is now considered an unsigned 32-bit integer
rather than an enum.
Within RPC-over-RDMA version 2,
this word is the rdma_htype field of the structure rdma_start.
</t>
</list>
</t>
<t>
Beyond conforming to the restrictions specified in
<xref target="RFC8166"/>,
RPC-over-RDMA version 2 tightly limits the scope
of the changes made in order to ensure interoperability.
It makes no major structural changes to the protocol,
and all existing transport header types used in version 1
(as defined in
<xref target="RFC8166"/>)
are retained in version 2.
Chunks are expressed using the same on-the-wire format and
are used in the same way in both versions.
</t>
</section>

<section
 title="Extensibility Beyond RPC-over-RDMA Version 2"
 anchor="section:E965E6CE-1B64-4579-A8AB-B960807F15C4">
<t>
Subsequent RPC-over-RDMA versions are free to change the protocol
in any way they choose as long as they maintain the first four header words
as currently specified by
<xref target="RFC8166"/>.
</t>
<t>
Such changes might involve deletion or major re-organization of
existing transport headers.
However, the need for interoperability between adjacent versions
will often limit the scope of changes that can be made in a single version.
</t>
<t>
In some cases it may prove desirable to transition to a new version
by using the extension features described for use with RPC-over-RDMA version 2,
by continuing the same basic extension model but allowing header types
and properties that were OPTIONAL in one version to become REQUIRED
in the subsequent version.
</t>
</section>

</section>

<section
 title="Security Considerations"
 anchor="section:912A2C09-95EC-4CB6-AA2B-2245726D9EDF">
<t>
The security considerations for RPC-over-RDMA version 2
are the same as those for RPC-over-RDMA version 1.
</t>

<section
 title="Security Considerations (Transport Properties)"
 anchor="section:3B0E673B-98D7-436D-BD6F-180180503DF6">
<t>
Like other fields that appear in each RPC-over-RDMA header,
property information is sent in the clear on the fabric
with no integrity protection, making it vulnerable to
man-in-the-middle attacks.
</t>
<t>
For example, if a man-in-the-middle were to change the value of
the Receive buffer size or the Requester Remote Invalidation boolean,
it could reduce connection performance or trigger loss of connection.
Repeated connection loss can impact performance or even prevent a
new connection from being established. Recourse is to deploy on a
private network or use link-layer encryption.
</t>
</section>

</section>

<section
 title="IANA Considerations"
 anchor="section:D235C884-6463-411F-BA34-6BCC82AB7A9F">
<t>
This document does not require actions by IANA.
</t>
</section>

</middle>

<back>

<references title="Normative References">
<?rfc include="reference.RFC.2119.xml"?>
<?rfc include="reference.RFC.4506.xml"?>
<?rfc include="reference.RFC.5531.xml"?>
<?rfc include="reference.RFC.8166.xml"?>
<?rfc include="reference.RFC.8174.xml"?>
</references>

<references title="Informative References">

<reference
 anchor="IBARCH"
 target="http://www.infinibandta.org/content/pages.php?pg=technology_download">
<front>
<title>InfiniBand Architecture Specification Volume 1</title>
<author>
<organization>InfiniBand Trade Association</organization>
</author>
<date month='March' year='2015'/>
</front>
<seriesInfo name='Release' value='1.3'/>
</reference>

<?rfc include="reference.RFC.5040.xml"?>
<?rfc include="reference.RFC.5041.xml"?>
<?rfc include="reference.RFC.5661.xml"?>
<?rfc include="reference.RFC.5662.xml"?>
<?rfc include="reference.RFC.7530.xml"?>
<?rfc include="reference.RFC.8167.xml"?>
<?rfc include="reference.RFC.8178.xml"?>
</references>

<section
 title="Acknowledgments"
 anchor="section:7b212a81-9c2a-4c05-891a-369cc7184585"
 numbered="no">
<t>
The authors gratefully acknowledge the work of Brent Callaghan
and Tom Talpey on the original RPC-over-RDMA version 1 specification
(RFC 5666).
The authors also wish to thank
Bill Baker, Greg Marsden, and Matt Benjamin
for their support of this work.
</t>
<t>
The XDR extraction conventions were
first described by the authors of the NFS version 4.1
XDR specification
<xref target="RFC5662"/>.
Herbert van den Bergh suggested the replacement sed
script used in this document.
</t>
<t>
Special thanks go to
Transport Area Director Spencer Dawkins,
NFSV4 Working Group Chairs Spencer Shepler
and
Brian Pawlowski,
and
NFSV4 Working Group Secretary Thomas Haynes
for their support.
</t>
</section>

</back>

</rfc>
