<?xml version="1.0" encoding="UTF-8"?>
<!-- This template is for creating an Internet Draft using xml2rfc,
     which is available here: http://xml.resource.org. -->
<!DOCTYPE rfc SYSTEM "rfc2629.dtd" [
<!-- One method to get references from the online citation libraries.
     There has to be one entity for each item to be referenced. 
     An alternate method (rfc include) is described in the references. -->

]>
<?xml-stylesheet type="text/xsl" href="rfc2629.xslt" ?>
<?rfc strict="yes" ?>
<?rfc toc="yes"?>
<?rfc tocdepth="3"?>
<?rfc symrefs="yes"?>
<?rfc sortrefs="yes" ?>
<?rfc compact="yes" ?>
<?rfc subcompact="no" ?>
<rfc ipr="trust200902"
     category="info" 
     docName="draft-dnoveck-nfsv4-rpcrdma-rtrext-03">
  <front>
    <title abbrev="RPC/RDMA Round-trip Reductions">
      RPC-over-RDMA Extensions to Reduce Internode Round-trips
    </title>
   <author initials='D.' surname='Noveck'
           fullname = 'David Noveck'>
     <organization>
       NetApp
     </organization>
     <address>
       <postal>
         <street>1601 Trapelo Road</street>
         <city>Waltham</city >   
         <region>MA</region>
         <code>02451</code>
         <country>US</country>
       </postal>

       <phone>+1 781 572 8038</phone>
       <email>davenoveck@gmail.com</email>
     </address>
   </author> 
    <date year="2017"/>

    <area>Transport</area>
    <workgroup>Network File System Version 4</workgroup>
    <abstract>
      <t>
        It is expected that a future version of the RPC-over-RDMA transport 
        will
        allow protocol extensions to be defined.  This would provide for 
        the specification of OPTIONAL features allowing
        participants who implement such features to cooperate
        as specified by that extension, while still
        interoperating with participants who do not support 
        that extension.
      </t>
      <t>
        A particular extension is described herein, whose motivation
        is to reduce the latency due to inter-node round-trips needed 
        to effect operations which involve direct data placement or 
        which transfer RPC messages longer than the fixed inline 
        buffer size limit. 
      </t>
    </abstract>
  </front>
  <middle>
    <section title="Preliminaries" anchor="PRELIM">	
      <section title="Requirements Language" anchor="INTRO-req">
        <t>
          The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", 
          "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", 
          "MAY", and "OPTIONAL" in this document are to be interpreted 
          as described in <xref target="RFC2119"/>.  
        </t>
      </section>
      <section title="Introduction" anchor="PRELIM-intro">
        <t>
          This document describes a potential extension to the 
          RPC-over-RDMA protocol, which would allow participating
          implementations to have more flexibility in how they use
          RDMA sends and receives to effect necessary transmission
          of RPC requests and replies.
        </t>	
        <t>
          In contrast to existing facilities defined in RPC-over-RDMA
          Version One in which the mapping between RPC messages and
          RPC-over-RDMA messages is strictly one-to-one and placement
	  of bulk data is
          effected only through use of explicit RDMA operations, the
          following features are made available through this extension:
        <list style="symbols">
          <t>
            The ability to effect data placement in the context of a
            single RPC-over-RDMA transmission, rather than requiring 
            explicit RDMA operations to effect the necessary placement.
          </t>
          <t>
            The ability to continue an RPC request or reply over
            multiple RPC-over-RDMA transmissions.
          </t>
        </list>
        </t>	
      </section>
      <section title="Role of this Document" anchor="PRELIM-rols">
        <t>
          This is not a standards-track document, despite the fact that it
          contains many of the sorts of items (e.g. proposed XDR, detailed
          field descriptions) that normally appear in such documents.
        </t>	
        <t>
          Although this document is in the informational category it is
          not expected to result in an Informational RFC, as the material
          within it is not expected to be of interest to the internet 
          community in general.  Its target audience is the nfsv4
          working group 
          itself and it is not expected to evolve into an RFC. 
 	
        </t>	
        <t>
          The function of this document is essentially exploratory, 
          in that it looks at 
          a number of possible ways that the RPC-over-RDMA 
          transport could be extended.
          Although many of these might well be followed up on eventually
          with standards-track documents, it should not be assumed that all
          will or that the relation among the various elements of any
          extension to address these issue will be the same as laid out
          here.
        </t>	
      </section>

      <section title="Prerequisites" anchor="PREREQ">
        <t>
          This document is written assuming that certain underlying 
          facilities will be made available to build upon, in the context of 
          a future version of RPC-over-RDMA.  It is most likely 
          that such facilities will be first available in Version Two of
          RPC-over-RDMA.
          

        <list style="symbols">
          <t>
            A protocol extension mechanism is needed to enable the extensions
            to RPC-over-RDMA described here.
          <vspace blankLines="1"/>
            This document is currently written to conform to the extension
            model for the proposed RPC-over-RDMA Version Two as described in
            <xref target="I-D.cel-nfsv4-rpcrdma-version-two"/>. 	
          </t>
          <t>
            An existing means of communicating transport properties 
            between the RPC-over-RDMA endpoints is assumed.
          <vspace blankLines="1"/>
            This document is currently written assuming the transport 
            property model defined in 
            <xref target="I-D.cel-nfsv4-rpcrdma-version-two"/>. 
            will be available and can be extended to meet the needs of this
            extension.	
          </t>
        </list>
        </t>	
        <t>
          As the document referred to above is currently a personal Internet
          Draft, and subject to change, adjustments to this document
          are expected to be necessary when and if the needed facilities
          are defined in one or more working group documents leading to 
          the potential publication of Standards-track RFCs.	
        </t>	
        <t>
          Such an RFC for a new RPC-over-RDMA version might differ from	
          <xref target="I-D.cel-nfsv4-rpcrdma-version-two"/> in significant
          ways even if it
          provides the prerequisites listed above.  For example, 
        <list style="symbols">
          <t>
            The extension model might be significantly different.  For 
            example, it might use an approach more like that used in 
            <xref target="RFC8178"/> rather than using a single message
            type as a vehicle for OPTIONAL extensions.
          </t>
          <t>
            There is the possibility of significant change in the credit 
            model.  While <xref target="I-D.cel-nfsv4-rpcrdma-version-two"/> 
            contains support for one-way messages, much of the text 
            regarding credits is inherited from <xref target="RFC8166"/>
            which assumes a one-to-one mapping between requests and responses.
            It is not clear whether this mismatch will be resolved by changing
            (only) the description of the credit mechanism or whether a more
            basic protocol change is required.  Whichever approach is taken,
            the treatment of message continuation is likely to follow.    
          </t>
        </list>
       </t>	
      </section>

      <section title="Participant Terminology" anchor="PRELIM-term">
        <t>
          A number of different terms are used regarding the roles of the
          two participants in an RPC-over-RMA connection.  Some of these
          roles last for the duration of a connection while others vary 
          from request to request or from
          message to message.
        </t>	        
        <t>
          The roles of the client and server are fixed for the lifetime of
          the connection, with the client defined as the endpoint which 
          initiated the connection. 	
        </t>	
        <t>
          The roles of requester and responder often parallel those of
          client and server, although this is not always the case.  
          Most requests are made in the forward direction, in which the client
          is the requester and the server is the responder.  However,
          backward direction requests are possible, in which case the server
          is the requester and the client is the responder.  As a result 
          clients and servers may both act as requesters and responders
          for different requests issued on the same connection.
        </t>	
        <t>
          The roles of sender and receiver vary from message to messages. With 
          regard to the messages described in this document, the sender
          may act as a requester by sending RPC requests or a responder 
          by sending RPC requests or as both at the same time by
          sending a mix of the two.  	
        </t>	
      </section>
    </section>
    <section title="Extension Overview" anchor="OVIEW">
      <t>
        This extension is intended to function as part of
        RPC-over-RDMA and implementations should successfully interoperate 
        with existing RPC-over-RDMA Version One implementations.
        Nevertheless, this extension seeks to take a somewhat different
        approach to high-performance RPC operation than has been used
        previously in that it seeks to de-emphasize the use of explicit RDMA
        operations.  It does this in two ways:
      <list style="symbols">
        <t>
          By implementing a send-based form of data placement 
          (see <xref target="DDP" />), use of explicit RDMA operations
          can be avoided in many common cases in which data is to be
          placed at an appropriate location in the receiver's memory.
        </t>
        <t>
          Use of explicit RDMA to support reply chunks and position-zero
          read chunks can be avoided by
          allowing a single message to be split into multiple
          transmissions.  This can be used to avoid many instances of
          the only existing
          use of explicit RDMA operations not associated with 
          Direct Data Placement.
        </t>
      </list>
      </t>
      <t>
        While use of explicit RDMA operations allows the cost of the
        actual data transfer to be
        offloaded from the client and server CPUs to the RNIC, there are
        ancillary costs in setting up the transfer that cannot be ignored.
        As a result, send-based functions are often preferable, since the
        RNIC also uses DMA to effect these operations.  In addition,
        the cost of the additional inter-node round trips required by 
        explicit RDMA operation can be an issue, which can becomes 
        increasingly troublesome as internode distances increase.   Once
        one moves from in-machine-room to campus-wide or metropolitan-area
        distances the additional 
        round-trip delay of 16 microseconds per mile becomes an issue
        impeding use of explicit RDMA operations.   
      </t>
    </section>
    <section title="Data Placement Features" anchor="DDP">
      <t>
      </t>
      <section title="Current Situation" anchor="DDP-cur">
        <t>
          Although explicit RDMA operations are used in the existing 
          RPC-over-RDMA protocol for purposes unrelated to 
          Direct Data Placement, all placement of bulk data is
	  effected using 
          explicit RDMA operations.
        </t>
        <t>
          As a result, many operations requiring placement of
          bulk data
          involve multiple internode round trips.
        </t>
      </section>
      <section title="RDMA_MSGP" anchor="DDP-mgp">
        <t>
          Although this was not stated explicitly, it appears that
          RDMA_MSGP (defined in <xref target="RFC5666"/>, removed from
          RPC-over-RDMA Version One by <xref target="RFC8166"/>), 
          was an early
          attempt to effect correct placement of bulk data within 
          a single RPC-over-RDMA transmission.  
        </t>
        <t>
          As things turned out, the fields within 
          the RDMA_MSGP header were not
          described in <xref target="RFC5666"/> in a way that allowed 
          this message type to be implemented.  
        </t>
        <t>
          In attempting to provide the appropriate data placement 
          functionality, we have to keep in mind
          and avoid the problems that led to failure of RDMA_MSGP. 
          It appears that the problems go deeper than neglecting to write 
          a few relevant sentences.  It is helpful to note that: 
        <list style="symbols">
          <t>
            The inline message size limits eventually adopted were too small
            to allow RDMA_MSGP to be used effectively.  This is true of both 
            the 1K limit in Version One <xref target="RFC8166"/>
            and the 4K limit specified in 
            <xref target="I-D.cel-nfsv4-rpcrdma-version-two"/>. 
          <vspace blankLines="1" />
            On the other hand, there is text within <xref target="RFC5667"/>
            that suggests that much longer messages were anticipated at some
            points during the evolution of RPC-over-RDMA.
          </t>
          <t>
            The fact that NFSv4 COMPOUNDs often have additional operations
            beyond the one including the bulk data means that the RDMA_MSGP
            model cannot be extended to NFSv4.  As a result, the bulk data
            needs to be excised from the data stream just as chunks are,
            so that the payload stream can include non-bulk data both
            before and after the logical position of the excised bulk data.
          </t>	
          <t>
            In order for the sender to determine the appropriate amount 
            of padding necessary within a transmission to place the bulk 
            data at the proper
            position within receive buffer,
            the server must know more about the structure of
            the receiver's buffers.  Since the padding needs to bring the
            bulk data to a position within the buffer that is appropriate
            to receive the bulk data, the sender needs to know where
            within the receive buffers such placement-eligible areas
	    are located.
          </t>
          <t>
            While appropriate padding could place the bulk data within a
            large WRITE into an appropriately aligned buffer or set of
            buffer, there is no corresponding provision for the 
            bulk data associated with a READ.  In short, there is no
            way to indicate to the responder that it should use RDMA_MSGP
            to appropriately place bulk data in the response.  
          </t>
          <t>
            There is no explicit discussion of the required padding's use
            in effecting proper data placement or connection with the
            ULB's specification of DDP-eligible XDR items.
          </t>
        </list>
        </t>
        <t>
          To summarize, RDMA_MSGP was an attempt to properly place bulk data
          which was thought of as a local optimization and insufficient
          attention was given to it to make it successful.  As a result,
          as RPC-over-RDMA Version One was developed, data 
          placement was identified with the use of explicit RDMA operations
	  providing DDP
          and the possibility of data placement within sends was not
          recognized.
        </t>
      </section>
      <section title="Send-based Data Placement" anchor="DDP-send">
        <t>
          In this extension  we will describe a more complete way  
          to provide send-based data placement, as follows:
        <list style="symbols">
          <t>
            By defining the structure of receive buffers as a transport
            property available to be interrogated by the peer implementation.
          </t>
          <t>
            By treating positioning of bulk data within a message as an
            instance of data placement, causing the bulk data to be excised 
            from 
            the payload XDR stream, as is the case with other forms of 
            bulk data placement (e.g. DDP).
          </t>
          <t>
            By defining new data structures to control placement of 
            bulk data that support both send-based data placement
            and DDP using explicit RDMA operations
            that was an integral part  in RPC-over-RDMA Version One.  These new
            control structures, described in <xref target="DDPS-oview" />
            are organized differently from the chunk-based structures
            described in <xref target="RFC8166" />.
          </t>
        </list>
        </t>
        <t>
          In making these changes, we will retain certain aspects of the
          DDP model:
        <list style="symbols">
          <t>
	    The set of bulk data items eligible for special data placement
	    is exactly the same as with DDP, as defined by the RPC protocol's
	    upper-layer binding document.
          </t>
          <t>
	    The concept of an inline XDR stream is retained, with specially
	    placed items appearing outside it, but with references to them
	    retained so that the receiver has access
	    to all of the message data.
          </t>
        </list>
        </t>
      </section>
      <section title="Other Extensions Relating to Data Placement"
	       anchor="DDP-other">
        <t>
          In order to support send-based data placement, new
	  placement-related data
          structures have been defined, as described in Sections
          <xref target="DDPS-msg" format="counter" /> and         
          <xref target="DDPS-resp" format="counter" />.
        </t>
        <t>
          These new data structures support both send-based and
          RDMA-operation-based data placement.  In addition, because of the
          restructuring described in <xref target="DDPS-oview"/>,
          a number of additional facilities are made available:
        <list style="symbols">
          <t>
            The ability to restrict entries regarding data placement
	    in response 
            data to XDR data items generated in response to performing
            particular constituent operations
            within a given RPC request (e.g. specific operations
            within an NFSv4 COMPOUND).
          </t>
          <t>
            The ability to make use of special data placement
	    contingent on the actual 
            length of a placement-eligible data item in the response.
          </t>
          <t>
            The ability to specify whether use of data placement
	    for a particular 
            placement-eligible data item is required or optional. 
          </t>
        </list>
        </t>
        <t>
          These additional facilities will be available to implementations
          that do not support send-based data placement,
	  as long as both parties support 
          the OPTIONAL Header types that include these new structures.  For
          more information about the relationships among, the new 
          transport properties, operations, and features, see 
          <xref target="ADD" />.
        </t>
      </section>
        

    </section>
    <section title="Message Continuation Feature" anchor="CONT">
      <t>
      </t>
      <section title="Current Situation" anchor="REORG-cur">
        <t>
          Within RPC-over-RDMA Version One <xref target="RFC8166"/>,
          each transmission of a request or reply involves  
          sending a single RDMA send message and conversely 
          each message-related
          transmission involves only a single RPC request or reply.
        </t>
        <t>
          This strict one-to-one model leads to some potential performance
          issues.
        <list style="symbols">
          <t>
            Because of RDMA's use of fixed-size receives, some requests
            and replies
            will inevitably not fit in the limited space available, even
            if they do not contain any DDP-eligible bulk data.
          <vspace blankLines="1" />
            Such cases will raise performance issues because, to deal with 
            them, the server is interrupted twice to receive a single request
            and all the necessary transfers are serialized.   In particular, 
            there are two server interrupt latencies involved before the 
            server can process the actual request, in addition to the OTW
            round-trip latencies. 
          </t>
          <t>
            In the case of replies, there may be cases in which reply 
            chucks need 
            to be allocated and registered even if the actual reply would fit
            within the fixed receive-size limit.   Because the decision to
            create a reply chunk is made at the time the request is sent,
            even an extremely low probability of a longer reply will trigger
            allocation of a reply chunk.
          <vspace blankLines="1" />
            Because this decision is made in conformance with ULB rules,
            which, by their nature, may only reference a limited set of data,
            a reply chunk may be required even when the actual 
            probability of a long 
            reply is exactly zero.  For example a GETATTR request can generate
            a long reply due to a long ACL, and thus COMPOUND with this 
            operation might allocate a reply chunk, even if the specific 
            file system being interrogated only supports ACLs of limited 
            sizes, or the GETATTR in question does not interrogate
            one of the ACL attributes.  Also, the OWNER attribute is a string
            and it may be impossible to determine a priori that the owner of any
            particular file has no chance of requiring more than 
            4K bytes of space, for example.  The assumption that there are 
            no such user names,
            while it probably is valid, is not a fact that RPC-over-RDMA
            implementations can depend on. 
          </t>
        </list>
        </t>
      </section>
      <section title="Message Continuation Changes" anchor="CONT-chg">
        <t>
          Continuing a single RPC request or reply is addressed by
          defining separate optional header types to begin and to
          continue sending a single RPC message.  This is instead
          of creating a header with a continuation bit.  In
          this approach, all of the fields relating to data
	  placement, which include support for send-based data
	  placement, appear in the starting header 
          (of types ROPT_XMTREQ and ROPT_XMTRESP) and apply to the 
          RPC message as a whole. 
        </t>
        <t>
          Later RPC-over-RDMA messages (of type ROPT_XMTCONT) may 
          extend the payload stream and/or provide additional
          buffers to which bulk data can be directed.  
        </t>
        <t>
          In this case, all of the RPC-over-RDMA messages used together
          are referred to as a transmission group and must be received 
          in order without any intervening message.  
        </t>
        <t>
          In implementations using this optional facility, those
          decoding RPC messages received using RPC-over-RDMA no longer
          have the assurance that that each RPC message is in a 
          contiguous buffer.  As most XDR implementations are built
          based on the assumption that input will not be contiguous,
          this will not affect performance in most cases.
            
        </t>
      </section> 
      <section title="Message Continuation and Credits" 
               anchor="CONT-cred">
        <t>
          Using multiple transmissions to send a single request or
          response can complicate credit management.  In the case of
          the message continuation feature, deadlocks can be avoided
          because use of message continuation is not obligatory.  The
          requester or responder can use explicit RDMA operations
          if sufficient credits to use message continuation are not
          available.
        </t>
        <t>
          A requester is well positioned to make this choice with regard
          to the sending of requests.  The requester must
          know, before sending a request, how long it will be, and therefore,
          how many credits it would require to send the request using message
          continuation.  If these are not available, it can avoid
          message continuation by either creating read chunks sufficient
          to make the payload stream fit in a single transmission or 
          by creating a position-zero read chunk. 
        </t>
        <t>
          With regard to the response, the requester is not in position
          to know exactly how long the response will be.  However, the
          ULB will allow the maximum response length to be determined based
          on the request.  This value can be used:
        <list style="symbols">
          <t>
            To determine the maximum number of receive buffers that might
            be required to receive any response sent.
          </t>
          <t>
            To allocate and register a reply chunk to hold a possible large
            reply.
          </t>
        </list>
        </t>
        <t>
          The requester can avoid doing the second of these if the responder
          has indicated it can use message continuation to send the response.
          In this case, it makes sure that the buffers will be 
          available and indicates to 
          the responder how many additional buffers (in the form of pre-posted
          reads have been made available to accommodate continuation
          transmissions.   
        </t>
        <t>
          When the responder processes the request, those additional receive
          buffers may be
          used or not, or used only in part.  This may be because the
          response is shorter than the maximum possible response, or because
          a reply chunk was used to transmit the response.         
        </t>          
        <t> 
          After the first or only transmission associated with the response 
          is received by the requester, it can be determined how many of
          the additional buffers were used for the response.  Any unused 
          buffers
          can be made available for other uses such as expanding the pool
          of receive buffers available for the initial transmissions of
          response or for receiving opposite direction requests.
          Alternatively, they can be kept in reserve for future uses, such
          as being made available to future requests which have potentially 
          long responses.
        </t>
      </section> 

    </section>
    <section title="Using Protocol Additions" anchor="ADD">
      <t>
        In using existing RPC-over-RDMA facilities for protocol extension,
        interoperability with existing implementations needs to be assured.
        Because this document describes support for multiple features,
        we need to clearly specify the various possible extensions and
        how peers can determine whether certain facilities are supported by 
        both ends of the connection.  
      </t>
    <section title="New Operation Support" anchor="ADD-ops">
      <t>
        Note that most of the new operations defined in this extension are
        not tightly tied to a specific feature.  XOPT_XMTREQ and
        XOPT_XMTRESP are designed to support implementations that 
        support either or both Send-based data placement
	or message continuation.
        However, the converse is not the case and these header types can be 
        implemented by those not supporting either of these features.  
        For example, implementations may only need support for 
        the facilities described in <xref target="DDP-other" />.
      </t>
      <t>
        Implementations may determine whether a peer implementation
        supports XOPT_XMTREQ, XOPT_XMTREQ, or XOPT_XMTCONT by attempting
        these operations.  An alternative is to interrogate the 
        RTR Support Property for information about which 
        operations are supported.
      </t>
    </section>
    <section title="Message Continuation Support" anchor="ADD-msgc">
      <t>
        Implementations may determine and act based on the 
        level of peer implementation of 
        support for message continuation as follows:
      <list style="symbols">
        <t>
          To deal with issues relating to sending the peer
          multi-transmission requests, the requester can interrogate
          the peer's value of the Request Transmission Receive Limit
          (<xref target="CHAR-rqrxl"/>).  In cases in which the 
          property is not
          provided or has the value one, the requester implementation can avoid
          sending multi-transmission requests, and use the equivalent
          of position-zero read chunks to convey a request larger than the
          receive buffer limit.
        <vspace blankLines="1"/>
          Similarly, if the request is longer than can fit in a set of
          transmissions given that limit, the request can be
          conveyed in the same fashion,
        </t>
        <t>
          To deal with issues relating to sending the peer
          multi-transmission responses, responders will only send
          multi-transmission responses for requests conveyed using
          XOPT_XMTREQ where the number of response transmissions is
          less than or equal to buffer reservation count (in the
          field optxrq_rsbuf).  The requester can avoid receiving 
          a message consisting of too many transmissions by setting
          this field appropriately.  This includes the case in which
          the requester cannot handle any multi-transmission responses.  
        </t>
        <t>
          To avoid reserving receive buffers that the responder is 
          not prepared to use, the requester can interrogate
          the peer's value of the Response Transmission Send Receive Limit
          (<xref target="CHAR-rssxl"/>).  In cases in which it is
          possible that a request might result in a response too large 
          for this set of buffers, the requester, the requester can provide a 
          reply chunk to receive the response, which the responder can use if
          the count of buffers provided is insufficient.
        </t>
      </list>
      </t>
    </section>
    <section title="Support for Send-based Data Placement" anchor="ADD-sbddp">
      <t>
        Implementations may determine and adapt to the level of 
        peer implementation
        support for send-based data placement as described below.  Note that
        an implementation may be able to send messages containing
        bulk data items placed using send-based data placement while not being
        prepared to receive them, or the reverse.  
      <list style="symbols">
        <t>
          The requester can interrogate the responder's Receive 
          Buffer Structure Property.  In cases in which the 
          property is not
          provided or shows no placement-targetable buffer segments, an
          implementation knows that messages containing bulk data may not
          be sent using send-based data placement.  In such cases, 
          when XOPT_XMTREQ
          is used to send a request, bulk items may be transferred
          by setting the associated placement information to indicate 
          that the bulk
          data is to be fetched using explicit RDMA operations.
        </t>
        <t>
          In cases in which a requester is unprepared to accept messages using
          send-based data placement, its Receive Buffer Structure Property 
          will make this clear to the responder.  Nevertheless, the requester
          will generally indicate to the responder that bulk data items
          are to be returned using explicit RDMA operations.  As a result,
          requesters may use XOPT_XMTREQ (and get the benefit of the 
          placement-related features discussed in <xref target="DDP-other"/>
          even if they support neither
          message continuation nor send-based data placement.
        </t>
        <t>
          Since it is possible for a responder to generate responses containing
          bulk data using send-based data placement
	  even if it is not prepared to send
          such message, a requester who is prepared to accept such messages
          should specify in the request that the responses are to contain
          (or may contain) bulk data placed in this way.  In deciding whether
          this is to be done the requester can interrogate the responder's 
          RTR Support Property for information about which whether the 
          peer can send responses in this form.  It can do this without 
          regard to whether the responder can accept messages containing
          bulk data items placed using send-based data placement.
        </t>
      </list>
      </t>
      <t>
        In determining whether bulk data will be placed using 
        send-based data placement
        or via explicit RDMA operations, the 
        level of support for message
        continuation will have a role.  This is because DDP using explicit
        RDMA will reduce message size while send-based data
	placement reduces the size
        of the payload stream by rearranging the message, leaving the 
        message size the same.  As a result, the considerations discussed
        in <xref target="CONT-cred"/> will have to be attended to by
        the sender in determining which form of data placement is to be used.

      </t>
    </section>
    <section title="Error Reporting" anchor="ADD-err">
      <t>
        The more extensive transport layer functionality described in this
        document requires its own means 
        of reporting errors, to deal with issues that are distinct from:
      <list style="symbols">
        <t>
          Errors (including XDR errors) in the XDR stream as received by
          responder or requester.
        </t>
        <t>
          XDR errors detected in the XDR headers defined by the base
          protocol.
        </t>
        <t>
          XDR errors detected in the new operations defined in this 
          document. 
        </t>
      </list>
      </t>
      <t>
        Beyond the above, the following sorts of errors will have to
        be dealt with, depending on which of the features of the extension
        are implemented.
      <list style="symbols">
        <t>
          Information associated with send-based data placement
	  may be inconsistent
          or otherwise invalid, even though it conforms to the XDR
          definition. 
        </t>
        <t>
          There may be problems with the organization of transmission groups
          in that there are missing or extraneous transmissions.
        </t>
      </list>
      </t>
      <t>
        In each of the above cases, the problem will be reported to the sender
        using the Error Reporting operation which needs to be supported by
        every endpoint that sends ROPT_XMTREQ, ROPT_XMTRESP, or 
        ROPT_XMTCONT. 
        This includes cases in which the
        problem is one with a reply.  The function of the Error Reporting 
        operation is to aid in diagnosing transport protocol errors and
        allowing the sender to recover or decide recovery is not possible.
        Reporting failure to the requesting process is dealt with indirectly.
        For example,
      <list style="symbols">
        <t>
          When the transmissions used to send a request are ill-formed, the
          requestor can respond to  the error indication by
          proceeding to send the request using existing (i.e. non-extended)
          facilities.  If it
          chooses not to do so, the requestor can report an RPC request failure
          to the initiator of the RPC. 

        </t>
        <t>
          When the transmissions used to send a response are ill-formed, the
          responder need to know about the problem since it will otherwise 
          assume that the transmissions succeeded.  It can proceed to resend
          the reply using existing (i.e. non-extended) facilities.  If it
          chooses not to do so, the requester will not see a response and
          eventually an RPC timeout will occur.

        </t>
      </list>
      </t>
    </section>
    </section>
    <section title="XDR Preliminaries" anchor="XDRP">
      <section title="Message Continuation Preliminaries" 
               anchor="XDRP-struct">
        <t>
          In order to implement message continuation,
          we have occasion to refer to particular RPC-over-RDMA
          transmissions within a transmission group or to
          characteristics of a later transmission group.
        <figure align="left">
          <artwork xml:space="preserve" align="left">
&lt;CODE BEGINS&gt;

typedef uint32  xms_grpxn;
typedef uint32  xms_grpxc;
struct xms_id {
        uint32         xmsi_xid;
        msg_type       xmsi_dir;
        xms_grpxn      xmsi_seq;
} 

&lt;CODE ENDS&gt;
          </artwork>
        </figure> 
        </t>
        <t>
          An xms_grpxn designates a particular RPC-over-RDMA transmission
          within a set of transmissions devoted to sending a single
          RPC message.
        </t>
        <t>
          An xms_grpxc specifies the number of RPC-over-RDMA transmissions
          in a potential group of transmissions devoted to sending a single
          RPC message.
        </t>
      </section>
      <section title="Data Placement Preliminaries" 
               anchor="XDRP-ddp">
        <t>
          Data structures related to data placement use a number of XDR 
          typedefs to help clarify the meaning of fields in the data 
          structures which use these typedefs.
        <figure align="left">
          <artwork xml:space="preserve" align="left">
&lt;CODE BEGINS&gt;

typedef uint32  xmdp_itemlen;
typedef uint32  xmdp_pldisp;
typedef uint32  xmdp_vsdisp;

typedef uint32  xmdp_tbsn;
 
enum xmdp_type {
        XMPTYPE_EXRW = 1,
        XMPTYPE_TBSN = 2,
        XMPTYPE_CHOOSE = 3,
        XMPTYPE_BYSIZE = 4,
        XMPTYPE_TOOSHORT = 5,
        XMPTYPE_NOITEM = 6
};
                
&lt;CODE ENDS&gt;
          </artwork>
        </figure> 
        </t>
        <t>
          An xmdp_itemlen specifies the length of XDR item.  Because
          items excised from the XDR stream are XDR items, lengths of
          items excised from the XDR stream are denoted by xmdp_itemlens.   
        </t>
        <t>
          An xmdp_pldisp specifies a specific displacement with the
          payload stream associated with a single RPC-over-RDNA transmission
          or a group of such transmissions.  Note that when multiple 
          transmissions are used for a single message, all of the payload 
          streams within a transmission group 
          are considered concatenated.
        </t>
        <t>
          An xmdp_vsdisp specifies a displacement within the virtual 
          XDR stream associates with the set of RPC messages transferred
          by single RPC-over-RDNA transmission
          or a group of such transmissions.  The virtual XDR stream
          includes bulk data excised from the payload stream and so
          displacements within it reflect those of the corresponding 
          objects in the XDR stream that might be sent and received if 
          no bulk data excision facilities were involved in the RPC
          transmission.  
         </t>
        <t>
          An xmdp_tbsn designates a particular target buffer segment
          within a (trivial or non-trivial) RPC-over-RDMA transmission group.
          Each placement-targetable buffer segment is assigned a 
          number starting 
          with zero and proceeding through all the buffer segments for all
          the RPC-over-RDMA transmissions in the group.  This includes buffer 
          segments not actually used because transmission are 
          shorter than the maximum size and those in 
          which a placement-targetable buffer segment is used to hold part of
          the payload XDR stream rather than bulk data. 
        </t>
        <t>
          An xmdp_type allows a selection between placement using explicit RDMA
          operations (i.e. DDP) and send-based data placement.
	  Fields of this type are used in a number
          of contexts.  The specific context governs which subset of the
          types is valid:   
        <list style="symbols">
          <t>
            In request messages, they indicate where each of the 
            specially placed data items within the request has been placed.
            In this case, xmdp_type appears as the discriminator within an
            xmdp_loc which is part of an xmdp_mitem that is an element 
            within a request's optxrq_dp field.
          </t>
          <t>
            In request messages, they direct the responder as to where 
            potential specially placed items are to be placed.
            In this case, xmdp_type appears as the discriminator within an
            xmdp_rsdloc which is part of an xmdp_rsditem that is an element 
            within a request's optxrq_rsd field.
          </t>
          <t>
            In response messages, they indicate how each of the 
            potential specially placed items has been dealt with.  A subset
            of these specially placed data items and are presented
            in the same form as that used for
            specially placed data items within a request. 
            In this case, xmdp_type appears as the discriminator within an
            xmdp_loc which is part of an xmdp_mitem that is an element 
            within a response's optxrs_dp field.

          </t>
        </list>
        </t>
        <t>
          A number of these type are valid in all of these contexts, since
          they specify use of a specific mode of data placement which
          is to be used or has been used. 
        <list style="symbols">
          <t>
            XMPTYPE_EXRW selects DDP using explicit RDMA reads and writes. 
          </t>
          <t>
            XMPTYPE_TBSN selects use of send-based data placement in which
	    placement-eligible
            data is located in placement-targetable buffer segments.
          </t>
        </list>
        </t>
        <t>
          Another set of types is used to direct the use of specific sets of
          types but cannot specify an actual choice that has been made.
        <list style="symbols">
          <t>
            XMPTYPE_CHOICE indicates that the responder may use either 
            send-based data placement or chunk-based DDP using 
            explicit RDMA operations, 
            with a target location for the latter having been 
            provided by the requester.
          </t>
          <t>
            XMPTYPE_BYSIZE indicates that the responder is to use either 
            send-based data placement
	    or chunk-based DDP using explicit RDMA operations, 
            with the choice between the two governed by the actual size of
            the associated DDP-eligible XDR item.
          </t>
        </list>
        </t>
        <t>
          The following types are used when no actual special placement 
          has occurred. They are used in responses to indicate ways
          in which a direction to govern data placement
	  in a reply was responded
          to without resulting in special placement.
        <list style="symbols">
          <t>
            XMPTYPE_TOOSHORT indicates that the corresponding entry in an
            xmdp_rsdset was matched with a DDP-eligible item which was 
            too small to be handled using special placement, resulting in the
            DDP-eligible item being placed inline. 
          </t>
          <t>
            XMPTYPE_NOITEM indicates that the corresponding entry in an
            xmdp_rsdset was not matched with a DDP-eligible item in the
            reply.
          </t>
        </list>
        </t>
        <t>
          The following table indicates which of the above types is valid
          in each of the contexts in which these types may appear.  For
          valid occurrences, it distinguishes those which give 
          sender-generated information about the message, and those that direct
          reply construction, from those that indicate how those 
          directions governed the construction of a reply.  For invalid
          occurrences, we distinguish between those that result in XDR
          decode errors and those which are valid from the XDR point of
          view but are semantically invalid. 
        </t>
      <texttable align="left" style="full" anchor="typetaba">
        <ttcol>
          Type
        </ttcol>
        <ttcol>
          xmdp_loc in request
        </ttcol>
        <ttcol>
          xmdp_rsdloc in request
        </ttcol>
        <ttcol>
          xmdp_loc in response
        </ttcol>
        <c>
          XMPTYPE_EXRW
        </c>
        <c>
          Valid Info
        </c>
        <c>
          Valid Direction
        </c>
        <c>
          Valid Result
        </c>
        <c>
          XMPTYPE_TBSN
        </c>
        <c>
          Valid Info
        </c>
        <c>
          Valid Direction
        </c>
        <c>
          Valid Result
        </c>
        <c>
          XMPTYPE_BYSIZE
        </c>
        <c>
          XDR Invalid
        </c>
        <c>
          Valid Direction
        </c>
        <c>
          XDR Invalid
        </c>
        <c>
          XMPTYPE_CHOICE
        </c>
        <c>
          XDR Invalid
        </c>
        <c>
          Valid Direction
        </c>
        <c>
          XDR Invalid
        </c>
        <c>
          XMPTYPE_TOOSHORT
        </c>
        <c>
          Sem. Invalid
        </c>
        <c>
          XDR Invalid
        </c>
        <c>
          Valid Result
        </c>
        <c>
          XMPTYPE_NOITEM
        </c>
        <c>
          Sem. Invalid
        </c>
        <c>
          XDR Invalid
        </c>
        <c>
          Valid Result
        </c>

      </texttable>

      </section>
    </section>
    <section title="Data Placement Structures" anchor="DDPS">
      <section title="Data Placement Overview" anchor="DDPS-oview">
        <t>
          To understand the new data placement structures defined here, it is 
          necessary to review the existing DDP structures used in
          RPC-over-RDMA Version One and look at the corresponding
          structures in the new message transmission headers defined
          in this document.
        </t>
        <t>
          We look first at the existing structures.
        <list style="symbols">
          <t>
            Read chunks are specified on requests to indicate data
            items to be excised from the payload stream and fetched
            from the requester's memory by the responder.  As such,
            they serve as a means of supplying data excised from the 
            payload XDR stream. 
          <vspace blankLines="1"/>
            Read chunks appear in replies but they have no clear function
            there.
          </t>
          <t>
            Write chunks are specified on requests to provide locations
            in requester memory to which DDP-eligible items in the 
            corresponding reply are to be transferred.  They do not
            describe data in the request but serve to direct reply
            construction. 
          <vspace blankLines="1"/>
            When write chunks appear in replies they serve to indicate
            the length of the data transferred.  The addresses to which
            the bulk reply data has been transferred is available, but this
            information is already known to the requester.
          </t>
          <t>
            Reply chunks are specified to provide a location in the
            requester's memory to which the responder can transfer the
            response using RDMA Write. Like write chunks, they do not
            describe data in the request but serve to direct reply
            construction. 
          <vspace blankLines="1"/>
            When reply chunks appear in reply message headers, they serve
            mainly to indicate whether the reply chunk was actually used.
          </t>
        </list>
        </t>
        <t>
          Within the data placement structures defined here
	  a different organization
          is used, even where DDP using explicit RDMA operations in
          supported.
        <list style="symbols">
          <t>
            All messages that contain bulk data contain structures that 
            indicate where the excised data is located.  
            See <xref target="DDPS-msg" /> for details.
          </t>
          <t>
            Requests that might generate replies containing bulk
            data contain structures that provide guidance as to where
            the bulk data is to be placed.
            See <xref target="DDPS-resp" /> for details.
          </t>
        </list>
        </t>
        <t>
          Both sets of data structure are defined at the granularity of
          an RPC-over-RDMA transmission group.  That is, they describe the
          placement of data within an RPC message and the scope of description
          is not limited to a single RPC-over-RDMA transmission.
        </t>
      </section>
      <section title="Buffer Structure Definition" anchor="DDPS-buf">
        <t>
          Buffer structure definition information is used to allow the
          sender to know how receive buffers are constructed, to allow
          it to appropriately pad messages being sent so that bulk data
          will be received into a memory area with the appropriate 
          characteristics.
        </t>
        <t>
          In this case, data placement will not place data in a specific
          address, picked and registered in advance as is done to effect
          DDP using explicit RDMA operations.  Instead, a message is sent
          so that when it is matched with one of the preposted receives,
          the bulk data will be received into a memory area with the 
          appropriate characteristics, including:
        <list style="symbols">
          <t>
            size
          </t>
          <t>
            alignment
          </t>
          <t>
            placement-targetability and potentially other 
            memory characteristics 
            such as speed, persistence.
          </t>
        </list>
        </t>
        <t>
        <figure align="left">
          <artwork xml:space="preserve" align="left">
&lt;CODE BEGINS&gt;

struct xmrbs_seg {
        uint32          xmrseg_length;
        uint32          xmrseg_align;
        uint32          xmrseg_flags;
};

const uint32    XMRSFLAG_PLT = 0x01;


struct xmrbs_group {
        uint32          xmrgrp_count;
        xmrbs_seg       xmrgrp_info;
};

struct xmrbs_buf {
        uint32          xmrbuf_length;
        xmrbs_group     xmrbuf_groups&lt;&gt;;
};

&lt;CODE ENDS&gt;
          </artwork>
        </figure> 
        </t>
        <t>
          Buffers can be, and typically are, structured to contain multiple
          segments.  Preposted receives that target a buffer uses a scatter 
          list to place received messages in successive buffer segments.
        </t>
        <t>
          An xmrbs_seg defines a single buffer segment.  The fields included
          are:
        <list style="symbols">
          <t>
            xmrseg_length is the length of this contiguous buffer segment
          </t>
          <t>
            xmrseg_align specifies the guaranteed alignment for the
            corresponding buffer segment.  
          </t>
          <t>
            xmrseg_flags which specify some noteworthy characteristics 
            of the associated buffer segment.
          </t>
        </list>
        </t>
        <t>
           The following flag bit is the only one currently defined:
        <list style="symbols">
          <t>
            XMRSFLAG_PLT indicates that the buffer segment in question 
            is to be considered suitable as a target for data placement.
          </t>
        </list>
        </t>
        <t>
          An xmrgs_group designates a set of buffer segment all with
          the same buffer segment characteristics as indicated by xmr_grpinfo.
          The buffer segments are contiguous within the buffer although
          they are likely not to be physically contiguous.
        </t>
        <t>
          An xmrbs_buf defines a receiver's buffer structure and consists of
          multiple xmrbs_groups.  This buffer structure, when made available as
          a transport property, allows the sender to structure 
          transmissions so as to place DDP-eligible data in appropriate
          target buffer segments.
        </t>

      </section>
      <section title="Message Data Placement Structures" anchor="DDPS-msg">
        <t>
          These data structures show where in the virtual XDR stream
          for the set of messages, data is to be excised from that 
          XDR stream and where that excised bulk data is to be found
          instead.
        <figure align="left">
          <artwork xml:space="preserve" align="left">
&lt;CODE BEGINS&gt;

union xmdp_loc switch(xmdp_type type)

        case XMPTYPE_EXRW:
                rpcrdma1_segment        xmdl_ex&lt;&gt;;
        case XMPTYPE_TBSN:
                xmdp_itemlen            xmdl_offset;
                xmdp_tbsn               xmdl_bsnum&lt;&gt;;
        case XMPTYPE_TOOSHORT:
        case XMPTYPE_NOITEM:
                void;
};


struct xmdp_mitem {
        xmdp_vsdisp     xmdmi_disp;
        xmdp_itemlen    xmdmi_length;
        xmdp_loc        xmdmi_where;
};

typedef xmdp_mitem      xmdp_grpinfo&lt;&gt;;
         
&lt;CODE ENDS&gt;
          </artwork>
        </figure> 
        </t>
        <t>
          An xmdp_loc shows where a particular piece of bulk data is located.
          This information exists in multiple forms.
        <list style="symbols">
          <t>
            The case for DDP using explicit RDMA operations, contains,
            in xmdl_ex an array of rpcrdma1_segments showing where
            bulk data is to be fetched from or has been transferred to.   
          </t>
          <t>
            The case for send-based data placement contains,
	    in xmdl_tbsn an array
            placement-targetable buffer segments, indicating where bulk data,
            excised from the payload stream, is actually located.  The
            bulk data starts xmdl_offset bytes into the buffer segment
            designated by xmdl_bsnum[0] and then proceeds through buffer
            segments denoted by successive xmdl_bsnum entries until the
            length of the data item is exhausted.
          </t>
          <t>
            The cases for XMPTYPE_TOOSHORT and XMPTYPE_NOITEM are only valid 
            in responses

          </t>
        </list>
        </t>
        <t>
          An xmdp_mitem denotes a specific item of bulk data.  It consists
          of: 
        <list style="symbols">
          <t>
            The XDR stream displacement of the bulk data excised from the
            payload stream, in xmdmi_disp.
          </t>
          <t>
            The length of the data item, in xmdmi_length.
          </t>
          <t>
            The actual location of the bulk data, in xmdmi_loc.
          </t>
        </list>
        </t>
        <t>
          An xmdp_grpinfo consists of an array of xmdp_mitems describing
          all of the bulk data excised from all RPC messages sent in a
          single RPC-over-RDMA transmission group. Some possible cases:
        <list style="symbols">
          <t>
            The array is of length zero, indicating that there is no
            DDP-eligible data excised from the virtual XDR stream. In
            this case, the virtual XDR stream and the payload stream are 
            identical.
          </t>
          <t>
            The array consists of one or more xmdp_mitems, each of whose
            xmdmi_where fields is of type XMPTYPE_EXRW.  In this case, the
            placement data corresponds to read chunks in the case in
	    which a request
            is being sent and to write chunks in the case in which a reply
            is being sent.
          </t>
          <t> 
            The array consists of one or more xmdp_mitems, each of whose
            xmdmi_where fields is of type XMPTYPE_TBSN. In this case, each
            entry, whether it applies to bulk data in a request or a reply,
            describes data logically part of the message being sent, which
            may be part of any RPC-over-RDMA transmissions in the same
            transmission group.
          </t>
          <t>
            The array consists of one or more xmdp_mitems, with 
            xmdmi_where fields of a mixture of types, In this case, each
            entry, whether it applies to bulk data in a request or a reply,
            describes data logically part of the message being sent, although
            the method of getting access to that data may vary from entry to 
            entry.
          </t>
        </list>

        </t>
      </section>
      <section title="Response Direction Data Placement Structures" anchor="DDPS-resp">
        <t>
          These data structures, when sent as part of the request, instruct
          the responder how to use data placement to place response
          data subject to special data placement.
        <figure align="left">
          <artwork xml:space="preserve" align="left">
&lt;CODE BEGINS&gt;

union xmdp_rsdloc switch(xmdp_type type)

        case XMPTYPE_EXRW:
        case XMPTYPE_CHOICE:
                rpcrdma1_segment        xmdrsdl_ex&lt;&gt;;
        case XMPTYPE_BYSIZE:
                xmdp_itemlen            xmdrsdl_dsdov;
                rpcrdma1_segment        xmdrsdl_bsex&lt;&gt;;
        case XMPTYPE_TBSN:
                void;
};

struct xmdp_rsdrange {
        xmdp_vsdisp     xmdrsdr_begin;
        xmdp_vsdisp     xmdrsdr_end;
};

struct xmdp_rsditem {
        xmdp_itemlen    xmdrsdi_minlen;
        xmdp_rsdloc     xmdrsdi_loc;
};

struct xmdp_rsdset {
        xmdp_rsdrange   xmdrsds_range;
        xmdp_rsditem    xmdrsds_items&lt;&gt;;
};

typedef xmdp_rsdset     xmdp_rsdgroup&lt;&gt;;

&lt;CODE ENDS&gt;
          </artwork>
        </figure> 
        </t>
        <t>
          An xmdp_rsdloc contains information specifying where bulk data
          generated as part of a reply is to be placed.
          This information is defined as a union with the following
          cases: 

        <list style="symbols">
          <t>
            The case for DDP using explicit RDMA operations, XMPTYPE_EXRW,
            contains,
            in xmrsdl_ex, an array of rpcrdma1_segments showing where
            bulk data generated by the corresponding reply is to be 
            transferred to.   
          </t>
          <t>
            The case allowing the responder to freely choose the data
	    placement method,
            XMPTYPE_CHOICE, is identical.
            It also contains,
            in xmrsdl_ex, an array of rpcrdma1_segments showing where
            bulk data generated by the corresponding reply is to be 
            transferred to if explicit RDMA requests are to be used.   
          </t>
          <t>
            The case for send-based data placement, XMPTYPE_TBSN,
	    is void, since the decisions as to 
            where bulk data is to be placed are made by the responder.
          </t>
          <t>
            In the case directing the responder to choose the data placement
	    method
            based on item size, XMPTYPE_BYSIZE, an array of rpcrdma1_segments 
            is in xmrsdl_bsex.

          </t>
        </list>
        </t>
        <t>
          In all cases, each xmdp_rsdloc sent as part of a request has a 
          corresponding xmdp_loc in the associated response.  The xmdp_type
          specified in the request will affect the type in the response, but
          the types are not necessarily the same.  The table below
          describes the valid combinations of request and response
          xmdp_type values. 
        </t>
        <t>
          In this table, rows correspond to types in requests directing,
          the responder as to the desired placement in the response while
          the columns correspond to types in the ensuing response.  Invalid
          combinations are labelled "Inv" while valid combination are labelled
          either "NDR" denoting no need to deregister memory, or "DR" to 
          indicate that memory previously registered will need to be 
          deregistered.
        </t>
        <texttable align="left" style="full" anchor="typetab">
          <ttcol>
            Type
          </ttcol>
          <ttcol>
            EXRW
          </ttcol>
          <ttcol>
            TBSN
          </ttcol>
          <ttcol>
            TOOSHORT
          </ttcol>
          <ttcol>
            NOITEM
          </ttcol>
          <c>
            EXRW
          </c>
          <c>
            DR
          </c>
          <c>
            Inv.
          </c>
          <c>
            DR
          </c>
          <c>
            DR
          </c>
          <c>
            TBSN
          </c>
          <c>
            Inv.
          </c>
          <c>
            NDR
          </c>
          <c>
            NDR
          </c>
          <c>
            NDR
          </c>
          <c>
            CHOICE
          </c>
          <c>
            DR
          </c>
          <c>
            NDR
          </c>
          <c>
            DR
          </c>
          <c>
            DR
          </c>
          <c>
            BYSIZE
          </c>
          <c>
            DR
          </c>
          <c>
            NDR
          </c>
          <c>
            DR
          </c>
          <c>
            DR
          </c>
        </texttable>
        <t>
          An xmdp_rsdrange denotes a range of positions in the XDR stream
          associated with a request.  Particular directions regarding
          bulk data in the corresponding response are limited to such 
          ranges, where response XDR stream positions and request XDR stream
          positions can be reliably tied together.  
        </t>
        <t>
          When the ULP supports multiple individual operations per RPC 
          request (e.g., COMPOUND and CB_COMPOUND in NFSv4), an
          xmd_rsdrange can isolate elements of the reply due to particular
          operations.
        </t>
        <t>
          An xmdp_rsditem specifies the handling of one potential item of
          bulk data.  The handling specified is qualified by a length range.
          If the item is smaller than xmdrsdi_minlen, it is not treated as
          bulk data and the corresponding data item appears in the payload
          stream, while that particular xmdp_rsditem is considered used up,
          making the next xmdp_rsditem in the xmdp_rsdset the target of the 
          next DDP-eligible data item in the reply. 
          Note that in the case in which xmdrsdi_loc specifies
          use of explicit RDMA operations, the area specified is not used
          and the requester is responsible for deregistering it.
        </t>
        <t>
          For each xmdp_rsditem, there will be a corresponding xmdp_mitem
        </t>
        <t>
          An xmdp_rsdset contains a set of xmdp_rsditems applicable to
          a given xmdp_range in the request.
        </t>
        <t>
          An xmdp_rsdgroup designates a set of xmdp_rsdsets applicable to
          a particular RPC-over-RDMA transmission group.  The xmdrsds_range
          fields of successive xmdp_rsdsets must be disjoint and in 
          strictly increasing order. 
        </t>


      </section>
    </section>

    <section title="Transport Properties" anchor="CHAR">
      <section title="Property List" anchor="CHAR-list">

      <t>
        In this document we take advantage of the fact that the set of
        transport properties defined in 
        <xref target="I-D.cel-nfsv4-rpcrdma-version-two"/>. 
        is subject to later extension.  The additional transport 
        properties are summarized below in <xref target="chtab"/>.
      </t>
      <t>
        In that table the columns have the following values: 
        <list style="symbols">
          <t>
            The column labeled "property" identifies the transport 
            property described by the current row.
          </t>
          <t>

            The column labeled "#" specifies the propid value
            used to identify this property.
          </t>
          <t>
            The column labeled "XDR type" gives XDR type of the data used 
            to communicate the value of this property.  This data 
            overlays the nominally opaque field pv_data in a
            propval. 
          </t>
          <t>
            The column labeled "default" gives the default value for the
            property which is to be assumed by those who do not
            receive, or are unable to interpret, information about the
            actual value of the property.
          </t>
          <t>
            The column labeled "section" indicates the section (within this
            document) that explains the semantics and use of this transport 
            property.
          </t>
        </list>
      </t> 
      <texttable align="left" style="full" anchor="chtab">
        <ttcol>
          property
        </ttcol>
        <ttcol>
          #
        </ttcol>
        <ttcol>
          XDR type
        </ttcol>
        <ttcol>
          default
        </ttcol>
        <ttcol>
          section
        </ttcol>
        <c>
          RTR Support
        </c>
        <c>
          3
        </c>
        <c>
          uint32
        </c>
        <c>
          0
        </c>
        <c>
          <xref format="counter" target="CHAR-rtrs"/>
        </c>
        <c>
          Receive Buffer Structure
        </c>
        <c>
          4
        </c>
        <c>
          xmrbs_buf
        </c>
        <c>
          <xref target="cNote1" format="none">Note1</xref>
        </c>
        <c>
          <xref format="counter" target="CHAR-rbs"/>
        </c>
        <c>
          Request Transmission Receive Limit
        </c>
        <c>
          5
        </c>
        <c>
          xms_grpxc
        </c>
        <c>
          1
        </c>
        <c>
          <xref format="counter" target="CHAR-rqrxl"/>
        </c>
        <c>
          Response Transmission Send Limit
        </c>
        <c>
          6
        </c>
        <c>
          xms_grpxc
        </c>
        <c>
          1
        </c>
        <c>
          <xref format="counter" target="CHAR-rssxl"/>
        </c>


      </texttable>
      <t>
        The following notes apply to the above table:
      <list style="numbers">
        <t anchor="cNote1">
          The default value for the Receive Buffer Structure always
          consists of a single buffer segment, without any alignment
          restrictions and not targetable for DDP.  The length of
          that buffer segment derives from the Receive Buffer Size
          Property if available, and from the default receive
          buffer size otherwise.    
        </t>
      </list>
      </t>
    </section>
    <section title="RTR Support Property"
             anchor="CHAR-rtrs">
      <t>
      <figure align="left">
        <artwork xml:space="preserve" align="left">
&lt;CODE BEGINS&gt;

const uint32           XPROP_RTRSUPP = 3;
typedef uint32         xpr_rtrs;

const uint32           RTRS_XREQ = 1;
const uint32           RTRS_XRESP = 2;
const uint32           RTRS_XCONT = 4;

&lt;CODE ENDS&gt;
          </artwork>
        </figure>       
      </t>
    </section>
    <section title="Receive Buffer Structure Property"
             anchor="CHAR-rbs">
      <t>
        This property defines the structure of the endpoint's
        receive buffers, in order to give a sender the ability to 
        place bulk
        data in specific DDP-targetable buffer segments.
      <figure align="left">
        <artwork xml:space="preserve" align="left">
&lt;CODE BEGINS&gt;

const uint32           XPROP_RBSTRUCT = 4;
typedef xmrbs_buf      xpr_rbs;

&lt;CODE ENDS&gt;
          </artwork>
        </figure>       
      </t>
      <t>
        Normally, this property, if specified, should be in agreement
        with Receive Buffer Size Property.  However, the following rules 
        apply.
      <list style="symbols">
        <t>
          If the value of Receive Buffer Structure Property is not 
          specified, it is derived from the Receive Buffer Size Property,
          if known, and the default buffer size otherwise.
          The buffer is  considered to consist of a single non-DDP-targetable 
          segment whose size is the buffer size.  
        </t>
        <t>
          If the value of Receive Buffer Size Property is not 
          specified and the Receive Buffer Structure Property is 
          specified, the value of the former is derived from the latter, by
          adding up the length of all buffer segments specified.
        </t>
      </list>
      </t>
    </section>
    <section title="Request Transmission Receive Limit Property"
             anchor="CHAR-rqrxl">
      <t>
        This property specifies the length of the longest request 
        messages (in terms of number of transmissions) 
        that a responder will accept.
      <figure align="left">
        <artwork xml:space="preserve" align="left">
&lt;CODE BEGINS&gt;

const uint32           XPROP_REQRXLIM = 5;
typedef uint32         xpr_rqrxl;

&lt;CODE ENDS&gt;
         </artwork>
       </figure>       
      </t>
      <t>
        A requester can use this property to determine whether to
        send long requests by using message continuation or by
        using a position-zero read chunk. 
      </t>
    </section>
    <section title="Response Transmission Send Limit Property"
             anchor="CHAR-rssxl">
      <t>
        This property specifies the length of the longest response 
        message (in terms of number of transmissions) 
        that a responder will generate.
      <figure align="left">
        <artwork xml:space="preserve" align="left">
&lt;CODE BEGINS&gt;

const uint32           XPROP_RESPSXLIM = 6;
typedef uint32         xpr_rssxl;

&lt;CODE ENDS&gt;
          </artwork>
        </figure>       
      </t>
    </section>

  </section>
    <section title="New Operations" anchor="OPS">
      <section title="Operations List" anchor="OPS-list">

        <t>
          The proposed  new operation are set for in <xref target="optab"/>
          below. In that table, the columns have the following values:
        <list style="symbols">
          <t>
            The column labeled "operation" specifies the particular operation. 
          </t>
          <t>
            The column labeled "#" specifies the value of opttype for this
            operation.
          </t>
          <t>
            The column labeled "XDR type" gives XDR type of the data structure 
            used to describe the information in this new message type.  
            This data overlays the nominally opaque field optinfo in an
            RDMA_OPTIONAL message.
          </t>
          <t>
            The column labeled "msg" indicates whether this operation is
            followed (or not) by an RPC message payload (or something else).
          </t>
          <t>
            The column labeled "section" indicates the section (within this
            document) that explains the semantics and use of this optional 
            operation.
          </t>
        </list>
        </t>
        <texttable align="left" style="full" anchor="optab">
          <ttcol>
            operation
          </ttcol>
          <ttcol>
            #
          </ttcol>
          <ttcol>
            XDR type
          </ttcol>
          <ttcol>
            msg
          </ttcol>
          <ttcol>
            section
          </ttcol>
          <c>
            Transmit Request
          </c>
          <c>
            5
          </c>
          <c>
            optxmt_req 
          </c>
          <c>
            <xref target="oNote1" format="none">Note1</xref>
          </c>
          <c>
            <xref format="counter" target="OPS-xrq"/>
          </c>
          <c>
            Transmit Response
          </c>
          <c>
            6
          </c>
          <c>
            optxmt_resp
          </c>
          <c>
            <xref target="oNote1" format="none">Note1</xref>
          </c>
          <c>
            <xref format="counter" target="OPS-xrs"/>
          </c>
          <c>
            Transmit Continue
          </c>
          <c>
            7
          </c>
          <c>
            optxmt_cont
          </c>
          <c>
            <xref target="oNote2" format="none">Note2</xref>
          </c>
          <c>
            <xref format="counter" target="OPS-xmc"/>
          </c>
          <c>
            Report Error
          </c>
          <c>
            8
          </c>
          <c>
            optrept_err
          </c>
          <c>
            No.
          </c>
          <c>
            <xref format="counter" target="OPS-xme"/>
          </c>
        </texttable>
        <t>
          The following notes apply to the above table:
        <list style="numbers">
          <t anchor="oNote1">
            Contains an initial segment of the message payload stream for an
            RPC message, or the entre payload stream.  The 
            optxr[qs]_pslen field, indicates the length of the section
            present
          </t>
          <t anchor="oNote2">
            May contain a part of a message payload stream for an
            RPC message, although not the entre payload stream.  The 
            optxc_pslen field, if non-zero, indicates that this portion 
            is present, and the length of the section.
          </t>

        </list>
        </t>
      </section>
      <section title="Transmit Request Operation" 
               anchor="OPS-xrq">
        <t>
          The message definition for this operation is as follows: 
        <figure align="left">
          <artwork xml:space="preserve" align="left">
&lt;CODE BEGINS&gt;

const uint32     ROPT_XMTREQ = 1;

struct optxmt_req {
        xmdp_grpinfo    optxrq_dp;
        xmdp_rsdgroup   optxrq_rsd;
        xms_grpxc       optxrq_count;
        xms_grpxc       optxrq_rsbuf;
        xmdp_pldisp     optxrq_pslen;

};        
 
&lt;CODE ENDS&gt;
          </artwork>
        </figure> 
        </t>
        <t>
          The field optxrq_dp describes the fields in virtual XDR 
          stream which have been excised in forming the payload stream, 
          and information about where the corresponding bulk data is 
          located. 
        </t>
        <t>
          The field optxrq_rsd consists of information directing the responder
          as to how to construct the reply, in terms of DDP.  
          of length zero.
        </t>
        <t>
          The field optrq_count specifies the count of transmissions
          in this group of transmissions used to send a request.
        </t>
        <t>
          The field optrq_repch serves as a way to transfer a reply chunk
          to the responder to serve as a way in which a reply longer than 
          the inline size limit may be transferred.  Although, not prohibited
          by the protocol, it is unlikely to be used in environments in which
          message continuation is supported.
        </t>
        <t>
          The field optrq_pslen gives the length of the payload stream for the
          RPC transmitted.  The payload stream begins right after the end of 
          the optxmt_msg and proceeds for optxm_pslen bytes.  This can include
          crossing buffer segment boundaries.
        </t>
      </section>
      <section title="Transmit Response Operation" 
               anchor="OPS-xrs">
        <t>
          The message definition for this operation is as follows: 
        <figure align="left">
          <artwork xml:space="preserve" align="left">
&lt;CODE BEGINS&gt;

const uint32     ROPT_XMTRESP = 2;

struct optxmt_resp {
        xmdp_grpinfo    optxrs_dp;
        xms_grpxn       optxrs_count;
        xmdp_pldisp     optxrs_pslen;

};        
 
&lt;CODE ENDS&gt;
          </artwork>
        </figure> 
        </t>
        <t>
          The field optxrs_dp describes the fields in virtual XDR 
          stream which have been excised in forming the payload stream, 
          and information about where the corresponding bulk data is 
          located. 
        </t>
        <t>
          The field optrs_count specifies the count of transmissions
          in this group of transmissions used to send a reply.
        </t>
        <t>

          The field optrq_pslen gives the length of the payload stream for the
          RPC transmitted.  The payload stream begins right after the end of 
          the optxmt_msg and proceeds for optxm_pslen bytes.  This can include
          crossing buffer segment boundaries.
        </t>
      </section>
     <section title="Transmit Continue Operation" 
               anchor="OPS-xmc">
       <t>
         RPC-over-RDMA headers of this type are used to continue
         RPC messages begun by RPC-over-RDMA message 
         of type ROPT_XMTREQ or ROPT_XMTRESP.  The xid field of this 
         message must match that in the initial transmission. 
       </t>
       <t>
         This operation needs to be supported for the message 
         continuation feature to be used. 
       </t>
       <t>
         The message definition for this operation is as follows: 
       <figure align="left">
         <artwork xml:space="preserve" align="left">
&lt;CODE BEGINS&gt;

const uint32     ROPT_XMTCONT = 3;

struct optxmt_cont {
        xms_grpxn       optxc_xnum;
        uint32          optxc_itype;
        xmdp_pldisp;    optxc_pslen;
};        
 
&lt;CODE ENDS&gt;
          </artwork>
        </figure> 
        </t>
        <t>
        </t>
        <t>
          The field optxc_xnum indicates the transmission number of
          this transmission
          within its transmission group.
        </t>
        <t>
          The field optxc_pslen gives the length of the section of
          the payload stream which is located in the current 
          RPC-over-RDMA transmission.  It is
          valid for this length to be zero, indicating that there is no
          portion of the payload stream in this transmission.
          Except when the length is zero, the payload stream 
          begins right after the end of 
          the optxmt_cont and proceeds for optxc_pslen bytes.  
          This can include crossing buffer segment boundaries.
          In any case, the payload streams for all transmissions
          within the same group are considered concatenated.
        </t>
      </section>
    <section title="Error Reporting Operation" 
             anchor="OPS-xme">
       <t>
         This RPC-over-RDMA message type is used to signal the occurrence
         of errors that do not involve:
       <list style="numbers">
         <t>
           Transmission of a message that violates the rules specified
           in <xref target="I-D.cel-nfsv4-rpcrdma-version-two"/>.  
         </t>
         <t>
           Transmission of a message described in this document which
           does not conform to the XDR specified here.
         </t>
         <t>
           The transmission of a message, which, when assembled according
           to the rules here, cannot be decoded according to the XDR
           for the ULP.
         </t>
       </list>
       </t>
       <t>
         Such errors can arise if the rules specified in this document are
         not followed and can be the result of a mismatch between multiple,
          each of which is valid when considered on its own.
       </t>
       <t>
         The preliminary error-related definition is as follows: 
       <figure align="left">
         <artwork xml:space="preserve" align="left">
&lt;CODE BEGINS&gt;

enum optr_err {
        OPTRERR_BADHMT = 1,
        OPTRERR_BADOMT = 2,
        OPTRERR_BADCONT = 3,
        OPTRERR_BADSEQ = 4,
        OPTRERR_BADXID = 5,
        OPTRERR_BADOFF = 6,
        OPTRERR_BADTBSN = 7,
        OPTRERR_BADPL = 8
}

union optr_info switch(optr_err optre_which) {

  case OPTRERR_BADHMT:
  case OPTRERR_BADOMT:
  case OPTRERR_BADSEQ:
  case OPTRERR_BADXID:
        uint32          optri_expect;
        uint32          optri_current;

  case OPTRERR_BADCONT:
        void;


  case OPTRERR_BADTBSN:
  case OPTRERR_BADOFF:
  case OPTRERR_BADPL:
        uint32          optri_value;
        uint32          optri_min;
        uint32          optri_max;

};

&lt;CODE ENDS&gt;
          </artwork>
        </figure> 
        </t>
        <t>
          optr_err enumerates the various error conditions that might be
          reported.
        <list style="symbols">
          <t>
            OPTRERR_BADHMT indicates that a header message type other than
            the one expected was received.  In this context, a particular
            message type can be considered "expected" only because 
            of message or group continuation. 
          </t>
          <t>
            OPTRERR_BADOMT indicates that an optional message type other than
            the one expected was received.  In this context, a particular
            message type can be considered "expected" only because 
            of message or group continuation. 
          </t>
          <t>
            OPTRERR_BADCONT indicates that a continuation messages was received
            when there was no reason to expect one.
          </t>
          <t>
            OPTRERR_BADSEQ indicate that a transmission sequence number 
            other than the one expected was received. 
          </t>
          <t>
            OPTRERR_BADXID indicate that an xid other than the one expected
            in a continuation context.
          </t>
          <t>
            OPTRERR_BADTBSN indicate that an invalid target buffer sequence 
            number was received.
          </t>
          <t>
            OPTRERR_BADOFF indicate that a bad offset was received as part 
            of an xmdp_loc.  This is typically because the offset is larger
            than the buffer segment size.
          </t>
          <t>
            OPTRERR_BADPL indicates that a bad offset was received for the
            payload length.  This is typically because the length would make 
            the area devoted to the payload stream not a subset of the 
            actual transmission.
          </t>
        </list>
        </t>
        <t>
          The optr_info gives error about the specific invalid field 
          being reported.  The additional information given depends on the
          specific error.
        <list style="symbols">
          <t>
            For the errors OPTRERR_BADHMT, OPTRERR_BADOMT, OPTRERR_BADSEQ,
            and OPTRERR_BADXID, the expected and actual values of the field 
            are reported
          </t>
          <t>
            For the error OPTRERR_CONT, no additional information is provided.
          </t>
          <t>
            For the errors OPTRERR_BADTBSN, OPTRERR_BADOFF, and OPTRERR_BADPL,
            the actual value together with a range of valid values is provided.
            When the actual value is with the valid range, it can be inferred 
            that the actual value is not properly aligned (e.g. not on a 32-bit
            boundary)           
          </t>
        </list>
        </t>

        <t>
          The message definition for this operation is as follows: 
        <figure align="left">
           <artwork xml:space="preserve" align="left">
&lt;CODE BEGINS&gt;

const uint32     ROPT_REPTERR = 4;

struct optrept_err {
        xms_id          optre_bad;
        xms_id          *optre_lead;
        optr_info       optre_info;
};        
 
&lt;CODE ENDS&gt;
          </artwork>
        </figure> 
        </t>
        <t>
          The field optre_bad is a description of the transmission on which 
          the error was actually detected.
        </t>
        <t>
          The optional field optre_lead is a description of an 
          earlier transmission that might have led to the error reported.
        </t>
        <t>
          The field optre_info  provides information about the       
        </t>
      </section>
    </section>
    <section title="XDR" anchor="XDR">

      <t>
        This section contains an XDR <xref target="RFC4506"/>  description 
        of the proposed extension.
      </t>
      <t>
       This description is provided in a way that makes it simple to 
       extract into ready-to-use form.  The reader can apply the 
       following shell script to this document to produce a machine-readable 
       XDR description of extension which can be combined with 
       XDR for the base protocol to produce an XDR that includes the base 
       protocol together with the optional extensions.
     <figure align="left">
       <artwork xml:space="preserve" align="left">

&lt;CODE BEGINS&gt;

#!/bin/sh
grep '^ *///' | sed 's?^ /// ??' | sed 's?^ *///$??'

&lt;CODE ENDS&gt;

       </artwork>
      </figure>
      </t>
      <t>
        That is, if the above script is stored in a file called 
        "extract.sh" and this document is in a file called "ext.txt" then 
        the reader can do the following to extract an XDR description file
        for this extension: 
      <figure align="left">
        <artwork xml:space="preserve" align="left">

&lt;CODE BEGINS&gt;

sh extract.sh &lt; ext.txt &gt; xmitext.x

&lt;CODE ENDS&gt;

        </artwork>
      </figure>
      </t>
      <t>
        The XDR description for this extension can be combined
        with that for other extensions and that for the base protocol.
        While this is a complete description and can be processed
        by the XDR compiler, the result might not be usable to
        process the extended protocol, for a number of reasons:
      <list style="sumbols">
        <t>
          The RPC-over-RDMA transport headers do not constitute an
          RPC program and version negotiation and message selection
          part of the XDR, rather than being external to it.
        </t>
        <t>
          Headers used for requests and replies are not necessarily 
          paired, as they would be in an RPC program.
        </t>
        <t>
          Header types defined as optional extensions overlay
          existing nominally opaque fields in the base protocol.
          While this overlay architecture allows code aware of the
          overlay relationships to have a more complete view of
          header structure, this overlay relationship cannot be expressed
          within the XDR language 
        </t>
      </list>
      </t>
      <section title="Code Component License" toc="default">
        <t>
          Code components extracted from this document must include 
          the following license text.  When the extracted XDR code is 
          combined with other complementary XDR code which itself has 
          an identical license, only a single copy of the license text 
          need be preserved.  
        <figure align="left">
          <artwork xml:space="preserve" align="left">

&lt;CODE BEGINS&gt;

/// /*
///  * Copyright (c) 2010, 2016 IETF Trust and the persons
///  * identified as authors of the code.  All rights reserved.
///  *
///  * The author of the code is: D. Noveck.
///  *
///  * Redistribution and use in source and binary forms, with
///  * or without modification, are permitted provided that the
///  * following conditions are met:
///  *
///  * - Redistributions of source code must retain the above
///  *   copyright notice, this list of conditions and the
///  *   following disclaimer.
///  *
///  * - Redistributions in binary form must reproduce the above
///  *   copyright notice, this list of conditions and the
///  *   following disclaimer in the documentation and/or other
///  *   materials provided with the distribution.
///  *
///  * - Neither the name of Internet Society, IETF or IETF
///  *   Trust, nor the names of specific contributors, may be
///  *   used to endorse or promote products derived from this
///  *   software without specific prior written permission.
///  *
///  *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS
///  *   AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED
///  *   WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
///  *   IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
///  *   FOR A PARTICULAR PURPOSE ARE DISCLAIMED.  IN NO
///  *   EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
///  *   LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
///  *   EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
///  *   NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
///  *   SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
///  *   INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
///  *   LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
///  *   OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING
///  *   IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
///  *   ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
///  */

&lt;CODE ENDS&gt;

          </artwork>
        </figure> 
        </t>
      </section>
      <section title="XDR Proper for Extension">
        <t>
        <figure align="left">
          <artwork xml:space="preserve" align="left">


&lt;CODE BEGINS&gt;
/// /*******************************************************************
///  *******************************************************************
///  ** 
///  **  XDR for OPTIONAL protocol extension.
///  ** 
///  **  Includes support for both message continuation and send-based 
///  **  DDP. The latter is supported by a new structure for the 
///  **  specification of data placements which can be used for both 
///  **  send-based data placement and DDP using explicit RDMA
///  **  operations.
///  ** 
///  **  Extensions include:
///  ** 
///  **     o Four new transport properties.
///  **     o Four new OPTIONAL message types
///  **     
///  *******************************************************************
///  ******************************************************************/
///
/// /*******************************************************************
///  *
///  *                   Core XDR Definitions
///  *
///  ******************************************************************/

/// /* 
///  * General XDR preliminaries for these features,
///  */
/// typedef uint32  xms_grpxn;
/// typedef uint32  xms_grpxc;
/// 
/// /* 
///  * Basic XDR typedefs for the new approach to the specification of
///  8 data placement.
///  */
/// typedef uint32  xmdp_itemlen;
/// typedef uint32  xmdp_pldisp;
/// typedef uint32  xmdp_vsdisp;
/// typedef uint32  xmdp_tbsn;
///  
/// /* 
///  * Define the possible types of data placement items.
///  */
/// enum xmdp_type {
///         XMPTYPE_EXRW = 1,
///         XMPTYPE_TBSN = 2,
///         XMPTYPE_CHOOSE = 3,
///         XMPTYPE_BYSIZE = 4,
///         XMPTYPE_TOOSHORT = 5,
///         XMPTYPE_NOITEM = 6
/// };
/// 
/// /*
///  * XDR defining the placement of bulk items in the message being 
///  * sent.
///  */
/// union xmdp_loc switch(xmdp_type type)
/// 
///         case XMPTYPE_EXRW:
///                 rpcrdma1_segment        xmdl_ex&lt;&gt;;
///         case XMPTYPE_TBSN:
///                 xmdp_itemlen            xmdl_offset;
///                 xmdp_tbsn               xmdl_bsnum&lt;&gt;;
///         case XMPTYPE_TOOSHORT:
///         case XMPTYPE_NOITEM:
///                 void;
/// };
/// 
/// 
/// 
/// struct xmdp_mitem {
///         xmdp_vsdisp     xmdmi_disp;
///         xmdp_itemlen    xmdmi_length;
///         xmdp_loc        xmdmi_where;
/// };
/// 
/// typedef xmdp_mitem     xmdp_grpinfo&lt;&gt;;
/// 
/// /*
///  * XDR defining the placement of bulk items in the response to the
///  * message being sent.
///  */
/// union xmdp_rsdloc switch(xmdp_type type)
/// 
///         case XMPTYPE_EXRW:
///         case XMPTYPE_CHOICE:
///                 rpcrdma1_segment        xmdrsdl_ex&lt;&gt;;
///         case XMPTYPE_BYSIZE:
///                 xmdp_itemlen            xmdrsdl_dsdov;
///                 rpcrdma1_segment        xmdrsdl_bsex&lt;&gt;;
///         case XMPTYPE_TBSN:
///                 void;
/// };
/// 
/// struct xmdp_rsdrange {
///         xmdp_vsdisp     xmdrsdr_begin;
///         xmdp_vsdisp     xmdrsdr_end;
/// };
/// 
/// struct xmdp_rsditem {
///         xmdp_itemlen    xmdrsdi_minlen;
///         xmdp_rsdloc     xmdrsdi_loc;
/// };
/// 
/// struct xmdp_rsdset {
///         xmdp_rsdrange   xmdrsds_range;
///         xmdp_rsditem    xmdrsds_items&lt;&gt;;
/// };
/// 
/// typedef xmdp_rsdset     xmdp_rsdgroup&lt;&gt;;
///
/// /*******************************************************************
///  *
///  *                     New Transport Properties           
///  *
///  ******************************************************************/
///
/// /* 
///  * New Transport Property codes 
///  */ 
/// const uint32           XPROP_RTRSUPP = 3;
/// const uint32           XPROP_RBSTRUCT = 4;
/// const uint32           XPROP_REQRXLIM = 5;
/// const uint32           XPROP_RESPSXLIM = 6;
/// 
/// /*
///  * XDR relating to RTR Support Property
///  */
/// typedef uint32         xpr_rtrs;
/// 
/// const uint32           RTRS_XREQ = 1;
/// const uint32           RTRS_XRESP = 2;
/// const uint32           RTRS_XCONT = 4;
/// 
/// /* 
///  * Items related to Receive Buffer Structure Property
///  */
/// struct xmrbs_seg {
///         uint32          xmrseg_length;
///         uint32          xmrseg_align;
///         uint32          xmrseg_flags;
/// };
/// 
/// const uint32    XMRSFLAG_PLT = 0x01; 
/// 
/// struct xmrbs_group {
///         uint32          xmrgrp_count;
///         xmrbs_seg       xmrgrp_info;
/// };
/// 
/// struct xmrbs_buf {
///         uint32          xmrbuf_length;
///         xmrbs_group     xmrbuf_groups&lt;&gt;;
/// };
/// typedef xmrbs_buf      xpr_rbs;
/// 
/// /*
///  * XDR relating to transmission limit properties
///  */
/// typedef uint32         xpr_rqrxl;
/// 
/// typedef uint32         xpr_rssxl;
///
/// /*******************************************************************
///  *
///  *                     New OPTIONAL Message Types           
///  *
///  ******************************************************************/
///
/// /* 
///  * New message type codes 
///  */ 
/// const uint32     ROPT_XMTREQ = 1;
/// const uint32     ROPT_XMTRESP = 2;
/// const uint32     ROPT_XMTCONT = 3;
/// const uint32     ROPT_REPTERR = 4;
/// 
/// 
/// /*
///  * New message type to do the initial transmission of a request.
///  */ 
/// struct optxmt_req {
///         xmdp_grpinfo    optxrq_dp;
///         xmdp_rsdgroup   optxrq_rsd;
///         xms_grpxc       optxrq_count;
///         xms_grpxc       optxrq_rsbuf;
///         xmdp_pldisp     optxrq_pslen;
/// 
/// };        
///  
/// /*
///  * New message type to do the initial transmission of a response.
///  */ 
/// struct optxmt_resp {
///         xmdp_grpinfo    optxrs_dp;
///         xms_grpxn       optxrs_count;
///         xmdp_pldisp     optxrs_pslen;
/// 
/// };        
/// 
/// /*
///  * New message type to transmit the continuation of a request or
///  * response.
///  */ 
/// struct optxmt_cont {
///         xms_grpxn       optxc_xnum;
///         uint32          optxc_itype;
///         xmdp_pldisp;    optxc_pslen;
/// };        
/// 
/// /*
///  * XDR definitions to support error reporting.
///  */ 
/// enum optr_err {
///         OPTRERR_BADHMT = 1,
///         OPTRERR_BADOMT = 2,
///         OPTRERR_BADCONT = 3,
///         OPTRERR_BADSEQ = 4,
///         OPTRERR_BADXID = 5,
///         OPTRERR_BADOFF = 6,
///         OPTRERR_BADTBSN = 7,
///         OPTRERR_BADPL = 8
/// }
/// 
/// union optr_info switch(optr_err optre_which) {
/// 
///   case OPTRERR_BADHMT:
///   case OPTRERR_BADOMT:
///   case OPTRERR_BADSEQ:
///   case OPTRERR_BADXID:
///         uint32          optri_expect;
///         uint32          optri_current;
/// 
///   case OPTRERR_BADCONT:
///         void;
/// 
/// 
///   case OPTRERR_BADTBSN:
///   case OPTRERR_BADOFF:
///   case OPTRERR_BADPL:
///         uint32          optri_value;
///         uint32          optri_min;
///         uint32          optri_max;
/// 
/// };
/// 
/// struct xms_id {
///         uint32         xmsi_xid;
///         msg_type       xmsi_dir;
///         xms_grpxn      xmsi_seq;
/// };
///
/// /*
///  * New message type for error reporting.
///  */ 
/// struct optrept_err {
///         xms_id          optre_bad;
///         xms_id          *optre_lead;
///         optr_info       optre_info;
/// };        
///  
/// 
&lt;CODE ENDS&gt;
          </artwork>
        </figure> 
        </t>
      </section>
    </section>
    <section title="Security Considerations" anchor="SEC">
      <t>
        The extension described has the same security considerations 
        described in <xref target="RFC8166" /> and
        <xref target="I-D.cel-nfsv4-rpcrdma-version-two"/>. 
        With regard to the transport
        properties introduced in this document, it is possible that
        a man-in-the-middle could interfere with the communication
        of transport properties with possible negative effects.  To
        prevent such interference, the steps described in 
        <xref target="I-D.cel-nfsv4-rpcrdma-version-two"/> 
        should be attended to.
      </t>
      <t>
        The use of the techniques described in this document to reduce
        use of explicit RDMA operations raise important issues which
        implementers should consider:
      <list style="sumbols">
        <t>
          While the use of these techniques may be expedient in certain cases,
          their
          use is not likely to be universal, at least for a considerable
          time.  As a result, implementers should remain aware of the
          issues discussed in Section 9.1 of <xref target="RFC8166" />,
          unless and until it is certain that none of a requesters 
          memory can be registered 
          for remote access.
        </t>
        <t>
          Extra care needs to be taken in cases in which padding 
          needs to be inserted in a
          transmission to ensure that 
          DDP-targetable data item will be received in an appropriately
          aligned buffer segment.  In some implementations,
          sensitive data could be inadvertently sent within the padding.
          To prevent this, the padding can be zeroed or it can be sent 
          from a pre-zeroed area using a gather list.  
        </t>
      </list>
      </t>
    </section>
    <section title="IANA Considerations" anchor="IANA">
      <t>
        This document does not require any actions by IANA.
      </t>
    </section>
  </middle>
  <back>
    <references title="Normative References">
      <?rfc include="reference.RFC.2119.xml"?>
      <?rfc include="reference.RFC.4506.xml"?>
      <?rfc include="reference.RFC.8166.xml"?>

    </references>
    <references title="Informative References">
      <?rfc include="reference.RFC.5662.xml"?>
      <?rfc include="reference.RFC.5666.xml"?>
      <?rfc include="reference.RFC.5667.xml"?>
      <?rfc include="reference.RFC.8178.xml"?>
      <?rfc include="reference.I-D.cel-nfsv4-rpcrdma-version-two.xml"?>
    </references>
    <section title="Acknowledgments" anchor="ACK" numbered="no">
      <t>
        The author gratefully acknowledges the work of Brent Callaghan and
        Tom Talpey producing the original RPC-over-RDMA Version One 
        specification <xref target="RFC5666" /> and also Tom's work in
        helping to clarify that specification. 
      </t>
      <t>
        The author also wishes to thank Chuck Lever for his work resurrecting 
        NFS support for RDMA in <xref target="RFC8166"/>, for 
        clarifying the relationshp between RDMA and direct data placement,
        and for beginning the work on RPC-over-RDMA Version Two.
      </t>
      <t>
       The extract.sh shell script and formatting conventions were first
       described by the authors of the NFSv4.1 XDR specification 
       <xref target="RFC5662"/>.
      </t>
    </section>
  </back>
</rfc>
