idnits 2.17.1 

draft-talpey-rdma-commit-01.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

     (Using the creation date from RFC5040, updated by this document, for
     RFC5378 checks: 2003-02-19)

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (March 9, 2020) is 1509 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  -- Looks like a reference, but probably isn't: '1' on line 1585

  -- Looks like a reference, but probably isn't: '2' on line 1587

  -- Obsolete informational reference (is this intentional?): RFC 5661
     (Obsoleted by RFC 8881)


     Summary: 0 errors (**), 0 flaws (~~), 1 warning (==), 5 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.
--------------------------------------------------------------------------------


2	NFSv4 (provisionally)                                          T. Talpey
3	Internet-Draft                                                 Microsoft
4	Updates: 5040 7306 (if approved)                               T. Hurson
5	Intended status: Standards Track                                   Intel
6	Expires: September 10, 2020                                   G. Agarwal
7	                                                                 Marvell
8	                                                                  T. Reu
9	                                                                 Chelsio
10	                                                           March 9, 2020

12	             RDMA Extensions for Enhanced Memory Placement
13	                      draft-talpey-rdma-commit-01

15	Abstract

17	   This document specifies extensions to RDMA (Remote Direct Memory
18	   Access) protocols to provide capabilities in support of enhanced
19	   remotely-directed data placement on persistent memory-addressable
20	   devices.  The extensions include new operations supporting remote
21	   commitment to persistence of remotely-managed buffers, which can
22	   provide enhanced guarantees and improve performance for low-latency
23	   storage applications.  In addition to, and in support of these,
24	   extensions to local behaviors are described, which may be used to
25	   guide implementation, and to ease adoption.  This document updates
26	   RFC5040 (Remote Direct Memory Access Protocol (RDMAP)) and updates
27	   RFC7306 (RDMA Protocol Extensions).

29	Requirements Language

31	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
32	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
33	   document are to be interpreted as described in RFC 2119 [RFC2119].

35	Status of This Memo

37	   This Internet-Draft is submitted in full conformance with the
38	   provisions of BCP 78 and BCP 79.

40	   Internet-Drafts are working documents of the Internet Engineering
41	   Task Force (IETF).  Note that other groups may also distribute
42	   working documents as Internet-Drafts.  The list of current Internet-
43	   Drafts is at https://datatracker.ietf.org/drafts/current/.

45	   Internet-Drafts are draft documents valid for a maximum of six months
46	   and may be updated, replaced, or obsoleted by other documents at any
47	   time.  It is inappropriate to use Internet-Drafts as reference
48	   material or to cite them other than as "work in progress."
49	   This Internet-Draft will expire on September 10, 2020.

51	Copyright Notice

53	   Copyright (c) 2020 IETF Trust and the persons identified as the
54	   document authors.  All rights reserved.

56	   This document is subject to BCP 78 and the IETF Trust's Legal
57	   Provisions Relating to IETF Documents
58	   (https://trustee.ietf.org/license-info) in effect on the date of
59	   publication of this document.  Please review these documents
60	   carefully, as they describe your rights and restrictions with respect
61	   to this document.

63	Table of Contents

65	   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   3
66	     1.1.  Glossary  . . . . . . . . . . . . . . . . . . . . . . . .   4
67	   2.  Problem Statement . . . . . . . . . . . . . . . . . . . . . .   4
68	     2.1.  Requirements for RDMA Flush . . . . . . . . . . . . . . .  10
69	       2.1.1.  Non-Requirements  . . . . . . . . . . . . . . . . . .  12
70	     2.2.  Requirements for Atomic Write . . . . . . . . . . . . . .  14
71	     2.3.  Requirements for RDMA Verify  . . . . . . . . . . . . . .  15
72	     2.4.  Local Semantics . . . . . . . . . . . . . . . . . . . . .  16
73	   3.  RDMA Protocol Extensions  . . . . . . . . . . . . . . . . . .  17
74	     3.1.  RDMAP Extensions  . . . . . . . . . . . . . . . . . . . .  17
75	       3.1.1.  RDMA Flush  . . . . . . . . . . . . . . . . . . . . .  20
76	       3.1.2.  RDMA Verify . . . . . . . . . . . . . . . . . . . . .  23
77	       3.1.3.  Atomic Write  . . . . . . . . . . . . . . . . . . . .  25
78	       3.1.4.  Discovery of RDMAP Extensions . . . . . . . . . . . .  27
79	     3.2.  Local Extensions  . . . . . . . . . . . . . . . . . . . .  28
80	       3.2.1.  Registration Semantics  . . . . . . . . . . . . . . .  28
81	       3.2.2.  Completion Semantics  . . . . . . . . . . . . . . . .  28
82	       3.2.3.  Platform Semantics  . . . . . . . . . . . . . . . . .  29
83	   4.  Ordering and Completions Table  . . . . . . . . . . . . . . .  29
84	   5.  Error Processing  . . . . . . . . . . . . . . . . . . . . . .  30
85	     5.1.  Errors Detected at the Local Peer . . . . . . . . . . . .  30
86	     5.2.  Errors Detected at the Remote Peer  . . . . . . . . . . .  31
87	   6.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .  31
88	   7.  Security Considerations . . . . . . . . . . . . . . . . . . .  31
89	   8.  To Be Added or Considered . . . . . . . . . . . . . . . . . .  32
90	   9.  Acknowledgements  . . . . . . . . . . . . . . . . . . . . . .  33
91	   10. References  . . . . . . . . . . . . . . . . . . . . . . . . .  33
92	     10.1.  Normative References . . . . . . . . . . . . . . . . . .  33
93	     10.2.  Informative References . . . . . . . . . . . . . . . . .  33
94	     10.3.  URIs . . . . . . . . . . . . . . . . . . . . . . . . . .  35
95	   Appendix A.  DDP Segment Formats for RDMA Extensions  . . . . . .  35
96	     A.1.  DDP Segment for RDMA Flush Request  . . . . . . . . . . .  35
97	     A.2.  DDP Segment for RDMA Flush Response . . . . . . . . . . .  35
98	     A.3.  DDP Segment for RDMA Verify Request . . . . . . . . . . .  36
99	     A.4.  DDP Segment for RDMA Verify Response  . . . . . . . . . .  36
100	     A.5.  DDP Segment for Atomic Write Request  . . . . . . . . . .  37
101	     A.6.  DDP Segment for Atomic Write Response . . . . . . . . . .  38
102	   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  38

104	1.  Introduction

106	   The RDMA Protocol (RDMAP) [RFC5040] and RDMA Protocol Extensions
107	   (RDMAPEXT) [RFC7306] provide capabilities for secure, zero copy data
108	   communications that preserve memory protection semantics, enabling
109	   more efficient network protocol implementations.  The RDMA Protocol
110	   is part of the iWARP family of specifications which also include the
111	   Direct Data Placement Protocol (DDP) [RFC5041], and others as
112	   described in the relevant documents.  For additional background on
113	   RDMA Protocol applicability, see "Applicability of Remote Direct
114	   Memory Access Protocol (RDMA) and Direct Data Placement Protocol
115	   (DDP)" RFC5045 [RFC5045].

117	   RDMA protocols are enjoying good success in improving the performance
118	   of remote storage access, and have been well-suited to semantics and
119	   latencies of existing storage solutions.  However, new storage
120	   solutions are emerging with much lower latencies, driving new
121	   workloads and new performance requirements.  Also, storage
122	   programming paradigms SNIANVMP [SNIANVMP] are driving new
123	   requirements of the remote storage layers, in addition to driving
124	   down latency tolerances.  Overcoming these latencies, and providing
125	   the means to achieve persistence and/or visibility without invoking
126	   upper layers and remote CPUs for each such request, are the
127	   motivators for the extensions in this document.

129	   This document specifies the following extensions to the RDMA Protocol
130	   (RDMAP) and its local memory ecosystem:

132	   o  Flush - support for RDMA requests and responses with enhanced
133	      placement semantics.

135	   o  Atomic Write - support for writing certain data elements into
136	      memory in an atomically visible fashion.

138	   o  Verify - support for validating the contents of remote memory,
139	      through use of integrity signatures.

141	   o  Enhanced memory registration semantics in support of persistence
142	      and visibility.

144	   The extensions defined in this document do not require the RDMAP
145	   version to change.

147	1.1.  Glossary

149	   This document is an extension of RFC 5040 and RFC7306, and key words
150	   are additionally defined in the glossaries of the referenced
151	   documents.

153	   The following additional terms are used in this document as defined.

155	   Flush:  The submitting of previously written data from volatile
156	      intermediate locations for subsequent placement, in a persistent
157	      and/or globally visible fashion.

159	   Invalidate:  The removal of data from volatile intermediate
160	      locations.

162	   Commit:  Obsolescent previous synonym for Flush.  Term to be deleted.

164	   Persistent:  The property that data is present, readable and remains
165	      stable after recovery from a power failure or other fatal error in
166	      an upper layer or hardware. <https://en.wikipedia.org/wiki/
167	      Durability_(database_systems)>, <https://en.wikipedia.org/wiki/
168	      Disk_buffer#Cache_control_from_the_host>, [SCSI].

170	   Globally Visible:  The property of data being available for reading
171	      consistently by all processing elements on a system.  Global
172	      visibility and persistence are not necessarily causally related;
173	      either one may precede the other, or they may take effect
174	      simultaneously, depending on the architecture of the platform.

176	2.  Problem Statement

178	   RDMA is widely deployed in support of storage and shared memory over
179	   increasingly low-latency and high-bandwidth networks.  The state of
180	   the art today yields end-to-end network latencies on the order of one
181	   to two microseconds for message transfer, and bandwidths exceeding
182	   100 gigabit/s.  These bandwidths are expected to increase over time,
183	   with latencies decreasing as a direct result.

185	   In storage, another trend is emerging - greatly reduced latency of
186	   persistently storing data blocks.  While best-of-class Hard Disk
187	   Drives (HDDs) have delivered average latencies of several
188	   milliseconds for many years, Solid State Disks (SSDs) have improved
189	   this by one to two orders of magnitude.  Technologies such as NVM
190	   Express NVMe [1] yield even higher-performing results by eliminating
191	   the traditional storage interconnect.  The latest technologies
192	   providing memory-based persistence, such as Nonvolatile Memory DIMM
193	   NVDIMM [2], places storage-like semantics directly on the memory bus,
194	   reducing latency to less than a microsecond and increasing bandwidth
195	   to potentially many tens of gigabyte/s. [supporting data to be added]

197	   RDMA protocols, in turn, are used for many storage protocols,
198	   including NFS/RDMA RFC5661 [RFC5661] RFC8166 [RFC8166] RFC8267
199	   [RFC8267], SMB Direct MS-SMB2 [SMB3] MS-SMBD [SMBDirect] and iSER
200	   RFC7145 [RFC7145], to name just a few.  These protocols allow storage
201	   and computing peers to take full advantage of these highly performant
202	   networks and storage technologies to achieve remarkable throughput,
203	   while minimizing the CPU overhead needed to drive their workloads.
204	   This leaves more computing resources available for the applications,
205	   which in turn can scale to even greater levels.  Within the context
206	   of Cloud-based environments, and through scale-out approaches, this
207	   can directly reduce the number of servers that need to be deployed,
208	   making such attributes highly compelling.

210	   However, limiting factors come into play when deploying ultra-low
211	   latency storage in such environments:

213	   o  The latency of the fabric, and of the necessary RDMA message
214	      exchanges to ensure reliable transfer is now higher than that of
215	      the storage itself.

217	   o  The requirement that storage be resilient to failure requires that
218	      multiple copies be committed in multiple locations across the
219	      fabric, adding extra hops which increase the latency and computing
220	      demand placed on implementing the resiliency.

222	   o  Processing is required at the receiver in order to ensure that the
223	      storage data has reached a persistent state, and acknowledge the
224	      transfer so that the sender can proceed.

226	   o  Typical latency optimizations, such as polling a receive memory
227	      location for a key that determines when the data arrives, can
228	      create both correctness and security issues because this approach
229	      requires the memory remain open to writes and therefore the buffer
230	      may not remain stable after the application determines that the IO
231	      has completed.  This is of particular concern in security
232	      conscious environments.

234	   The first issue is fundamental, and due to the nature of serial,
235	   shared communication channels, presents challenges that are not
236	   easily bypassed.  Communication cannot exceed the speed of light, for
237	   example, and serialization/deserialization plus packet processing
238	   adds further delay.  Therefore, an RDMA solution which offloads and
239	   reduces the overhead of exchanges which encounter such latencies is
240	   highly desirable.

242	   The second issue requires that outbound transfers be made as
243	   efficient as possible, so that replication of data can be done with
244	   minimal overhead and delay (latency).  A reliable "push" RDMA
245	   transfer method is highly suited to this.

247	   The third issue requires that the transfer be performed without an
248	   upper-layer exchange required.  Within security contraints, RDMA
249	   transfers, arbitrated only by lower layers into well-defined and pre-
250	   advertised buffers, present an ideal solution.

252	   The fourth issue requires significant CPU activity, consuming power
253	   and valuable resources, and may not be guaranteed by the RDMA
254	   protocols, which make no requirement of the order in which certain
255	   received data is placed or becomes visible; such guarantees are made
256	   only after signaling a completion to upper layers.

258	   The RDMAP and DDP protocols, together, provide data transfer
259	   semantics with certain consistency guarantees to both the sender and
260	   receiver.  Delivery of data transferred by these protocols is said to
261	   have been Placed in destination buffers upon Completion of specific
262	   operations.  In general, these guarantees are limited to the
263	   visibility of the transferred data within the hardware domain of the
264	   receiver (data sink).  Significantly, the guarantees do not
265	   necessarily extend to the actual storage of the data in memory cells,
266	   nor do they convey any guarantee that the data integrity is intact,
267	   nor that it remains present after a catastrophic failure.  These
268	   guarantees may be provided by upper layers, such as the ones
269	   mentioned, after processing the Completions, and performing the
270	   necessary operations.

272	   The NFSv4.1, SMB3 and iSER protocols are, respectively, file and
273	   block oriented, and have been used extensively for providing access
274	   to hard disk and solid state flash drive media.  Such devices incur
275	   certain latencies in their operation, from the millisecond-order
276	   rotational and seek delays of rotating disk hardware, or the 100-
277	   microsecond-order erase/write and translation layers of solid state
278	   flash.  These file and block protocols have benefited from the
279	   increased bandwidth, lower latency, and markedly lower CPU overhead
280	   of RDMA to provide excellent performance for such media,
281	   approximately 30-50 microseconds for 4KB writes in leading
282	   implementations.

284	   These protocols employ a "pull" model for write: the client, or
285	   initiator, sends an upper layer write request which contains an RDMA
286	   reference to the data to be written.  The upper layer protocols
287	   encode this as one or more memory regions.  The server, or target,
288	   then prepares the request for local write execution, and "pulls" the
289	   data with an RDMA Read.  After processing the write, a response is
290	   returned.  There are therefore two or more roundtrips on the RDMA
291	   network in processing the request.  This is desirable for several
292	   reasons, as described in the relevant specifications, but it incurs
293	   latency.  However, since as mentioned the network latency has been so
294	   much less than the storage processing, this has been a sound
295	   approach.

297	   Today, a new class of Storage Class Memory is emerging, in the form
298	   of Non-Volatile DIMM and NVM Express devices, among others.  These
299	   devices are characterized by further reduced latencies, in the 10-
300	   microsecond-order range for NVMe, and sub-microsecond for NVDIMM.
301	   The 30-50 microsecond write latencies of the above file and block
302	   protocols are therefore from one to two orders of magnitude larger
303	   than the storage media!  The client/server processing model of
304	   traditional storage protocols are no longer amortizable at an
305	   acceptable level into the overall latency of storage access, due to
306	   their requiring request/response communication, CPU processing by the
307	   both server and client (or target and initiator), and the interrupts
308	   to signal such requests.

310	   Another important property of certain such devices is the requirement
311	   for explicitly requesting that the data written to them be made
312	   persistent.  Because persistence requires that data be committed to
313	   memory cells, it is a relatively expensive operation in time (and
314	   power), and in order to maintain the highest device throughput and
315	   most efficient operation, the device "commit" operation is explicit.
316	   When the data is written by an application on the local platform,
317	   this responsibility naturally falls to that application (and the CPU
318	   on which it runs).  However, when data is written by current RDMA
319	   protocols, no such semantic is provided.  As a result, upper layer
320	   stacks, and the target CPU, must be invoked to perform it, adding
321	   overhead and latency that is now highly undesirable.

323	   When such devices are deployed as the remote server, or target,
324	   storage, and when such a persistence can be requested and guaranteed
325	   remotely, a new transfer model can be considered.  Instead of relying
326	   on the server, or target, to perform requested processing and to
327	   reply after the data is persistently stored, it becomes desirable for
328	   the client, or initiator, to perform these operations itself.  By
329	   altering the transfer models to support a "push mode", that is, by
330	   allowing the requestor to push data with RDMA Write and subsequently
331	   make it persistent, a full round trip can be eliminated from the
332	   operation.  Additionally, the signaling, and processing overheads at
333	   the remote peer (server or target) can be eliminated.  This becomes
334	   an extremely compelling latency advantage.

336	   In DDP (RFC5041), data is considered "placed" when it is submitted by
337	   the RNIC to the system.  This operation is commonly an i/o bus write,
338	   e.g. via PCI.  The submission is ordered, but there is no
339	   confirmation or necessary guarantee that the data has yet reached its
340	   destination, nor become visible to other devices in the system.  The
341	   data will eventually do so, but possibly at a later time.  The act of
342	   "delivery", on the other hand, offers a stronger semantic,
343	   guaranteeing that not only have prior operations been executed, but
344	   also guaranteeing any data is in a consistent and visible state.
345	   Generally however, such "delivery" requires raising a completion
346	   event, necessarily involving the host CPU.  This is a relatively
347	   expensive, and latency-bound operation.  Some systems perform "DMA
348	   snooping" to provide a somewhat higher guarantee of visibility after
349	   delivery and without CPU intervention, but others do not.  The RDMA
350	   requirements remain the same, therefore, upper layers may make no
351	   broad assumption.  Such platform behaviors, in any case, do not
352	   address persistence.

354	   The extensions in this document primarily address a new "flush to
355	   persistence" RDMA operation.  This operation, when invoked by a
356	   connected remote RDMA peer, can be used to request that previously-
357	   written data be moved into the persistent storage domain.  This may
358	   be a simple flush to a memory cell, or it may require movement across
359	   one or more busses within the target platform, followed by an
360	   explicit persistence operation.  Such matters are beyond the scope of
361	   this specification, which provides only the mechanism to request the
362	   operation, and to signal its successful completion.

364	   In a similar vein, many applications desire to achieve visibility of
365	   remotely-provided data, and to do so with minimum latency.  One
366	   example of such applications is "network shared memory", where
367	   publish-subscribe access to network-accessible buffers is shared by
368	   multiple peers, possibly from applications on the platform hosting
369	   the buffers, and others via network connection.  There may therefore
370	   be multiple local devices accessing the buffer - for example, CPUs,
371	   and other RNICs.  The topology of the hosting platform may be
372	   complex, with multiple i/o, memory, and interconnect busses,
373	   requiring multiple intervening steps to process arriving data.

375	   To address this, the extension additionally provides a "flush to
376	   global visibility", which requires the RNIC to perform platform-
377	   dependent processing in order to guarantee that the contents of a
378	   specific range are visible for all devices that access them.  On
379	   certain highly-consistent platforms, this may be provided natively.
380	   On others, it may require platform-specific processing, to flush data
381	   from volatile caches, invalidate stale cached data from others, and
382	   to empty queued pending operations.  Ideally, but not universally,
383	   this processing will take place without CPU intervention.  With a
384	   global visibility guarantee, network shared memory and similar
385	   applications will be assured of broader compatibility and lower
386	   latency across all hardware platforms.

388	   Subsequently, many applications will seek to obtain a guarantee that
389	   the integrity of the data has been preserved after it has been
390	   flushed to a persistent or globally visible state.  This may be
391	   enforced at any time.  Unlike traditional block-based storage, the
392	   data provided by RDMA is neither structured nor segmented, and is
393	   therefore not self-describing with respect to integrity.  Only the
394	   originator of the data, or an upper layer, is in possession of that.
395	   Applications requiring such guarantees may include filesystem or
396	   database logwriters, replication agents, etc.

398	   To provide an additional integrity guarantee, a new operation is
399	   provided by the extension, which will calculate, and optionally
400	   compare an integrity value for an arbitrary region.  The operation is
401	   ordered with respect to preceding and subsequent operations, allowing
402	   for a request pipeline without "bubbles" - roundtrip delays to
403	   ascertain success or failure.

405	   Finally, once data has been transmitted and directly placed by RDMA,
406	   flushed to its final state, and its integrity verified, applications
407	   will seek to commit the result with a transaction semantic.  The
408	   previous application examples apply here, logwriters and replication
409	   are key, and both are highly latency- and integrity-sensitive.  They
410	   desire a pipelined transaction marker which is placed atomically to
411	   indicate the validity of the preceding operations.  They may require
412	   that the data be in a persistent and/or globally visibile state,
413	   before placing this marker.

415	   Together the above discussion argues for a new "one sided" transfer
416	   model supporting extended remote placement guarantees, provided by
417	   the RDMA transport, and used directly by upper layers on a data
418	   source, to control persistent storage of data on a remote data sink
419	   without requiring its remote interaction.  Existing, or new, upper
420	   layers can use such a model in several ways, and evolutionary steps
421	   to support persistence guarantees without required protocol changes
422	   are explored in the remainder of this document.

424	   Note that is intended that the requirements and concept of these
425	   extensions can be applied to any similar RDMA protocol, and that a
426	   compatible model can be applied broadly.

428	2.1.  Requirements for RDMA Flush

430	   The fundamental new requirement for extending RDMA protocols is to
431	   define the property of _persistence_. This new property is to be
432	   expressed by new operations to extend Placement as defined in
433	   existing RDMA protocols.  The RFC5040 protocols specify that
434	   Placement means that the data is visible consistently within a
435	   platform-defined domain on which the buffer resides, and to remote
436	   peers across the network via RDMA to an adapter within the domain.
437	   In modern hardware designs, this buffer can reside in memory, or also
438	   in cache, if that cache is part of the hardware consistency domain.
439	   Many designs use such caches extensively to improve performance of
440	   local access.

442	   Persistence, by contrast, requires that the buffer contents be
443	   preserved across catastrophic failures.  While it is possible for
444	   caches to be persistent, they are typically not, or they provide the
445	   persistence guarantee for a limited period of time, for example,
446	   while backup power is applied.  Efficient designs, in fact, lead most
447	   implementations to simply make them volatile.  In these designs, an
448	   explicit flush operation (writing dirty data from caches), often
449	   followed by an explicit commit (ensuring the data has reached its
450	   destination and is in a persistent state), is required to provide
451	   this guarantee.  In some platforms, these operations may be combined.

453	   For the RDMA protocol to remotely provide such guarantees, an
454	   extension is required.  Note that this does not imply support for
455	   persistence or global visibility by the RDMA hardware implementation
456	   itself; it is entirely acceptable for the RDMA implementation to
457	   request these from another subsystem, for example, by requesting that
458	   the CPU perform the flush and commit, or that the destination memory
459	   device do so.  But, in an ideal implementation, the RDMA
460	   implementation will be able to act as a master and provide these
461	   services without further work requests local to the data sink.  Note,
462	   it is possible that different buffers will require different
463	   processing, for example one buffer may reside in persistent memory,
464	   while another may place its blocks in a storage device.  Many such
465	   memory-addressable designs are entering the market, from NVDIMM to
466	   NVMe and even to SSDs and hard drives.

468	   Therefore, additionally any local memory registration primitive will
469	   be enhanced to specify new optional placement attributes, along with
470	   any local information required to achieve them.  These attributes do
471	   not explicitly traverse the network - like existing local memory
472	   registration, the region is fully described by a { STag, Tagged
473	   offset, length } descriptor, and such aspects of the local physical
474	   address, memory type, protection (remote read, remote write,
475	   protection key), etc are not instantiated in the protocol.  Indeed,
476	   each local RDMA implementation maintains these, and strictly performs
477	   processing based on them, and they are not exposed to the peer.  Such
478	   considerations are discussed in the security model [RDMAP Security
479	   [RFC5042]].

481	   Note, additionally, that by describing such attributes only through
482	   the presence of an optional property of each region, it is possible
483	   to describe regions referring to the same physical segment as a
484	   combination of attributes, in order to enable efficient processing.
485	   Processing of writes to regions marked as persistent, globally
486	   visible, or neither ("ordinary" memory) may be optimized
487	   appropriately.  For example, such memory can be registered multiple
488	   times, yielding multiple different Steering Tags which nonetheless
489	   merge data in the underlying memory.  This can be used by upper
490	   layers to enable bulk-type processing with low overhead, by assigning
491	   specific attributes through use of the Steering Tag.

493	   When the underlying region is marked as persistent, that the
494	   placement of data into persistence is guaranteed only after a
495	   successful RDMA Flush directed to the Steering Tag which holds the
496	   persistent attribute (i.e. any volatile buffering between the network
497	   and the underlying storage has been flushed, and the appropriate
498	   platform- and device-specific steps have been performed).

500	   To enable the maximum generality, the RDMA Flush operation is
501	   specified to act on a set of bytes in a region, specified by a
502	   standard RDMA { STag, Tagged offset, length } descriptor.  It is
503	   required that each byte of the specified segment be in the requested
504	   state before the response to the Flush is generated.  However,
505	   depending on the implementation, other bytes in the region, or in
506	   other regions, may be acted upon as part of processing any RDMA
507	   Flush.  In fact, any data in any buffer destined for persistent
508	   storage, may become persistent at any time, even if not requested
509	   explicitly.  For example, the host system may flush cache entries due
510	   to cache pressure, or as part of platform housekeeping activities.
511	   Or, a simple and stateless approach to flushing a specific range
512	   might be for all data be flushed and made persistent, system-wide.  A
513	   possibly more efficient implementation might track previously written
514	   bytes, or blocks with "dirty" bytes, and flush only those to
515	   persistence.  Either result provides the required guarantee.

517	   The RDMA Flush operation provides a response but does not return a
518	   status, or can result in an RDMA Terminate event upon failure.  A
519	   region permission check is performed first, and may fail prior to any
520	   attempt to process data.  The RDMA Flush operation may fail to make
521	   the data persistent, perhaps due to a hardware failure, or a change
522	   in device capability (device read-only, device wear, etc).  The
523	   device itself may support an integrity check, similar to modern error
524	   checking and corection (ECC) memory or media error detection on hard
525	   drive surfaces, which may signal failure.  Or, the request may exceed
526	   device limits in size or even transient attribute such as temporary
527	   media failure.  The behavior of the device itself is beyond the scope
528	   of this specification.

530	   Because the RDMA Flush involves processing on the local platform and
531	   the actual storage device, in addition to being ordered with certain
532	   other RDMA operations, it is expected to take a certain time to be
533	   performed.  For this reason, the operation is required to be defined
534	   as a "queued" operation on the RDMA device, and therefore also the
535	   protocol.  The RDMA protocol supports RDMA Read (RFC5040) and Atomic
536	   (RFC7306) in such a fashion.  The iWARP family defines a "queue
537	   number" with queue-specific processing that is naturally suited for
538	   this.  Queuing provides a convenient means for supporting ordering
539	   among other operations, and for flow control.  Flow control for RDMA
540	   Reads and Atomics on any given Queue Pair share incoming and outgoing
541	   crediting depths ("IRD/ORD"); operations in this specification share
542	   these values and do not define their own separate values.

544	2.1.1.  Non-Requirements

546	   The extension does not include a "RDMA Write to persistence", that
547	   is, a modifier on the existing RDMA Write operation.  While it might
548	   seem a logical approach, several issues become apparent:

550	      The existing RDMA Write operation is a tagged DDP request which is
551	      unacknowledged at the DDP layer (RFC5042).  Requiring it to
552	      provide an indication of remote persistence would require it to
553	      have an acknowledgement, which would be an undesirable extension
554	      to the existing defined operation.

556	      Such an operation would require flow control and therefore also
557	      buffering on the responding peer.  Existing RDMA Write semantics
558	      are not flow controlled and as tagged transfers are by design
559	      zero-copy i.e. unbuffered.  Requiring these would introduce
560	      potential pipeline stalls and increase implementation complexity
561	      in a critical performance path.

563	      The operation at the requesting peer would stall until the
564	      acknowledgement of completion, significantly changing the semantic
565	      of the existing operation, and complicating software by blocking
566	      the send work queue, a significant new semantic for RDMA Write
567	      work requests.  As each operation would be self-describing with
568	      respect to persistence, individual operations would therefore
569	      block with differing semantics and complicate the situation even
570	      further.

572	      Even for the possibly-common case of flushing after every write,
573	      it is highly undesirable to impose new optional semantics on an
574	      existing operation, and therefore also on the upper layer protocol
575	      implementation.  And, the same result can be achieved by sending
576	      the Flush merged in the same network packet, and since the RDMA
577	      Write is unacknowledged while the RDMA Flush is always replied-to,
578	      no additional overhead is imposed on the combined exchange.

580	   For these reasons, it is deemed a non-requirement to extend the
581	   existing RDMA Write operation.

583	   Similarly, the extension does not consider the use of RDMA Read to
584	   implement Flush.  Historically, an RDMA Read has been used by
585	   applications to ensure that previously written data has been
586	   processed by the responding RNIC and has been submitted for ordered
587	   Placement.  However, this is inadequate for implementing the required
588	   RDMA Flush:

590	      RDMA Read guarantees only that previously written data has been
591	      Placed, it provides no such guarantee that the data has reached
592	      its destination buffer.  In practice, an RNIC satisfies the RDMA
593	      Read requirement by simply issuing all PCIe Writes prior to
594	      issuing any PCIe Reads.

596	      Such PCIe Reads must be issued by the RNIC after all such PCIe
597	      Writes, therefore flushing a large region requires the RNIC and
598	      its attached bus to strictly order (and not cache) its writes, to
599	      "scoreboard" its writes, or to perform PCIe Reads to the entire
600	      region.  The former approach is significantly complex and
601	      expensive, and the latter approach requires a large amount of PCIe
602	      and network read bandwidth, which are often unnecessary and
603	      expensive.  The Reads, in any event, may be satisfied by platform-
604	      specfic caches, never actually reaching the destination memory or
605	      other device.

607	      The RDMA Read may begin execution at any time once the request is
608	      fully received, queued, and the prior RDMA Write requirement has
609	      been satisfied.  This means that the RDMA Read operation may not
610	      be ordered with respect to other queued operations, such as Verify
611	      and Atomic Write, in addition to other RDMA Flush operations.

613	      The RDMA Read has no specific error semantic to detect failure,
614	      and the response may be generated from any cached data in a
615	      consistently Placed state, regardless of where it may reside.  For
616	      this reason, an RDMA Read may proceed without necessarily
617	      verifying that a previously ordered "flush" has succeeded or
618	      failed.

620	      RDMA Read is heavily used by existing RDMA consumers, and the
621	      semantics are therefore implemented by the existing specification.
622	      For new applications to further expect an extended RDMA Read
623	      behavior would require an upper layer negotiation to determine if
624	      the data sink platform and RNIC appropriately implemented them, or
625	      to silently ignore the requirement, with the resulting failure to
626	      meet the requirement.  An explicit extension, rather than
627	      depending on an overloaded side effect, ensures this will not
628	      occur.

630	   Again, for these reasons, it is deemed a non-requirement to reuse or
631	   extend the existing RDMA Read operation.

633	   Therefore, no changes to existing specified RDMA operations are
634	   proposed, and the protocol is unchanged if the extensions are not
635	   invoked.

637	2.2.  Requirements for Atomic Write

639	   The persistence of data is a key property by which applications
640	   implement transactional behavior.  Transactional applications, such
641	   as databases and log-based filesystems, among many others, implement
642	   a "two phase commit" wherein a write is made durable, and *only upon
643	   success*, a validity indicator for the written data is set.  Such
644	   semantics are challenging to provide over an RDMA fabric, as it
645	   exists today.  The RDMA Write operation does not generate an
646	   acknowledgement at the RDMA layers.  And, even when an RDMA Write is
647	   delivered, if the destination region is persistent, its data can be
648	   made persistent at any time, even before a Flush is requested.  Out-
649	   of-order DDP processing, packet fragmentation, and other matters of
650	   scheduling transfers can introduce partial delivery and ordering
651	   differences.  If a region is made persistent, or even globally
652	   visible, before such sequences are complete, significant application-
653	   layer inconsistencies can result.  Therefore, applications may
654	   require fine-grained control over the placement of bytes.  In current
655	   RDMA storage solutions, these semantics are implemented in upper
656	   layers, potentially with additional upper layer message signaling,
657	   and corresponding roundtrips and blocking behaviors.

659	   In addition to controlling placement of bytes, the ordering of such
660	   placement can be important.  By providing an ordered relationship
661	   among write and flush operations, a basic transaction scenario can be
662	   constructed, in a way which can function with equal semantics both
663	   locally and remotely.  In a "log-based" scenario, for example, a
664	   relatively large segment (log "record") is placed, and made durable.
665	   Once persistence of the segment is assured, a second small segment
666	   (log "pointer") is written, and optionally also made persistent.  The
667	   visibility of the second segment is used to imply the validity, and
668	   persistence, of the first.  Any sequence of such log-operation pairs
669	   can thereby always have a single valid state.  In case of failure,
670	   the resulting string (log) of transactions can therefore be recovered
671	   up to and including the final state.

673	   Such semantics are typically a challenge to implement on general
674	   purpose hardware platforms, and a variety of application approaches
675	   have become common.  Generally, they employ a small, well-aligned
676	   atom of storage for the second segment (the one used for validity).
677	   For example, an integer or pointer, aligned to natural memory address
678	   boundaries and CPU and other cache attributes, is stored using
679	   instructions which provide for atomic placement.  Existing RDMA
680	   protocols, however, provide no such capability.

682	   This document specifies an Atomic Write extension, which,
683	   appropriately constrained, can serve to provide similar semantics.  A
684	   small (64 bit) payload, sent in a request which is ordered with
685	   respect to prior RDMA Flush operations on the same stream and
686	   targeted at a segment which is aligned such that it can be placed in
687	   a single hardware operation, can be used to satisfy the previously
688	   described scenario.  Note that the visibility of this payload can
689	   also serve as an indication that all prior operations have succeeded,
690	   enabling a highly efficient application-visible memory semaphore.

692	2.3.  Requirements for RDMA Verify

694	   An additional matter remains with persistence - the integrity of the
695	   persistent data.  Typically, storage stacks such as filesystems and
696	   media approches such as SCSI T10 DIF or filesystem integrity checks
697	   such as ZFS provide for block- oir file-level protection of data at
698	   rest on storage devices.  With RDMA protocols and physical memory, no
699	   such stacks are present.  And, to add such support would introduce
700	   CPU processing and its inherent latency, counter to the goals of the
701	   remote storage approach.  Requiring the peer to verify by remotely
702	   reading the data is prohibitive in both bandwidth and latency, and
703	   without additional mechanisms to ensure the actual stored data is
704	   read (and not a copy in some volatile cache), can not provide the
705	   necessary result.

707	   To address this, an integrity operation is required.  The integrity
708	   check is initiated by the upper layer or application, which
709	   optionally computes the expected hash of a given segment of arbitrary
710	   size, sending the hash via an RDMA Verify operation targeting the
711	   RDMA segment on the responder, and the responder calculating and
712	   optionally verifying the hash on the indicated data, bypassing any
713	   volatile copies remaining in caches.  The responder responds with its
714	   computed hash value, or optionally, terminates the connection with an
715	   appropriate error status upon mismatch.  Specifying this optional
716	   termination behavior enables a transaction to be sent as WRITE-FLUSH-
717	   VERIFY-ATOMICWRITE, without any pipeline bubble.  The result (carried
718	   by the subsequently ordered ATOMIC_WRITE) will not not be committed
719	   as valid if any prior operation is terminated, and in this case,
720	   recovery can be initiated by the requestor immediately from the point
721	   of failure.  On the other hand, an errorless "scrub" can be
722	   implemented without the optional termination behavior, by providing
723	   no value for the expected hash.  The responder will return the
724	   computed hash of the contents.

726	   The hash algorithm is not specified by the RDMA protocol, instead it
727	   is left to the upper layer to select an appropriate choice based upon
728	   the strength, security, length, support by the RNIC, and other
729	   criteria.  The size of the resulting hash is therefore also not
730	   specified by the RDMA protocol, but is dictated by the hash
731	   algorithm.  The RDMA protocol becomes simply a transport for
732	   exchanging the values.

734	   It should be noted that the design of the operation, passing of the
735	   hash value from requestor to responder (instead of, for example,
736	   computing it at the responder and simply returning it), allows both
737	   peers to determine immediately whether the segment is considered
738	   valid, permitting local processing by both peers if that is not the
739	   case.  For example, a known-bad segment can be immediately marked as
740	   such ("poisoned") by the responder platform, requiring recovery
741	   before permitting access. [cf ACPI, JEDEC, SNIA NVMP specifications]

743	2.4.  Local Semantics

745	   The new operations imply new access methods ("verbs") to local
746	   persistent memory which backs registrations.  Registrations of memory
747	   which support persistence will follow all existing practices to
748	   ensure permission-based remote access.  The RDMA protocols do not
749	   expose these permissions on the wire, instead they are contained in
750	   local memory registration semantics.  Existing attributes are Remote
751	   Read and Remote Write, which are granted individually through local
752	   registration on the machine.  If an RDMA Read or RDMA Write operation
753	   arrives which targets a segment without the appropriate attribute,
754	   the connection is terminated.

756	   In support of the new operations, new memory attributes are needed.
757	   For RDMA Flush, two "Flushable" attributes provide permission to
758	   invoke the operation on memory in the region for persistence and/or
759	   global visibility.  When registering, along with the attribute,
760	   additional local information can be provided to the RDMA layer such
761	   as the type of memory, the necessary processing to make its contents
762	   persistent, etc.  If the attribute is requested for memory which
763	   cannot be persisted, it also allows the local provider to return an
764	   error to the upper layer, obviating the upper layer from providing
765	   the region to the remote peer.

767	   For RDMA Verify, the "Verifiable" attribute provides permission to
768	   compute the hash of memory in the region.  Again, along with the
769	   attribute, additional information such as the hash algorithm for the
770	   region is provided to the local operation.  If the attribute is
771	   requested for non-persistent memory, or if the hash algorithm is not
772	   available, the local provider can return an error to the upper layer.
773	   In the case of success, the upper layer can exchange the necessary
774	   information with the remote peer.  Note that the algorithm is not
775	   identified by the on-the-wire operation as a result.  Establishing
776	   the choice of hash for each region is done by the local consumer, and
777	   each hash result is merely transported by the RDMA protocol.  Memory
778	   can be registered under multiple regions, if differing hashes are
779	   required, for example unique keys may be provisoned to implement
780	   secure hashing.  Also note that, for certain "reversible" hash
781	   algorithms, this may allow peers to effectively read the memory,
782	   therefore, the local platform may require additional read permissions
783	   to be associated with the Verifiable permission, when such algorithms
784	   are selected.

786	   The Atomic Write operation requires no new attributes, however it
787	   does require the "Remote Write" attribute on the target region, as is
788	   true for any remotely requested write.  If the Atomic Write
789	   additionally targets a Flushable region, the RDMA Flush is performed
790	   separately.  It is never generally possible to achieve persistence
791	   atomically with placement, even locally.

793	3.  RDMA Protocol Extensions

795	   The extensions in this document fall into two categories:

797	   o  Protocol extensions

799	   o  Local behavior extensions

801	   These categories are described, and may be implemented, separately.

803	3.1.  RDMAP Extensions

805	   The wire-related aspects of the extensions are discussed in this
806	   section.This document defines the following new RDMA operations.

808	   For reference, Figure 1 depicts the format of the DDP Control and
809	   RDMAP Control Fields, in the style and convention of RFC5040 and
810	   RFC7306:

812	   The DDP Control Field consists of the T (Tagged), L (Last), Resrv,
813	   and DV (DDP protocol Version) fields are defined in RFC5041.  The
814	   RDMAP Control Field consists of the RV (RDMA Version), Rsv, and
815	   Opcode fields are defined in RFC5040.  No change or extension is made
816	   to these fields by this specification.

818	   This specification adds values for the RDMA Opcode field to those
819	   specified in RFC5040.  Table 1 defines the new values of the RDMA
820	   Opcode field that are used for the RDMA Messages defined in this
821	   specification.

823	   As shown in Table 1, STag (Steering Tag) and Tagged Offset are valid
824	   only for certain RDMA Messages defined in this specification.
825	   Table 1 also shows the appropriate Queue Number for each Opcode.

827	    0                   1                   2                   3
828	    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
829	                                   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
830	                                   |T|L| Resrv | DV| RV|R|  Opcode |
831	                                   | | |       |   |   |s|         |
832	                                   | | |       |   |   |v|         |
833	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
834	   |                     Invalidate STag                           |
835	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

837	                   DDP Control and RDMAP Control Fields

839	   All RDMA Messages defined in this specification MUST carry the
840	   following values:

842	   o  The RDMA Version (RV) field: 01b.

844	   o  Opcode field: Set to one of the values in Table 2.

846	   o  Invalidate STag: Set to zero, or optionally to non-zero by the
847	      sender, processed by the receiver.

849	   Note: N/A in the table below means Not Applicable
850	   -------+------------+-------+------+-------+-----------+-------------
851	   RDMA   | Message    | Tagged| STag | Queue | Invalidate| Message
852	   Opcode | Type       | Flag  | and  | Number| STag      | Length
853	          |            |       | TO   |       |           | Communicated
854	          |            |       |      |       |           | between DDP
855	          |            |       |      |       |           | and RDMAP
856	   -------+------------+-------+------+-------+-----------+-------------
857	   -------+------------+------------------------------------------------
858	   01100b | RDMA Flush |  0    |  N/A |  1    |  opt      |  Yes
859	          | Request    |       |      |       |           |
860	   -------+------------+------------------------------------------------
861	   01101b | RDMA Flush |  0    |  N/A |  3    |  N/A      |  No
862	          | Response   |       |      |       |           |
863	   -------+------------+------------------------------------------------
864	   01110b | RDMA Verify|  0    |  N/A |  1    |  opt      |  Yes
865	          | Request    |       |      |       |           |
866	   -------+------------+------------------------------------------------
867	   01111b | RDMA Verify|  0    |  N/A |  3    |  N/A      |  Yes
868	          | Response   |       |      |       |           |
869	   -------+------------+------------------------------------------------
870	   10000b | Atomic     |  0    |  N/A |  1    |  opt      |  Yes
871	          | Write      |       |      |       |           |
872	          | Request    |       |      |       |           |
873	   -------+------------+------------------------------------------------
874	   10001b | Atomic     |  0    |  N/A |  3    |  N/A      |  No
875	          | Write      |       |      |       |           |
876	          | Response   |       |      |       |           |
877	   -------+------------+------------------------------------------------

879	                    Additional RDMA Usage of DDP Fields

881	   This extension adds RDMAP use of Queue Number 1 for Untagged Buffers
882	   for issuing RDMA Flush, RDMA Verify and Atomic Write Requests, and
883	   use of Queue Number 3 for Untagged Buffers for tracking the
884	   respective Responses.

886	   All other DDP and RDMAP Control Fields are set as described in
887	   RFC5040 and RFC7306.

889	   Table 3 defines which RDMA Headers are used on each new RDMA Message
890	   and which new RDMA Messages are allowed to carry ULP payload.

892	   -------+------------+-------------------+-------------------------
893	   RDMA   | Message    | RDMA Header Used  | ULP Message allowed in
894	   Message| Type       |                   | the RDMA Message
895	   OpCode |            |                   |
896	   -------+------------+-------------------+-------------------------
897	   -------+------------+-------------------+-------------------------
898	   01100b | RDMA Flush | None              | No
899	          | Request    |                   |
900	   -------+------------+-------------------+-------------------------
901	   01101b | RDMA Flush | None              | No
902	          | Response   |                   |
903	   -------+------------+---------------------------------------------
904	   01110b | RDMA Verify| None              | No
905	          | Request    |                   |
906	   -------+------------+-------------------+-------------------------
907	   01111b | RDMA Verify| None              | No
908	          | Response   |                   |
909	   -------+------------+---------------------------------------------
910	   10000b | Atomic     | None              | No
911	          | Write      |                   |
912	          | Request    |                   |
913	   -------+------------+---------------------------------------------
914	   10000b | Atomic     | None              | No
915	          | Write      |                   |
916	          | Response   |                   |
917	   -------+------------+---------------------------------------------

919	                         RDMA Message Definitions

921	3.1.1.  RDMA Flush

923	   The RDMA Flush operation requests that all bytes in a specified
924	   region are to be made persistent and/or globally visible, under
925	   control of specified flags.  As specified in section 4 its operation
926	   is ordered after the successful completion of any previous requested
927	   RDMA Write or certain other operations.  The response is generated
928	   after the region has reached its specified state.  The implementation
929	   MUST fail the operation and send a terminate message if the RDMA
930	   Flush cannot be performed, or has encountered an error.

932	   The RDMA Flush operation MUST NOT be completed by the data sink until
933	   all data has attained the requested state.  Achieving persistence may
934	   require programming and/or flushing of device buffers, while
935	   achieving global visibility may require flushing of cached buffers
936	   across the entire platform interconnect.  In no event are persistence
937	   and global visibility achieved atomically, one may precede the other
938	   and either may complete at any time.The Atomic Write operation may be
939	   used by an upper layer consumer to indicate that either or both
940	   dispositions are available after completion of the RDMA Flush, in
941	   addition to other approaches.

943	3.1.1.1.  RDMA Flush Request Format

945	   The RDMA Flush Request Message makes use of the DDP Untagged Buffer
946	   Model.  RDMA Flush Request messages MUST use the same Queue Number as
947	   RDMA Read Requests and RDMA Extensions Atomic Operation Requests
948	   (QN=1).  Reusing the same queue number for RDMA Flush Requests allows
949	   the operations to reuse the same RDMA infrastructure (e.g.  Outbound
950	   and Inbound RDMA Read Queue Depth (ORD/IRD) flow control) as that
951	   defined for RDMA Read Requests.

953	   The RDMA Flush Request Message carries a payload that describes the
954	   ULP Buffer address in the Responder's memory.  The following figure
955	   depicts the Flush Request that is used for all RDMA Flush Request
956	   Messages:

958	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
959	   |                         Data Sink STag                        |
960	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
961	   |                        Data Sink Length                       |
962	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
963	   |                     Data Sink Tagged Offset                   |
964	   +                                                               +
965	   |                                                               |
966	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
967	   |                     Flush Disposition Flags               +G+P|
968	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

970	                               Flush Request

972	   Data Sink STag: 32 bits  The Data Sink STag identifies the Remote
973	      Peer's Tagged Buffer targeted by the RDMA Flush Request.  The Data
974	      Sink STag is associated with the RDMAP Stream through a mechanism
975	      that is outside the scope of the RDMAP specification.

977	   Data Sink Length:  The Data Sink Length is the length, in octets, of
978	      the bytes targeted by the RDMA Flush Request.

980	   Data Sink Tagged Offset: 64 bits  The Data Sink Tagged Offset
981	      specifies the starting offset, in octets, from the base of the
982	      Remote Peer's Tagged Buffer targeted by the RDMA Flush Request.

984	   Flags:  Flags specifying the disposition of the flushed data: 0x01
985	      Flush to Persistence, 0x02 Flush to Global Visibility.

987	3.1.1.2.  RDMA Flush Response

989	   The RDMA Flush Response Message makes use of the DDP Untagged Buffer
990	   Model.  RDMA Flush Response messages MUST use the same Queue Number
991	   as RDMA Extensions Atomic Operation Responses (QN=3).  No payload is
992	   passed to the DDP layer on Queue Number 3.

994	   Upon successful completion of RDMA Flush processing, an RDMA Flush
995	   Response MUST be generated.

997	   If during RDMA Flush processing on the Responder, an error is
998	   detected which would result in the requested region to not achieve
999	   the requested disposition, the Responder MUST generate a Terminate
1000	   message.  The contents of the Terminate message are defined in
1001	   Section 5.2.

1003	3.1.1.3.  RDMA Flush Ordering and Atomicity

1005	   Ordering and completion rules for RDMA Flush Request are similar to
1006	   those for an Atomic operation as described in section 5 of RFC7306.
1007	   The queue number field of the RDMA Flush Request for the DDP layer
1008	   MUST be 1, and the RDMA Flush Response for the DDP layer MUST be 3.

1010	   There are no ordering requirements for the placement of the data, nor
1011	   are there any requirements for the order in which the data is made
1012	   globally visible and/or persistent.  Data received by prior
1013	   operations (e.g.  RDMA Write) MAY be submitted for placement at any
1014	   time, and persistence or global visibility MAY occur before the flush
1015	   is requested.  After placement, data MAY become persistent or
1016	   globally visible at any time, in the course of operation of the
1017	   persistency management of the storage device, or by other actions
1018	   resulting in persistence or global visibility.

1020	   Any region segment specified by the RDMA Flush operation MUST be made
1021	   persistent and/or globally visible before successful return of the
1022	   operation.  If RDMA Flush processing is successful on the Responder,
1023	   meaning the requested bytes of the region are, or have been made
1024	   persistent and/or globally visible, as requested, the RDMA Flush
1025	   Response MUST be generated.

1027	   There are no atomicity guarantees provided on the Responder's node by
1028	   the RDMA Flush Operation with respect to any other operations.  While
1029	   the Completion of the RDMA Flush Operation ensures that the requested
1030	   data was placed into, and flushed from the target Tagged Buffer,
1031	   other operations might have also placed or fetched overlapping data.
1032	   The upper layer is responsible for arbitrating any shared access.

1034	   (Sidebar) It would be useful to make a statement about other RDMA
1035	   Flush to the target buffer and RDMA Read from the target buffer on
1036	   the same connection.  Use of QN 1 for these operations provides
1037	   ordering possibilities which imply that they will "work" (see #7
1038	   below).  NOTE: this does not, however, extend to RDMA Write, which is
1039	   not queued nor sequenced and therefore does not employ a DDP QN.

1041	3.1.2.  RDMA Verify

1043	   The RDMA Verify operation requests that all bytes in a specified
1044	   region are to be read from the underlying storage and that an
1045	   integrity hash be calculated.  As specified in section 4 its
1046	   operation is ordered after the successful completion of any previous
1047	   requested RDMA Write and RDMA Flush, or certain other operations.
1048	   The implementation MUST fail the operation and send a terminate
1049	   message if the RDMA Verify cannot be performed, has encountered an
1050	   error, or if a hash value was provided in the request and the
1051	   calculated hash does not match.  If no condition for a Terminate
1052	   message is encountered, the response is generated containing the
1053	   result calculated hash value.

1055	3.1.2.1.  RDMA Verify Request Format

1057	   The RDMA Verify Request Message makes use of the DDP Untagged Buffer
1058	   Model.  RDMA Verify Request messages MUST use the same Queue Number
1059	   as RDMA Read Requests and RDMA Extensions Atomic Operation Requests
1060	   (QN=1).  Reusing the same queue number for RDMA Read and RDMA Flush
1061	   Requests allows the operations to reuse the same RDMA infrastructure
1062	   (e.g.  Outbound and Inbound RDMA Read Queue Depth (ORD/IRD) flow
1063	   control) as that defined for those requests.

1065	   The RDMA Verify Request Message carries a payload that describes the
1066	   ULP Buffer address in the Responder's memory.  The following figure
1067	   depicts the Verify Request that is used for all RDMA Verify Request
1068	   Messages:

1070	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1071	   |                         Data Sink STag                        |
1072	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1073	   |                        Data Sink Length                       |
1074	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1075	   |                     Data Sink Tagged Offset                   |
1076	   +                                                               +
1077	   |                                                               |
1078	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1079	   |                Hash Value (optional, variable)                |
1080	   |                              ...                              |
1081	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

1083	                              Verify Request

1085	   Data Sink STag: 32 bits  The Data Sink STag identifies the Remote
1086	      Peer's Tagged Buffer targeted by the Verify Request.  The Data
1087	      Sink STag is associated with the RDMAP Stream through a mechanism
1088	      that is outside the scope of the RDMAP specification.

1090	   Data Sink Length:  The Data Sink Length is the length, in octets, of
1091	      the bytes targeted by the Verify Request.

1093	   Data Sink Tagged Offset: 64 bits  The Data Sink Tagged Offset
1094	      specifies the starting offset, in octets, from the base of the
1095	      Remote Peer's Tagged Buffer targeted by the Verify Request.

1097	   Hash Value:  The Hash Value is optionally an octet string
1098	      representing the expected result, if any, of the hash algorithm on
1099	      the Remote Peer's Tagged Buffer.  The length of the Hash Value is
1100	      variable, and dependent on the selected algorithm.  When provided,
1101	      any mismatch with the calculated value causes the Responder to
1102	      generate a Terminate message, and close the connection.  The
1103	      contents of the Terminate message are defined in section 5.2.

1105	3.1.2.2.  Verify Response Format

1107	   The Verify Response Message makes use of the DDP Untagged Buffer
1108	   Model.  Verify Response messages MUST use the same Queue Number as
1109	   RDMA Flush Responses (QN=3).  The RDMAP layer passes the following
1110	   payload to the DDP layer on Queue Number 3.  The RDMA Verify Response
1111	   is not sent when a Terminate message is generated through specifying
1112	   the Compare Flag as 1, and a mismatch occurs.

1114	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1115	   |                    Hash Value (variable)                      |
1116	   |                              ...                              |
1117	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

1119	                              Verify Response

1121	   Hash Value:  The Hash Value is an octet string representing the
1122	      result of the hash algorithm on the Remote Peer's Tagged Buffer.
1123	      The length of the Hash Value is variable, and dependent on the
1124	      algorithm selected by the upper layer consumer, among those
1125	      supported by the RNIC.

1127	3.1.2.3.  RDMA Verify Ordering

1129	   Ordering and completion rules for RDMA Verify Request are similar to
1130	   those for an Atomic operation as described in section 5 of RFC7306.
1131	   The queue number field of the RDMA Verify Request for the DDP layer
1132	   MUST be 1, and the RDMA Verify Response for the DDP layer MUST be 3.

1134	   As specified in section 4, RDMA Verify and RDMA Flush are executed by
1135	   the Data Sink in strict order.  When an RDMA Verify follows an RDMA
1136	   Flush, and because the RDMA Flush MUST ensure that all bytes are in
1137	   the specified state before responding, any RDMA Verify that follows
1138	   can be assured that it is operating on flushed data.  If unflushed
1139	   data has been sent to the region segment between the operations, and
1140	   since data may be made persistent and/or globally visible by the Data
1141	   Sink at any time, the result of any such RDMA Verify is undefined.

1143	3.1.3.  Atomic Write

1145	   The Atomic Write operation provides a block of data which is placed
1146	   to a specified region atomically, and as specified in section 4 its
1147	   placement is ordered after the successful completion of any previous
1148	   requested RDMA Flush or RDMA Verify.  This specified region is
1149	   constrained in size and alignment to 64-bits at 64-bit alignment, and
1150	   the implementation MUST fail the operation and send a terminate
1151	   message if the placement cannot be performed atomically.

1153	   The Atomic Write Operation requires the Responder to write a 64-bit
1154	   value at a ULP Buffer address that is 64-bit aligned in the
1155	   Responder's memory, in a manner which is Placed in the responder's
1156	   memory atomically.

1158	3.1.3.1.  Atomic Write Request

1160	   The Atomic Write Request Message makes use of the DDP Untagged Buffer
1161	   Model.  Atomic Write Request messages MUST use the same Queue Number
1162	   as RDMA Read Requests and RDMA Extensions Atomic Operation Requests
1163	   (QN=1).  Reusing the same queue number for RDMA Flush and RDMA Verify
1164	   Requests allows the operations to reuse the same RDMA infrastructure
1165	   (e.g.  Outbound and Inbound RDMA Read Queue Depth (ORD/IRD) flow
1166	   control) as that defined for those Requests.

1168	   The Atomic Write Request Message carries an Atomic Write Request
1169	   payload that describes the ULP Buffer address in the Responder's
1170	   memory, as well as the data to be written.  The following figure
1171	   depicts the Atomic Write Request that is used for all Atomic Write
1172	   Request Messages:

1174	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1175	   |                         Data Sink STag                        |
1176	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1177	   |                        Data Sink Length                       |
1178	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1179	   |                     Data Sink Tagged Offset                   |
1180	   +                                                               +
1181	   |                                                               |
1182	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1183	   |                              Data                             |
1184	   +                                                               +
1185	   |                                                               |
1186	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

1188	                           Atomic Write Request

1190	   Data Sink STag: 32 bits  The Data Sink STag identifies the Remote
1191	      Peer's Tagged Buffer targeted by the Atomic Write Request.  The
1192	      Data Sink STag is associated with the RDMAP Stream through a
1193	      mechanism that is outside the scope of the RDMAP specification.

1195	   Data Sink Length:  The Data Sink Length is the length of data to be
1196	      placed, and MUST be 8.

1198	   Data Sink Tagged Offset: 64 bits  The Data Sink Tagged Offset
1199	      specifies the starting offset, in octets, from the base of the
1200	      Remote Peer's Tagged Buffer targeted by the Atomic Write Request.
1201	      This offset can be any value, but the destination ULP buffer
1202	      address MUST be aligned as specified above.  Ensuring that the
1203	      STag and Data Sink Tagged Offset values appropriately meet such a
1204	      requirement is an upper layer consumer responsibility, and is out
1205	      of scope for this specification.

1207	   Data:  The 64-bit data value to be written, in big-endian format.

1209	   Atomic Write Operations MUST target ULP Buffer addresses that are
1210	   64-bit aligned, and conform to any other platform restrictions on the
1211	   Responder system.  The write MUST NOT be Placed prior to all prior
1212	   RDMA Flush operations, and therefore all other prior operations,
1213	   completing successfully.

1215	   If an Atomic Write Operation is attempted on a target ULP Buffer
1216	   address that is not 64-bit aligned, or due to alignment, size, or
1217	   other platform restrictions cannot be performed atomically:

1219	      The operation MUST NOT be performed

1221	      The Responder's memory MUST NOT be modified

1223	      A terminate message MUST be generated.  (See Section 5.2 for the
1224	      contents of the terminate message.)

1226	3.1.3.2.  Atomic Write Response

1228	   The Atomic Write Response Message makes use of the DDP Untagged
1229	   Buffer Model.  Atomic Write Response Response messages MUST use the
1230	   same Queue Number as RDMA Flush Responses (QN=3).  The RDMAP layer
1231	   passes no payload to the DDP layer on Queue Number 3.

1233	3.1.4.  Discovery of RDMAP Extensions

1235	   As for RFC7306, explicit negotiation by the RDMAP peers of the
1236	   extensions covered by this document is not required.  Instead, it is
1237	   RECOMMENDED that RDMA applications and/or ULPs negotiate any use of
1238	   these extensions at the application or ULP level.  The definition of
1239	   such application-specific mechanisms is outside the scope of this
1240	   specification.  For backward compatibility, existing applications
1241	   and/or ULPs should not assume that these extensions are supported.

1243	   In the absence of application-specific negotiation of the features
1244	   defined within this specification, the new operations can be
1245	   attempted, and reported errors can be used to determine a remote
1246	   peer's capabilities.  In the case of RDMA Flush and Atomic Write, an
1247	   operation to a previously Advertised buffer with remote write
1248	   permission can be used to determine the peer's support.  A Remote
1249	   Operation Error or Unexpected OpCode error will be reported by the
1250	   remote peer if the Operation is not supported by the remote peer.
1251	   For RDMA Verify, such an operation may target a buffer with remote
1252	   read permission.

1254	3.2.  Local Extensions

1256	   This section discusses memory registration, new memory and protection
1257	   attributes, and applicability to both remote and "local" (receives).
1258	   Because this section does not specify any wire-visible semantic, it
1259	   is entirely informative.

1261	3.2.1.  Registration Semantics

1263	   New platform-specific attributes to RDMA registration, allows them to
1264	   be processed at the server *only* without client knowledge, or
1265	   protocol exposure.  No client knowledge - robust design ensuring
1266	   future interop

1268	   New local PMEM memory registration example:

1270	      Register(region[], MemPerm, MemType, MemMode) -> STag

1272	         Region describes the memory segment[s] to be registered by the
1273	         returned STag.  The local RNIC may limit the size and number of
1274	         these segments.

1276	         MemPerm to indicate permitted operations in addition to remote
1277	         read and remote werite: "remote flush to persistence", "remote
1278	         flush to global visibility", selectivity, etc.

1280	         MemType includes type of storage described by the Region, i.e.
1281	         plain RAM, "flush required" (flushable), or PCIe-resident via
1282	         peer-to-peer, or any other local platform-specific processing

1284	         MemMode includes disposition of data Read and/or written e.g.
1285	         Cacheable after operation (indicate if needed by CPU on data
1286	         sink, to allow or avoid writethrough as optimization)

1288	         None of the above attributes are at all relevant, or exposed,
1289	         by the protocol

1291	   STag is processed in receiving RNIC during RDMA operation to
1292	   specified region, under control of original Perm, Type and Mode.

1294	3.2.2.  Completion Semantics

1296	   Discuss the interactions with new operations when upper layer
1297	   provides Completions to responder (e.g. messages via receive or
1298	   immediate data via RDMA Write).  Natural conclusion of ordering
1299	   rules, but made explicit.

1301	   Ordering of operations is critical: Such RDMA Writes cannot be
1302	   allowed to "pass" persistence or global visibility, and RDMA Flush
1303	   may not begin until prior RDMA Writes to flush region are accounted
1304	   for.  Therefore, ULP protocol implications may also exist.

1306	3.2.3.  Platform Semantics

1308	   Writethrough behavior on persistent regions and reasons for same.
1309	   Consider recommending a local writethrough behavior on any persistent
1310	   region, to support a nonblocking hurry-up to avoid future stalls on a
1311	   subsequent cache flush, prior to a flush.  Also, it would enhance
1312	   storage integrity.  Drive selection of this behavior from memory
1313	   registration, so RNIC may "look up" the desired behavior in its TPT.

1315	   PCI extension to support Flush would allow RNIC to provide
1316	   persistence and/or global visibility directly and efficiently To
1317	   Memory, CPU, PCI Root, PM device, PCIe device, ... Avoids CPU
1318	   interaction Supports strong data consistency model.  Performs
1319	   equivalent of: CLFLUSHOPT (region list) or some other flow tag.  Or
1320	   if RNIC participates in platform consistency domain on memory bus or
1321	   within CPU complex... other possibilities exist!

1323	   Also consider additional "integrity check" behavior (hash algorithm)
1324	   specified per-region.  If so, providing it as a registration
1325	   parameter enables fine-graned control, and enables storing it in per-
1326	   region RNIC state, making its processing optional and
1327	   straightforward.

1329	   A similar approach applicable to providing security key for
1330	   encrypting/decrypting access on per-region basius, without protocol
1331	   exposure.  [SDC2017 presentation]

1333	   Any other per-region processing to be explored.

1335	4.  Ordering and Completions Table

1337	   The table in this section specifies the ordering relationships for
1338	   the operations in this specification and in those it extends, from
1339	   the standpoint of the Requester.  Note that in the table, Send
1340	   Operation includes Send, Send with Invalidate, Send with Solicited
1341	   Event, and Send with Solicited Event and Invalidate.  Also note that
1342	   Immediate Operation includes Immediate Data and Immediate Data with
1343	   Solicited Event.

1345	   Note: N/A in the table below means Not Applicable
1346	   ----------+------------+-------------+-------------+-----------------
1347	   First     | Second     | Placement   | Placement   | Ordering
1348	   Operation | Operation  | Guarantee at| Guarantee at| Guarantee at
1349	             |            | Remote Peer | Local Peer  | Remote Peer
1350	   ----------+------------+-------------+-------------+-----------------
1351	   RDMA Flush| TODO       | No Placement| N/A         | Completed in
1352	             |            | Guarantee   |             | Order
1353	             |            | between Foo |             |
1354	             |            | and Bar     |             |
1355	   ----------+------------+-------------+-------------+-----------------
1356	   TODO      | RDMA Flush | Placement   | N/A         | TODO
1357	             |            | Guarantee   |             |
1358	             |            | between Foo |             |
1359	             |            | and Bar     |             |
1360	   ----------+------------+-------------+-------------+-----------------
1361	   TODO      | TODO       | Etc         | Etc         | Etc
1362	   ----------+------------+-------------+-------------+-----------------
1363	   ----------+------------+-------------+-------------+-----------------

1365	                          Ordering of Operations

1367	5.  Error Processing

1369	   In addition to error processing described in section 7 of RFC5040 and
1370	   section 8 of RFC7306, the following rules apply for the new RDMA
1371	   Messages defined in this specification.

1373	5.1.  Errors Detected at the Local Peer

1375	   The Local Peer MUST send a Terminate Message for each of the
1376	   following cases:

1378	   1.  For errors detected while creating an RDMA Flush, RDMA Verify or
1379	       Atomic Write Request, or other reasons not directly associated
1380	       with an incoming Message, the Terminate Message and Error code
1381	       are sent instead of the Message.  In this case, the Error Type
1382	       and Error Code fields are included in the Terminate Message, but
1383	       the Terminated DDP Header and Terminated RDMA Header fields are
1384	       set to zero.

1386	   2.  For errors detected on an incoming RDMA Flush, RDMA Verify or
1387	       Atomic Write Request or Response, the Terminate Message is sent
1388	       at the earliest possible opportunity, preferably in the next
1389	       outgoing RDMA Message.  In this case, the Error Type, Error Code,
1390	       and Terminated DDP Header fields are included in the Terminate
1391	       Message, but the Terminated RDMA Header field is set to zero.

1393	   3.  For errors detected in the processing of the RDMA Flush or RDMA
1394	       Verify itself, that is, the act of flushing or verifying the
1395	       data, the Terminate Message is generated as per the referenced
1396	       specifications.  Even though data is not lost, the upper layer
1397	       MUST be notified of the failure by informing the requester of the
1398	       status, terminating any queued operations, and allow the
1399	       requester to perform further action, for instance, recovery.

1401	5.2.  Errors Detected at the Remote Peer

1403	   On incoming RDMA Flush and RDMA Verify Requests, the following MUST
1404	   be validated:

1406	   o  The DDP layer MUST validate all DDP Segment fields.

1408	   The following additional validation MUST be performed:

1410	   o  If the RDMA Flush, RDMA Verify or Atomic Write operation cannot be
1411	      satisfied, due to transient or permanent errors detected in the
1412	      processing by the Responder, a Terminate message MUST be returned
1413	      to the Requestor.

1415	6.  IANA Considerations

1417	   This document requests that IANA assign the following new operation
1418	   codes in the "RDMAP Message Operation Codes" registry defined in
1419	   section 3.4 of [RFC6580].

1421	   0xC  RDMA Flush Request, this specification

1423	   0xD  RDMA Flush Response, this specification

1425	   0xE  RDMA Verify Request, this specification

1427	   0xF  RDMA Verify Response, this specification

1429	   0x10  Atomic Write Request, this specification

1431	   0x11  Atomic Write Response, this specification

1433	   Note to RFC Editor: this section may be edited and updated prior to
1434	   publication as an RFC.

1436	7.  Security Considerations

1438	   This document specifies extensions to the RDMA Protocol specification
1439	   in RFC5040 and RDMA Protocol Extensions in RFC7306, and as such the
1440	   Security Considerations discussed in Section 8 of RFC5040 and
1441	   Section 9 of RFC7306 apply.  In particular, all operations use ULP
1442	   Buffer addresses for the Remote Peer Buffer addressing used in
1443	   RFC5040 as required by the security model described in [RDMAP
1444	   Security [RFC5042]].

1446	   If the "push mode" transfer model discussed in section 2 is
1447	   implemented by upper layers, new security considerations will be
1448	   potentially introduced in those protocols, particularly on the
1449	   server, or target, if the new memory regions are not carefully
1450	   protected.  Therefore, for them to take full advantage of the
1451	   extension defined in this document, additional security design is
1452	   required in the implementation of those upper layers.  The facilities
1453	   of RFC5042 [RFC5042] can provide the basis for any such design.

1455	   In addition to protection, in "push mode" the server or target will
1456	   expose memory resources to the peer for potentially extended periods,
1457	   and will allow the peer to perform remote requests which will
1458	   necessarily consume shared resources, e.g. memory bandwidth, power,
1459	   and memory itself.  It is recommended that the upper layers provide a
1460	   means to gracefully adjust such resources, for example using upper
1461	   layer callbacks, without resorting to revoking RDMA permissions,
1462	   which would summarily close connections.  With the initiator
1463	   applications relying on the protocol extension itself for managing
1464	   their required persistence and/or global visibility, the lack of such
1465	   an approach would lead to frequent recovery in low-resource
1466	   situations, potentially opening a new threat to such applications.

1468	8.  To Be Added or Considered

1470	   This section will be deleted in a future document revision.

1472	   Complete the discussion in section 3.2 and its subsections, Local
1473	   Extension semantics.

1475	   Complete the Ordering table in section 4.  Carefully include
1476	   discussion of the order of "start of execution" as well as
1477	   completion, which are somewhat more involved than prior RDMA
1478	   operation ordering.

1480	   RDMA Flush "selectivity", to provide default flush semantics with
1481	   broader scope than region-based.  If specified, a flag to request
1482	   that all prior write operations on the issuing Queue Pair be flushed
1483	   with the requested disposition(s).  This flag may simplify upper
1484	   layer processing, and would allow regions larger than 4GB-1 byte to
1485	   be flushed in a single operation.  The STag, Offset and Length will
1486	   be ignored in this case.  It is to-be-determined how to extend the
1487	   RDMA security model to protect other regions associated with this
1488	   Queue Pair from unintentional or unauthorized flush.

1490	9.  Acknowledgements

1492	   The authors wish to thank Jim Pinkerton, who contributed to an
1493	   earlier version of the specification, and Brian Hausauer and Kobby
1494	   Carmona, who have provided significant review and valuable comments.

1496	10.  References

1498	10.1.  Normative References

1500	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
1501	              Requirement Levels", BCP 14, RFC 2119,
1502	              DOI 10.17487/RFC2119, March 1997,
1503	              <https://www.rfc-editor.org/info/rfc2119>.

1505	   [RFC5040]  Recio, R., Metzler, B., Culley, P., Hilland, J., and D.
1506	              Garcia, "A Remote Direct Memory Access Protocol
1507	              Specification", RFC 5040, DOI 10.17487/RFC5040, October
1508	              2007, <https://www.rfc-editor.org/info/rfc5040>.

1510	   [RFC5041]  Shah, H., Pinkerton, J., Recio, R., and P. Culley, "Direct
1511	              Data Placement over Reliable Transports", RFC 5041,
1512	              DOI 10.17487/RFC5041, October 2007,
1513	              <https://www.rfc-editor.org/info/rfc5041>.

1515	   [RFC5042]  Pinkerton, J. and E. Deleganes, "Direct Data Placement
1516	              Protocol (DDP) / Remote Direct Memory Access Protocol
1517	              (RDMAP) Security", RFC 5042, DOI 10.17487/RFC5042, October
1518	              2007, <https://www.rfc-editor.org/info/rfc5042>.

1520	   [RFC6580]  Ko, M. and D. Black, "IANA Registries for the Remote
1521	              Direct Data Placement (RDDP) Protocols", RFC 6580,
1522	              DOI 10.17487/RFC6580, April 2012,
1523	              <https://www.rfc-editor.org/info/rfc6580>.

1525	   [RFC7306]  Shah, H., Marti, F., Noureddine, W., Eiriksson, A., and R.
1526	              Sharp, "Remote Direct Memory Access (RDMA) Protocol
1527	              Extensions", RFC 7306, DOI 10.17487/RFC7306, June 2014,
1528	              <https://www.rfc-editor.org/info/rfc7306>.

1530	10.2.  Informative References

1532	   [RFC5045]  Bestler, C., Ed. and L. Coene, "Applicability of Remote
1533	              Direct Memory Access Protocol (RDMA) and Direct Data
1534	              Placement (DDP)", RFC 5045, DOI 10.17487/RFC5045, October
1535	              2007, <https://www.rfc-editor.org/info/rfc5045>.

1537	   [RFC5661]  Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed.,
1538	              "Network File System (NFS) Version 4 Minor Version 1
1539	              Protocol", RFC 5661, DOI 10.17487/RFC5661, January 2010,
1540	              <https://www.rfc-editor.org/info/rfc5661>.

1542	   [RFC7145]  Ko, M. and A. Nezhinsky, "Internet Small Computer System
1543	              Interface (iSCSI) Extensions for the Remote Direct Memory
1544	              Access (RDMA) Specification", RFC 7145,
1545	              DOI 10.17487/RFC7145, April 2014,
1546	              <https://www.rfc-editor.org/info/rfc7145>.

1548	   [RFC8166]  Lever, C., Ed., Simpson, W., and T. Talpey, "Remote Direct
1549	              Memory Access Transport for Remote Procedure Call Version
1550	              1", RFC 8166, DOI 10.17487/RFC8166, June 2017,
1551	              <https://www.rfc-editor.org/info/rfc8166>.

1553	   [RFC8267]  Lever, C., "Network File System (NFS) Upper-Layer Binding
1554	              to RPC-over-RDMA Version 1", RFC 8267,
1555	              DOI 10.17487/RFC8267, October 2017,
1556	              <https://www.rfc-editor.org/info/rfc8267>.

1558	   [SCSI]     ANSI, "SCSI Primary Commands - 3 (SPC-3) (INCITS
1559	              408-2005)", May 2005.

1561	   [SMB3]     Microsoft Corporation, "Server Message Block (SMB)
1562	              Protocol Versions 2 and 3 (MS-SMB2)", March 2020.

1564	              https://docs.microsoft.com/en-
1565	              us/openspecs/windows_protocols/ms-smb2/5606ad47-5ee0-437a-
1566	              817e-70c366052962

1568	   [SMBDirect]
1569	              Microsoft Corporation, "SMB2 Remote Direct Memory Access
1570	              (RDMA) Transport Protocol (MS-SMBD)", September 2018.

1572	              https://docs.microsoft.com/en-
1573	              us/openspecs/windows_protocols/ms-smbd/1ca5f4ae-e5b1-493d-
1574	              b87d-f4464325e6e3

1576	   [SNIANVMP]
1577	              SNIA NVM Programming TWG, "SNIA NVM Programming Model
1578	              v1.2", June 2017.

1580	              https://www.snia.org/sites/default/files/technical_work/
1581	              final/NVMProgrammingModel_v1.2.pdf

1583	10.3.  URIs

1585	   [1] http://www.nvmexpress.org

1587	   [2] http://www.jedec.org

1589	Appendix A.  DDP Segment Formats for RDMA Extensions

1591	   This appendix is for information only and is NOT part of the
1592	   standard.  It simply depicts the DDP Segment format for each of the
1593	   RDMA Messages defined in this specification.

1595	A.1.  DDP Segment for RDMA Flush Request

1597	    0                   1                   2                   3
1598	    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
1599	                                   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1600	                                   |   DDP Control | RDMA Control  |
1601	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1602	   |                      Reserved (Not Used)                      |
1603	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1604	   |             DDP (Flush Request) Queue Number (1)              |
1605	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1606	   |           DDP (Flush Request) Message Sequence Number         |
1607	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1608	   |                       Data Sink STag                          |
1609	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1610	   |                      Data Sink Length                         |
1611	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1612	   |                   Data Sink Tagged Offset                     |
1613	   +                                                               +
1614	   |                                                               |
1615	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1616	   |                      Disposition Flags                    +G+P|
1617	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

1619	                      RDMA Flush Request, DDP Segment

1621	A.2.  DDP Segment for RDMA Flush Response
1622	    0                   1                   2                   3
1623	    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
1624	                                   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1625	                                   |   DDP Control | RDMA Control  |
1626	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1627	   |                      Reserved (Not Used)                      |
1628	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1629	   |              DDP (Flush Response) Queue Number (3)            |
1630	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1631	   |          DDP (Flush Response) Message Sequence Number         |
1632	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

1634	                     RDMA Flush Response, DDP Segment

1636	A.3.  DDP Segment for RDMA Verify Request

1638	    0                   1                   2                   3
1639	    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
1640	                                   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1641	                                   |   DDP Control | RDMA Control  |
1642	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1643	   |                      Reserved (Not Used)                      |
1644	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1645	   |             DDP (Verify Request) Queue Number (1)             |
1646	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1647	   |           DDP (Verify Request) Message Sequence Number        |
1648	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1649	   |                       Data Sink STag                          |
1650	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1651	   |                      Data Sink Length                         |
1652	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1653	   |                   Data Sink Tagged Offset                     |
1654	   +                                                               +
1655	   |                                                               |
1656	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1657	   |                Hash Value (optional, variable)                |
1658	   |                              ...                              |
1659	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

1661	                     RDMA Verify Request, DDP Segment

1663	A.4.  DDP Segment for RDMA Verify Response
1664	    0                   1                   2                   3
1665	    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
1666	                                   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1667	                                   |   DDP Control | RDMA Control  |
1668	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1669	   |                      Reserved (Not Used)                      |
1670	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1671	   |              DDP (Verify Response) Queue Number (3)           |
1672	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1673	   |          DDP (Verify Response) Message Sequence Number        |
1674	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1675	   |                     Hash Value (variable)                     |
1676	   |                              ...                              |
1677	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

1679	                     RDMA Verify Response, DDP Segment

1681	A.5.  DDP Segment for Atomic Write Request

1683	    0                   1                   2                   3
1684	    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
1685	                                   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1686	                                   |   DDP Control | RDMA Control  |
1687	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1688	   |                      Reserved (Not Used)                      |
1689	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1690	   |          DDP (Atomic Write Request) Queue Number (1)          |
1691	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1692	   |        DDP (Atomic Write Request) Message Sequence Number     |
1693	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1694	   |                       Data Sink STag                          |
1695	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1696	   |                 Data Sink Length (value=8)                    |
1697	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1698	   |                   Data Sink Tagged Offset                     |
1699	   +                                                               +
1700	   |                                                               |
1701	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1702	   |                       Data (64 bits)                          |
1703	   +                                                               +
1704	   |                                                               |
1705	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

1707	                     Atomic Write Request, DDP Segment

1709	A.6.  DDP Segment for Atomic Write Response

1711	    0                   1                   2                   3
1712	    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
1713	                                   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1714	                                   |   DDP Control | RDMA Control  |
1715	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1716	   |                      Reserved (Not Used)                      |
1717	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1718	   |           DDP (Atomic Write Response) Queue Number (3)        |
1719	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1720	   |       DDP (Atomic Write Response) Message Sequence Number     |
1721	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

1723	                    Atomic Write Response, DDP Segment

1725	Authors' Addresses

1727	   Tom Talpey
1728	   Microsoft
1729	   One Microsoft Way
1730	   Redmond, WA  98052
1731	   US

1733	   Email: ttalpey@microsoft.com

1735	   Tony Hurson
1736	   Intel
1737	   Austin, TX
1738	   US

1740	   Email: tony.hurson@intel.com

1742	   Gaurav Agarwal
1743	   Marvell
1744	   CA
1745	   US

1747	   Email: gagarwal@marvell.com
1748	   Tom Reu
1749	   Chelsio
1750	   NJ
1751	   US

1753	   Email: tomreu@chelsio.com