| < draft-talpey-rdma-commit-00.txt | draft-talpey-rdma-commit-01.txt > | |||
|---|---|---|---|---|
| Internet-Draft T. Talpey | NFSv4 (provisionally) T. Talpey | |||
| Internet-Draft J. Pinkerton | Internet-Draft Microsoft | |||
| Updates: 5040, 7306 (if approved) Microsoft | Updates: 5040 7306 (if approved) T. Hurson | |||
| Intended status: Standards Track | Intended status: Standards Track Intel | |||
| Expires: August 22, 2016 February 19, 2016 | Expires: September 10, 2020 G. Agarwal | |||
| Marvell | ||||
| T. Reu | ||||
| Chelsio | ||||
| March 9, 2020 | ||||
| RDMA Durable Write Commit | RDMA Extensions for Enhanced Memory Placement | |||
| draft-talpey-rdma-commit-00 | draft-talpey-rdma-commit-01 | |||
| Abstract | Abstract | |||
| This document specifies extensions to RDMA protocols to provide | This document specifies extensions to RDMA (Remote Direct Memory | |||
| capabilities in support of enhanced remotely-directed data | Access) protocols to provide capabilities in support of enhanced | |||
| consistency. The extensions include a new operation supporting | remotely-directed data placement on persistent memory-addressable | |||
| remote commitment to durability of remotely-managed buffers, which | devices. The extensions include new operations supporting remote | |||
| can provide enhanced guarantees and improve performance for low- | commitment to persistence of remotely-managed buffers, which can | |||
| latency storage applications. In addition to, and in support of | provide enhanced guarantees and improve performance for low-latency | |||
| these, extensions to local behaviors are described, which may be used | storage applications. In addition to, and in support of these, | |||
| to guide implementation, and to ease adoption. This document would | extensions to local behaviors are described, which may be used to | |||
| extend the IETF Remote Direct Memory Access Protocol (RDMAP), | guide implementation, and to ease adoption. This document updates | |||
| RFC5040, and RDMA Protocol Extensions, RFC7306. | RFC5040 (Remote Direct Memory Access Protocol (RDMAP)) and updates | |||
| RFC7306 (RDMA Protocol Extensions). | ||||
| Requirements Language | Requirements Language | |||
| The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", | The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", | |||
| "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this | "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this | |||
| document are to be interpreted as described in RFC 2119 [RFC2119]. | document are to be interpreted as described in RFC 2119 [RFC2119]. | |||
| Status of This Memo | Status of This Memo | |||
| This Internet-Draft is submitted in full conformance with the | This Internet-Draft is submitted in full conformance with the | |||
| provisions of BCP 78 and BCP 79. | provisions of BCP 78 and BCP 79. | |||
| Internet-Drafts are working documents of the Internet Engineering | Internet-Drafts are working documents of the Internet Engineering | |||
| Task Force (IETF). Note that other groups may also distribute | Task Force (IETF). Note that other groups may also distribute | |||
| working documents as Internet-Drafts. The list of current Internet- | working documents as Internet-Drafts. The list of current Internet- | |||
| Drafts is at http://datatracker.ietf.org/drafts/current/. | Drafts is at https://datatracker.ietf.org/drafts/current/. | |||
| Internet-Drafts are draft documents valid for a maximum of six months | Internet-Drafts are draft documents valid for a maximum of six months | |||
| and may be updated, replaced, or obsoleted by other documents at any | and may be updated, replaced, or obsoleted by other documents at any | |||
| time. It is inappropriate to use Internet-Drafts as reference | time. It is inappropriate to use Internet-Drafts as reference | |||
| material or to cite them other than as "work in progress." | material or to cite them other than as "work in progress." | |||
| This Internet-Draft will expire on September 10, 2020. | ||||
| This Internet-Draft will expire on August 22, 2016. | ||||
| Copyright Notice | Copyright Notice | |||
| Copyright (c) 2016 IETF Trust and the persons identified as the | Copyright (c) 2020 IETF Trust and the persons identified as the | |||
| document authors. All rights reserved. | document authors. All rights reserved. | |||
| This document is subject to BCP 78 and the IETF Trust's Legal | This document is subject to BCP 78 and the IETF Trust's Legal | |||
| Provisions Relating to IETF Documents | Provisions Relating to IETF Documents | |||
| (http://trustee.ietf.org/license-info) in effect on the date of | (https://trustee.ietf.org/license-info) in effect on the date of | |||
| publication of this document. Please review these documents | publication of this document. Please review these documents | |||
| carefully, as they describe your rights and restrictions with respect | carefully, as they describe your rights and restrictions with respect | |||
| to this document. Code Components extracted from this document must | to this document. | |||
| include Simplified BSD License text as described in Section 4.e of | ||||
| the Trust Legal Provisions and are provided without warranty as | ||||
| described in the Simplified BSD License. | ||||
| Table of Contents | Table of Contents | |||
| 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 | 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 | |||
| 1.1. Glossary . . . . . . . . . . . . . . . . . . . . . . . . 3 | 1.1. Glossary . . . . . . . . . . . . . . . . . . . . . . . . 4 | |||
| 2. Problem Statement . . . . . . . . . . . . . . . . . . . . . . 4 | 2. Problem Statement . . . . . . . . . . . . . . . . . . . . . . 4 | |||
| 2.1. Requirements . . . . . . . . . . . . . . . . . . . . . . 7 | 2.1. Requirements for RDMA Flush . . . . . . . . . . . . . . . 10 | |||
| 2.1.1. Non-Requirements . . . . . . . . . . . . . . . . . . 9 | 2.1.1. Non-Requirements . . . . . . . . . . . . . . . . . . 12 | |||
| 2.2. Additional Semantics . . . . . . . . . . . . . . . . . . 10 | 2.2. Requirements for Atomic Write . . . . . . . . . . . . . . 14 | |||
| 3. Proposed Extensions . . . . . . . . . . . . . . . . . . . . . 11 | 2.3. Requirements for RDMA Verify . . . . . . . . . . . . . . 15 | |||
| 3.1. Local Extensions . . . . . . . . . . . . . . . . . . . . 11 | 2.4. Local Semantics . . . . . . . . . . . . . . . . . . . . . 16 | |||
| 3.1.1. Registration Semantics . . . . . . . . . . . . . . . 11 | 3. RDMA Protocol Extensions . . . . . . . . . . . . . . . . . . 17 | |||
| 3.1.2. Completion Semantics . . . . . . . . . . . . . . . . 12 | 3.1. RDMAP Extensions . . . . . . . . . . . . . . . . . . . . 17 | |||
| 3.1.3. Platform Semantics . . . . . . . . . . . . . . . . . 12 | 3.1.1. RDMA Flush . . . . . . . . . . . . . . . . . . . . . 20 | |||
| 3.2. RDMAP Extensions . . . . . . . . . . . . . . . . . . . . 12 | 3.1.2. RDMA Verify . . . . . . . . . . . . . . . . . . . . . 23 | |||
| 3.2.1. RDMA Commit Request Header Format . . . . . . . . . . 15 | 3.1.3. Atomic Write . . . . . . . . . . . . . . . . . . . . 25 | |||
| 3.2.2. RDMA Commit Response Header Format . . . . . . . . . 16 | 3.1.4. Discovery of RDMAP Extensions . . . . . . . . . . . . 27 | |||
| 3.2.3. Ordering . . . . . . . . . . . . . . . . . . . . . . 16 | 3.2. Local Extensions . . . . . . . . . . . . . . . . . . . . 28 | |||
| 3.2.4. Atomicity . . . . . . . . . . . . . . . . . . . . . . 17 | 3.2.1. Registration Semantics . . . . . . . . . . . . . . . 28 | |||
| 3.2.5. Discovery of RDMAP Extensions . . . . . . . . . . . . 17 | 3.2.2. Completion Semantics . . . . . . . . . . . . . . . . 28 | |||
| 4. Ordering and Completions Table . . . . . . . . . . . . . . . 18 | 3.2.3. Platform Semantics . . . . . . . . . . . . . . . . . 29 | |||
| 5. Error Processing . . . . . . . . . . . . . . . . . . . . . . 18 | 4. Ordering and Completions Table . . . . . . . . . . . . . . . 29 | |||
| 5.1. Errors Detected at the Local Peer . . . . . . . . . . . . 18 | 5. Error Processing . . . . . . . . . . . . . . . . . . . . . . 30 | |||
| 5.2. Errors Detected at the Remote Peer . . . . . . . . . . . 19 | 5.1. Errors Detected at the Local Peer . . . . . . . . . . . . 30 | |||
| 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 19 | 5.2. Errors Detected at the Remote Peer . . . . . . . . . . . 31 | |||
| 7. Security Considerations . . . . . . . . . . . . . . . . . . . 20 | 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 31 | |||
| 8. References . . . . . . . . . . . . . . . . . . . . . . . . . 20 | 7. Security Considerations . . . . . . . . . . . . . . . . . . . 31 | |||
| 8.1. Normative References . . . . . . . . . . . . . . . . . . 20 | 8. To Be Added or Considered . . . . . . . . . . . . . . . . . . 32 | |||
| 8.2. Informative References . . . . . . . . . . . . . . . . . 21 | 9. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 33 | |||
| 8.3. URIs . . . . . . . . . . . . . . . . . . . . . . . . . . 22 | 10. References . . . . . . . . . . . . . . . . . . . . . . . . . 33 | |||
| Appendix A. DDP Segment Formats for RDMA Extensions . . . . . . 22 | 10.1. Normative References . . . . . . . . . . . . . . . . . . 33 | |||
| A.1. DDP Segment for RDMA Commit Request . . . . . . . . . . . 22 | 10.2. Informative References . . . . . . . . . . . . . . . . . 33 | |||
| A.2. DDP Segment for RDMA Commit Response . . . . . . . . . . 23 | 10.3. URIs . . . . . . . . . . . . . . . . . . . . . . . . . . 35 | |||
| Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 24 | Appendix A. DDP Segment Formats for RDMA Extensions . . . . . . 35 | |||
| A.1. DDP Segment for RDMA Flush Request . . . . . . . . . . . 35 | ||||
| A.2. DDP Segment for RDMA Flush Response . . . . . . . . . . . 35 | ||||
| A.3. DDP Segment for RDMA Verify Request . . . . . . . . . . . 36 | ||||
| A.4. DDP Segment for RDMA Verify Response . . . . . . . . . . 36 | ||||
| A.5. DDP Segment for Atomic Write Request . . . . . . . . . . 37 | ||||
| A.6. DDP Segment for Atomic Write Response . . . . . . . . . . 38 | ||||
| Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 38 | ||||
| 1. Introduction | 1. Introduction | |||
| The RDMA Protocol (RDMAP) [RFC5040] and RDMA Protocol Extensions | The RDMA Protocol (RDMAP) [RFC5040] and RDMA Protocol Extensions | |||
| (RDMAPEXT) [RFC7306] provide capabilities for secure, zero copy data | (RDMAPEXT) [RFC7306] provide capabilities for secure, zero copy data | |||
| communications that preserve memory protection semantics, enabling | communications that preserve memory protection semantics, enabling | |||
| more efficient network protocol implementations. The RDMA Protocol | more efficient network protocol implementations. The RDMA Protocol | |||
| is part of the iWARP family of specifications which also include the | is part of the iWARP family of specifications which also include the | |||
| Direct Data Placement Protocol (DDP) [RFC5041], and others as | Direct Data Placement Protocol (DDP) [RFC5041], and others as | |||
| described in the relevant documents. For additional background on | described in the relevant documents. For additional background on | |||
| RDMA Protocol applicability, see "Applicability of Remote Direct | RDMA Protocol applicability, see "Applicability of Remote Direct | |||
| Memory Access Protocol (RDMA) and Direct Data Placement Protocol | Memory Access Protocol (RDMA) and Direct Data Placement Protocol | |||
| (DDP)" RFC5045 [RFC5045]. | (DDP)" RFC5045 [RFC5045]. | |||
| RDMA protocols are enjoying good success in improving the performance | RDMA protocols are enjoying good success in improving the performance | |||
| of remote storage access, and have been well-suited to semantics and | of remote storage access, and have been well-suited to semantics and | |||
| latencies of existing storage solutions. However, new storage | latencies of existing storage solutions. However, new storage | |||
| solutions are emerging with much lower latencies, driving new | solutions are emerging with much lower latencies, driving new | |||
| workloads and new performance requirements. Also, storage | workloads and new performance requirements. Also, storage | |||
| programming paradigms SNIANVM [SNIANVM] are driving new requirements | programming paradigms SNIANVMP [SNIANVMP] are driving new | |||
| of the remote storage layers, in addition to driving down latency | requirements of the remote storage layers, in addition to driving | |||
| tolerances. Overcoming these latencies, and providing the means to | down latency tolerances. Overcoming these latencies, and providing | |||
| achieve durability without invoking upper layers and remote CPUs for | the means to achieve persistence and/or visibility without invoking | |||
| each such request, are the motivators for the extensions proposed by | upper layers and remote CPUs for each such request, are the | |||
| this document. | motivators for the extensions in this document. | |||
| This document specifies the following extensions to the RDMA Protocol | This document specifies the following extensions to the RDMA Protocol | |||
| (RDMAP) and its local memory ecosystem: | (RDMAP) and its local memory ecosystem: | |||
| o RDMA Commit - support for RDMA requests and responses with | o Flush - support for RDMA requests and responses with enhanced | |||
| enhanced placement semantics. | placement semantics. | |||
| o Enhanced memory registration semantics in support of durability. | o Atomic Write - support for writing certain data elements into | |||
| memory in an atomically visible fashion. | ||||
| o Verify - support for validating the contents of remote memory, | ||||
| through use of integrity signatures. | ||||
| o Enhanced memory registration semantics in support of persistence | ||||
| and visibility. | ||||
| The extensions defined in this document do not require the RDMAP | The extensions defined in this document do not require the RDMAP | |||
| version to change. | version to change. | |||
| 1.1. Glossary | 1.1. Glossary | |||
| This document is an extension of RFC 5040 and RFC 7306, and key words | This document is an extension of RFC 5040 and RFC7306, and key words | |||
| are additionally defined in the glossaries of the referenced | are additionally defined in the glossaries of the referenced | |||
| documents. | documents. | |||
| The following additional terms are defined in this document. | The following additional terms are used in this document as defined. | |||
| Commit: The placement of data into storage referenced by a target | Flush: The submitting of previously written data from volatile | |||
| Tagged Buffer in a durable fashion. | intermediate locations for subsequent placement, in a persistent | |||
| and/or globally visible fashion. | ||||
| Durability: The property that data is present and remains stable | Invalidate: The removal of data from volatile intermediate | |||
| after recovery from a power failure or other fatal error in an | locations. | |||
| upper layer or hardware. <https://en.wikipedia.org/wiki/ | ||||
| Commit: Obsolescent previous synonym for Flush. Term to be deleted. | ||||
| Persistent: The property that data is present, readable and remains | ||||
| stable after recovery from a power failure or other fatal error in | ||||
| an upper layer or hardware. <https://en.wikipedia.org/wiki/ | ||||
| Durability_(database_systems)>, <https://en.wikipedia.org/wiki/ | Durability_(database_systems)>, <https://en.wikipedia.org/wiki/ | |||
| Disk_buffer#Cache_control_from_the_host>[SCSI], | Disk_buffer#Cache_control_from_the_host>, [SCSI]. | |||
| Globally Visible: The property of data being available for reading | ||||
| consistently by all processing elements on a system. Global | ||||
| visibility and persistence are not necessarily causally related; | ||||
| either one may precede the other, or they may take effect | ||||
| simultaneously, depending on the architecture of the platform. | ||||
| 2. Problem Statement | 2. Problem Statement | |||
| RDMA is widely deployed in support of storage and shared memory over | RDMA is widely deployed in support of storage and shared memory over | |||
| increasingly low-latency and high-bandwidth networks. The state of | increasingly low-latency and high-bandwidth networks. The state of | |||
| the art today yields end-to-end network latencies on the order of one | the art today yields end-to-end network latencies on the order of one | |||
| to two microseconds for message transfer, and bandwidths exceeding 40 | to two microseconds for message transfer, and bandwidths exceeding | |||
| gigabit/s. These bandwidths are expected to increase over time, with | 100 gigabit/s. These bandwidths are expected to increase over time, | |||
| latencies decreasing as a direct result. | with latencies decreasing as a direct result. | |||
| In storage, another trend is emerging - greatly reduced latency of | In storage, another trend is emerging - greatly reduced latency of | |||
| persistently storing data blocks. While best-of-class Hard Disk | persistently storing data blocks. While best-of-class Hard Disk | |||
| Drives (HDDs) have delivered latencies of several milliseconds for | Drives (HDDs) have delivered average latencies of several | |||
| many years, Solid State Disks (SSDs) have improved this by one to two | milliseconds for many years, Solid State Disks (SSDs) have improved | |||
| orders of magnitude. Technologies such as NVM Express NVMe [1] yield | this by one to two orders of magnitude. Technologies such as NVM | |||
| even higher-performing results by eliminating the traditional storage | Express NVMe [1] yield even higher-performing results by eliminating | |||
| interconnect. The latest technologies providing memory-based | the traditional storage interconnect. The latest technologies | |||
| persistence, such as Nonvolatile Memory DIMM NVDIMM [2], places | providing memory-based persistence, such as Nonvolatile Memory DIMM | |||
| storage-like semantics directly on the memory bus, reducing latency | NVDIMM [2], places storage-like semantics directly on the memory bus, | |||
| to less than a microsecond and bandwidth to potentially many tens of | reducing latency to less than a microsecond and increasing bandwidth | |||
| gigabyte/s. [supporting data to be added] | to potentially many tens of gigabyte/s. [supporting data to be added] | |||
| RDMA protocols, in turn, are used for many storage protocols, | RDMA protocols, in turn, are used for many storage protocols, | |||
| including NFS/RDMA RFC5661 [RFC5661] RFC5666 [RFC5666] RFC5667 | including NFS/RDMA RFC5661 [RFC5661] RFC8166 [RFC8166] RFC8267 | |||
| [RFC5667], SMB Direct MS-SMB2 [SMB3] MS-SMBD [SMBDirect] and iSER | [RFC8267], SMB Direct MS-SMB2 [SMB3] MS-SMBD [SMBDirect] and iSER | |||
| RFC7145 [RFC7145], to name just a few. These protocols allow storage | RFC7145 [RFC7145], to name just a few. These protocols allow storage | |||
| and computing peers to take full advantage of these highly performant | and computing peers to take full advantage of these highly performant | |||
| networks and storage technologies to achieve remarkable throughput, | networks and storage technologies to achieve remarkable throughput, | |||
| while minimizing the CPU overhead needed to drive their workloads. | while minimizing the CPU overhead needed to drive their workloads. | |||
| This leaves more computing resources available for the applications, | This leaves more computing resources available for the applications, | |||
| which in turn can scale to even greater levels. Within the context | which in turn can scale to even greater levels. Within the context | |||
| of Cloud-based environments, and through scale-out approaches, this | of Cloud-based environments, and through scale-out approaches, this | |||
| can directly reduce the number of servers that need to be deployed, | can directly reduce the number of servers that need to be deployed, | |||
| making such attributes compelling. | making such attributes highly compelling. | |||
| However, limiting factors come into play when deploying ultra-low | However, limiting factors come into play when deploying ultra-low | |||
| latency storage in such environments: | latency storage in such environments: | |||
| o The latency of the fabric, and of the necessary RDMA message | o The latency of the fabric, and of the necessary RDMA message | |||
| exchanges to ensure reliable transfer is now higher than that of | exchanges to ensure reliable transfer is now higher than that of | |||
| the storage itself. | the storage itself. | |||
| o The requirement that storage be resilient to failure requires that | o The requirement that storage be resilient to failure requires that | |||
| multiple copies be committed in multiple locations across the | multiple copies be committed in multiple locations across the | |||
| fabric, adding extra hops which increase the latency and computing | fabric, adding extra hops which increase the latency and computing | |||
| demand placed on implementing the resiliency. | demand placed on implementing the resiliency. | |||
| o Processing is required at the receiver in order to ensure that the | o Processing is required at the receiver in order to ensure that the | |||
| storage data has reached a persistent state, and acknowledge the | storage data has reached a persistent state, and acknowledge the | |||
| transfer so that the sender can proceed. | transfer so that the sender can proceed. | |||
| o Typical latency optimizations, such as polling a receive memory | o Typical latency optimizations, such as polling a receive memory | |||
| location for a key that determines when the data arrives, can | location for a key that determines when the data arrives, can | |||
| create both correctness and security issues because the buffer may | create both correctness and security issues because this approach | |||
| not remain stable after the application determines that the IO has | requires the memory remain open to writes and therefore the buffer | |||
| completed. This is of particular concern in security conscious | may not remain stable after the application determines that the IO | |||
| environments. | has completed. This is of particular concern in security | |||
| conscious environments. | ||||
| The first issue is fundamental, and due to the nature of serial, | The first issue is fundamental, and due to the nature of serial, | |||
| shared communication channels, presents challenges that are not | shared communication channels, presents challenges that are not | |||
| easily bypassed. Therefore, an RDMA solution which reduces the | easily bypassed. Communication cannot exceed the speed of light, for | |||
| exchanges which encounter such latencies is highly desirable. | example, and serialization/deserialization plus packet processing | |||
| adds further delay. Therefore, an RDMA solution which offloads and | ||||
| reduces the overhead of exchanges which encounter such latencies is | ||||
| highly desirable. | ||||
| The second issue requires that outbound transfers be made as | The second issue requires that outbound transfers be made as | |||
| efficient as possible, so that replication of data can be done with | efficient as possible, so that replication of data can be done with | |||
| minimal overhead and delay (latency). A reliable "push" RDMA | minimal overhead and delay (latency). A reliable "push" RDMA | |||
| transfer method is highly suited to this. | transfer method is highly suited to this. | |||
| The third issue requires that the transfer be performed without an | The third issue requires that the transfer be performed without an | |||
| upper-layer exchange required. Within security contraints, RDMA | upper-layer exchange required. Within security contraints, RDMA | |||
| transfers arbitrated only by lower layers into well-defined and pre- | transfers, arbitrated only by lower layers into well-defined and pre- | |||
| advertised buffers present an ideal solution. | advertised buffers, present an ideal solution. | |||
| The fourth issue requires significant CPU activity, consuming power | The fourth issue requires significant CPU activity, consuming power | |||
| and valuable resources, and additionally is not guaranteed by the | and valuable resources, and may not be guaranteed by the RDMA | |||
| RDMA protocols, which make no guarantee of the order in which | protocols, which make no requirement of the order in which certain | |||
| received data is placed or becomes visible; such guarantees are made | received data is placed or becomes visible; such guarantees are made | |||
| only after signaling a completion to upper layers. | only after signaling a completion to upper layers. | |||
| The RDMAP and DDP protocols, together, provide data transfer | The RDMAP and DDP protocols, together, provide data transfer | |||
| semantics with certain consistency guarantees to both the sender and | semantics with certain consistency guarantees to both the sender and | |||
| receiver. Delivery of data transferred by these protocols is said to | receiver. Delivery of data transferred by these protocols is said to | |||
| have been Placed in destination buffers upon Completion of specific | have been Placed in destination buffers upon Completion of specific | |||
| operations. In general, these guarantees are limited to the | operations. In general, these guarantees are limited to the | |||
| visibility of the transferred data within the hardware domain of the | visibility of the transferred data within the hardware domain of the | |||
| receiver (data sink). Significantly, the guarantees do not | receiver (data sink). Significantly, the guarantees do not | |||
| necessarily extend to the actual storage of the data in memory cells, | necessarily extend to the actual storage of the data in memory cells, | |||
| nor do they convey any guarantee of durability, that is, that the | nor do they convey any guarantee that the data integrity is intact, | |||
| data may not be present after a catastrophic failure such as power | nor that it remains present after a catastrophic failure. These | |||
| loss. These guarantees may be provided by upper layers, such as the | guarantees may be provided by upper layers, such as the ones | |||
| ones mentioned. | mentioned, after processing the Completions, and performing the | |||
| necessary operations. | ||||
| The NFSv4.1 and iSER protocols are, respectively, file and block | The NFSv4.1, SMB3 and iSER protocols are, respectively, file and | |||
| oriented, and have been used extensively for providing access to hard | block oriented, and have been used extensively for providing access | |||
| disk and solid state flash drive media. Such devices incur certain | to hard disk and solid state flash drive media. Such devices incur | |||
| latencies in their operation, from the millisecond-order rotational | certain latencies in their operation, from the millisecond-order | |||
| and seek delays of rotating disk hardware, or the 100-microsecond- | rotational and seek delays of rotating disk hardware, or the 100- | |||
| order erase/write and translation layers of solid state flash. These | microsecond-order erase/write and translation layers of solid state | |||
| file and block protocols have benefited from the increased bandwidth, | flash. These file and block protocols have benefited from the | |||
| lower latency, and markedly lower CPU overhead of RDMA to provide | increased bandwidth, lower latency, and markedly lower CPU overhead | |||
| excellent performance for such media, approximately 30-50 | of RDMA to provide excellent performance for such media, | |||
| microseconds for 4KB writes in leading implementations. | approximately 30-50 microseconds for 4KB writes in leading | |||
| implementations. | ||||
| These protocols employ a "pull" model for write: the client, or | These protocols employ a "pull" model for write: the client, or | |||
| initiator, sends an upper layer write request which contains a | initiator, sends an upper layer write request which contains an RDMA | |||
| reference to the data to be written. The upper layer protocols | reference to the data to be written. The upper layer protocols | |||
| encode this as one or more memory regions. The server, or target, | encode this as one or more memory regions. The server, or target, | |||
| then prepares the request for local write execution, and "pulls" the | then prepares the request for local write execution, and "pulls" the | |||
| data with an RDMA Read. After processing the write, a response is | data with an RDMA Read. After processing the write, a response is | |||
| returned. There are therefore two or more roundtrips on the RDMA | returned. There are therefore two or more roundtrips on the RDMA | |||
| network in processing the request. This is desirable for several | network in processing the request. This is desirable for several | |||
| reasons, as described in the relevant specifications, but it incurs | reasons, as described in the relevant specifications, but it incurs | |||
| latency. However, since as mentioned the network latency has been so | latency. However, since as mentioned the network latency has been so | |||
| much less than the storage processing, this has been a sound | much less than the storage processing, this has been a sound | |||
| approach. | approach. | |||
| skipping to change at page 6, line 46 ¶ | skipping to change at page 7, line 29 ¶ | |||
| protocols are therefore from one to two orders of magnitude larger | protocols are therefore from one to two orders of magnitude larger | |||
| than the storage media! The client/server processing model of | than the storage media! The client/server processing model of | |||
| traditional storage protocols are no longer amortizable at an | traditional storage protocols are no longer amortizable at an | |||
| acceptable level into the overall latency of storage access, due to | acceptable level into the overall latency of storage access, due to | |||
| their requiring request/response communication, CPU processing by the | their requiring request/response communication, CPU processing by the | |||
| both server and client (or target and initiator), and the interrupts | both server and client (or target and initiator), and the interrupts | |||
| to signal such requests. | to signal such requests. | |||
| Another important property of certain such devices is the requirement | Another important property of certain such devices is the requirement | |||
| for explicitly requesting that the data written to them be made | for explicitly requesting that the data written to them be made | |||
| durable. Because durability requires that data be committed to | persistent. Because persistence requires that data be committed to | |||
| memory cells, it is a relatively expensive operation in time (and | memory cells, it is a relatively expensive operation in time (and | |||
| power), and in order to maintain the highest device throughput and | power), and in order to maintain the highest device throughput and | |||
| most efficient operation, the "commit" operation is explicit. When | most efficient operation, the device "commit" operation is explicit. | |||
| the data is written by an application on the local platform, this | When the data is written by an application on the local platform, | |||
| responsibility naturally falls to that application (and the CPU on | this responsibility naturally falls to that application (and the CPU | |||
| which it runs). However, when data is written by current RDMA | on which it runs). However, when data is written by current RDMA | |||
| protocols, no such semantic is provided. As a result, upper layer | protocols, no such semantic is provided. As a result, upper layer | |||
| stacks, and the target CPU, must be invoked to perform it, adding | stacks, and the target CPU, must be invoked to perform it, adding | |||
| overhead and latency that is now highly undesirable. | overhead and latency that is now highly undesirable. | |||
| When such devices are deployed as the remote server, or target, | When such devices are deployed as the remote server, or target, | |||
| storage, and when such a durability can be requested and guaranteed | storage, and when such a persistence can be requested and guaranteed | |||
| remotely, a new transfer model can be considered. Instead of relying | remotely, a new transfer model can be considered. Instead of relying | |||
| on the server, or target, to perform requested processing and to | on the server, or target, to perform requested processing and to | |||
| reply after the data is durably stored, it becomes desirable for the | reply after the data is persistently stored, it becomes desirable for | |||
| client, or initiator, to perform these operations itself. By | the client, or initiator, to perform these operations itself. By | |||
| altering the transfer models to support a "push mode", that is, by | altering the transfer models to support a "push mode", that is, by | |||
| allowing the requestor to push data with RDMA Write and subsequently | allowing the requestor to push data with RDMA Write and subsequently | |||
| make it durable, a full round trip can be eliminated from the | make it persistent, a full round trip can be eliminated from the | |||
| operation. Additionally, the signaling, and processing overheads at | operation. Additionally, the signaling, and processing overheads at | |||
| the remote peer (server or target) can be eliminated. This becomes | the remote peer (server or target) can be eliminated. This becomes | |||
| an extremely compelling latency advantage. | an extremely compelling latency advantage. | |||
| Together the above discussion argues for a new transfer model | In DDP (RFC5041), data is considered "placed" when it is submitted by | |||
| supporting remote durability guarantees, provided by the RDMA | the RNIC to the system. This operation is commonly an i/o bus write, | |||
| transport, and used directly by upper layers on a data source, to | e.g. via PCI. The submission is ordered, but there is no | |||
| control durable storage of data on a remote data sink without | confirmation or necessary guarantee that the data has yet reached its | |||
| requiring its remote interaction. Existing, or new, upper layers can | destination, nor become visible to other devices in the system. The | |||
| use such a model in several ways, and evolutionary steps to support | data will eventually do so, but possibly at a later time. The act of | |||
| durability guarantees without required protocol changes are explored | "delivery", on the other hand, offers a stronger semantic, | |||
| in the remainder of this document. | guaranteeing that not only have prior operations been executed, but | |||
| also guaranteeing any data is in a consistent and visible state. | ||||
| Generally however, such "delivery" requires raising a completion | ||||
| event, necessarily involving the host CPU. This is a relatively | ||||
| expensive, and latency-bound operation. Some systems perform "DMA | ||||
| snooping" to provide a somewhat higher guarantee of visibility after | ||||
| delivery and without CPU intervention, but others do not. The RDMA | ||||
| requirements remain the same, therefore, upper layers may make no | ||||
| broad assumption. Such platform behaviors, in any case, do not | ||||
| address persistence. | ||||
| The extensions in this document primarily address a new "flush to | ||||
| persistence" RDMA operation. This operation, when invoked by a | ||||
| connected remote RDMA peer, can be used to request that previously- | ||||
| written data be moved into the persistent storage domain. This may | ||||
| be a simple flush to a memory cell, or it may require movement across | ||||
| one or more busses within the target platform, followed by an | ||||
| explicit persistence operation. Such matters are beyond the scope of | ||||
| this specification, which provides only the mechanism to request the | ||||
| operation, and to signal its successful completion. | ||||
| In a similar vein, many applications desire to achieve visibility of | ||||
| remotely-provided data, and to do so with minimum latency. One | ||||
| example of such applications is "network shared memory", where | ||||
| publish-subscribe access to network-accessible buffers is shared by | ||||
| multiple peers, possibly from applications on the platform hosting | ||||
| the buffers, and others via network connection. There may therefore | ||||
| be multiple local devices accessing the buffer - for example, CPUs, | ||||
| and other RNICs. The topology of the hosting platform may be | ||||
| complex, with multiple i/o, memory, and interconnect busses, | ||||
| requiring multiple intervening steps to process arriving data. | ||||
| To address this, the extension additionally provides a "flush to | ||||
| global visibility", which requires the RNIC to perform platform- | ||||
| dependent processing in order to guarantee that the contents of a | ||||
| specific range are visible for all devices that access them. On | ||||
| certain highly-consistent platforms, this may be provided natively. | ||||
| On others, it may require platform-specific processing, to flush data | ||||
| from volatile caches, invalidate stale cached data from others, and | ||||
| to empty queued pending operations. Ideally, but not universally, | ||||
| this processing will take place without CPU intervention. With a | ||||
| global visibility guarantee, network shared memory and similar | ||||
| applications will be assured of broader compatibility and lower | ||||
| latency across all hardware platforms. | ||||
| Subsequently, many applications will seek to obtain a guarantee that | ||||
| the integrity of the data has been preserved after it has been | ||||
| flushed to a persistent or globally visible state. This may be | ||||
| enforced at any time. Unlike traditional block-based storage, the | ||||
| data provided by RDMA is neither structured nor segmented, and is | ||||
| therefore not self-describing with respect to integrity. Only the | ||||
| originator of the data, or an upper layer, is in possession of that. | ||||
| Applications requiring such guarantees may include filesystem or | ||||
| database logwriters, replication agents, etc. | ||||
| To provide an additional integrity guarantee, a new operation is | ||||
| provided by the extension, which will calculate, and optionally | ||||
| compare an integrity value for an arbitrary region. The operation is | ||||
| ordered with respect to preceding and subsequent operations, allowing | ||||
| for a request pipeline without "bubbles" - roundtrip delays to | ||||
| ascertain success or failure. | ||||
| Finally, once data has been transmitted and directly placed by RDMA, | ||||
| flushed to its final state, and its integrity verified, applications | ||||
| will seek to commit the result with a transaction semantic. The | ||||
| previous application examples apply here, logwriters and replication | ||||
| are key, and both are highly latency- and integrity-sensitive. They | ||||
| desire a pipelined transaction marker which is placed atomically to | ||||
| indicate the validity of the preceding operations. They may require | ||||
| that the data be in a persistent and/or globally visibile state, | ||||
| before placing this marker. | ||||
| Together the above discussion argues for a new "one sided" transfer | ||||
| model supporting extended remote placement guarantees, provided by | ||||
| the RDMA transport, and used directly by upper layers on a data | ||||
| source, to control persistent storage of data on a remote data sink | ||||
| without requiring its remote interaction. Existing, or new, upper | ||||
| layers can use such a model in several ways, and evolutionary steps | ||||
| to support persistence guarantees without required protocol changes | ||||
| are explored in the remainder of this document. | ||||
| Note that is intended that the requirements and concept of these | Note that is intended that the requirements and concept of these | |||
| extensions can be applied to any similar RDMA protocol, and that a | extensions can be applied to any similar RDMA protocol, and that a | |||
| compatible remote durability model can be applied broadly. | compatible model can be applied broadly. | |||
| 2.1. Requirements | 2.1. Requirements for RDMA Flush | |||
| The fundamental new requirement for extending RDMA protocols is to | The fundamental new requirement for extending RDMA protocols is to | |||
| define the property of _durability_. This new property drives the | define the property of _persistence_. This new property is to be | |||
| operations to extend Placement as defined in existing RDMA protocols. | expressed by new operations to extend Placement as defined in | |||
| When Placed, these protocols require only that the data be visible | existing RDMA protocols. The RFC5040 protocols specify that | |||
| consistently to both the platform on which the buffer resides, and to | Placement means that the data is visible consistently within a | |||
| remote peers across the network via RDMA. In modern hardware | platform-defined domain on which the buffer resides, and to remote | |||
| designs, this buffer can reside in memory, or also in cache, if that | peers across the network via RDMA to an adapter within the domain. | |||
| cache is part of the hardware consistency domain. Many designs use | In modern hardware designs, this buffer can reside in memory, or also | |||
| such caches extensively to improve performance of local access. | in cache, if that cache is part of the hardware consistency domain. | |||
| Many designs use such caches extensively to improve performance of | ||||
| local access. | ||||
| Durability, by contrast, requires that the data not only be | Persistence, by contrast, requires that the buffer contents be | |||
| consistently visible, it further requires that the buffer contents be | ||||
| preserved across catastrophic failures. While it is possible for | preserved across catastrophic failures. While it is possible for | |||
| caches to be durable, they are typically not. Efficient designs, in | caches to be persistent, they are typically not, or they provide the | |||
| fact, lead many implementations to make them volatile. In these | persistence guarantee for a limited period of time, for example, | |||
| designs, an explicit flush operation, often followed by an explicit | while backup power is applied. Efficient designs, in fact, lead most | |||
| commit, is required to provide this guarantee. | implementations to simply make them volatile. In these designs, an | |||
| explicit flush operation (writing dirty data from caches), often | ||||
| followed by an explicit commit (ensuring the data has reached its | ||||
| destination and is in a persistent state), is required to provide | ||||
| this guarantee. In some platforms, these operations may be combined. | ||||
| For the RDMA protocol to remotely provide durability guarantees, the | For the RDMA protocol to remotely provide such guarantees, an | |||
| new requirement is mandatory. Note that this does not imply support | extension is required. Note that this does not imply support for | |||
| for durability by the RDMA hardware implementation itself; it is | persistence or global visibility by the RDMA hardware implementation | |||
| entirely acceptable for the RDMA implementation to request durability | itself; it is entirely acceptable for the RDMA implementation to | |||
| from another subsystem, for example, by requesting that the CPU | request these from another subsystem, for example, by requesting that | |||
| perform the flush and commit. But, in an ideal implementation, the | the CPU perform the flush and commit, or that the destination memory | |||
| RDMA implementation will be able to act as a master and provide these | device do so. But, in an ideal implementation, the RDMA | |||
| services without further work requests. Note, it is possible that | implementation will be able to act as a master and provide these | |||
| different buffers will require different durability processing, for | services without further work requests local to the data sink. Note, | |||
| example one buffer may reside in persistent memory, while another may | it is possible that different buffers will require different | |||
| place its durable blocks in a persistent storage device. Many such | processing, for example one buffer may reside in persistent memory, | |||
| while another may place its blocks in a storage device. Many such | ||||
| memory-addressable designs are entering the market, from NVDIMM to | memory-addressable designs are entering the market, from NVDIMM to | |||
| NVMe and even to SSDs and hard drives. | NVMe and even to SSDs and hard drives. | |||
| Therefore, any local memory registration primitive will be enhanced | Therefore, additionally any local memory registration primitive will | |||
| to specify an optional durability attribute, along with any local | be enhanced to specify new optional placement attributes, along with | |||
| information required to achieve it. These attributes remain local - | any local information required to achieve them. These attributes do | |||
| like existing local memory registration, the region is fully | not explicitly traverse the network - like existing local memory | |||
| described by a { handle, offset, length } descriptor, and such | registration, the region is fully described by a { STag, Tagged | |||
| aspects of the local physical address, memory type, protection | offset, length } descriptor, and such aspects of the local physical | |||
| (remote read, remote write, protection key), etc are not instantiated | address, memory type, protection (remote read, remote write, | |||
| in the protocol. The RDMA implementation maintains these, and | protection key), etc are not instantiated in the protocol. Indeed, | |||
| strictly performs processing based on them, but they are not known to | each local RDMA implementation maintains these, and strictly performs | |||
| the peer, and therefore are not a matter for the protocol. | processing based on them, and they are not exposed to the peer. Such | |||
| considerations are discussed in the security model [RDMAP Security | ||||
| [RFC5042]]. | ||||
| Note, additionally, that by describing durability only through the | Note, additionally, that by describing such attributes only through | |||
| presence of an optional durability attribute, it is possible to | the presence of an optional property of each region, it is possible | |||
| describe regions as both durable and non-durable, in order to enable | to describe regions referring to the same physical segment as a | |||
| efficient processing. When commit is remotely requested of a non- | combination of attributes, in order to enable efficient processing. | |||
| durable region, the result is not required to be that the data is | Processing of writes to regions marked as persistent, globally | |||
| durable. This can be used by upper layers to enable bulk-type | visible, or neither ("ordinary" memory) may be optimized | |||
| processing with low overhead, by assigning specific durability | appropriately. For example, such memory can be registered multiple | |||
| through use of the Steering Tag. | times, yielding multiple different Steering Tags which nonetheless | |||
| merge data in the underlying memory. This can be used by upper | ||||
| layers to enable bulk-type processing with low overhead, by assigning | ||||
| specific attributes through use of the Steering Tag. | ||||
| The intention is that if the underlying region is marked as non- | When the underlying region is marked as persistent, that the | |||
| volatile, the placement of data into it is also non-volatile (i.e. | placement of data into persistence is guaranteed only after a | |||
| any volatile buffering between the network and the underlying storage | successful RDMA Flush directed to the Steering Tag which holds the | |||
| has been flushed). | persistent attribute (i.e. any volatile buffering between the network | |||
| and the underlying storage has been flushed, and the appropriate | ||||
| platform- and device-specific steps have been performed). | ||||
| To enable the maximum generality, the commit operation is specified | To enable the maximum generality, the RDMA Flush operation is | |||
| to act on a list of { handle, offset, length } regions. The | specified to act on a set of bytes in a region, specified by a | |||
| requirement is that each byte of each specified region be made | standard RDMA { STag, Tagged offset, length } descriptor. It is | |||
| durable before the response to the commit is generated. However, | required that each byte of the specified segment be in the requested | |||
| depending on the implementation, other bytes in other regions may be | state before the response to the Flush is generated. However, | |||
| made durable as part of processing any commit. Any data in any | depending on the implementation, other bytes in the region, or in | |||
| buffer destined for persistent, durable storage, may become durable | other regions, may be acted upon as part of processing any RDMA | |||
| at any time, even if not requested explicitly. For example, a simple | Flush. In fact, any data in any buffer destined for persistent | |||
| and stateless approach would be for all data be flushed and | storage, may become persistent at any time, even if not requested | |||
| committed, system-wide. A possibly more efficient implementation | explicitly. For example, the host system may flush cache entries due | |||
| might track previously written bytes, or blocks with "dirty" bytes, | to cache pressure, or as part of platform housekeeping activities. | |||
| and commit only those. Either result provides the required | Or, a simple and stateless approach to flushing a specific range | |||
| guarantee. The length of the region list, and the maximum amount of | might be for all data be flushed and made persistent, system-wide. A | |||
| data that can be made durable in a single request, are implementation | possibly more efficient implementation might track previously written | |||
| dependent and its protocol expression is to be described. | bytes, or blocks with "dirty" bytes, and flush only those to | |||
| persistence. Either result provides the required guarantee. | ||||
| The commit operation is specified to return a status, which may be | The RDMA Flush operation provides a response but does not return a | |||
| zero on success but may take other values to be determined. Several | status, or can result in an RDMA Terminate event upon failure. A | |||
| possibilities present themselves. The commit operation may fail to | region permission check is performed first, and may fail prior to any | |||
| make the data durable, perhaps due to a hardware failure, or a change | attempt to process data. The RDMA Flush operation may fail to make | |||
| in device capability (device read-only, device wear, etc). The data, | the data persistent, perhaps due to a hardware failure, or a change | |||
| however, may not have been lost and is still present in the buffer. | in device capability (device read-only, device wear, etc). The | |||
| Or, the device may support an integrity check, similar to modern | device itself may support an integrity check, similar to modern error | |||
| error checking memory or media error detection on hard drive | checking and corection (ECC) memory or media error detection on hard | |||
| surfaces, and its status is returned. Or, the request may exceed | drive surfaces, which may signal failure. Or, the request may exceed | |||
| device limits in size or even transient attribute such as temporary | device limits in size or even transient attribute such as temporary | |||
| media failure. The behavior of the device itself is beyond the scope | media failure. The behavior of the device itself is beyond the scope | |||
| of this specification. | of this specification. | |||
| Because the commit involves processing on the local platform and the | Because the RDMA Flush involves processing on the local platform and | |||
| actual device, it is expected to take a certain time to be performed. | the actual storage device, in addition to being ordered with certain | |||
| For this reason, the commit operation is required to be defined as a | other RDMA operations, it is expected to take a certain time to be | |||
| "queued" operation on the RDMA device, and therefore also the | performed. For this reason, the operation is required to be defined | |||
| protocol. The RDMA protocol supports RDMA Read and Atomic in such a | as a "queued" operation on the RDMA device, and therefore also the | |||
| fashion. The iWARP family defines a "queue number" with queue- | protocol. The RDMA protocol supports RDMA Read (RFC5040) and Atomic | |||
| specific processing that is naturally suited for this. Queuing | (RFC7306) in such a fashion. The iWARP family defines a "queue | |||
| provides a convenient means for supporting ordering among other | number" with queue-specific processing that is naturally suited for | |||
| operations, and for flow control. Flow control for RDMA Reads and | this. Queuing provides a convenient means for supporting ordering | |||
| Atomics share incoming and outgoing crediting depths ("IRD/ORD"); | among other operations, and for flow control. Flow control for RDMA | |||
| commit will either share these, or define their own separate values. | Reads and Atomics on any given Queue Pair share incoming and outgoing | |||
| crediting depths ("IRD/ORD"); operations in this specification share | ||||
| these values and do not define their own separate values. | ||||
| 2.1.1. Non-Requirements | 2.1.1. Non-Requirements | |||
| The protocol does not include a "RDMA Write with durability", that | The extension does not include a "RDMA Write to persistence", that | |||
| is, a modifier on the existing RDMA Write operation. While it might | is, a modifier on the existing RDMA Write operation. While it might | |||
| seem a logical approach, several issues become apparent: | seem a logical approach, several issues become apparent: | |||
| The existing RDMA Write operation is unacknowledged at the RDMA | The existing RDMA Write operation is a tagged DDP request which is | |||
| layer. Requiring it to provide an indication of remote durability | unacknowledged at the DDP layer (RFC5042). Requiring it to | |||
| would require it to have an acknowledgement, which would be an | provide an indication of remote persistence would require it to | |||
| undesirable extension to the operation. | have an acknowledgement, which would be an undesirable extension | |||
| to the existing defined operation. | ||||
| Such an operation would require flow control and therefore also | Such an operation would require flow control and therefore also | |||
| buffering on the responding peer. Existing RDMA Write semantics | buffering on the responding peer. Existing RDMA Write semantics | |||
| are not flow controlled and as tagged transfers are by design | are not flow controlled and as tagged transfers are by design | |||
| zero-copy i.e. unbuffered. Requiring these would introduce | zero-copy i.e. unbuffered. Requiring these would introduce | |||
| potential pipeline stalls and increase implementation complexity | potential pipeline stalls and increase implementation complexity | |||
| in a critical performance path. | in a critical performance path. | |||
| The operation at the initiator would stall until the | The operation at the requesting peer would stall until the | |||
| acknowledgement of completion, significantly changing the semantic | acknowledgement of completion, significantly changing the semantic | |||
| of the existing operation, and complicating software by blocking | of the existing operation, and complicating software by blocking | |||
| the send work queue. As each operation would be self-describing | the send work queue, a significant new semantic for RDMA Write | |||
| with respect to durability, individual operations would therefore | work requests. As each operation would be self-describing with | |||
| block with differing semantics. | respect to persistence, individual operations would therefore | |||
| block with differing semantics and complicate the situation even | ||||
| further. | ||||
| Even for the possibly-common case of commiting after every write, | Even for the possibly-common case of flushing after every write, | |||
| it is highly undesirable to impose new optional semantics on an | it is highly undesirable to impose new optional semantics on an | |||
| existing operation. And, the same result can be achieved by | existing operation, and therefore also on the upper layer protocol | |||
| sending the commit in the same network packet, and since the RDMA | implementation. And, the same result can be achieved by sending | |||
| Write is unacknowledged while the commit is always replied-to, no | the Flush merged in the same network packet, and since the RDMA | |||
| additional overhead is imposed on the combined exchange. | Write is unacknowledged while the RDMA Flush is always replied-to, | |||
| no additional overhead is imposed on the combined exchange. | ||||
| [Further expand on the undesirable nature of such a change.] | ||||
| 2.2. Additional Semantics | ||||
| Ordering w.r.t. RDMA Write, receives, RDMA Read, other commits. | ||||
| Also, ensure ordering ensures similar remote semantics to local | ||||
| The commit operation is ordered with respect to certain other | ||||
| operations, and it may be advantageous to combine certain actions | ||||
| into the same request, or requests with specific ordering to the | ||||
| commit. Examples to be discussed include: | ||||
| Additional optional payload to be placed and made durable in an | ||||
| atomic fashion after the requested commit. A small (64 bit) | ||||
| payload, sent in the same, or other single, request, and aligned | ||||
| such that it can be made durable in a single hardware operation, | ||||
| can be used to satisfy the "log update" scenario (describe this in | ||||
| more detail). | ||||
| Immediate data to be optionally provided in a completion to an | ||||
| upper layer on the remote peer. Such an indication can be used to | ||||
| signal the upper layer that certain data has been placed in the | ||||
| peer's buffer, and has been made available durably. | ||||
| Remote invalidation, as optionally performed by existing RDMA | For these reasons, it is deemed a non-requirement to extend the | |||
| protocols for other operations. | existing RDMA Write operation. | |||
| Upper Layer message, an optional full message to be provided in a | Similarly, the extension does not consider the use of RDMA Read to | |||
| completion after the commit. | implement Flush. Historically, an RDMA Read has been used by | |||
| applications to ensure that previously written data has been | ||||
| processed by the responding RNIC and has been submitted for ordered | ||||
| Placement. However, this is inadequate for implementing the required | ||||
| RDMA Flush: | ||||
| Integrity check for committed data, which could take the form of a | RDMA Read guarantees only that previously written data has been | |||
| value to be verified before returning, or a value computed and | Placed, it provides no such guarantee that the data has reached | |||
| returned which the initiator can use to verify. Specification of | its destination buffer. In practice, an RNIC satisfies the RDMA | |||
| the checksum or hash algorithm, or its negotiation by an upper | Read requirement by simply issuing all PCIe Writes prior to | |||
| layer, will be necessary if adopted. | issuing any PCIe Reads. | |||
| 3. Proposed Extensions | Such PCIe Reads must be issued by the RNIC after all such PCIe | |||
| Writes, therefore flushing a large region requires the RNIC and | ||||
| its attached bus to strictly order (and not cache) its writes, to | ||||
| "scoreboard" its writes, or to perform PCIe Reads to the entire | ||||
| region. The former approach is significantly complex and | ||||
| expensive, and the latter approach requires a large amount of PCIe | ||||
| and network read bandwidth, which are often unnecessary and | ||||
| expensive. The Reads, in any event, may be satisfied by platform- | ||||
| specfic caches, never actually reaching the destination memory or | ||||
| other device. | ||||
| The extensions in this document fall into two categories: | The RDMA Read may begin execution at any time once the request is | |||
| fully received, queued, and the prior RDMA Write requirement has | ||||
| been satisfied. This means that the RDMA Read operation may not | ||||
| be ordered with respect to other queued operations, such as Verify | ||||
| and Atomic Write, in addition to other RDMA Flush operations. | ||||
| o Local behavior extensions | The RDMA Read has no specific error semantic to detect failure, | |||
| and the response may be generated from any cached data in a | ||||
| consistently Placed state, regardless of where it may reside. For | ||||
| this reason, an RDMA Read may proceed without necessarily | ||||
| verifying that a previously ordered "flush" has succeeded or | ||||
| failed. | ||||
| o Protocol extensions | RDMA Read is heavily used by existing RDMA consumers, and the | |||
| semantics are therefore implemented by the existing specification. | ||||
| For new applications to further expect an extended RDMA Read | ||||
| behavior would require an upper layer negotiation to determine if | ||||
| the data sink platform and RNIC appropriately implemented them, or | ||||
| to silently ignore the requirement, with the resulting failure to | ||||
| meet the requirement. An explicit extension, rather than | ||||
| depending on an overloaded side effect, ensures this will not | ||||
| occur. | ||||
| These categories are described, and may be implemented, separately. | Again, for these reasons, it is deemed a non-requirement to reuse or | |||
| extend the existing RDMA Read operation. | ||||
| 3.1. Local Extensions | Therefore, no changes to existing specified RDMA operations are | |||
| proposed, and the protocol is unchanged if the extensions are not | ||||
| invoked. | ||||
| Here discuss memory registration, new memory and protection | 2.2. Requirements for Atomic Write | |||
| attributes, and applicability to both remote and "local" (receives). | ||||
| 3.1.1. Registration Semantics | The persistence of data is a key property by which applications | |||
| implement transactional behavior. Transactional applications, such | ||||
| as databases and log-based filesystems, among many others, implement | ||||
| a "two phase commit" wherein a write is made durable, and *only upon | ||||
| success*, a validity indicator for the written data is set. Such | ||||
| semantics are challenging to provide over an RDMA fabric, as it | ||||
| exists today. The RDMA Write operation does not generate an | ||||
| acknowledgement at the RDMA layers. And, even when an RDMA Write is | ||||
| delivered, if the destination region is persistent, its data can be | ||||
| made persistent at any time, even before a Flush is requested. Out- | ||||
| of-order DDP processing, packet fragmentation, and other matters of | ||||
| scheduling transfers can introduce partial delivery and ordering | ||||
| differences. If a region is made persistent, or even globally | ||||
| visible, before such sequences are complete, significant application- | ||||
| layer inconsistencies can result. Therefore, applications may | ||||
| require fine-grained control over the placement of bytes. In current | ||||
| RDMA storage solutions, these semantics are implemented in upper | ||||
| layers, potentially with additional upper layer message signaling, | ||||
| and corresponding roundtrips and blocking behaviors. | ||||
| New platform-specific attributes to RDMA registration, allows them to | In addition to controlling placement of bytes, the ordering of such | |||
| be processed at the server *only* without client knowledge, or | placement can be important. By providing an ordered relationship | |||
| protocol exposure. No client knowledge - ensures future interop | among write and flush operations, a basic transaction scenario can be | |||
| constructed, in a way which can function with equal semantics both | ||||
| locally and remotely. In a "log-based" scenario, for example, a | ||||
| relatively large segment (log "record") is placed, and made durable. | ||||
| Once persistence of the segment is assured, a second small segment | ||||
| (log "pointer") is written, and optionally also made persistent. The | ||||
| visibility of the second segment is used to imply the validity, and | ||||
| persistence, of the first. Any sequence of such log-operation pairs | ||||
| can thereby always have a single valid state. In case of failure, | ||||
| the resulting string (log) of transactions can therefore be recovered | ||||
| up to and including the final state. | ||||
| New local PM memory registration example: | Such semantics are typically a challenge to implement on general | |||
| purpose hardware platforms, and a variety of application approaches | ||||
| have become common. Generally, they employ a small, well-aligned | ||||
| atom of storage for the second segment (the one used for validity). | ||||
| For example, an integer or pointer, aligned to natural memory address | ||||
| boundaries and CPU and other cache attributes, is stored using | ||||
| instructions which provide for atomic placement. Existing RDMA | ||||
| protocols, however, provide no such capability. | ||||
| Register(region[], PMType, mode) -> Handle | This document specifies an Atomic Write extension, which, | |||
| appropriately constrained, can serve to provide similar semantics. A | ||||
| small (64 bit) payload, sent in a request which is ordered with | ||||
| respect to prior RDMA Flush operations on the same stream and | ||||
| targeted at a segment which is aligned such that it can be placed in | ||||
| a single hardware operation, can be used to satisfy the previously | ||||
| described scenario. Note that the visibility of this payload can | ||||
| also serve as an indication that all prior operations have succeeded, | ||||
| enabling a highly efficient application-visible memory semaphore. | ||||
| PMType includes type of PM i.e. plain RAM, or "commit | 2.3. Requirements for RDMA Verify | |||
| required", or PCIe-resident, or any other local platform- | ||||
| specific processing | ||||
| Mode includes disposition of data Read and/or write e.g. | An additional matter remains with persistence - the integrity of the | |||
| Cacheable after operation (needed by CPU on data sink) | persistent data. Typically, storage stacks such as filesystems and | |||
| media approches such as SCSI T10 DIF or filesystem integrity checks | ||||
| such as ZFS provide for block- oir file-level protection of data at | ||||
| rest on storage devices. With RDMA protocols and physical memory, no | ||||
| such stacks are present. And, to add such support would introduce | ||||
| CPU processing and its inherent latency, counter to the goals of the | ||||
| remote storage approach. Requiring the peer to verify by remotely | ||||
| reading the data is prohibitive in both bandwidth and latency, and | ||||
| without additional mechanisms to ensure the actual stored data is | ||||
| read (and not a copy in some volatile cache), can not provide the | ||||
| necessary result. | ||||
| Handle is processed in receiving NIC during RDMA operation to | To address this, an integrity operation is required. The integrity | |||
| specified region, under control of original Mode. | check is initiated by the upper layer or application, which | |||
| optionally computes the expected hash of a given segment of arbitrary | ||||
| size, sending the hash via an RDMA Verify operation targeting the | ||||
| RDMA segment on the responder, and the responder calculating and | ||||
| optionally verifying the hash on the indicated data, bypassing any | ||||
| volatile copies remaining in caches. The responder responds with its | ||||
| computed hash value, or optionally, terminates the connection with an | ||||
| appropriate error status upon mismatch. Specifying this optional | ||||
| termination behavior enables a transaction to be sent as WRITE-FLUSH- | ||||
| VERIFY-ATOMICWRITE, without any pipeline bubble. The result (carried | ||||
| by the subsequently ordered ATOMIC_WRITE) will not not be committed | ||||
| as valid if any prior operation is terminated, and in this case, | ||||
| recovery can be initiated by the requestor immediately from the point | ||||
| of failure. On the other hand, an errorless "scrub" can be | ||||
| implemented without the optional termination behavior, by providing | ||||
| no value for the expected hash. The responder will return the | ||||
| computed hash of the contents. | ||||
| Also consider whether potential "integrity check" behavior can be | The hash algorithm is not specified by the RDMA protocol, instead it | |||
| made per-region. If so, piggybacking it on the registration enables | is left to the upper layer to select an appropriate choice based upon | |||
| selecting the integrity hash, and making its processing optional and | the strength, security, length, support by the RNIC, and other | |||
| straightforward. | criteria. The size of the resulting hash is therefore also not | |||
| specified by the RDMA protocol, but is dictated by the hash | ||||
| algorithm. The RDMA protocol becomes simply a transport for | ||||
| exchanging the values. | ||||
| Any other per-region durability processing to be explored. | It should be noted that the design of the operation, passing of the | |||
| hash value from requestor to responder (instead of, for example, | ||||
| computing it at the responder and simply returning it), allows both | ||||
| peers to determine immediately whether the segment is considered | ||||
| valid, permitting local processing by both peers if that is not the | ||||
| case. For example, a known-bad segment can be immediately marked as | ||||
| such ("poisoned") by the responder platform, requiring recovery | ||||
| before permitting access. [cf ACPI, JEDEC, SNIA NVMP specifications] | ||||
| 3.1.2. Completion Semantics | 2.4. Local Semantics | |||
| Transparency is possible when upper layer provides Completions (e.g. | The new operations imply new access methods ("verbs") to local | |||
| messages or immediate data) | persistent memory which backs registrations. Registrations of memory | |||
| which support persistence will follow all existing practices to | ||||
| ensure permission-based remote access. The RDMA protocols do not | ||||
| expose these permissions on the wire, instead they are contained in | ||||
| local memory registration semantics. Existing attributes are Remote | ||||
| Read and Remote Write, which are granted individually through local | ||||
| registration on the machine. If an RDMA Read or RDMA Write operation | ||||
| arrives which targets a segment without the appropriate attribute, | ||||
| the connection is terminated. | ||||
| Commit to durability can be piggybacked by data sink upon signaling. | In support of the new operations, new memory attributes are needed. | |||
| Upper layer may not need to explicitly Commit in this case, which of | For RDMA Flush, two "Flushable" attributes provide permission to | |||
| course are dependent on upper layer and workload. | invoke the operation on memory in the region for persistence and/or | |||
| global visibility. When registering, along with the attribute, | ||||
| additional local information can be provided to the RDMA layer such | ||||
| as the type of memory, the necessary processing to make its contents | ||||
| persistent, etc. If the attribute is requested for memory which | ||||
| cannot be persisted, it also allows the local provider to return an | ||||
| error to the upper layer, obviating the upper layer from providing | ||||
| the region to the remote peer. | ||||
| Can apply this concept to RDMA Write with Immediate Or ...ordinary | For RDMA Verify, the "Verifiable" attribute provides permission to | |||
| receives. Strong possibilities exist - explore here. | compute the hash of memory in the region. Again, along with the | |||
| attribute, additional information such as the hash algorithm for the | ||||
| region is provided to the local operation. If the attribute is | ||||
| requested for non-persistent memory, or if the hash algorithm is not | ||||
| available, the local provider can return an error to the upper layer. | ||||
| In the case of success, the upper layer can exchange the necessary | ||||
| information with the remote peer. Note that the algorithm is not | ||||
| identified by the on-the-wire operation as a result. Establishing | ||||
| the choice of hash for each region is done by the local consumer, and | ||||
| each hash result is merely transported by the RDMA protocol. Memory | ||||
| can be registered under multiple regions, if differing hashes are | ||||
| required, for example unique keys may be provisoned to implement | ||||
| secure hashing. Also note that, for certain "reversible" hash | ||||
| algorithms, this may allow peers to effectively read the memory, | ||||
| therefore, the local platform may require additional read permissions | ||||
| to be associated with the Verifiable permission, when such algorithms | ||||
| are selected. | ||||
| Ordering of operations is critical: Such RDMA Writes cannot be | The Atomic Write operation requires no new attributes, however it | |||
| allowed to "pass" durability. Therefore, protocol implications may | does require the "Remote Write" attribute on the target region, as is | |||
| also exist. | true for any remotely requested write. If the Atomic Write | |||
| additionally targets a Flushable region, the RDMA Flush is performed | ||||
| separately. It is never generally possible to achieve persistence | ||||
| atomically with placement, even locally. | ||||
| Discuss optional behaviors explored in prior section, and whether/how | 3. RDMA Protocol Extensions | |||
| they generate completions. | ||||
| 3.1.3. Platform Semantics | The extensions in this document fall into two categories: | |||
| Writethrough behavior on durable regions and reasons for same. | o Protocol extensions | |||
| Consider requiring/recommending a local writethrough behavior on any | ||||
| durable region, to support a nonblocking hurry-up to avoid future | ||||
| stalls on a subsequent cache flush, prior to a commit. Also, it | ||||
| would enhance durability. | ||||
| PCI extension to support Commit Allow NIC to provide durability | o Local behavior extensions | |||
| directly and efficiently To Memory, CPU, PCI Root, PM device, PCIe | ||||
| device, ... Avoids CPU interaction Supports strong data consistency | ||||
| model Performs equivalent of: CLFLUSHOPT (region list) PCOMMIT Or if | ||||
| NIC is on memory bus or within CPU complex... Other possibilities | ||||
| exist | ||||
| 3.2. RDMAP Extensions | These categories are described, and may be implemented, separately. | |||
| This document defines a new RDMA operation, "RDMA Commit". The wire- | 3.1. RDMAP Extensions | |||
| related aspects of the proposed protocol are discussed in this | ||||
| section. | ||||
| This section and the ones following present one possible approach | The wire-related aspects of the extensions are discussed in this | |||
| toward defining the wire protocol defined by the above discussion. | section.This document defines the following new RDMA operations. | |||
| The definitions are included for initial discussion and do not | ||||
| comprise a complete specification. Certain additional protocol | ||||
| features of any potential new extension, such as any associated | ||||
| Immediate Data, Solicited Events, Remote Invalidation, ULP Message | ||||
| inclusion, etc, are left to a later version. | ||||
| For reference, Figure 1 depicts the format of the DDP Control and | For reference, Figure 1 depicts the format of the DDP Control and | |||
| RDMAP Control Fields, in the style and convention of RFC 5040 and | RDMAP Control Fields, in the style and convention of RFC5040 and | |||
| RFC7306: | RFC7306: | |||
| The DDP Control Field consists of the T (Tagged), L (Last), Resrv, | The DDP Control Field consists of the T (Tagged), L (Last), Resrv, | |||
| and DV (DDP protocol Version) fields RFC 5041. The RDMAP Control | and DV (DDP protocol Version) fields are defined in RFC5041. The | |||
| Field consists of the RV (RDMA Version), Rsv, and Opcode fields RFC | RDMAP Control Field consists of the RV (RDMA Version), Rsv, and | |||
| 5040. | Opcode fields are defined in RFC5040. No change or extension is made | |||
| to these fields by this specification. | ||||
| This specification adds values for the RDMA Opcode field to those | This specification adds values for the RDMA Opcode field to those | |||
| specified in RFC 5040. Table 1 defines the new values of the RDMA | specified in RFC5040. Table 1 defines the new values of the RDMA | |||
| Opcode field that are used for the RDMA Messages defined in this | Opcode field that are used for the RDMA Messages defined in this | |||
| specification. | specification. | |||
| As shown in Table 1, STag (Steering Tag) and Tagged Offset are valid | As shown in Table 1, STag (Steering Tag) and Tagged Offset are valid | |||
| only for certain RDMA Messages defined in this specification. | only for certain RDMA Messages defined in this specification. | |||
| Table 1 also shows the appropriate Queue Number for each Opcode. | Table 1 also shows the appropriate Queue Number for each Opcode. | |||
| 0 1 2 3 | 0 1 2 3 | |||
| 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | |||
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |||
| |T|L| Resrv | DV| RV|R| Opcode | | |T|L| Resrv | DV| RV|R| Opcode | | |||
| | | | | | |s| | | | | | | | |s| | | |||
| | | | | | |v| | | | | | | | |v| | | |||
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |||
| | Invalidate STag | | | Invalidate STag | | |||
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |||
| Figure 1: DDP Control and RDMAP Control Fields | DDP Control and RDMAP Control Fields | |||
| All RDMA Messages defined in this specification MUST carry the | All RDMA Messages defined in this specification MUST carry the | |||
| following values: | following values: | |||
| o The RDMA Version (RV) field: 01b. | o The RDMA Version (RV) field: 01b. | |||
| o Opcode field: Set to one of the values in Table 1. | o Opcode field: Set to one of the values in Table 2. | |||
| o Invalidate STag: Set to zero, or optionally to non-zero by the | o Invalidate STag: Set to zero, or optionally to non-zero by the | |||
| sender, processed by the receiver. | sender, processed by the receiver. | |||
| Note: N/A in the table below means Not Applicable | Note: N/A in the table below means Not Applicable | |||
| -------+-----------+-------+------+-------+-----------+-------------- | -------+------------+-------+------+-------+-----------+------------- | |||
| RDMA | Message | Tagged| STag | Queue | Invalidate| Message | RDMA | Message | Tagged| STag | Queue | Invalidate| Message | |||
| Opcode | Type | Flag | and | Number| STag | Length | Opcode | Type | Flag | and | Number| STag | Length | |||
| | | | TO | | | Communicated | | | | TO | | | Communicated | |||
| | | | | | | between DDP | | | | | | | between DDP | |||
| | | | | | | and RDMAP | | | | | | | and RDMAP | |||
| -------+-----------+-------+------+-------+-----------+-------------- | -------+------------+-------+------+-------+-----------+------------- | |||
| -------+-----------+------------------------------------------------- | -------+------------+------------------------------------------------ | |||
| 01100b | RDMA | 0 | N/A | 1 | opt | Yes | 01100b | RDMA Flush | 0 | N/A | 1 | opt | Yes | |||
| | Commit | | | | | | | Request | | | | | | |||
| | Request | | | | | | -------+------------+------------------------------------------------ | |||
| -------+-----------+------------------------------------------------- | 01101b | RDMA Flush | 0 | N/A | 3 | N/A | No | |||
| 01101b | RDMA | 0 | N/A | 3 | N/A | Yes | | Response | | | | | | |||
| | Commit | | | | | | -------+------------+------------------------------------------------ | |||
| | Response | | | | | | 01110b | RDMA Verify| 0 | N/A | 1 | opt | Yes | |||
| -------+-----------+------------------------------------------------- | | Request | | | | | | |||
| -------+------------+------------------------------------------------ | ||||
| 01111b | RDMA Verify| 0 | N/A | 3 | N/A | Yes | ||||
| | Response | | | | | | ||||
| -------+------------+------------------------------------------------ | ||||
| 10000b | Atomic | 0 | N/A | 1 | opt | Yes | ||||
| | Write | | | | | | ||||
| | Request | | | | | | ||||
| -------+------------+------------------------------------------------ | ||||
| 10001b | Atomic | 0 | N/A | 3 | N/A | No | ||||
| | Write | | | | | | ||||
| | Response | | | | | | ||||
| -------+------------+------------------------------------------------ | ||||
| Table 1: Additional RDMA Usage of DDP Fields | Additional RDMA Usage of DDP Fields | |||
| This extension adds RDMAP use of Queue Number 1 for Untagged Buffers | This extension adds RDMAP use of Queue Number 1 for Untagged Buffers | |||
| for issuing RDMA Commit Requests, and use of Queue Number 3 for | for issuing RDMA Flush, RDMA Verify and Atomic Write Requests, and | |||
| Untagged Buffers for tracking RDMA Commit Responses. | use of Queue Number 3 for Untagged Buffers for tracking the | |||
| respective Responses. | ||||
| All other DDP and RDMAP Control Fields are set as described in RFC | All other DDP and RDMAP Control Fields are set as described in | |||
| 5040 and RFC 7306. | RFC5040 and RFC7306. | |||
| Table 2 defines which RDMA Headers are used on each new RDMA Message | Table 3 defines which RDMA Headers are used on each new RDMA Message | |||
| and which new RDMA Messages are allowed to carry ULP payload. | and which new RDMA Messages are allowed to carry ULP payload. | |||
| -------+-----------+-------------------+------------------------- | -------+------------+-------------------+------------------------- | |||
| RDMA | Message | RDMA Header Used | ULP Message allowed in | RDMA | Message | RDMA Header Used | ULP Message allowed in | |||
| Message| Type | | the RDMA Message | Message| Type | | the RDMA Message | |||
| OpCode | | | | OpCode | | | | |||
| | | | | -------+------------+-------------------+------------------------- | |||
| -------+-----------+-------------------+------------------------- | -------+------------+-------------------+------------------------- | |||
| -------+-----------+-------------------+------------------------- | 01100b | RDMA Flush | None | No | |||
| 01100b | RDMA | None | TBD | | Request | | | |||
| | Commit | | | -------+------------+-------------------+------------------------- | |||
| | Request | | | 01101b | RDMA Flush | None | No | |||
| -------+-----------+-------------------+------------------------- | | Response | | | |||
| 01101b | RDMA | None | No | -------+------------+--------------------------------------------- | |||
| | Commit | | | 01110b | RDMA Verify| None | No | |||
| | Response | | | | Request | | | |||
| -------+-----------+--------------------------------------------- | -------+------------+-------------------+------------------------- | |||
| 01111b | RDMA Verify| None | No | ||||
| | Response | | | ||||
| -------+------------+--------------------------------------------- | ||||
| 10000b | Atomic | None | No | ||||
| | Write | | | ||||
| | Request | | | ||||
| -------+------------+--------------------------------------------- | ||||
| 10000b | Atomic | None | No | ||||
| | Write | | | ||||
| | Response | | | ||||
| -------+------------+--------------------------------------------- | ||||
| Table 2: RDMA Message Definitions | RDMA Message Definitions | |||
| Further discussion. | 3.1.1. RDMA Flush | |||
| 3.2.1. RDMA Commit Request Header Format | The RDMA Flush operation requests that all bytes in a specified | |||
| region are to be made persistent and/or globally visible, under | ||||
| control of specified flags. As specified in section 4 its operation | ||||
| is ordered after the successful completion of any previous requested | ||||
| RDMA Write or certain other operations. The response is generated | ||||
| after the region has reached its specified state. The implementation | ||||
| MUST fail the operation and send a terminate message if the RDMA | ||||
| Flush cannot be performed, or has encountered an error. | ||||
| The RDMA Commit Request Message makes use of the DDP Untagged Buffer | The RDMA Flush operation MUST NOT be completed by the data sink until | |||
| Model. RDMA Commit Request messages MUST use the same Queue Number | all data has attained the requested state. Achieving persistence may | |||
| as RDMA Read Requests and RDMA Extensions Atomic Operation Requests | require programming and/or flushing of device buffers, while | |||
| (QN=1). Reusing the same queue number for RMDA Commit Requests | achieving global visibility may require flushing of cached buffers | |||
| allows the operations to reuse the same infrastructure (e.g. | across the entire platform interconnect. In no event are persistence | |||
| Outbound and Inbound RDMA Read Queue Depth (ORD/IRD) flow control) as | and global visibility achieved atomically, one may precede the other | |||
| that defined for RDMA Read Requests. | and either may complete at any time.The Atomic Write operation may be | |||
| used by an upper layer consumer to indicate that either or both | ||||
| dispositions are available after completion of the RDMA Flush, in | ||||
| addition to other approaches. | ||||
| The RDMA Commit Request Message carries an RDMA Commit header that | 3.1.1.1. RDMA Flush Request Format | |||
| describes the ULP Buffer address in the Responder's memory. The RDMA | ||||
| Write Request header immediately follows the DDP header. The RDMAP | The RDMA Flush Request Message makes use of the DDP Untagged Buffer | |||
| layer passes an RDMAP Control Field to the DDP layer. Figure 2 | Model. RDMA Flush Request messages MUST use the same Queue Number as | |||
| depicts the RDMA Commit Request Header that is used for all RDMA | RDMA Read Requests and RDMA Extensions Atomic Operation Requests | |||
| Commit Request Messages: | (QN=1). Reusing the same queue number for RDMA Flush Requests allows | |||
| the operations to reuse the same RDMA infrastructure (e.g. Outbound | ||||
| and Inbound RDMA Read Queue Depth (ORD/IRD) flow control) as that | ||||
| defined for RDMA Read Requests. | ||||
| The RDMA Flush Request Message carries a payload that describes the | ||||
| ULP Buffer address in the Responder's memory. The following figure | ||||
| depicts the Flush Request that is used for all RDMA Flush Request | ||||
| Messages: | ||||
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ||||
| | Request Identifier | | ||||
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |||
| | Data Sink STag | | | Data Sink STag | | |||
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |||
| | Data Sink Length | | | Data Sink Length | | |||
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |||
| | Data Sink Tagged Offset | | | Data Sink Tagged Offset | | |||
| + + | + + | |||
| | | | | | | |||
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |||
| | ... | | | Flush Disposition Flags +G+P| | |||
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |||
| Figure 2: RDMA Commit Request Header | Flush Request | |||
| Request Identifier: 32 bits. The Request Identifier specifies a | Data Sink STag: 32 bits The Data Sink STag identifies the Remote | |||
| number that is used to identify the RDMA Commit Request Message. | Peer's Tagged Buffer targeted by the RDMA Flush Request. The Data | |||
| The value used in this field is selected by the RNIC that sends | Sink STag is associated with the RDMAP Stream through a mechanism | |||
| the message, and it is reflected back to the Local Peer in the | that is outside the scope of the RDMAP specification. | |||
| RDMA Commit Response message. N.B. Is this field really useful | ||||
| to the RNIC, or does ordering suffice??? | Data Sink Length: The Data Sink Length is the length, in octets, of | |||
| the bytes targeted by the RDMA Flush Request. | ||||
| Data Sink Tagged Offset: 64 bits The Data Sink Tagged Offset | ||||
| specifies the starting offset, in octets, from the base of the | ||||
| Remote Peer's Tagged Buffer targeted by the RDMA Flush Request. | ||||
| Flags: Flags specifying the disposition of the flushed data: 0x01 | ||||
| Flush to Persistence, 0x02 Flush to Global Visibility. | ||||
| 3.1.1.2. RDMA Flush Response | ||||
| The RDMA Flush Response Message makes use of the DDP Untagged Buffer | ||||
| Model. RDMA Flush Response messages MUST use the same Queue Number | ||||
| as RDMA Extensions Atomic Operation Responses (QN=3). No payload is | ||||
| passed to the DDP layer on Queue Number 3. | ||||
| Upon successful completion of RDMA Flush processing, an RDMA Flush | ||||
| Response MUST be generated. | ||||
| If during RDMA Flush processing on the Responder, an error is | ||||
| detected which would result in the requested region to not achieve | ||||
| the requested disposition, the Responder MUST generate a Terminate | ||||
| message. The contents of the Terminate message are defined in | ||||
| Section 5.2. | ||||
| 3.1.1.3. RDMA Flush Ordering and Atomicity | ||||
| Ordering and completion rules for RDMA Flush Request are similar to | ||||
| those for an Atomic operation as described in section 5 of RFC7306. | ||||
| The queue number field of the RDMA Flush Request for the DDP layer | ||||
| MUST be 1, and the RDMA Flush Response for the DDP layer MUST be 3. | ||||
| There are no ordering requirements for the placement of the data, nor | ||||
| are there any requirements for the order in which the data is made | ||||
| globally visible and/or persistent. Data received by prior | ||||
| operations (e.g. RDMA Write) MAY be submitted for placement at any | ||||
| time, and persistence or global visibility MAY occur before the flush | ||||
| is requested. After placement, data MAY become persistent or | ||||
| globally visible at any time, in the course of operation of the | ||||
| persistency management of the storage device, or by other actions | ||||
| resulting in persistence or global visibility. | ||||
| Any region segment specified by the RDMA Flush operation MUST be made | ||||
| persistent and/or globally visible before successful return of the | ||||
| operation. If RDMA Flush processing is successful on the Responder, | ||||
| meaning the requested bytes of the region are, or have been made | ||||
| persistent and/or globally visible, as requested, the RDMA Flush | ||||
| Response MUST be generated. | ||||
| There are no atomicity guarantees provided on the Responder's node by | ||||
| the RDMA Flush Operation with respect to any other operations. While | ||||
| the Completion of the RDMA Flush Operation ensures that the requested | ||||
| data was placed into, and flushed from the target Tagged Buffer, | ||||
| other operations might have also placed or fetched overlapping data. | ||||
| The upper layer is responsible for arbitrating any shared access. | ||||
| (Sidebar) It would be useful to make a statement about other RDMA | ||||
| Flush to the target buffer and RDMA Read from the target buffer on | ||||
| the same connection. Use of QN 1 for these operations provides | ||||
| ordering possibilities which imply that they will "work" (see #7 | ||||
| below). NOTE: this does not, however, extend to RDMA Write, which is | ||||
| not queued nor sequenced and therefore does not employ a DDP QN. | ||||
| 3.1.2. RDMA Verify | ||||
| The RDMA Verify operation requests that all bytes in a specified | ||||
| region are to be read from the underlying storage and that an | ||||
| integrity hash be calculated. As specified in section 4 its | ||||
| operation is ordered after the successful completion of any previous | ||||
| requested RDMA Write and RDMA Flush, or certain other operations. | ||||
| The implementation MUST fail the operation and send a terminate | ||||
| message if the RDMA Verify cannot be performed, has encountered an | ||||
| error, or if a hash value was provided in the request and the | ||||
| calculated hash does not match. If no condition for a Terminate | ||||
| message is encountered, the response is generated containing the | ||||
| result calculated hash value. | ||||
| 3.1.2.1. RDMA Verify Request Format | ||||
| The RDMA Verify Request Message makes use of the DDP Untagged Buffer | ||||
| Model. RDMA Verify Request messages MUST use the same Queue Number | ||||
| as RDMA Read Requests and RDMA Extensions Atomic Operation Requests | ||||
| (QN=1). Reusing the same queue number for RDMA Read and RDMA Flush | ||||
| Requests allows the operations to reuse the same RDMA infrastructure | ||||
| (e.g. Outbound and Inbound RDMA Read Queue Depth (ORD/IRD) flow | ||||
| control) as that defined for those requests. | ||||
| The RDMA Verify Request Message carries a payload that describes the | ||||
| ULP Buffer address in the Responder's memory. The following figure | ||||
| depicts the Verify Request that is used for all RDMA Verify Request | ||||
| Messages: | ||||
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ||||
| | Data Sink STag | | ||||
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ||||
| | Data Sink Length | | ||||
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ||||
| | Data Sink Tagged Offset | | ||||
| + + | ||||
| | | | ||||
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ||||
| | Hash Value (optional, variable) | | ||||
| | ... | | ||||
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ||||
| Verify Request | ||||
| Data Sink STag: 32 bits The Data Sink STag identifies the Remote | Data Sink STag: 32 bits The Data Sink STag identifies the Remote | |||
| Peer's Tagged Buffer targeted by the RDMA Commit Request. The | Peer's Tagged Buffer targeted by the Verify Request. The Data | |||
| Data Sink STag is associated with the RDMAP Stream through a | Sink STag is associated with the RDMAP Stream through a mechanism | |||
| mechanism that is outside the scope of the RDMAP specification. | that is outside the scope of the RDMAP specification. | |||
| Data Sink Length: The Data Sink Length is the length, in octets, of | ||||
| the bytes targeted by the Verify Request. | ||||
| Data Sink Tagged Offset: 64 bits The Data Sink Tagged Offset | Data Sink Tagged Offset: 64 bits The Data Sink Tagged Offset | |||
| specifies the starting offset, in octets, from the base of the | specifies the starting offset, in octets, from the base of the | |||
| Remote Peer's Tagged Buffer targeted by the RDMA Commit Request. | Remote Peer's Tagged Buffer targeted by the Verify Request. | |||
| ... Additional region identifiers to be committed in processing the | Hash Value: The Hash Value is optionally an octet string | |||
| RDMA Commit Request, and/or upper layer message to be passed to | representing the expected result, if any, of the hash algorithm on | |||
| upper layer after commit completion (TBD). | the Remote Peer's Tagged Buffer. The length of the Hash Value is | |||
| variable, and dependent on the selected algorithm. When provided, | ||||
| any mismatch with the calculated value causes the Responder to | ||||
| generate a Terminate message, and close the connection. The | ||||
| contents of the Terminate message are defined in section 5.2. | ||||
| 3.2.2. RDMA Commit Response Header Format | 3.1.2.2. Verify Response Format | |||
| The RDMA Commit Response Message makes use of the DDP Untagged Buffer | The Verify Response Message makes use of the DDP Untagged Buffer | |||
| Model. RDMA Commit Response messages MUST use the same Queue Number | Model. Verify Response messages MUST use the same Queue Number as | |||
| as RDMA Extensions Atomic Operation Responses (QN=3). The RDMAP | RDMA Flush Responses (QN=3). The RDMAP layer passes the following | |||
| layer passes the following payload to the DDP layer on Queue Number | payload to the DDP layer on Queue Number 3. The RDMA Verify Response | |||
| 3. | is not sent when a Terminate message is generated through specifying | |||
| the Compare Flag as 1, and a mismatch occurs. | ||||
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |||
| | Original Request Identifier | | | Hash Value (variable) | | |||
| | ... | | ||||
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |||
| | Status | | ||||
| Verify Response | ||||
| Hash Value: The Hash Value is an octet string representing the | ||||
| result of the hash algorithm on the Remote Peer's Tagged Buffer. | ||||
| The length of the Hash Value is variable, and dependent on the | ||||
| algorithm selected by the upper layer consumer, among those | ||||
| supported by the RNIC. | ||||
| 3.1.2.3. RDMA Verify Ordering | ||||
| Ordering and completion rules for RDMA Verify Request are similar to | ||||
| those for an Atomic operation as described in section 5 of RFC7306. | ||||
| The queue number field of the RDMA Verify Request for the DDP layer | ||||
| MUST be 1, and the RDMA Verify Response for the DDP layer MUST be 3. | ||||
| As specified in section 4, RDMA Verify and RDMA Flush are executed by | ||||
| the Data Sink in strict order. When an RDMA Verify follows an RDMA | ||||
| Flush, and because the RDMA Flush MUST ensure that all bytes are in | ||||
| the specified state before responding, any RDMA Verify that follows | ||||
| can be assured that it is operating on flushed data. If unflushed | ||||
| data has been sent to the region segment between the operations, and | ||||
| since data may be made persistent and/or globally visible by the Data | ||||
| Sink at any time, the result of any such RDMA Verify is undefined. | ||||
| 3.1.3. Atomic Write | ||||
| The Atomic Write operation provides a block of data which is placed | ||||
| to a specified region atomically, and as specified in section 4 its | ||||
| placement is ordered after the successful completion of any previous | ||||
| requested RDMA Flush or RDMA Verify. This specified region is | ||||
| constrained in size and alignment to 64-bits at 64-bit alignment, and | ||||
| the implementation MUST fail the operation and send a terminate | ||||
| message if the placement cannot be performed atomically. | ||||
| The Atomic Write Operation requires the Responder to write a 64-bit | ||||
| value at a ULP Buffer address that is 64-bit aligned in the | ||||
| Responder's memory, in a manner which is Placed in the responder's | ||||
| memory atomically. | ||||
| 3.1.3.1. Atomic Write Request | ||||
| The Atomic Write Request Message makes use of the DDP Untagged Buffer | ||||
| Model. Atomic Write Request messages MUST use the same Queue Number | ||||
| as RDMA Read Requests and RDMA Extensions Atomic Operation Requests | ||||
| (QN=1). Reusing the same queue number for RDMA Flush and RDMA Verify | ||||
| Requests allows the operations to reuse the same RDMA infrastructure | ||||
| (e.g. Outbound and Inbound RDMA Read Queue Depth (ORD/IRD) flow | ||||
| control) as that defined for those Requests. | ||||
| The Atomic Write Request Message carries an Atomic Write Request | ||||
| payload that describes the ULP Buffer address in the Responder's | ||||
| memory, as well as the data to be written. The following figure | ||||
| depicts the Atomic Write Request that is used for all Atomic Write | ||||
| Request Messages: | ||||
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ||||
| | Data Sink STag | | ||||
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ||||
| | Data Sink Length | | ||||
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ||||
| | Data Sink Tagged Offset | | ||||
| + + | ||||
| | | | ||||
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ||||
| | Data | | ||||
| + + | ||||
| | | | ||||
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |||
| Figure 3: RDMA Commit Response Header | Atomic Write Request | |||
| Original Request Identifier: 32 bits. The Original Request | Data Sink STag: 32 bits The Data Sink STag identifies the Remote | |||
| Identifier is set to the value specified in the Request Identifier | Peer's Tagged Buffer targeted by the Atomic Write Request. The | |||
| field that was originally provided in the corresponding RDMA | Data Sink STag is associated with the RDMAP Stream through a | |||
| Commit Request Message. N.B. ditto previous question. | mechanism that is outside the scope of the RDMAP specification. | |||
| Status: 32 bits. Zero if the RDMA Commit was successfully processed, | Data Sink Length: The Data Sink Length is the length of data to be | |||
| or any other value if not. | placed, and MUST be 8. | |||
| 3.2.3. Ordering | Data Sink Tagged Offset: 64 bits The Data Sink Tagged Offset | |||
| specifies the starting offset, in octets, from the base of the | ||||
| Remote Peer's Tagged Buffer targeted by the Atomic Write Request. | ||||
| This offset can be any value, but the destination ULP buffer | ||||
| address MUST be aligned as specified above. Ensuring that the | ||||
| STag and Data Sink Tagged Offset values appropriately meet such a | ||||
| requirement is an upper layer consumer responsibility, and is out | ||||
| of scope for this specification. | ||||
| Ordering and completion rules for RDMA Commit Request are similar to | Data: The 64-bit data value to be written, in big-endian format. | |||
| those for an Atomic operation as described in section 5 of RFC 7306. | ||||
| The queue number field of the RDMA Commit Request for the DDP layer | ||||
| MUST be 1, and the RDMA Commit Response for the DDP layer MUST be 3. | ||||
| There are no ordering requirements for the placement of the data to | Atomic Write Operations MUST target ULP Buffer addresses that are | |||
| be committed, nor are there any requirements for the order in which | 64-bit aligned, and conform to any other platform restrictions on the | |||
| the data is made durable. Data received by prior operations (e.g. | Responder system. The write MUST NOT be Placed prior to all prior | |||
| RDMA Write) MAY be submitted for placement at any time, and | RDMA Flush operations, and therefore all other prior operations, | |||
| durability MAY occur before the commit is requested. Data committed | completing successfully. | |||
| after placement MAY become durable at any time, in the course of | ||||
| operation of the persistency management of the storage device, or by | ||||
| other actions resulting in durability. Any data specified by the | ||||
| commit operation, in any case, MUST be made durable before successful | ||||
| return of the commit. | ||||
| 3.2.4. Atomicity | If an Atomic Write Operation is attempted on a target ULP Buffer | |||
| address that is not 64-bit aligned, or due to alignment, size, or | ||||
| other platform restrictions cannot be performed atomically: | ||||
| There are no atomicity guarantees provided on the Responder's node by | The operation MUST NOT be performed | |||
| the RDMA Commit Operation with respect to any other operations. | ||||
| While the Completion of the RDMA Commit Operation ensures that the | ||||
| requested data was placed and committed to the target Tagged Buffer, | ||||
| other operations might have also placed or fetched overlapping data. | ||||
| The upper layer is responsible for arbitrating any shared access. | ||||
| (To discuss) The commit operation provides an optional block of data | The Responder's memory MUST NOT be modified | |||
| which is committed to a specified region after the successful | ||||
| completion of the requested commit. This specified region MAY be | ||||
| constrained in size and alignment by the implementation, and the | ||||
| implementation MUST fail the operation and send a terminate message | ||||
| if the subsequent commit cannot be performed atomically. The | ||||
| implementation MUST NOT perform the subsequent commit if an error | ||||
| occurred on the requested commit, and SHOULD return a non-zero status | ||||
| indicating the error. | ||||
| (Sidebar) It would be useful to make a statement about other RDMA | A terminate message MUST be generated. (See Section 5.2 for the | |||
| Commit to the target buffer and RDMA Read from the target buffer on | contents of the terminate message.) | |||
| the same connection. Use of QN 1 for these operations provides | ||||
| ordering guarantees which imply that they will "work" (see #7 below). | ||||
| NOTE: this does not, however, extend to RDMA Write, which is not | ||||
| sequenced nor does it employ a DDP QN. | ||||
| 3.2.5. Discovery of RDMAP Extensions | 3.1.3.2. Atomic Write Response | |||
| As for RFC 7306, explicit negotiation by the RDMAP peers of the | The Atomic Write Response Message makes use of the DDP Untagged | |||
| Buffer Model. Atomic Write Response Response messages MUST use the | ||||
| same Queue Number as RDMA Flush Responses (QN=3). The RDMAP layer | ||||
| passes no payload to the DDP layer on Queue Number 3. | ||||
| 3.1.4. Discovery of RDMAP Extensions | ||||
| As for RFC7306, explicit negotiation by the RDMAP peers of the | ||||
| extensions covered by this document is not required. Instead, it is | extensions covered by this document is not required. Instead, it is | |||
| RECOMMENDED that RDMA applications and/or ULPs negotiate any use of | RECOMMENDED that RDMA applications and/or ULPs negotiate any use of | |||
| these extensions at the application or ULP level. The definition of | these extensions at the application or ULP level. The definition of | |||
| such application-specific mechanisms is outside the scope of this | such application-specific mechanisms is outside the scope of this | |||
| specification. For backward compatibility, existing applications | specification. For backward compatibility, existing applications | |||
| and/or ULPs should not assume that these extensions are supported. | and/or ULPs should not assume that these extensions are supported. | |||
| In the absence of application-specific negotiation of the features | In the absence of application-specific negotiation of the features | |||
| defined within this specification, the new operations can be | defined within this specification, the new operations can be | |||
| attempted, and reported errors can be used to determine a remote | attempted, and reported errors can be used to determine a remote | |||
| peer's capabilities. In the case of RDMA Commit, an operation to a | peer's capabilities. In the case of RDMA Flush and Atomic Write, an | |||
| previously Advertised buffer with remote write permission can be used | operation to a previously Advertised buffer with remote write | |||
| to determine the peer's support. A Remote Operation Error or | permission can be used to determine the peer's support. A Remote | |||
| Unexpected OpCode error will be reported by the remote peer if the | Operation Error or Unexpected OpCode error will be reported by the | |||
| Operation is not supported by the remote peer. | remote peer if the Operation is not supported by the remote peer. | |||
| For RDMA Verify, such an operation may target a buffer with remote | ||||
| read permission. | ||||
| 4. Ordering and Completions Table | 3.2. Local Extensions | |||
| Table 3 summarizes the ordering relationships for the RDMA Commit | This section discusses memory registration, new memory and protection | |||
| operation from the standpoint of the Requester. Note that in the | attributes, and applicability to both remote and "local" (receives). | |||
| table, Send Operation includes Send, Send with Invalidate, Send with | Because this section does not specify any wire-visible semantic, it | |||
| Solicited Event, and Send with Solicited Event and Invalidate. Also | is entirely informative. | |||
| note that Immediate Operation includes Immediate Data and Immediate | ||||
| Data with Solicited Event. | ||||
| As for the prior section, the text below presents one possible | 3.2.1. Registration Semantics | |||
| approach, and is included in skeletal form to be filled-in when | ||||
| appropriate. | ||||
| Note: N/A in the table below means Not Applicable | New platform-specific attributes to RDMA registration, allows them to | |||
| be processed at the server *only* without client knowledge, or | ||||
| protocol exposure. No client knowledge - robust design ensuring | ||||
| future interop | ||||
| New local PMEM memory registration example: | ||||
| Register(region[], MemPerm, MemType, MemMode) -> STag | ||||
| Region describes the memory segment[s] to be registered by the | ||||
| returned STag. The local RNIC may limit the size and number of | ||||
| these segments. | ||||
| MemPerm to indicate permitted operations in addition to remote | ||||
| read and remote werite: "remote flush to persistence", "remote | ||||
| flush to global visibility", selectivity, etc. | ||||
| MemType includes type of storage described by the Region, i.e. | ||||
| plain RAM, "flush required" (flushable), or PCIe-resident via | ||||
| peer-to-peer, or any other local platform-specific processing | ||||
| MemMode includes disposition of data Read and/or written e.g. | ||||
| Cacheable after operation (indicate if needed by CPU on data | ||||
| sink, to allow or avoid writethrough as optimization) | ||||
| None of the above attributes are at all relevant, or exposed, | ||||
| by the protocol | ||||
| STag is processed in receiving RNIC during RDMA operation to | ||||
| specified region, under control of original Perm, Type and Mode. | ||||
| 3.2.2. Completion Semantics | ||||
| Discuss the interactions with new operations when upper layer | ||||
| provides Completions to responder (e.g. messages via receive or | ||||
| immediate data via RDMA Write). Natural conclusion of ordering | ||||
| rules, but made explicit. | ||||
| Ordering of operations is critical: Such RDMA Writes cannot be | ||||
| allowed to "pass" persistence or global visibility, and RDMA Flush | ||||
| may not begin until prior RDMA Writes to flush region are accounted | ||||
| for. Therefore, ULP protocol implications may also exist. | ||||
| 3.2.3. Platform Semantics | ||||
| Writethrough behavior on persistent regions and reasons for same. | ||||
| Consider recommending a local writethrough behavior on any persistent | ||||
| region, to support a nonblocking hurry-up to avoid future stalls on a | ||||
| subsequent cache flush, prior to a flush. Also, it would enhance | ||||
| storage integrity. Drive selection of this behavior from memory | ||||
| registration, so RNIC may "look up" the desired behavior in its TPT. | ||||
| PCI extension to support Flush would allow RNIC to provide | ||||
| persistence and/or global visibility directly and efficiently To | ||||
| Memory, CPU, PCI Root, PM device, PCIe device, ... Avoids CPU | ||||
| interaction Supports strong data consistency model. Performs | ||||
| equivalent of: CLFLUSHOPT (region list) or some other flow tag. Or | ||||
| if RNIC participates in platform consistency domain on memory bus or | ||||
| within CPU complex... other possibilities exist! | ||||
| Also consider additional "integrity check" behavior (hash algorithm) | ||||
| specified per-region. If so, providing it as a registration | ||||
| parameter enables fine-graned control, and enables storing it in per- | ||||
| region RNIC state, making its processing optional and | ||||
| straightforward. | ||||
| A similar approach applicable to providing security key for | ||||
| encrypting/decrypting access on per-region basius, without protocol | ||||
| exposure. [SDC2017 presentation] | ||||
| Any other per-region processing to be explored. | ||||
| 4. Ordering and Completions Table | ||||
| The table in this section specifies the ordering relationships for | ||||
| the operations in this specification and in those it extends, from | ||||
| the standpoint of the Requester. Note that in the table, Send | ||||
| Operation includes Send, Send with Invalidate, Send with Solicited | ||||
| Event, and Send with Solicited Event and Invalidate. Also note that | ||||
| Immediate Operation includes Immediate Data and Immediate Data with | ||||
| Solicited Event. | ||||
| Note: N/A in the table below means Not Applicable | ||||
| ----------+------------+-------------+-------------+----------------- | ----------+------------+-------------+-------------+----------------- | |||
| First | Second | Placement | Placement | Ordering | First | Second | Placement | Placement | Ordering | |||
| Operation | Operation | Guarantee at| Guarantee at| Guarantee at | Operation | Operation | Guarantee at| Guarantee at| Guarantee at | |||
| | | Remote Peer | Local Peer | Remote Peer | | | Remote Peer | Local Peer | Remote Peer | |||
| ----------+------------+-------------+-------------+----------------- | ----------+------------+-------------+-------------+----------------- | |||
| RDMA | TODO | No Placement| N/A | Completed in | RDMA Flush| TODO | No Placement| N/A | Completed in | |||
| Commit | | Guarantee | | Order | | | Guarantee | | Order | |||
| | | between Foo | | | | | between Foo | | | |||
| | | and Bar | | | | | and Bar | | | |||
| ----------+------------+-------------+-------------+----------------- | ----------+------------+-------------+-------------+----------------- | |||
| TODO | RDMA | No Placement| N/A | TODO | TODO | RDMA Flush | Placement | N/A | TODO | |||
| | Commit | Guarantee | | | | | Guarantee | | | |||
| | | between Foo | | | | | between Foo | | | |||
| | | and Bar | | | | | and Bar | | | |||
| ----------+------------+-------------+-------------+----------------- | ----------+------------+-------------+-------------+----------------- | |||
| TODO | TODO | Etc | Etc | Etc | TODO | TODO | Etc | Etc | Etc | |||
| ----------+------------+-------------+-------------+----------------- | ----------+------------+-------------+-------------+----------------- | |||
| ----------+------------+-------------+-------------+----------------- | ----------+------------+-------------+-------------+----------------- | |||
| Table 3: Ordering of Operations | Ordering of Operations | |||
| 5. Error Processing | 5. Error Processing | |||
| In addition to error processing described in section 7 of RFC 5040 | In addition to error processing described in section 7 of RFC5040 and | |||
| and section 8 of RFC 7306, the following rules apply for the new RDMA | section 8 of RFC7306, the following rules apply for the new RDMA | |||
| Messages defined in this specification. | Messages defined in this specification. | |||
| 5.1. Errors Detected at the Local Peer | 5.1. Errors Detected at the Local Peer | |||
| The Local Peer MUST send a Terminate Message for each of the | The Local Peer MUST send a Terminate Message for each of the | |||
| following cases: | following cases: | |||
| 1. For errors detected while creating a RDMA Commit Request or other | 1. For errors detected while creating an RDMA Flush, RDMA Verify or | |||
| reasons not directly associated with an incoming Message, the | Atomic Write Request, or other reasons not directly associated | |||
| Terminate Message and Error code are sent instead of the Message. | with an incoming Message, the Terminate Message and Error code | |||
| In this case, the Error Type and Error Code fields are included | are sent instead of the Message. In this case, the Error Type | |||
| in the Terminate Message, but the Terminated DDP Header and | and Error Code fields are included in the Terminate Message, but | |||
| Terminated RDMA Header fields are set to zero. | the Terminated DDP Header and Terminated RDMA Header fields are | |||
| set to zero. | ||||
| 2. For errors detected on an incoming RDMA Commit Request or RDMA | 2. For errors detected on an incoming RDMA Flush, RDMA Verify or | |||
| Commit Response, the Terminate Message is sent at the earliest | Atomic Write Request or Response, the Terminate Message is sent | |||
| possible opportunity, preferably in the next outgoing RDMA | at the earliest possible opportunity, preferably in the next | |||
| Message. In this case, the Error Type, Error Code, and | outgoing RDMA Message. In this case, the Error Type, Error Code, | |||
| Terminated DDP Header fields are included in the Terminate | and Terminated DDP Header fields are included in the Terminate | |||
| Message, but the Terminated RDMA Header field is set to zero. | Message, but the Terminated RDMA Header field is set to zero. | |||
| 3. For errors detected in the processing of the RDMA Commit itself, | 3. For errors detected in the processing of the RDMA Flush or RDMA | |||
| that is, the act of making the data durable, no Terminate Message | Verify itself, that is, the act of flushing or verifying the | |||
| is generated. Because the data is not lost, the connection MUST | data, the Terminate Message is generated as per the referenced | |||
| NOT terminate and the peer MUST inform the requester of the | specifications. Even though data is not lost, the upper layer | |||
| status, and allow the requester to perform further action, for | MUST be notified of the failure by informing the requester of the | |||
| instance, recovery. | status, terminating any queued operations, and allow the | |||
| requester to perform further action, for instance, recovery. | ||||
| 5.2. Errors Detected at the Remote Peer | 5.2. Errors Detected at the Remote Peer | |||
| On incoming RDMA Commit Requests, the following MUST be validated: | On incoming RDMA Flush and RDMA Verify Requests, the following MUST | |||
| be validated: | ||||
| o The DDP layer MUST validate all DDP Segment fields. | o The DDP layer MUST validate all DDP Segment fields. | |||
| The following additional validation MUST be performed: | The following additional validation MUST be performed: | |||
| o If the RDMA Commit cannot be satisfied, due to transient or | o If the RDMA Flush, RDMA Verify or Atomic Write operation cannot be | |||
| permanent errors detected in the processing by the Responder, a | satisfied, due to transient or permanent errors detected in the | |||
| status MUST be returned to the Requestor. Valid status values are | processing by the Responder, a Terminate message MUST be returned | |||
| to be specified. | to the Requestor. | |||
| 6. IANA Considerations | 6. IANA Considerations | |||
| This document requests that IANA assign the following new operation | This document requests that IANA assign the following new operation | |||
| codes in the "RDMAP Message Operation Codes" registry defined in | codes in the "RDMAP Message Operation Codes" registry defined in | |||
| section 3.4 of [RFC6580]. | section 3.4 of [RFC6580]. | |||
| 0xC RDMA Commit Request, this specification | 0xC RDMA Flush Request, this specification | |||
| 0xD RDMA Commit Response, this specification | 0xD RDMA Flush Response, this specification | |||
| Additionally, the name of the listed entry in "RDMAP DDP Untagged | 0xE RDMA Verify Request, this specification | |||
| Queue Numbers" as defined in section 10.2 of [RFC7306] is requested | ||||
| to be updated as follows: | ||||
| 0x00000003 Queue 3 Modify name to "Atomic Response and RDMA Commit | 0xF RDMA Verify Response, this specification | |||
| Response operations" and add reference to this specification | ||||
| 0x10 Atomic Write Request, this specification | ||||
| 0x11 Atomic Write Response, this specification | ||||
| Note to RFC Editor: this section may be edited and updated prior to | Note to RFC Editor: this section may be edited and updated prior to | |||
| publication as an RFC. | publication as an RFC. | |||
| 7. Security Considerations | 7. Security Considerations | |||
| This document specifies extensions to the RDMA Protocol specification | This document specifies extensions to the RDMA Protocol specification | |||
| in RFC 5040 and RDMA Protocol Extensions in RFC 7306, and as such the | in RFC5040 and RDMA Protocol Extensions in RFC7306, and as such the | |||
| Security Considerations discussed in Section 8 of RFC 5040 and | Security Considerations discussed in Section 8 of RFC5040 and | |||
| Section 9 of RFC 7306 apply. In particular, RDMA Commit Operations | Section 9 of RFC7306 apply. In particular, all operations use ULP | |||
| use ULP Buffer addresses for the Remote Peer Buffer addressing used | Buffer addresses for the Remote Peer Buffer addressing used in | |||
| in RFC 5040 as required by the security model described in [RDMAP | RFC5040 as required by the security model described in [RDMAP | |||
| Security [RFC5042]]. | Security [RFC5042]]. | |||
| If the "push mode" transfer model discussed in section 2 is | If the "push mode" transfer model discussed in section 2 is | |||
| implemented by upper layers, new security considerations will be | implemented by upper layers, new security considerations will be | |||
| potentially introduced in those protocols, particularly on the | potentially introduced in those protocols, particularly on the | |||
| server, or target, if the new memory regions are not carefully | server, or target, if the new memory regions are not carefully | |||
| protected. Therefore, for them to take full advantage of the | protected. Therefore, for them to take full advantage of the | |||
| extension defined in this document, additional security design is | extension defined in this document, additional security design is | |||
| required in the implementation of those upper layers. The facilities | required in the implementation of those upper layers. The facilities | |||
| of RFC5042 [RFC5042] can provide the basis for any such design. | of RFC5042 [RFC5042] can provide the basis for any such design. | |||
| In addition to protection, in "push mode" the server or target will | In addition to protection, in "push mode" the server or target will | |||
| expose memory resources to the peer for potentially extended periods, | expose memory resources to the peer for potentially extended periods, | |||
| and will allow the peer to perform remote durability requests which | and will allow the peer to perform remote requests which will | |||
| will necessarily consume shared resources, e.g. memory bandwidth, | necessarily consume shared resources, e.g. memory bandwidth, power, | |||
| power, and memory itself. It is recommended that the upper layers | and memory itself. It is recommended that the upper layers provide a | |||
| provide a means to gracefully adjust such resources, for example | means to gracefully adjust such resources, for example using upper | |||
| using upper layer callbacks, without resorting to revoking RDMA | layer callbacks, without resorting to revoking RDMA permissions, | |||
| permissions, which would summarily close connections. With the | which would summarily close connections. With the initiator | |||
| initiator applications relying on the protocol extension itself for | applications relying on the protocol extension itself for managing | |||
| managing their required durability, the lack of such an approach | their required persistence and/or global visibility, the lack of such | |||
| would lead to frequent recovery in low-resource situations, | an approach would lead to frequent recovery in low-resource | |||
| potentially opening a new threat to such applications. | situations, potentially opening a new threat to such applications. | |||
| 8. References | 8. To Be Added or Considered | |||
| 8.1. Normative References | This section will be deleted in a future document revision. | |||
| Complete the discussion in section 3.2 and its subsections, Local | ||||
| Extension semantics. | ||||
| Complete the Ordering table in section 4. Carefully include | ||||
| discussion of the order of "start of execution" as well as | ||||
| completion, which are somewhat more involved than prior RDMA | ||||
| operation ordering. | ||||
| RDMA Flush "selectivity", to provide default flush semantics with | ||||
| broader scope than region-based. If specified, a flag to request | ||||
| that all prior write operations on the issuing Queue Pair be flushed | ||||
| with the requested disposition(s). This flag may simplify upper | ||||
| layer processing, and would allow regions larger than 4GB-1 byte to | ||||
| be flushed in a single operation. The STag, Offset and Length will | ||||
| be ignored in this case. It is to-be-determined how to extend the | ||||
| RDMA security model to protect other regions associated with this | ||||
| Queue Pair from unintentional or unauthorized flush. | ||||
| 9. Acknowledgements | ||||
| The authors wish to thank Jim Pinkerton, who contributed to an | ||||
| earlier version of the specification, and Brian Hausauer and Kobby | ||||
| Carmona, who have provided significant review and valuable comments. | ||||
| 10. References | ||||
| 10.1. Normative References | ||||
| [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate | [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate | |||
| Requirement Levels", BCP 14, RFC 2119, | Requirement Levels", BCP 14, RFC 2119, | |||
| DOI 10.17487/RFC2119, March 1997, | DOI 10.17487/RFC2119, March 1997, | |||
| <http://www.rfc-editor.org/info/rfc2119>. | <https://www.rfc-editor.org/info/rfc2119>. | |||
| [RFC5040] Recio, R., Metzler, B., Culley, P., Hilland, J., and D. | [RFC5040] Recio, R., Metzler, B., Culley, P., Hilland, J., and D. | |||
| Garcia, "A Remote Direct Memory Access Protocol | Garcia, "A Remote Direct Memory Access Protocol | |||
| Specification", RFC 5040, DOI 10.17487/RFC5040, October | Specification", RFC 5040, DOI 10.17487/RFC5040, October | |||
| 2007, <http://www.rfc-editor.org/info/rfc5040>. | 2007, <https://www.rfc-editor.org/info/rfc5040>. | |||
| [RFC5041] Shah, H., Pinkerton, J., Recio, R., and P. Culley, "Direct | [RFC5041] Shah, H., Pinkerton, J., Recio, R., and P. Culley, "Direct | |||
| Data Placement over Reliable Transports", RFC 5041, | Data Placement over Reliable Transports", RFC 5041, | |||
| DOI 10.17487/RFC5041, October 2007, | DOI 10.17487/RFC5041, October 2007, | |||
| <http://www.rfc-editor.org/info/rfc5041>. | <https://www.rfc-editor.org/info/rfc5041>. | |||
| [RFC5042] Pinkerton, J. and E. Deleganes, "Direct Data Placement | [RFC5042] Pinkerton, J. and E. Deleganes, "Direct Data Placement | |||
| Protocol (DDP) / Remote Direct Memory Access Protocol | Protocol (DDP) / Remote Direct Memory Access Protocol | |||
| (RDMAP) Security", RFC 5042, DOI 10.17487/RFC5042, October | (RDMAP) Security", RFC 5042, DOI 10.17487/RFC5042, October | |||
| 2007, <http://www.rfc-editor.org/info/rfc5042>. | 2007, <https://www.rfc-editor.org/info/rfc5042>. | |||
| [RFC6580] Ko, M. and D. Black, "IANA Registries for the Remote | [RFC6580] Ko, M. and D. Black, "IANA Registries for the Remote | |||
| Direct Data Placement (RDDP) Protocols", RFC 6580, | Direct Data Placement (RDDP) Protocols", RFC 6580, | |||
| DOI 10.17487/RFC6580, April 2012, | DOI 10.17487/RFC6580, April 2012, | |||
| <http://www.rfc-editor.org/info/rfc6580>. | <https://www.rfc-editor.org/info/rfc6580>. | |||
| [RFC7306] Shah, H., Marti, F., Noureddine, W., Eiriksson, A., and R. | [RFC7306] Shah, H., Marti, F., Noureddine, W., Eiriksson, A., and R. | |||
| Sharp, "Remote Direct Memory Access (RDMA) Protocol | Sharp, "Remote Direct Memory Access (RDMA) Protocol | |||
| Extensions", RFC 7306, DOI 10.17487/RFC7306, June 2014, | Extensions", RFC 7306, DOI 10.17487/RFC7306, June 2014, | |||
| <http://www.rfc-editor.org/info/rfc7306>. | <https://www.rfc-editor.org/info/rfc7306>. | |||
| 8.2. Informative References | 10.2. Informative References | |||
| [RFC5045] Bestler, C., Ed. and L. Coene, "Applicability of Remote | [RFC5045] Bestler, C., Ed. and L. Coene, "Applicability of Remote | |||
| Direct Memory Access Protocol (RDMA) and Direct Data | Direct Memory Access Protocol (RDMA) and Direct Data | |||
| Placement (DDP)", RFC 5045, DOI 10.17487/RFC5045, October | Placement (DDP)", RFC 5045, DOI 10.17487/RFC5045, October | |||
| 2007, <http://www.rfc-editor.org/info/rfc5045>. | 2007, <https://www.rfc-editor.org/info/rfc5045>. | |||
| [RFC5661] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed., | [RFC5661] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed., | |||
| "Network File System (NFS) Version 4 Minor Version 1 | "Network File System (NFS) Version 4 Minor Version 1 | |||
| Protocol", RFC 5661, DOI 10.17487/RFC5661, January 2010, | Protocol", RFC 5661, DOI 10.17487/RFC5661, January 2010, | |||
| <http://www.rfc-editor.org/info/rfc5661>. | <https://www.rfc-editor.org/info/rfc5661>. | |||
| [RFC5666] Talpey, T. and B. Callaghan, "Remote Direct Memory Access | ||||
| Transport for Remote Procedure Call", RFC 5666, | ||||
| DOI 10.17487/RFC5666, January 2010, | ||||
| <http://www.rfc-editor.org/info/rfc5666>. | ||||
| [RFC5667] Talpey, T. and B. Callaghan, "Network File System (NFS) | ||||
| Direct Data Placement", RFC 5667, DOI 10.17487/RFC5667, | ||||
| January 2010, <http://www.rfc-editor.org/info/rfc5667>. | ||||
| [RFC7145] Ko, M. and A. Nezhinsky, "Internet Small Computer System | [RFC7145] Ko, M. and A. Nezhinsky, "Internet Small Computer System | |||
| Interface (iSCSI) Extensions for the Remote Direct Memory | Interface (iSCSI) Extensions for the Remote Direct Memory | |||
| Access (RDMA) Specification", RFC 7145, | Access (RDMA) Specification", RFC 7145, | |||
| DOI 10.17487/RFC7145, April 2014, | DOI 10.17487/RFC7145, April 2014, | |||
| <http://www.rfc-editor.org/info/rfc7145>. | <https://www.rfc-editor.org/info/rfc7145>. | |||
| [SCSI] American National Standards Institute, "SCSI Primary | [RFC8166] Lever, C., Ed., Simpson, W., and T. Talpey, "Remote Direct | |||
| Commands - 3 (SPC-3) (INCITS 408-2005)", May 2005. | Memory Access Transport for Remote Procedure Call Version | |||
| 1", RFC 8166, DOI 10.17487/RFC8166, June 2017, | ||||
| <https://www.rfc-editor.org/info/rfc8166>. | ||||
| [RFC8267] Lever, C., "Network File System (NFS) Upper-Layer Binding | ||||
| to RPC-over-RDMA Version 1", RFC 8267, | ||||
| DOI 10.17487/RFC8267, October 2017, | ||||
| <https://www.rfc-editor.org/info/rfc8267>. | ||||
| [SCSI] ANSI, "SCSI Primary Commands - 3 (SPC-3) (INCITS | ||||
| 408-2005)", May 2005. | ||||
| [SMB3] Microsoft Corporation, "Server Message Block (SMB) | [SMB3] Microsoft Corporation, "Server Message Block (SMB) | |||
| Protocol Versions 2 and 3 (MS-SMB2)", October 2015. | Protocol Versions 2 and 3 (MS-SMB2)", March 2020. | |||
| https://docs.microsoft.com/en- | ||||
| us/openspecs/windows_protocols/ms-smb2/5606ad47-5ee0-437a- | ||||
| 817e-70c366052962 | ||||
| [SMBDirect] | [SMBDirect] | |||
| Microsoft Corporation, "SMB2 Remote Direct Memory Access | Microsoft Corporation, "SMB2 Remote Direct Memory Access | |||
| (RDMA) Transport Protocol (MS-SMBD)", October 2015. | (RDMA) Transport Protocol (MS-SMBD)", September 2018. | |||
| [SNIANVM] Storage Networking Industry Association NVM TWG, "SNIA NVM | https://docs.microsoft.com/en- | |||
| Programming Model v1.0", 2014. | us/openspecs/windows_protocols/ms-smbd/1ca5f4ae-e5b1-493d- | |||
| b87d-f4464325e6e3 | ||||
| 8.3. URIs | [SNIANVMP] | |||
| SNIA NVM Programming TWG, "SNIA NVM Programming Model | ||||
| v1.2", June 2017. | ||||
| https://www.snia.org/sites/default/files/technical_work/ | ||||
| final/NVMProgrammingModel_v1.2.pdf | ||||
| 10.3. URIs | ||||
| [1] http://www.nvmexpress.org | [1] http://www.nvmexpress.org | |||
| [2] http://www.jedec.org | [2] http://www.jedec.org | |||
| Appendix A. DDP Segment Formats for RDMA Extensions | Appendix A. DDP Segment Formats for RDMA Extensions | |||
| This appendix is for information only and is NOT part of the | This appendix is for information only and is NOT part of the | |||
| standard. It simply depicts the DDP Segment format for each of the | standard. It simply depicts the DDP Segment format for each of the | |||
| RDMA Messages defined in this specification. | RDMA Messages defined in this specification. | |||
| A.1. DDP Segment for RDMA Commit Request | A.1. DDP Segment for RDMA Flush Request | |||
| Figure 3 depicts an RDMA Commit Request, DDP Segment: | 0 1 2 3 | |||
| 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | ||||
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ||||
| | DDP Control | RDMA Control | | ||||
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ||||
| | Reserved (Not Used) | | ||||
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ||||
| | DDP (Flush Request) Queue Number (1) | | ||||
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ||||
| | DDP (Flush Request) Message Sequence Number | | ||||
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ||||
| | Data Sink STag | | ||||
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ||||
| | Data Sink Length | | ||||
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ||||
| | Data Sink Tagged Offset | | ||||
| + + | ||||
| | | | ||||
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ||||
| | Disposition Flags +G+P| | ||||
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ||||
| RDMA Flush Request, DDP Segment | ||||
| A.2. DDP Segment for RDMA Flush Response | ||||
| 0 1 2 3 | 0 1 2 3 | |||
| 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | |||
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |||
| | DDP Control | RDMA Control | | | DDP Control | RDMA Control | | |||
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |||
| | Reserved (Not Used) | | | Reserved (Not Used) | | |||
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |||
| | DDP (RDMA Commit Request) Queue Number | | | DDP (Flush Response) Queue Number (3) | | |||
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |||
| | DDP (RDMA Commit Request) Message Sequence Number | | | DDP (Flush Response) Message Sequence Number | | |||
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |||
| | Request Identifier | | ||||
| RDMA Flush Response, DDP Segment | ||||
| A.3. DDP Segment for RDMA Verify Request | ||||
| 0 1 2 3 | ||||
| 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | ||||
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ||||
| | DDP Control | RDMA Control | | ||||
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ||||
| | Reserved (Not Used) | | ||||
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ||||
| | DDP (Verify Request) Queue Number (1) | | ||||
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ||||
| | DDP (Verify Request) Message Sequence Number | | ||||
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |||
| | Data Sink STag | | | Data Sink STag | | |||
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |||
| | Data Sink Length | | | Data Sink Length | | |||
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |||
| | Data Sink Tagged Offset | | | Data Sink Tagged Offset | | |||
| + + | + + | |||
| | | | | | | |||
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |||
| | ... | | | Hash Value (optional, variable) | | |||
| | ... | | ||||
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |||
| Figure 3 | RDMA Verify Request, DDP Segment | |||
| A.2. DDP Segment for RDMA Commit Response | A.4. DDP Segment for RDMA Verify Response | |||
| 0 1 2 3 | ||||
| 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | ||||
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ||||
| | DDP Control | RDMA Control | | ||||
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ||||
| | Reserved (Not Used) | | ||||
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ||||
| | DDP (Verify Response) Queue Number (3) | | ||||
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ||||
| | DDP (Verify Response) Message Sequence Number | | ||||
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ||||
| | Hash Value (variable) | | ||||
| | ... | | ||||
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ||||
| Figure 4 depicts an RDMA Commit Response, DDP Segment: | RDMA Verify Response, DDP Segment | |||
| A.5. DDP Segment for Atomic Write Request | ||||
| 0 1 2 3 | 0 1 2 3 | |||
| 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | |||
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |||
| | DDP Control | RDMA Control | | | DDP Control | RDMA Control | | |||
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |||
| | Reserved (Not Used) | | | Reserved (Not Used) | | |||
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |||
| | DDP (RDMA Commit Response) Queue Number | | | DDP (Atomic Write Request) Queue Number (1) | | |||
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |||
| | DDP (RDMA Commit Response) Message Sequence Number | | | DDP (Atomic Write Request) Message Sequence Number | | |||
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |||
| | Original Request Identifier | | | Data Sink STag | | |||
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |||
| | Status | | | Data Sink Length (value=8) | | |||
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ||||
| | Data Sink Tagged Offset | | ||||
| + + | ||||
| | | | ||||
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ||||
| | Data (64 bits) | | ||||
| + + | ||||
| | | | ||||
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |||
| Figure 4 | Atomic Write Request, DDP Segment | |||
| A.6. DDP Segment for Atomic Write Response | ||||
| 0 1 2 3 | ||||
| 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | ||||
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ||||
| | DDP Control | RDMA Control | | ||||
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ||||
| | Reserved (Not Used) | | ||||
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ||||
| | DDP (Atomic Write Response) Queue Number (3) | | ||||
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ||||
| | DDP (Atomic Write Response) Message Sequence Number | | ||||
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ||||
| Atomic Write Response, DDP Segment | ||||
| Authors' Addresses | Authors' Addresses | |||
| Tom Talpey | Tom Talpey | |||
| Microsoft | Microsoft | |||
| One Microsoft Way | One Microsoft Way | |||
| Redmond, WA 98052 | Redmond, WA 98052 | |||
| US | US | |||
| Email: ttalpey@microsoft.com | Email: ttalpey@microsoft.com | |||
| Jim Pinkerton | Tony Hurson | |||
| Microsoft | Intel | |||
| One Microsoft Way | Austin, TX | |||
| Redmond, WA 98052 | ||||
| US | US | |||
| Email: jpink@microsoft.com | Email: tony.hurson@intel.com | |||
| Gaurav Agarwal | ||||
| Marvell | ||||
| CA | ||||
| US | ||||
| Email: gagarwal@marvell.com | ||||
| Tom Reu | ||||
| Chelsio | ||||
| NJ | ||||
| US | ||||
| Email: tomreu@chelsio.com | ||||
| End of changes. 178 change blocks. | ||||
| 566 lines changed or deleted | 1215 lines changed or added | |||
This html diff was produced by rfcdiff 1.48. The latest version is available from http://tools.ietf.org/tools/rfcdiff/ | ||||