idnits 2.17.1 draft-talpey-rdma-commit-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year (Using the creation date from RFC5040, updated by this document, for RFC5378 checks: 2003-02-19) -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (March 9, 2020) is 1509 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Looks like a reference, but probably isn't: '1' on line 1585 -- Looks like a reference, but probably isn't: '2' on line 1587 -- Obsolete informational reference (is this intentional?): RFC 5661 (Obsoleted by RFC 8881) Summary: 0 errors (**), 0 flaws (~~), 1 warning (==), 5 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 NFSv4 (provisionally) T. Talpey 3 Internet-Draft Microsoft 4 Updates: 5040 7306 (if approved) T. Hurson 5 Intended status: Standards Track Intel 6 Expires: September 10, 2020 G. Agarwal 7 Marvell 8 T. Reu 9 Chelsio 10 March 9, 2020 12 RDMA Extensions for Enhanced Memory Placement 13 draft-talpey-rdma-commit-01 15 Abstract 17 This document specifies extensions to RDMA (Remote Direct Memory 18 Access) protocols to provide capabilities in support of enhanced 19 remotely-directed data placement on persistent memory-addressable 20 devices. The extensions include new operations supporting remote 21 commitment to persistence of remotely-managed buffers, which can 22 provide enhanced guarantees and improve performance for low-latency 23 storage applications. In addition to, and in support of these, 24 extensions to local behaviors are described, which may be used to 25 guide implementation, and to ease adoption. This document updates 26 RFC5040 (Remote Direct Memory Access Protocol (RDMAP)) and updates 27 RFC7306 (RDMA Protocol Extensions). 29 Requirements Language 31 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 32 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 33 document are to be interpreted as described in RFC 2119 [RFC2119]. 35 Status of This Memo 37 This Internet-Draft is submitted in full conformance with the 38 provisions of BCP 78 and BCP 79. 40 Internet-Drafts are working documents of the Internet Engineering 41 Task Force (IETF). Note that other groups may also distribute 42 working documents as Internet-Drafts. The list of current Internet- 43 Drafts is at https://datatracker.ietf.org/drafts/current/. 45 Internet-Drafts are draft documents valid for a maximum of six months 46 and may be updated, replaced, or obsoleted by other documents at any 47 time. It is inappropriate to use Internet-Drafts as reference 48 material or to cite them other than as "work in progress." 49 This Internet-Draft will expire on September 10, 2020. 51 Copyright Notice 53 Copyright (c) 2020 IETF Trust and the persons identified as the 54 document authors. All rights reserved. 56 This document is subject to BCP 78 and the IETF Trust's Legal 57 Provisions Relating to IETF Documents 58 (https://trustee.ietf.org/license-info) in effect on the date of 59 publication of this document. Please review these documents 60 carefully, as they describe your rights and restrictions with respect 61 to this document. 63 Table of Contents 65 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 66 1.1. Glossary . . . . . . . . . . . . . . . . . . . . . . . . 4 67 2. Problem Statement . . . . . . . . . . . . . . . . . . . . . . 4 68 2.1. Requirements for RDMA Flush . . . . . . . . . . . . . . . 10 69 2.1.1. Non-Requirements . . . . . . . . . . . . . . . . . . 12 70 2.2. Requirements for Atomic Write . . . . . . . . . . . . . . 14 71 2.3. Requirements for RDMA Verify . . . . . . . . . . . . . . 15 72 2.4. Local Semantics . . . . . . . . . . . . . . . . . . . . . 16 73 3. RDMA Protocol Extensions . . . . . . . . . . . . . . . . . . 17 74 3.1. RDMAP Extensions . . . . . . . . . . . . . . . . . . . . 17 75 3.1.1. RDMA Flush . . . . . . . . . . . . . . . . . . . . . 20 76 3.1.2. RDMA Verify . . . . . . . . . . . . . . . . . . . . . 23 77 3.1.3. Atomic Write . . . . . . . . . . . . . . . . . . . . 25 78 3.1.4. Discovery of RDMAP Extensions . . . . . . . . . . . . 27 79 3.2. Local Extensions . . . . . . . . . . . . . . . . . . . . 28 80 3.2.1. Registration Semantics . . . . . . . . . . . . . . . 28 81 3.2.2. Completion Semantics . . . . . . . . . . . . . . . . 28 82 3.2.3. Platform Semantics . . . . . . . . . . . . . . . . . 29 83 4. Ordering and Completions Table . . . . . . . . . . . . . . . 29 84 5. Error Processing . . . . . . . . . . . . . . . . . . . . . . 30 85 5.1. Errors Detected at the Local Peer . . . . . . . . . . . . 30 86 5.2. Errors Detected at the Remote Peer . . . . . . . . . . . 31 87 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 31 88 7. Security Considerations . . . . . . . . . . . . . . . . . . . 31 89 8. To Be Added or Considered . . . . . . . . . . . . . . . . . . 32 90 9. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 33 91 10. References . . . . . . . . . . . . . . . . . . . . . . . . . 33 92 10.1. Normative References . . . . . . . . . . . . . . . . . . 33 93 10.2. Informative References . . . . . . . . . . . . . . . . . 33 94 10.3. URIs . . . . . . . . . . . . . . . . . . . . . . . . . . 35 95 Appendix A. DDP Segment Formats for RDMA Extensions . . . . . . 35 96 A.1. DDP Segment for RDMA Flush Request . . . . . . . . . . . 35 97 A.2. DDP Segment for RDMA Flush Response . . . . . . . . . . . 35 98 A.3. DDP Segment for RDMA Verify Request . . . . . . . . . . . 36 99 A.4. DDP Segment for RDMA Verify Response . . . . . . . . . . 36 100 A.5. DDP Segment for Atomic Write Request . . . . . . . . . . 37 101 A.6. DDP Segment for Atomic Write Response . . . . . . . . . . 38 102 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 38 104 1. Introduction 106 The RDMA Protocol (RDMAP) [RFC5040] and RDMA Protocol Extensions 107 (RDMAPEXT) [RFC7306] provide capabilities for secure, zero copy data 108 communications that preserve memory protection semantics, enabling 109 more efficient network protocol implementations. The RDMA Protocol 110 is part of the iWARP family of specifications which also include the 111 Direct Data Placement Protocol (DDP) [RFC5041], and others as 112 described in the relevant documents. For additional background on 113 RDMA Protocol applicability, see "Applicability of Remote Direct 114 Memory Access Protocol (RDMA) and Direct Data Placement Protocol 115 (DDP)" RFC5045 [RFC5045]. 117 RDMA protocols are enjoying good success in improving the performance 118 of remote storage access, and have been well-suited to semantics and 119 latencies of existing storage solutions. However, new storage 120 solutions are emerging with much lower latencies, driving new 121 workloads and new performance requirements. Also, storage 122 programming paradigms SNIANVMP [SNIANVMP] are driving new 123 requirements of the remote storage layers, in addition to driving 124 down latency tolerances. Overcoming these latencies, and providing 125 the means to achieve persistence and/or visibility without invoking 126 upper layers and remote CPUs for each such request, are the 127 motivators for the extensions in this document. 129 This document specifies the following extensions to the RDMA Protocol 130 (RDMAP) and its local memory ecosystem: 132 o Flush - support for RDMA requests and responses with enhanced 133 placement semantics. 135 o Atomic Write - support for writing certain data elements into 136 memory in an atomically visible fashion. 138 o Verify - support for validating the contents of remote memory, 139 through use of integrity signatures. 141 o Enhanced memory registration semantics in support of persistence 142 and visibility. 144 The extensions defined in this document do not require the RDMAP 145 version to change. 147 1.1. Glossary 149 This document is an extension of RFC 5040 and RFC7306, and key words 150 are additionally defined in the glossaries of the referenced 151 documents. 153 The following additional terms are used in this document as defined. 155 Flush: The submitting of previously written data from volatile 156 intermediate locations for subsequent placement, in a persistent 157 and/or globally visible fashion. 159 Invalidate: The removal of data from volatile intermediate 160 locations. 162 Commit: Obsolescent previous synonym for Flush. Term to be deleted. 164 Persistent: The property that data is present, readable and remains 165 stable after recovery from a power failure or other fatal error in 166 an upper layer or hardware. , , [SCSI]. 170 Globally Visible: The property of data being available for reading 171 consistently by all processing elements on a system. Global 172 visibility and persistence are not necessarily causally related; 173 either one may precede the other, or they may take effect 174 simultaneously, depending on the architecture of the platform. 176 2. Problem Statement 178 RDMA is widely deployed in support of storage and shared memory over 179 increasingly low-latency and high-bandwidth networks. The state of 180 the art today yields end-to-end network latencies on the order of one 181 to two microseconds for message transfer, and bandwidths exceeding 182 100 gigabit/s. These bandwidths are expected to increase over time, 183 with latencies decreasing as a direct result. 185 In storage, another trend is emerging - greatly reduced latency of 186 persistently storing data blocks. While best-of-class Hard Disk 187 Drives (HDDs) have delivered average latencies of several 188 milliseconds for many years, Solid State Disks (SSDs) have improved 189 this by one to two orders of magnitude. Technologies such as NVM 190 Express NVMe [1] yield even higher-performing results by eliminating 191 the traditional storage interconnect. The latest technologies 192 providing memory-based persistence, such as Nonvolatile Memory DIMM 193 NVDIMM [2], places storage-like semantics directly on the memory bus, 194 reducing latency to less than a microsecond and increasing bandwidth 195 to potentially many tens of gigabyte/s. [supporting data to be added] 197 RDMA protocols, in turn, are used for many storage protocols, 198 including NFS/RDMA RFC5661 [RFC5661] RFC8166 [RFC8166] RFC8267 199 [RFC8267], SMB Direct MS-SMB2 [SMB3] MS-SMBD [SMBDirect] and iSER 200 RFC7145 [RFC7145], to name just a few. These protocols allow storage 201 and computing peers to take full advantage of these highly performant 202 networks and storage technologies to achieve remarkable throughput, 203 while minimizing the CPU overhead needed to drive their workloads. 204 This leaves more computing resources available for the applications, 205 which in turn can scale to even greater levels. Within the context 206 of Cloud-based environments, and through scale-out approaches, this 207 can directly reduce the number of servers that need to be deployed, 208 making such attributes highly compelling. 210 However, limiting factors come into play when deploying ultra-low 211 latency storage in such environments: 213 o The latency of the fabric, and of the necessary RDMA message 214 exchanges to ensure reliable transfer is now higher than that of 215 the storage itself. 217 o The requirement that storage be resilient to failure requires that 218 multiple copies be committed in multiple locations across the 219 fabric, adding extra hops which increase the latency and computing 220 demand placed on implementing the resiliency. 222 o Processing is required at the receiver in order to ensure that the 223 storage data has reached a persistent state, and acknowledge the 224 transfer so that the sender can proceed. 226 o Typical latency optimizations, such as polling a receive memory 227 location for a key that determines when the data arrives, can 228 create both correctness and security issues because this approach 229 requires the memory remain open to writes and therefore the buffer 230 may not remain stable after the application determines that the IO 231 has completed. This is of particular concern in security 232 conscious environments. 234 The first issue is fundamental, and due to the nature of serial, 235 shared communication channels, presents challenges that are not 236 easily bypassed. Communication cannot exceed the speed of light, for 237 example, and serialization/deserialization plus packet processing 238 adds further delay. Therefore, an RDMA solution which offloads and 239 reduces the overhead of exchanges which encounter such latencies is 240 highly desirable. 242 The second issue requires that outbound transfers be made as 243 efficient as possible, so that replication of data can be done with 244 minimal overhead and delay (latency). A reliable "push" RDMA 245 transfer method is highly suited to this. 247 The third issue requires that the transfer be performed without an 248 upper-layer exchange required. Within security contraints, RDMA 249 transfers, arbitrated only by lower layers into well-defined and pre- 250 advertised buffers, present an ideal solution. 252 The fourth issue requires significant CPU activity, consuming power 253 and valuable resources, and may not be guaranteed by the RDMA 254 protocols, which make no requirement of the order in which certain 255 received data is placed or becomes visible; such guarantees are made 256 only after signaling a completion to upper layers. 258 The RDMAP and DDP protocols, together, provide data transfer 259 semantics with certain consistency guarantees to both the sender and 260 receiver. Delivery of data transferred by these protocols is said to 261 have been Placed in destination buffers upon Completion of specific 262 operations. In general, these guarantees are limited to the 263 visibility of the transferred data within the hardware domain of the 264 receiver (data sink). Significantly, the guarantees do not 265 necessarily extend to the actual storage of the data in memory cells, 266 nor do they convey any guarantee that the data integrity is intact, 267 nor that it remains present after a catastrophic failure. These 268 guarantees may be provided by upper layers, such as the ones 269 mentioned, after processing the Completions, and performing the 270 necessary operations. 272 The NFSv4.1, SMB3 and iSER protocols are, respectively, file and 273 block oriented, and have been used extensively for providing access 274 to hard disk and solid state flash drive media. Such devices incur 275 certain latencies in their operation, from the millisecond-order 276 rotational and seek delays of rotating disk hardware, or the 100- 277 microsecond-order erase/write and translation layers of solid state 278 flash. These file and block protocols have benefited from the 279 increased bandwidth, lower latency, and markedly lower CPU overhead 280 of RDMA to provide excellent performance for such media, 281 approximately 30-50 microseconds for 4KB writes in leading 282 implementations. 284 These protocols employ a "pull" model for write: the client, or 285 initiator, sends an upper layer write request which contains an RDMA 286 reference to the data to be written. The upper layer protocols 287 encode this as one or more memory regions. The server, or target, 288 then prepares the request for local write execution, and "pulls" the 289 data with an RDMA Read. After processing the write, a response is 290 returned. There are therefore two or more roundtrips on the RDMA 291 network in processing the request. This is desirable for several 292 reasons, as described in the relevant specifications, but it incurs 293 latency. However, since as mentioned the network latency has been so 294 much less than the storage processing, this has been a sound 295 approach. 297 Today, a new class of Storage Class Memory is emerging, in the form 298 of Non-Volatile DIMM and NVM Express devices, among others. These 299 devices are characterized by further reduced latencies, in the 10- 300 microsecond-order range for NVMe, and sub-microsecond for NVDIMM. 301 The 30-50 microsecond write latencies of the above file and block 302 protocols are therefore from one to two orders of magnitude larger 303 than the storage media! The client/server processing model of 304 traditional storage protocols are no longer amortizable at an 305 acceptable level into the overall latency of storage access, due to 306 their requiring request/response communication, CPU processing by the 307 both server and client (or target and initiator), and the interrupts 308 to signal such requests. 310 Another important property of certain such devices is the requirement 311 for explicitly requesting that the data written to them be made 312 persistent. Because persistence requires that data be committed to 313 memory cells, it is a relatively expensive operation in time (and 314 power), and in order to maintain the highest device throughput and 315 most efficient operation, the device "commit" operation is explicit. 316 When the data is written by an application on the local platform, 317 this responsibility naturally falls to that application (and the CPU 318 on which it runs). However, when data is written by current RDMA 319 protocols, no such semantic is provided. As a result, upper layer 320 stacks, and the target CPU, must be invoked to perform it, adding 321 overhead and latency that is now highly undesirable. 323 When such devices are deployed as the remote server, or target, 324 storage, and when such a persistence can be requested and guaranteed 325 remotely, a new transfer model can be considered. Instead of relying 326 on the server, or target, to perform requested processing and to 327 reply after the data is persistently stored, it becomes desirable for 328 the client, or initiator, to perform these operations itself. By 329 altering the transfer models to support a "push mode", that is, by 330 allowing the requestor to push data with RDMA Write and subsequently 331 make it persistent, a full round trip can be eliminated from the 332 operation. Additionally, the signaling, and processing overheads at 333 the remote peer (server or target) can be eliminated. This becomes 334 an extremely compelling latency advantage. 336 In DDP (RFC5041), data is considered "placed" when it is submitted by 337 the RNIC to the system. This operation is commonly an i/o bus write, 338 e.g. via PCI. The submission is ordered, but there is no 339 confirmation or necessary guarantee that the data has yet reached its 340 destination, nor become visible to other devices in the system. The 341 data will eventually do so, but possibly at a later time. The act of 342 "delivery", on the other hand, offers a stronger semantic, 343 guaranteeing that not only have prior operations been executed, but 344 also guaranteeing any data is in a consistent and visible state. 345 Generally however, such "delivery" requires raising a completion 346 event, necessarily involving the host CPU. This is a relatively 347 expensive, and latency-bound operation. Some systems perform "DMA 348 snooping" to provide a somewhat higher guarantee of visibility after 349 delivery and without CPU intervention, but others do not. The RDMA 350 requirements remain the same, therefore, upper layers may make no 351 broad assumption. Such platform behaviors, in any case, do not 352 address persistence. 354 The extensions in this document primarily address a new "flush to 355 persistence" RDMA operation. This operation, when invoked by a 356 connected remote RDMA peer, can be used to request that previously- 357 written data be moved into the persistent storage domain. This may 358 be a simple flush to a memory cell, or it may require movement across 359 one or more busses within the target platform, followed by an 360 explicit persistence operation. Such matters are beyond the scope of 361 this specification, which provides only the mechanism to request the 362 operation, and to signal its successful completion. 364 In a similar vein, many applications desire to achieve visibility of 365 remotely-provided data, and to do so with minimum latency. One 366 example of such applications is "network shared memory", where 367 publish-subscribe access to network-accessible buffers is shared by 368 multiple peers, possibly from applications on the platform hosting 369 the buffers, and others via network connection. There may therefore 370 be multiple local devices accessing the buffer - for example, CPUs, 371 and other RNICs. The topology of the hosting platform may be 372 complex, with multiple i/o, memory, and interconnect busses, 373 requiring multiple intervening steps to process arriving data. 375 To address this, the extension additionally provides a "flush to 376 global visibility", which requires the RNIC to perform platform- 377 dependent processing in order to guarantee that the contents of a 378 specific range are visible for all devices that access them. On 379 certain highly-consistent platforms, this may be provided natively. 380 On others, it may require platform-specific processing, to flush data 381 from volatile caches, invalidate stale cached data from others, and 382 to empty queued pending operations. Ideally, but not universally, 383 this processing will take place without CPU intervention. With a 384 global visibility guarantee, network shared memory and similar 385 applications will be assured of broader compatibility and lower 386 latency across all hardware platforms. 388 Subsequently, many applications will seek to obtain a guarantee that 389 the integrity of the data has been preserved after it has been 390 flushed to a persistent or globally visible state. This may be 391 enforced at any time. Unlike traditional block-based storage, the 392 data provided by RDMA is neither structured nor segmented, and is 393 therefore not self-describing with respect to integrity. Only the 394 originator of the data, or an upper layer, is in possession of that. 395 Applications requiring such guarantees may include filesystem or 396 database logwriters, replication agents, etc. 398 To provide an additional integrity guarantee, a new operation is 399 provided by the extension, which will calculate, and optionally 400 compare an integrity value for an arbitrary region. The operation is 401 ordered with respect to preceding and subsequent operations, allowing 402 for a request pipeline without "bubbles" - roundtrip delays to 403 ascertain success or failure. 405 Finally, once data has been transmitted and directly placed by RDMA, 406 flushed to its final state, and its integrity verified, applications 407 will seek to commit the result with a transaction semantic. The 408 previous application examples apply here, logwriters and replication 409 are key, and both are highly latency- and integrity-sensitive. They 410 desire a pipelined transaction marker which is placed atomically to 411 indicate the validity of the preceding operations. They may require 412 that the data be in a persistent and/or globally visibile state, 413 before placing this marker. 415 Together the above discussion argues for a new "one sided" transfer 416 model supporting extended remote placement guarantees, provided by 417 the RDMA transport, and used directly by upper layers on a data 418 source, to control persistent storage of data on a remote data sink 419 without requiring its remote interaction. Existing, or new, upper 420 layers can use such a model in several ways, and evolutionary steps 421 to support persistence guarantees without required protocol changes 422 are explored in the remainder of this document. 424 Note that is intended that the requirements and concept of these 425 extensions can be applied to any similar RDMA protocol, and that a 426 compatible model can be applied broadly. 428 2.1. Requirements for RDMA Flush 430 The fundamental new requirement for extending RDMA protocols is to 431 define the property of _persistence_. This new property is to be 432 expressed by new operations to extend Placement as defined in 433 existing RDMA protocols. The RFC5040 protocols specify that 434 Placement means that the data is visible consistently within a 435 platform-defined domain on which the buffer resides, and to remote 436 peers across the network via RDMA to an adapter within the domain. 437 In modern hardware designs, this buffer can reside in memory, or also 438 in cache, if that cache is part of the hardware consistency domain. 439 Many designs use such caches extensively to improve performance of 440 local access. 442 Persistence, by contrast, requires that the buffer contents be 443 preserved across catastrophic failures. While it is possible for 444 caches to be persistent, they are typically not, or they provide the 445 persistence guarantee for a limited period of time, for example, 446 while backup power is applied. Efficient designs, in fact, lead most 447 implementations to simply make them volatile. In these designs, an 448 explicit flush operation (writing dirty data from caches), often 449 followed by an explicit commit (ensuring the data has reached its 450 destination and is in a persistent state), is required to provide 451 this guarantee. In some platforms, these operations may be combined. 453 For the RDMA protocol to remotely provide such guarantees, an 454 extension is required. Note that this does not imply support for 455 persistence or global visibility by the RDMA hardware implementation 456 itself; it is entirely acceptable for the RDMA implementation to 457 request these from another subsystem, for example, by requesting that 458 the CPU perform the flush and commit, or that the destination memory 459 device do so. But, in an ideal implementation, the RDMA 460 implementation will be able to act as a master and provide these 461 services without further work requests local to the data sink. Note, 462 it is possible that different buffers will require different 463 processing, for example one buffer may reside in persistent memory, 464 while another may place its blocks in a storage device. Many such 465 memory-addressable designs are entering the market, from NVDIMM to 466 NVMe and even to SSDs and hard drives. 468 Therefore, additionally any local memory registration primitive will 469 be enhanced to specify new optional placement attributes, along with 470 any local information required to achieve them. These attributes do 471 not explicitly traverse the network - like existing local memory 472 registration, the region is fully described by a { STag, Tagged 473 offset, length } descriptor, and such aspects of the local physical 474 address, memory type, protection (remote read, remote write, 475 protection key), etc are not instantiated in the protocol. Indeed, 476 each local RDMA implementation maintains these, and strictly performs 477 processing based on them, and they are not exposed to the peer. Such 478 considerations are discussed in the security model [RDMAP Security 479 [RFC5042]]. 481 Note, additionally, that by describing such attributes only through 482 the presence of an optional property of each region, it is possible 483 to describe regions referring to the same physical segment as a 484 combination of attributes, in order to enable efficient processing. 485 Processing of writes to regions marked as persistent, globally 486 visible, or neither ("ordinary" memory) may be optimized 487 appropriately. For example, such memory can be registered multiple 488 times, yielding multiple different Steering Tags which nonetheless 489 merge data in the underlying memory. This can be used by upper 490 layers to enable bulk-type processing with low overhead, by assigning 491 specific attributes through use of the Steering Tag. 493 When the underlying region is marked as persistent, that the 494 placement of data into persistence is guaranteed only after a 495 successful RDMA Flush directed to the Steering Tag which holds the 496 persistent attribute (i.e. any volatile buffering between the network 497 and the underlying storage has been flushed, and the appropriate 498 platform- and device-specific steps have been performed). 500 To enable the maximum generality, the RDMA Flush operation is 501 specified to act on a set of bytes in a region, specified by a 502 standard RDMA { STag, Tagged offset, length } descriptor. It is 503 required that each byte of the specified segment be in the requested 504 state before the response to the Flush is generated. However, 505 depending on the implementation, other bytes in the region, or in 506 other regions, may be acted upon as part of processing any RDMA 507 Flush. In fact, any data in any buffer destined for persistent 508 storage, may become persistent at any time, even if not requested 509 explicitly. For example, the host system may flush cache entries due 510 to cache pressure, or as part of platform housekeeping activities. 511 Or, a simple and stateless approach to flushing a specific range 512 might be for all data be flushed and made persistent, system-wide. A 513 possibly more efficient implementation might track previously written 514 bytes, or blocks with "dirty" bytes, and flush only those to 515 persistence. Either result provides the required guarantee. 517 The RDMA Flush operation provides a response but does not return a 518 status, or can result in an RDMA Terminate event upon failure. A 519 region permission check is performed first, and may fail prior to any 520 attempt to process data. The RDMA Flush operation may fail to make 521 the data persistent, perhaps due to a hardware failure, or a change 522 in device capability (device read-only, device wear, etc). The 523 device itself may support an integrity check, similar to modern error 524 checking and corection (ECC) memory or media error detection on hard 525 drive surfaces, which may signal failure. Or, the request may exceed 526 device limits in size or even transient attribute such as temporary 527 media failure. The behavior of the device itself is beyond the scope 528 of this specification. 530 Because the RDMA Flush involves processing on the local platform and 531 the actual storage device, in addition to being ordered with certain 532 other RDMA operations, it is expected to take a certain time to be 533 performed. For this reason, the operation is required to be defined 534 as a "queued" operation on the RDMA device, and therefore also the 535 protocol. The RDMA protocol supports RDMA Read (RFC5040) and Atomic 536 (RFC7306) in such a fashion. The iWARP family defines a "queue 537 number" with queue-specific processing that is naturally suited for 538 this. Queuing provides a convenient means for supporting ordering 539 among other operations, and for flow control. Flow control for RDMA 540 Reads and Atomics on any given Queue Pair share incoming and outgoing 541 crediting depths ("IRD/ORD"); operations in this specification share 542 these values and do not define their own separate values. 544 2.1.1. Non-Requirements 546 The extension does not include a "RDMA Write to persistence", that 547 is, a modifier on the existing RDMA Write operation. While it might 548 seem a logical approach, several issues become apparent: 550 The existing RDMA Write operation is a tagged DDP request which is 551 unacknowledged at the DDP layer (RFC5042). Requiring it to 552 provide an indication of remote persistence would require it to 553 have an acknowledgement, which would be an undesirable extension 554 to the existing defined operation. 556 Such an operation would require flow control and therefore also 557 buffering on the responding peer. Existing RDMA Write semantics 558 are not flow controlled and as tagged transfers are by design 559 zero-copy i.e. unbuffered. Requiring these would introduce 560 potential pipeline stalls and increase implementation complexity 561 in a critical performance path. 563 The operation at the requesting peer would stall until the 564 acknowledgement of completion, significantly changing the semantic 565 of the existing operation, and complicating software by blocking 566 the send work queue, a significant new semantic for RDMA Write 567 work requests. As each operation would be self-describing with 568 respect to persistence, individual operations would therefore 569 block with differing semantics and complicate the situation even 570 further. 572 Even for the possibly-common case of flushing after every write, 573 it is highly undesirable to impose new optional semantics on an 574 existing operation, and therefore also on the upper layer protocol 575 implementation. And, the same result can be achieved by sending 576 the Flush merged in the same network packet, and since the RDMA 577 Write is unacknowledged while the RDMA Flush is always replied-to, 578 no additional overhead is imposed on the combined exchange. 580 For these reasons, it is deemed a non-requirement to extend the 581 existing RDMA Write operation. 583 Similarly, the extension does not consider the use of RDMA Read to 584 implement Flush. Historically, an RDMA Read has been used by 585 applications to ensure that previously written data has been 586 processed by the responding RNIC and has been submitted for ordered 587 Placement. However, this is inadequate for implementing the required 588 RDMA Flush: 590 RDMA Read guarantees only that previously written data has been 591 Placed, it provides no such guarantee that the data has reached 592 its destination buffer. In practice, an RNIC satisfies the RDMA 593 Read requirement by simply issuing all PCIe Writes prior to 594 issuing any PCIe Reads. 596 Such PCIe Reads must be issued by the RNIC after all such PCIe 597 Writes, therefore flushing a large region requires the RNIC and 598 its attached bus to strictly order (and not cache) its writes, to 599 "scoreboard" its writes, or to perform PCIe Reads to the entire 600 region. The former approach is significantly complex and 601 expensive, and the latter approach requires a large amount of PCIe 602 and network read bandwidth, which are often unnecessary and 603 expensive. The Reads, in any event, may be satisfied by platform- 604 specfic caches, never actually reaching the destination memory or 605 other device. 607 The RDMA Read may begin execution at any time once the request is 608 fully received, queued, and the prior RDMA Write requirement has 609 been satisfied. This means that the RDMA Read operation may not 610 be ordered with respect to other queued operations, such as Verify 611 and Atomic Write, in addition to other RDMA Flush operations. 613 The RDMA Read has no specific error semantic to detect failure, 614 and the response may be generated from any cached data in a 615 consistently Placed state, regardless of where it may reside. For 616 this reason, an RDMA Read may proceed without necessarily 617 verifying that a previously ordered "flush" has succeeded or 618 failed. 620 RDMA Read is heavily used by existing RDMA consumers, and the 621 semantics are therefore implemented by the existing specification. 622 For new applications to further expect an extended RDMA Read 623 behavior would require an upper layer negotiation to determine if 624 the data sink platform and RNIC appropriately implemented them, or 625 to silently ignore the requirement, with the resulting failure to 626 meet the requirement. An explicit extension, rather than 627 depending on an overloaded side effect, ensures this will not 628 occur. 630 Again, for these reasons, it is deemed a non-requirement to reuse or 631 extend the existing RDMA Read operation. 633 Therefore, no changes to existing specified RDMA operations are 634 proposed, and the protocol is unchanged if the extensions are not 635 invoked. 637 2.2. Requirements for Atomic Write 639 The persistence of data is a key property by which applications 640 implement transactional behavior. Transactional applications, such 641 as databases and log-based filesystems, among many others, implement 642 a "two phase commit" wherein a write is made durable, and *only upon 643 success*, a validity indicator for the written data is set. Such 644 semantics are challenging to provide over an RDMA fabric, as it 645 exists today. The RDMA Write operation does not generate an 646 acknowledgement at the RDMA layers. And, even when an RDMA Write is 647 delivered, if the destination region is persistent, its data can be 648 made persistent at any time, even before a Flush is requested. Out- 649 of-order DDP processing, packet fragmentation, and other matters of 650 scheduling transfers can introduce partial delivery and ordering 651 differences. If a region is made persistent, or even globally 652 visible, before such sequences are complete, significant application- 653 layer inconsistencies can result. Therefore, applications may 654 require fine-grained control over the placement of bytes. In current 655 RDMA storage solutions, these semantics are implemented in upper 656 layers, potentially with additional upper layer message signaling, 657 and corresponding roundtrips and blocking behaviors. 659 In addition to controlling placement of bytes, the ordering of such 660 placement can be important. By providing an ordered relationship 661 among write and flush operations, a basic transaction scenario can be 662 constructed, in a way which can function with equal semantics both 663 locally and remotely. In a "log-based" scenario, for example, a 664 relatively large segment (log "record") is placed, and made durable. 665 Once persistence of the segment is assured, a second small segment 666 (log "pointer") is written, and optionally also made persistent. The 667 visibility of the second segment is used to imply the validity, and 668 persistence, of the first. Any sequence of such log-operation pairs 669 can thereby always have a single valid state. In case of failure, 670 the resulting string (log) of transactions can therefore be recovered 671 up to and including the final state. 673 Such semantics are typically a challenge to implement on general 674 purpose hardware platforms, and a variety of application approaches 675 have become common. Generally, they employ a small, well-aligned 676 atom of storage for the second segment (the one used for validity). 677 For example, an integer or pointer, aligned to natural memory address 678 boundaries and CPU and other cache attributes, is stored using 679 instructions which provide for atomic placement. Existing RDMA 680 protocols, however, provide no such capability. 682 This document specifies an Atomic Write extension, which, 683 appropriately constrained, can serve to provide similar semantics. A 684 small (64 bit) payload, sent in a request which is ordered with 685 respect to prior RDMA Flush operations on the same stream and 686 targeted at a segment which is aligned such that it can be placed in 687 a single hardware operation, can be used to satisfy the previously 688 described scenario. Note that the visibility of this payload can 689 also serve as an indication that all prior operations have succeeded, 690 enabling a highly efficient application-visible memory semaphore. 692 2.3. Requirements for RDMA Verify 694 An additional matter remains with persistence - the integrity of the 695 persistent data. Typically, storage stacks such as filesystems and 696 media approches such as SCSI T10 DIF or filesystem integrity checks 697 such as ZFS provide for block- oir file-level protection of data at 698 rest on storage devices. With RDMA protocols and physical memory, no 699 such stacks are present. And, to add such support would introduce 700 CPU processing and its inherent latency, counter to the goals of the 701 remote storage approach. Requiring the peer to verify by remotely 702 reading the data is prohibitive in both bandwidth and latency, and 703 without additional mechanisms to ensure the actual stored data is 704 read (and not a copy in some volatile cache), can not provide the 705 necessary result. 707 To address this, an integrity operation is required. The integrity 708 check is initiated by the upper layer or application, which 709 optionally computes the expected hash of a given segment of arbitrary 710 size, sending the hash via an RDMA Verify operation targeting the 711 RDMA segment on the responder, and the responder calculating and 712 optionally verifying the hash on the indicated data, bypassing any 713 volatile copies remaining in caches. The responder responds with its 714 computed hash value, or optionally, terminates the connection with an 715 appropriate error status upon mismatch. Specifying this optional 716 termination behavior enables a transaction to be sent as WRITE-FLUSH- 717 VERIFY-ATOMICWRITE, without any pipeline bubble. The result (carried 718 by the subsequently ordered ATOMIC_WRITE) will not not be committed 719 as valid if any prior operation is terminated, and in this case, 720 recovery can be initiated by the requestor immediately from the point 721 of failure. On the other hand, an errorless "scrub" can be 722 implemented without the optional termination behavior, by providing 723 no value for the expected hash. The responder will return the 724 computed hash of the contents. 726 The hash algorithm is not specified by the RDMA protocol, instead it 727 is left to the upper layer to select an appropriate choice based upon 728 the strength, security, length, support by the RNIC, and other 729 criteria. The size of the resulting hash is therefore also not 730 specified by the RDMA protocol, but is dictated by the hash 731 algorithm. The RDMA protocol becomes simply a transport for 732 exchanging the values. 734 It should be noted that the design of the operation, passing of the 735 hash value from requestor to responder (instead of, for example, 736 computing it at the responder and simply returning it), allows both 737 peers to determine immediately whether the segment is considered 738 valid, permitting local processing by both peers if that is not the 739 case. For example, a known-bad segment can be immediately marked as 740 such ("poisoned") by the responder platform, requiring recovery 741 before permitting access. [cf ACPI, JEDEC, SNIA NVMP specifications] 743 2.4. Local Semantics 745 The new operations imply new access methods ("verbs") to local 746 persistent memory which backs registrations. Registrations of memory 747 which support persistence will follow all existing practices to 748 ensure permission-based remote access. The RDMA protocols do not 749 expose these permissions on the wire, instead they are contained in 750 local memory registration semantics. Existing attributes are Remote 751 Read and Remote Write, which are granted individually through local 752 registration on the machine. If an RDMA Read or RDMA Write operation 753 arrives which targets a segment without the appropriate attribute, 754 the connection is terminated. 756 In support of the new operations, new memory attributes are needed. 757 For RDMA Flush, two "Flushable" attributes provide permission to 758 invoke the operation on memory in the region for persistence and/or 759 global visibility. When registering, along with the attribute, 760 additional local information can be provided to the RDMA layer such 761 as the type of memory, the necessary processing to make its contents 762 persistent, etc. If the attribute is requested for memory which 763 cannot be persisted, it also allows the local provider to return an 764 error to the upper layer, obviating the upper layer from providing 765 the region to the remote peer. 767 For RDMA Verify, the "Verifiable" attribute provides permission to 768 compute the hash of memory in the region. Again, along with the 769 attribute, additional information such as the hash algorithm for the 770 region is provided to the local operation. If the attribute is 771 requested for non-persistent memory, or if the hash algorithm is not 772 available, the local provider can return an error to the upper layer. 773 In the case of success, the upper layer can exchange the necessary 774 information with the remote peer. Note that the algorithm is not 775 identified by the on-the-wire operation as a result. Establishing 776 the choice of hash for each region is done by the local consumer, and 777 each hash result is merely transported by the RDMA protocol. Memory 778 can be registered under multiple regions, if differing hashes are 779 required, for example unique keys may be provisoned to implement 780 secure hashing. Also note that, for certain "reversible" hash 781 algorithms, this may allow peers to effectively read the memory, 782 therefore, the local platform may require additional read permissions 783 to be associated with the Verifiable permission, when such algorithms 784 are selected. 786 The Atomic Write operation requires no new attributes, however it 787 does require the "Remote Write" attribute on the target region, as is 788 true for any remotely requested write. If the Atomic Write 789 additionally targets a Flushable region, the RDMA Flush is performed 790 separately. It is never generally possible to achieve persistence 791 atomically with placement, even locally. 793 3. RDMA Protocol Extensions 795 The extensions in this document fall into two categories: 797 o Protocol extensions 799 o Local behavior extensions 801 These categories are described, and may be implemented, separately. 803 3.1. RDMAP Extensions 805 The wire-related aspects of the extensions are discussed in this 806 section.This document defines the following new RDMA operations. 808 For reference, Figure 1 depicts the format of the DDP Control and 809 RDMAP Control Fields, in the style and convention of RFC5040 and 810 RFC7306: 812 The DDP Control Field consists of the T (Tagged), L (Last), Resrv, 813 and DV (DDP protocol Version) fields are defined in RFC5041. The 814 RDMAP Control Field consists of the RV (RDMA Version), Rsv, and 815 Opcode fields are defined in RFC5040. No change or extension is made 816 to these fields by this specification. 818 This specification adds values for the RDMA Opcode field to those 819 specified in RFC5040. Table 1 defines the new values of the RDMA 820 Opcode field that are used for the RDMA Messages defined in this 821 specification. 823 As shown in Table 1, STag (Steering Tag) and Tagged Offset are valid 824 only for certain RDMA Messages defined in this specification. 825 Table 1 also shows the appropriate Queue Number for each Opcode. 827 0 1 2 3 828 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 829 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 830 |T|L| Resrv | DV| RV|R| Opcode | 831 | | | | | |s| | 832 | | | | | |v| | 833 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 834 | Invalidate STag | 835 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 837 DDP Control and RDMAP Control Fields 839 All RDMA Messages defined in this specification MUST carry the 840 following values: 842 o The RDMA Version (RV) field: 01b. 844 o Opcode field: Set to one of the values in Table 2. 846 o Invalidate STag: Set to zero, or optionally to non-zero by the 847 sender, processed by the receiver. 849 Note: N/A in the table below means Not Applicable 850 -------+------------+-------+------+-------+-----------+------------- 851 RDMA | Message | Tagged| STag | Queue | Invalidate| Message 852 Opcode | Type | Flag | and | Number| STag | Length 853 | | | TO | | | Communicated 854 | | | | | | between DDP 855 | | | | | | and RDMAP 856 -------+------------+-------+------+-------+-----------+------------- 857 -------+------------+------------------------------------------------ 858 01100b | RDMA Flush | 0 | N/A | 1 | opt | Yes 859 | Request | | | | | 860 -------+------------+------------------------------------------------ 861 01101b | RDMA Flush | 0 | N/A | 3 | N/A | No 862 | Response | | | | | 863 -------+------------+------------------------------------------------ 864 01110b | RDMA Verify| 0 | N/A | 1 | opt | Yes 865 | Request | | | | | 866 -------+------------+------------------------------------------------ 867 01111b | RDMA Verify| 0 | N/A | 3 | N/A | Yes 868 | Response | | | | | 869 -------+------------+------------------------------------------------ 870 10000b | Atomic | 0 | N/A | 1 | opt | Yes 871 | Write | | | | | 872 | Request | | | | | 873 -------+------------+------------------------------------------------ 874 10001b | Atomic | 0 | N/A | 3 | N/A | No 875 | Write | | | | | 876 | Response | | | | | 877 -------+------------+------------------------------------------------ 879 Additional RDMA Usage of DDP Fields 881 This extension adds RDMAP use of Queue Number 1 for Untagged Buffers 882 for issuing RDMA Flush, RDMA Verify and Atomic Write Requests, and 883 use of Queue Number 3 for Untagged Buffers for tracking the 884 respective Responses. 886 All other DDP and RDMAP Control Fields are set as described in 887 RFC5040 and RFC7306. 889 Table 3 defines which RDMA Headers are used on each new RDMA Message 890 and which new RDMA Messages are allowed to carry ULP payload. 892 -------+------------+-------------------+------------------------- 893 RDMA | Message | RDMA Header Used | ULP Message allowed in 894 Message| Type | | the RDMA Message 895 OpCode | | | 896 -------+------------+-------------------+------------------------- 897 -------+------------+-------------------+------------------------- 898 01100b | RDMA Flush | None | No 899 | Request | | 900 -------+------------+-------------------+------------------------- 901 01101b | RDMA Flush | None | No 902 | Response | | 903 -------+------------+--------------------------------------------- 904 01110b | RDMA Verify| None | No 905 | Request | | 906 -------+------------+-------------------+------------------------- 907 01111b | RDMA Verify| None | No 908 | Response | | 909 -------+------------+--------------------------------------------- 910 10000b | Atomic | None | No 911 | Write | | 912 | Request | | 913 -------+------------+--------------------------------------------- 914 10000b | Atomic | None | No 915 | Write | | 916 | Response | | 917 -------+------------+--------------------------------------------- 919 RDMA Message Definitions 921 3.1.1. RDMA Flush 923 The RDMA Flush operation requests that all bytes in a specified 924 region are to be made persistent and/or globally visible, under 925 control of specified flags. As specified in section 4 its operation 926 is ordered after the successful completion of any previous requested 927 RDMA Write or certain other operations. The response is generated 928 after the region has reached its specified state. The implementation 929 MUST fail the operation and send a terminate message if the RDMA 930 Flush cannot be performed, or has encountered an error. 932 The RDMA Flush operation MUST NOT be completed by the data sink until 933 all data has attained the requested state. Achieving persistence may 934 require programming and/or flushing of device buffers, while 935 achieving global visibility may require flushing of cached buffers 936 across the entire platform interconnect. In no event are persistence 937 and global visibility achieved atomically, one may precede the other 938 and either may complete at any time.The Atomic Write operation may be 939 used by an upper layer consumer to indicate that either or both 940 dispositions are available after completion of the RDMA Flush, in 941 addition to other approaches. 943 3.1.1.1. RDMA Flush Request Format 945 The RDMA Flush Request Message makes use of the DDP Untagged Buffer 946 Model. RDMA Flush Request messages MUST use the same Queue Number as 947 RDMA Read Requests and RDMA Extensions Atomic Operation Requests 948 (QN=1). Reusing the same queue number for RDMA Flush Requests allows 949 the operations to reuse the same RDMA infrastructure (e.g. Outbound 950 and Inbound RDMA Read Queue Depth (ORD/IRD) flow control) as that 951 defined for RDMA Read Requests. 953 The RDMA Flush Request Message carries a payload that describes the 954 ULP Buffer address in the Responder's memory. The following figure 955 depicts the Flush Request that is used for all RDMA Flush Request 956 Messages: 958 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 959 | Data Sink STag | 960 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 961 | Data Sink Length | 962 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 963 | Data Sink Tagged Offset | 964 + + 965 | | 966 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 967 | Flush Disposition Flags +G+P| 968 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 970 Flush Request 972 Data Sink STag: 32 bits The Data Sink STag identifies the Remote 973 Peer's Tagged Buffer targeted by the RDMA Flush Request. The Data 974 Sink STag is associated with the RDMAP Stream through a mechanism 975 that is outside the scope of the RDMAP specification. 977 Data Sink Length: The Data Sink Length is the length, in octets, of 978 the bytes targeted by the RDMA Flush Request. 980 Data Sink Tagged Offset: 64 bits The Data Sink Tagged Offset 981 specifies the starting offset, in octets, from the base of the 982 Remote Peer's Tagged Buffer targeted by the RDMA Flush Request. 984 Flags: Flags specifying the disposition of the flushed data: 0x01 985 Flush to Persistence, 0x02 Flush to Global Visibility. 987 3.1.1.2. RDMA Flush Response 989 The RDMA Flush Response Message makes use of the DDP Untagged Buffer 990 Model. RDMA Flush Response messages MUST use the same Queue Number 991 as RDMA Extensions Atomic Operation Responses (QN=3). No payload is 992 passed to the DDP layer on Queue Number 3. 994 Upon successful completion of RDMA Flush processing, an RDMA Flush 995 Response MUST be generated. 997 If during RDMA Flush processing on the Responder, an error is 998 detected which would result in the requested region to not achieve 999 the requested disposition, the Responder MUST generate a Terminate 1000 message. The contents of the Terminate message are defined in 1001 Section 5.2. 1003 3.1.1.3. RDMA Flush Ordering and Atomicity 1005 Ordering and completion rules for RDMA Flush Request are similar to 1006 those for an Atomic operation as described in section 5 of RFC7306. 1007 The queue number field of the RDMA Flush Request for the DDP layer 1008 MUST be 1, and the RDMA Flush Response for the DDP layer MUST be 3. 1010 There are no ordering requirements for the placement of the data, nor 1011 are there any requirements for the order in which the data is made 1012 globally visible and/or persistent. Data received by prior 1013 operations (e.g. RDMA Write) MAY be submitted for placement at any 1014 time, and persistence or global visibility MAY occur before the flush 1015 is requested. After placement, data MAY become persistent or 1016 globally visible at any time, in the course of operation of the 1017 persistency management of the storage device, or by other actions 1018 resulting in persistence or global visibility. 1020 Any region segment specified by the RDMA Flush operation MUST be made 1021 persistent and/or globally visible before successful return of the 1022 operation. If RDMA Flush processing is successful on the Responder, 1023 meaning the requested bytes of the region are, or have been made 1024 persistent and/or globally visible, as requested, the RDMA Flush 1025 Response MUST be generated. 1027 There are no atomicity guarantees provided on the Responder's node by 1028 the RDMA Flush Operation with respect to any other operations. While 1029 the Completion of the RDMA Flush Operation ensures that the requested 1030 data was placed into, and flushed from the target Tagged Buffer, 1031 other operations might have also placed or fetched overlapping data. 1032 The upper layer is responsible for arbitrating any shared access. 1034 (Sidebar) It would be useful to make a statement about other RDMA 1035 Flush to the target buffer and RDMA Read from the target buffer on 1036 the same connection. Use of QN 1 for these operations provides 1037 ordering possibilities which imply that they will "work" (see #7 1038 below). NOTE: this does not, however, extend to RDMA Write, which is 1039 not queued nor sequenced and therefore does not employ a DDP QN. 1041 3.1.2. RDMA Verify 1043 The RDMA Verify operation requests that all bytes in a specified 1044 region are to be read from the underlying storage and that an 1045 integrity hash be calculated. As specified in section 4 its 1046 operation is ordered after the successful completion of any previous 1047 requested RDMA Write and RDMA Flush, or certain other operations. 1048 The implementation MUST fail the operation and send a terminate 1049 message if the RDMA Verify cannot be performed, has encountered an 1050 error, or if a hash value was provided in the request and the 1051 calculated hash does not match. If no condition for a Terminate 1052 message is encountered, the response is generated containing the 1053 result calculated hash value. 1055 3.1.2.1. RDMA Verify Request Format 1057 The RDMA Verify Request Message makes use of the DDP Untagged Buffer 1058 Model. RDMA Verify Request messages MUST use the same Queue Number 1059 as RDMA Read Requests and RDMA Extensions Atomic Operation Requests 1060 (QN=1). Reusing the same queue number for RDMA Read and RDMA Flush 1061 Requests allows the operations to reuse the same RDMA infrastructure 1062 (e.g. Outbound and Inbound RDMA Read Queue Depth (ORD/IRD) flow 1063 control) as that defined for those requests. 1065 The RDMA Verify Request Message carries a payload that describes the 1066 ULP Buffer address in the Responder's memory. The following figure 1067 depicts the Verify Request that is used for all RDMA Verify Request 1068 Messages: 1070 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1071 | Data Sink STag | 1072 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1073 | Data Sink Length | 1074 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1075 | Data Sink Tagged Offset | 1076 + + 1077 | | 1078 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1079 | Hash Value (optional, variable) | 1080 | ... | 1081 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1083 Verify Request 1085 Data Sink STag: 32 bits The Data Sink STag identifies the Remote 1086 Peer's Tagged Buffer targeted by the Verify Request. The Data 1087 Sink STag is associated with the RDMAP Stream through a mechanism 1088 that is outside the scope of the RDMAP specification. 1090 Data Sink Length: The Data Sink Length is the length, in octets, of 1091 the bytes targeted by the Verify Request. 1093 Data Sink Tagged Offset: 64 bits The Data Sink Tagged Offset 1094 specifies the starting offset, in octets, from the base of the 1095 Remote Peer's Tagged Buffer targeted by the Verify Request. 1097 Hash Value: The Hash Value is optionally an octet string 1098 representing the expected result, if any, of the hash algorithm on 1099 the Remote Peer's Tagged Buffer. The length of the Hash Value is 1100 variable, and dependent on the selected algorithm. When provided, 1101 any mismatch with the calculated value causes the Responder to 1102 generate a Terminate message, and close the connection. The 1103 contents of the Terminate message are defined in section 5.2. 1105 3.1.2.2. Verify Response Format 1107 The Verify Response Message makes use of the DDP Untagged Buffer 1108 Model. Verify Response messages MUST use the same Queue Number as 1109 RDMA Flush Responses (QN=3). The RDMAP layer passes the following 1110 payload to the DDP layer on Queue Number 3. The RDMA Verify Response 1111 is not sent when a Terminate message is generated through specifying 1112 the Compare Flag as 1, and a mismatch occurs. 1114 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1115 | Hash Value (variable) | 1116 | ... | 1117 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1119 Verify Response 1121 Hash Value: The Hash Value is an octet string representing the 1122 result of the hash algorithm on the Remote Peer's Tagged Buffer. 1123 The length of the Hash Value is variable, and dependent on the 1124 algorithm selected by the upper layer consumer, among those 1125 supported by the RNIC. 1127 3.1.2.3. RDMA Verify Ordering 1129 Ordering and completion rules for RDMA Verify Request are similar to 1130 those for an Atomic operation as described in section 5 of RFC7306. 1131 The queue number field of the RDMA Verify Request for the DDP layer 1132 MUST be 1, and the RDMA Verify Response for the DDP layer MUST be 3. 1134 As specified in section 4, RDMA Verify and RDMA Flush are executed by 1135 the Data Sink in strict order. When an RDMA Verify follows an RDMA 1136 Flush, and because the RDMA Flush MUST ensure that all bytes are in 1137 the specified state before responding, any RDMA Verify that follows 1138 can be assured that it is operating on flushed data. If unflushed 1139 data has been sent to the region segment between the operations, and 1140 since data may be made persistent and/or globally visible by the Data 1141 Sink at any time, the result of any such RDMA Verify is undefined. 1143 3.1.3. Atomic Write 1145 The Atomic Write operation provides a block of data which is placed 1146 to a specified region atomically, and as specified in section 4 its 1147 placement is ordered after the successful completion of any previous 1148 requested RDMA Flush or RDMA Verify. This specified region is 1149 constrained in size and alignment to 64-bits at 64-bit alignment, and 1150 the implementation MUST fail the operation and send a terminate 1151 message if the placement cannot be performed atomically. 1153 The Atomic Write Operation requires the Responder to write a 64-bit 1154 value at a ULP Buffer address that is 64-bit aligned in the 1155 Responder's memory, in a manner which is Placed in the responder's 1156 memory atomically. 1158 3.1.3.1. Atomic Write Request 1160 The Atomic Write Request Message makes use of the DDP Untagged Buffer 1161 Model. Atomic Write Request messages MUST use the same Queue Number 1162 as RDMA Read Requests and RDMA Extensions Atomic Operation Requests 1163 (QN=1). Reusing the same queue number for RDMA Flush and RDMA Verify 1164 Requests allows the operations to reuse the same RDMA infrastructure 1165 (e.g. Outbound and Inbound RDMA Read Queue Depth (ORD/IRD) flow 1166 control) as that defined for those Requests. 1168 The Atomic Write Request Message carries an Atomic Write Request 1169 payload that describes the ULP Buffer address in the Responder's 1170 memory, as well as the data to be written. The following figure 1171 depicts the Atomic Write Request that is used for all Atomic Write 1172 Request Messages: 1174 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1175 | Data Sink STag | 1176 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1177 | Data Sink Length | 1178 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1179 | Data Sink Tagged Offset | 1180 + + 1181 | | 1182 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1183 | Data | 1184 + + 1185 | | 1186 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1188 Atomic Write Request 1190 Data Sink STag: 32 bits The Data Sink STag identifies the Remote 1191 Peer's Tagged Buffer targeted by the Atomic Write Request. The 1192 Data Sink STag is associated with the RDMAP Stream through a 1193 mechanism that is outside the scope of the RDMAP specification. 1195 Data Sink Length: The Data Sink Length is the length of data to be 1196 placed, and MUST be 8. 1198 Data Sink Tagged Offset: 64 bits The Data Sink Tagged Offset 1199 specifies the starting offset, in octets, from the base of the 1200 Remote Peer's Tagged Buffer targeted by the Atomic Write Request. 1201 This offset can be any value, but the destination ULP buffer 1202 address MUST be aligned as specified above. Ensuring that the 1203 STag and Data Sink Tagged Offset values appropriately meet such a 1204 requirement is an upper layer consumer responsibility, and is out 1205 of scope for this specification. 1207 Data: The 64-bit data value to be written, in big-endian format. 1209 Atomic Write Operations MUST target ULP Buffer addresses that are 1210 64-bit aligned, and conform to any other platform restrictions on the 1211 Responder system. The write MUST NOT be Placed prior to all prior 1212 RDMA Flush operations, and therefore all other prior operations, 1213 completing successfully. 1215 If an Atomic Write Operation is attempted on a target ULP Buffer 1216 address that is not 64-bit aligned, or due to alignment, size, or 1217 other platform restrictions cannot be performed atomically: 1219 The operation MUST NOT be performed 1221 The Responder's memory MUST NOT be modified 1223 A terminate message MUST be generated. (See Section 5.2 for the 1224 contents of the terminate message.) 1226 3.1.3.2. Atomic Write Response 1228 The Atomic Write Response Message makes use of the DDP Untagged 1229 Buffer Model. Atomic Write Response Response messages MUST use the 1230 same Queue Number as RDMA Flush Responses (QN=3). The RDMAP layer 1231 passes no payload to the DDP layer on Queue Number 3. 1233 3.1.4. Discovery of RDMAP Extensions 1235 As for RFC7306, explicit negotiation by the RDMAP peers of the 1236 extensions covered by this document is not required. Instead, it is 1237 RECOMMENDED that RDMA applications and/or ULPs negotiate any use of 1238 these extensions at the application or ULP level. The definition of 1239 such application-specific mechanisms is outside the scope of this 1240 specification. For backward compatibility, existing applications 1241 and/or ULPs should not assume that these extensions are supported. 1243 In the absence of application-specific negotiation of the features 1244 defined within this specification, the new operations can be 1245 attempted, and reported errors can be used to determine a remote 1246 peer's capabilities. In the case of RDMA Flush and Atomic Write, an 1247 operation to a previously Advertised buffer with remote write 1248 permission can be used to determine the peer's support. A Remote 1249 Operation Error or Unexpected OpCode error will be reported by the 1250 remote peer if the Operation is not supported by the remote peer. 1251 For RDMA Verify, such an operation may target a buffer with remote 1252 read permission. 1254 3.2. Local Extensions 1256 This section discusses memory registration, new memory and protection 1257 attributes, and applicability to both remote and "local" (receives). 1258 Because this section does not specify any wire-visible semantic, it 1259 is entirely informative. 1261 3.2.1. Registration Semantics 1263 New platform-specific attributes to RDMA registration, allows them to 1264 be processed at the server *only* without client knowledge, or 1265 protocol exposure. No client knowledge - robust design ensuring 1266 future interop 1268 New local PMEM memory registration example: 1270 Register(region[], MemPerm, MemType, MemMode) -> STag 1272 Region describes the memory segment[s] to be registered by the 1273 returned STag. The local RNIC may limit the size and number of 1274 these segments. 1276 MemPerm to indicate permitted operations in addition to remote 1277 read and remote werite: "remote flush to persistence", "remote 1278 flush to global visibility", selectivity, etc. 1280 MemType includes type of storage described by the Region, i.e. 1281 plain RAM, "flush required" (flushable), or PCIe-resident via 1282 peer-to-peer, or any other local platform-specific processing 1284 MemMode includes disposition of data Read and/or written e.g. 1285 Cacheable after operation (indicate if needed by CPU on data 1286 sink, to allow or avoid writethrough as optimization) 1288 None of the above attributes are at all relevant, or exposed, 1289 by the protocol 1291 STag is processed in receiving RNIC during RDMA operation to 1292 specified region, under control of original Perm, Type and Mode. 1294 3.2.2. Completion Semantics 1296 Discuss the interactions with new operations when upper layer 1297 provides Completions to responder (e.g. messages via receive or 1298 immediate data via RDMA Write). Natural conclusion of ordering 1299 rules, but made explicit. 1301 Ordering of operations is critical: Such RDMA Writes cannot be 1302 allowed to "pass" persistence or global visibility, and RDMA Flush 1303 may not begin until prior RDMA Writes to flush region are accounted 1304 for. Therefore, ULP protocol implications may also exist. 1306 3.2.3. Platform Semantics 1308 Writethrough behavior on persistent regions and reasons for same. 1309 Consider recommending a local writethrough behavior on any persistent 1310 region, to support a nonblocking hurry-up to avoid future stalls on a 1311 subsequent cache flush, prior to a flush. Also, it would enhance 1312 storage integrity. Drive selection of this behavior from memory 1313 registration, so RNIC may "look up" the desired behavior in its TPT. 1315 PCI extension to support Flush would allow RNIC to provide 1316 persistence and/or global visibility directly and efficiently To 1317 Memory, CPU, PCI Root, PM device, PCIe device, ... Avoids CPU 1318 interaction Supports strong data consistency model. Performs 1319 equivalent of: CLFLUSHOPT (region list) or some other flow tag. Or 1320 if RNIC participates in platform consistency domain on memory bus or 1321 within CPU complex... other possibilities exist! 1323 Also consider additional "integrity check" behavior (hash algorithm) 1324 specified per-region. If so, providing it as a registration 1325 parameter enables fine-graned control, and enables storing it in per- 1326 region RNIC state, making its processing optional and 1327 straightforward. 1329 A similar approach applicable to providing security key for 1330 encrypting/decrypting access on per-region basius, without protocol 1331 exposure. [SDC2017 presentation] 1333 Any other per-region processing to be explored. 1335 4. Ordering and Completions Table 1337 The table in this section specifies the ordering relationships for 1338 the operations in this specification and in those it extends, from 1339 the standpoint of the Requester. Note that in the table, Send 1340 Operation includes Send, Send with Invalidate, Send with Solicited 1341 Event, and Send with Solicited Event and Invalidate. Also note that 1342 Immediate Operation includes Immediate Data and Immediate Data with 1343 Solicited Event. 1345 Note: N/A in the table below means Not Applicable 1346 ----------+------------+-------------+-------------+----------------- 1347 First | Second | Placement | Placement | Ordering 1348 Operation | Operation | Guarantee at| Guarantee at| Guarantee at 1349 | | Remote Peer | Local Peer | Remote Peer 1350 ----------+------------+-------------+-------------+----------------- 1351 RDMA Flush| TODO | No Placement| N/A | Completed in 1352 | | Guarantee | | Order 1353 | | between Foo | | 1354 | | and Bar | | 1355 ----------+------------+-------------+-------------+----------------- 1356 TODO | RDMA Flush | Placement | N/A | TODO 1357 | | Guarantee | | 1358 | | between Foo | | 1359 | | and Bar | | 1360 ----------+------------+-------------+-------------+----------------- 1361 TODO | TODO | Etc | Etc | Etc 1362 ----------+------------+-------------+-------------+----------------- 1363 ----------+------------+-------------+-------------+----------------- 1365 Ordering of Operations 1367 5. Error Processing 1369 In addition to error processing described in section 7 of RFC5040 and 1370 section 8 of RFC7306, the following rules apply for the new RDMA 1371 Messages defined in this specification. 1373 5.1. Errors Detected at the Local Peer 1375 The Local Peer MUST send a Terminate Message for each of the 1376 following cases: 1378 1. For errors detected while creating an RDMA Flush, RDMA Verify or 1379 Atomic Write Request, or other reasons not directly associated 1380 with an incoming Message, the Terminate Message and Error code 1381 are sent instead of the Message. In this case, the Error Type 1382 and Error Code fields are included in the Terminate Message, but 1383 the Terminated DDP Header and Terminated RDMA Header fields are 1384 set to zero. 1386 2. For errors detected on an incoming RDMA Flush, RDMA Verify or 1387 Atomic Write Request or Response, the Terminate Message is sent 1388 at the earliest possible opportunity, preferably in the next 1389 outgoing RDMA Message. In this case, the Error Type, Error Code, 1390 and Terminated DDP Header fields are included in the Terminate 1391 Message, but the Terminated RDMA Header field is set to zero. 1393 3. For errors detected in the processing of the RDMA Flush or RDMA 1394 Verify itself, that is, the act of flushing or verifying the 1395 data, the Terminate Message is generated as per the referenced 1396 specifications. Even though data is not lost, the upper layer 1397 MUST be notified of the failure by informing the requester of the 1398 status, terminating any queued operations, and allow the 1399 requester to perform further action, for instance, recovery. 1401 5.2. Errors Detected at the Remote Peer 1403 On incoming RDMA Flush and RDMA Verify Requests, the following MUST 1404 be validated: 1406 o The DDP layer MUST validate all DDP Segment fields. 1408 The following additional validation MUST be performed: 1410 o If the RDMA Flush, RDMA Verify or Atomic Write operation cannot be 1411 satisfied, due to transient or permanent errors detected in the 1412 processing by the Responder, a Terminate message MUST be returned 1413 to the Requestor. 1415 6. IANA Considerations 1417 This document requests that IANA assign the following new operation 1418 codes in the "RDMAP Message Operation Codes" registry defined in 1419 section 3.4 of [RFC6580]. 1421 0xC RDMA Flush Request, this specification 1423 0xD RDMA Flush Response, this specification 1425 0xE RDMA Verify Request, this specification 1427 0xF RDMA Verify Response, this specification 1429 0x10 Atomic Write Request, this specification 1431 0x11 Atomic Write Response, this specification 1433 Note to RFC Editor: this section may be edited and updated prior to 1434 publication as an RFC. 1436 7. Security Considerations 1438 This document specifies extensions to the RDMA Protocol specification 1439 in RFC5040 and RDMA Protocol Extensions in RFC7306, and as such the 1440 Security Considerations discussed in Section 8 of RFC5040 and 1441 Section 9 of RFC7306 apply. In particular, all operations use ULP 1442 Buffer addresses for the Remote Peer Buffer addressing used in 1443 RFC5040 as required by the security model described in [RDMAP 1444 Security [RFC5042]]. 1446 If the "push mode" transfer model discussed in section 2 is 1447 implemented by upper layers, new security considerations will be 1448 potentially introduced in those protocols, particularly on the 1449 server, or target, if the new memory regions are not carefully 1450 protected. Therefore, for them to take full advantage of the 1451 extension defined in this document, additional security design is 1452 required in the implementation of those upper layers. The facilities 1453 of RFC5042 [RFC5042] can provide the basis for any such design. 1455 In addition to protection, in "push mode" the server or target will 1456 expose memory resources to the peer for potentially extended periods, 1457 and will allow the peer to perform remote requests which will 1458 necessarily consume shared resources, e.g. memory bandwidth, power, 1459 and memory itself. It is recommended that the upper layers provide a 1460 means to gracefully adjust such resources, for example using upper 1461 layer callbacks, without resorting to revoking RDMA permissions, 1462 which would summarily close connections. With the initiator 1463 applications relying on the protocol extension itself for managing 1464 their required persistence and/or global visibility, the lack of such 1465 an approach would lead to frequent recovery in low-resource 1466 situations, potentially opening a new threat to such applications. 1468 8. To Be Added or Considered 1470 This section will be deleted in a future document revision. 1472 Complete the discussion in section 3.2 and its subsections, Local 1473 Extension semantics. 1475 Complete the Ordering table in section 4. Carefully include 1476 discussion of the order of "start of execution" as well as 1477 completion, which are somewhat more involved than prior RDMA 1478 operation ordering. 1480 RDMA Flush "selectivity", to provide default flush semantics with 1481 broader scope than region-based. If specified, a flag to request 1482 that all prior write operations on the issuing Queue Pair be flushed 1483 with the requested disposition(s). This flag may simplify upper 1484 layer processing, and would allow regions larger than 4GB-1 byte to 1485 be flushed in a single operation. The STag, Offset and Length will 1486 be ignored in this case. It is to-be-determined how to extend the 1487 RDMA security model to protect other regions associated with this 1488 Queue Pair from unintentional or unauthorized flush. 1490 9. Acknowledgements 1492 The authors wish to thank Jim Pinkerton, who contributed to an 1493 earlier version of the specification, and Brian Hausauer and Kobby 1494 Carmona, who have provided significant review and valuable comments. 1496 10. References 1498 10.1. Normative References 1500 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 1501 Requirement Levels", BCP 14, RFC 2119, 1502 DOI 10.17487/RFC2119, March 1997, 1503 . 1505 [RFC5040] Recio, R., Metzler, B., Culley, P., Hilland, J., and D. 1506 Garcia, "A Remote Direct Memory Access Protocol 1507 Specification", RFC 5040, DOI 10.17487/RFC5040, October 1508 2007, . 1510 [RFC5041] Shah, H., Pinkerton, J., Recio, R., and P. Culley, "Direct 1511 Data Placement over Reliable Transports", RFC 5041, 1512 DOI 10.17487/RFC5041, October 2007, 1513 . 1515 [RFC5042] Pinkerton, J. and E. Deleganes, "Direct Data Placement 1516 Protocol (DDP) / Remote Direct Memory Access Protocol 1517 (RDMAP) Security", RFC 5042, DOI 10.17487/RFC5042, October 1518 2007, . 1520 [RFC6580] Ko, M. and D. Black, "IANA Registries for the Remote 1521 Direct Data Placement (RDDP) Protocols", RFC 6580, 1522 DOI 10.17487/RFC6580, April 2012, 1523 . 1525 [RFC7306] Shah, H., Marti, F., Noureddine, W., Eiriksson, A., and R. 1526 Sharp, "Remote Direct Memory Access (RDMA) Protocol 1527 Extensions", RFC 7306, DOI 10.17487/RFC7306, June 2014, 1528 . 1530 10.2. Informative References 1532 [RFC5045] Bestler, C., Ed. and L. Coene, "Applicability of Remote 1533 Direct Memory Access Protocol (RDMA) and Direct Data 1534 Placement (DDP)", RFC 5045, DOI 10.17487/RFC5045, October 1535 2007, . 1537 [RFC5661] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed., 1538 "Network File System (NFS) Version 4 Minor Version 1 1539 Protocol", RFC 5661, DOI 10.17487/RFC5661, January 2010, 1540 . 1542 [RFC7145] Ko, M. and A. Nezhinsky, "Internet Small Computer System 1543 Interface (iSCSI) Extensions for the Remote Direct Memory 1544 Access (RDMA) Specification", RFC 7145, 1545 DOI 10.17487/RFC7145, April 2014, 1546 . 1548 [RFC8166] Lever, C., Ed., Simpson, W., and T. Talpey, "Remote Direct 1549 Memory Access Transport for Remote Procedure Call Version 1550 1", RFC 8166, DOI 10.17487/RFC8166, June 2017, 1551 . 1553 [RFC8267] Lever, C., "Network File System (NFS) Upper-Layer Binding 1554 to RPC-over-RDMA Version 1", RFC 8267, 1555 DOI 10.17487/RFC8267, October 2017, 1556 . 1558 [SCSI] ANSI, "SCSI Primary Commands - 3 (SPC-3) (INCITS 1559 408-2005)", May 2005. 1561 [SMB3] Microsoft Corporation, "Server Message Block (SMB) 1562 Protocol Versions 2 and 3 (MS-SMB2)", March 2020. 1564 https://docs.microsoft.com/en- 1565 us/openspecs/windows_protocols/ms-smb2/5606ad47-5ee0-437a- 1566 817e-70c366052962 1568 [SMBDirect] 1569 Microsoft Corporation, "SMB2 Remote Direct Memory Access 1570 (RDMA) Transport Protocol (MS-SMBD)", September 2018. 1572 https://docs.microsoft.com/en- 1573 us/openspecs/windows_protocols/ms-smbd/1ca5f4ae-e5b1-493d- 1574 b87d-f4464325e6e3 1576 [SNIANVMP] 1577 SNIA NVM Programming TWG, "SNIA NVM Programming Model 1578 v1.2", June 2017. 1580 https://www.snia.org/sites/default/files/technical_work/ 1581 final/NVMProgrammingModel_v1.2.pdf 1583 10.3. URIs 1585 [1] http://www.nvmexpress.org 1587 [2] http://www.jedec.org 1589 Appendix A. DDP Segment Formats for RDMA Extensions 1591 This appendix is for information only and is NOT part of the 1592 standard. It simply depicts the DDP Segment format for each of the 1593 RDMA Messages defined in this specification. 1595 A.1. DDP Segment for RDMA Flush Request 1597 0 1 2 3 1598 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1599 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1600 | DDP Control | RDMA Control | 1601 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1602 | Reserved (Not Used) | 1603 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1604 | DDP (Flush Request) Queue Number (1) | 1605 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1606 | DDP (Flush Request) Message Sequence Number | 1607 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1608 | Data Sink STag | 1609 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1610 | Data Sink Length | 1611 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1612 | Data Sink Tagged Offset | 1613 + + 1614 | | 1615 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1616 | Disposition Flags +G+P| 1617 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1619 RDMA Flush Request, DDP Segment 1621 A.2. DDP Segment for RDMA Flush Response 1622 0 1 2 3 1623 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1624 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1625 | DDP Control | RDMA Control | 1626 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1627 | Reserved (Not Used) | 1628 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1629 | DDP (Flush Response) Queue Number (3) | 1630 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1631 | DDP (Flush Response) Message Sequence Number | 1632 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1634 RDMA Flush Response, DDP Segment 1636 A.3. DDP Segment for RDMA Verify Request 1638 0 1 2 3 1639 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1640 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1641 | DDP Control | RDMA Control | 1642 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1643 | Reserved (Not Used) | 1644 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1645 | DDP (Verify Request) Queue Number (1) | 1646 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1647 | DDP (Verify Request) Message Sequence Number | 1648 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1649 | Data Sink STag | 1650 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1651 | Data Sink Length | 1652 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1653 | Data Sink Tagged Offset | 1654 + + 1655 | | 1656 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1657 | Hash Value (optional, variable) | 1658 | ... | 1659 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1661 RDMA Verify Request, DDP Segment 1663 A.4. DDP Segment for RDMA Verify Response 1664 0 1 2 3 1665 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1666 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1667 | DDP Control | RDMA Control | 1668 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1669 | Reserved (Not Used) | 1670 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1671 | DDP (Verify Response) Queue Number (3) | 1672 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1673 | DDP (Verify Response) Message Sequence Number | 1674 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1675 | Hash Value (variable) | 1676 | ... | 1677 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1679 RDMA Verify Response, DDP Segment 1681 A.5. DDP Segment for Atomic Write Request 1683 0 1 2 3 1684 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1685 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1686 | DDP Control | RDMA Control | 1687 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1688 | Reserved (Not Used) | 1689 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1690 | DDP (Atomic Write Request) Queue Number (1) | 1691 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1692 | DDP (Atomic Write Request) Message Sequence Number | 1693 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1694 | Data Sink STag | 1695 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1696 | Data Sink Length (value=8) | 1697 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1698 | Data Sink Tagged Offset | 1699 + + 1700 | | 1701 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1702 | Data (64 bits) | 1703 + + 1704 | | 1705 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1707 Atomic Write Request, DDP Segment 1709 A.6. DDP Segment for Atomic Write Response 1711 0 1 2 3 1712 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1713 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1714 | DDP Control | RDMA Control | 1715 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1716 | Reserved (Not Used) | 1717 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1718 | DDP (Atomic Write Response) Queue Number (3) | 1719 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1720 | DDP (Atomic Write Response) Message Sequence Number | 1721 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1723 Atomic Write Response, DDP Segment 1725 Authors' Addresses 1727 Tom Talpey 1728 Microsoft 1729 One Microsoft Way 1730 Redmond, WA 98052 1731 US 1733 Email: ttalpey@microsoft.com 1735 Tony Hurson 1736 Intel 1737 Austin, TX 1738 US 1740 Email: tony.hurson@intel.com 1742 Gaurav Agarwal 1743 Marvell 1744 CA 1745 US 1747 Email: gagarwal@marvell.com 1748 Tom Reu 1749 Chelsio 1750 NJ 1751 US 1753 Email: tomreu@chelsio.com