idnits 2.17.1 draft-csapuntz-caserdma-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** Missing expiration date. The document expiration date should appear on the first and last page. == No 'Intended status' indicated for this document; assuming Proposed Standard == The page length should not exceed 58 lines per page, but there was 1 longer page, the longest (page 1) being 775 lines Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. ** There are 4 instances of too long lines in the document, the longest one being 3 characters in excess of 72. ** There are 21 instances of lines with control characters in the document. Miscellaneous warnings: ---------------------------------------------------------------------------- -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (December 2000) is 8530 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Unused Reference: 'CIFS' is defined on line 730, but no explicit reference was found in the text == Unused Reference: 'HTTP' is defined on line 735, but no explicit reference was found in the text == Unused Reference: 'NFSv3' is defined on line 738, but no explicit reference was found in the text == Unused Reference: 'RPC' is defined on line 741, but no explicit reference was found in the text == Unused Reference: 'TCP' is defined on line 749, but no explicit reference was found in the text -- Possible downref: Non-RFC (?) normative reference: ref. 'ALF' -- Possible downref: Non-RFC (?) normative reference: ref. 'Brustoloni' -- Possible downref: Non-RFC (?) normative reference: ref. 'Chase' -- Possible downref: Non-RFC (?) normative reference: ref. 'CIFS' ** Obsolete normative reference: RFC 2616 (ref. 'HTTP') (Obsoleted by RFC 7230, RFC 7231, RFC 7232, RFC 7233, RFC 7234, RFC 7235) ** Downref: Normative reference to an Informational RFC: RFC 1813 (ref. 'NFSv3') ** Obsolete normative reference: RFC 1831 (ref. 'RPC') (Obsoleted by RFC 5531) -- Possible downref: Non-RFC (?) normative reference: ref. 'Stevens' ** Obsolete normative reference: RFC 793 (ref. 'TCP') (Obsoleted by RFC 9293) -- Possible downref: Non-RFC (?) normative reference: ref. 'TCPRDMA' -- Possible downref: Non-RFC (?) normative reference: ref. 'VI' -- Possible downref: Non-RFC (?) normative reference: ref. 'VITCP' Summary: 10 errors (**), 0 flaws (~~), 7 warnings (==), 10 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 INTERNET-DRAFT C. Sapuntzakis 2 Cisco Systems 3 A. Romanow 4 Cisco Systems 6 J. Chase 7 Duke University 9 draft-csapuntz-caserdma-00.txt December 2000 11 The Case for RDMA 13 Status of this Memo 15 This document is an Internet-Draft and is in full conformance with all 16 provisions of Section 10 of RFC2026. 18 Internet-Drafts are working documents of the Internet Engineering 19 Task Force (IETF), its areas, and its working groups. Note that 20 other groups may also distribute working documents as Internet- 21 Drafts. 23 Internet-Drafts are draft documents valid for a maximum of six months 24 and may be updated, replaced, or obsoleted by other documents at any 25 time. It is inappropriate to use Internet-Drafts as reference 26 material or to cite them other than as "work in progress." 28 The list of current Internet-Drafts can be accessed at 29 http://www.ietf.org/ietf/1id-abstracts.txt 31 The list of Internet-Draft Shadow Directories can be accessed at 32 http://www.ietf.org/shadow.html. 34 Copyright Notice 35 Copyright (C) Cisco Systems (2000). All Rights Reserved. 37 Abstract 39 The end-to-end performance of IP networks for bulk data transfer is 40 often limited by data copying overhead in the end systems. Even when 41 end systems can sustain the bandwidth of high-speed networks, copying 42 overheads often limit their ability to carry out other processing 43 tasks. 45 Remote Direct Memory Access (RDMA) is a facility for avoiding copying 46 for network communication in a general and comprehensive way. RDMA is 47 particularly useful for protocols that transmit bulk data mixed with 48 control information, such as NFS, CIFS, HTTP, or enscapsulated device 49 protocols such as iSCSI. While networking architectures such as the 50 Virtual Interface (VI) architecture support RDMA, there is no standard 51 for RDMA over IP networks. Such a standard would allow vendors of 52 IP-aware network hardware (such as TCP-capable network adapters) to 53 incorporate support for RDMA into their products. 55 This document reviews the I/O performance issues addressed by RDMA, 56 and considers issues for supporting the key elements of RDMA in an IP 57 networking context. 59 Glossary 61 header/payload splitting - any technique that enables a NIC to deposit 62 incoming protocol headers and payloads into separate host buffers 64 headers - control information used by the protocol 66 HBA - host bus adapter, a network adapter (see NIC) 68 I/O operation - a request to a device, then a transfer to/from that device, 69 and a status response 71 MTU - maximum transmission unit, the largest packet size that a given 72 network device or path can carry 74 NIC - network interface card/controller (see HBA) 76 payload - in general, uninterpreted data transported by a protocol 78 payload steering - any technique that enables a NIC to deposit an 79 incoming protocol payload into a buffer designated for that 80 specific payload 82 protocol stack - the layers of software, firmware, or hardware 83 that implement communication between applications across a network 85 region and region identifier (RID) - a memory buffer region reserved and 86 registered for use with RDMA requests, and its unique identifier 88 solicited data - data that was sent in response to some control 89 message 91 unsolicited data - data that was sent without being requested 93 upper-layer protocol (ULP) - an application-layer protocol like 94 NFS, CIFS, HTTP, or iSCSI 96 1. Introduction 98 The principal use of the Internet and IP networks today is for 99 buffer-to-buffer transfers, often in the form of file or block 100 transfers. Today, this is done using a variety of protocols: HTTP, 101 FTP, NFS, and CIFS. Soon, iSCSI will be added to this list. 103 These upper-layer protocols (ULPs) all have one thing in common: the 104 majority of the bytes they send on the network are data "payloads" 105 that are uninterpreted by the protocol or the network. 107 Each ULP has different ways of requesting and initiating data 108 transfers. They differ in the kinds of control information or 109 meta-data (e.g. cache coherence info) they specify and send across the 110 wire. However, all these protocols eventually come down to 111 transporting large blocks of uninterpreted data from a local buffer to 112 a remote buffer. Transferring a payload from one host to another is 113 similar to a buffer-to-buffer data transfer (like the C memcpy 114 function) over the network. For example, one use of HTTP is to 115 transfer JPEG format graphic images from a web server to a web 116 browser's address space. 118 Today, gigabit speed buffer-to-buffer network transfers are chewing up 119 significant memory bandwidth and CPU time on the receivers. With the 120 advent of IP checksum hardware, the end-system overhead for network 121 transfers is dominated by costs of copying in order to place incoming 122 data correctly in the receiver's memory buffer. Although CPUs are 123 rapidly becoming more powerful, advances in network bandwidths have 124 also kept pace with and even exceeded Moore's Law in recent years. 125 Moreover, copying is limited by memory system performance, which is 126 not improving as fast CPU speeds. 128 One solution to this problem is to place the data in the correct 129 memory buffer directly as it arrives from the network, avoiding the 130 need to copy it into the correct buffer after it has arrived. If the 131 network interface (NIC) could place data correctly in memory, this 132 would free up the memory bandwidth and CPU cycles consumed by copying. 134 A number of mechanisms already exist to reduce copying overhead in the 135 IP stack. Some of these mechanisms depend on fragile assumptions about 136 the hardware and application buffers, others involve ad hoc support 137 for specific protocols and communication scenarios, and all of them 138 impose other costs that may be prohibitive in some scenarios. 140 However, a mechanism called Remote Direct Memory Access (RDMA) offers 141 a solution that is simple, general, complete, and robust. RDMA 142 introduces new control information into the communication stream that 143 directs data movement for buffer-to-buffer transfers. Incorporating 144 support for RDMA into network protocols can significantly reduce the 145 cost of network buffer-to-buffer transfers. 147 RDMA accomplishes exact data placement via a generalized abstraction 148 at the boundary between the ULP and its transport (e.g., TCP), 149 allowing an RDMA-capable NIC to recognize and steer payloads 150 independently of the specific ULP. Using RDMA, ULPs 151 gain efficient data placement without the need to program ULP-specific 152 details into the NIC. Thus RDMA speeds deployment of new protocols by 153 not requiring the firmware or hardware on the NIC to be rewritten to 154 accelerate each new protocol. 156 To be effective, the receiving NIC must recognize the RDMA control 157 information, and ULP implementations or applications most be modified 158 to generate the RDMA control information. In addition, support for 159 framing in the transport protocols would allow an RDMA-capable NIC to 160 locate RDMA control information in the stream in the case where 161 packets arrive out of order. 163 Historically, network protocols and implementations have addressed the 164 issue of demultiplexing multiple streams arriving at an interface. 165 However, there are still no accepted solutions to demultiplex control 166 and data arriving on a single stream. Much current network traffic is 167 characterized by a small amount of control with a large amount of 168 data. RDMA enables efficient data payload steering for this common 169 case, which is especially important as data rates increase. 171 This document is somewhat tutorial in seeking to set out clearly the 172 I/O performance issues addressed by RDMA, and the design alternatives 173 for an RDMA facility. It considers proposed approaches for solving the 174 problems, clarifying the benefits and costs of deploying and using an 175 RDMA approach. 177 The document is organized as follows. Section 2 describes the copy 178 overhead problem in detail. Section 3 discusses various alternatives 179 to a general RDMA facility. Section 4 describes the RDMA approach in 180 detail. RDMA implementation issues are considered in Section 5, and 181 unsolicited data in Section 6. 183 2. The I/O Performance Problem 185 Figure 1 shows a block diagram illustrating the layers involved in 186 transferring data in and out of a host system. We will call these 187 layers the network I/O stack. Each boundary in the diagram corresponds 188 to an I/O interface. In general, we assume that all the modules 189 represented in Figure 1 (except for the NIC) run on the host CPU, 190 although RDMA is equally useful if portions of the I/O stack run on 191 the NIC. 193 |-----------------------| 195 Application 197 |-----------+-----------| 198 File | 199 System | Block 200 Interface | Interface 201 |-----------+-----------| 202 Upper-Layer Protocol Stack 203 (NFS, CIFS, SCSI/iSCSI, 204 HTTP) 205 |-----------------------| 207 Network Stack (IP, TCP) 209 |-----------------------| 211 NIC 213 |-----------------------| 215 In IP networks, end system CPUs may incur substantial overhead from 216 copying data in memory as part of I/O operations. Copying is 217 necessary in order to align data, place data contiguously in memory, or 218 place data in specific buffers supplied by the application or ULP module. 219 These may be important to applications for several reasons. 221 Alignment is important because most CPU architectures impose alignment 222 constraints on data accessed in units larger than a byte, e.g., for 223 incoming data interpreted as integers. 225 Contiguity of data in memory simplifies the book-keeping data 226 structures that describe the data and improves memory utilization by 227 reducing fragmentation of free space. Data contiguity may simplify 228 algorithms that traverse the data, reducing execution time. For 229 example, data contiguity enables sequential memory access. 231 Common network APIs such as sockets [Stevens] allow applications to 232 designate specific buffers for incoming data, requiring a copy to 233 place the incoming data correctly. It may be possible to avoid the 234 copy by page remapping (see Section 3.2), but only if the data is 235 contiguous to occupy complete memory pages and is page-aligned 236 relative to the application's buffer. Similarly, storage protocols 237 such as NFS and iSCSI may require contiguous, page-aligned data for 238 buffering in the system I/O cache. 240 This document concentrates on how to eliminate unnecessary data copies 241 used to assure correct placement of incoming data. 243 Some have argued that the expense of these data copies can be partly 244 masked if some other data scanning operation, such as checksumming or 245 decryption, runs over the data simultaneously (see [ALF]). However, 246 such optimizations are highly processor-dependent and may not yield 247 the expected benefits [Chase]. Moreover, this approach is not useful 248 unless other data scanning operations are handled in software; 249 hardware support for checksumming and decryption is increasingly 250 common. 252 In recent years, valuable progress has been made in minimizing 253 other sources of networking overhead. Examples include checksum 254 offloading, extended ethernet frames, and interrupt suppression. For a 255 review and evaluation of various solutions see [Chase]. These issues 256 are not discussed in this document. 258 2.1 Copy on receive 260 The primary issue addressed here is how application data is received 261 from the network. In many I/O interfaces, when an application reads 262 data, the application specifies the buffer into which it will receive 263 data. But, today's generic NICs are incapable of placing data 264 directly into the supplied buffer. This limitation is largely because 265 such direct placement of data requires more complexity and intelligence 266 than provided in generic NICs. For example to accomplish this task, 267 NICs would need to separate payloads from ULP and transport headers, parse 268 headers, and demultiplex multiple incoming packet streams. 270 Most NICs today are not this sophisticated in their handling of 271 incoming data streams. Instead, they deposit incoming packets into 272 generic host buffers supplied by the network stack software. Both the 273 network and ULP stacks sift through the packets, looking successively 274 at headers from the link layer (e.g., Ethernet), IP, transport, and 275 ULP. Eventually, the data payload is recognized and copied from the 276 network buffers to the correct application buffer. 278 2.2 Copy on transmit 280 For the most part, sending data from applications to the network 281 should not require copies in the I/O stack. Today's network adapters 282 can gather data from anywhere in memory to form a packet, so no copy 283 is necessary to align outgoing packet data for the NIC. 285 Copying can be used as a technique to ensure that the data is not 286 modified between the time it is passed from the application to the I/O 287 interface, and the time that the data transfer completes. Other 288 well-known solutions exist that do not involve copying [Brustoloni]. 290 Copy on transmit will not be discussed further. 292 3. Non-RDMA solutions 294 There are a range of ad-hoc solutions to avoid copying of incoming 295 data that do not require RDMA. These include: 297 - scatter-gather buffers 298 - header/payload separation 299 - parsing the ULP on the NIC 301 3.1 Scatter-gather buffers 303 Once the NIC has written the application data to memory, a copy can be 304 avoided if we tell the application where to find its data in memory. 305 The application data may be scattered in memory as it may have arrived 306 in multiple packets. A data structure called a scatter-gather buffer 307 is used to tell the application the location of the data. 308 Scatter-gather buffering is the only known copy avoidance technique 309 that does not require direct support on the NIC. 311 This solution is not compatible with existing I/O interfaces, such as 312 the sockets interface. Also, in this approach, data is not necessarily 313 contiguous in memory or page-aligned. For example, it cannot in 314 general be delivered securely to a user-level process without copying 315 it, since mapping the pages containing the received data into a user 316 process address space exposes the containing pages in their entirety, 317 not just the portions occupied by the received data. 319 However, scatter-gather buffering is a viable copy avoidance technique 320 for kernel-based applications where few data transformations are 321 needed. For file system protocols, effective use of scatter-gather 322 buffering may require a redesign of the the file buffer cache and/or 323 virtual memory page cache. 325 3.2. Ad Hoc header/payload separation 327 A more sophisticated NIC might recognize transport and/or ULP headers 328 in order to separate the headers from the payloads. Then each payload 329 is "split" from its header and place the payload in a separate buffer. 330 Header/payload splitting is useful for copy avoidance because a 331 virtual memory system may then map the payload to an application 332 buffer by manipulating virtual memory translations to point to the 333 payload. This approach, called "page flipping" or "page remapping", 334 is an alternative to copying for delivering the data into the 335 application buffers. A prerequisite for page flipping is that the 336 application buffer must be page-aligned and contiguous in virtual 337 memory. 339 Header/payload splitting adds significant complexity to the NIC. If 340 the network MTU is smaller than the hardware page size, then the 341 transfer of a page of data is spread across multiple packets. These 342 packets can arrive at the receiver out-of-order and/or interspersed 343 with packets from other flows. In order to pack the data contiguously 344 into pages, the NIC must do intelligent processing of the transport 345 and ULP. This approach is "ad hoc" because the NIC must include 346 support for each transport and ULP that benefits from page flipping. 347 The NIC processing may be unnecessarily complex for ULPs such as NFS 348 that use variable-length headers or that require ULP-level state to 349 decode the incoming headers. 351 A key disadvantage is that page flipping requires TLB invalidations, 352 which can be prohibitively expensive on shared memory multiprocessors. 354 3.3. Explicit header/payload separation 356 The previous section discussed header/payload separation implemented 357 in an ad hoc fashion. It is also possible to implement a more 358 generalized method of header/payload splitting that does not require 359 the NIC to decode ULP headers. A generic framing mechanism 360 implemented at the transport layer or just above it could include 361 frame header fields that distinguish the ULP payload from the ULP 362 header. This would enable a receiving NIC to separate received data 363 payloads from control information and deposit the received payload 364 data in contiguous page-aligned target buffer locations. Under most 365 conditions this is sufficient to allow low-copy implementations of 366 ULPs such as NFS. 368 The RDMA approach explored in this document is a more general 369 extension of this approach. 371 3.4. Terminate the ULP in the NIC 373 If the NIC terminates the ULP, the memory copy is eliminated because 374 the application communicates I/O requests directly to the NIC. The 375 NIC uses the information in the ULP headers to steer ULP payloads to 376 the correct application buffers. This is commonly done in the 377 FibreChannel arena, where FibreChannel NICs (or Host Bus Adapters) 378 implements an I/O block (e.g., SCSI) transport on the NIC. This 379 approach effectively migrates all modules of the network stack from 380 Figure 1 onto the NIC. FibreChannel implementations use this technique 381 to deliver high performance with low host overhead. 383 In such a scheme, the NIC needs to be informed of specific application 384 buffers. The NIC also needs to be capable of header/payload splitting. 386 While this approach may be useful for single-function devices, it is 387 inappropriate for general-purpose NICs. The NIC must be reprogrammed 388 or extended to accelerate each ULP. RDMA offers a general mechanism 389 that allows RDMA-capable NICs to avoid copies for any ULP that uses 390 RDMA. 392 4. Remote Direct Memory Access (RDMA) 394 This section outlines how RDMA works. 396 Direct memory access (DMA) is a fundamental technique that is widely 397 used in high-performance I/O systems. DMA allows a device to directly 398 read or write host memory across an I/O interconnect (such as PCI) by 399 sending DMA commands to the memory controller. No CPU intervention or 400 copying is required. For example, when a host requests an I/O read 401 operation from a DMA-capable storage device, the device uses a DMA 402 write to place the incoming data directly to memory buffers that the 403 host provides for that specific operation. Similarly, when the host 404 requests an I/O write operation, the device uses a DMA read to fetch 405 outgoing data from host memory buffers specified by the host for that 406 operation. 408 Remote DMA can provide similar functionality in IP networks. It is 409 particularly useful when an IP network is used as an I/O interconnect 410 for IP-capable devices, such as storage devices and their servers. 411 Conceptually, RDMA allows a network-attached device to read or write 412 remote memory, e.g., by adding control information that specifies the 413 buffers to receive transmitted payloads. The remote NIC decodes this 414 control information and uses DMA to read/write memory, effectively 415 translating between the RDMA protocol and the local memory access 416 protocol. In an IP network, the RDMA protocol appears at the 417 transport layer (e.g., as a "shim" above an existing transport 418 protocol such as TCP) so that a wide variety of upper-layer protocols 419 can make use of it with minimal changes. 421 The idea of RDMA has been around by various names for many years. 422 RDMA is an important component of the VI architecture for user-level 423 networking, and is also a key element of the Infiniband effort. VI 424 illustrates one alternative for a networking API that accommodates 425 RDMA (see Section 5.1). However, RDMA generalizes to other network 426 architectures. This document addresses issues for incorporating RDMA 427 into conventional IP protocol stacks. Note that VI can run over an IP 428 transport such as TCP, but only if the NIC implements the full 429 transport. 431 Since TCP is the most widely used transport for upper-layer protocols, 432 using RDMA with TCP is the first case to consider. However, RDMA can 433 be used with other transport protocols, specifically SCTP. 435 4.1 How RDMA works 437 An RDMA facility embeds new RDMA control commands into the byte stream 438 or packet stream. A full RDMA protocol includes two key commands: 439 RDMA READ and RDMA WRITE. The receiving NIC translates these commands 440 into local memory reads and writes. 442 For security reasons, it is undesirable to allow transmitters to read 443 or write arbitrary memory on the receiver. Any RDMA scheme must 444 prevent any unauthorized memory accesses. Most RDMA schemes protect 445 memory by allowing RDMA reads/writes only to buffers that the receiver 446 has explicitly identified to the NIC as valid RDMA targets. The 447 process of informing the NIC about a buffer is called "registration". 449 The following steps illustrate the common case of a data transfer 450 using RDMA WRITE in the context of a request/response storage protocol 451 such as NFS or iSCSI: 453 1. Client application calls an I/O interface, requesting that 454 the result of the I/O be put into a buffer B. 456 2. Client implementation registers buffer B with the NIC. 458 3. Client sends the I/O READ request to server. 460 4. Server issues one or more RDMA WRITE(s) to write I/O data 461 into client's buffer B. 463 5. Server sends the file system READ response for the I/O. 465 Of course, on each I/O operation the server must know to which client 466 addresses to write. One alternative is for the client to pass a 467 token identifying the target buffer in the request; the server 468 returns the token in its response. This is the approach used in 469 VI implementations. An alternative is for both the client and server 470 to each synthesize the token from other unique identifiers present in 471 the request [TCPRDMA]. 473 Most RDMA schemes use a region identifier (RID) and an offset to 474 identify the target buffer in a token. The (RID, offset) pair amounts 475 to a form of virtual address; the receiving NIC translates the virtual 476 addresses to physical addresses using table lookup. As such, if a 477 mapping to a physical page does not appear in the table, there is no 478 way a transmitter can refer to it. 480 Once an entry is in the table, the NIC can potentially access the 481 physical memory of a buffer at any time. As such, the buffer must not 482 be re-used for other purposes. One alternative is for the OS to "pin" 483 the buffer in physical memory, allowing the NIC to safely hold the 484 physical addresses corresponding to the buffer. Once the region 485 mapping is removed, the OS can "unpin" the physical memory. 487 4.2. Unsolicited payloads 489 NFS, CIFS, and HTTP all support sending data in a WRITE (or POST) 490 request along with the request. This is optimistic; it assumes the 491 receiving application has space (other than TCP window) to buffer the 492 WRITE payload. The payload and transfer are called "unsolicited" in that 493 they were not requested by the receiver. RDMA WRITE is 494 straightforward for solicited data, since the sender can receive the 495 RID and buffer address in the message that solicits the data, as in 496 the preceding example. In the case of unsolicited data, it is not 497 clear how the sender obtains the RID necessary for an RDMA WRITE. 499 RDMA may be used for unsolicited data in the following way. The 500 receiver may expose a memory region for unsolicited data from each 501 sender. The sender, when it wishes to do an unsolicited WRITE, can 502 RDMA its data into that region. Then, along with the WRITE request, 503 the sender may pass a pointer (e.g., region offset) to the data it 504 wrote. This requires that the receiver (server) pass an RID for 505 unsolicited data at connection open and supply a new region if the 506 unsolicited region fills. Alternatively, the receiver may handle 507 unsolicited data by responding to the WRITE request with an RDMA 508 READ (if supported) to fetch the data, as described in Section 4.3. 510 4.3 Reading remote memory 512 Some RDMA protocols allow one party to read another's memory with an 513 RDMA READ operation. As with the RDMA WRITE, the NICs and not the CPUs 514 process the RDMA READs. 516 The receiving NIC may complete the RDMA READ from the receiver's 517 memory without interrupting the CPU. The operation is potentially 518 useful because CPU interrupts are expensive in general-purpose 519 systems. Switching between the currently executing task and the 520 interrupt handler involves flushing pipelines, saving and restoring 521 context, and other overheads. 523 Although any RDMA READ may be emulated using an RDMA WRITE in the 524 opposite direction, use of RDMA READ as an alternative has potential 525 advantages. First, an RDMA READ requester does not need to export a 526 region RID to receive the incoming data as an RDMA WRITE. This is 527 useful because it allows servers to avoid reserving and exposing 528 memory regions for large numbers of clients. Second, RDMA READ allows 529 the requester to control the order and rate of data transmitted by the 530 sender or RDMA READ target. 532 For example, a network storage device or server may implement write 533 operations by issuing RDMA READs to its client, rather than allowing 534 the client to use RDMA WRITE to transfer the data to the server. This 535 allows the server to control use of the buffer space it allocates for 536 the transfers, and to pull the data from the client in an order that is 537 convenient for the server, e.g., to optimize disk performance. 538 The emerging VI-based Direct Access File System uses RDMA READ for 539 file write operations, in part for these reasons. 541 RDMA READ is more complex than RDMA WRITE because it implies that the 542 target NIC autonomously transmits data back to the requester, e.g., 543 without involving a host CPU. This implies that the NIC implements the 544 complete transport protocol necessary to send such data without involving 545 or interfering with the protocol stack in host software. 547 Use of RDMA READ requires ULPs designed to take advantage of it, as 548 well as more powerful NICs. While it offers several benefits, there 549 may be alternative means to achieve many of the same benefits, such as 550 simple interrupt suppressing NICs and ULP protocol features to control 551 the rate and order of data flow, as provided in the iSCSI draft 552 specification. 554 In contrast to RDMA READ, RDMA WRITE is simple and general, does not 555 require full implementation of the transport on the NIC, and is easily 556 incorporated into existing request/response protocols with minimal 557 impact. The remainder of this document focuses on RDMA WRITE. 559 4.4 Security 561 The principal mechanism for RDMA security is region addressing using 562 RID-based virtual addresses as described above in Section 4.1. Under 563 no circumstances may a transmitter access memory that has not been 564 explicitly registered for RDMA use by the receiver. Thus RDMA does 565 not introduce fundamental new security issues beyond the standard 566 concerns of interception and corruption of data and commands on an 567 insecure connection. In this case, the concern is whether RIDs for 568 registered RDMA regions may be misused. 570 To further improve safety, each RID may include a sparse (hard to 571 guess) key value; only transmitters who know the key can read or write 572 to the memory region. RIDs protected in this way are essentially weak 573 capabilities. NICs may also place access-control lists or permissions 574 on pages, or limit region access to specific connections. 576 For real security on untrusted networks, the RDMA protocol may be 577 protected in-transit using security and endpoint authentication 578 features at the transport layer or below, such as TLS or IPsec. 580 5 RDMA APIs 582 Direct I/O to application buffers requires an interface for 583 registering buffers with the NIC and receiving notification that RDMA 584 transfers have completed. It is straightforward to devise internal 585 kernel interfaces to enable use of RDMA for kernel-based ULPs. 586 However, use of RDMA by user-space applications may require extensions 587 to existing kernel networking APIs. For example, the Berkeley Unix 588 sockets [Stevens] interface, as currently specified, does not directly 589 support RDMA. 591 5.1 The VI interface 593 The VI programming interface [VI] supports both message passing and 594 RDMA. The VI interface has calls for registering and pinning buffers. 595 The interface supports both polling and asynchronous notification of 596 events, e.g., RDMA completions. The VI interface does not specify the 597 wire protocol and allows a variety of protocols, including IP 598 protocols. 600 The VI interface assumes that user-space programs may directly access 601 the NIC without transitioning to kernel mode. This precludes use of 602 the full VI API in conjunction with conventional TCP/IP protocol 603 stacks. However, one option is to supplement the socket interface 604 with RDMA-related elements of the VI interface. 606 5.2 Winsock Direct 608 The Winsock Direct API, available on Windows 2000, is an extension of 609 the sockets interface that supports reliable messages and RDMA. 611 6 Implementing RDMA 613 Conceptually, the RDMA abstraction belongs at the transport layer so 614 that it generalizes to multiple ULPs. The sending side of the RDMA 615 protocol is straightforward to implement at the boundary between the 616 ULP and the underlying transport, i.e., as a "shim" to TCP. However, 617 the key aspects of the receiving side of an RDMA protocol are 618 implemented within the NIC, a link-level device that is logically 619 below the transport layer. This is the crux of the problem for 620 implementing RDMA. 622 Transport-level support for enhanced framing (e.g., in TCP) would be 623 useful for implementing RDMA. For RDMA to be effective, the receiving 624 NIC must be able to read and decode the control information necessary 625 for it to implement RDMA. At minimum, this requires it to recognize 626 transport-layer headers and identify RDMA control headers embedded in 627 the incoming data. It is trivial to locate these headers within an 628 ordered byte stream using a simple byte counting method (length field) 629 for framing. The difficulty is that packets may arrive at the RDMA 630 receiver (NIC) out of order, and some or all of the transport-layer 631 facility to reorder data may be implemented above the NIC, e.g., in 632 host software, as shown in Figure 1. Thus there must be some 633 mechanism that enables the receiving NIC to retain or recover its 634 ability to locate RDMA headers in the presence of sequence holes, 635 i.e., when packets arrive out of order. 637 One option is for the NIC to buffer out-of-order data until any late 638 packets arrive, allowing the NIC to recover any lost framing 639 information. Note that this does not preclude delivering the 640 out-of-order data to the host along a slow path that does not benefit 641 from RDMA. Keeping a copy of the data until all sequence holes are 642 filled allows the NIC to traverse the RDMA headers in the data stream, 643 positioning it to locate subsequent RDMA headers and re-establish the 644 RDMA fast path. If the NIC does not have sufficient memory to buffer 645 the data, it may discard it, forcing the sender to retransmit more of 646 the data after a sequence hole. 648 A second option is to integrate framing support into the transport, 649 allowing the receiver to locate RDMA headers even when packets arrive 650 out of order. Note that every packet must contain an RDMA header for 651 this approach to be fully general. For example, consider a packet 652 carrying an RDMA header that applies to data in subsequent packets. 653 Even with enhanced framing, if the packet containing the RDMA header 654 is lost, the NIC cannot correctly apply the RDMA operation to the 655 arriving data until it receives the RDMA header. 657 Several alternatives have been proposed for integrating framing into 658 TCP. These include introducing a new TCP option [TCPRDMA] or 659 constraining the TCP sender's selection of segment boundaries to 660 correspond with framing boundaries [VITCP]. Each of these 661 approaches would have some impact on TCP implementations and 662 APIs, and some of them also extend the wire protocol. 664 The TCP options approach requires a minor extension of the TCP wire 665 protocol, and modification to both the sender and the receiver, which 666 is especially painful considering today's inflexible in-kernel TCP 667 implementations. The TCP options approach does not break backward 668 compatibility since unmodified endpoints will not negotiate the 669 option. Also, the options information is regarded only as an 670 optimization; it is not required for the application to parse the TCP 671 stream. 673 7 Conclusion 675 Remote DMA provides for efficient placement of data in memory. The 676 NIC writes data into memory with the proper alignment. Furthermore, 677 the NIC can often place data directly into application buffers. 679 The Remote DMA abstraction provides generalized mechanism useful with 680 many higher level protocols such as NFS, without the need for ULP 681 support in the NIC, and with only minor extensions to the ULP protocol 682 implementations. 684 Authors' Addresses 686 Constantine Sapuntzakis 687 Cisco Systems, Inc. 688 170 W. Tasman Drive 689 San Jose, CA 95134 690 USA 692 Phone: +1 408 525 5497 693 Email: csapuntz@cisco.com 695 Allyn Romanow 696 Cisco Systems, Inc. 697 170 W. Tasman Drive 698 San Jose, CA 95134 699 USA 701 Phone: +1 408 525 8836 702 Email: allyn@cisco.com 704 Jeff Chase 705 Department of Computer Science 706 Duke University 707 Durham, NC 27708-0129 708 USA 710 Phone: +1 919 660 6559 711 Email: chase@cs.duke.edu 713 References 715 [ALF] D. D. Clark and D. L. Tennenhouse, "Architectural considerations 716 for a new generation of protocols," in SIGCOMM Symposium on 717 Communications Architectures and Protocols , (Philadelphia, 718 Pennsylvania), pp. 200--208, IEEE, Sept. 1990. Computer 719 Communications Review, Vol. 20(4), Sept. 1990. 721 [Brustoloni] J. Brustoloni and P. Steenkiste. "Effects of buffering 722 semantics on I/O performance," in Operating System Design and 723 Implementation (OSDI), Seattle, WA, Oct 1996. 725 [Chase] J. Chase, A. Gallatin, and Ken Yocum, "End-system 726 Optimizations for High-Speed TCP", IEEE Communications special 727 issue on high-speed TCP, 2001. 728 http://www.cs.duke.edu/ari/publications/end-system.ps (or .pdf). 730 [CIFS] Paul Leach, "A Common Internet File System (CIFS/1.0) Protocol 731 Preliminary Draft", 732 http://www.cifs.com/specs/draft-leach-cifs-v1-spec-01.txt, December 733 1997 735 [HTTP] J. Gettys et al., "Hypertext Transfer Protocol - HTTP/1.1", 736 RFC 2616, June 1999 738 [NFSv3] B. Callaghan, "NFS Version 3 Protocol Specification", 739 RFC 1813, June 1995 741 [RPC] R. Srinivasan, "RPC: Remote Procedure Call Protocol 742 Specification Version 2", RFC 1831, August 1995 744 [iSCSI] J. Satran, et al., "iSCSI", draft-ietf-ips-iscsi-01.txt 746 [Stevens] W. Richard Stevens, "Unix Network Programming Volume 1," 747 Prentice Hall, 1998, ISBN 0-13-490012-X 749 [TCP] J. Postel, "Transmission Control Protocol - DARPA 750 Internet Program Protocol Specification", RFC 793, September 1981 752 [TCPRDMA] C. Sapuntzakis and D. Cheriton, "TCP RDMA option", 753 http://www.ietf.org/internet-drafts/draft-csapuntz-tcprdma-00.txt 755 [Winsock Direct] "Winsock Direct Specification", Windows 2000 DDK, 756 http://www.microsoft.com/ddk/ddkdocs/win2K/wsdspec_1h66.htm 758 [VI] Virtual Interface Architecture Specification version 1.0, 759 http://www.viarch.org/ 761 [VITCP] DiCecco, S., et al., "VI/TCP (Internet VI)",