idnits 2.17.1 

draft-csapuntz-caserdma-00.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** Looks like you're using RFC 2026 boilerplate.  This must be updated to
     follow RFC 3978/3979, as updated by RFC 4748.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  ** Missing expiration date.  The document expiration date should appear on
     the first and last page.

  == No 'Intended status' indicated for this document; assuming Proposed
     Standard

  == The page length should not exceed 58 lines per page, but there was 1
     longer page, the longest (page 1) being 775 lines


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack an IANA Considerations section.  (See Section
     2.2 of https://www.ietf.org/id-info/checklist for how to handle the case
     when there are no actions for IANA.)

  ** The document seems to lack separate sections for Informative/Normative
     References.  All references will be assumed normative when checking for
     downward references.

  ** There are 4 instances of too long lines in the document, the longest one
     being 3 characters in excess of 72.

  ** There are 21 instances of lines with control characters in the document.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (December 2000) is 8530 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  == Unused Reference: 'CIFS' is defined on line 730, but no explicit
     reference was found in the text

  == Unused Reference: 'HTTP' is defined on line 735, but no explicit
     reference was found in the text

  == Unused Reference: 'NFSv3' is defined on line 738, but no explicit
     reference was found in the text

  == Unused Reference: 'RPC' is defined on line 741, but no explicit
     reference was found in the text

  == Unused Reference: 'TCP' is defined on line 749, but no explicit
     reference was found in the text

  -- Possible downref: Non-RFC (?) normative reference: ref. 'ALF'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'Brustoloni'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'Chase'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'CIFS'

  ** Obsolete normative reference: RFC 2616 (ref. 'HTTP') (Obsoleted by RFC
     7230, RFC 7231, RFC 7232, RFC 7233, RFC 7234, RFC 7235)

  ** Downref: Normative reference to an Informational RFC: RFC 1813 (ref.
     'NFSv3')

  ** Obsolete normative reference: RFC 1831 (ref. 'RPC') (Obsoleted by RFC
     5531)

  -- Possible downref: Non-RFC (?) normative reference: ref. 'Stevens'

  ** Obsolete normative reference: RFC  793 (ref. 'TCP') (Obsoleted by RFC
     9293)

  -- Possible downref: Non-RFC (?) normative reference: ref. 'TCPRDMA'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'VI'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'VITCP'


     Summary: 10 errors (**), 0 flaws (~~), 7 warnings (==), 10 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------

1	INTERNET-DRAFT                                          C. Sapuntzakis
2	                                                         Cisco Systems
3	                                                            A. Romanow
4	                                                         Cisco Systems

6	                                                              J. Chase
7	                                                       Duke University

9	draft-csapuntz-caserdma-00.txt                            December 2000

11	                            The Case for RDMA

13	Status of this Memo

15	This document is an Internet-Draft and is in full conformance with all
16	provisions of Section 10 of RFC2026.

18	Internet-Drafts are working documents of the Internet Engineering
19	Task Force (IETF), its areas, and its working groups.  Note that
20	other groups may also distribute working documents as Internet-
21	Drafts.

23	Internet-Drafts are draft documents valid for a maximum of six months
24	and may be updated, replaced, or obsoleted by other documents at any
25	time.  It is inappropriate to use Internet-Drafts as reference
26	material or to cite them other than as "work in progress."

28	The list of current Internet-Drafts can be accessed at
29	http://www.ietf.org/ietf/1id-abstracts.txt

31	The list of Internet-Draft Shadow Directories can be accessed at
32	http://www.ietf.org/shadow.html.

34	Copyright Notice
35	Copyright (C) Cisco Systems (2000). All Rights Reserved.

37	Abstract

39	The end-to-end performance of IP networks for bulk data transfer is
40	often limited by data copying overhead in the end systems.  Even when
41	end systems can sustain the bandwidth of high-speed networks, copying
42	overheads often limit their ability to carry out other processing
43	tasks.

45	Remote Direct Memory Access (RDMA) is a facility for avoiding copying
46	for network communication in a general and comprehensive way.  RDMA is
47	particularly useful for protocols that transmit bulk data mixed with
48	control information, such as NFS, CIFS, HTTP, or enscapsulated device
49	protocols such as iSCSI.  While networking architectures such as the
50	Virtual Interface (VI) architecture support RDMA, there is no standard
51	for RDMA over IP networks.  Such a standard would allow vendors of
52	IP-aware network hardware (such as TCP-capable network adapters) to
53	incorporate support for RDMA into their products.

55	This document reviews the I/O performance issues addressed by RDMA,
56	and considers issues for supporting the key elements of RDMA in an IP
57	networking context.

59	Glossary

61	header/payload splitting - any technique that enables a NIC to deposit
62	    incoming protocol headers and payloads into separate host buffers

64	headers - control information used by the protocol

66	HBA - host bus adapter, a network adapter (see NIC)

68	I/O operation - a request to a device, then a transfer to/from that device,
69	    and a status response

71	MTU - maximum transmission unit, the largest packet size that a given
72	    network device or path can carry

74	NIC - network interface card/controller (see HBA)

76	payload - in general, uninterpreted data transported by a protocol

78	payload steering - any technique that enables a NIC to deposit an
79	    incoming protocol payload into a buffer designated for that
80	    specific payload

82	protocol stack - the layers of software, firmware, or hardware
83	   that implement communication between applications across a network

85	region and region identifier (RID) - a memory buffer region reserved and
86	   registered for use with RDMA requests, and its unique identifier

88	solicited data - data that was sent in response to some control
89	   message

91	unsolicited data - data that was sent without being requested

93	upper-layer protocol (ULP) - an application-layer protocol like
94	  NFS, CIFS, HTTP, or iSCSI

96	1. Introduction

98	The principal use of the Internet and IP networks today is for
99	buffer-to-buffer transfers, often in the form of file or block
100	transfers. Today, this is done using a variety of protocols: HTTP,
101	FTP, NFS, and CIFS. Soon, iSCSI will be added to this list.

103	These upper-layer protocols (ULPs) all have one thing in common: the
104	majority of the bytes they send on the network are data "payloads"
105	that are uninterpreted by the protocol or the network.

107	Each ULP has different ways of requesting and initiating data
108	transfers. They differ in the kinds of control information or
109	meta-data (e.g. cache coherence info) they specify and send across the
110	wire. However, all these protocols eventually come down to
111	transporting large blocks of uninterpreted data from a local buffer to
112	a remote buffer.  Transferring a payload from one host to another is
113	similar to a buffer-to-buffer data transfer (like the C memcpy
114	function) over the network. For example, one use of HTTP is to
115	transfer JPEG format graphic images from a web server to a web
116	browser's address space.

118	Today, gigabit speed buffer-to-buffer network transfers are chewing up
119	significant memory bandwidth and CPU time on the receivers.  With the
120	advent of IP checksum hardware, the end-system overhead for network
121	transfers is dominated by costs of copying in order to place incoming
122	data correctly in the receiver's memory buffer.  Although CPUs are
123	rapidly becoming more powerful, advances in network bandwidths have
124	also kept pace with and even exceeded Moore's Law in recent years.
125	Moreover, copying is limited by memory system performance, which is
126	not improving as fast CPU speeds.

128	One solution to this problem is to place the data in the correct
129	memory buffer directly as it arrives from the network, avoiding the
130	need to copy it into the correct buffer after it has arrived. If the
131	network interface (NIC) could place data correctly in memory, this
132	would free up the memory bandwidth and CPU cycles consumed by copying.

134	A number of mechanisms already exist to reduce copying overhead in the
135	IP stack.  Some of these mechanisms depend on fragile assumptions about
136	the hardware and application buffers, others involve ad hoc support
137	for specific protocols and communication scenarios, and all of them
138	impose other costs that may be prohibitive in some scenarios.

140	However, a mechanism called Remote Direct Memory Access (RDMA) offers
141	a solution that is simple, general, complete, and robust.  RDMA
142	introduces new control information into the communication stream that
143	directs data movement for buffer-to-buffer transfers.  Incorporating
144	support for RDMA into network protocols can significantly reduce the
145	cost of network buffer-to-buffer transfers.

147	RDMA accomplishes exact data placement via a generalized abstraction
148	at the boundary between the ULP and its transport (e.g., TCP),
149	allowing an RDMA-capable NIC to recognize and steer payloads
150	independently of the specific ULP.  Using RDMA, ULPs
151	gain efficient data placement without the need to program ULP-specific
152	details into the NIC.  Thus RDMA speeds deployment of new protocols by
153	not requiring the firmware or hardware on the NIC to be rewritten to
154	accelerate each new protocol.

156	To be effective, the receiving NIC must recognize the RDMA control
157	information, and ULP implementations or applications most be modified
158	to generate the RDMA control information.  In addition, support for
159	framing in the transport protocols would allow an RDMA-capable NIC to
160	locate RDMA control information in the stream in the case where
161	packets arrive out of order.

163	Historically, network protocols and implementations have addressed the
164	issue of demultiplexing multiple streams arriving at an interface.
165	However, there are still no accepted solutions to demultiplex control
166	and data arriving on a single stream.  Much current network traffic is
167	characterized by a small amount of control with a large amount of
168	data.  RDMA enables efficient data payload steering for this common
169	case, which is especially important as data rates increase.

171	This document is somewhat tutorial in seeking to set out clearly the
172	I/O performance issues addressed by RDMA, and the design alternatives
173	for an RDMA facility. It considers proposed approaches for solving the
174	problems, clarifying the benefits and costs of deploying and using an
175	RDMA approach.

177	The document is organized as follows.  Section 2 describes the copy
178	overhead problem in detail. Section 3 discusses various alternatives
179	to a general RDMA facility. Section 4 describes the RDMA approach in
180	detail.  RDMA implementation issues are considered in Section 5, and
181	unsolicited data in Section 6.

183	2. The I/O Performance Problem

185	Figure 1 shows a block diagram illustrating the layers involved in
186	transferring data in and out of a host system. We will call these
187	layers the network I/O stack. Each boundary in the diagram corresponds
188	to an I/O interface.  In general, we assume that all the modules
189	represented in Figure 1 (except for the NIC) run on the host CPU,
190	although RDMA is equally useful if portions of the I/O stack run on
191	the NIC.

193		|-----------------------|

195		  Application

197		|-----------+-----------|
198	          File      |
199	          System    | Block
200	          Interface | Interface
201		|-----------+-----------|
202		Upper-Layer Protocol Stack
203	         (NFS, CIFS, SCSI/iSCSI,
204			HTTP)
205		|-----------------------|

207		  Network Stack (IP, TCP)

209		|-----------------------|

211		  NIC

213		|-----------------------|

215	In IP networks, end system CPUs may incur substantial overhead from
216	copying data in memory as part of I/O operations. Copying is
217	necessary in order to align data, place data contiguously in memory, or
218	place data in specific buffers supplied by the application or ULP module.
219	These may be important to applications for several reasons.

221	Alignment is important because most CPU architectures impose alignment
222	constraints on data accessed in units larger than a byte, e.g., for
223	incoming data interpreted as integers.

225	Contiguity of data in memory simplifies the book-keeping data
226	structures that describe the data and improves memory utilization by
227	reducing fragmentation of free space. Data contiguity may simplify
228	algorithms that traverse the data, reducing execution time. For
229	example, data contiguity enables sequential memory access.

231	Common network APIs such as sockets [Stevens] allow applications to
232	designate specific buffers for incoming data, requiring a copy to
233	place the incoming data correctly.  It may be possible to avoid the
234	copy by page remapping (see Section 3.2), but only if the data is
235	contiguous to occupy complete memory pages and is page-aligned
236	relative to the application's buffer.  Similarly, storage protocols
237	such as NFS and iSCSI may require contiguous, page-aligned data for
238	buffering in the system I/O cache.

240	This document concentrates on how to eliminate unnecessary data copies
241	used to assure correct placement of incoming data.

243	Some have argued that the expense of these data copies can be partly
244	masked if some other data scanning operation, such as checksumming or
245	decryption, runs over the data simultaneously (see [ALF]). However,
246	such optimizations are highly processor-dependent and may not yield
247	the expected benefits [Chase]. Moreover, this approach is not useful
248	unless other data scanning operations are handled in software;
249	hardware support for checksumming and decryption is increasingly
250	common.

252	In recent years, valuable progress has been made in minimizing
253	other sources of networking overhead.  Examples include checksum
254	offloading, extended ethernet frames, and interrupt suppression.  For a
255	review and evaluation of various solutions see [Chase]. These issues
256	are not discussed in this document.

258	2.1 Copy on receive

260	The primary issue addressed here is how application data is received
261	from the network.  In many I/O interfaces, when an application reads
262	data, the application specifies the buffer into which it will receive
263	data.  But, today's generic NICs are incapable of placing data
264	directly into the supplied buffer. This limitation is largely because
265	such direct placement of data requires more complexity and intelligence
266	than provided in generic NICs.  For example to accomplish this task,
267	NICs would need to separate payloads from ULP and transport headers, parse
268	headers, and demultiplex multiple incoming packet streams.

270	Most NICs today are not this sophisticated in their handling of
271	incoming data streams.  Instead, they deposit incoming packets into
272	generic host buffers supplied by the network stack software.  Both the
273	network and ULP stacks sift through the packets, looking successively
274	at headers from the link layer (e.g., Ethernet), IP, transport, and
275	ULP. Eventually, the data payload is recognized and copied from the
276	network buffers to the correct application buffer.

278	2.2 Copy on transmit

280	For the most part, sending data from applications to the network
281	should not require copies in the I/O stack. Today's network adapters
282	can gather data from anywhere in memory to form a packet, so no copy
283	is necessary to align outgoing packet data for the NIC.

285	Copying can be used as a technique to ensure that the data is not
286	modified between the time it is passed from the application to the I/O
287	interface, and the time that the data transfer completes. Other
288	well-known solutions exist that do not involve copying [Brustoloni].

290	Copy on transmit will not be discussed further.

292	3. Non-RDMA solutions

294	There are a range of ad-hoc solutions to avoid copying of incoming
295	data that do not require RDMA. These include:

297		- scatter-gather buffers
298		- header/payload separation
299		- parsing the ULP on the NIC

301	3.1 Scatter-gather buffers

303	Once the NIC has written the application data to memory, a copy can be
304	avoided if we tell the application where to find its data in memory.
305	The application data may be scattered in memory as it may have arrived
306	in multiple packets.  A data structure called a scatter-gather buffer
307	is used to tell the application the location of the data.
308	Scatter-gather buffering is the only known copy avoidance technique
309	that does not require direct support on the NIC.

311	This solution is not compatible with existing I/O interfaces, such as
312	the sockets interface.  Also, in this approach, data is not necessarily
313	contiguous in memory or page-aligned.  For example, it cannot in
314	general be delivered securely to a user-level process without copying
315	it, since mapping the pages containing the received data into a user
316	process address space exposes the containing pages in their entirety,
317	not just the portions occupied by the received data.

319	However, scatter-gather buffering is a viable copy avoidance technique
320	for kernel-based applications where few data transformations are
321	needed.  For file system protocols, effective use of scatter-gather
322	buffering may require a redesign of the the file buffer cache and/or
323	virtual memory page cache.

325	3.2. Ad Hoc header/payload separation

327	A more sophisticated NIC might recognize transport and/or ULP headers
328	in order to separate the headers from the payloads. Then each payload
329	is "split" from its header and place the payload in a separate buffer.
330	Header/payload splitting is useful for copy avoidance because a
331	virtual memory system may then map the payload to an application
332	buffer by manipulating virtual memory translations to point to the
333	payload.  This approach, called "page flipping" or "page remapping",
334	is an alternative to copying for delivering the data into the
335	application buffers.  A prerequisite for page flipping is that the
336	application buffer must be page-aligned and contiguous in virtual
337	memory.

339	Header/payload splitting adds significant complexity to the NIC.  If
340	the network MTU is smaller than the hardware page size, then the
341	transfer of a page of data is spread across multiple packets. These
342	packets can arrive at the receiver out-of-order and/or interspersed
343	with packets from other flows. In order to pack the data contiguously
344	into pages, the NIC must do intelligent processing of the transport
345	and ULP.  This approach is "ad hoc" because the NIC must include
346	support for each transport and ULP that benefits from page flipping.
347	The NIC processing may be unnecessarily complex for ULPs such as NFS
348	that use variable-length headers or that require ULP-level state to
349	decode the incoming headers.

351	A key disadvantage is that page flipping requires TLB invalidations,
352	which can be prohibitively expensive on shared memory multiprocessors.

354	3.3. Explicit header/payload separation

356	The previous section discussed header/payload separation implemented
357	in an ad hoc fashion. It is also possible to implement a more
358	generalized method of header/payload splitting that does not require
359	the NIC to decode ULP headers.  A generic framing mechanism
360	implemented at the transport layer or just above it could include
361	frame header fields that distinguish the ULP payload from the ULP
362	header.  This would enable a receiving NIC to separate received data
363	payloads from control information and deposit the received payload
364	data in contiguous page-aligned target buffer locations. Under most
365	conditions this is sufficient to allow low-copy implementations of
366	ULPs such as NFS.

368	The RDMA approach explored in this document is a more general
369	extension of this approach.

371	3.4. Terminate the ULP in the NIC

373	If the NIC terminates the ULP, the memory copy is eliminated because
374	the application communicates I/O requests directly to the NIC.  The
375	NIC uses the information in the ULP headers to steer ULP payloads to
376	the correct application buffers.  This is commonly done in the
377	FibreChannel arena, where FibreChannel NICs (or Host Bus Adapters)
378	implements an I/O block (e.g., SCSI) transport on the NIC. This
379	approach effectively migrates all modules of the network stack from
380	Figure 1 onto the NIC.  FibreChannel implementations use this technique
381	to deliver high performance with low host overhead.

383	In such a scheme, the NIC needs to be informed of specific application
384	buffers. The NIC also needs to be capable of header/payload splitting.

386	While this approach may be useful for single-function devices, it is
387	inappropriate for general-purpose NICs.  The NIC must be reprogrammed
388	or extended to accelerate each ULP.  RDMA offers a general mechanism
389	that allows RDMA-capable NICs to avoid copies for any ULP that uses
390	RDMA.

392	4. Remote Direct Memory Access (RDMA)

394	This section outlines how RDMA works.

396	Direct memory access (DMA) is a fundamental technique that is widely
397	used in high-performance I/O systems.  DMA allows a device to directly
398	read or write host memory across an I/O interconnect (such as PCI) by
399	sending DMA commands to the memory controller. No CPU intervention or
400	copying is required.  For example, when a host requests an I/O read
401	operation from a DMA-capable storage device, the device uses a DMA
402	write to place the incoming data directly to memory buffers that the
403	host provides for that specific operation.  Similarly, when the host
404	requests an I/O write operation, the device uses a DMA read to fetch
405	outgoing data from host memory buffers specified by the host for that
406	operation.

408	Remote DMA can provide similar functionality in IP networks.  It is
409	particularly useful when an IP network is used as an I/O interconnect
410	for IP-capable devices, such as storage devices and their servers.
411	Conceptually, RDMA allows a network-attached device to read or write
412	remote memory, e.g., by adding control information that specifies the
413	buffers to receive transmitted payloads.  The remote NIC decodes this
414	control information and uses DMA to read/write memory, effectively
415	translating between the RDMA protocol and the local memory access
416	protocol.  In an IP network, the RDMA protocol appears at the
417	transport layer (e.g., as a "shim" above an existing transport
418	protocol such as TCP) so that a wide variety of upper-layer protocols
419	can make use of it with minimal changes.

421	The idea of RDMA has been around by various names for many years.
422	RDMA is an important component of the VI architecture for user-level
423	networking, and is also a key element of the Infiniband effort.  VI
424	illustrates one alternative for a networking API that accommodates
425	RDMA (see Section 5.1).  However, RDMA generalizes to other network
426	architectures.  This document addresses issues for incorporating RDMA
427	into conventional IP protocol stacks.  Note that VI can run over an IP
428	transport such as TCP, but only if the NIC implements the full
429	transport.

431	Since TCP is the most widely used transport for upper-layer protocols,
432	using RDMA with TCP is the first case to consider. However, RDMA can
433	be used with other transport protocols, specifically SCTP.

435	4.1 How RDMA works

437	An RDMA facility embeds new RDMA control commands into the byte stream
438	or packet stream.  A full RDMA protocol includes two key commands:
439	RDMA READ and RDMA WRITE.  The receiving NIC translates these commands
440	into local memory reads and writes.

442	For security reasons, it is undesirable to allow transmitters to read
443	or write arbitrary memory on the receiver.  Any RDMA scheme must
444	prevent any unauthorized memory accesses.  Most RDMA schemes protect
445	memory by allowing RDMA reads/writes only to buffers that the receiver
446	has explicitly identified to the NIC as valid RDMA targets.  The
447	process of informing the NIC about a buffer is called "registration".

449	The following steps illustrate the common case of a data transfer
450	using RDMA WRITE in the context of a request/response storage protocol
451	such as NFS or iSCSI:

453		1. Client application calls an I/O interface, requesting that
454		the result of the I/O be put into a buffer B.

456		2. Client implementation registers buffer B with the NIC.

458		3. Client sends the I/O READ request to server.

460		4. Server issues one or more RDMA WRITE(s) to write I/O data
461		into client's buffer B.

463		5. Server sends the file system READ response for the I/O.

465	Of course, on each I/O operation the server must know to which client
466	addresses to write.  One alternative is for the client to pass a
467	token identifying the target buffer in the request; the server
468	returns the token in its response.  This is the approach used in
469	VI implementations.  An alternative is for both the client and server
470	to each synthesize the token from other unique identifiers present in
471	the request [TCPRDMA].

473	Most RDMA schemes use a region identifier (RID) and an offset to
474	identify the target buffer in a token. The (RID, offset) pair amounts
475	to a form of virtual address; the receiving NIC translates the virtual
476	addresses to physical addresses using table lookup.  As such, if a
477	mapping to a physical page does not appear in the table, there is no
478	way a transmitter can refer to it.

480	Once an entry is in the table, the NIC can potentially access the
481	physical memory of a buffer at any time. As such, the buffer must not
482	be re-used for other purposes. One alternative is for the OS to "pin"
483	the buffer in physical memory, allowing the NIC to safely hold the
484	physical addresses corresponding to the buffer.  Once the region
485	mapping is removed, the OS can "unpin" the physical memory.

487	4.2. Unsolicited payloads

489	NFS, CIFS, and HTTP all support sending data in a WRITE (or POST)
490	request along with the request. This is optimistic; it assumes the
491	receiving application has space (other than TCP window) to buffer the
492	WRITE payload. The payload and transfer are called "unsolicited" in that
493	they were not requested by the receiver.  RDMA WRITE is
494	straightforward for solicited data, since the sender can receive the
495	RID and buffer address in the message that solicits the data, as in
496	the preceding example.  In the case of unsolicited data, it is not
497	clear how the sender obtains the RID necessary for an RDMA WRITE.

499	RDMA may be used for unsolicited data in the following way.  The
500	receiver may expose a memory region for unsolicited data from each
501	sender.  The sender, when it wishes to do an unsolicited WRITE, can
502	RDMA its data into that region. Then, along with the WRITE request,
503	the sender may pass a pointer (e.g., region offset) to the data it
504	wrote.  This requires that the receiver (server) pass an RID for
505	unsolicited data at connection open and supply a new region if the
506	unsolicited region fills.  Alternatively, the receiver may handle
507	unsolicited data by responding to the WRITE request with an RDMA
508	READ (if supported) to fetch the data, as described in Section 4.3.

510	4.3  Reading remote memory

512	Some RDMA protocols allow one party to read another's memory with an
513	RDMA READ operation. As with the RDMA WRITE, the NICs and not the CPUs
514	process the RDMA READs.

516	The receiving NIC may complete the RDMA READ from the receiver's
517	memory without interrupting the CPU. The operation is potentially
518	useful because CPU interrupts are expensive in general-purpose
519	systems. Switching between the currently executing task and the
520	interrupt handler involves flushing pipelines, saving and restoring
521	context, and other overheads.

523	Although any RDMA READ may be emulated using an RDMA WRITE in the
524	opposite direction, use of RDMA READ as an alternative has potential
525	advantages.  First, an RDMA READ requester does not need to export a
526	region RID to receive the incoming data as an RDMA WRITE.  This is
527	useful because it allows servers to avoid reserving and exposing
528	memory regions for large numbers of clients.  Second, RDMA READ allows
529	the requester to control the order and rate of data transmitted by the
530	sender or RDMA READ target.

532	For example, a network storage device or server may implement write
533	operations by issuing RDMA READs to its client, rather than allowing
534	the client to use RDMA WRITE to transfer the data to the server.  This
535	allows the server to control use of the buffer space it allocates for
536	the transfers, and to pull the data from the client in an order that is
537	convenient for the server, e.g., to optimize disk performance.
538	The emerging VI-based Direct Access File System uses RDMA READ for
539	file write operations, in part for these reasons.

541	RDMA READ is more complex than RDMA WRITE because it implies that the
542	target NIC autonomously transmits data back to the requester, e.g.,
543	without involving a host CPU.  This implies that the NIC implements the
544	complete transport protocol necessary to send such data without involving
545	or interfering with the protocol stack in host software.

547	Use of RDMA READ requires ULPs designed to take advantage of it, as
548	well as more powerful NICs.  While it offers several benefits, there
549	may be alternative means to achieve many of the same benefits, such as
550	simple interrupt suppressing NICs and ULP protocol features to control
551	the rate and order of data flow, as provided in the iSCSI draft
552	specification.

554	In contrast to RDMA READ, RDMA WRITE is simple and general, does not
555	require full implementation of the transport on the NIC, and is easily
556	incorporated into existing request/response protocols with minimal
557	impact.  The remainder of this document focuses on RDMA WRITE.

559	4.4 Security

561	The principal mechanism for RDMA security is region addressing using
562	RID-based virtual addresses as described above in Section 4.1.  Under
563	no circumstances may a transmitter access memory that has not been
564	explicitly registered for RDMA use by the receiver.  Thus RDMA does
565	not introduce fundamental new security issues beyond the standard
566	concerns of interception and corruption of data and commands on an
567	insecure connection.  In this case, the concern is whether RIDs for
568	registered RDMA regions may be misused.

570	To further improve safety, each RID may include a sparse (hard to
571	guess) key value; only transmitters who know the key can read or write
572	to the memory region.  RIDs protected in this way are essentially weak
573	capabilities.  NICs may also place access-control lists or permissions
574	on pages, or limit region access to specific connections.

576	For real security on untrusted networks, the RDMA protocol may be
577	protected in-transit using security and endpoint authentication
578	features at the transport layer or below, such as TLS or IPsec.

580	5 RDMA APIs

582	Direct I/O to application buffers requires an interface for
583	registering buffers with the NIC and receiving notification that RDMA
584	transfers have completed.  It is straightforward to devise internal
585	kernel interfaces to enable use of RDMA for kernel-based ULPs.
586	However, use of RDMA by user-space applications may require extensions
587	to existing kernel networking APIs.  For example, the Berkeley Unix
588	sockets [Stevens] interface, as currently specified, does not directly
589	support RDMA.

591	5.1 The VI interface

593	The VI programming interface [VI] supports both message passing and
594	RDMA. The VI interface has calls for registering and pinning buffers.
595	The interface supports both polling and asynchronous notification of
596	events, e.g., RDMA completions.  The VI interface does not specify the
597	wire protocol and allows a variety of protocols, including IP
598	protocols.

600	The VI interface assumes that user-space programs may directly access
601	the NIC without transitioning to kernel mode. This precludes use of
602	the full VI API in conjunction with conventional TCP/IP protocol
603	stacks.  However, one option is to supplement the socket interface
604	with RDMA-related elements of the VI interface.

606	5.2 Winsock Direct

608	The Winsock Direct API, available on Windows 2000, is an extension of
609	the sockets interface that supports reliable messages and RDMA.

611	6 Implementing RDMA

613	Conceptually, the RDMA abstraction belongs at the transport layer so
614	that it generalizes to multiple ULPs.  The sending side of the RDMA
615	protocol is straightforward to implement at the boundary between the
616	ULP and the underlying transport, i.e., as a "shim" to TCP.  However,
617	the key aspects of the receiving side of an RDMA protocol are
618	implemented within the NIC, a link-level device that is logically
619	below the transport layer.  This is the crux of the problem for
620	implementing RDMA.

622	Transport-level support for enhanced framing (e.g., in TCP) would be
623	useful for implementing RDMA.  For RDMA to be effective, the receiving
624	NIC must be able to read and decode the control information necessary
625	for it to implement RDMA.  At minimum, this requires it to recognize
626	transport-layer headers and identify RDMA control headers embedded in
627	the incoming data.  It is trivial to locate these headers within an
628	ordered byte stream using a simple byte counting method (length field)
629	for framing.  The difficulty is that packets may arrive at the RDMA
630	receiver (NIC) out of order, and some or all of the transport-layer
631	facility to reorder data may be implemented above the NIC, e.g., in
632	host software, as shown in Figure 1.  Thus there must be some
633	mechanism that enables the receiving NIC to retain or recover its
634	ability to locate RDMA headers in the presence of sequence holes,
635	i.e., when packets arrive out of order.

637	One option is for the NIC to buffer out-of-order data until any late
638	packets arrive, allowing the NIC to recover any lost framing
639	information.  Note that this does not preclude delivering the
640	out-of-order data to the host along a slow path that does not benefit
641	from RDMA.  Keeping a copy of the data until all sequence holes are
642	filled allows the NIC to traverse the RDMA headers in the data stream,
643	positioning it to locate subsequent RDMA headers and re-establish the
644	RDMA fast path.  If the NIC does not have sufficient memory to buffer
645	the data, it may discard it, forcing the sender to retransmit more of
646	the data after a sequence hole.

648	A second option is to integrate framing support into the transport,
649	allowing the receiver to locate RDMA headers even when packets arrive
650	out of order.  Note that every packet must contain an RDMA header for
651	this approach to be fully general.  For example, consider a packet
652	carrying an RDMA header that applies to data in subsequent packets.
653	Even with enhanced framing, if the packet containing the RDMA header
654	is lost, the NIC cannot correctly apply the RDMA operation to the
655	arriving data until it receives the RDMA header.

657	Several alternatives have been proposed for integrating framing into
658	TCP.  These include introducing a new TCP option [TCPRDMA] or
659	constraining the TCP sender's selection of segment boundaries to
660	correspond with framing boundaries [VITCP].  Each of these
661	approaches would have some impact on TCP implementations and
662	APIs, and some of them also extend the wire protocol.

664	The TCP options approach requires a minor extension of the TCP wire
665	protocol, and modification to both the sender and the receiver, which
666	is especially painful considering today's inflexible in-kernel TCP
667	implementations.  The TCP options approach does not break backward
668	compatibility since unmodified endpoints will not negotiate the
669	option. Also, the options information is regarded only as an
670	optimization; it is not required for the application to parse the TCP
671	stream.

673	7 Conclusion

675	Remote DMA provides for efficient placement of data in memory.  The
676	NIC writes data into memory with the proper alignment.  Furthermore,
677	the NIC can often place data directly into application buffers.

679	The Remote DMA abstraction provides generalized mechanism useful with
680	many higher level protocols such as NFS, without the need for ULP
681	support in the NIC, and with only minor extensions to the ULP protocol
682	implementations.

684	Authors' Addresses

686	Constantine Sapuntzakis
687	Cisco Systems, Inc.
688	170 W. Tasman Drive
689	San Jose, CA 95134
690	USA

692	Phone: +1 408 525 5497
693	Email: csapuntz@cisco.com

695	Allyn Romanow
696	Cisco Systems, Inc.
697	170 W. Tasman Drive
698	San Jose, CA 95134
699	USA

701	Phone: +1 408 525 8836
702	Email: allyn@cisco.com

704	Jeff Chase
705	Department of Computer Science
706	Duke University
707	Durham, NC 27708-0129
708	USA

710	Phone: +1 919 660 6559
711	Email: chase@cs.duke.edu

713	References

715	[ALF] D. D. Clark and D. L. Tennenhouse, "Architectural considerations
716	for a new generation of protocols," in SIGCOMM Symposium on
717	Communications Architectures and Protocols , (Philadelphia,
718	Pennsylvania), pp. 200--208, IEEE, Sept. 1990.  Computer
719	Communications Review, Vol. 20(4), Sept. 1990.

721	[Brustoloni] J. Brustoloni and P. Steenkiste. "Effects of buffering
722	semantics on I/O performance," in Operating System Design and
723	Implementation (OSDI), Seattle, WA, Oct 1996.

725	[Chase] J. Chase, A. Gallatin, and Ken Yocum, "End-system
726	Optimizations for High-Speed TCP", IEEE Communications special
727	issue on high-speed TCP, 2001.
728	http://www.cs.duke.edu/ari/publications/end-system.ps (or .pdf).

730	[CIFS] Paul Leach, "A Common Internet File System (CIFS/1.0) Protocol
731	Preliminary Draft",
732	http://www.cifs.com/specs/draft-leach-cifs-v1-spec-01.txt, December
733	1997

735	[HTTP] J. Gettys et al., "Hypertext Transfer Protocol - HTTP/1.1",
736	RFC 2616, June 1999

738	[NFSv3] B. Callaghan, "NFS Version 3 Protocol Specification",
739	RFC 1813, June 1995

741	[RPC] R. Srinivasan, "RPC: Remote Procedure Call Protocol
742	Specification Version 2", RFC 1831, August 1995

744	[iSCSI] J. Satran, et al., "iSCSI", draft-ietf-ips-iscsi-01.txt

746	[Stevens] W. Richard Stevens, "Unix Network Programming Volume 1,"
747	Prentice Hall, 1998, ISBN 0-13-490012-X

749	[TCP] J. Postel, "Transmission Control Protocol - DARPA
750	Internet Program Protocol Specification", RFC 793, September 1981

752	[TCPRDMA] C. Sapuntzakis and D. Cheriton, "TCP RDMA option",
753	http://www.ietf.org/internet-drafts/draft-csapuntz-tcprdma-00.txt

755	[Winsock Direct] "Winsock Direct Specification", Windows 2000 DDK,
756	http://www.microsoft.com/ddk/ddkdocs/win2K/wsdspec_1h66.htm

758	[VI] Virtual Interface Architecture Specification version 1.0,
759	http://www.viarch.org/

761	[VITCP] DiCecco, S., et al., "VI/TCP (Internet VI)",