idnits 2.17.1 

draft-ietf-rddp-problem-statement-02.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** Looks like you're using RFC 2026 boilerplate.  This must be updated to
     follow RFC 3978/3979, as updated by RFC 4748.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  == No 'Intended status' indicated for this document; assuming Proposed
     Standard


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack an IANA Considerations section.  (See Section
     2.2 of https://www.ietf.org/id-info/checklist for how to handle the case
     when there are no actions for IANA.)


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not
     match the current year

  == Line 683 has weird spacing: '...le from  http:...'

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- Couldn't find a document date in the document -- date freshness check
     skipped.

  -- Found something which looks like a code comment -- if you have code
     sections in the document, please surround them with '<CODE BEGINS>' and
     '<CODE ENDS>' lines.


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  == Missing Reference: 'B99' is mentioned on line 184, but not defined

  == Missing Reference: 'BP96' is mentioned on line 419, but not defined

  == Missing Reference: 'IPSEC' is mentioned on line 526, but not defined

  == Missing Reference: 'TLS' is mentioned on line 526, but not defined

  == Missing Reference: 'R2001' is mentioned on line 672, but not defined

  == Unused Reference: 'DAPP93' is defined on line 643, but no explicit
     reference was found in the text

  == Unused Reference: 'KSZ95' is defined on line 690, but no explicit
     reference was found in the text

  == Unused Reference: 'Ma02' is defined on line 694, but no explicit
     reference was found in the text

  == Unused Reference: 'Wa97' is defined on line 770, but no explicit
     reference was found in the text

  == Outdated reference: A later version (-07) exists of
     draft-ietf-rddp-arch-02

  -- Obsolete informational reference (is this intentional?): RFC  793 (ref.
     'Po81') (Obsoleted by RFC 9293)


     Summary: 2 errors (**), 0 flaws (~~), 13 warnings (==), 4 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	                                              Allyn Romanow      (Cisco)
3	Internet-Draft                                Jeff Mogul            (HP)
4	Expires: December 2003                        Tom Talpey        (NetApp)
5	                                              Stephen Bailey (Sandburst)

7	                     RDMA over IP Problem Statement
8	                  draft-ietf-rddp-problem-statement-02

10	Status of this Memo

12	     This document is an Internet-Draft and is in full conformance with
13	     all provisions of Section 10 of RFC2026.

15	     Internet-Drafts are working documents of the Internet Engineering
16	     Task Force (IETF), its areas, and its working groups.  Note that
17	     other groups may also distribute working documents as Internet-
18	     Drafts.

20	     Internet-Drafts are draft documents valid for a maximum of six
21	     months and may be updated, replaced, or obsoleted by other
22	     documents at any time.  It is inappropriate to use Internet-Drafts
23	     as reference material or to cite them other than as "work in
24	     progress."

26	     The list of current Internet-Drafts can be accessed at
27	     http://www.ietf.org/ietf/1id-abstracts.txt

29	     The list of Internet-Draft Shadow Directories can be accessed at
30	     http://www.ietf.org/shadow.html.

32	Copyright Notice

34	     Copyright (C) The Internet Society (2003).  All Rights Reserved.

36	Abstract

38	     This draft addresses an IP-based solution to the problem of high
39	     system costs due to network I/O copying in end-hosts at high
40	     speeds.  The problem is due to the high cost of memory bandwidth,
41	     and it can be substantially improved using "copy avoidance."  The
42	     high overhead has limited the use of TCP/IP in interconnection
43	     networks especially where high bandwidth, low latency and/or low
44	     overhead of end-system data movement are required by the hosted
45	     application.

47	Table Of Contents

49	     1.   Introduction . . . . . . . . . . . . . . . . . . . . . . .   2
50	     2.   The high cost of data movement operations in network I/O .   3
51	     2.1. Copy avoidance improves processing overhead  . . . . . . .   5
52	     3.   Memory bandwidth is the root cause of the problem  . . . .   6
53	     4.   High copy overhead is problematic for many key Internet
54	          applications . . . . . . . . . . . . . . . . . . . . . . .   7
55	     5.   Copy Avoidance Techniques  . . . . . . . . . . . . . . . .   9
56	     5.1. A Conceptual Framework: DDP and RDMA . . . . . . . . . . .  11
57	     6.   Security Considerations  . . . . . . . . . . . . . . . . .  11
58	     7.   Acknowledgements . . . . . . . . . . . . . . . . . . . . .  12
59	          Informative References . . . . . . . . . . . . . . . . . .  12
60	          Authors' Addresses . . . . . . . . . . . . . . . . . . . .  17
61	          Full Copyright Statement . . . . . . . . . . . . . . . . .  18

63	1.  Introduction

65	     This draft considers the problem of high host processing overhead
66	     associated with network I/O that occurs under high speed
67	     conditions.  This problem is often referred to as the "I/O
68	     bottleneck" [CT90].  More specifically, the source of high overhead
69	     that is of interest here is data movement operations - copying.
70	     This issue is not be confused with TCP offload, which is not
71	     addressed here.  High speed refers to conditions where the network
72	     link speed is high relative to the bandwidths of the host CPU and
73	     memory.  With today's computer systems, one Gbits/s and over is
74	     considered high speed.

76	     High costs associated with copying are an issue primarily for large
77	     scale systems.  Although smaller systems such as rack-mounted PCs
78	     and small workstations would benefit from a reduction in copying
79	     overhead, the benefit to smaller machines will be primarily in the
80	     next few years as they scale in the amount of bandwidth they
81	     handle.  Today it is large system machines with high bandwidth
82	     feeds, usually multiprocessors and clusters, that are adversely
83	     affected by copying overhead.  Examples of such machines include
84	     all varieties of servers: database servers, storage servers,
85	     application servers for transaction processing, for e-commerce, and
86	     web serving, content distribution, video distribution, backups,
87	     data mining and decision support, and scientific computing.

89	     Note that such servers almost exclusively service many concurrent
90	     sessions (transport connections), which, in aggregate, are
91	     responsible for > 1 Gbits/s of communication.  Nonetheless, the
92	     cost of copying overhead for a particular load is the same whether
93	     from few or many sessions.

95	     The I/O bottleneck, and the role of data movement operations, have
96	     been widely studied in research and industry over the last
97	     approximately 14 years, and we draw freely on these results.
98	     Historically, the I/O bottleneck has received attention whenever
99	     new networking technology has substantially increased line rates -
100	     100 Mbits/s FDDI and Fast Ethernet, 155 Mbits/s ATM, 1 Gbits/s
101	     Ethernet.  In earlier speed transitions, the availability of memory
102	     bandwidth allowed the I/O bottleneck issue to be deferred.  Now
103	     however, this is no longer the case.  While the I/O problem is
104	     significant at 1 Gbits/s, it is the introduction of 10 Gbits/s
105	     Ethernet which is motivating an upsurge of activity in industry and
106	     research [DAFS, IB, VI, CGZ01, Ma02, MAF+02].

108	     Because of high overhead of end-host processing in current
109	     implementations, the TCP/IP protocol stack is not used for high
110	     speed transfer.  Instead, special purpose network fabrics, using a
111	     technology generally known as remote direct memory access (RDMA),
112	     have been developed and are widely used.  RDMA is a set of
113	     mechanisms that allow the network adapter, under control of the
114	     application, to steer data directly into and out of application
115	     buffers.  Examples of such interconnection fabrics include Fibre
116	     Channel [FIBRE] for block storage transfer, Virtual Interface
117	     Architecture [VI] for database clusters, Infiniband [IB], Compaq
118	     Servernet [SRVNET], Quadrics [QUAD] for System Area Networks.
119	     These link level technologies limit application scaling in both
120	     distance and size, meaning that the number of nodes cannot be
121	     arbitrarily large.

123	     This problem statement substantiates the claim that in network I/O
124	     processing, high overhead results from data movement operations,
125	     specifically copying; and that copy avoidance significantly
126	     decreases the processing overhead.  It describes when and why the
127	     high processing overheads occur, explains why the overhead is
128	     problematic, and points out which applications are most affected.

130	     In addition, this document introduces an architectural approach to
131	     solving the problem, which is developed in detail in [BT02].  It
132	     also discusses how the proposed technology may introduce security
133	     concerns and how they should be addressed.

135	2.  The high cost of data movement operations in network I/O

137	     A wealth of data from research and industry shows that copying is
138	     responsible for substantial amounts of processing overhead.  It
139	     further shows that even in carefully implemented systems,
140	     eliminating copies significantly reduces the overhead, as
141	     referenced below.

143	     Clark et al. [CJRS89] in 1989 shows that TCP [Po81] overhead
144	     processing is attributable to both operating system costs such as
145	     interrupts, context switches, process management, buffer
146	     management, timer management, and to the costs associated with
147	     processing individual bytes, specifically computing the checksum
148	     and moving data in memory.  They found moving data in memory is the
149	     more important of the costs, and their experiments show that memory
150	     bandwidth is the greatest source of limitation.  In the data
151	     presented [CJRS89], 64% of the measured microsecond overhead was
152	     attributable to data touching operations, and 48% was accounted for
153	     by copying.  The system measured Berkeley TCP on a Sun-3/60 using
154	     1460 Byte Ethernet packets.

156	     In a well-implemented system, copying can occur between the network
157	     interface and the kernel, and between the kernel and application
158	     buffers - two copies, each of which are two memory bus crossings -
159	     for read and write.  Although in certain circumstances it is
160	     possible to do better, usually two copies are required on receive.

162	     Subsequent work has consistently shown the same phenomenon as the
163	     earlier Clark study.  A number of studies report results that data-
164	     touching operations, checksumming and data movement, dominate the
165	     processing costs for messages longer than 128 Bytes [BS96, CGY01,
166	     Ch96, CJRS89, DAPP93, KP96].  For smaller sized messages, per-
167	     packet overheads dominate [KP96, CGY01].

169	     The percentage of overhead due to data-touching operations
170	     increases with packet size, since time spent on per-byte operations
171	     scales linearly with message size [KP96].  For example, Chu [Ch96]
172	     reported substantial per-byte latency costs as a percentage of
173	     total networking software costs for an MTU size packet on
174	     SPARCstation/20 running memory-to-memory TCP tests over networks
175	     with 3 different MTU sizes.  The percentage of total software costs
176	     attributable to per-byte operations were:

178	        1500 Byte Ethernet 18-25%
179	        4352 Byte FDDI     35-50%
180	        9180 Byte ATM      55-65%

182	     Although many studies report results for data-touching operations
183	     including checksumming and data movement together, much work has
184	     focused just on copying [BS96, B99, Ch96, TK95].  For example,
185	     [KP96] reports results that separate processing times for checksum
186	     from data movement operations.  For the 1500 Byte Ethernet size,
187	     20% of total processing overhead time is attributable to copying.
188	     The study used 2 DECstations 5000/200 connected by an FDDI network.
189	     (In this study checksum accounts for 30% of the processing time.)

191	2.1.  Copy avoidance improves processing overhead

193	     A number of studies show that eliminating copies substantially
194	     reduces overhead.  For example, results from copy-avoidance in the
195	     IO-Lite system [PDZ99], which aimed at improving web server
196	     performance, show a throughput increase of 43% over an optimized
197	     web server, and 137% improvement over an Apache server.  The system
198	     was implemented in a 4.4BSD derived UNIX kernel, and the
199	     experiments used a server system based on a 333MHz Pentium II PC
200	     connected to a switched 100 Mbits/s Fast Ethernet.

202	     There are many other examples where elimination of copying using a
203	     variety of different approaches showed significant improvement in
204	     system performance [CFF+94, DP93, EBBV95, KSZ95, TK95, Wa97].  We
205	     will discuss the results of one of these studies in detail in order
206	     to clarify the significant degree of improvement produced by copy
207	     avoidance [Ch02].

209	     Recent work by Chase et al. [CGY01], measuring CPU utilization,
210	     shows that avoiding copies reduces CPU time spent on data access
211	     from 24% to 15% at 370 Mbits/s for a 32 KBytes MTU using an
212	     AlphaStation XP1000 and a Myrinet adapter [BCF+95].  This is an
213	     absolute improvement of 9% due to copy avoidance.

215	     The total CPU utilization was 35%, with data access accounting for
216	     24%.  Thus the relative importance of reducing copies is 26%.  At
217	     370 Mbits/s, the system is not very heavily loaded.  The relative
218	     improvement in achievable bandwidth is 34%.  This is the
219	     improvement we would see if copy avoidance were added when the
220	     machine was saturated by network I/O.

222	     Note that improvement from the optimization becomes more important
223	     if the overhead it targets is a larger share of the total cost.
224	     This is what happens if other sources of overhead, such as
225	     checksumming, are eliminated.  In [CGY01], after removing checksum
226	     overhead, copy avoidance reduces CPU utilization from 26% to 10%.
227	     This is a 16% absolute reduction, a 61% relative reduction, and a
228	     160% relative improvement in achievable bandwidth.

230	     In fact, today's network interface hardware commonly offloads the
231	     checksum, which removes the other source of per-byte overhead.
232	     They also coalesce interrupts to reduce per-packet costs.  Thus,
233	     today copying costs account for a relatively larger part of CPU
234	     utilization than previously, and therefore relatively more benefit
235	     is to be gained in reducing them.  (Of course this argument would
236	     be specious if the amount of overhead were insignificant, but it
237	     has been shown to be substantial.)

239	3.  Memory bandwidth is the root cause of the problem

241	     Data movement operations are expensive because memory bandwidth is
242	     scarce relative to network bandwidth and CPU bandwidth [PAC+97].
243	     This trend existed in the past and is expected to continue into the
244	     future [HP97, STREAM], especially in large multiprocessor systems.

246	     With copies crossing the bus twice per copy, network processing
247	     overhead is high whenever network bandwidth is large in comparison
248	     to CPU and memory bandwidths.  Generally with today's end-systems,
249	     the effects are observable at network speeds over 1 Gbits/s.

251	     A common question is whether increase in CPU processing power
252	     alleviates the problem of high processing costs of network I/O.
253	     The answer is no, it is the memory bandwidth that is the issue.
254	     Faster CPUs do not help if the CPU spends most of its time waiting
255	     for memory [CGY01].

257	     The widening gap between microprocessor performance and memory
258	     performance has long been a widely recognized and well-understood
259	     problem [PAC+97].  Hennessy [HP97] shows microprocessor performance
260	     grew from 1980-1998 at 60% per year, while the access time to DRAM
261	     improved at 10% per year, giving rise to an increasing "processor-
262	     memory performance gap".

264	     Another source of relevant data is the STREAM Benchmark Reference
265	     Information website which provides information on the STREAM
266	     benchmark [STREAM].  The benchmark is a simple synthetic benchmark
267	     program that measures sustainable memory bandwidth (in MBytes/s)
268	     and the corresponding computation rate for simple vector kernels
269	     measured in MFLOPS.  The website tracks information on sustainable
270	     memory bandwidth for hundreds of machines and all major vendors.

272	     Results show measured system performance statistics.  Processing
273	     performance from 1985-2001 increased at 50% per year on average,
274	     and sustainable memory bandwidth from 1975 to 2001 increased at 35%
275	     per year on average over all the systems measured.  A similar 15%
276	     per year lead of processing bandwidth over memory bandwidth shows
277	     up in another statistic, machine balance [Mc95], a measure of the
278	     relative rate of CPU to memory bandwidth (FLOPS/cycle) / (sustained
279	     memory ops/cycle) [STREAM].

281	     Network bandwidth has been increasing about 10-fold roughly every 8
282	     years, which is a 40% per year growth rate.

284	     A typical example illustrates that the memory bandwidth compares
285	     unfavorably with link speed.  The STREAM benchmark shows that a
286	     modern uniprocessor PC, for example the 1.2 GHz Athlon in 2001,
287	     will move the data 3 times in doing a receive operation - 1 for the
288	     network interface to deposit the data in memory, and 2 for the CPU
289	     to copy the data.  With 1 GBytes/s of memory bandwidth, meaning one
290	     read or one write, the machine could handle approximately 2.67
291	     Gbits/s of network bandwidth, one third the copy bandwidth.  But
292	     this assumes 100% utilization, which is not possible, and more
293	     importantly the machine would be totally consumed!  (A rule of
294	     thumb for databases is that 20% of the machine should be required
295	     to service I/O, leaving 80% for the database application.  And, the
296	     less the better.)

298	     In 2001, 1 Gbits/s links were common.  An application server may
299	     typically have two 1 Gbits/s connections - one connection backend
300	     to a storage server and one front-end, say for serving HTTP
301	     [FGM+99].  Thus the communications could use 2 Gbits/s.  In our
302	     typical example, the machine could handle 2.7 Gbits/s at its
303	     theoretical maximum while doing nothing else.  This means that the
304	     machine basically could not keep up with the communication demands
305	     in 2001, with the relative growth trends the situation only gets
306	     worse.

308	4.  High copy overhead is problematic for many key Internet applications

310	     If a significant portion of resources on an application machine is
311	     consumed in network I/O rather than in application processing, it
312	     makes it difficult for the application to scale - to handle more
313	     clients, to offer more services.

315	     Several years ago the most affected applications were streaming
316	     multimedia, parallel file systems and supercomputing on clusters
317	     [BS96].  In addition, today the applications that suffer from
318	     copying overhead are more central in Internet computing - they
319	     store, manage, and distribute the information of the Internet and
320	     the enterprise.  They include database applications doing
321	     transaction processing, e-commerce, web serving, decision support,
322	     content distribution, video distribution, and backups.  Clusters
323	     are typically used for this category of application, since they
324	     have advantages of availability and scalability.

326	     Today these applications, which provide and manage Internet and
327	     corporate information, are typically run in data centers that are
328	     organized into three logical tiers.  One tier is typically a set of
329	     web servers connecting to the WAN.  The second tier is a set of
330	     application servers that run the specific applications usually on
331	     more powerful machines, and the third tier is backend databases.
332	     Physically, the first two tiers - web server and application server
333	     - are usually combined [Pi01].  For example an e-commerce server
334	     communicates with a database server and with a customer site, or a
335	     content distribution server connects to a server farm, or an OLTP
336	     server connects to a database and a customer site.

338	     When network I/O uses too much memory bandwidth, performance on
339	     network paths between tiers can suffer.  (There might also be
340	     performance issues on SAN paths used either by the database tier or
341	     the application tier.)  The high overhead from network-related
342	     memory copies diverts system resources from other application
343	     processing.  It also can create bottlenecks that limit total system
344	     performance.

346	     There are a large and growing number of these application servers
347	     distributed throughout the Internet.  In 1999 approximately 3.4
348	     million server units were shipped, in 2000, 3.9 million units, and
349	     the estimated annual growth rate for 2000-2004 was 17 percent
350	     [Ne00, Pa01].

352	     There is high motivation to maximize the processing capacity of
353	     each CPU, as scaling by adding CPUs one way or another has
354	     drawbacks.  For example, adding CPUs to a multiprocessor will not
355	     necessarily help, as a multiprocessor improves performance only
356	     when the memory bus has additional bandwidth to spare.  Clustering
357	     can add additional complexity to handling the applications.

359	     In order to scale a cluster or multiprocessor system, one must
360	     proportionately scale the interconnect bandwidth.  Interconnect
361	     bandwidth governs the performance of communication-intensive
362	     parallel applications; if this (often expressed in terms of
363	     "bisection bandwidth") is too low, adding additional processors
364	     cannot improve system throughput.  Interconnect latency can also
365	     limit the performance of applications that frequently share data
366	     between processors.

368	     So, excessive overheads on network paths in a "scalable" system
369	     both can require the use of more processors than optimal, and can
370	     reduce the marginal utility of those additional processors.

372	     Copy avoidance scales a machine upwards by removing at least two-
373	     thirds the bus bandwidth load from the "very best" 1-copy (on
374	     receive) implementations, and removes at least 80% of the bandwidth
375	     overhead from the 2-copy implementations.

377	     An example showing poor performance with copies and improved
378	     scaling with copy avoidance is illustrative.  The IO-Lite work
379	     [PDZ99] shows higher server throughput servicing more clients using
380	     a zero-copy system.  In an experiment designed to mimic real world
381	     web conditions by simulating the effect of TCP WAN connections on
382	     the server, the performance of 3 servers was compared.  One server
383	     was Apache, another an optimized server called Flash, and the third
384	     the Flash server running IO-Lite, called Flash-Lite with zero copy.
385	     The measurement was of throughput in requests/second as a function
386	     of the number of slow background clients that could be served.  As
387	     the table shows, Flash-Lite has better throughput, especially as
388	     the number of clients increases.

390	                Apache              Flash         Flash-Lite
391	                ------              -----         ----------
392	     #Clients   Thruput reqs/s      Thruput       Thruput

394	     0          520                 610           890
395	     16         390                 490           890
396	     32         360                 490           850
397	     64         360                 490           890
398	     128        310                 450           880
399	     256        310                 440           820

401	     Traditional Web servers (which mostly send data and can keep most
402	     of their content in the file cache) are not the worst case for copy
403	     overhead.  Web proxies (which often receive as much data as they
404	     send) and complex Web servers based on SANs or multi-tier systems
405	     will suffer more from copy overheads than in the example above.

407	5.  Copy Avoidance Techniques

409	     There have been extensive research investigation and industry
410	     experience with two main alternative approaches to eliminating data
411	     movement overhead, often along with improving other Operating
412	     System processing costs.  In one approach, hardware and/or software
413	     changes within a single host reduce processing costs.  In another
414	     approach, memory-to-memory networking [MAF+02], hosts communicate
415	     via information that allows them to reduce processing costs.

417	     The single host approaches range from new hardware and software
418	     architectures [KSZ95, Wa97, DWB+93] to new or modified software
419	     systems [BP96, Ch96, TK95, DP93, PDZ99].  In the approach based on
420	     using a networking protocol to exchange information, the network
421	     adapter, under control of the application, places data directly
422	     into and out of application buffers, reducing the need for data
423	     movement.  Commonly this approach is called RDMA, Remote Direct
424	     Memory Access.

426	     As discussed below, research and industry experience has shown that
427	     copy avoidance techniques within the receiver processing path alone
428	     have proven to be problematic.  The research special purpose host
429	     adapter systems had good performance and can be seen as precursors
430	     for the commercial RDMA-based NICs [KSZ95, DWB+93].  In software,
431	     many implementations have successfully achieved zero-copy transmit,
432	     but few have accomplished zero-copy receive.  And those that have
433	     done so make strict alignment and no-touch requirements on the
434	     application, greatly reducing the portability and usefulness of the
435	     implementation.

437	     In contrast, experience has proven satisfactory with memory-to-
438	     memory systems that permit RDMA - performance has been good and
439	     there have not been system or networking difficulties.  RDMA is a
440	     single solution.  Once implemented, it can be used with any OS and
441	     machine architecture, and it does not need to be revised when
442	     either of these changes.

444	     In early work, one goal of the software approaches was to show that
445	     TCP could go faster with appropriate OS support [CJR89, CFF+94].
446	     While this goal was achieved, further investigation and experience
447	     showed that, though possible to craft software solutions, specific
448	     system optimizations have been complex, fragile, extremely
449	     interdependent with other system parameters in complex ways, and
450	     often of only marginal improvement [CFF+94, CGY01, Ch96, DAPP93,
451	     KSZ95, PDZ99].  The network I/O system interacts with other aspects
452	     of the Operating System such as machine architecture and file I/O,
453	     and disk I/O [Br99, Ch96, DP93].

455	     For example, the Solaris Zero-Copy TCP work [Ch96], which relies on
456	     page remapping, shows that the results are highly interdependent
457	     with other systems, such as the file system, and that the
458	     particular optimizations are specific for particular architectures,
459	     meaning for each variation in architecture optimizations must be
460	     re-crafted [Ch96].

462	     A number of research projects and industry products have been based
463	     on the memory-to-memory approach to copy avoidance.  These include
464	     U-Net [EBBV95], SHRIMP [BLA+94], Hamlyn [BJM+96], Infiniband [IB],
465	     Winsock Direct [Pi01].  Several memory-to-memory systems have been
466	     widely used and have generally been found to be robust, to have
467	     good performance, and to be relatively simple to implement.  These
468	     include VI [VI], Myrinet [BCF+95], Quadrics [QUAD], Compaq/Tandem
469	     Servernet [SRVNET].  Networks based on these memory-to-memory
470	     architectures have been used widely in scientific applications and
471	     in data centers for block storage, file system access, and
472	     transaction processing.

474	     By exporting direct memory access "across the wire", applications
475	     may direct the network stack to manage all data directly from
476	     application buffers.  A large and growing class of applications has
477	     already emerged which takes advantage of such capabilities,
478	     including all the major databases, as well as file systems such as
479	     DAFS [DAFS] and network protocols such as Sockets Direct [SDP].

481	5.1.  A Conceptual Framework: DDP and RDMA

483	     An RDMA solution can be usefully viewed as being comprised of two
484	     distinct components: "direct data placement (DDP)" and "remote
485	     direct memory access (RDMA) semantics".  They are distinct in
486	     purpose and also in practice - they may be implemented as separate
487	     protocols.

489	     The more fundamental of the two is the direct data placement
490	     facility.  This is the means by which memory is exposed to the
491	     remote peer in an appropriate fashion, and the means by which the
492	     peer may access it, for instance reading and writing.

494	     The RDMA control functions are semantically layered atop direct
495	     data placement.  Included are operations that provide "control"
496	     features, such as connection and termination, and the ordering of
497	     operations and signaling their completions.  A "send" facility is
498	     provided.

500	     While the functions (and potentially protocols) are distinct,
501	     historically both aspects taken together have been referred as
502	     "RDMA".  The facilities of direct data placement are useful in and
503	     of themselves, and may be employed by other upper layer protocols
504	     to facilitate data transfer.  Therefore, it is often useful to
505	     refer to DDP as the data placement functionality and RDMA as the
506	     control aspect.

508	     [BT02] develops an architecture for DDP and RDMA, and is a
509	     companion draft to this problem statement.

511	6.  Security Considerations

513	     Solutions to the problem of reducing copying overhead in high
514	     bandwidth transfers via one or more protocols may introduce new
515	     security concerns.  Any proposed solution must be analyzed for
516	     security threats and any such threats addressed.  Potential
517	     security weaknesses due to resource issues that might lead to
518	     denial-of-service attacks, overwrites and other concurrent
519	     operations, the ordering of completions as required by the RDMA
520	     protocol, the granularity of transfer, and any other identified
521	     threats; need to be examined, described and an adequate solution to
522	     them found.

524	     Layered atop Internet transport protocols, the RDMA protocols will
525	     gain leverage from and must permit integration with Internet
526	     security standards, such as IPSec and TLS [IPSEC, TLS].  A thorough
527	     analysis of the degree to which these protocols address potential
528	     threats is required.

530	     Security for an RDMA design requires more than just securing the
531	     communication channel.  While it is necessary to be able to
532	     guarantee channel properties such as privacy, integrity, and
533	     authentication, these properties cannot defend against all attacks
534	     from properly authenticated peers, which might be malicious,
535	     compromised, or buggy.  For example, an RDMA peer should not be
536	     able to read or write memory regions without prior consent.

538	     Further, it must not be possible to evade consistency checks at the
539	     recipient.  The RDMA design must allow the recipient to rely on its
540	     consistent memory contents by controlling peer access to memory
541	     regions explicitly, and must disallow peer access to regions when
542	     not authorized.

544	     The RDMA protocols must ensure that regions addressable by RDMA
545	     peers be under strict application control.  Remote access to local
546	     memory by a network peer introduces a number of potential security
547	     concerns.  This becomes particularly important in the Internet
548	     context, where such access can be exported globally.

550	     The RDMA protocols carry in part what is essentially user
551	     information, explicitly including addressing information and
552	     operation type (read or write), and implicitly including protection
553	     and attributes.  As such, the protocol requires checking of these
554	     higher level aspects in addition to the basic formation of
555	     messages.  The semantics associated with each class of error must
556	     be clearly defined, and the expected action to be taken on mismatch
557	     be specified.  In some cases, this will result in a catastrophic
558	     error on the RDMA association, however in others a local or remote
559	     error may be signalled.  Certain of these errors may require
560	     consideration of abstract local semantics, which must be carefully
561	     specified so as to provide useful behavior while not constraining
562	     the implementation.

564	7.  Acknowledgements

566	     Jeff Chase generously provided many useful insights and
567	     information.  Thanks to Jim Pinkerton for many helpful discussions.

569	8.  Informative References

571	     [BCF+95]
572	          N. J. Boden, D. Cohen, R. E. Felderman, A. E. Kulawik, C. L.
573	          Seitz, J. N. Seizovic, and W. Su. "Myrinet - A gigabit-per-
574	          second local-area network", IEEE Micro, February 1995

576	     [BJM+96]
577	          G. Buzzard, D. Jacobson, M. Mackey, S. Marovich, J. Wilkes,
578	          "An implementation of the Hamlyn send-managed interface
579	          architecture", in Proceedings of the Second Symposium on
580	          Operating Systems Design and Implementation, USENIX Assoc.,
581	          October 1996

583	     [BLA+94]
584	          M. A. Blumrich, K. Li, R. Alpert, C. Dubnicki, E. W. Felten,
585	          "A virtual memory mapped network interface for the SHRIMP
586	          multicomputer", in Proceedings of the 21st Annual Symposium on
587	          Computer Architecture, April 1994, pp. 142-153

589	     [Br99]
590	          J. C. Brustoloni, "Interoperation of copy avoidance in network
591	          and file I/O", Proceedings of IEEE Infocom, 1999, pp. 534-542

593	     [BS96]
594	          J. C. Brustoloni, P. Steenkiste, "Effects of buffering
595	          semantics on I/O performance", Proceedings OSDI'96, USENIX,
596	          Seattle, WA October 1996, pp. 277-291

598	RFC Editor note:
599	     Replace following architecture draft-ietf- name, status and date
600	     with appropriate reference when assigned.

602	     [BT02]
603	          S. Bailey, T. Talpey, "The Architecture of Direct Data
604	          Placement (DDP) And Remote Direct Memory Access (RDMA) On
605	          Internet Protocols", Internet Draft Work in Progress, draft-
606	          ietf-rddp-arch-02, June 2003

608	     [CFF+94]
609	          C-H Chang, D. Flower, J. Forecast, H. Gray, B. Hawe, A.
610	          Nadkarni, K. K. Ramakrishnan, U. Shikarpur, K. Wilde, "High-
611	          performance TCP/IP and UDP/IP networking in DEC OSF/1 for
612	          Alpha AXP",  Proceedings of the 3rd IEEE Symposium on High
613	          Performance Distributed Computing, August 1994, pp. 36-42

615	     [CGY01]
616	          J. S. Chase, A. J. Gallatin, and K. G. Yocum, "End system
617	          optimizations for high-speed TCP", IEEE Communications
618	          Magazine, Volume: 39, Issue: 4 , April 2001, pp 68-74.
619	          http://www.cs.duke.edu/ari/publications/end-system.{ps,pdf}

621	     [Ch96]
622	          H.K. Chu, "Zero-copy TCP in Solaris", Proc. of the USENIX 1996
623	          Annual Technical Conference, San Diego, CA, January 1996

625	     [Ch02]
626	          Jeffrey Chase, Personal communication

628	     [CJRS89]
629	          D. D. Clark, V. Jacobson, J. Romkey, H. Salwen, "An analysis
630	          of TCP processing overhead", IEEE Communications Magazine,
631	          volume: 27, Issue: 6, June 1989, pp 23-29

633	     [CT90]
634	          D. D. Clark, D. Tennenhouse, "Architectural considerations for
635	          a new generation of protocols", Proceedings of the ACM SIGCOMM
636	          Conference, 1990

638	     [DAFS]
639	          DAFS Collaborative, "Direct Access File System Specification
640	          v1.0", September 2001, available from
641	          http://www.dafscollaborative.org

643	     [DAPP93]
644	          P. Druschel, M. B. Abbott, M. A. Pagels, L. L. Peterson,
645	          "Network subsystem design", IEEE Network, July 1993, pp. 8-17

647	     [DP93]
648	          P. Druschel, L. L. Peterson, "Fbufs: a high-bandwidth cross-
649	          domain transfer facility", Proceedings of the 14th ACM
650	          Symposium of Operating Systems Principles, December 1993

652	     [DWB+93]
653	          C. Dalton, G. Watson, D. Banks, C. Calamvokis, A. Edwards, J.
654	          Lumley, "Afterburner: architectural support for high-
655	          performance protocols", Technical Report, HP Laboratories
656	          Bristol, HPL-93-46, July 1993

658	     [EBBV95]
659	          T. von Eicken, A. Basu, V. Buch, and W. Vogels, "U-Net: A
660	          user-level network interface for parallel and distributed
661	          computing", Proc. of the 15th ACM Symposium on Operating
662	          Systems Principles, Copper Mountain, Colorado, December 3-6,
663	          1995

665	     [FGM+99]
666	          R. Fielding, J. Gettys, J. Mogul, F. Frystyk, L. Masinter, P.
667	          Leach, T. Berners-Lee, "Hypertext Transfer Protocol -
668	          HTTP/1.1", RFC 2616, June 1999

670	     [FIBRE]
671	          ANSI Technical Committee T10, "Fibre Channel Protocol (FCP)"
672	          (and as revised and updated), ANSI X3.269:1996 [R2001],
673	          committee draft available from
674	          http://www.t10.org/drafts.htm#FibreChannel

676	     [HP97]
677	          J. L. Hennessy, D. A. Patterson, Computer Organization and
678	          Design, 2nd Edition, San Francisco: Morgan Kaufmann
679	          Publishers, 1997

681	     [IB] InfiniBand Trade Association, "InfiniBand Architecture
682	          Specification, Volumes 1 and 2", Release 1.1, November 2002,
683	          available from  http://www.infinibandta.org/specs

685	     [KP96]
686	          J. Kay, J. Pasquale, "Profiling and reducing processing
687	          overheads in TCP/IP", IEEE/ACM Transactions on Networking, Vol
688	          4, No. 6, pp.817-828, December 1996

690	     [KSZ95]
691	          K. Kleinpaste, P. Steenkiste, B. Zill, "Software support for
692	          outboard buffering and checksumming", SIGCOMM'95

694	     [Ma02]
695	          K. Magoutis, "Design and Implementation of a Direct Access
696	          File System (DAFS) Kernel Server for FreeBSD", in Proceedings
697	          of USENIX BSDCon 2002 Conference, San Francisco, CA, February
698	          11-14, 2002.

700	     [MAF+02]
701	          K. Magoutis, S. Addetia, A. Fedorova, M.  I. Seltzer, J. S.
702	          Chase, D. Gallatin, R. Kisley, R. Wickremesinghe, E. Gabber,
703	          "Structure and Performance of the Direct Access File System
704	          (DAFS)", accepted for publication at the 2002 USENIX Annual
705	          Technical Conference, Monterey, CA, June 9-14, 2002.

707	     [Mc95]
708	          J. D. McCalpin, "A Survey of memory bandwidth and machine
709	          balance in current high performance computers", IEEE TCCA
710	          Newsletter, December 1995

712	     [Ne00]
713	          A. Newman, "IDC report paints conflicted picture of server
714	          market circa 2004", ServerWatch, July 24, 2000
715	          http://serverwatch.internet.com/news/2000_07_24_a.html

717	     [Pa01]
718	          M. Pastore, "Server shipments for 2000 surpass those in 1999",
719	          ServerWatch, February 7, 2001
720	          http://serverwatch.internet.com/news/2001_02_07_a.html

722	     [PAC+97]
723	          D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K. Keeton,
724	          C. Kozyrakis, R. Thomas, K. Yelick , "A case for intelligient
725	          RAM: IRAM", IEEE Micro, April 1997

727	     [PDZ99]
728	          V. S. Pai, P. Druschel, W. Zwaenepoel, "IO-Lite: a unified I/O
729	          buffering and caching system", Proc. of the 3rd Symposium on
730	          Operating Systems Design and Implementation, New Orleans, LA,
731	          February 1999

733	     [Pi01]
734	          J. Pinkerton, "Winsock Direct: The Value of System Area
735	          Networks", May 2001, available from
736	          http://www.microsoft.com/windows2000/techinfo/
737	          howitworks/communications/winsock.asp

739	     [Po81]
740	          J. Postel, "Transmission Control Protocol - DARPA Internet
741	          Program Protocol Specification", RFC 793, September 1981

743	     [QUAD]
744	          Quadrics Ltd., Quadrics QSNet product information, available
745	          from http://www.quadrics.com/website/pages/02qsn.html

747	     [SDP]
748	          InfiniBand Trade Association, "Sockets Direct Protocol v1.0",
749	          Annex A of InfiniBand Architecture Specification Volume 1,
750	          Release 1.1, November 2002, available from
751	          http://www.infinibandta.org/specs

753	     [SRVNET]
754	          R. Horst, "TNet: A reliable system area network", IEEE Micro,
755	          pp. 37-45, February 1995

757	     [STREAM]
758	          J. D. McAlpin, The STREAM Benchmark Reference Information,
759	          http://www.cs.virginia.edu/stream/

761	     [TK95]
762	          M. N. Thadani, Y. A. Khalidi, "An efficient zero-copy I/O
763	          framework for UNIX", Technical Report, SMLI TR-95-39, May 1995

765	     [VI] Compaq Computer Corp., Intel Corporation and Microsoft
766	          Corporation, "Virtual Interface Architecture Specification
767	          Version 1.0", December 1997, available from
768	          http://www.vidf.org/info/04standards.html

770	     [Wa97]
771	          J. R. Walsh, "DART: Fast application-level networking via
772	          data-copy avoidance", IEEE Network, July/August 1997, pp.
773	          28-38

775	Authors' Addresses

777	     Stephen Bailey
778	     Sandburst Corporation
779	     600 Federal Street
780	     Andover, MA  01810 USA

782	     Phone: +1 978 689 1614
783	     Email: steph@sandburst.com

785	     Jeffrey C. Mogul
786	     Western Research Laboratory
787	     Hewlett-Packard Company
788	     1501 Page Mill Road, MS 1251
789	     Palo Alto, CA  94304 USA

791	     Phone: +1 650 857 2206 (email preferred)
792	     Email: JeffMogul@acm.org

794	     Allyn Romanow
795	     Cisco Systems, Inc.
796	     170 W. Tasman Drive
797	     San Jose, CA  95134 USA

799	     Phone: +1 408 525 8836
800	     Email: allyn@cisco.com
801	     Tom Talpey
802	     Network Appliance
803	     375 Totten Pond Road
804	     Waltham, MA  02451 USA

806	     Phone: +1 781 768 5329
807	     Email: thomas.talpey@netapp.com

809	Full Copyright Statement

811	     Copyright (C) The Internet Society (2003).  All Rights Reserved.

813	     This document and translations of it may be copied and furnished to
814	     others, and derivative works that comment on or otherwise explain
815	     it or assist in its implementation may be prepared, copied,
816	     published and distributed, in whole or in part, without restriction
817	     of any kind, provided that the above copyright notice and this
818	     paragraph are included on all such copies and derivative works.
819	     However, this document itself may not be modified in any way, such
820	     as by removing the copyright notice or references to the Internet
821	     Society or other Internet organizations, except as needed for the
822	     purpose of developing Internet standards in which case the
823	     procedures for copyrights defined in the Internet Standards process
824	     must be followed, or as required to translate it into languages
825	     other than English.

827	     The limited permissions granted above are perpetual and will not be
828	     revoked by the Internet Society or its successors or assigns.

830	     This document and the information contained herein is provided on
831	     an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET
832	     ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR
833	     IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF
834	     THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
835	     WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.