idnits 2.17.1 

draft-ietf-rddp-problem-statement-03.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** Looks like you're using RFC 2026 boilerplate.  This must be updated to
     follow RFC 3978/3979, as updated by RFC 4748.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  == No 'Intended status' indicated for this document; assuming Proposed
     Standard


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack an IANA Considerations section.  (See Section
     2.2 of https://www.ietf.org/id-info/checklist for how to handle the case
     when there are no actions for IANA.)


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not
     match the current year

  == Line 723 has weird spacing: '...le from  http:...'

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- Couldn't find a document date in the document -- date freshness check
     skipped.

  -- Found something which looks like a code comment -- if you have code
     sections in the document, please surround them with '<CODE BEGINS>' and
     '<CODE ENDS>' lines.


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  == Missing Reference: 'B99' is mentioned on line 185, but not defined

  == Missing Reference: 'BP96' is mentioned on line 435, but not defined

  == Missing Reference: 'IPSEC' is mentioned on line 566, but not defined

  == Missing Reference: 'TLS' is mentioned on line 566, but not defined

  == Missing Reference: 'R2001' is mentioned on line 712, but not defined

  == Unused Reference: 'DAPP93' is defined on line 683, but no explicit
     reference was found in the text

  == Unused Reference: 'KSZ95' is defined on line 730, but no explicit
     reference was found in the text

  == Unused Reference: 'Ma02' is defined on line 734, but no explicit
     reference was found in the text

  == Unused Reference: 'Wa97' is defined on line 810, but no explicit
     reference was found in the text

  == Outdated reference: A later version (-07) exists of
     draft-ietf-rddp-arch-04

  -- Obsolete informational reference (is this intentional?): RFC  793 (ref.
     'Po81') (Obsoleted by RFC 9293)


     Summary: 2 errors (**), 0 flaws (~~), 13 warnings (==), 4 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------

1	                                              Allyn Romanow      (Cisco)
2	Internet-Draft                                Jeff Mogul            (HP)
3	Expires: July 2004                            Tom Talpey        (NetApp)
4	                                              Stephen Bailey (Sandburst)

6	                     RDMA over IP Problem Statement
7	                  draft-ietf-rddp-problem-statement-03

9	Status of this Memo

11	     This document is an Internet-Draft and is in full conformance with
12	     all provisions of Section 10 of RFC2026.

14	     Internet-Drafts are working documents of the Internet Engineering
15	     Task Force (IETF), its areas, and its working groups.  Note that
16	     other groups may also distribute working documents as Internet-
17	     Drafts.

19	     Internet-Drafts are draft documents valid for a maximum of six
20	     months and may be updated, replaced, or obsoleted by other
21	     documents at any time.  It is inappropriate to use Internet-Drafts
22	     as reference material or to cite them other than as "work in
23	     progress."

25	     The list of current Internet-Drafts can be accessed at
26	     http://www.ietf.org/ietf/1id-abstracts.txt

28	     The list of Internet-Draft Shadow Directories can be accessed at
29	     http://www.ietf.org/shadow.html.

31	Copyright Notice

33	     Copyright (C) The Internet Society (2004).  All Rights Reserved.

35	Abstract

37	     This draft addresses an IP-based solution to the problem of high
38	     system overhead due to end-host copying of user data in the network
39	     I/O path at high speeds.  The problem is due to the high cost of
40	     memory bandwidth, and it can be substantially improved using "copy
41	     avoidance."  The overhead has limited the use of TCP/IP in
42	     interconnection networks especially where high bandwidth, low
43	     latency and/or low overhead of end-system data movement are
44	     required by the hosted application.

46	Table Of Contents

48	     1.   Introduction . . . . . . . . . . . . . . . . . . . . . . .   2
49	     2.   The high cost of data movement operations in network I/O .   3
50	     2.1. Copy avoidance improves processing overhead  . . . . . . .   5
51	     3.   Memory bandwidth is the root cause of the problem  . . . .   6
52	     4.   High copy overhead is problematic for many key Internet
53	          applications . . . . . . . . . . . . . . . . . . . . . . .   7
54	     5.   Copy Avoidance Techniques  . . . . . . . . . . . . . . . .   9
55	     5.1. A Conceptual Framework: DDP and RDMA . . . . . . . . . . .  11
56	     6.   Conclusions  . . . . . . . . . . . . . . . . . . . . . . .  12
57	     7.   Security Considerations  . . . . . . . . . . . . . . . . .  12
58	     8.   Acknowledgements . . . . . . . . . . . . . . . . . . . . .  13
59	          Informative References . . . . . . . . . . . . . . . . . .  13
60	          Authors' Addresses . . . . . . . . . . . . . . . . . . . .  18
61	          Full Copyright Statement . . . . . . . . . . . . . . . . .  18

63	1.  Introduction

65	     This draft considers the problem of high host processing overhead
66	     associated with the movement of user data to and from the network
67	     interface under high speed conditions.  This problem is often
68	     referred to as the "I/O bottleneck" [CT90].  More specifically, the
69	     source of high overhead that is of interest here is data movement
70	     operations - copying.  The throughput of a system may therefore be
71	     limited by the overhead of this copying.  This issue is not to be
72	     confused with TCP offload, which is not addressed here.  High speed
73	     refers to conditions where the network link speed is high relative
74	     to the bandwidths of the host CPU and memory.  With today's
75	     computer systems, one Gbits/s and over is considered high speed.

77	     High costs associated with copying are an issue primarily for large
78	     scale systems.  Although smaller systems such as rack-mounted PCs
79	     and small workstations would benefit from a reduction in copying
80	     overhead, the benefit to smaller machines will be primarily in the
81	     next few years as they scale in the amount of bandwidth they
82	     handle.  Today it is large system machines with high bandwidth
83	     feeds, usually multiprocessors and clusters, that are adversely
84	     affected by copying overhead.  Examples of such machines include
85	     all varieties of servers: database servers, storage servers,
86	     application servers for transaction processing, for e-commerce, and
87	     web serving, content distribution, video distribution, backups,
88	     data mining and decision support, and scientific computing.

90	     Note that such servers almost exclusively service many concurrent
91	     sessions (transport connections), which, in aggregate, are
92	     responsible for > 1 Gbits/s of communication.  Nonetheless, the
93	     cost of copying overhead for a particular load is the same whether
94	     from few or many sessions.

96	     The I/O bottleneck, and the role of data movement operations, have
97	     been widely studied in research and industry over the last
98	     approximately 14 years, and we draw freely on these results.
99	     Historically, the I/O bottleneck has received attention whenever
100	     new networking technology has substantially increased line rates -
101	     100 Mbits/s FDDI and Fast Ethernet, 155 Mbits/s ATM, 1 Gbits/s
102	     Ethernet.  In earlier speed transitions, the availability of memory
103	     bandwidth allowed the I/O bottleneck issue to be deferred.  Now
104	     however, this is no longer the case.  While the I/O problem is
105	     significant at 1 Gbits/s, it is the introduction of 10 Gbits/s
106	     Ethernet which is motivating an upsurge of activity in industry and
107	     research [DAFS, IB, VI, CGZ01, Ma02, MAF+02].

109	     Because of high overhead of end-host processing in current
110	     implementations, the TCP/IP protocol stack is not used for high
111	     speed transfer.  Instead, special purpose network fabrics, using a
112	     technology generally known as remote direct memory access (RDMA),
113	     have been developed and are widely used.  RDMA is a set of
114	     mechanisms that allow the network adapter, under control of the
115	     application, to steer data directly into and out of application
116	     buffers.  Examples of such interconnection fabrics include Fibre
117	     Channel [FIBRE] for block storage transfer, Virtual Interface
118	     Architecture [VI] for database clusters, Infiniband [IB], Compaq
119	     Servernet [SRVNET], Quadrics [QUAD] for System Area Networks.
120	     These link level technologies limit application scaling in both
121	     distance and size, meaning that the number of nodes cannot be
122	     arbitrarily large.

124	     This problem statement substantiates the claim that in network I/O
125	     processing, high overhead results from data movement operations,
126	     specifically copying; and that copy avoidance significantly
127	     decreases the processing overhead.  It describes when and why the
128	     high processing overheads occur, explains why the overhead is
129	     problematic, and points out which applications are most affected.

131	     In addition, this document introduces an architectural approach to
132	     solving the problem, which is developed in detail in [BT04].  It
133	     also discusses how the proposed technology may introduce security
134	     concerns and how they should be addressed.

136	2.  The high cost of data movement operations in network I/O

138	     A wealth of data from research and industry shows that copying is
139	     responsible for substantial amounts of processing overhead.  It
140	     further shows that even in carefully implemented systems,
141	     eliminating copies significantly reduces the overhead, as
142	     referenced below.

144	     Clark et al. [CJRS89] in 1989 shows that TCP [Po81] overhead
145	     processing is attributable to both operating system costs such as
146	     interrupts, context switches, process management, buffer
147	     management, timer management, and to the costs associated with
148	     processing individual bytes, specifically computing the checksum
149	     and moving data in memory.  They found moving data in memory is the
150	     more important of the costs, and their experiments show that memory
151	     bandwidth is the greatest source of limitation.  In the data
152	     presented [CJRS89], 64% of the measured microsecond overhead was
153	     attributable to data touching operations, and 48% was accounted for
154	     by copying.  The system measured Berkeley TCP on a Sun-3/60 using
155	     1460 Byte Ethernet packets.

157	     In a well-implemented system, copying can occur between the network
158	     interface and the kernel, and between the kernel and application
159	     buffers - two copies, each of which are two memory bus crossings -
160	     for read and write.  Although in certain circumstances it is
161	     possible to do better, usually two copies are required on receive.

163	     Subsequent work has consistently shown the same phenomenon as the
164	     earlier Clark study.  A number of studies report results that data-
165	     touching operations, checksumming and data movement, dominate the
166	     processing costs for messages longer than 128 Bytes [BS96, CGY01,
167	     Ch96, CJRS89, DAPP93, KP96].  For smaller sized messages, per-
168	     packet overheads dominate [KP96, CGY01].

170	     The percentage of overhead due to data-touching operations
171	     increases with packet size, since time spent on per-byte operations
172	     scales linearly with message size [KP96].  For example, Chu [Ch96]
173	     reported substantial per-byte latency costs as a percentage of
174	     total networking software costs for an MTU size packet on
175	     SPARCstation/20 running memory-to-memory TCP tests over networks
176	     with 3 different MTU sizes.  The percentage of total software costs
177	     attributable to per-byte operations were:

179	        1500 Byte Ethernet 18-25%
180	        4352 Byte FDDI     35-50%
181	        9180 Byte ATM      55-65%

183	     Although many studies report results for data-touching operations
184	     including checksumming and data movement together, much work has
185	     focused just on copying [BS96, B99, Ch96, TK95].  For example,
186	     [KP96] reports results that separate processing times for checksum
187	     from data movement operations.  For the 1500 Byte Ethernet size,
188	     20% of total processing overhead time is attributable to copying.
189	     The study used 2 DECstations 5000/200 connected by an FDDI network.
190	     (In this study checksum accounts for 30% of the processing time.)

192	2.1.  Copy avoidance improves processing overhead

194	     A number of studies show that eliminating copies substantially
195	     reduces overhead.  For example, results from copy-avoidance in the
196	     IO-Lite system [PDZ99], which aimed at improving web server
197	     performance, show a throughput increase of 43% over an optimized
198	     web server, and 137% improvement over an Apache server.  The system
199	     was implemented in a 4.4BSD derived UNIX kernel, and the
200	     experiments used a server system based on a 333MHz Pentium II PC
201	     connected to a switched 100 Mbits/s Fast Ethernet.

203	     There are many other examples where elimination of copying using a
204	     variety of different approaches showed significant improvement in
205	     system performance [CFF+94, DP93, EBBV95, KSZ95, TK95, Wa97].  We
206	     will discuss the results of one of these studies in detail in order
207	     to clarify the significant degree of improvement produced by copy
208	     avoidance [Ch02].

210	     Recent work by Chase et al. [CGY01], measuring CPU utilization,
211	     shows that avoiding copies reduces CPU time spent on data access
212	     from 24% to 15% at 370 Mbits/s for a 32 KBytes MTU using an
213	     AlphaStation XP1000 and a Myrinet adapter [BCF+95].  This is an
214	     absolute improvement of 9% due to copy avoidance.

216	     The total CPU utilization was 35%, with data access accounting for
217	     24%.  Thus the relative importance of reducing copies is 26%.  At
218	     370 Mbits/s, the system is not very heavily loaded.  The relative
219	     improvement in achievable bandwidth is 34%.  This is the
220	     improvement we would see if copy avoidance were added when the
221	     machine was saturated by network I/O.

223	     Note that improvement from the optimization becomes more important
224	     if the overhead it targets is a larger share of the total cost.
225	     This is what happens if other sources of overhead, such as
226	     checksumming, are eliminated.  In [CGY01], after removing checksum
227	     overhead, copy avoidance reduces CPU utilization from 26% to 10%.
228	     This is a 16% absolute reduction, a 61% relative reduction, and a
229	     160% relative improvement in achievable bandwidth.

231	     In fact, today's network interface hardware commonly offloads the
232	     checksum, which removes the other source of per-byte overhead.
233	     They also coalesce interrupts to reduce per-packet costs.  Thus,
234	     today copying costs account for a relatively larger part of CPU
235	     utilization than previously, and therefore relatively more benefit
236	     is to be gained in reducing them.  (Of course this argument would
237	     be specious if the amount of overhead were insignificant, but it
238	     has been shown to be substantial.)

240	3.  Memory bandwidth is the root cause of the problem

242	     Data movement operations are expensive because memory bandwidth is
243	     scarce relative to network bandwidth and CPU bandwidth [PAC+97].
244	     This trend existed in the past and is expected to continue into the
245	     future [HP97, STREAM], especially in large multiprocessor systems.

247	     With copies crossing the bus twice per copy, network processing
248	     overhead is high whenever network bandwidth is large in comparison
249	     to CPU and memory bandwidths.  Generally with today's end-systems,
250	     the effects are observable at network speeds over 1 Gbits/s.  In
251	     fact, with multiple bus crossings it is possible to see the bus
252	     bandwidth being the limiting factor for throughput.  This prevents
253	     such an end-system from silultaneously achieving full network
254	     bandwidth and full application performance.

256	     A common question is whether increase in CPU processing power
257	     alleviates the problem of high processing costs of network I/O.
258	     The answer is no, it is the memory bandwidth that is the issue.
259	     Faster CPUs do not help if the CPU spends most of its time waiting
260	     for memory [CGY01].

262	     The widening gap between microprocessor performance and memory
263	     performance has long been a widely recognized and well-understood
264	     problem [PAC+97].  Hennessy [HP97] shows microprocessor performance
265	     grew from 1980-1998 at 60% per year, while the access time to DRAM
266	     improved at 10% per year, giving rise to an increasing "processor-
267	     memory performance gap".

269	     Another source of relevant data is the STREAM Benchmark Reference
270	     Information website which provides information on the STREAM
271	     benchmark [STREAM].  The benchmark is a simple synthetic benchmark
272	     program that measures sustainable memory bandwidth (in MBytes/s)
273	     and the corresponding computation rate for simple vector kernels
274	     measured in MFLOPS.  The website tracks information on sustainable
275	     memory bandwidth for hundreds of machines and all major vendors.

277	     Results show measured system performance statistics.  Processing
278	     performance from 1985-2001 increased at 50% per year on average,
279	     and sustainable memory bandwidth from 1975 to 2001 increased at 35%
280	     per year on average over all the systems measured.  A similar 15%
281	     per year lead of processing bandwidth over memory bandwidth shows
282	     up in another statistic, machine balance [Mc95], a measure of the
283	     relative rate of CPU to memory bandwidth (FLOPS/cycle) / (sustained
284	     memory ops/cycle) [STREAM].

286	     Network bandwidth has been increasing about 10-fold roughly every 8
287	     years, which is a 40% per year growth rate.

289	     A typical example illustrates that the memory bandwidth compares
290	     unfavorably with link speed.  The STREAM benchmark shows that a
291	     modern uniprocessor PC, for example the 1.2 GHz Athlon in 2001,
292	     will move the data 3 times in doing a receive operation - 1 for the
293	     network interface to deposit the data in memory, and 2 for the CPU
294	     to copy the data.  With 1 GBytes/s of memory bandwidth, meaning one
295	     read or one write, the machine could handle approximately 2.67
296	     Gbits/s of network bandwidth, one third the copy bandwidth.  But
297	     this assumes 100% utilization, which is not possible, and more
298	     importantly the machine would be totally consumed!  (A rule of
299	     thumb for databases is that 20% of the machine should be required
300	     to service I/O, leaving 80% for the database application.  And, the
301	     less the better.)

303	     In 2001, 1 Gbits/s links were common.  An application server may
304	     typically have two 1 Gbits/s connections - one connection backend
305	     to a storage server and one front-end, say for serving HTTP
306	     [FGM+99].  Thus the communications could use 2 Gbits/s.  In our
307	     typical example, the machine could handle 2.7 Gbits/s at its
308	     theoretical maximum while doing nothing else.  This means that the
309	     machine basically could not keep up with the communication demands
310	     in 2001, with the relative growth trends the situation only gets
311	     worse.

313	4.  High copy overhead is problematic for many key Internet applications

315	     If a significant portion of resources on an application machine is
316	     consumed in network I/O rather than in application processing, it
317	     makes it difficult for the application to scale - to handle more
318	     clients, to offer more services.

320	     Several years ago the most affected applications were streaming
321	     multimedia, parallel file systems and supercomputing on clusters
322	     [BS96].  In addition, today the applications that suffer from
323	     copying overhead are more central in Internet computing - they
324	     store, manage, and distribute the information of the Internet and
325	     the enterprise.  They include database applications doing
326	     transaction processing, e-commerce, web serving, decision support,
327	     content distribution, video distribution, and backups.  Clusters
328	     are typically used for this category of application, since they
329	     have advantages of availability and scalability.

331	     Today these applications, which provide and manage Internet and
332	     corporate information, are typically run in data centers that are
333	     organized into three logical tiers.  One tier is typically a set of
334	     web servers connecting to the WAN.  The second tier is a set of
335	     application servers that run the specific applications usually on
336	     more powerful machines, and the third tier is backend databases.
337	     Physically, the first two tiers - web server and application server
338	     - are usually combined [Pi01].  For example an e-commerce server
339	     communicates with a database server and with a customer site, or a
340	     content distribution server connects to a server farm, or an OLTP
341	     server connects to a database and a customer site.

343	     When network I/O uses too much memory bandwidth, performance on
344	     network paths between tiers can suffer.  (There might also be
345	     performance issues on SAN paths used either by the database tier or
346	     the application tier.)  The high overhead from network-related
347	     memory copies diverts system resources from other application
348	     processing.  It also can create bottlenecks that limit total system
349	     performance.

351	     There are a large and growing number of these application servers
352	     distributed throughout the Internet.  In 1999 approximately 3.4
353	     million server units were shipped, in 2000, 3.9 million units, and
354	     the estimated annual growth rate for 2000-2004 was 17 percent
355	     [Ne00, Pa01].

357	     There is high motivation to maximize the processing capacity of
358	     each CPU, as scaling by adding CPUs one way or another has
359	     drawbacks.  For example, adding CPUs to a multiprocessor will not
360	     necessarily help, as a multiprocessor improves performance only
361	     when the memory bus has additional bandwidth to spare.  Clustering
362	     can add additional complexity to handling the applications.

364	     In order to scale a cluster or multiprocessor system, one must
365	     proportionately scale the interconnect bandwidth.  Interconnect
366	     bandwidth governs the performance of communication-intensive
367	     parallel applications; if this (often expressed in terms of
368	     "bisection bandwidth") is too low, adding additional processors
369	     cannot improve system throughput.  Interconnect latency can also
370	     limit the performance of applications that frequently share data
371	     between processors.

373	     So, excessive overheads on network paths in a "scalable" system
374	     both can require the use of more processors than optimal, and can
375	     reduce the marginal utility of those additional processors.

377	     Copy avoidance scales a machine upwards by removing at least two-
378	     thirds the bus bandwidth load from the "very best" 1-copy (on
379	     receive) implementations, and removes at least 80% of the bandwidth
380	     overhead from the 2-copy implementations.

382	     The removal of bus bandwidth requirement, in turn, removes
383	     bottlenecks from the network processing path and increases the
384	     throughput of the machine.  On a machine with limited bus
385	     bandwidth, the advantages of removing this load is immediately
386	     evident, as the host can attain full network bandwidth.  Even on a
387	     machine with bus bandwidth adequate to sustain full network
388	     bandwidth, removal of bus bandwidth load serves to increase the
389	     availabilty of the machine for the processing of user applications,
390	     in some cases dramatically.

392	     An example showing poor performance with copies and improved
393	     scaling with copy avoidance is illustrative.  The IO-Lite work
394	     [PDZ99] shows higher server throughput servicing more clients using
395	     a zero-copy system.  In an experiment designed to mimic real world
396	     web conditions by simulating the effect of TCP WAN connections on
397	     the server, the performance of 3 servers was compared.  One server
398	     was Apache, another an optimized server called Flash, and the third
399	     the Flash server running IO-Lite, called Flash-Lite with zero copy.
400	     The measurement was of throughput in requests/second as a function
401	     of the number of slow background clients that could be served.  As
402	     the table shows, Flash-Lite has better throughput, especially as
403	     the number of clients increases.

405	                Apache              Flash         Flash-Lite
406	                ------              -----         ----------
407	     #Clients   Thruput reqs/s      Thruput       Thruput

409	     0          520                 610           890
410	     16         390                 490           890
411	     32         360                 490           850
412	     64         360                 490           890
413	     128        310                 450           880
414	     256        310                 440           820

416	     Traditional Web servers (which mostly send data and can keep most
417	     of their content in the file cache) are not the worst case for copy
418	     overhead.  Web proxies (which often receive as much data as they
419	     send) and complex Web servers based on SANs or multi-tier systems
420	     will suffer more from copy overheads than in the example above.

422	5.  Copy Avoidance Techniques

424	     There have been extensive research investigation and industry
425	     experience with two main alternative approaches to eliminating data
426	     movement overhead, often along with improving other Operating
427	     System processing costs.  In one approach, hardware and/or software
428	     changes within a single host reduce processing costs.  In another
429	     approach, memory-to-memory networking [MAF+02], the exchange of
430	     explicit data placement information between hosts allows them to
431	     reduce processing costs.

433	     The single host approaches range from new hardware and software
434	     architectures [KSZ95, Wa97, DWB+93] to new or modified software
435	     systems [BP96, Ch96, TK95, DP93, PDZ99].  In the approach based on
436	     using a networking protocol to exchange information, the network
437	     adapter, under control of the application, places data directly
438	     into and out of application buffers, reducing the need for data
439	     movement.  Commonly this approach is called RDMA, Remote Direct
440	     Memory Access.

442	     As discussed below, research and industry experience has shown that
443	     copy avoidance techniques within the receiver processing path alone
444	     have proven to be problematic.  The research special purpose host
445	     adapter systems had good performance and can be seen as precursors
446	     for the commercial RDMA-based NICs [KSZ95, DWB+93].  In software,
447	     many implementations have successfully achieved zero-copy transmit,
448	     but few have accomplished zero-copy receive.  And those that have
449	     done so make strict alignment and no-touch requirements on the
450	     application, greatly reducing the portability and usefulness of the
451	     implementation.

453	     In contrast, experience has proven satisfactory with memory-to-
454	     memory systems that permit RDMA - performance has been good and
455	     there have not been system or networking difficulties.  RDMA is a
456	     single solution.  Once implemented, it can be used with any OS and
457	     machine architecture, and it does not need to be revised when
458	     either of these changes.

460	     In early work, one goal of the software approaches was to show that
461	     TCP could go faster with appropriate OS support [CJR89, CFF+94].
462	     While this goal was achieved, further investigation and experience
463	     showed that, though possible to craft software solutions, specific
464	     system optimizations have been complex, fragile, extremely
465	     interdependent with other system parameters in complex ways, and
466	     often of only marginal improvement [CFF+94, CGY01, Ch96, DAPP93,
467	     KSZ95, PDZ99].  The network I/O system interacts with other aspects
468	     of the Operating System such as machine architecture and file I/O,
469	     and disk I/O [Br99, Ch96, DP93].

471	     For example, the Solaris Zero-Copy TCP work [Ch96], which relies on
472	     page remapping, shows that the results are highly interdependent
473	     with other systems, such as the file system, and that the
474	     particular optimizations are specific for particular architectures,
475	     meaning for each variation in architecture optimizations must be
476	     re-crafted [Ch96].

478	     With RDMA, application I/O buffers are mapped directly, and the
479	     authorized peer may access it without incurring additional
480	     processing overhead.  When RDMA is implemented in hardware,
481	     arbitrary data movement can be performed without involving the host
482	     CPU at all.

484	     A number of research projects and industry products have been based
485	     on the memory-to-memory approach to copy avoidance.  These include
486	     U-Net [EBBV95], SHRIMP [BLA+94], Hamlyn [BJM+96], Infiniband [IB],
487	     Winsock Direct [Pi01].  Several memory-to-memory systems have been
488	     widely used and have generally been found to be robust, to have
489	     good performance, and to be relatively simple to implement.  These
490	     include VI [VI], Myrinet [BCF+95], Quadrics [QUAD], Compaq/Tandem
491	     Servernet [SRVNET].  Networks based on these memory-to-memory
492	     architectures have been used widely in scientific applications and
493	     in data centers for block storage, file system access, and
494	     transaction processing.

496	     By exporting direct memory access "across the wire", applications
497	     may direct the network stack to manage all data directly from
498	     application buffers.  A large and growing class of applications has
499	     already emerged which takes advantage of such capabilities,
500	     including all the major databases, as well as file systems such as
501	     DAFS [DAFS] and network protocols such as Sockets Direct [SDP].

503	5.1.  A Conceptual Framework: DDP and RDMA

505	     An RDMA solution can be usefully viewed as being comprised of two
506	     distinct components: "direct data placement (DDP)" and "remote
507	     direct memory access (RDMA) semantics".  They are distinct in
508	     purpose and also in practice - they may be implemented as separate
509	     protocols.

511	     The more fundamental of the two is the direct data placement
512	     facility.  This is the means by which memory is exposed to the
513	     remote peer in an appropriate fashion, and the means by which the
514	     peer may access it, for instance reading and writing.

516	     The RDMA control functions are semantically layered atop direct
517	     data placement.  Included are operations that provide "control"
518	     features, such as connection and termination, and the ordering of
519	     operations and signaling their completions.  A "send" facility is
520	     provided.

522	     While the functions (and potentially protocols) are distinct,
523	     historically both aspects taken together have been referred as
524	     "RDMA".  The facilities of direct data placement are useful in and
525	     of themselves, and may be employed by other upper layer protocols
526	     to facilitate data transfer.  Therefore, it is often useful to
527	     refer to DDP as the data placement functionality and RDMA as the
528	     control aspect.

530	     [BT04] develops an architecture for DDP and RDMA atop the Internet
531	     Protocol Suite, and is a companion draft to this problem statement.

533	6.  Conclusions

535	     This Problem Statement concludes that an IP-based, general solution
536	     for reducing processing overhead in end-hosts is desirable.

538	     It has shown that high overhead of the processing of network data
539	     leads to end-host bottlenecks.  These bottlenecks are in large part
540	     attributable to the copying of data.  The bus bandwidth of machines
541	     has historically been limited, and the bandwidth of high-speed
542	     interconnects taxes it heavily.

544	     An architectural solution to alleviate these bottlenecks best
545	     satisifies the issue.  Further, the high speed of today's
546	     interconnects and the deployment of these hosts on Internet
547	     Protocol-based networks leads to the desireability to layer such a
548	     solution on the Internet Protocol Suite.  The architecture
549	     described in [BT04] is such a proposal.

551	7.  Security Considerations

553	     Solutions to the problem of reducing copying overhead in high
554	     bandwidth transfers via one or more protocols may introduce new
555	     security concerns.  Any proposed solution must be analyzed for
556	     security threats and any such threats addressed.  Potential
557	     security weaknesses due to resource issues that might lead to
558	     denial-of-service attacks, overwrites and other concurrent
559	     operations, the ordering of completions as required by the RDMA
560	     protocol, the granularity of transfer, and any other identified
561	     threats; need to be examined, described and an adequate solution to
562	     them found.

564	     Layered atop Internet transport protocols, the RDMA protocols will
565	     gain leverage from and must permit integration with Internet
566	     security standards, such as IPSec and TLS [IPSEC, TLS].  A thorough
567	     analysis of the degree to which these protocols address potential
568	     threats is required.

570	     Security for an RDMA design requires more than just securing the
571	     communication channel.  While it is necessary to be able to
572	     guarantee channel properties such as privacy, integrity, and
573	     authentication, these properties cannot defend against all attacks
574	     from properly authenticated peers, which might be malicious,
575	     compromised, or buggy.  For example, an RDMA peer should not be
576	     able to read or write memory regions without prior consent.

578	     Further, it must not be possible to evade consistency checks at the
579	     recipient.  The RDMA design must allow the recipient to rely on its
580	     consistent memory contents by controlling peer access to memory
581	     regions explicitly, and must disallow peer access to regions when
582	     not authorized.

584	     The RDMA protocols must ensure that regions addressable by RDMA
585	     peers be under strict application control.  Remote access to local
586	     memory by a network peer introduces a number of potential security
587	     concerns.  This becomes particularly important in the Internet
588	     context, where such access can be exported globally.

590	     The RDMA protocols carry in part what is essentially user
591	     information, explicitly including addressing information and
592	     operation type (read or write), and implicitly including protection
593	     and attributes.  As such, the protocol requires checking of these
594	     higher level aspects in addition to the basic formation of
595	     messages.  The semantics associated with each class of error must
596	     be clearly defined, and the expected action to be taken on mismatch
597	     be specified.  In some cases, this will result in a catastrophic
598	     error on the RDMA association, however in others a local or remote
599	     error may be signalled.  Certain of these errors may require
600	     consideration of abstract local semantics, which must be carefully
601	     specified so as to provide useful behavior while not constraining
602	     the implementation.

604	8.  Acknowledgements

606	     Jeff Chase generously provided many useful insights and
607	     information.  Thanks to Jim Pinkerton for many helpful discussions.

609	9.  Informative References

611	     [BCF+95]
612	          N. J. Boden, D. Cohen, R. E. Felderman, A. E. Kulawik, C. L.
613	          Seitz, J. N. Seizovic, and W. Su. "Myrinet - A gigabit-per-
614	          second local-area network", IEEE Micro, February 1995

616	     [BJM+96]
617	          G. Buzzard, D. Jacobson, M. Mackey, S. Marovich, J. Wilkes,
618	          "An implementation of the Hamlyn send-managed interface
619	          architecture", in Proceedings of the Second Symposium on
620	          Operating Systems Design and Implementation, USENIX Assoc.,
621	          October 1996

623	     [BLA+94]
624	          M. A. Blumrich, K. Li, R. Alpert, C. Dubnicki, E. W. Felten,
625	          "A virtual memory mapped network interface for the SHRIMP
626	          multicomputer", in Proceedings of the 21st Annual Symposium on
627	          Computer Architecture, April 1994, pp. 142-153

629	     [Br99]
630	          J. C. Brustoloni, "Interoperation of copy avoidance in network
631	          and file I/O", Proceedings of IEEE Infocom, 1999, pp. 534-542

633	     [BS96]
634	          J. C. Brustoloni, P. Steenkiste, "Effects of buffering
635	          semantics on I/O performance", Proceedings OSDI'96, USENIX,
636	          Seattle, WA October 1996, pp. 277-291

638	RFC Editor note:
639	     Replace following architecture draft-ietf- name, status and date
640	     with appropriate reference when assigned.

642	     [BT04]
643	          S. Bailey, T. Talpey, "The Architecture of Direct Data
644	          Placement (DDP) And Remote Direct Memory Access (RDMA) On
645	          Internet Protocols", Internet Draft Work in Progress, draft-
646	          ietf-rddp-arch-04, January 2004

648	     [CFF+94]
649	          C-H Chang, D. Flower, J. Forecast, H. Gray, B. Hawe, A.
650	          Nadkarni, K. K. Ramakrishnan, U. Shikarpur, K. Wilde, "High-
651	          performance TCP/IP and UDP/IP networking in DEC OSF/1 for
652	          Alpha AXP",  Proceedings of the 3rd IEEE Symposium on High
653	          Performance Distributed Computing, August 1994, pp. 36-42

655	     [CGY01]
656	          J. S. Chase, A. J. Gallatin, and K. G. Yocum, "End system
657	          optimizations for high-speed TCP", IEEE Communications
658	          Magazine, Volume: 39, Issue: 4 , April 2001, pp 68-74.
659	          http://www.cs.duke.edu/ari/publications/end-system.{ps,pdf}

661	     [Ch96]
662	          H.K. Chu, "Zero-copy TCP in Solaris", Proc. of the USENIX 1996
663	          Annual Technical Conference, San Diego, CA, January 1996

665	     [Ch02]
666	          Jeffrey Chase, Personal communication

668	     [CJRS89]
669	          D. D. Clark, V. Jacobson, J. Romkey, H. Salwen, "An analysis
670	          of TCP processing overhead", IEEE Communications Magazine,
671	          volume: 27, Issue: 6, June 1989, pp 23-29

673	     [CT90]
674	          D. D. Clark, D. Tennenhouse, "Architectural considerations for
675	          a new generation of protocols", Proceedings of the ACM SIGCOMM
676	          Conference, 1990

678	     [DAFS]
679	          DAFS Collaborative, "Direct Access File System Specification
680	          v1.0", September 2001, available from
681	          http://www.dafscollaborative.org

683	     [DAPP93]
684	          P. Druschel, M. B. Abbott, M. A. Pagels, L. L. Peterson,
685	          "Network subsystem design", IEEE Network, July 1993, pp. 8-17

687	     [DP93]
688	          P. Druschel, L. L. Peterson, "Fbufs: a high-bandwidth cross-
689	          domain transfer facility", Proceedings of the 14th ACM
690	          Symposium of Operating Systems Principles, December 1993

692	     [DWB+93]
693	          C. Dalton, G. Watson, D. Banks, C. Calamvokis, A. Edwards, J.
694	          Lumley, "Afterburner: architectural support for high-
695	          performance protocols", Technical Report, HP Laboratories
696	          Bristol, HPL-93-46, July 1993

698	     [EBBV95]
699	          T. von Eicken, A. Basu, V. Buch, and W. Vogels, "U-Net: A
700	          user-level network interface for parallel and distributed
701	          computing", Proc. of the 15th ACM Symposium on Operating
702	          Systems Principles, Copper Mountain, Colorado, December 3-6,
703	          1995

705	     [FGM+99]
706	          R. Fielding, J. Gettys, J. Mogul, F. Frystyk, L. Masinter, P.
707	          Leach, T. Berners-Lee, "Hypertext Transfer Protocol -
708	          HTTP/1.1", RFC 2616, June 1999

710	     [FIBRE]
711	          ANSI Technical Committee T10, "Fibre Channel Protocol (FCP)"
712	          (and as revised and updated), ANSI X3.269:1996 [R2001],
713	          committee draft available from
714	          http://www.t10.org/drafts.htm#FibreChannel

716	     [HP97]
717	          J. L. Hennessy, D. A. Patterson, Computer Organization and
718	          Design, 2nd Edition, San Francisco: Morgan Kaufmann
719	          Publishers, 1997

721	     [IB] InfiniBand Trade Association, "InfiniBand Architecture
722	          Specification, Volumes 1 and 2", Release 1.1, November 2002,
723	          available from  http://www.infinibandta.org/specs

725	     [KP96]
726	          J. Kay, J. Pasquale, "Profiling and reducing processing
727	          overheads in TCP/IP", IEEE/ACM Transactions on Networking, Vol
728	          4, No. 6, pp.817-828, December 1996

730	     [KSZ95]
731	          K. Kleinpaste, P. Steenkiste, B. Zill, "Software support for
732	          outboard buffering and checksumming", SIGCOMM'95

734	     [Ma02]
735	          K. Magoutis, "Design and Implementation of a Direct Access
736	          File System (DAFS) Kernel Server for FreeBSD", in Proceedings
737	          of USENIX BSDCon 2002 Conference, San Francisco, CA, February
738	          11-14, 2002.

740	     [MAF+02]
741	          K. Magoutis, S. Addetia, A. Fedorova, M.  I. Seltzer, J. S.
742	          Chase, D. Gallatin, R. Kisley, R. Wickremesinghe, E. Gabber,
743	          "Structure and Performance of the Direct Access File System
744	          (DAFS)", in Proceedings of the 2002 USENIX Annual Technical
745	          Conference, Monterey, CA, June 9-14, 2002.

747	     [Mc95]
748	          J. D. McCalpin, "A Survey of memory bandwidth and machine
749	          balance in current high performance computers", IEEE TCCA
750	          Newsletter, December 1995

752	     [Ne00]
753	          A. Newman, "IDC report paints conflicted picture of server
754	          market circa 2004", ServerWatch, July 24, 2000
755	          http://serverwatch.internet.com/news/2000_07_24_a.html

757	     [Pa01]
758	          M. Pastore, "Server shipments for 2000 surpass those in 1999",
759	          ServerWatch, February 7, 2001
760	          http://serverwatch.internet.com/news/2001_02_07_a.html

762	     [PAC+97]
763	          D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K. Keeton,
764	          C. Kozyrakis, R. Thomas, K. Yelick , "A case for intelligient
765	          RAM: IRAM", IEEE Micro, April 1997

767	     [PDZ99]
768	          V. S. Pai, P. Druschel, W. Zwaenepoel, "IO-Lite: a unified I/O
769	          buffering and caching system", Proc. of the 3rd Symposium on
770	          Operating Systems Design and Implementation, New Orleans, LA,
771	          February 1999

773	     [Pi01]
774	          J. Pinkerton, "Winsock Direct: The Value of System Area
775	          Networks", May 2001, available from
776	          http://www.microsoft.com/windows2000/techinfo/
777	          howitworks/communications/winsock.asp

779	     [Po81]
780	          J. Postel, "Transmission Control Protocol - DARPA Internet
781	          Program Protocol Specification", RFC 793, September 1981

783	     [QUAD]
784	          Quadrics Ltd., Quadrics QSNet product information, available
785	          from http://www.quadrics.com/website/pages/02qsn.html

787	     [SDP]
788	          InfiniBand Trade Association, "Sockets Direct Protocol v1.0",
789	          Annex A of InfiniBand Architecture Specification Volume 1,
790	          Release 1.1, November 2002, available from
791	          http://www.infinibandta.org/specs

793	     [SRVNET]
794	          R. Horst, "TNet: A reliable system area network", IEEE Micro,
795	          pp. 37-45, February 1995

797	     [STREAM]
798	          J. D. McAlpin, The STREAM Benchmark Reference Information,
799	          http://www.cs.virginia.edu/stream/

801	     [TK95]
802	          M. N. Thadani, Y. A. Khalidi, "An efficient zero-copy I/O
803	          framework for UNIX", Technical Report, SMLI TR-95-39, May 1995

805	     [VI] Compaq Computer Corp., Intel Corporation and Microsoft
806	          Corporation, "Virtual Interface Architecture Specification
807	          Version 1.0", December 1997, available from
808	          http://www.vidf.org/info/04standards.html

810	     [Wa97]
811	          J. R. Walsh, "DART: Fast application-level networking via
812	          data-copy avoidance", IEEE Network, July/August 1997, pp.
813	          28-38

815	Authors' Addresses

817	     Stephen Bailey
818	     Sandburst Corporation
819	     600 Federal Street
820	     Andover, MA  01810 USA

822	     Phone: +1 978 689 1614
823	     Email: steph@sandburst.com

825	     Jeffrey C. Mogul
826	     Western Research Laboratory
827	     Hewlett-Packard Company
828	     1501 Page Mill Road, MS 1251
829	     Palo Alto, CA  94304 USA

831	     Phone: +1 650 857 2206 (email preferred)
832	     Email: JeffMogul@acm.org

834	     Allyn Romanow
835	     Cisco Systems, Inc.
836	     170 W. Tasman Drive
837	     San Jose, CA  95134 USA

839	     Phone: +1 408 525 8836
840	     Email: allyn@cisco.com

842	     Tom Talpey
843	     Network Appliance
844	     375 Totten Pond Road
845	     Waltham, MA  02451 USA

847	     Phone: +1 781 768 5329
848	     Email: thomas.talpey@netapp.com

850	Full Copyright Statement

852	     Copyright (C) The Internet Society (2004).  All Rights Reserved.

854	     This document and translations of it may be copied and furnished to
855	     others, and derivative works that comment on or otherwise explain
856	     it or assist in its implementation may be prepared, copied,
857	     published and distributed, in whole or in part, without restriction
858	     of any kind, provided that the above copyright notice and this
859	     paragraph are included on all such copies and derivative works.
860	     However, this document itself may not be modified in any way, such
861	     as by removing the copyright notice or references to the Internet
862	     Society or other Internet organizations, except as needed for the
863	     purpose of developing Internet standards in which case the
864	     procedures for copyrights defined in the Internet Standards process
865	     must be followed, or as required to translate it into languages
866	     other than English.

868	     The limited permissions granted above are perpetual and will not be
869	     revoked by the Internet Society or its successors or assigns.

871	     This document and the information contained herein is provided on
872	     an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET
873	     ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR
874	     IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF
875	     THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
876	     WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.