idnits 2.17.1 

draft-ietf-rddp-problem-statement-05.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** It looks like you're using RFC 3978 boilerplate.  You should update this
     to the boilerplate described in the IETF Trust License Policy document
     (see https://trustee.ietf.org/license-info), which is required now.

  -- Found old boilerplate from RFC 3667, Section 5.1 on line 15.

  -- Found old boilerplate from RFC 3978, Section 5.5 on line 936.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 946.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 953.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 959.

  ** Found boilerplate matching RFC 3978, Section 5.4, paragraph 1 (on line
     927), which is fine, but *also* found old RFC 2026, Section 10.4C,
     paragraph 1 text on line 36.

  ** The document seems to lack an RFC 3978 Section 5.1 IPR Disclosure
     Acknowledgement -- however, there's a paragraph with a matching
     beginning. Boilerplate error?

  ** This document has an original RFC 3978 Section 5.4 Copyright Line,
     instead of the newer IETF Trust Copyright according to RFC 4748.

  ** The document seems to lack an RFC 3978 Section 5.4 Reference to BCP 78
     -- however, there's a paragraph with a matching beginning. Boilerplate
     error?

  ** This document has an original RFC 3978 Section 5.5 Disclaimer, instead
     of the newer disclaimer which includes the IETF Trust according to RFC
     4748.

  ** The document uses RFC 3667 boilerplate or RFC 3978-like boilerplate
     instead of verbatim RFC 3978 boilerplate.  After 6 May 2005, submission
     of drafts without verbatim RFC 3978 boilerplate is not accepted.

     The following non-3978 patterns matched text found in the document. 
     That text should be removed or replaced:

        By submitting this Internet-Draft, I certify that any applicable patent
        or other IPR claims of which I am aware have been disclosed, or
        will be disclosed, and any of which I become aware will be
        disclosed, in accordance with RFC 3668.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  == No 'Intended status' indicated for this document; assuming Proposed
     Standard


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack an IANA Considerations section.  (See Section
     2.2 of https://www.ietf.org/id-info/checklist for how to handle the case
     when there are no actions for IANA.)


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not
     match the current year

  == Line 796 has weird spacing: '...le from  http:...'

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- Couldn't find a document date in the document -- date freshness check
     skipped.

  -- Found something which looks like a code comment -- if you have code
     sections in the document, please surround them with '<CODE BEGINS>' and
     '<CODE ENDS>' lines.


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  == Missing Reference: 'B99' is mentioned on line 258, but not defined

  == Missing Reference: 'BP96' is mentioned on line 456, but not defined

  == Missing Reference: 'IPSEC' is mentioned on line 586, but not defined

  == Missing Reference: 'TLS' is mentioned on line 586, but not defined

  == Missing Reference: 'R2001' is mentioned on line 785, but not defined

  == Unused Reference: 'DAPP93' is defined on line 751, but no explicit
     reference was found in the text

  == Unused Reference: 'KSZ95' is defined on line 803, but no explicit
     reference was found in the text

  == Unused Reference: 'Ma02' is defined on line 807, but no explicit
     reference was found in the text

  == Unused Reference: 'Wa97' is defined on line 883, but no explicit
     reference was found in the text

  == Outdated reference: A later version (-07) exists of
     draft-ietf-rddp-arch-06

  -- Obsolete informational reference (is this intentional?): RFC  793 (ref.
     'Po81') (Obsoleted by RFC 9293)


     Summary: 8 errors (**), 0 flaws (~~), 13 warnings (==), 9 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Internet-Draft                                Allyn Romanow      (Cisco)
3	Expires: April 2005                           Jeff Mogul            (HP)
4	                                              Tom Talpey        (NetApp)
5	                                              Stephen Bailey (Sandburst)

7	      Remote Direct Memory Access (RDMA) over IP Problem Statement
8	                  draft-ietf-rddp-problem-statement-05

10	Status of this Memo

12	     By submitting this Internet-Draft, I certify that any applicable
13	     patent or other IPR claims of which I am aware have been disclosed,
14	     or will be disclosed, and any of which I become aware will be
15	     disclosed, in accordance with RFC 3668.

17	     Internet-Drafts are working documents of the Internet Engineering
18	     Task Force (IETF), its areas, and its working groups.  Note that
19	     other groups may also distribute working documents as Internet-
20	     Drafts.

22	     Internet-Drafts are draft documents valid for a maximum of six
23	     months and may be updated, replaced, or obsoleted by other
24	     documents at any time.  It is inappropriate to use Internet-Drafts
25	     as reference material or to cite them other than as "work in
26	     progress."

28	     The list of current Internet-Drafts can be accessed at
29	         http://www.ietf.org/ietf/1id-abstracts.txt

31	     The list of Internet-Draft Shadow Directories can be accessed at
32	         http://www.ietf.org/shadow.html

34	Copyright Notice

36	     Copyright (C) The Internet Society (2004).  All Rights Reserved.

38	Abstract

40	     Overhead due to the movement of user data in the end-system network
41	     I/O processing path at high speeds is significant, and has limited
42	     the use of Internet protocols in interconnection networks and the
43	     Internet itself - especially where high bandwidth, low latency
44	     and/or low overhead are required by the hosted application.

46	     This draft examines this overhead, and addresses an architectural,
47	     IP-based "copy avoidance" solution for its elimination, by enabling
48	     Remote Direct Memory Access (RDMA).

50	Table Of Contents

52	     1.   Introduction . . . . . . . . . . . . . . . . . . . . . . .   2
53	     2.   The high cost of data movement operations in network I/O .   4
54	     2.1. Copy avoidance improves processing overhead  . . . . . . .   5
55	     3.   Memory bandwidth is the root cause of the problem  . . . .   6
56	     4.   High copy overhead is problematic for many key Internet
57	          applications . . . . . . . . . . . . . . . . . . . . . . .   7
58	     5.   Copy Avoidance Techniques  . . . . . . . . . . . . . . . .  10
59	     5.1. A Conceptual Framework: DDP and RDMA . . . . . . . . . . .  12
60	     6.   Conclusions  . . . . . . . . . . . . . . . . . . . . . . .  12
61	     7.   Security Considerations  . . . . . . . . . . . . . . . . .  13
62	     8.   Terminology  . . . . . . . . . . . . . . . . . . . . . . .  14
63	     9.   Acknowledgements . . . . . . . . . . . . . . . . . . . . .  15
64	          Informative References . . . . . . . . . . . . . . . . . .  15
65	          Authors' Addresses . . . . . . . . . . . . . . . . . . . .  19
66	          Full Copyright Statement . . . . . . . . . . . . . . . . .  20

68	1.  Introduction

70	     This draft considers the problem of high host processing overhead
71	     associated with the movement of user data to and from the network
72	     interface under high speed conditions.  This problem is often
73	     referred to as the "I/O bottleneck" [CT90].  More specifically, the
74	     source of high overhead that is of interest here is data movement
75	     operations - copying.  The throughput of a system may therefore be
76	     limited by the overhead of this copying.  This issue is not to be
77	     confused with TCP offload, which is not addressed here.  High speed
78	     refers to conditions where the network link speed is high relative
79	     to the bandwidths of the host CPU and memory.  With today's
80	     computer systems, one Gigabit per second (Gbits/s) and over is
81	     considered high speed.

83	     High costs associated with copying are an issue primarily for large
84	     scale systems.  Although smaller systems such as rack-mounted PCs
85	     and small workstations would benefit from a reduction in copying
86	     overhead, the benefit to smaller machines will be primarily in the
87	     next few years as they scale in the amount of bandwidth they
88	     handle.  Today it is large system machines with high bandwidth
89	     feeds, usually multiprocessors and clusters, that are adversely
90	     affected by copying overhead.  Examples of such machines include
91	     all varieties of servers: database servers, storage servers,
92	     application servers for transaction processing, for e-commerce, and
93	     web serving, content distribution, video distribution, backups,
94	     data mining and decision support, and scientific computing.

96	     Note that such servers almost exclusively service many concurrent
97	     sessions (transport connections), which, in aggregate, are
98	     responsible for > 1 Gbits/s of communication.  Nonetheless, the
99	     cost of copying overhead for a particular load is the same whether
100	     from few or many sessions.

102	     The I/O bottleneck, and the role of data movement operations, have
103	     been widely studied in research and industry over the last
104	     approximately 14 years, and we draw freely on these results.
105	     Historically, the I/O bottleneck has received attention whenever
106	     new networking technology has substantially increased line rates -
107	     100 Megabit per second (Mbits/s) Fast Ethernet and Fibre
108	     Distributed Data Interface [FDDI], 155 Mbits/s Asynchronous
109	     Transfer Mode [ATM], 1 Gbits/s Ethernet.  In earlier speed
110	     transitions, the availability of memory bandwidth allowed the I/O
111	     bottleneck issue to be deferred.  Now however, this is no longer
112	     the case.  While the I/O problem is significant at 1 Gbits/s, it is
113	     the introduction of 10 Gbits/s Ethernet which is motivating an
114	     upsurge of activity in industry and research [DAFS, IB, VI, CGZ01,
115	     Ma02, MAF+02].

117	     Because of high overhead of end-host processing in current
118	     implementations, the TCP/IP protocol stack is not used for high
119	     speed transfer.  Instead, special purpose network fabrics, using a
120	     technology generally known as Remote Direct Memory Access (RDMA),
121	     have been developed and are widely used.  RDMA is a set of
122	     mechanisms that allow the network adapter, under control of the
123	     application, to steer data directly into and out of application
124	     buffers.  Examples of such interconnection fabrics include Fibre
125	     Channel [FIBRE] for block storage transfer, Virtual Interface
126	     Architecture [VI] for database clusters, and Infiniband [IB],
127	     Compaq Servernet [SRVNET] and Quadrics [QUAD] for System Area
128	     Networks.  These link level technologies limit application scaling
129	     in both distance and size, meaning that the number of nodes cannot
130	     be arbitrarily large.

132	     This problem statement substantiates the claim that in network I/O
133	     processing, high overhead results from data movement operations,
134	     specifically copying; and that copy avoidance significantly
135	     decreases this processing overhead.  It describes when and why the
136	     high processing overheads occur, explains why the overhead is
137	     problematic, and points out which applications are most affected.

139	     The document goes on to discuss why the problem is relevant to the
140	     Internet and to Internet-based applications.  Applications which
141	     store, manage and distribute the information of the Internet are
142	     well suited to applying the copy avoidance solution.  They will
143	     benefit by avoiding high processing overheads, which removes limits
144	     to the available scaling of tiered end-systems.  Copy avoidance
145	     also eliminates latency for these systems, which can further
146	     benefit effective distributed processing.

148	     In addition, this document introduces an architectural approach to
149	     solving the problem, which is developed in detail in [BT04].  It
150	     also discusses how the proposed technology may introduce security
151	     concerns and how they should be addressed.

153	     Finally, this document includes a Terminology section to aid as a
154	     reference for several new terms introduced by RDMA.

156	2.  The high cost of data movement operations in network I/O

158	     A wealth of data from research and industry shows that copying is
159	     responsible for substantial amounts of processing overhead.  It
160	     further shows that even in carefully implemented systems,
161	     eliminating copies significantly reduces the overhead, as
162	     referenced below.

164	     Clark et al. [CJRS89] in 1989 shows that TCP [Po81] overhead
165	     processing is attributable to both operating system costs such as
166	     interrupts, context switches, process management, buffer
167	     management, timer management, and to the costs associated with
168	     processing individual bytes, specifically computing the checksum
169	     and moving data in memory.  They found moving data in memory is the
170	     more important of the costs, and their experiments show that memory
171	     bandwidth is the greatest source of limitation.  In the data
172	     presented [CJRS89], 64% of the measured microsecond overhead was
173	     attributable to data touching operations, and 48% was accounted for
174	     by copying.  The system measured Berkeley TCP on a Sun-3/60 using
175	     1460 Byte Ethernet packets.

177	     In a well-implemented system, copying can occur between the network
178	     interface and the kernel, and between the kernel and application
179	     buffers - two copies, each of which are two memory bus crossings -
180	     for read and write.  Although in certain circumstances it is
181	     possible to do better, usually two copies are required on receive.

183	     Subsequent work has consistently shown the same phenomenon as the
184	     earlier Clark study.  A number of studies report results that data-
185	     touching operations, checksumming and data movement, dominate the
186	     processing costs for messages longer than 128 Bytes [BS96, CGY01,
187	     Ch96, CJRS89, DAPP93, KP96].  For smaller sized messages, per-
188	     packet overheads dominate [KP96, CGY01].

190	     The percentage of overhead due to data-touching operations
191	     increases with packet size, since time spent on per-byte operations
192	     scales linearly with message size [KP96].  For example, Chu [Ch96]
193	     reported substantial per-byte latency costs as a percentage of
194	     total networking software costs for an MTU size packet on
195	     SPARCstation/20 running memory-to-memory TCP tests over networks
196	     with 3 different MTU sizes.  The percentage of total software costs
197	     attributable to per-byte operations were:

199	        1500 Byte Ethernet 18-25%
200	        4352 Byte FDDI     35-50%
201	        9180 Byte ATM      55-65%

203	     Although many studies report results for data-touching operations
204	     including checksumming and data movement together, much work has
205	     focused just on copying [BS96, B99, Ch96, TK95].  For example,
206	     [KP96] reports results that separate processing times for checksum
207	     from data movement operations.  For the 1500 Byte Ethernet size,
208	     20% of total processing overhead time is attributable to copying.
209	     The study used 2 DECstations 5000/200 connected by an FDDI network.
210	     (In this study checksum accounts for 30% of the processing time.)

212	2.1.  Copy avoidance improves processing overhead

214	     A number of studies show that eliminating copies substantially
215	     reduces overhead.  For example, results from copy-avoidance in the
216	     IO-Lite system [PDZ99], which aimed at improving web server
217	     performance, show a throughput increase of 43% over an optimized
218	     web server, and 137% improvement over an Apache server.  The system
219	     was implemented in a 4.4BSD derived UNIX kernel, and the
220	     experiments used a server system based on a 333MHz Pentium II PC
221	     connected to a switched 100 Mbits/s Fast Ethernet.

223	     There are many other examples where elimination of copying using a
224	     variety of different approaches showed significant improvement in
225	     system performance [CFF+94, DP93, EBBV95, KSZ95, TK95, Wa97].  We
226	     will discuss the results of one of these studies in detail in order
227	     to clarify the significant degree of improvement produced by copy
228	     avoidance [Ch02].

230	     Recent work by Chase et al. [CGY01], measuring CPU utilization,
231	     shows that avoiding copies reduces CPU time spent on data access
232	     from 24% to 15% at 370 Mbits/s for a 32 KBytes MTU using an
233	     AlphaStation XP1000 and a Myrinet adapter [BCF+95].  This is an
234	     absolute improvement of 9% due to copy avoidance.

236	     The total CPU utilization was 35%, with data access accounting for
237	     24%.  Thus the relative importance of reducing copies is 26%.  At
238	     370 Mbits/s, the system is not very heavily loaded.  The relative
239	     improvement in achievable bandwidth is 34%.  This is the
240	     improvement we would see if copy avoidance were added when the
241	     machine was saturated by network I/O.

243	     Note that improvement from the optimization becomes more important
244	     if the overhead it targets is a larger share of the total cost.
245	     This is what happens if other sources of overhead, such as
246	     checksumming, are eliminated.  In [CGY01], after removing checksum
247	     overhead, copy avoidance reduces CPU utilization from 26% to 10%.
248	     This is a 16% absolute reduction, a 61% relative reduction, and a
249	     160% relative improvement in achievable bandwidth.

251	     In fact, today's network interface hardware commonly offloads the
252	     checksum, which removes the other source of per-byte overhead.
253	     They also coalesce interrupts to reduce per-packet costs.  Thus,
254	     today copying costs account for a relatively larger part of CPU
255	     utilization than previously, and therefore relatively more benefit
256	     is to be gained in reducing them.  (Of course this argument would
257	     be specious if the amount of overhead were insignificant, but it
258	     has been shown to be substantial.  [BS96, B99, Ch96, KP96, TK95])

260	3.  Memory bandwidth is the root cause of the problem

262	     Data movement operations are expensive because memory bandwidth is
263	     scarce relative to network bandwidth and CPU bandwidth [PAC+97].
264	     This trend existed in the past and is expected to continue into the
265	     future [HP97, STREAM], especially in large multiprocessor systems.

267	     With copies crossing the bus twice per copy, network processing
268	     overhead is high whenever network bandwidth is large in comparison
269	     to CPU and memory bandwidths.  Generally with today's end-systems,
270	     the effects are observable at network speeds over 1 Gbits/s.  In
271	     fact, with multiple bus crossings it is possible to see the bus
272	     bandwidth being the limiting factor for throughput.  This prevents
273	     such an end-system from silultaneously achieving full network
274	     bandwidth and full application performance.

276	     A common question is whether increase in CPU processing power
277	     alleviates the problem of high processing costs of network I/O.
278	     The answer is no, it is the memory bandwidth that is the issue.
279	     Faster CPUs do not help if the CPU spends most of its time waiting
280	     for memory [CGY01].

282	     The widening gap between microprocessor performance and memory
283	     performance has long been a widely recognized and well-understood
284	     problem [PAC+97].  Hennessy [HP97] shows microprocessor performance
285	     grew from 1980-1998 at 60% per year, while the access time to DRAM
286	     improved at 10% per year, giving rise to an increasing "processor-
287	     memory performance gap".

289	     Another source of relevant data is the STREAM Benchmark Reference
290	     Information website which provides information on the STREAM
291	     benchmark [STREAM].  The benchmark is a simple synthetic benchmark
292	     program that measures sustainable memory bandwidth (in MBytes/s)
293	     and the corresponding computation rate for simple vector kernels
294	     measured in MFLOPS.  The website tracks information on sustainable
295	     memory bandwidth for hundreds of machines and all major vendors.

297	     Results show measured system performance statistics.  Processing
298	     performance from 1985-2001 increased at 50% per year on average,
299	     and sustainable memory bandwidth from 1975 to 2001 increased at 35%
300	     per year on average over all the systems measured.  A similar 15%
301	     per year lead of processing bandwidth over memory bandwidth shows
302	     up in another statistic, machine balance [Mc95], a measure of the
303	     relative rate of CPU to memory bandwidth (FLOPS/cycle) / (sustained
304	     memory ops/cycle) [STREAM].

306	     Network bandwidth has been increasing about 10-fold roughly every 8
307	     years, which is a 40% per year growth rate.

309	     A typical example illustrates that the memory bandwidth compares
310	     unfavorably with link speed.  The STREAM benchmark shows that a
311	     modern uniprocessor PC, for example the 1.2 GHz Athlon in 2001,
312	     will move the data 3 times in doing a receive operation - 1 for the
313	     network interface to deposit the data in memory, and 2 for the CPU
314	     to copy the data.  With 1 GBytes/s of memory bandwidth, meaning one
315	     read or one write, the machine could handle approximately 2.67
316	     Gbits/s of network bandwidth, one third the copy bandwidth.  But
317	     this assumes 100% utilization, which is not possible, and more
318	     importantly the machine would be totally consumed!  (A rule of
319	     thumb for databases is that 20% of the machine should be required
320	     to service I/O, leaving 80% for the database application.  And, the
321	     less the better.)

323	     In 2001, 1 Gbits/s links were common.  An application server may
324	     typically have two 1 Gbits/s connections - one connection backend
325	     to a storage server and one front-end, say for serving HTTP
326	     [FGM+99].  Thus the communications could use 2 Gbits/s.  In our
327	     typical example, the machine could handle 2.7 Gbits/s at its
328	     theoretical maximum while doing nothing else.  This means that the
329	     machine basically could not keep up with the communication demands
330	     in 2001, with the relative growth trends the situation only gets
331	     worse.

333	4.  High copy overhead is problematic for many key Internet applications

335	     If a significant portion of resources on an application machine is
336	     consumed in network I/O rather than in application processing, it
337	     makes it difficult for the application to scale - to handle more
338	     clients, to offer more services.

340	     Several years ago the most affected applications were streaming
341	     multimedia, parallel file systems and supercomputing on clusters
342	     [BS96].  In addition, today the applications that suffer from
343	     copying overhead are more central in Internet computing - they
344	     store, manage, and distribute the information of the Internet and
345	     the enterprise.  They include database applications doing
346	     transaction processing, e-commerce, web serving, decision support,
347	     content distribution, video distribution, and backups.  Clusters
348	     are typically used for this category of application, since they
349	     have advantages of availability and scalability.

351	     Today these applications, which provide and manage Internet and
352	     corporate information, are typically run in data centers that are
353	     organized into three logical tiers.  One tier is typically a set of
354	     web servers connecting to the WAN.  The second tier is a set of
355	     application servers that run the specific applications usually on
356	     more powerful machines, and the third tier is backend databases.
357	     Physically, the first two tiers - web server and application server
358	     - are usually combined [Pi01].  For example an e-commerce server
359	     communicates with a database server and with a customer site, or a
360	     content distribution server connects to a server farm, or an OLTP
361	     server connects to a database and a customer site.

363	     When network I/O uses too much memory bandwidth, performance on
364	     network paths between tiers can suffer.  (There might also be
365	     performance issues on Storage Area Network paths used either by the
366	     database tier or the application tier.)  The high overhead from
367	     network-related memory copies diverts system resources from other
368	     application processing.  It also can create bottlenecks that limit
369	     total system performance.

371	     There are a large and growing number of these application servers
372	     distributed throughout the Internet.  In 1999 approximately 3.4
373	     million server units were shipped, in 2000, 3.9 million units, and
374	     the estimated annual growth rate for 2000-2004 was 17 percent
375	     [Ne00, Pa01].

377	     There is high motivation to maximize the processing capacity of
378	     each CPU, as scaling by adding CPUs one way or another has
379	     drawbacks.  For example, adding CPUs to a multiprocessor will not
380	     necessarily help, as a multiprocessor improves performance only
381	     when the memory bus has additional bandwidth to spare.  Clustering
382	     can add additional complexity to handling the applications.

384	     In order to scale a cluster or multiprocessor system, one must
385	     proportionately scale the interconnect bandwidth.  Interconnect
386	     bandwidth governs the performance of communication-intensive
387	     parallel applications; if this (often expressed in terms of
388	     "bisection bandwidth") is too low, adding additional processors
389	     cannot improve system throughput.  Interconnect latency can also
390	     limit the performance of applications that frequently share data
391	     between processors.

393	     So, excessive overheads on network paths in a "scalable" system
394	     both can require the use of more processors than optimal, and can
395	     reduce the marginal utility of those additional processors.

397	     Copy avoidance scales a machine upwards by removing at least two-
398	     thirds the bus bandwidth load from the "very best" 1-copy (on
399	     receive) implementations, and removes at least 80% of the bandwidth
400	     overhead from the 2-copy implementations.

402	     The removal of bus bandwidth requirement, in turn, removes
403	     bottlenecks from the network processing path and increases the
404	     throughput of the machine.  On a machine with limited bus
405	     bandwidth, the advantages of removing this load is immediately
406	     evident, as the host can attain full network bandwidth.  Even on a
407	     machine with bus bandwidth adequate to sustain full network
408	     bandwidth, removal of bus bandwidth load serves to increase the
409	     availabilty of the machine for the processing of user applications,
410	     in some cases dramatically.

412	     An example showing poor performance with copies and improved
413	     scaling with copy avoidance is illustrative.  The IO-Lite work
414	     [PDZ99] shows higher server throughput servicing more clients using
415	     a zero-copy system.  In an experiment designed to mimic real world
416	     web conditions by simulating the effect of TCP WAN connections on
417	     the server, the performance of 3 servers was compared.  One server
418	     was Apache, another an optimized server called Flash, and the third
419	     the Flash server running IO-Lite, called Flash-Lite with zero copy.
420	     The measurement was of throughput in requests/second as a function
421	     of the number of slow background clients that could be served.  As
422	     the table shows, Flash-Lite has better throughput, especially as
423	     the number of clients increases.

425	                Apache              Flash         Flash-Lite
426	                ------              -----         ----------
427	     #Clients   Throughput reqs/s   Throughput    Throughput

429	     0          520                 610           890
430	     16         390                 490           890
431	     32         360                 490           850
432	     64         360                 490           890
433	     128        310                 450           880
434	     256        310                 440           820

436	     Traditional Web servers (which mostly send data and can keep most
437	     of their content in the file cache) are not the worst case for copy
438	     overhead.  Web proxies (which often receive as much data as they
439	     send) and complex Web servers based on System Area Networks or
440	     multi-tier systems will suffer more from copy overheads than in the
441	     example above.

443	5.  Copy Avoidance Techniques

445	     There have been extensive research investigation and industry
446	     experience with two main alternative approaches to eliminating data
447	     movement overhead, often along with improving other Operating
448	     System processing costs.  In one approach, hardware and/or software
449	     changes within a single host reduce processing costs.  In another
450	     approach, memory-to-memory networking [MAF+02], the exchange of
451	     explicit data placement information between hosts allows them to
452	     reduce processing costs.

454	     The single host approaches range from new hardware and software
455	     architectures [KSZ95, Wa97, DWB+93] to new or modified software
456	     systems [BP96, Ch96, TK95, DP93, PDZ99].  In the approach based on
457	     using a networking protocol to exchange information, the network
458	     adapter, under control of the application, places data directly
459	     into and out of application buffers, reducing the need for data
460	     movement.  Commonly this approach is called RDMA, Remote Direct
461	     Memory Access.

463	     As discussed below, research and industry experience has shown that
464	     copy avoidance techniques within the receiver processing path alone
465	     have proven to be problematic.  The research special purpose host
466	     adapter systems had good performance and can be seen as precursors
467	     for the commercial RDMA-based adapters [KSZ95, DWB+93].  In
468	     software, many implementations have successfully achieved zero-copy
469	     transmit, but few have accomplished zero-copy receive.  And those
470	     that have done so make strict alignment and no-touch requirements
471	     on the application, greatly reducing the portability and usefulness
472	     of the implementation.

474	     In contrast, experience has proven satisfactory with memory-to-
475	     memory systems that permit RDMA - performance has been good and
476	     there have not been system or networking difficulties.  RDMA is a
477	     single solution.  Once implemented, it can be used with any OS and
478	     machine architecture, and it does not need to be revised when
479	     either of these changes.

481	     In early work, one goal of the software approaches was to show that
482	     TCP could go faster with appropriate OS support [CJR89, CFF+94].
483	     While this goal was achieved, further investigation and experience
484	     showed that, though possible to craft software solutions, specific
485	     system optimizations have been complex, fragile, extremely
486	     interdependent with other system parameters in complex ways, and
487	     often of only marginal improvement [CFF+94, CGY01, Ch96, DAPP93,
488	     KSZ95, PDZ99].  The network I/O system interacts with other aspects
489	     of the Operating System such as machine architecture and file I/O,
490	     and disk I/O [Br99, Ch96, DP93].

492	     For example, the Solaris Zero-Copy TCP work [Ch96], which relies on
493	     page remapping, shows that the results are highly interdependent
494	     with other systems, such as the file system, and that the
495	     particular optimizations are specific for particular architectures,
496	     meaning for each variation in architecture optimizations must be
497	     re-crafted [Ch96].

499	     With RDMA, application I/O buffers are mapped directly, and the
500	     authorized peer may access it without incurring additional
501	     processing overhead.  When RDMA is implemented in hardware,
502	     arbitrary data movement can be performed without involving the host
503	     CPU at all.

505	     A number of research projects and industry products have been based
506	     on the memory-to-memory approach to copy avoidance.  These include
507	     U-Net [EBBV95], SHRIMP [BLA+94], Hamlyn [BJM+96], Infiniband [IB],
508	     Winsock Direct [Pi01].  Several memory-to-memory systems have been
509	     widely used and have generally been found to be robust, to have
510	     good performance, and to be relatively simple to implement.  These
511	     include VI [VI], Myrinet [BCF+95], Quadrics [QUAD], Compaq/Tandem
512	     Servernet [SRVNET].  Networks based on these memory-to-memory
513	     architectures have been used widely in scientific applications and
514	     in data centers for block storage, file system access, and
515	     transaction processing.

517	     By exporting direct memory access "across the wire", applications
518	     may direct the network stack to manage all data directly from
519	     application buffers.  A large and growing class of applications has
520	     already emerged which takes advantage of such capabilities,
521	     including all the major databases, as well as file systems such as
522	     DAFS [DAFS] and network protocols such as Sockets Direct [SDP].

524	5.1.  A Conceptual Framework: DDP and RDMA

526	     An RDMA solution can be usefully viewed as being comprised of two
527	     distinct components: "direct data placement (DDP)" and "remote
528	     direct memory access (RDMA) semantics".  They are distinct in
529	     purpose and also in practice - they may be implemented as separate
530	     protocols.

532	     The more fundamental of the two is the direct data placement
533	     facility.  This is the means by which memory is exposed to the
534	     remote peer in an appropriate fashion, and the means by which the
535	     peer may access it, for instance reading and writing.

537	     The RDMA control functions are semantically layered atop direct
538	     data placement.  Included are operations that provide "control"
539	     features, such as connection and termination, and the ordering of
540	     operations and signaling their completions.  A "send" facility is
541	     provided.

543	     While the functions (and potentially protocols) are distinct,
544	     historically both aspects taken together have been referred as
545	     "RDMA".  The facilities of direct data placement are useful in and
546	     of themselves, and may be employed by other upper layer protocols
547	     to facilitate data transfer.  Therefore, it is often useful to
548	     refer to DDP as the data placement functionality and RDMA as the
549	     control aspect.

551	     [BT04] develops an architecture for DDP and RDMA atop the Internet
552	     Protocol Suite, and is a companion draft to this problem statement.

554	6.  Conclusions

556	     This Problem Statement concludes that an IP-based, general solution
557	     for reducing processing overhead in end-hosts is desirable.

559	     It has shown that high overhead of the processing of network data
560	     leads to end-host bottlenecks.  These bottlenecks are in large part
561	     attributable to the copying of data.  The bus bandwidth of machines
562	     has historically been limited, and the bandwidth of high-speed
563	     interconnects taxes it heavily.

565	     An architectural solution to alleviate these bottlenecks best
566	     satisifies the issue.  Further, the high speed of today's
567	     interconnects and the deployment of these hosts on Internet
568	     Protocol-based networks leads to the desireability to layer such a
569	     solution on the Internet Protocol Suite.  The architecture
570	     described in [BT04] is such a proposal.

572	7.  Security Considerations

574	     Solutions to the problem of reducing copying overhead in high
575	     bandwidth transfers may introduce new security concerns.  Any
576	     proposed solution must be analyzed for security vulnerabilities and
577	     any such vulnerabilities addressed.  Potential security weaknesses
578	     due to resource issues that might lead to denial-of-service
579	     attacks, overwrites and other concurrent operations, the ordering
580	     of completions as required by the RDMA protocol, the granularity of
581	     transfer, and any other identified vulnerabilities; need to be
582	     examined, described and an adequate resolution to them found.

584	     Layered atop Internet transport protocols, the RDMA protocols will
585	     gain leverage from and must permit integration with Internet
586	     security standards, such as IPsec and TLS [IPSEC, TLS].  However,
587	     there may be implementation ramifications for certain security
588	     approaches with respect to RDMA, due to its copy avoidance.

590	     IPsec, operating to secure the connection on a packet-by-packet
591	     basis, seems to be a natural fit to securing RDMA placement, which
592	     operates in conjunction with transport.  Because RDMA enables an
593	     implementation to avoid buffering, it is preferable to perform all
594	     applicable security protection prior to processing of each segment
595	     by the transport and RDMA layers.  Such a layering enables the most
596	     efficient secure RDMA implementation.

598	     The TLS record protocol, on the other hand, is layered on top of
599	     reliable transports and cannot provide such security assurance
600	     until an entire record is available, which may require the
601	     buffering and/or assembly of several distinct messages prior to TLS
602	     processing.  This defers RDMA processing and introduces overheads
603	     that RDMA is designed to avoid.  TLS therefore is viewed as
604	     potentially a less natural fit for protecting the RDMA protocols.

606	     It is necessary to guarantee properties such as confidentiality,
607	     integrity, and authentication on an RDMA communications channel.
608	     However, these properties cannot defend against all attacks from
609	     properly authenticated peers, which might be malicious,
610	     compromised, or buggy.  Therefore the RDMA design must address
611	     protection against such attacks.  For example, an RDMA peer should
612	     not be able to read or write memory regions without prior consent.

614	     Further, it must not be possible to evade memory consistency checks
615	     at the recipient.  The RDMA design must allow the recipient to rely
616	     on its consistent memory contents by explicitly controlling peer
617	     access to memory regions at appropriate times.

619	     Peer connections which do not pass authentication and authorization
620	     checks by upper layers must not be permitted to begin processing in
621	     RDMA mode with an inappropriate endpoint.  Once associated, peer
622	     accesses to memory regions must be authenticated and made subject
623	     to authorization checks in the context of the association and
624	     connection on which they are to be performed, prior to any transfer
625	     operation or data being accessed.

627	     The RDMA protocols must ensure that these region protections be
628	     under strict application control.  Remote access to local memory by
629	     a network peer is particularly important in the Internet context,
630	     where such access can be exported globally.

632	8.  Terminology

634	     This section contains general terminology definitions for this
635	     document and for Remote Direct Memory Access in general.

637	     Remote Direct Memory Access (RDMA)
638	          A method of accessing memory on a remote system in which the
639	          local system specifies the location of the data to be
640	          transferred.

642	     RDMA Protocol
643	          A protocol that supports RDMA Operations to transfer data
644	          between systems.

646	     Fabric
647	          The collection of links, switches, and routers that connect a
648	          set of systems.

650	     Storage Area Network (SAN)
651	          A network where disks, tapes and other storage devices are
652	          made available to one or more end-systems via a fabric.

654	     System Area Network
655	          A network where clustered systems share services, such as
656	          storage and interprocess communication, via a fabric.

658	     Fibre Channel (FC)
659	          An ANSI standard link layer with associated protocols,
660	          typically used to implement Storage Area Networks. [FIBRE]

662	     Virtual Interface Architecture (VI, VIA)
663	          An RDMA interface definition developed by an industry group
664	          and implemented with a variety of differing wire protocols.
665	          [VI]

667	     Infiniband (IB)
668	          An RDMA interface, protocol suite and link layer specification
669	          defined by an industry trade association. [IB]

671	9.  Acknowledgements

673	     Jeff Chase generously provided many useful insights and
674	     information.  Thanks to Jim Pinkerton for many helpful discussions.

676	10.  Informative References

678	     [ATM]
679	          The ATM Forum, "Asynchronous Transfer Mode Physical Layer
680	          Specification" af-phy-0015.000, etc.  drafts available from
681	          http://www.atmforum.com/standards/approved.html

683	     [BCF+95]
684	          N. J. Boden, D. Cohen, R. E. Felderman, A. E. Kulawik, C. L.
685	          Seitz, J. N. Seizovic, and W. Su. "Myrinet - A gigabit-per-
686	          second local-area network", IEEE Micro, February 1995

688	     [BJM+96]
689	          G. Buzzard, D. Jacobson, M. Mackey, S. Marovich, J. Wilkes,
690	          "An implementation of the Hamlyn send-managed interface
691	          architecture", in Proceedings of the Second Symposium on
692	          Operating Systems Design and Implementation, USENIX Assoc.,
693	          October 1996

695	     [BLA+94]
696	          M. A. Blumrich, K. Li, R. Alpert, C. Dubnicki, E. W. Felten,
697	          "A virtual memory mapped network interface for the SHRIMP
698	          multicomputer", in Proceedings of the 21st Annual Symposium on
699	          Computer Architecture, April 1994, pp. 142-153

701	     [Br99]
702	          J. C. Brustoloni, "Interoperation of copy avoidance in network
703	          and file I/O", Proceedings of IEEE Infocom, 1999, pp. 534-542

705	     [BS96]
706	          J. C. Brustoloni, P. Steenkiste, "Effects of buffering
707	          semantics on I/O performance", Proceedings OSDI'96, USENIX,
708	          Seattle, WA October 1996, pp. 277-291

710	     [BT04]
711	          S. Bailey, T. Talpey, "The Architecture of Direct Data
712	          Placement (DDP) And Remote Direct Memory Access (RDMA) On
713	          Internet Protocols", Internet Draft Work in Progress, draft-
714	          ietf-rddp-arch-06, October 2004

716	     [CFF+94]
717	          C-H Chang, D. Flower, J. Forecast, H. Gray, B. Hawe, A.
718	          Nadkarni, K. K. Ramakrishnan, U. Shikarpur, K. Wilde, "High-
719	          performance TCP/IP and UDP/IP networking in DEC OSF/1 for
720	          Alpha AXP",  Proceedings of the 3rd IEEE Symposium on High
721	          Performance Distributed Computing, August 1994, pp. 36-42

723	     [CGY01]
724	          J. S. Chase, A. J. Gallatin, and K. G. Yocum, "End system
725	          optimizations for high-speed TCP", IEEE Communications
726	          Magazine, Volume: 39, Issue: 4 , April 2001, pp 68-74.
727	          http://www.cs.duke.edu/ari/publications/end-system.{ps,pdf}

729	     [Ch96]
730	          H.K. Chu, "Zero-copy TCP in Solaris", Proc. of the USENIX 1996
731	          Annual Technical Conference, San Diego, CA, January 1996

733	     [Ch02]
734	          Jeffrey Chase, Personal communication

736	     [CJRS89]
737	          D. D. Clark, V. Jacobson, J. Romkey, H. Salwen, "An analysis
738	          of TCP processing overhead", IEEE Communications Magazine,
739	          volume: 27, Issue: 6, June 1989, pp 23-29

741	     [CT90]
742	          D. D. Clark, D. Tennenhouse, "Architectural considerations for
743	          a new generation of protocols", Proceedings of the ACM SIGCOMM
744	          Conference, 1990

746	     [DAFS]
747	          DAFS Collaborative, "Direct Access File System Specification
748	          v1.0", September 2001, available from
749	          http://www.dafscollaborative.org

751	     [DAPP93]
752	          P. Druschel, M. B. Abbott, M. A. Pagels, L. L. Peterson,
753	          "Network subsystem design", IEEE Network, July 1993, pp. 8-17

755	     [DP93]
756	          P. Druschel, L. L. Peterson, "Fbufs: a high-bandwidth cross-
757	          domain transfer facility", Proceedings of the 14th ACM
758	          Symposium of Operating Systems Principles, December 1993

760	     [DWB+93]
761	          C. Dalton, G. Watson, D. Banks, C. Calamvokis, A. Edwards, J.
762	          Lumley, "Afterburner: architectural support for high-
763	          performance protocols", Technical Report, HP Laboratories
764	          Bristol, HPL-93-46, July 1993

766	     [EBBV95]
767	          T. von Eicken, A. Basu, V. Buch, and W. Vogels, "U-Net: A
768	          user-level network interface for parallel and distributed
769	          computing", Proc. of the 15th ACM Symposium on Operating
770	          Systems Principles, Copper Mountain, Colorado, December 3-6,
771	          1995

773	     [FDDI]
774	          International Standards Organization, "Fibre Distributed Data
775	          Interface", ISO/IEC 9314, committee drafts available from
776	          http://www.iso.org

778	     [FGM+99]
779	          R. Fielding, J. Gettys, J. Mogul, F. Frystyk, L. Masinter, P.
780	          Leach, T. Berners-Lee, "Hypertext Transfer Protocol -
781	          HTTP/1.1", RFC 2616, June 1999

783	     [FIBRE]
784	          ANSI Technical Committee T10, "Fibre Channel Protocol (FCP)"
785	          (and as revised and updated), ANSI X3.269:1996 [R2001],
786	          committee draft available from
787	          http://www.t10.org/drafts.htm#FibreChannel

789	     [HP97]
790	          J. L. Hennessy, D. A. Patterson, Computer Organization and
791	          Design, 2nd Edition, San Francisco: Morgan Kaufmann
792	          Publishers, 1997

794	     [IB] InfiniBand Trade Association, "InfiniBand Architecture
795	          Specification, Volumes 1 and 2", Release 1.1, November 2002,
796	          available from  http://www.infinibandta.org/specs

798	     [KP96]
799	          J. Kay, J. Pasquale, "Profiling and reducing processing
800	          overheads in TCP/IP", IEEE/ACM Transactions on Networking, Vol
801	          4, No. 6, pp.817-828, December 1996

803	     [KSZ95]
804	          K. Kleinpaste, P. Steenkiste, B. Zill, "Software support for
805	          outboard buffering and checksumming", SIGCOMM'95

807	     [Ma02]
808	          K. Magoutis, "Design and Implementation of a Direct Access
809	          File System (DAFS) Kernel Server for FreeBSD", in Proceedings
810	          of USENIX BSDCon 2002 Conference, San Francisco, CA, February
811	          11-14, 2002.

813	     [MAF+02]
814	          K. Magoutis, S. Addetia, A. Fedorova, M.  I. Seltzer, J. S.
815	          Chase, D. Gallatin, R. Kisley, R. Wickremesinghe, E. Gabber,
816	          "Structure and Performance of the Direct Access File System
817	          (DAFS)", in Proceedings of the 2002 USENIX Annual Technical
818	          Conference, Monterey, CA, June 9-14, 2002.

820	     [Mc95]
821	          J. D. McCalpin, "A Survey of memory bandwidth and machine
822	          balance in current high performance computers", IEEE TCCA
823	          Newsletter, December 1995

825	     [Ne00]
826	          A. Newman, "IDC report paints conflicted picture of server
827	          market circa 2004", ServerWatch, July 24, 2000
828	          http://serverwatch.internet.com/news/2000_07_24_a.html

830	     [Pa01]
831	          M. Pastore, "Server shipments for 2000 surpass those in 1999",
832	          ServerWatch, February 7, 2001
833	          http://serverwatch.internet.com/news/2001_02_07_a.html

835	     [PAC+97]
836	          D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K. Keeton,
837	          C. Kozyrakis, R. Thomas, K. Yelick , "A case for intelligient
838	          RAM: IRAM", IEEE Micro, April 1997

840	     [PDZ99]
841	          V. S. Pai, P. Druschel, W. Zwaenepoel, "IO-Lite: a unified I/O
842	          buffering and caching system", Proc. of the 3rd Symposium on
843	          Operating Systems Design and Implementation, New Orleans, LA,
844	          February 1999

846	     [Pi01]
847	          J. Pinkerton, "Winsock Direct: The Value of System Area
848	          Networks", May 2001, available from
849	          http://www.microsoft.com/windows2000/techinfo/
850	          howitworks/communications/winsock.asp

852	     [Po81]
853	          J. Postel, "Transmission Control Protocol - DARPA Internet
854	          Program Protocol Specification", RFC 793, September 1981

856	     [QUAD]
857	          Quadrics Ltd., Quadrics QSNet product information, available
858	          from http://www.quadrics.com/website/pages/02qsn.html

860	     [SDP]
861	          InfiniBand Trade Association, "Sockets Direct Protocol v1.0",
862	          Annex A of InfiniBand Architecture Specification Volume 1,
863	          Release 1.1, November 2002, available from
864	          http://www.infinibandta.org/specs

866	     [SRVNET]
867	          R. Horst, "TNet: A reliable system area network", IEEE Micro,
868	          pp. 37-45, February 1995

870	     [STREAM]
871	          J. D. McAlpin, The STREAM Benchmark Reference Information,
872	          http://www.cs.virginia.edu/stream/

874	     [TK95]
875	          M. N. Thadani, Y. A. Khalidi, "An efficient zero-copy I/O
876	          framework for UNIX", Technical Report, SMLI TR-95-39, May 1995

878	     [VI] Compaq Computer Corp., Intel Corporation and Microsoft
879	          Corporation, "Virtual Interface Architecture Specification
880	          Version 1.0", December 1997, available from
881	          http://www.vidf.org/info/04standards.html

883	     [Wa97]
884	          J. R. Walsh, "DART: Fast application-level networking via
885	          data-copy avoidance", IEEE Network, July/August 1997, pp.
886	          28-38

888	Authors' Addresses

890	     Stephen Bailey
891	     Sandburst Corporation
892	     600 Federal Street
893	     Andover, MA  01810 USA

895	     Phone: +1 978 689 1614
896	     Email: steph@sandburst.com
897	     Jeffrey C. Mogul
898	     Western Research Laboratory
899	     Hewlett-Packard Company
900	     1501 Page Mill Road, MS 1251
901	     Palo Alto, CA  94304 USA

903	     Phone: +1 650 857 2206 (email preferred)
904	     Email: JeffMogul@acm.org

906	     Allyn Romanow
907	     Cisco Systems, Inc.
908	     170 W. Tasman Drive
909	     San Jose, CA  95134 USA

911	     Phone: +1 408 525 8836
912	     Email: allyn@cisco.com

914	     Tom Talpey
915	     Network Appliance
916	     375 Totten Pond Road
917	     Waltham, MA  02451 USA

919	     Phone: +1 781 768 5329
920	     Email: thomas.talpey@netapp.com

922	Full Copyright Statement

924	     Copyright (C) The Internet Society (2004).  This document is
925	     subject to the rights, licenses and restrictions contained in BCP
926	     78 and except as set forth therein, the authors retain all their
927	     rights.

929	     This document and the information contained herein are provided on
930	     an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE
931	     REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND
932	     THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES,
933	     EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT
934	     THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR
935	     ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A
936	     PARTICULAR PURPOSE.

938	Intellectual Property
939	     The IETF takes no position regarding the validity or scope of any
940	     Intellectual Property Rights or other rights that might be claimed
941	     to pertain to the implementation or use of the technology described
942	     in this document or the extent to which any license under such
943	     rights might or might not be available; nor does it represent that
944	     it has made any independent effort to identify any such rights.
945	     Information on the procedures with respect to rights in RFC
946	     documents can be found in BCP 78 and BCP 79.

948	     Copies of IPR disclosures made to the IETF Secretariat and any
949	     assurances of licenses to be made available, or the result of an
950	     attempt made to obtain a general license or permission for the use
951	     of such proprietary rights by implementers or users of this
952	     specification can be obtained from the IETF on-line IPR repository
953	     at http://www.ietf.org/ipr.

955	     The IETF invites any interested party to bring to its attention any
956	     copyrights, patents or patent applications, or other proprietary
957	     rights that may cover technology that may be required to implement
958	     this standard.  Please address the information to the IETF at ietf-
959	     ipr@ietf.org.

961	Acknowledgement
962	     Funding for the RFC Editor function is currently provided by the
963	     Internet Society.