idnits 2.17.1 

draft-ietf-rddp-problem-statement-01.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** Looks like you're using RFC 2026 boilerplate.  This must be updated to
     follow RFC 3978/3979, as updated by RFC 4748.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  == No 'Intended status' indicated for this document; assuming Proposed
     Standard


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack an IANA Considerations section.  (See Section
     2.2 of https://www.ietf.org/id-info/checklist for how to handle the case
     when there are no actions for IANA.)

  ** The document seems to lack separate sections for Informative/Normative
     References.  All references will be assumed normative when checking for
     downward references.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not
     match the current year

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- Couldn't find a document date in the document -- date freshness check
     skipped.

  -- Found something which looks like a code comment -- if you have code
     sections in the document, please surround them with '<CODE BEGINS>' and
     '<CODE ENDS>' lines.


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  == Missing Reference: 'B99' is mentioned on line 182, but not defined

  == Missing Reference: 'PA01' is mentioned on line 348, but not defined

  == Missing Reference: 'BP96' is mentioned on line 417, but not defined

  == Missing Reference: 'IPSEC' is mentioned on line 524, but not defined

  == Missing Reference: 'TLS' is mentioned on line 524, but not defined

  == Unused Reference: 'DAPP93' is defined on line 640, but no explicit
     reference was found in the text

  == Unused Reference: 'KSZ95' is defined on line 684, but no explicit
     reference was found in the text

  == Unused Reference: 'Ma02' is defined on line 688, but no explicit
     reference was found in the text

  == Unused Reference: 'MYR' is defined on line 706, but no explicit
     reference was found in the text

  == Unused Reference: 'Pa01' is defined on line 714, but no explicit
     reference was found in the text

  == Unused Reference: 'Wa97' is defined on line 760, but no explicit
     reference was found in the text

  -- Possible downref: Non-RFC (?) normative reference: ref. 'Br99'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'BS96'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'BSW02'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'BT02'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'CGY01'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'Ch96'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'Ch02'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'CJRS89'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'CT90'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'DAFS'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'DAPP93'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'DP93'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'EBBV95'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'FIBRE'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'HP97'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'IB'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'KP96'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'KSZ95'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'Ma02'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'Mc95'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'MYR'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'Ne00'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'Pa01'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'PDZ99'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'Pi01'

  ** Obsolete normative reference: RFC  793 (ref. 'Po81') (Obsoleted by RFC
     9293)

  -- Possible downref: Non-RFC (?) normative reference: ref. 'QUAD'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'SDP'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'SRVNET'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'STREAM'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'TK95'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'VI'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'Wa97'


     Summary: 4 errors (**), 0 flaws (~~), 13 warnings (==), 35 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	                                              Allyn Romanow      (Cisco)
3	Internet-Draft                                Jeff Mogul            (HP)
4	Expires: August 2003                          Tom Talpey        (NetApp)
5	                                              Stephen Bailey (Sandburst)

7	                     RDMA over IP Problem Statement
8	                draft-ietf-rddp-problem-statement-01.txt

10	Status of this Memo

12	     This document is an Internet-Draft and is in full conformance with
13	     all provisions of Section 10 of RFC2026.

15	     Internet-Drafts are working documents of the Internet Engineering
16	     Task Force (IETF), its areas, and its working groups.  Note that
17	     other groups may also distribute working documents as Internet-
18	     Drafts.

20	     Internet-Drafts are draft documents valid for a maximum of six
21	     months and may be updated, replaced, or obsoleted by other
22	     documents at any time.  It is inappropriate to use Internet-Drafts
23	     as reference material or to cite them other than as "work in
24	     progress."

26	     The list of current Internet-Drafts can be accessed at
27	     http://www.ietf.org/ietf/1id-abstracts.txt

29	     The list of Internet-Draft Shadow Directories can be accessed at
30	     http://www.ietf.org/shadow.html.

32	Copyright Notice

34	     Copyright (C) The Internet Society (2003).  All Rights Reserved.

36	Abstract

38	     This draft addresses an IP-based solution to the problem of high
39	     system costs due to network I/O copying in end-hosts at high
40	     speeds.  The problem is due to the high cost of memory bandwidth,
41	     and it can be substantially improved using "copy avoidance."  The
42	     high overhead has prevented TCP/IP from being used as an
43	     interconnection network.

45	Table Of Contents

47	     1.   Introduction . . . . . . . . . . . . . . . . . . . . . . .   2
48	     2.   The high cost of data movement operations in network I/O .   3
49	     2.1. Copy avoidance improves processing overhead  . . . . . . .   5
50	     3.   Memory bandwidth is the root cause of the problem  . . . .   6
51	     4.   High copy overhead is problematic for many key Internet
52	          applications . . . . . . . . . . . . . . . . . . . . . . .   7
53	     5.   Copy Avoidance Techniques  . . . . . . . . . . . . . . . .   9
54	     5.1. A Conceptual Framework: DDP and RDMA . . . . . . . . . . .  11
55	     6.   Security Considerations  . . . . . . . . . . . . . . . . .  11
56	     7.   Acknowledgements . . . . . . . . . . . . . . . . . . . . .  12
57	          References . . . . . . . . . . . . . . . . . . . . . . . .  12
58	          Authors' Addresses . . . . . . . . . . . . . . . . . . . .  17
59	          Full Copyright Statement . . . . . . . . . . . . . . . . .  17

61	1.  Introduction

63	     This draft considers the problem of high host processing overhead
64	     associated with network I/O that occurs under high speed
65	     conditions.  This problem is often referred to as the "I/O
66	     bottleneck" [CT90].  More specifically, the source of high overhead
67	     that is of interest here is data movement operations - copying.
68	     This issue is not be confused with TCP offload, which is not
69	     addressed here.  High speed refers to conditions where the network
70	     link speed is high relative to the bandwidths of the host CPU and
71	     memory.  With today's computer systems, one Gbits/s and over is
72	     considered high speed.

74	     High costs associated with copying are an issue primarily for large
75	     scale systems.  Although smaller systems such as rack-mounted PCs
76	     and small workstations would benefit from a reduction in copying
77	     overhead, the benefit to smaller machines will be primarily in the
78	     next few years as they scale in the amount of bandwidth they
79	     handle.  Today it is large system machines with high bandwidth
80	     feeds, usually multiprocessors and clusters, that are adversely
81	     affected by copying overhead.  Examples of such machines include
82	     all varieties of servers: database servers, storage servers,
83	     application servers for transaction processing, for e-commerce, and
84	     web serving, content distribution, video distribution, backups,
85	     data mining and decision support, and scientific computing.

87	     Note that such servers almost exclusively service many concurrent
88	     sessions (transport connections), which, in aggregate, are
89	     responsible for > 1 Gbits/s of communication.  Nonetheless, the
90	     cost of copying overhead for a particular load is the same whether
91	     from few or many sessions.

93	     The I/O bottleneck, and the role of data movement operations, have
94	     been widely studied in research and industry over the last
95	     approximately 14 years, and we draw freely on these results.
96	     Historically, the I/O bottleneck has received attention whenever
97	     new networking technology has substantially increased line rates -
98	     100 Mbits/s FDDI and Fast Ethernet, 155 Mbits/s ATM, 1 Gbits/s
99	     Ethernet.  In earlier speed transitions, the availability of memory
100	     bandwidth allowed the I/O bottleneck issue to be deferred.  Now
101	     however, this is no longer the case.  While the I/O problem is
102	     significant at 1 Gbits/s, it is the introduction of 10 Gbits/s
103	     Ethernet which is motivating an upsurge of activity in industry and
104	     research [DAFS, IB, VI, CGZ01, Ma02, MAF+02].

106	     Because of high overhead of end-host processing in current
107	     implementations, the TCP/IP protocol stack is not used for high
108	     speed transfer.  Instead, special purpose network fabrics, using a
109	     technology generally known as remote direct memory access (RDMA),
110	     have been developed and are widely used.  RDMA is a set of
111	     mechanisms that allow the network adapter, under control of the
112	     application, to steer data directly into and out of application
113	     buffers.  Examples of such interconnection fabrics include Fibre
114	     Channel [FIBRE] for block storage transfer, Virtual Interface
115	     Architecture [VI] for database clusters, Infiniband [IB], Compaq
116	     Servernet [SRVNET], Quadrics [QUAD] for System Area Networks.
117	     These link level technologies limit application scaling in both
118	     distance and size, meaning that the number of nodes cannot be
119	     arbitrarily large.

121	     This problem statement substantiates the claim that in network I/O
122	     processing, high overhead results from data movement operations,
123	     specifically copying; and that copy avoidance significantly
124	     decreases the processing overhead.  It describes when and why the
125	     high processing overheads occur, explains why the overhead is
126	     problematic, and points out which applications are most affected.

128	     In addition, this document introduces an architectural approach to
129	     solving the problem, which is developed in detail in [BT02].  It
130	     also discusses how the proposed technology may introduce security
131	     concerns and how they should be addressed.

133	2.  The high cost of data movement operations in network I/O

135	     A wealth of data from research and industry shows that copying is
136	     responsible for substantial amounts of processing overhead.  It
137	     further shows that even in carefully implemented systems,
138	     eliminating copies significantly reduces the overhead, as
139	     referenced below.

141	     Clark et al. [CJRS89] in 1989 shows that TCP [Po81] overhead
142	     processing is attributable to both operating system costs such as
143	     interrupts, context switches, process management, buffer
144	     management, timer management, and to the costs associated with
145	     processing individual bytes, specifically computing the checksum
146	     and moving data in memory.  They found moving data in memory is the
147	     more important of the costs, and their experiments show that memory
148	     bandwidth is the greatest source of limitation.  In the data
149	     presented [CJRS89], 64% of the measured microsecond overhead was
150	     attributable to data touching operations, and 48% was accounted for
151	     by copying.  The system measured Berkeley TCP on a Sun-3/60 using
152	     1460 Byte Ethernet packets.

154	     In a well-implemented system, copying can occur between the network
155	     interface and the kernel, and between the kernel and application
156	     buffers - two copies, each of which are two memory bus crossings -
157	     for read and write.  Although in certain circumstances it is
158	     possible to do better, usually two copies are required on receive.

160	     Subsequent work has consistently shown the same phenomenon as the
161	     earlier Clark study.  A number of studies report results that data-
162	     touching operations, checksumming and data movement, dominate the
163	     processing costs for messages longer than 128 Bytes [BS96, CGY01,
164	     Ch96, CJRS89, DAPP93, KP96].  For smaller sized messages, per-
165	     packet overheads dominate [KP96, CGY01].

167	     The percentage of overhead due to data-touching operations
168	     increases with packet size, since time spent on per-byte operations
169	     scales linearly with message size [KP96].  For example, Chu [Ch96]
170	     reported substantial per-byte latency costs as a percentage of
171	     total networking software costs for an MTU size packet on
172	     SPARCstation/20 running memory-to-memory TCP tests over networks
173	     with 3 different MTU sizes.  The percentage of total software costs
174	     attributable to per-byte operations were:

176	        1500 Byte Ethernet 18-25%
177	        4352 Byte FDDI     35-50%
178	        9180 Byte ATM      55-65%

180	     Although many studies report results for data-touching operations
181	     including checksumming and data movement together, much work has
182	     focused just on copying [BS96, B99, Ch96, TK95].  For example,
183	     [KP96] reports results that separate processing times for checksum
184	     from data movement operations.  For the 1500 Byte Ethernet size,
185	     20% of total processing overhead time is attributable to copying.
186	     The study used 2 DECstations 5000/200 connected by an FDDI network.
187	     (In this study checksum accounts for 30% of the processing time.)

189	2.1.  Copy avoidance improves processing overhead

191	     A number of studies show that eliminating copies substantially
192	     reduces overhead.  For example, results from copy-avoidance in the
193	     IO-Lite system [PDZ99], which aimed at improving web server
194	     performance, show a throughput increase of 43% over an optimized
195	     web server, and 137% improvement over an Apache server.  The system
196	     was implemented in a 4.4BSD derived UNIX kernel, and the
197	     experiments used a server system based on a 333MHz Pentium II PC
198	     connected to a switched 100 Mbits/s Fast Ethernet.

200	     There are many other examples where elimination of copying using a
201	     variety of different approaches showed significant improvement in
202	     system performance [CFF+94, DP93, EBBV95, KSZ95, TK95, Wa97].  We
203	     will discuss the results of one of these studies in detail in order
204	     to clarify the significant degree of improvement produced by copy
205	     avoidance [Ch02].

207	     Recent work by Chase et al. [CGY01], measuring CPU utilization,
208	     shows that avoiding copies reduces CPU time spent on data access
209	     from 24% to 15% at 370 Mbits/s for a 32 KBytes MTU using an
210	     AlphaStation XP1000 and a Myrinet adapter [BCF+95].  This is an
211	     absolute improvement of 9% due to copy avoidance.

213	     The total CPU utilization was 35%, with data access accounting for
214	     24%.  Thus the relative importance of reducing copies is 26%.  At
215	     370 Mbits/s, the system is not very heavily loaded.  The relative
216	     improvement in achievable bandwidth is 34%.  This is the
217	     improvement we would see if copy avoidance were added when the
218	     machine was saturated by network I/O.

220	     Note that improvement from the optimization becomes more important
221	     if the overhead it targets is a larger share of the total cost.
222	     This is what happens if other sources of overhead, such as
223	     checksumming, are eliminated.  In [CGY01], after removing checksum
224	     overhead, copy avoidance reduces CPU utilization from 26% to 10%.
225	     This is a 16% absolute reduction, a 61% relative reduction, and a
226	     160% relative improvement in achievable bandwidth.

228	     In fact, today's network interface hardware commonly offloads the
229	     checksum, which removes the other source of per-byte overhead.
230	     They also coalesce interrupts to reduce per-packet costs.  Thus,
231	     today copying costs account for a relatively larger part of CPU
232	     utilization than previously, and therefore relatively more benefit
233	     is to be gained in reducing them.  (Of course this argument would
234	     be specious if the amount of overhead were insignificant, but it
235	     has been shown to be substantial.)

237	3.  Memory bandwidth is the root cause of the problem

239	     Data movement operations are expensive because memory bandwidth is
240	     scarce relative to network bandwidth and CPU bandwidth [PAC+97].
241	     This trend existed in the past and is expected to continue into the
242	     future [HP97, STREAM], especially in large multiprocessor systems.

244	     With copies crossing the bus twice per copy, network processing
245	     overhead is high whenever network bandwidth is large in comparison
246	     to CPU and memory bandwidths.  Generally with today's end-systems,
247	     the effects are observable at network speeds over 1 Gbits/s.

249	     A common question is whether increase in CPU processing power
250	     alleviates the problem of high processing costs of network I/O.
251	     The answer is no, it is the memory bandwidth that is the issue.
252	     Faster CPUs do not help if the CPU spends most of its time waiting
253	     for memory [CGY01].

255	     The widening gap between microprocessor performance and memory
256	     performance has long been a widely recognized and well-understood
257	     problem [PAC+97].  Hennessy [HP97] shows microprocessor performance
258	     grew from 1980-1998 at 60% per year, while the access time to DRAM
259	     improved at 10% per year, giving rise to an increasing "processor-
260	     memory performance gap".

262	     Another source of relevant data is the STREAM Benchmark Reference
263	     Information website which provides information on the STREAM
264	     benchmark [STREAM].  The benchmark is a simple synthetic benchmark
265	     program that measures sustainable memory bandwidth (in MBytes/s)
266	     and the corresponding computation rate for simple vector kernels
267	     measured in MFLOPS.  The website tracks information on sustainable
268	     memory bandwidth for hundreds of machines and all major vendors.

270	     Results show measured system performance statistics.  Processing
271	     performance from 1985-2001 increased at 50% per year on average,
272	     and sustainable memory bandwidth from 1975 to 2001 increased at 35%
273	     per year on average over all the systems measured.  A similar 15%
274	     per year lead of processing bandwidth over memory bandwidth shows
275	     up in another statistic, machine balance [Mc95], a measure of the
276	     relative rate of CPU to memory bandwidth (FLOPS/cycle) / (sustained
277	     memory ops/cycle) [STREAM].

279	     Network bandwidth has been increasing about 10-fold roughly every 8
280	     years, which is a 40% per year growth rate.

282	     A typical example illustrates that the memory bandwidth compares
283	     unfavorably with link speed.  The STREAM benchmark shows that a
284	     modern uniprocessor PC, for example the 1.2 GHz Athlon in 2001,
285	     will move the data 3 times in doing a receive operation - 1 for the
286	     network interface to deposit the data in memory, and 2 for the CPU
287	     to copy the data.  With 1 GBytes/s of memory bandwidth, meaning one
288	     read or one write, the machine could handle approximately 2.67
289	     Gbits/s of network bandwidth, one third the copy bandwidth.  But
290	     this assumes 100% utilization, which is not possible, and more
291	     importantly the machine would be totally consumed!  (A rule of
292	     thumb for databases is that 20% of the machine should be required
293	     to service I/O, leaving 80% for the database application.  And, the
294	     less the better.)

296	     In 2001, 1 Gbits/s links were common.  An application server may
297	     typically have two 1 Gbits/s connections - one connection backend
298	     to a storage server and one front-end, say for serving HTTP
299	     [FGM+99].  Thus the communications could use 2 Gbits/s.  In our
300	     typical example, the machine could handle 2.7 Gbits/s at its
301	     theoretical maximum while doing nothing else.  This means that the
302	     machine basically could not keep up with the communication demands
303	     in 2001, with the relative growth trends the situation only gets
304	     worse.

306	4.  High copy overhead is problematic for many key Internet applications

308	     If a significant portion of resources on an application machine is
309	     consumed in network I/O rather than in application processing, it
310	     makes it difficult for the application to scale - to handle more
311	     clients, to offer more services.

313	     Several years ago the most affected applications were streaming
314	     multimedia, parallel file systems and supercomputing on clusters
315	     [BS96].  In addition, today the applications that suffer from
316	     copying overhead are more central in Internet computing - they
317	     store, manage, and distribute the information of the Internet and
318	     the enterprise.  They include database applications doing
319	     transaction processing, e-commerce, web serving, decision support,
320	     content distribution, video distribution, and backups.  Clusters
321	     are typically used for this category of application, since they
322	     have advantages of availability and scalability.

324	     Today these applications, which provide and manage Internet and
325	     corporate information, are typically run in data centers that are
326	     organized into three logical tiers.  One tier is typically a set of
327	     web servers connecting to the WAN.  The second tier is a set of
328	     application servers that run the specific applications usually on
329	     more powerful machines, and the third tier is backend databases.
330	     Physically, the first two tiers - web server and application server
331	     - are usually combined [Pi01].  For example an e-commerce server
332	     communicates with a database server and with a customer site, or a
333	     content distribution server connects to a server farm, or an OLTP
334	     server connects to a database and a customer site.

336	     When network I/O uses too much memory bandwidth, performance on
337	     network paths between tiers can suffer.  (There might also be
338	     performance issues on SAN paths used either by the database tier or
339	     the application tier.)  The high overhead from network-related
340	     memory copies diverts system resources from other application
341	     processing.  It also can create bottlenecks that limit total system
342	     performance.

344	     There are a large and growing number of these application servers
345	     distributed throughout the Internet.  In 1999 approximately 3.4
346	     million server units were shipped, in 2000, 3.9 million units, and
347	     the estimated annual growth rate for 2000-2004 was 17 percent
348	     [Ne00, PA01].

350	     There is high motivation to maximize the processing capacity of
351	     each CPU, as scaling by adding CPUs one way or another has
352	     drawbacks.  For example, adding CPUs to a multiprocessor will not
353	     necessarily help, as a multiprocessor improves performance only
354	     when the memory bus has additional bandwidth to spare.  Clustering
355	     can add additional complexity to handling the applications.

357	     In order to scale a cluster or multiprocessor system, one must
358	     proportionately scale the interconnect bandwidth.  Interconnect
359	     bandwidth governs the performance of communication-intensive
360	     parallel applications; if this (often expressed in terms of
361	     "bisection bandwidth") is too low, adding additional processors
362	     cannot improve system throughput.  Interconnect latency can also
363	     limit the performance of applications that frequently share data
364	     between processors.

366	     So, excessive overheads on network paths in a "scalable" system
367	     both can require the use of more processors than optimal, and can
368	     reduce the marginal utility of those additional processors.

370	     Copy avoidance scales a machine upwards by removing at least two-
371	     thirds the bus bandwidth load from the "very best" 1-copy (on
372	     receive) implementations, and removes at least 80% of the bandwidth
373	     overhead from the 2-copy implementations.

375	     An example showing poor performance with copies and improved
376	     scaling with copy avoidance is illustrative.  The IO-Lite work
377	     [PDZ99] shows higher server throughput servicing more clients using
378	     a zero-copy system.  In an experiment designed to mimic real world
379	     web conditions by simulating the effect of TCP WAN connections on
380	     the server, the performance of 3 servers was compared.  One server
381	     was Apache, another an optimized server called Flash, and the third
382	     the Flash server running IO-Lite, called Flash-Lite with zero copy.
383	     The measurement was of throughput in requests/second as a function
384	     of the number of slow background clients that could be served.  As
385	     the table shows, Flash-Lite has better throughput, especially as
386	     the number of clients increases.

388	                Apache              Flash         Flash-Lite
389	                ------              -----         ----------
390	     #Clients   Thruput reqs/s      Thruput       Thruput

392	     0          520                 610           890
393	     16         390                 490           890
394	     32         360                 490           850
395	     64         360                 490           890
396	     128        310                 450           880
397	     256        310                 440           820

399	     Traditional Web servers (which mostly send data and can keep most
400	     of their content in the file cache) are not the worst case for copy
401	     overhead.  Web proxies (which often receive as much data as they
402	     send) and complex Web servers based on SANs or multi-tier systems
403	     will suffer more from copy overheads than in the example above.

405	5.  Copy Avoidance Techniques

407	     There have been extensive research investigation and industry
408	     experience with two main alternative approaches to eliminating data
409	     movement overhead, often along with improving other Operating
410	     System processing costs.  In one approach, hardware and/or software
411	     changes within a single host reduce processing costs.  In another
412	     approach, memory-to-memory networking [MAF+02], hosts communicate
413	     via information that allows them to reduce processing costs.

415	     The single host approaches range from new hardware and software
416	     architectures [KSZ95, Wa97, DWB+93] to new or modified software
417	     systems [BP96, Ch96, TK95, DP93, PDZ99].  In the approach based on
418	     using a networking protocol to exchange information, the network
419	     adapter, under control of the application, places data directly
420	     into and out of application buffers, reducing the need for data
421	     movement.  Commonly this approach is called RDMA, Remote Direct
422	     Memory Access.

424	     As discussed below, research and industry experience has shown that
425	     copy avoidance techniques within the receiver processing path alone
426	     have proven to be problematic.  The research special purpose host
427	     adapter systems had good performance and can be seen as precursors
428	     for the commercial RDMA-based NICs [KSZ95, DWB+93].  In software,
429	     many implementations have successfully achieved zero-copy transmit,
430	     but few have accomplished zero-copy receive.  And those that have
431	     done so make strict alignment and no-touch requirements on the
432	     application, greatly reducing the portability and usefulness of the
433	     implementation.

435	     In contrast, experience has proven satisfactory with memory-to-
436	     memory systems that permit RDMA - performance has been good and
437	     there have not been system or networking difficulties.  RDMA is a
438	     single solution.  Once implemented, it can be used with any OS and
439	     machine architecture, and it does not need to be revised when
440	     either of these changes.

442	     In early work, one goal of the software approaches was to show that
443	     TCP could go faster with appropriate OS support [CJR89, CFF+94].
444	     While this goal was achieved, further investigation and experience
445	     showed that, though possible to craft software solutions, specific
446	     system optimizations have been complex, fragile, extremely
447	     interdependent with other system parameters in complex ways, and
448	     often of only marginal improvement [CFF+94, CGY01, Ch96, DAPP93,
449	     KSZ95, PDZ99].  The network I/O system interacts with other aspects
450	     of the Operating System such as machine architecture and file I/O,
451	     and disk I/O [Br99, Ch96, DP93].

453	     For example, the Solaris Zero-Copy TCP work [Ch96], which relies on
454	     page remapping, shows that the results are highly interdependent
455	     with other systems, such as the file system, and that the
456	     particular optimizations are specific for particular architectures,
457	     meaning for each variation in architecture optimizations must be
458	     re-crafted [Ch96].

460	     A number of research projects and industry products have been based
461	     on the memory-to-memory approach to copy avoidance.  These include
462	     U-Net [EBBV95], SHRIMP [BLA+94], Hamlyn [BJM+96], Infiniband [IB],
463	     Winsock Direct [Pi01].  Several memory-to-memory systems have been
464	     widely used and have generally been found to be robust, to have
465	     good performance, and to be relatively simple to implement.  These
466	     include VI [VI], Myrinet [BCF+95], Quadrics [QUAD], Compaq/Tandem
467	     Servernet [SRVNET].  Networks based on these memory-to-memory
468	     architectures have been used widely in scientific applications and
469	     in data centers for block storage, file system access, and
470	     transaction processing.

472	     By exporting direct memory access "across the wire", applications
473	     may direct the network stack to manage all data directly from
474	     application buffers.  A large and growing class of applications has
475	     already emerged which takes advantage of such capabilities,
476	     including all the major databases, as well as file systems such as
477	     DAFS [DAFS] and network protocols such as Sockets Direct [SDP].

479	5.1.  A Conceptual Framework: DDP and RDMA

481	     An RDMA solution can be usefully viewed as being comprised of two
482	     distinct components: "direct data placement (DDP)" and "remote
483	     direct memory access (RDMA) semantics".  They are distinct in
484	     purpose and also in practice - they may be implemented as separate
485	     protocols.

487	     The more fundamental of the two is the direct data placement
488	     facility.  This is the means by which memory is exposed to the
489	     remote peer in an appropriate fashion, and the means by which the
490	     peer may access it, for instance reading and writing.

492	     The RDMA control functions are semantically layered atop direct
493	     data placement.  Included are operations that provide "control"
494	     features, such as connection and termination, and the ordering of
495	     operations and signaling their completions.  A "send" facility is
496	     provided.

498	     While the functions (and potentially protocols) are distinct,
499	     historically both aspects taken together have been referred as
500	     "RDMA".  The facilities of direct data placement are useful in and
501	     of themselves, and may be employed by other upper layer protocols
502	     to facilitate data transfer.  Therefore, it is often useful to
503	     refer to DDP as the data placement functionality and RDMA as the
504	     control aspect.

506	     [BT02] develops an architecture for DDP and RDMA, and is a
507	     companion draft to this problem statement.

509	6.  Security Considerations

511	     Solutions to the problem of reducing copying overhead in high
512	     bandwidth transfers via one or more protocols may introduce new
513	     security concerns.  Any proposed solution must be analyzed for
514	     security threats and any such threats addressed.  [BSW02] brings up
515	     potential security weaknesses due to resource issues that might
516	     lead to denial-of-service attacks, overwrites and other concurrent
517	     operations, the ordering of completions as required by the RDMA
518	     protocol, and the granularity of transfer.  Each of these concerns
519	     plus any other identified threats need to be examined, described
520	     and an adequate solution to them found.

522	     Layered atop Internet transport protocols, the RDMA protocols will
523	     gain leverage from and must permit integration with Internet
524	     security standards, such as IPSec and TLS [IPSEC, TLS].  A thorough
525	     analysis of the degree to which these protocols solve threats is
526	     required.

528	     Security for an RDMA design requires more than just securing the
529	     communication channel.  While it is necessary to be able to
530	     guarantee channel properties such as privacy, integrity, and
531	     authentication, these properties cannot defend against all attacks
532	     from properly authenticated peers, which might be malicious,
533	     compromised, or buggy.  For example, an RDMA peer should not be
534	     able to read or write memory regions without prior consent.

536	     Further, it must not be possible to evade consistency checks at the
537	     recipient.  For example, the RDMA design should not allow a peer to
538	     update a region after the completion of an authorized update.

540	     The RDMA protocols must ensure that regions addressable by RDMA
541	     peers be under strict application control.  Remote access to local
542	     memory by a network peer introduces a number of potential security
543	     concerns.  This becomes particularly important in the Internet
544	     context, where such access can be exported globally.

546	     The RDMA protocols carry in part what is essentially user
547	     information, explicitly including addressing information and
548	     operation type (read or write), and implicitly including protection
549	     and attributes.  As such, the protocol requires checking of these
550	     higher level aspects in addition to the basic formation of
551	     messages.  The semantics associated with each class of error must
552	     be clearly defined, and the expected action to be taken on mismatch
553	     be specified.  In some cases, this will result in a catastrophic
554	     error on the RDMA association, however in others a local or remote
555	     error may be signalled.  Certain of these errors may require
556	     consideration of abstract local semantics, which must be carefully
557	     specified so as to provide useful behavior while not constraining
558	     the implementation.

560	7.  Acknowledgements

562	     Jeff Chase generously provided many useful insights and
563	     information.  Thanks to Jim Pinkerton for many helpful discussions.

565	8.  References

567	     [BCF+95]
568	          N. J. Boden, D. Cohen, R. E. Felderman, A. E. Kulawik, C. L.
569	          Seitz, J. N. Seizovic, and W. Su. "Myrinet - A gigabit-per-
570	          second local-area network", IEEE Micro, February 1995

572	     [BJM+96]
573	          G. Buzzard, D. Jacobson, M. Mackey, S. Marovich, J. Wilkes,
574	          "An implementation of the Hamlyn send-managed interface
575	          architecture", in Proceedings of the Second Symposium on
576	          Operating Systems Design and Implementation, USENIX Assoc.,
577	          October 1996

579	     [BLA+94]
580	          M. A. Blumrich, K. Li, R. Alpert, C. Dubnicki, E. W. Felten,
581	          "A virtual memory mapped network interface for the SHRIMP
582	          multicomputer", in Proceedings of the 21st Annual Symposium on
583	          Computer Architecture, April 1994, pp. 142-153

585	     [Br99]
586	          J. C. Brustoloni, "Interoperation of copy avoidance in network
587	          and file I/O", Proceedings of IEEE Infocom, 1999, pp. 534-542

589	     [BS96]
590	          J. C. Brustoloni, P. Steenkiste, "Effects of buffering
591	          semantics on I/O performance", Proceedings OSDI'96, USENIX,
592	          Seattle, WA October 1996, pp. 277-291

594	     [BSW02]
595	          D. Black, M. Speer, J. Wroclawski, "DDP and RDMA Concerns",
596	          http://www.ietf.org/internet-drafts/draft-ietf-rddp-rdma-
597	          concerns-00.txt, Work in Progress, December 2002

599	     [BT02]
600	          S. Bailey, T. Talpey, "The Architecture of Direct Data
601	          Placement (DDP) And Remote Direct Memory Access (RDMA) On
602	          Internet Protocols", Work in Progress,
603	          http://www.ietf.org/internet-drafts/draft-ietf-rddp-
604	          arch-01.txt, February 2003

606	     [CFF+94]
607	          C-H Chang, D. Flower, J. Forecast, H. Gray, B. Hawe, A.
608	          Nadkarni, K. K. Ramakrishnan, U. Shikarpur, K. Wilde, "High-
609	          performance TCP/IP and UDP/IP networking in DEC OSF/1 for
610	          Alpha AXP",  Proceedings of the 3rd IEEE Symposium on High
611	          Performance Distributed Computing, August 1994, pp. 36-42

613	     [CGY01]
614	          J. S. Chase, A. J. Gallatin, and K. G. Yocum, "End system
615	          optimizations for high-speed TCP", IEEE Communications
616	          Magazine, Volume: 39, Issue: 4 , April 2001, pp 68-74.
617	          http://www.cs.duke.edu/ari/publications/end-system.{ps,pdf}

619	     [Ch96]
620	          H.K. Chu, "Zero-copy TCP in Solaris", Proc. of the USENIX 1996
621	          Annual Technical Conference, San Diego, CA, January 1996

623	     [Ch02]
624	          Jeffrey Chase, Personal communication

626	     [CJRS89]
627	          D. D. Clark, V. Jacobson, J. Romkey, H. Salwen, "An analysis
628	          of TCP processing overhead", IEEE Communications Magazine,
629	          volume: 27, Issue: 6, June 1989, pp 23-29

631	     [CT90]
632	          D. D. Clark, D. Tennenhouse, "Architectural considerations for
633	          a new generation of protocols", Proceedings of the ACM SIGCOMM
634	          Conference, 1990

636	     [DAFS]
637	          Direct Access File System http://www.dafscollaborative.org
638	          http://www.ietf.org/internet-drafts/draft-wittle-dafs-00.txt

640	     [DAPP93]
641	          P. Druschel, M. B. Abbott, M. A. Pagels, L. L. Peterson,
642	          "Network subsystem design", IEEE Network, July 1993, pp. 8-17

644	     [DP93]
645	          P. Druschel, L. L. Peterson, "Fbufs: a high-bandwidth cross-
646	          domain transfer facility", Proceedings of the 14th ACM
647	          Symposium of Operating Systems Principles, December 1993

649	     [DWB+93]
650	          C. Dalton, G. Watson, D. Banks, C. Calamvokis, A. Edwards, J.
651	          Lumley, "Afterburner: architectural support for high-
652	          performance protocols", Technical Report, HP Laboratories
653	          Bristol, HPL-93-46, July 1993

655	     [EBBV95]
656	          T. von Eicken, A. Basu, V. Buch, and W. Vogels, "U-Net: A
657	          user-level network interface for parallel and distributed
658	          computing", Proc. of the 15th ACM Symposium on Operating
659	          Systems Principles, Copper Mountain, Colorado, December 3-6,
660	          1995

662	     [FGM+99]
663	          R. Fielding, J. Gettys, J. Mogul, F. Frystyk, L. Masinter, P.
664	          Leach, T. Berners-Lee, "Hypertext Transfer Protocol -
665	          HTTP/1.1", RFC 2616, June 1999

667	     [FIBRE]
668	          Fibre Channel Standard
669	          http://www.fibrechannel.com/technology/index.master.html

671	     [HP97]
672	          J. L. Hennessy, D. A. Patterson, Computer Organization and
673	          Design, 2nd Edition, San Francisco: Morgan Kaufmann
674	          Publishers, 1997

676	     [IB] InfiniBand Architecture Specification, Volumes 1 and 2,
677	          Release 1.0.a.  http://www.infinibandta.org

679	     [KP96]
680	          J. Kay, J. Pasquale, "Profiling and reducing processing
681	          overheads in TCP/IP", IEEE/ACM Transactions on Networking, Vol
682	          4, No. 6, pp.817-828, December 1996

684	     [KSZ95]
685	          K. Kleinpaste, P. Steenkiste, B. Zill, "Software support for
686	          outboard buffering and checksumming", SIGCOMM'95

688	     [Ma02]
689	          K. Magoutis, "Design and Implementation of a Direct Access
690	          File System (DAFS) Kernel Server for FreeBSD", in Proceedings
691	          of USENIX BSDCon 2002 Conference, San Francisco, CA, February
692	          11-14, 2002.

694	     [MAF+02]
695	          K. Magoutis, S. Addetia, A. Fedorova, M.  I. Seltzer, J. S.
696	          Chase, D. Gallatin, R. Kisley, R. Wickremesinghe, E. Gabber,
697	          "Structure and Performance of the Direct Access File System
698	          (DAFS)", accepted for publication at the 2002 USENIX Annual
699	          Technical Conference, Monterey, CA, June 9-14, 2002.

701	     [Mc95]
702	          J. D. McCalpin, "A Survey of memory bandwidth and machine
703	          balance in current high performance computers", IEEE TCCA
704	          Newsletter, December 1995

706	     [MYR]
707	          Myrinet, http://www.myricom.com

709	     [Ne00]
710	          A. Newman, "IDC report paints conflicted picture of server
711	          market circa 2004", ServerWatch, July 24, 2000
712	          http://serverwatch.internet.com/news/2000_07_24_a.html

714	     [Pa01]
715	          M. Pastore, "Server shipments for 2000 surpass those in 1999",
716	          ServerWatch, February 7, 2001
717	          http://serverwatch.internet.com/news/2001_02_07_a.html

719	     [PAC+97]
720	          D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K. Keeton,
721	          C. Kozyrakis, R. Thomas, K. Yelick , "A case for intelligient
722	          RAM: IRAM", IEEE Micro, April 1997

724	     [PDZ99]
725	          V. S. Pai, P. Druschel, W. Zwaenepoel, "IO-Lite: a unified I/O
726	          buffering and caching system", Proc. of the 3rd Symposium on
727	          Operating Systems Design and Implementation, New Orleans, LA,
728	          February 1999

730	     [Pi01]
731	          J. Pinkerton, "Winsock Direct: the value of System Area
732	          Networks". http://www.microsoft.com/windows2000/techinfo/
733	          howitworks/communications/winsock.asp

735	     [Po81]
736	          J. Postel, "Transmission Control Protocol - DARPA Internet
737	          Program Protocol Specification", RFC 793, September 1981

739	     [QUAD]
740	          Quadrics Ltd., http://www.quadrics.com

742	     [SDP]
743	          Sockets Direct Protocol v1.0

745	     [SRVNET]
746	          Compaq Servernet,
747	          http://nonstop.compaq.com/view.asp?PAGE=ServerNet

749	     [STREAM]
750	          The STREAM Benchmark Reference Information,
751	          http://www.cs.virginia.edu/stream/

753	     [TK95]
754	          M. N. Thadani, Y. A. Khalidi, "An efficient zero-copy I/O
755	          framework for UNIX", Technical Report, SMLI TR-95-39, May 1995

757	     [VI] Virtual Interface Architecture Specification Version 1.0.
758	          http://www.vidf.org/info/04standards.html

760	     [Wa97]
761	          J. R. Walsh, "DART: Fast application-level networking via
762	          data-copy avoidance", IEEE Network, July/August 1997, pp.
763	          28-38

765	Authors' Addresses

767	     Stephen Bailey
768	     Sandburst Corporation
769	     600 Federal Street
770	     Andover, MA  01810 USA

772	     Phone: +1 978 689 1614
773	     Email: steph@sandburst.com

775	     Jeffrey C. Mogul
776	     Western Research Laboratory
777	     Hewlett-Packard Company
778	     1501 Page Mill Road, MS 1251
779	     Palo Alto, CA  94304 USA

781	     Phone: +1 650 857 2206 (email preferred)
782	     Email: JeffMogul@acm.org

784	     Allyn Romanow
785	     Cisco Systems, Inc.
786	     170 W. Tasman Drive
787	     San Jose, CA  95134 USA

789	     Phone: +1 408 525 8836
790	     Email: allyn@cisco.com

792	     Tom Talpey
793	     Network Appliance
794	     375 Totten Pond Road
795	     Waltham, MA  02451 USA

797	     Phone: +1 781 768 5329
798	     Email: thomas.talpey@netapp.com

800	Full Copyright Statement

802	     Copyright (C) The Internet Society (2003).  All Rights Reserved.

804	     This document and translations of it may be copied and furnished to
805	     others, and derivative works that comment on or otherwise explain
806	     it or assist in its implementation may be prepared, copied,
807	     published and distributed, in whole or in part, without restriction
808	     of any kind, provided that the above copyright notice and this
809	     paragraph are included on all such copies and derivative works.
810	     However, this document itself may not be modified in any way, such
811	     as by removing the copyright notice or references to the Internet
812	     Society or other Internet organizations, except as needed for the
813	     purpose of developing Internet standards in which case the
814	     procedures for copyrights defined in the Internet Standards process
815	     must be followed, or as required to translate it into languages
816	     other than English.

818	     The limited permissions granted above are perpetual and will not be
819	     revoked by the Internet Society or its successors or assigns.

821	     This document and the information contained herein is provided on
822	     an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET
823	     ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR
824	     IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF
825	     THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
826	     WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.