idnits 2.17.1 

draft-garcia-direct-access-problem-00.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** Looks like you're using RFC 2026 boilerplate.  This must be updated to
     follow RFC 3978/3979, as updated by RFC 4748.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  == No 'Intended status' indicated for this document; assuming Proposed
     Standard


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack an IANA Considerations section.  (See Section
     2.2 of https://www.ietf.org/id-info/checklist for how to handle the case
     when there are no actions for IANA.)

  ** The document seems to lack separate sections for Informative/Normative
     References.  All references will be assumed normative when checking for
     downward references.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not
     match the current year

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- Couldn't find a document date in the document -- date freshness check
     skipped.


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  == Unused Reference: 'FCIP' is defined on line 611, but no explicit
     reference was found in the text

  -- Possible downref: Non-RFC (?) normative reference: ref. 'Chase'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'DAFSAPI'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'FCIP'

  ** Obsolete normative reference: RFC 2616 (ref. 'HTTP') (Obsoleted by RFC
     7230, RFC 7231, RFC 7232, RFC 7233, RFC 7234, RFC 7235)

  -- Possible downref: Non-RFC (?) normative reference: ref. 'IB'

  ** Downref: Normative reference to an Informational RFC: RFC 1813 (ref.
     'NFSv3')

  ** Obsolete normative reference: RFC 2960 (ref. 'SCTP') (Obsoleted by RFC
     4960)

  ** Obsolete normative reference: RFC  793 (ref. 'TCP') (Obsoleted by RFC
     9293)

  -- Possible downref: Non-RFC (?) normative reference: ref. 'VI'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'WSD'


     Summary: 7 errors (**), 0 flaws (~~), 3 warnings (==), 8 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	                                               S. Bailey    (Sandburst)
3	Internet-draft                                 D. Garcia       (Compaq)
4	Expires: May 2002                              J. Hilland      (Compaq)
5	                                               A. Romanow       (Cisco)

7	                    Direct Access Problem Statement
8	                 draft-garcia-direct-access-problem-00

10	Status of this Memo

12	     This document is an Internet-Draft and is in full conformance with
13	     all provisions of Section 10 of RFC2026.

15	     Internet-Drafts are working documents of the Internet Engineering
16	     Task Force (IETF), its areas, and its working groups.  Note that
17	     other groups may also distribute working documents as Internet-
18	     Drafts.

20	     Internet-Drafts are draft documents valid for a maximum of six
21	     months and may be updated, replaced, or obsoleted by other
22	     documents at any time.  It is inappropriate to use Internet-Drafts
23	     as reference material or to cite them other than as "work in
24	     progress."

26	     The list of current Internet-Drafts can be accessed at
27	     http://www.ietf.org/ietf/1id-abstracts.txt

29	     The list of Internet-Draft Shadow Directories can be accessed at
30	     http://www.ietf.org/shadow.html.

32	Copyright Notice

34	     Copyright (C) The Internet Society (2001). All Rights Reserved.

36	Abstract

38	     This problem statement describes barriers to the use of Internet
39	     Protocols for highly scalable, high bandwidth, low latency
40	     transfers necessary in some of today's important applications,
41	     particularly applications found within data centers.  In addition
42	     to describing technical reasons for the problems, it gives an
43	     overview of common non-IP solutions to these problems which have
44	     been deployed over the years.

46	     The perspective of this draft is that it would be very beneficial
47	     to have an IP-based solution for these problems so IP can be used
48	     for high speed data transfers within data centers, in addition to
49	     IP's many other uses.

51	Table Of Contents

53	     1.     Introduction . . . . . . . . . . . . . . . . . . . . . .   2
54	     1.1.   High Bandwidth Transfer Overhead . . . . . . . . . . . .   3
55	     1.2.   Proliferation Of Fabrics in Data Centers . . . . . . . .   4
56	     1.3.   Potential Solutions  . . . . . . . . . . . . . . . . . .   4
57	     2.     High Bandwidth Data Transfer In The Data Center  . . . .   6
58	     2.1.   Scalable Data Center Applications  . . . . . . . . . . .   7
59	     2.2.   Client/Server Communication  . . . . . . . . . . . . . .   7
60	     2.3.   Block Storage  . . . . . . . . . . . . . . . . . . . . .   8
61	     2.4.   File Storage . . . . . . . . . . . . . . . . . . . . . .   9
62	     2.5.   Backup . . . . . . . . . . . . . . . . . . . . . . . . .   9
63	     2.6.   The Common Thread  . . . . . . . . . . . . . . . . . . .  10
64	     3.     Non-IP Solutions . . . . . . . . . . . . . . . . . . . .  10
65	     3.1.   Proprietary Solutions  . . . . . . . . . . . . . . . . .  11
66	     3.2.   Standards-based Solutions  . . . . . . . . . . . . . . .  11
67	     3.2.1. The Virtual Interface Architecture (VIA) . . . . . . . .  12
68	     3.2.2. InfiniBand . . . . . . . . . . . . . . . . . . . . . . .  12
69	     4.     Conclusion . . . . . . . . . . . . . . . . . . . . . . .  13
70	     5.     Security Considerations  . . . . . . . . . . . . . . . .  13
71	     6.     References . . . . . . . . . . . . . . . . . . . . . . .  13
72	     Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . .  15
73	     A.     RDMA Technology Overview . . . . . . . . . . . . . . . .  16
74	     A.1    Use of Memory Access Transfers . . . . . . . . . . . . .  16
75	     A.2    Use Of Push Transfers  . . . . . . . . . . . . . . . . .  17
76	     A.3    RDMA-based I/O Example . . . . . . . . . . . . . . . . .  18
77	     Full Copyright Statement  . . . . . . . . . . . . . . . . . . .  19

79	1.  Introduction

81	     Protocols in the IP family offer a huge, ever increasing range of
82	     functions, including mail, messaging, telephony, media and
83	     hypertext content delivery, block and file storage, and network
84	     control.  IP has been so successful that applications only use
85	     other forms of communication when there is a very compelling
86	     reason.  Currently, it is often not acceptable to use IP protocols
87	     for high-speed communication within a data center.  In these cases,
88	     copying data to application buffers consumes too much CPU that is
89	     otherwise needed to perform application functions.

91	     This limitation of IP protocols has not been particularly important
92	     until now because the domain of high performance transfers was
93	     limited to a relatively specialized niche of low volume
94	     applications, such as scientific supercomputing.  Applications that
95	     needed more efficient transfer than IP could offer simply used
96	     other purpose-built solutions.

98	     As the use of the Internet has become pervasive and critical, the
99	     growth in number and importance of data centers has matched the
100	     growth of the Internet. The role of the data center is similarly
101	     critical.  The high-end environment of the data center makes up the
102	     core and nexus of today's Internet. Everything goes in and out of
103	     data centers.

105	     Applications running within data centers frequently require high
106	     bandwidth data transfer.  Due to the high host processing overhead
107	     of high bandwidth communication in IP, the industry has developed
108	     non-IP technology to serve data center traffic.  That said, the
109	     obstacles to lowering host processing overhead in the IP are well-
110	     understood and straightforward to address.  Simple techniques could
111	     allow the penetration of existing IP protocols into data centers
112	     where non-IP technology is currently used.

114	     Technology advances have made feasible specially designed network
115	     interfaces that place IP protocol data directly in application
116	     buffers.  While it is certainly possible to use control information
117	     directly from existing IP protocol messages to place data in
118	     application buffers, but the sheer number and diversity of current
119	     and future IP protocols calls for a generic solution instead.
120	     Therefore, the goal is to investigate a generic data placement
121	     solution for IP protocols that would allow a single network
122	     interface to perform direct data placement for a wide variety of
123	     mature, evolving and completely new protocols.

125	     There is a great desire to develop lower overhead, more scalable
126	     data transfer technology based on IP.  This desire comes from the
127	     advantages of using one protocol technology rather than several,
128	     and from the many efficiencies of technology based upon a single,
129	     widely adopted, open standard.

131	     This document describes the problems that IP faces in delivering
132	     highly scalable high bandwidth data transfer.  The first section
133	     describes the issues in general.  The second section describes
134	     several specific scenarios, discussing particular application
135	     domains and specific problems that arise.  The third section
136	     describes approaches that have historically been used to address
137	     low overhead, high bandwidth data transfer needs.  The appendix
138	     gives an overview of how a particular class of non-IP technologies
139	     addresses this problem with Remote Direct Memory Access (RDMA).

141	1.1.  High Bandwidth Transfer Overhead

143	     Transport protocols such as TCP [TCP] and SCTP [SCTP] have
144	     successfully shielded upper layers from the complexities of moving
145	     data between two computers.  This has been very successful in
146	     making TCP/IP ubiquitous.  However, with current IP
147	     implementations, Upper Layer Protocols (ULPs), such as NFS [NFSv3]
148	     and HTTP [HTTP], require incoming data packets to be buffered and
149	     copied before the data is used.

151	     It is this data copying that is a primary source of overhead in IP
152	     data transfers.  Copying received data for high bandwidth transfers
153	     consumes significant processing time and memory bandwidth.  If data
154	     is buffered and then copied, the data moves across the memory bus
155	     at least three times during the data transfer.  By comparison, if
156	     the incoming data is placed directly where the application requires
157	     it, the data moves across the memory bus only once.  This copying
158	     overhead currently means that additional processing resources, such
159	     as additional processors in a multiprocessor machine, are needed to
160	     reach faster and faster wire speeds.

162	     A wide range of ad hoc solutions have been explored to eliminate
163	     data copying overhead withing the framework of current IP
164	     protocols, but despite extensive study, still no adequate or
165	     general solution exists [Chase].

167	1.2.  Proliferation Of Fabrics in Data Centers

169	     The current alternative to paying the high costs due to data
170	     transfer overhead in data centers is the use of several different
171	     communication technologies at once. Data centers are likely to have
172	     separate Ethernet IP, Fibre Channel storage, and InfiniBand, VIA or
173	     proprietary interprocess communication (IPC) networks.  Special
174	     purpose networks are used for storage and IPC to reduce the
175	     processor overhead associated with data communications; and in the
176	     case of IPC, to reduce latency as well.

178	     Using such proprietary and special purpose solutions runs counter
179	     to the requirements of data center computing.  Data center
180	     designers and operators do not want the expense and complexity of
181	     building and maintaining three separate communications networks.
182	     Three NICs and three fabric ports are expensive, consume valuable
183	     IO card slots, power and machine room space.

185	     A single IP fabric would be far preferable.  IP networks are best
186	     positioned to fill the role of all three of these existing
187	     networks.  At 1 to 10 gigabit speeds current IP interconnects could
188	     offer comparable or superior performance characteristics to special
189	     purpose purpose interconnects, if it were not for the high overhead
190	     and latency of IP data transfers.  An IP-based alternative to the
191	     IPC and storage fabrics would be less costly, and much more easily
192	     manageable than maintaining separate communication fabrics.

194	1.3.  Potential Solutions

196	     One frequently proposed solution to the problem of data transfer
197	     overhead in IP data transfers is to wait for the next generation of
198	     faster processors and speedier memories to render the problem
199	     irrelevant.  However, in the evolution of the Internet, processor
200	     and memory speeds are not the only variables that have increased
201	     exponentially over time.  Data link speeds have grown exponentially
202	     as well.  Recently, spurred by the demand for core network
203	     bandwidth, data link speeds have grown faster than both processor
204	     computation rates and processor memory transfer rates.  Whatever
205	     speed increases occur in processors and memories, it is clear that
206	     link speeds will continue to grow aggressively as well.

208	     Rather than relying on increasing CPU performance, non-IP solutions
209	     use network interface hardware to attack several Several distinct
210	     sources of overhead can be seen.  For a small, one-way IP data
211	     transfer, typically both the sender and receiver must make several
212	     context switches, process several interrupts, and send and receive
213	     a network packet.  In addition, the receiver must perform at least
214	     one data copy.  This single transfer could require 10,000
215	     instructions of execution and total time measured in hundreds of
216	     microseconds if not milliseconds.  The sources of overhead in this
217	     transfer are:

219	     o    context switches and interrupts,

221	     o    execution of protocol code,

223	     o    copying the data on the receiver.

225	     Copying competes with DMA and other processor accesses for memory
226	     system bandwidth, and all these sources of overhead can also have
227	     significant secondary effects on the efficiency of application
228	     execution by interfering with system caches.

230	     Depending on the application, each of these sources of overhead may
231	     be small or large factor in total overhead, but the cumulative
232	     effect of all of them is nearly always substantial for high
233	     bandwidth transfers.  If data transfers are very small, data
234	     copying is only a small cost, but context switching and protocol
235	     stack execution become performance limiting factors.  For large
236	     transfers, the most common high bandwidth data transfers, context
237	     switching and protocol stack execution can be amortized away,
238	     within certain limits, but data copying becomes costly.

240	     Non-IP solutions address these sources of overhead with network
241	     interface hardware that:

243	     o    reduces context switches and interrupts with kernel-bypass
244	          capability, where the application communicates directly
245	          through network interface without kernel intervention,

247	     o    reduces protocol stack processing with protocol offload
248	          hardware that performs some or all protocol processing (e.g.
249	          ACK processing),

251	     o    reduces data copying overhead by placing data directly in
252	          application buffers.

254	     The application of these techniques reduces both data transfer
255	     overhead, and data transfer latency.  Context switches and data
256	     copying are substantial sources of end-to-end latency that are
257	     eliminated by kernel-bypass and direct data placement.  Offloaded
258	     protocol processing can also typically be performed an order of
259	     magnitude faster than a comparable, general purpose protocol stack,
260	     due to the ability to exploit extensive parallelism in hardware.
261	     While protocol offload does reduce overhead, for the vast majority
262	     of current high bandwidth data transfer applications, eliminating
263	     data copies is much more important.

265	     These techniques, and others, may be equally applicable to reducing
266	     the overhead of IP data transfers.

268	2.  High Bandwidth Transfers In The Data Center

270	     There are numerous uses of high bandwidth data transfers in today's
271	     data centers.  While these applications are found in the data
272	     center, they have implications for the desktop as well.  This
273	     problem statement focuses on data center scenarios below, but it
274	     would be beneficial to find a solution that meets data center while
275	     possibly remaining affordable for the desktop.

277	     Why is high bandwidth data transfer in the data center important
278	     for IP networking?  Performance on the Internet, as well as
279	     intranets, is dependent on the performance of the data center.
280	     Every request, be it a web page, database query or file and print
281	     service goes to or through data center servers.  Often a multi-
282	     tiered computing solution is used, where multiple machines in the
283	     data center satisfy these requests.  Despite the explosive growth
284	     of the server market, data centers are running into critical
285	     limitations that impact every client directly or indirectly.
286	     Unlike servers, clients are largely limited in performance by the
287	     human at the interface.  In contrast, data center performance is
288	     limited by the speeds and feeds of the network and I/O devices as
289	     well as hardware and software components.

291	     With new protocols such as iSCSI, IP networks are increasingly
292	     taking on the functions of special purpose interconnects, such as
293	     Fibre Channel.  However, the limitations created by high data
294	     transfer overhead described here have not as yet been addressed for
295	     IP protocols in general.

297	     First and foremost, all the problems illustrated in scenarios below
298	     occur on IP protocol based networks.  It is imperative to
299	     understand the pervasiveness of IP networks within the data center
300	     and that all of the problems described below occur in IP-based data
301	     transfer solutions.  Therefore, a solution to these problems will
302	     naturally also be a part of the IP protocol suite.

304	     Although the problems discussed below manifest themselves in
305	     different ways, investigation into the source of these problems
306	     shows a common thread running through them.  These scenarios are
307	     not exhaustive list, but rather describe the wide range of problems
308	     exhibited in scalability and performance of the applications and
309	     infrastructures encountered in data center computing as a result of
310	     high communication overhead.

312	2.1.  Scalable Data Center Applications

314	     A key characteristic of any data center application is its ability
315	     to scale as demands increase.  For many Internet services,
316	     applications must scale in response to the success of the service
317	     and the increased demand which results.  In other cases,
318	     applications must be scaled as capabilities are added to a service,
319	     again in response to the success of the service, changes in the
320	     competitive environment or goals of the provider.

322	     Virtually all data center applications require intermachine
323	     communication, and therefore, application scalability may be
324	     directly limited by communication overhead.  From the application
325	     viewpoint, every CPU cycle spent performing data transfer is a
326	     wasted cycle that affects scalability.  For high bandwidth data
327	     transfers using IP, this overhead can be 30-40% of available CPU.
328	     If an application is running on a single single server, and it is
329	     scaled by adding a second server, communication overhead of 40%
330	     means that the CPU available to the application from two servers is
331	     only 120% of that of the single server.  The problem is even worse
332	     with many servers, because most servers are communicating with more
333	     than one other server.  If three servers are connected in a
334	     pipeline where 40% CPU is required for data transfers to or from
335	     another server, the total available CPU power would still be only
336	     120% of the power of a single server!  Not all data center
337	     applications require this level of communication, but many do.  The
338	     high overhead of data transfers in IP severely impacts the
339	     viability of IP for scalable data center applications.

341	2.2.  Client/Server Communication

343	     Client/server communication in the data center is a variation of
344	     the scalable data center application scenario, but applies to
345	     standalone servers as well as parallel applications.  The overhead
346	     of high bandwidth data communication weighs heavily on the server.
347	     The server's ability to respond is limited by any communication
348	     overhead it incurs.

350	     In addition, client/server application performance is often
351	     dominated by data transfer latency characteristics.  Reducing
352	     latency can greatly improve application performance.  Techniques
353	     commonly employed in IP network interfaces, such as TCP checksum
354	     calculation offload, reduce transfer overhead somewhat, but they
355	     typically do not reduce latency at all.  Another technique used to
356	     reduce latency in IP communication is to dedicate multiple threads
357	     of execution, each running on a separate processor, to processing
358	     requests concurrently.  However, this multithreading solution has
359	     limits, as the number of outstanding requests can vastly exceed the
360	     number of processors.  Furthermore, the effect of multithreading
361	     concurrency is additive with any other latency reduction in the
362	     data transfers themselves.

364	     To address the problems of high bandwidth IP client/server
365	     communication, a solution would ideally reduce both end to end
366	     communication latency, and communication overhead.

368	2.3.  Block Storage

370	     Block storage, in the form of iSCSI [iSCSI] and IP Fibre Channel
371	     protocols [FCIP, iFCP], is a IP new application area of great
372	     interest to the storage and data center communities.  Just as data
373	     centers eagerly desire to replace special-purpose interprocess
374	     communication fabrics with IP, there is parallel and equal interest
375	     in migrating block storage traffic from special-purpose storage
376	     fabrics to IP.

378	     As with other forms of high bandwidth communication, the data
379	     transfer overhead in traditional IP implementations, particularly
380	     the three bus crossings required for receiving data, may
381	     substantially limit data center storage transfer performance
382	     compared to what is commonplace with special-purpose storage
383	     fabrics.  In addition, data copying, even if it is performed within
384	     a specialized IP-storage adapter, will substantially increase
385	     transfer latency, which can noticeably degrade the performance of
386	     both file systems, and applications.

388	     Protocol offload and direct data placement comparable to what is
389	     provided by existing storage fabric interfaces (Fibre Channel,
390	     SCSI, FireWire, etc.) are possible pieces of a solution to the
391	     problems created by IP data transfer overhead for block storage.
392	     It has been claimed that block storage is such an important
393	     application that IP block storage protocols should be directly
394	     offloaded by network interface hardware, rather through use of a
395	     generic application-independent offload solution.  However, even
396	     the block storage community recognizes the benefits of more
397	     general-purpose ways to reduce IP transfer overhead, and most
398	     expect to eventually use such general-purpose capabilities for
399	     block storage when they become available, if for no other reason
400	     than it reduces the risks and impact of changing and evolving the
401	     block storage protocols themselves.

403	2.4.  File Storage

405	     The file storage application exhibits a compound problem within the
406	     data center.  File servers and clients are subject to the
407	     communication characteristics of both block storage and
408	     client/server applications.  The problems created by high transfer
409	     overhead are particularly acute for file storage implementations
410	     that are built with a substantial amount of user-mode code.  In any
411	     form of file storage application, many CPU cycles are spent
412	     traversing the kernel mode file system, disk storage subsystems,
413	     protocol stacks, and driving network hardware, similar to the block
414	     storage scenario.  In addition, file systems must address the
415	     communication problems of a distributed client/server application.
416	     There may be substantial shared state distributed among servers and
417	     clients creating the need for extensive communication to maintain
418	     this shared state.

420	     A solution to the communication overhead problems of IP data
421	     transfer for file storage involves a union of the approaches for
422	     efficient disk storage and efficient client/server communication,
423	     as discussed above.  In other words, both low overhead and low
424	     latency communication are goals.

426	2.5.  Backup

428	     One of the problems with IP-based storage backup is that it
429	     consumes a great deal of the host CPU's time and resources.
430	     Unfortunately, the high overhead required for IP-based backup is
431	     typically not acceptable in an active data center.

433	     The challenge of backup is that it is usually performed on machines
434	     which are also actively participating in the services the data
435	     center is providing.  At a minimum, a machine performing backup
436	     must maintain some synchronization with other machines modifying
437	     the state being backed up, so the backup is coherent.  As discussed
438	     in the section above on Scalable Data Center Applications, any
439	     overhead placed on active machines can substantially affect
440	     scalability and solution cost.

442	     Backup solutions on specialized storage-fabrics allow systems to
443	     back up the data without the host processor ever touching the data.
444	     Data is transfered to the backup device from disk storage through
445	     host memory, or sometimes even directly without passing through the
446	     host, as a so-called third party transfer.

448	     Storage backup in the data center could be done with IP if data
449	     transfer overhead were substantially reduced.

451	2.6.  The Common Thread

453	     There is a common thread running through the problems of using IP
454	     communication in all of these scenarios.  The union of the
455	     solutions to these problems are a high bandwidth, low latency, low
456	     CPU overhead data transfer solution.  Non-IP solutions offer
457	     technical solutions to these problems but the they lack the
458	     ubiquity and price/performance characteristics necessary for a
459	     viable, general solution.

461	3.  Non-IP Solutions

463	     The most refined non-IP solution to reducing communication
464	     overhead, has a rich history reaching back almost 20 years.  This
465	     solution uses a data transfer metaphor called Remote Direct Memory
466	     Access (RDMA).  See Appendix A for an introduction to RDMA.  In
467	     spite of the technical advantages of the various non-IP solutions,
468	     all have ultimately lacked the ubiquity and price/performance
469	     characteristics necessary to gain widespread usage.  This lack of
470	     widespread adoption has also resulted in various shortcomings of
471	     particular incarnations, such as incomplete integration with native
472	     platform capabilities, or other software implementation
473	     limitations.  In addition, no non-IP solutions offer the massive
474	     range of network scalability IP protocols support.  Non-IP
475	     solutions typically only scale to tens or hundreds of nodes in a
476	     single network, and have no story to tell about interconnection of
477	     multiple networks.

479	     Several non-IP solutions will be briefly described here to show the
480	     state of experience with this set of problems.

482	3.1.  Proprietary Solutions

484	     Low overhead communication technologies have traditionally been
485	     developed as proprietary value-added products by computer platform
486	     vendors.  Such solutions were tightly integrated with platform
487	     operating systems and did provide powerful, well integrated
488	     communication capabilities.  However, applications written for one
489	     solution were not portable to others.  Also, the solutions were
490	     expensive, as is typically the case with value-added technologies.

492	     The earliest example of an low overhead communication technology
493	     was Digital's VAX Cluster Interconnect (CI), first released in
494	     1983.  The CI allowed computers and storage to be connected as
495	     peers on a small multipoint network used for both IPC and I/O.  The
496	     CI made VAX/VMS Clusters the only alternative to mainframes for
497	     large commercial applications for many years.

499	     Tandem ServerNet was a another proprietary block transfer
500	     technology developed in the mid 1990s.  It has been used to perform
501	     Disk I/O, IPC and network I/O in the Himalaya product line.  This
502	     architecture allows the Himalaya platform to be inherently scalable
503	     because the software has been designed to take advantage of the
504	     offload capability and zero copy techniques.  Tandem attempted to
505	     take this product into the Industry Standard Server market, but the
506	     price/performance characteristics and its of being a proprietary
507	     solution prevented wide adoption.

509	     Silicon Graphics used a standards-based network fabric, HiPPI-800,
510	     but built a proprietary low overhead communication mechanism on
511	     top.  Other platform vendors such as IBM, HP and Sun have also
512	     offered a variety of proprietary low overhead communication
513	     solutions over the years.

515	3.2.  Standards-based Solutions

517	     Increasing fluidity in the landscape of major platform vendors has
518	     drastically increased the desire for all applications to be
519	     portable.  Platforms which were here yesterday might be gone
520	     tomorrow.  This has killed the willingness of application and data
521	     center designers and maintainers to use proprietary features of any
522	     platform.

524	     Unwillingness to continue to use proprietary interconnects forced
525	     platform vendors to collaborate on standards-based low overhead
526	     communication technologies to replace the proprietary ones which
527	     had become critical to building data center applications.  Two of
528	     these standards-based solutions considered to be roughly parent and
529	     child are described below.

531	3.2.1.  The Virtual Interface Architecture (VIA)

533	     VIA [VI] was a technology jointly developed by Compaq, Intel and
534	     Microsoft.  VIA helped prove the feasibility of doing IPC offload,
535	     user mode I/O and traditional kernel mode I/O as well.

537	     While VIA implementations met with some limited success, VIA turned
538	     out to only fill a small market niche, for several reasons.  First,
539	     commercially available operating systems lacked a pervasive
540	     interface.  Second, because the standard did not define a wire
541	     protocol, no two implementations of the VIA standard were
542	     interoperable on the wire.  Third, different implementations were
543	     not interoperable at the software layer either, since the API
544	     definition was an appendix to the specification and not part of the
545	     specification itself.

547	     Yet with parallel applications, VIA proved itself time and again.
548	     It was used to set the new benchmark record in the terabyte data
549	     sort in Sandia Labs.  It set new TPC-C records for distributed
550	     databases, and it was used to set new TPC-C records as the client-
551	     server communication link.  VIA also set the foundation for work
552	     such as the Sockets Direct Protocol through the implementation of
553	     the Winsock Direct Protocol in Windows 2000 [WSD].  And it gave the
554	     DAFS collective a rally point for a common programming interface
555	     [DAFSAPI].

557	3.2.2.  InfiniBand

559	     InfiniBand [IB] was developed by the InfiniBand Trade Association
560	     (IBTA) as a low overhead communication technology that provides
561	     remote direct memory access transfers, including interlocked atomic
562	     operations, as well as traditional datagram-style transfers.

564	     InfiniBand defines a new electromechanical interface, card and
565	     cable form factors, physical interface, link layer, transport layer
566	     and upper layer software transport interface.  The IBTA has also
567	     described a fabric management infrastructure to initialize and
568	     maintain the fabric.

570	     While all of the specialized technology of InfiniBand does provide
571	     impressive performance characteristics, IB lacks the ubiquity and
572	     price/performance of IP.  In addition, management of InfiniBand
573	     fabrics will require new tools and training, and InfiniBand
574	     additionally lacks the huge base of applications, protocols,
575	     thoroughly engineered security and routing technology available in
576	     IP.

578	4.  Conclusion

580	     This document has described the set of problems that hinder the
581	     widespread use of IP for high speed data transfers in data centers.
582	     There have been a variety of other, non-IP solutions available
583	     which have met with only limited success, for different reasons.
584	     After many years of experience in both the IP and non-IP domains,
585	     the problems appear to be reasonably well understood, and a
586	     direction to a solution is suggested by this study.  However, some
587	     additional investigation and subsequent execution on an
588	     architecture and necessary protocol(s) for reducing overhead in
589	     high bandwidth IP data transfers are required.

591	5.  Security Considerations

593	     This draft states a problem and, therefore, does not require
594	     particular security considerations other than those dedicated to
595	     squelching the free spread of ideas, should the problem discussion
596	     itself be considered seditious or otherwise unsafe.

598	6.  References

600	     [Chase]
601	          Jeff S. Chase, et.al., "End system optimizations for high-
602	          speed TCP", IEEE Communications Magazine , Volume: 39, Issue:
603	          4 , April 2001, pp 68-74.
604	          http://www.cs.duke.edu/ari/publications/end-system.{ps,pdf}

606	     [DAFSAPI]
607	          "Direct Access File System Application Programming Interface",
608	          version 0.9.5, 09/21/2001.
609	          http://www.dafscollaborative.org/tools/dafs_api.pdf

611	     [FCIP]
612	          Raj Bhagwat, et al., "Fibre Channel Over TCP/IP (FCIP)",
613	          09/20/2001.  http://www.ietf.org/internet-drafts/draft-ietf-
614	          ips-fcovertcpip-06.txt

616	     [HTTP]
617	          J. Gettys et al., "Hypertext Transfer Protocol - HTTP/1.1",
618	          RFC 2616, June 1999

620	     [IB] InfiniBand Architecture Specification, Volumes 1 and 2,
621	          release 1.0.a.  http://www.infinibandta.org

623	     [iFCP]
624	          Charles Monia et al., "iFCP - A Protocol for Internet Fibre
625	          Channel Storage Networking", 10/19/2001.
626	          http://www.ietf.org/internet-drafts/draft-ietf-ips-ifcp-06.txt

628	     [iSCSI]
629	          J. Satran, et al., "iSCSI", 10/01/2001.
630	          http://www.ietf.org/internet-drafts/draft-ietf-ips-
631	          iscsi-08.txt

633	     [NFSv3]
634	          B. Callaghan, "NFS Version 3 Protocol Specification", RFC
635	          1813, June 1995

637	     [SCTP]
638	          R.R. Stewart, Q. Xie, K. Morneault, C. Sharp, H.J.
639	          Schwarzbauer, T. Taylor, I.  Rytina, M.  Kalla, L.  Zhang,
640	          and, V.  Paxson, "Stream Control Transmission Protocol,"
641	          RFC2960, October 2000.

643	     [TCP]
644	          Postel, J., "Transmission Control Protocol - DARPA Internet
645	           Program Protocol Specification", RFC 793, September 1981

647	     [VI] Virtual Interface Architecture Specification version 1.0.
648	          http://www.viarch.org/html/collateral/san_10.pdf

650	     [WSD]
651	          "Winsock Direct and Protocol Offload On SANs", version 1.0,
652	          3/3/2001, from "Designing Hardware for the Microsoft Windows
653	          Family of Operating Systems".
654	          http://www.microsoft.com/hwdev/network/san

656	Authors' Addresses

658	     Stephen Bailey
659	     Sandburst Corporation
660	     600 Federal Street
661	     Andover, MA  01810
662	     USA

664	     Phone: +1 978 689 1614
665	     Email: steph@sandburst.com
666	     Dave Garcia
667	     Compaq Computer Corp.
668	     19333 Valco Parkway
669	     Cupertino, CA  95014
670	     USA

672	     Phone: +1 408 285 6116
673	     EMail: dave.garcia@compaq.com

675	     Jeff Hilland
676	     Compaq Computer Corp.
677	     20555 SH 249
678	     Houston, TX  77070
679	     USA

681	     Phone: +1 281 514 9489
682	     EMail: jeff.hilland@compaq.com

684	     Allyn Romanow
685	     Cisco Systems, Inc.
686	     170 W. Tasman Drive
687	     San Jose, CA  95134
688	     USA

690	     Phone: +1 408 525 8836
691	     Email: allyn@cisco.com

693	Appendix A. RDMA Technology Overview

695	     This section describes how Remote Direct Memory Access (RDMA)
696	     technology such as the Virtual Interface Architecture (VIA) and
697	     InfiniBand (IB) provide for low overhead data transfer. VIA and IB
698	     are examples of the RDMA technology also used by many proprietary
699	     low over head data transfer solutions.

701	     The IB and VIA protocols both provide memory access and push
702	     transfer semantics.  With memory access transfers, data from the
703	     local computer is written/read directly to/from an address space of
704	     the remote computer.  How, when and why buffers are accessed is
705	     defined by the ULP layer above IB or VIA.

707	     With push transfers, the data source pushes data to an anonymous
708	     receive buffer at the destination.  TCP and UDP transfers are both
709	     example of push transfers.  VIA and IB both call their Push
710	     transfer a Send operation, which is a datagram-style push transfer.
711	     The data receiver chooses where to place the data; the receive
712	     buffer is anonymous with respect to the sender of the data.

714	A.1 Use of Memory Access Transfers

716	     In the memory access transfer model, the initiator of the data
717	     transfer explicitly indicates where data is extracted from or
718	     placed on the remote computer.  VI and InfiniBand both define
719	     memory access read (called RDMA Read) and memory access write
720	     (called RDMA Write) transfers.  The buffer address is carried in
721	     each PDU allowing the network interface to directly place the data
722	     in application buffers. Placing the data directly into the
723	     application's buffer has three significant benefits:

725	     o    CPU and memory bus utilization are lowered by not having to
726	          copy the data.  Since memory access transfers use buffer
727	          addresses supplied by the application, data can be directly
728	          placed at its final location.

730	     o    memory access transfers incur no CPU overhead during transfers
731	          if the network interface offloads RDMA (and lower layer)
732	          protocol processing.  There is enough information in RDMA PDUs
733	          for the target network interface to complete RDMA Reads or
734	          RDMA Writes without any local CPU action.

736	     o    Memory access transfers allow splitting of ULP headers and
737	          data. With memory access transfers, the ULP can control the
738	          exact placement of all received data, including ULP headers
739	          and ULP data.  ULP headers and other control information can
740	          be placed in separate buffers from ULP data.  This is
741	          frequently a distinct advantage compared to having ULP headers
742	          and data in the same buffers, as an additional data copy may
743	          be otherwise required to separate them.

745	     Providing memory access transfers does not mean a processor's
746	     entire memory space is open for unprotected transfers.  The remote
747	     computer controls which of its buffers can be accessed by memory
748	     access transfers.  Incoming RDMA Read and RDMA Write operations can
749	     only access buffers to which the receiving host has explicitly
750	     permitted RDMA accesses.  When the ULP allows RDMA access to a
751	     buffer, the extent and address characteristics of buffer can be
752	     chosen by the ULP.  A buffer could use the virtual address space of
753	     the process, it could be a physical address (if allowed), or it
754	     could be a new virtual address space created for the individual
755	     buffer.

757	     In both IB and VIA the RDMA buffer is registered with the receiving
758	     network interface before RDMA operations can occur. For a typical
759	     hardware offload network interface, this is enough information to
760	     build an address translation table and associate appropriate
761	     security information with the buffer. The address translation table
762	     lets the NIC convert the incoming buffer target address into a
763	     local physical address.

765	A.2 Use Of Push Transfers

767	     Memory access transfers contrast with the push transfers typically
768	     used by IP applications.  With push transfers the source has no
769	     visibility or control over where data will be delivered on the
770	     destination machine.  While most protocols use some form of push
771	     transfer, IB and VIA define a datagram-style push transfer that
772	     allows a form of direct data placement on the receive side.

774	     IB and VIA both require the application to pre-post receive
775	     buffers.  The application pre-posts receive buffers for a
776	     connection and they are filled by subsequent incoming Send
777	     operations.  Since the receive buffer is pre-posted, the network
778	     interface can place the data from the incoming Send operation
779	     directly into the application's buffer. IB and VIA allow use of a
780	     scattered receive buffers to support splitting the ULP header from
781	     data within a single Send.

783	     Neither memory access nor push transfers are inherently superior --
784	     each has its merits.  Furthermore, memory access transfers can be
785	     built atop push transfers or vice versa.  However, direct support
786	     of memory access transfers allows much lower transfer overhead than
787	     if memory access transfers are emulated.

789	A.3 RDMA-based I/O Example

791	     If the RDMA protocol is offloaded to the network interface, the
792	     RDMA Read operation allows an I/O subsystem, such as a storage
793	     array, to fully control all aspects of data transfer for
794	     outstanding I/O operations.  An example of a simple I/O operation
795	     shows several benefits of using memory access transfers.

797	     Consider an I/O block Write operation where the host processor
798	     wishes to move a block of data (the data source) to an I/O
799	     subsystem. The host first registers the data source with its
800	     network interface as an RDMA address block.  Next the host pushes a
801	     small Send operation to the I/O subsystem.  The message describes
802	     the I/O write request and tells the I/O subsystem where it can find
803	     the data in the virtual address space presented through the
804	     communication connection by the network interface.  After receiving
805	     this message, the I/O subsystem can pull the data from the host's
806	     buffer as needed. This gives the I/O subsystem the ability to both
807	     schedule and pace its data transfer, thereby requiring less
808	     buffering on the I/O subsystem.  When the I/O subsystem completes
809	     the data pull, it pushes a completion message back to the host with
810	     a small Send operation. The completion message tells the host the
811	     I/O operation is complete and that it can deregister its RDMA
812	     block.

814	     In this example the host processor spent very few CPU cycles doing
815	     the I/O block Write operation. The processor sent out a small
816	     message and the I/O subsystem did all the data movement.  After the
817	     I/O operation was completed the host processor received a single
818	     completion message.

820	Full Copyright Statement

822	     Copyright (C) The Internet Society (2001). All Rights Reserved.

824	     This document and translations of it may be copied and furnished to
825	     others, and derivative works that comment on or otherwise explain
826	     it or assist in its implementation may be prepared, copied,
827	     published and distributed, in whole or in part, without restriction
828	     of any kind, provided that the above copyright notice and this
829	     paragraph are included on all such copies and derivative works.
830	     However, this document itself may not be modified in any way, such
831	     as by removing the copyright notice or references to the Internet
832	     Society or other Internet organizations, except as needed for the
833	     purpose of developing Internet standards in which case the
834	     procedures for copyrights defined in the Internet Standards process
835	     must be followed, or as required to translate it into languages
836	     other than English.

838	     The limited permissions granted above are perpetual and will not be
839	     revoked by the Internet Society or its successors or assigns.

841	     This document and the information contained herein is provided on
842	     an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET
843	     ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR
844	     IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF
845	     THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
846	     WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.