idnits 2.17.1 

draft-ietf-nfsv4-nfs-rdma-problem-statement-02.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** It looks like you're using RFC 3978 boilerplate.  You should update this
     to the boilerplate described in the IETF Trust License Policy document
     (see https://trustee.ietf.org/license-info), which is required now.

  -- Found old boilerplate from RFC 3667, Section 5.1 on line 15.

  -- Found old boilerplate from RFC 3978, Section 5.5 on line 679.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 690.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 697.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 703.

  ** Found boilerplate matching RFC 3978, Section 5.4, paragraph 1 (on line
     670), which is fine, but *also* found old RFC 2026, Section 10.4C,
     paragraph 1 text on line 35.

  ** The document seems to lack an RFC 3978 Section 5.1 IPR Disclosure
     Acknowledgement -- however, there's a paragraph with a matching
     beginning. Boilerplate error?

  ** This document has an original RFC 3978 Section 5.4 Copyright Line,
     instead of the newer IETF Trust Copyright according to RFC 4748.

  ** The document seems to lack an RFC 3978 Section 5.4 Reference to BCP 78
     -- however, there's a paragraph with a matching beginning. Boilerplate
     error?

  ** This document has an original RFC 3978 Section 5.5 Disclaimer, instead
     of the newer disclaimer which includes the IETF Trust according to RFC
     4748.

  ** The document uses RFC 3667 boilerplate or RFC 3978-like boilerplate
     instead of verbatim RFC 3978 boilerplate.  After 6 May 2005, submission
     of drafts without verbatim RFC 3978 boilerplate is not accepted.

     The following non-3978 patterns matched text found in the document. 
     That text should be removed or replaced:

        By submitting this Internet-Draft, I certify that any applicable patent
        or other IPR claims of which I am aware have been disclosed, or
        will be disclosed, and any of which I become aware will be
        disclosed, in accordance with RFC 3668.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  == No 'Intended status' indicated for this document; assuming Proposed
     Standard


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack a Security Considerations section.

  ** The document seems to lack an IANA Considerations section.  (See Section
     2.2 of https://www.ietf.org/id-info/checklist for how to handle the case
     when there are no actions for IANA.)


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not
     match the current year

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- Couldn't find a document date in the document -- date freshness check
     skipped.


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  == Missing Reference: 'RFC1094' is mentioned on line 65, but not defined

  == Missing Reference: 'RPC1831' is mentioned on line 100, but not defined

  == Missing Reference: 'RDDP' is mentioned on line 111, but not defined

  == Missing Reference: 'ONCRDMA' is mentioned on line 466, but not defined

  == Unused Reference: 'FJDAFS' is defined on line 584, but no explicit
     reference was found in the text

  == Unused Reference: 'FJNFS' is defined on line 590, but no explicit
     reference was found in the text

  == Unused Reference: 'KM02' is defined on line 600, but no explicit
     reference was found in the text

  ** Obsolete normative reference: RFC 3530 (Obsoleted by RFC 7530)

  ** Obsolete normative reference: RFC 1831 (Obsoleted by RFC 5531)

  ** Obsolete normative reference: RFC 1832 (Obsoleted by RFC 4506)

  ** Downref: Normative reference to an Informational RFC: RFC 1813

  -- No information found for draft-ietf-nfsv4-session - is the name correct?


     Summary: 13 errors (**), 0 flaws (~~), 9 warnings (==), 8 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	INTERNET-DRAFT                                            Tom Talpey
3	Expires: August 2005                                   Chet Juszczak

5	                                                      February, 2005

7	                       NFS RDMA Problem Statement
8	           draft-ietf-nfsv4-nfs-rdma-problem-statement-02.txt

10	Status of this Memo

12	     By submitting this Internet-Draft, I certify that any applicable
13	     patent or other IPR claims of which I am aware have been disclosed,
14	     or will be disclosed, and any of which I become aware will be
15	     disclosed, in accordance with RFC 3668.

17	     Internet-Drafts are working documents of the Internet Engineering
18	     Task Force (IETF), its areas, and its working groups.  Note that
19	     other groups may also distribute working documents as Internet-
20	     Drafts.

22	     Internet-Drafts are draft documents valid for a maximum of six
23	     months and may be updated, replaced, or obsoleted by other
24	     documents at any time.  It is inappropriate to use Internet-Drafts
25	     as reference material or to cite them other than as "work in
26	     progress."

28	     The list of current Internet-Drafts can be accessed at
29	         http://www.ietf.org/ietf/1id-abstracts.txt The list of
30	     Internet-Draft Shadow Directories can be accessed at
31	         http://www.ietf.org/shadow.html.

33	Copyright Notice

35	     Copyright (C) The Internet Society (2005).  All Rights Reserved.

37	Abstract

39	     This draft addresses applying Remote Direct Memory Access to the
40	     NFS protocols.  NFS implementations historically incur significant
41	     overhead due to data copies on end-host systems, as well as other
42	     sources.  The potential benefits of RDMA to these implementations
43	     are explored, and the reasons why RDMA is especially well-suited to
44	     NFS and network file protocols in general are evaluated.

46	Table Of Contents

48	     1.   Introduction . . . . . . . . . . . . . . . . . . . . . . .   2
49	     2.   Problem Statement  . . . . . . . . . . . . . . . . . . . .   4
50	     3.   File Protocol Architecture . . . . . . . . . . . . . . . .   5
51	     4.   Sources of Overhead  . . . . . . . . . . . . . . . . . . .   7
52	     4.1.   Savings from TOE . . . . . . . . . . . . . . . . . . . .   8
53	     4.2.   Savings from RDMA  . . . . . . . . . . . . . . . . . . .   9
54	     5.   Application of RDMA to NFS . . . . . . . . . . . . . . . .  10
55	     6.   Improved Semantics . . . . . . . . . . . . . . . . . . . .  10
56	     7.   Conclusions  . . . . . . . . . . . . . . . . . . . . . . .  11
57	          Acknowledgements . . . . . . . . . . . . . . . . . . . . .  11
58	          Normative References . . . . . . . . . . . . . . . . . . .  12
59	          Informative References . . . . . . . . . . . . . . . . . .  12
60	          Authors' Addresses . . . . . . . . . . . . . . . . . . . .  14
61	          Full Copyright Statement . . . . . . . . . . . . . . . . .  15

63	1.  Introduction

65	     The Network File System (NFS) protocol (as described in [RFC1094],
66	     [RFC1813], and [RFC3530]) is one of several remote file access
67	     protocols used in the class of processing architecture sometimes
68	     called Network Attached Storage (NAS).

70	     Historically, remote file access has proved to be a convenient,
71	     cost-effective way to share information over a network, a concept
72	     proven over time by the popularity of the NFS protocol.  However,
73	     there are issues in such a deployment.

75	     As compared to a local (direct-attached) file access architecture,
76	     NFS removes the overhead of managing the local on-disk filesystem
77	     state and its metadata, but interposes at least a transport network
78	     and two network endpoints between an application process and the
79	     files it is accessing.  This tradeoff has to date usually resulted
80	     in a net performance loss as a result of reduced bandwidth,
81	     increased application server CPU utilization, and other overheads.

83	     Several classes of applications, including those directly
84	     supporting enterprise activities in high performance domains such
85	     as database applications and shared clusters, have therefore
86	     encountered issues with moving to NFS architectures.  While this
87	     has been due principally to the performance costs of NFS versus
88	     direct attached files, other reasons are relevant, such as the lack
89	     of strong consistency guarantees being provided by NFS
90	     implementations.

92	     Replication of local file access performance on NAS using
93	     traditional network protocol stacks has proven difficult, not
94	     because of protocol processing overheads, but because of data copy
95	     costs in the network endpoints.  This is especially true since host
96	     buses are now often the main bottleneck in NAS architectures
97	     [MOG03] [CHA+01].

99	     The External Data Representation [RFC1832] employed beneath NFS and
100	     RPC [RPC1831] can add more data copies, exacerbating the problem.

102	     Data copy-avoidance designs have not been widely adopted for a
103	     variety of reasons.  [BRU99] points out that "many copy avoidance
104	     techniques for network I/O are not applicable or may even backfire
105	     if applied to file I/O."  Other designs that eliminate unnecessary
106	     copies, such as [PAI+00], are incompatible with existing APIs and
107	     therefore force application changes.

109	     Over the past year, an effort to standardize a set of protocols for
110	     Remote Direct Memory Access, RDMA, over the standard Internet
111	     Protocol Suite has been chartered [RDDP].  Several drafts have been
112	     proposed and are under discussion.

114	     RDMA is a general solution to the problem of CPU overhead incurred
115	     due to data copies, primarily at the receiver.  Substantial
116	     research has addressed this and has borne out the efficacy of the
117	     approach.  An overview of this is the RDDP Problem Statement
118	     document, [RDDPPS].

120	     In addition to the per-byte savings of off-loading data copies,
121	     RDMA-enabled NICs (RNICS) offload the underlying protocol layers as
122	     well, e.g. TCP, further reducing CPU overhead due to NAS
123	     processing.

125	1.1.  Background

127	     The RDDP Problem Statement [RDDPPS] asserts:

129	          "High costs associated with copying are an issue primarily for
130	          large scale systems ... with high bandwidth feeds, usually
131	          multiprocessors and clusters, that are adversely affected by
132	          copying overhead.  Examples of such machines include all
133	          varieties of servers: database servers, storage servers,
134	          application servers for transaction processing, for e-
135	          commerce, and web serving, content distribution, video
136	          distribution, backups, data mining and decision support, and
137	          scientific computing.  Note that such servers almost
138	          exclusively service many concurrent sessions (transport
139	          connections), which, in aggregate, are responsible for > 1
140	          Gbits/s of communication.  Nonetheless, the cost of copying
141	          overhead for a particular load is the same whether from few or
142	          many sessions."

144	     Note that each of the servers listed above could be accessing their
145	     file data as an NFS client, or NFS serving the data to such
146	     clients, or acting as both.

148	     The CPU overhead of the NFS and TCP/IP protocol stacks (including
149	     data copies or reduced copy workarounds) becomes a significant
150	     matter in these clients and servers.  File access using locally
151	     attached disks imposes relatively low overhead due to the highly
152	     optimized I/O path and direct memory access afforded to the storage
153	     controller.  This is not the case with NFS, which must pass data
154	     to, and especially from, the network and network processing stack
155	     to the NFS stack.  Frequently, data copies are imposed on this
156	     transfer, in some cases several such copies in each direction.

158	     Copies are potentially encountered in an NFS implementation
159	     exchanging data to and from user address spaces, within kernel
160	     buffer caches, in XDR marshalling and unmarshalling, and within
161	     network stacks and network drivers.  Other overheads such as
162	     serialization among multiple threads of execution sharing a single
163	     NFS mount point and transport connection are additionally
164	     encountered.

166	     Numerous upper layer protocols achieve extremely high bandwidth and
167	     low overhead through the use of RDMA.  [MAF+02] show that the RDMA-
168	     based Direct Access File System (with a user-level implementation
169	     of the file system client) can outperform even a zero-copy
170	     implementation of NFS [CHA+01] [CHA+99] [GAL+99].  Also, file data
171	     access implies the use of large ULP messages.  These large messages
172	     tend to amortize any increase in per-message costs due to the
173	     offload of protocol processing incurred when using RNICs while
174	     gaining the benefits of reduced per-byte costs.  Finally, the
175	     direct memory addressing afforded by RDMA avoids many sources of
176	     contention on network resources.

178	2.  Problem Statement

180	     The principal performance problem encountered by NFS
181	     implementations is the CPU overhead required to implement the
182	     protocol.  Primary among the sources of this overhead is the
183	     movement of data from NFS protocol messages to its eventual
184	     destination in user buffers or aligned kernel buffers.  Due to the
185	     nature of the RPC and XDR protocols, the NFS data payload arrives
186	     at arbitrary alignment and the NFS requests are completed in an
187	     arbitrary sequence.

189	     The data copies consume system bus bandwidth and CPU time, reducing
190	     the available system capacity for applications [RDDPPS].  Achieving
191	     zero-copy with NFS has, to date, required sophisticated, version-
192	     specific "header cracking" hardware and/or extensive platform-
193	     specific virtual memory mapping tricks.  Such approaches become
194	     even more difficult for NFS version 4 due to the existence of the
195	     COMPOUND operation, which further reduces alignment and greatly
196	     complicates ULP offload.

198	     Furthermore, NFS will soon be challenged by emerging high-speed
199	     network fabrics such as 10 Gbits/s Ethernet.  Performing even raw
200	     network I/O such as TCP is an issue at such speeds with today's
201	     hardware.  The problem is fundamental in nature and has led the
202	     IETF to explore RDMA [RDDPPS].

204	     Zero-copy techniques benefit file protocols extensively, as they
205	     enable direct user I/O, reduce the overhead of protocol stacks,
206	     provide perfect alignment into caches, etc.  Many studies have
207	     already shown the performance benefits of such techniques [SKE+01,
208	     DCK+03, FJNFS, FJDAFS, MAF+02].

210	     RDMA implementations generally have other interesting properties,
211	     such as hardware assisted protocol access, and support for user
212	     space access to I/O.  RDMA is compelling here for another reason;
213	     hardware offloaded networking support in itself does not avoid data
214	     copies, without resorting to implementing part of the NFS protocol
215	     in the NIC.  Support of RDMA by NFS enables the highest performance
216	     at the architecture level rather than by implementation; this
217	     enables ubiquitous and interoperable solutions.

219	     By providing file access performance equivalent to that of local
220	     file systems, NFS over RDMA will enable applications running on a
221	     set of client machines to interact through an NFS file system, just
222	     as applications running on a single machine might interact through
223	     a local file system.

225	3.  File Protocol Architecture

227	     NFS runs as an ONC RPC [RFC1831] application.  Being a file access
228	     protocol, NFS is very "rich" in data content (versus control
229	     information).

231	     NFS messages can range from very small (under 100 bytes) to very
232	     large (from many kilobytes to a megabyte or more).  They are all
233	     contained within an RPC message and follow a variable length RPC
234	     header.  This layout provides an alignment challenge for the data
235	     items contained in an NFS call (request) or reply (response)
236	     message.

238	     In addition to the control information in each NFS call or reply
239	     message, sometimes there are large "chunks" of application file
240	     data, for example read and write requests.  With NFS version 4 (due
241	     to the existence of the COMPOUND operation) there can be several of
242	     these data chunks interspersed with control information.

244	     ONC RPC is a remote procedure call protocol that has been run over
245	     a variety of transports.  Most implementations today use UDP or
246	     TCP.  RPC messages are defined in terms of an eXternal Data
247	     Representation (XDR) [RFC1832] which provides a canonical data
248	     representation across a variety of host architectures.  An XDR data
249	     stream is conveyed differently on each type of transport.  On UDP,
250	     RPC messages are encapsulated inside datagrams, while on a TCP byte
251	     stream, RPC messages are delineated by a record marking protocol.
252	     An RDMA transport also conveys RPC messages in a unique fashion
253	     that must be fully described if client and server implementations
254	     are to interoperate.

256	     The RPC transport is responsible for conveying an RPC message from
257	     a sender to a receiver.  An RPC message is either an RPC call from
258	     a client to a server, or an RPC reply from the server back to the
259	     client.  An RPC message contains an RPC call header followed by
260	     arguments if the message is an RPC call, or an RPC reply header
261	     followed by results if the message is an RPC reply.  The call
262	     header contains a transaction ID (XID) followed by the program and
263	     procedure number as well as a security credential.  An RPC reply
264	     header begins with an XID that matches that of the RPC call
265	     message, followed by a security verifier and results.  All data in
266	     an RPC message is XDR encoded.

268	     The encoding of XDR data into transport buffers is referred to as
269	     "marshalling", and the decoding of XDR data contained within
270	     transport buffers and into destination RPC procedure result
271	     buffers, is referred to as "unmarshalling".  The process of
272	     marshalling takes place therefore at the sender of any particular
273	     message, be it an RPC request or an RPC response.  Unmarshalling,
274	     of course, takes place at the receiver.

276	     Normally, any bulk data is moved (copied) as a result of the
277	     unmarshalling process, because the destination adddress is not
278	     known until the RPC code receives control and subsequently invokes
279	     the XDR unmarshalling routine.  In other words, XDR-encoded data is
280	     not self-describing, and it carries no placement information.  This
281	     results in a data copy in most NFS implementations.

283	     One mechanism by which the RPC layer may overcome this is for each
284	     request to include placement information, to be used for direct
285	     placement during XDR encode.  This "write chunk" can avoid sending
286	     bulk data inline in an RPC message and generally results in one or
287	     more RDMA Write operations.

289	     Similarly, a "read chunk", where placement information referring to
290	     bulk data which may be directly fetched via one or more RDMA Read
291	     operations during XDR decode, may be conveyed.  The "read chunk"
292	     will therefore be useful in both RPC calls and replies, while the
293	     "write chunk" is used solely in replies.

295	     These "chunks" are the key concept in an existing proposal
296	     [RPCRDMA].  They convey what are effectively pointers to remote
297	     memory across the network.  They allow cooperating peers to
298	     exchange data outside of XDR encodings but still use XDR for
299	     describing the data to be transferred.  And, finally, through use
300	     of XDR they maintain a large degree of on-the-wire compatibility.

302	     The central concept of the RDMA transport is to provide the
303	     additional encoding conventions to convey this placement
304	     information in transport-specific encoding, and to modify the XDR
305	     handling of bulk data.

307	                             Block Diagram

309	     +------------------------+-----------------------------------+
310	     |         NFS            |            NFS + RDMA             |
311	     +------------------------+----------------------+------------+
312	     |           Operations / Procedures             |            |
313	     +-----------------------------------------------+            |
314	     |                   RPC/XDR                     |            |
315	     +--------------------------------+--------------+            |
316	     |       Stream Transport         |      RDMA Transport       |
317	     +--------------------------------+---------------------------+

319	4.  Sources of Overhead

321	     Network and file protocol costs can be categorized as follows:

323	     o    per-byte costs - data touching costs such as checksum or data
324	          copy.  Today's network interface hardware commonly offloads
325	          the checksum, which leaves the other major source of per-byte
326	          overhead, data copy.

328	     o    per-packet costs - interrupts and lower-layer processing.
329	          Today's network interface hardware also commonly coalesce
330	          interrupts to reduce per-packet costs.

332	     o    per-message (request or response) costs - LLP and ULP
333	          processing.

335	     Improvement from optimization becomes more important if the
336	     overhead it targets is a larger share of the total cost.  As other
337	     sources of overhead, such as the checksumming and interrupt
338	     handling above are eliminated, the remaining overheads (primarily
339	     data copy) loom larger.

341	     With copies crossing the bus twice per copy, network processing
342	     overhead is high whenever network bandwidth is large in comparison
343	     to CPU and memory bandwidths.  Generally with today's end-systems,
344	     the effects are observable at network speeds at or above 1 Gbits/s.

346	     A common question is whether increase in CPU processing power
347	     alleviates the problem of high processing costs of network I/O.
348	     The answer is no, it is the memory bandwidth that is the issue.
349	     Faster CPUs do not help if the CPU spends most of its time waiting
350	     for memory [RDDPPS].

352	     TCP offload engine (TOE) technology aims to offload the CPU by
353	     moving TCP/IP protocol processing to the NIC.  However, TOE
354	     technology by itself does nothing to avoid necessary data copies
355	     within upper layer protocols.  [MOG03] provides a description of
356	     the role TOE can play in reducing per-packet and per-message costs.
357	     Beyond the offloads commonly provided by today's network interface
358	     hardware, TOE alone (w/o RDMA) helps in protocol header processing,
359	     but this has been shown to be a minority component of the total
360	     protocol processing overhead. [CHA+01]

362	     Numerous software approaches to the optimization of network
363	     throughput have been made.  Experience has shown that network I/O
364	     interacts with other aspects of system processing such as file I/O
365	     and disk I/O.  [BRU99] [CHU96] Zero-copy optimizations based on
366	     page remapping [CHU96] can be dependent upon machine architecture,
367	     and are not scaleable to multi-processor architectures.  Correct
368	     buffer alignment and sizing together are needed to optimize the
369	     performance of zero-copy movement mechanisms [SKE+01].  The NFS
370	     message layout described above does not facilitate the splitting of
371	     headers from data nor does it facilitate providing correct data
372	     buffer alignment.

374	4.1.  Savings from TOE

376	     The expected improvement of TOE specifically for NFS protocol
377	     processing can be quantified and shown to be fundamentally limited.
378	     [SHI+03] presents a set of "LAWS" parameters which serve to
379	     illustrate the issues.  In the TOE case, the copy cost can be
380	     viewed as part of the application processing "a".  Application
381	     processing increases the LAWS "gamma", which is shown by the paper
382	     to result in a diminished benefit for TOE.

384	     For example, if the overhead is 20% TCP/IP, 30% copy and 50% real
385	     application work, then gamma is 80/20 or 4, which means the maximum
386	     benefit of TOE is 1/gamma, or only 25%.

388	     For RDMA (with embedded TOE) and the same example, the "overhead"
389	     (o) offloaded or eliminated is 50% (20%+30%).  Therefore in the
390	     RDMA case, gamma is 50/50 or 1, and the inverse gives the potential
391	     benefit of 1 (100%), a factor of two.

393	                        CPU overhead reduction factor

395	                   No Offload   TCP Offload   RDMA Offload
396	                   -----------+-------------+-------------
397	                      1.00x        1.25x         2.00x

399	     The analysis in the paper shows that RDMA could improve throughput
400	     by the same factor of two, even when the host is (just) powerful
401	     enough to drive the full network bandwidth without RDMA.  It can
402	     also be shown that the speedup may be higher if network bandwidth
403	     grows faster than Moore's Law, although the higher benefits will
404	     apply to a narrow range of applications.

406	4.2.  Savings from RDMA

408	     Performance measurements directly comparing an NFS over RDMA
409	     prototype with conventional network-based NFS processing are
410	     described in [CAL+03].  Comparisons of Read throughput and CPU
411	     overhead were performed on two Gigabit Ethernet adapters, one
412	     conventional and one with RDMA capability.  The prototype RDMA
413	     protocol performed all transfers via RDMA Read.

415	     In these results, conventional network-based throughput was
416	     severely limited by the client's CPU being saturated at 100% for
417	     all transfers.  Read throughput reached no more than 60MBytes/s.

419	            I/O Type      Size    Read Throughput     CPU Utilization
420	            Conventional    2KB          20MB/s              100%
421	            Conventional   16KB          40MB/s              100%
422	            Conventional  256KB          60MB/s              100%

424	     However, over RDMA, throughput rose to the theoretical maximum
425	     throughput of the platform, while saturating the single-CPU system
426	     only at maximum throughput.

428	            I/O Type      Size    Read Throughput     CPU Utilization
429	            RDMA            2KB          10MB/s               45%
430	            RDMA           16KB          40MB/s               70%
431	            RDMA          256KB         100MB/s              100%

433	     The lower relative throughput of the RDMA prototype at the small
434	     blocksize may be attributable to the RDMA Read imposed by the
435	     prototype protocol, which reduced the operation rate since it
436	     introduces additional latency.  As well, it may reflect the
437	     relative increase of per-packet setup costs within the DMA portion
438	     of the transfer.

440	5.  Application of RDMA to NFS

442	     Efficient file protocols require efficient data positioning and
443	     movement.  The client system knows the client memory address where
444	     the application has data to be written or wants read data
445	     deposited.  The server system knows the server memory address where
446	     the local filesystem will accept write data or has data to be read.
447	     Neither peer however is aware of the others' data destination in
448	     the current NFS, RPC or XDR protocols.  Existing NFS
449	     implementations have struggled with the performance costs of data
450	     copies when using traditional Ethernet transports.

452	     With the onset of faster networks, the network I/O bottleneck will
453	     worsen.  Fortunately, new transports that support RDMA have
454	     emerged.  RDMA excels at bulk transfer efficiency; it is an
455	     efficient way to deliver direct data placement and remove a major
456	     part of the problem: data copies.  RDMA also addresses other
457	     overheads, e.g. underlying protocol offload, and offers separation
458	     of control information from data.

460	     The current NFS message layout provides the performance enhancing
461	     opportunity for an NFS over RDMA protocol that separates the
462	     control information from data chunks while meeting the alignment
463	     needs of both.  The data chunks can be copied "directly" between
464	     the client and server memory addresses above (with a single
465	     occurrence on each memory bus) while the control information can be
466	     passed "inline".  [ONCRDMA] describes such a protocol.

468	6.  Improved Semantics

470	     Network file protocols need to export the application programming
471	     interfaces and semantics that applications, especially mission
472	     critical ones like database and clusters, have been developed to
473	     expect.  These APIs and semantics are historical in nature and
474	     successful deprecation is doubtful.  NFS has not delivered all of
475	     the semantics (for example, reliable filesystem transactions) for
476	     the sake of acceptable performance.

478	     The advanced properties of RDMA-capable transports allow improved
479	     semantics.  [DAFS] is an example of a protocol which exports
480	     semantics which are similar to those of NFSv4, but improved in
481	     specific areas.  Improved NFS semantics can also be delivered.  As
482	     an example, [NFSRDMA] describes an implementation of RPC for RDMA
483	     transport that is evolutionary in nature yet enables the provision
484	     of reliable and idempotent filesystem operation.  This proposal
485	     shows that it is possible to deliver extended semantics with an
486	     RPC/XDR layer implementation with no changes required above the NFS
487	     layer, and few within.

489	7.  Conclusions

491	     NFS version 4 [RFC3530] has recently been granted "Proposed
492	     Standard" status.  The NFSv4 protocol was developed along several
493	     design points, important among them: effective operation over wide-
494	     area networks, including the Internet itself;  strong security
495	     integrated into the protocol;  extensive cross-platform
496	     interoperability including integrated locking semantics compatible
497	     with multiple operating systems; and (this is key), protocol
498	     extension.

500	     NFS version 4 is an excellent base on which to add the needed
501	     performance enhancements and improved semantics described above.
502	     The minor versioning support defined in NFS version 4 was designed
503	     to support protocol improvements without disruption to the
504	     installed base.  Evolutionary improvement of the protocol via minor
505	     versioning is a conservative and cautious approach to current and
506	     future problems and shortcomings.

508	     Many arguments can be made as to the efficacy of the file
509	     abstraction in meeting the future needs of enterprise data service
510	     and the Internet.  Fine grained Quality of Service (QoS) policies
511	     (e.g. data delivery, retention, availability, security, ...) are
512	     high among them.

514	     It is vital that the NFS protocol continue to provide these
515	     benefits to a wide range of applications, without its usefulness
516	     being compromised by concerns about performance and semantic
517	     inadequacies.  This can reasonably be addressed in the existing NFS
518	     protocol framework.  A cautious evolutionary improvement of
519	     performance and semantics allows building on the value already
520	     present in the NFS protocol, while addressing new requirements that
521	     have arisen from the application of networking technology.

523	8.  Acknowledgements

525	     The authors wish to thank Jeff Chase who provided many useful
526	     suggestions.

528	9.  Normative References

530	     [RFC3530]
531	          S. Shepler, et. al., "NFS Version 4 Protocol", Standards Track
532	          RFC

534	     [RFC1831]
535	          R. Srinivasan, "RPC: Remote Procedure Call Protocol
536	          Specification Version 2", Standards Track RFC

538	     [RFC1832]
539	          R. Srinivasan, "XDR: External Data Representation Standard",
540	          Standards Track RFC

542	     [RFC1813]
543	          B. Callaghan, B. Pawlowski, P. Staubach, "NFS Version 3
544	          Protocol Specification", Informational RFC

546	10.  Informative References

548	     [BRU99]
549	          J. Brustoloni, "Interoperation of copy avoidance in network
550	          and file I/O", in Proc. INFOCOM '99, pages 534-542, New York,
551	          NY, Mar. 1999., IEEE.  Also available from
552	          http://www.cs.pitt.edu/~jcb/publs.html

554	     [CAL+03]
555	          B. Callaghan, T. Lingutla-Raj, A.  Chiu, P. Staubach, O. Asad,
556	          "NFS over RDMA", in Proceedings of ACM SIGCOMM Summer 2003
557	          NICELI Workshop.

559	     [CHA+01]
560	          J. S. Chase, A. J. Gallatin, K. G. Yocum, "Endsystem
561	          optimizations for high-speed TCP", IEEE Communications,
562	          39(4):68-74, April 2001.

564	     [CHA+99]
565	          J. S. Chase, D. C. Anderson, A. J. Gallatin, A. R. Lebeck, K.
566	          G. Yocum, "Network I/O with Trapeze", in 1999 Hot
567	          Interconnects Symposium, August 1999.

569	     [CHU96]
570	          H.K. Chu, "Zero-copy TCP in Solaris", Proc. of the USENIX 1996
571	          Annual Technical Conference, San Diego, CA, January 1996

573	     [DAFS]
574	          Direct Access File System Specification version 1.0, available
575	          from http://www.dafscollaborative.org, September 2001

577	     [DCK+03]
578	          M. DeBergalis, P. Corbett, S. Kleiman, A. Lent, D. Noveck, T.
579	          Talpey, M. Wittle, "The Direct Access File System", in
580	          Proceedings of 2nd USENIX Conference on File and Storage
581	          Technologies (FAST '03), San Francisco, CA, March 31 - April
582	          2, 2003

584	     [FJDAFS]
585	          Fujitsu Prime Software Technologies, "Meet the DAFS
586	          Performance with DAFS/VI Kernel Implementation using cLAN",
587	          available from
588	          http://www.pst.fujitsu.com/english/dafsdemo/index.html, 2001.

590	     [FJNFS]
591	          Fujitsu Prime Software Technologies, "An Adaptation of VIA to
592	          NFS on Linux", available from
593	          http://www.pst.fujitsu.com/english/nfs/index.html, 2000.

595	     [GAL+99]
596	          A. Gallatin, J. Chase, K. Yocum, "Trapeze/IP: TCP/IP at Near-
597	          Gigabit Speeds", 1999 USENIX Technical Conference (Freenix
598	          Track), June 1999.

600	     [KM02]
601	          K. Magoutis, "Design and Implementation of a Direct Access
602	          File System (DAFS) Kernel Server for FreeBSD", in Proceedings
603	          of USENIX BSDCon 2002 Conference, San Francisco, CA, February
604	          11-14, 2002.

606	     [MAF+02]
607	          K. Magoutis, S. Addetia, A. Fedorova, M. Seltzer, J. Chase, D.
608	          Gallatin, R. Kisley, R. Wickremesinghe, E. Gabber, "Structure
609	          and Performance of the Direct Access File System (DAFS)", in
610	          Proceedings of 2002 USENIX Annual Technical Conference,
611	          Monterey, CA, June 9-14, 2002.

613	     [MOG03]
614	          J. Mogul, "TCP offload is a dumb idea whose time has come",
615	          9th Workshop on Hot Topics in Operating Systems (HotOS IX),
616	          Lihue, HI, May 2003. USENIX.

618	     [NFSRDMA]
619	          T. Talpey, S. Shepler, J. Bauman "NFSv4 Session Extensions",
620	          Internet Draft Work in Progress, draft-ietf-nfsv4-session

622	     [PAI+00]
623	          V. S. Pai, P. Druschel, W. Zwaenepoel, "IO-Lite: a unified I/O
624	          buffering and caching system", ACM Trans. Computer Systems,
625	          18(1):37-66, Feb. 2000.

627	     [RDDPPS]
628	          Remote Direct Data Placement Working Group Problem Statement,
629	          A. Romanow, J. Mogul, T. Talpey, S. Bailey, Internet Draft
630	          Work in Progress, draft-ietf-rddp-problem-statement

632	     [RPCRDMA]
633	          B. Callaghan, T. Talpey, "RDMA Transport for ONC RPC",
634	          Internet Draft Work in Progress, draft-ietf-nfsv4-rpcrdma

636	     [SHI+03]
637	          P. Shivam, J. Chase, "On the Elusive Benefits of Protocol
638	          Offload", to be published in Proceedings of ACM SIGCOMM Summer
639	          2003 NICELI Workshop, also available from
640	          http://issg.cs.duke.edu/publications/niceli03.pdf

642	     [SKE+01]
643	          K.-A. Skevik, T. Plagemann, V. Goebel, P. Halvorsen,
644	          "Evaluation of a Zero-Copy Protocol Implementation", in
645	          Proceedings of the 27th Euromicro Conference - Multimedia and
646	          Telecommunications Track (MTT'2001), Warsaw, Poland, September
647	          2001.

649	Authors' Addresses

651	     Tom Talpey
652	     Network Appliance, Inc.
653	     375 Totten Pond Road
654	     Waltham, MA 02451 USA

656	     Phone: +1 781 768 5329
657	     Email: thomas.talpey@netapp.com
658	     Chet Juszczak
659	     Chet's Boathouse Co.
660	     P.O. Box 1467
661	     Merrimack, NH 03054

663	     Email: chetnh@earthlink.net

665	Full Copyright Statement

667	     Copyright (C) The Internet Society (2005).  This document is
668	     subject to the rights, licenses and restrictions contained in BCP
669	     78 and except as set forth therein, the authors retain all their
670	     rights.

672	     This document and the information contained herein are provided on
673	     an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE
674	     REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND
675	     THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES,
676	     EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT
677	     THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR
678	     ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A
679	     PARTICULAR PURPOSE.

681	Intellectual Property

683	     The IETF takes no position regarding the validity or scope of any
684	     Intellectual Property Rights or other rights that might be claimed
685	     to pertain to the implementation or use of the technology described
686	     in this document or the extent to which any license under such
687	     rights might or might not be available; nor does it represent that
688	     it has made any independent effort to identify any such rights.
689	     Information on the procedures with respect to rights in RFC
690	     documents can be found in BCP 78 and BCP 79.

692	     Copies of IPR disclosures made to the IETF Secretariat and any
693	     assurances of licenses to be made available, or the result of an
694	     attempt made to obtain a general license or permission for the use
695	     of such proprietary rights by implementers or users of this
696	     specification can be obtained from the IETF on-line IPR repository
697	     at http://www.ietf.org/ipr.

699	     The IETF invites any interested party to bring to its attention any
700	     copyrights, patents or patent applications, or other proprietary
701	     rights that may cover technology that may be required to implement
702	     this standard.  Please address the information to the IETF at ietf-
703	     ipr@ietf.org.

705	Acknowledgement

707	     Funding for the RFC Editor function is currently provided by the
708	     Internet Society.