idnits 2.17.1 

draft-ietf-nfsv4-nfs-rdma-problem-statement-08.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** It looks like you're using RFC 3978 boilerplate.  You should update this
     to the boilerplate described in the IETF Trust License Policy document
     (see https://trustee.ietf.org/license-info), which is required now.

  -- Found old boilerplate from RFC 3978, Section 5.1 on line 15.

  -- Found old boilerplate from RFC 3978, Section 5.5, updated by RFC 4748 on
     line 739.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 749.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 756.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 762.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust Copyright Line does not match the
     current year

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (February 21, 2008) is 5906 days in the past.  Is this
     intentional?


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  ** Obsolete normative reference: RFC 3530 (Obsoleted by RFC 7530)

  ** Obsolete normative reference: RFC 1831 (Obsoleted by RFC 5531)


     Summary: 3 errors (**), 0 flaws (~~), 1 warning (==), 7 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	NFSv4 Working Group                                      Tom Talpey
3	Internet-Draft                                               NetApp
4	Intended status: Informational                        Chet Juszczak
5	Expires: August 23, 2008                          February 21, 2008

7	                       NFS RDMA Problem Statement
8	             draft-ietf-nfsv4-nfs-rdma-problem-statement-08

10	Status of this Memo

12	     By submitting this Internet-Draft, each author represents that any
13	     applicable patent or other IPR claims of which he or she is aware
14	     have been or will be disclosed, and any of which he or she becomes
15	     aware will be disclosed, in accordance with Section 6 of BCP 79.

17	     Internet-Drafts are working documents of the Internet Engineering
18	     Task Force (IETF), its areas, and its working groups.  Note that
19	     other groups may also distribute working documents as Internet-
20	     Drafts.

22	     Internet-Drafts are draft documents valid for a maximum of six
23	     months and may be updated, replaced, or obsoleted by other
24	     documents at any time.  It is inappropriate to use Internet-Drafts
25	     as reference material or to cite them other than as "work in
26	     progress."

28	     The list of current Internet-Drafts can be accessed at
29	         http://www.ietf.org/ietf/1id-abstracts.txt

31	     The list of Internet-Draft Shadow Directories can be accessed at
32	         http://www.ietf.org/shadow.html.

34	     This Internet-Draft will expire on August 23, 2008.

36	Copyright Notice

38	     Copyright (C) The IETF Trust (2008).

40	Abstract

42	     This draft addresses enabling the use of Remote Direct Memory
43	     Access (RDMA) by the Network File System (NFS) protocols.  NFS
44	     implementations historically incur significant overhead due to data
45	     copies on end-host systems, as well as other processing overhead.
46	     The potential benefits of RDMA to these implementations are
47	     explored, and the reasons why RDMA is especially well-suited to NFS
48	     and network file protocols in general are evaluated.

50	Table Of Contents

52	     1.   Introduction . . . . . . . . . . . . . . . . . . . . . . .   2
53	     2.   Problem Statement  . . . . . . . . . . . . . . . . . . . .   5
54	     3.   File Protocol Architecture . . . . . . . . . . . . . . . .   6
55	     4.   Sources of Overhead  . . . . . . . . . . . . . . . . . . .   8
56	     4.1.   Savings from TOE . . . . . . . . . . . . . . . . . . . .   9
57	     4.2.   Savings from RDMA  . . . . . . . . . . . . . . . . . . .  10
58	     5.   Application of RDMA to NFS . . . . . . . . . . . . . . . .  10
59	     6.   Conclusions  . . . . . . . . . . . . . . . . . . . . . . .  11
60	          Security Considerations  . . . . . . . . . . . . . . . . .  12
61	          IANA Considerations  . . . . . . . . . . . . . . . . . . .  13
62	          Acknowledgements . . . . . . . . . . . . . . . . . . . . .  13
63	          Normative References . . . . . . . . . . . . . . . . . . .  13
64	          Informative References . . . . . . . . . . . . . . . . . .  13
65	          Authors' Addresses . . . . . . . . . . . . . . . . . . . .  16
66	          Intellectual Property and Copyright Statements . . . . . .  16
67	     Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . .  17

69	1.  Introduction

71	     The Network File System (NFS) protocol (as described in [RFC1094],
72	     [RFC1813], and [RFC3530]) is one of several remote file access
73	     protocols used in the class of processing architecture sometimes
74	     called Network Attached Storage (NAS).

76	     Historically, remote file access has proven to be a convenient,
77	     cost-effective way to share information over a network, a concept
78	     proven over time by the popularity of the NFS protocol.  However,
79	     there are issues in such a deployment.

81	     As compared to a local (direct-attached) file access architecture,
82	     NFS removes the overhead of managing the local on-disk filesystem
83	     state and its metadata, but interposes at least a transport network
84	     and two network endpoints between an application process and the
85	     files it is accessing.  This tradeoff has to date usually resulted
86	     in a net performance loss as a result of reduced bandwidth,
87	     increased application server CPU utilization, and other overheads.

89	     Several classes of applications, including those directly
90	     supporting enterprise activities in high performance domains such
91	     as database applications and shared clusters, have therefore
92	     encountered issues with moving to NFS architectures.  While this
93	     has been due principally to the performance costs of NFS versus
94	     direct attached files, other reasons are relevant, such as the lack
95	     of strong consistency guarantees being provided by NFS
96	     implementations.

98	     Replication of local file access performance on NAS using
99	     traditional network protocol stacks has proven difficult, not
100	     because of protocol processing overheads, but because of data copy
101	     costs in the network endpoints.  This is especially true since host
102	     buses are now often the main bottleneck in NAS architectures
103	     [MOG03] [CHA+01].

105	     The External Data Representation [RFC4506] employed beneath NFS and
106	     RPC [RFC1831bis] can add more data copies, exacerbating the
107	     problem.

109	     Data copy-avoidance designs have not been widely adopted for a
110	     variety of reasons.  [BRU99] points out that "many copy avoidance
111	     techniques for network I/O are not applicable or may even backfire
112	     if applied to file I/O."  Other designs that eliminate unnecessary
113	     copies, such as [PAI+00], are incompatible with existing APIs and
114	     therefore force application changes.

116	     In recent years, an effort to standardize a set of protocols for
117	     Remote Direct Memory Access, RDMA, over the standard Internet
118	     Protocol Suite has been chartered [RDDP].  A complete IP-based RDMA
119	     procotol suite is available in the published Standards Track
120	     specifications.

122	     RDMA is a general solution to the problem of CPU overhead incurred
123	     due to data copies, primarily at the receiver.  Substantial
124	     research has addressed this and has borne out the efficacy of the
125	     approach.  An overview of this is the RDDP "Remote Direct Memory
126	     Access (RDMA) over IP Problem Statement" document, [RFC4297].

128	     In addition to the per-byte savings of off-loading data copies,
129	     RDMA-enabled NICs (RNICS) offload the underlying protocol layers as
130	     well, e.g., TCP, further reducing CPU overhead due to NAS
131	     processing.

133	1.1.  Background

135	     The RDDP Problem Statement [RFC4297] asserts:

137	          "High costs associated with copying are an issue primarily for
138	          large scale systems ... with high bandwidth feeds, usually
139	          multiprocessors and clusters, that are adversely affected by
140	          copying overhead.  Examples of such machines include all
141	          varieties of servers: database servers, storage servers,
142	          application servers for transaction processing, for e-
143	          commerce, and web serving, content distribution, video
144	          distribution, backups, data mining and decision support, and
145	          scientific computing.

147	          Note that such servers almost exclusively service many
148	          concurrent sessions (transport connections), which, in
149	          aggregate, are responsible for > 1 Gbits/s of communication.
150	          Nonetheless, the cost of copying overhead for a particular
151	          load is the same whether from few or many sessions."

153	     Note that each of the servers listed above could be accessing their
154	     file data as an NFS client, or NFS serving the data to such
155	     clients, or acting as both.

157	     The CPU overhead of the NFS and TCP/IP protocol stacks (including
158	     data copies or reduced copy workarounds) becomes a significant
159	     matter in these clients and servers.  File access using locally
160	     attached disks imposes relatively low overhead due to the highly
161	     optimized I/O path and direct memory access afforded to the storage
162	     controller.  This is not the case with NFS, which must pass data
163	     to, and especially from, the network and network processing stack
164	     to the NFS stack.  Frequently, data copies are imposed on this
165	     transfer, in some cases several such copies in each direction.

167	     Copies are potentially encountered in an NFS implementation
168	     exchanging data to and from user address spaces, within kernel
169	     buffer caches, in XDR marshalling and unmarshalling, and within
170	     network stacks and network drivers.  Other overheads such as
171	     serialization among multiple threads of execution sharing a single
172	     NFS mount point and transport connection are additionally
173	     encountered.

175	     Numerous upper layer protocols achieve extremely high bandwidth and
176	     low overhead through the use of RDMA.  [MAF+02] show that the RDMA-
177	     based Direct Access File System (with a user-level implementation
178	     of the file system client) can outperform even a zero-copy
179	     implementation of NFS [CHA+01] [CHA+99] [GAL+99] [KM02].  Also,
180	     file data access implies the use of large ULP messages.  These
181	     large messages tend to amortize any increase in per-message costs
182	     due to the offload of protocol processing incurred when using RNICs
183	     while gaining the benefits of reduced per-byte costs.  Finally, the
184	     direct memory addressing afforded by RDMA avoids many sources of
185	     contention on network resources.

187	2.  Problem Statement

189	     The principal performance problem encountered by NFS
190	     implementations is the CPU overhead required to implement the
191	     protocol.  Primary among the sources of this overhead is the
192	     movement of data from NFS protocol messages to its eventual
193	     destination in user buffers or aligned kernel buffers.  Due to the
194	     nature of the RPC and XDR protocols, the NFS data payload arrives
195	     at arbitrary alignment, necessitating a copy at the receiver, and
196	     the NFS requests are completed in an arbitrary sequence.

198	     The data copies consume system bus bandwidth and CPU time, reducing
199	     the available system capacity for applications [RFC4297].
200	     Achieving zero-copy with NFS has, to date, required sophisticated,
201	     version-specific "header cracking" hardware and/or extensive
202	     platform-specific virtual memory mapping tricks.  Such approaches
203	     become even more difficult for NFS version 4 due to the existence
204	     of the COMPOUND operation and presence of Kerberos and other
205	     security information, which further reduce alignment and greatly
206	     complicate ULP offload.

208	     Furthermore, NFS is challenged by high-speed network fabrics such
209	     as 10 Gbits/s Ethernet.  Performing even raw network I/O such as
210	     TCP is an issue at such speeds with today's hardware.  The problem
211	     is fundamental in nature and has led the IETF to explore RDMA
212	     [RFC4297].

214	     Zero-copy techniques benefit file protocols extensively, as they
215	     enable direct user I/O, reduce the overhead of protocol stacks,
216	     provide perfect alignment into caches, etc.  Many studies have
217	     already shown the performance benefits of such techniques [SKE+01]
218	     [DCK+03] [FJNFS] [FJDAFS] [KM02] [MAF+02].

220	     RDMA is compelling here for another reason; hardware offloaded
221	     networking support in itself does not avoid data copies, without
222	     resorting to implementing part of the NFS protocol in the NIC.
223	     Support of RDMA by NFS enables the highest performance at the
224	     architecture level rather than by implementation; this enables
225	     ubiquitous and interoperable solutions.

227	     By providing file access performance equivalent to that of local
228	     file systems, NFS over RDMA will enable applications running on a
229	     set of client machines to interact through an NFS file system, just
230	     as applications running on a single machine might interact through
231	     a local file system.

233	3.  File Protocol Architecture

235	     NFS runs as an ONC RPC [RFC1831bis] application.  Being a file
236	     access protocol, NFS is very "rich" in data content (versus control
237	     information).

239	     NFS messages can range from very small (under 100 bytes) to very
240	     large (from many kilobytes to a megabyte or more).  They are all
241	     contained within an RPC message and follow a variable length RPC
242	     header.  This layout provides an alignment challenge for the data
243	     items contained in an NFS call (request) or reply (response)
244	     message.

246	     In addition to the control information in each NFS call or reply
247	     message, sometimes there are large "chunks" of application file
248	     data, for example read and write requests.  With NFS version 4 (due
249	     to the existence of the COMPOUND operation) there can be several of
250	     these data chunks interspersed with control information.

252	     ONC RPC is a remote procedure call protocol that has been run over
253	     a variety of transports.  Most implementations today use UDP or
254	     TCP.  RPC messages are defined in terms of an eXternal Data
255	     Representation (XDR) [RFC4506] which provides a canonical data
256	     representation across a variety of host architectures.  An XDR data
257	     stream is conveyed differently on each type of transport.  On UDP,
258	     RPC messages are encapsulated inside datagrams, while on a TCP byte
259	     stream, RPC messages are delineated by a record marking protocol.
260	     An RDMA transport also conveys RPC messages in a unique fashion
261	     that must be fully described if client and server implementations
262	     are to interoperate.

264	     The RPC transport is responsible for conveying an RPC message from
265	     a sender to a receiver.  An RPC message is either an RPC call from
266	     a client to a server, or an RPC reply from the server back to the
267	     client.  An RPC message contains an RPC call header followed by
268	     arguments if the message is an RPC call, or an RPC reply header
269	     followed by results if the message is an RPC reply.  The call
270	     header contains a transaction ID (XID) followed by the program and
271	     procedure number as well as a security credential.  An RPC reply
272	     header begins with an XID that matches that of the RPC call
273	     message, followed by a security verifier and results.  All data in
274	     an RPC message is XDR encoded.

276	     The encoding of XDR data into transport buffers is referred to as
277	     "marshalling", and the decoding of XDR data contained within
278	     transport buffers and into destination RPC procedure result
279	     buffers, is referred to as "unmarshalling".  The process of
280	     marshalling takes place therefore at the sender of any particular
281	     message, be it an RPC request or an RPC response.  Unmarshalling,
282	     of course, takes place at the receiver.

284	     Normally, any bulk data is moved (copied) as a result of the
285	     unmarshalling process, because the destination address is not known
286	     until the RPC code receives control and subsequently invokes the
287	     XDR unmarshalling routine.  In other words, XDR-encoded data is not
288	     self-describing, and it carries no placement information.  This
289	     results in a data copy in most NFS implementations.

291	     One mechanism by which the RPC layer may overcome this is for each
292	     request to include placement information, to be used for direct
293	     placement during XDR encode.  This "write chunk" can avoid sending
294	     bulk data inline in an RPC message and generally results in one or
295	     more RDMA Write operations.

297	     Similarly, a "read chunk", where placement information referring to
298	     bulk data which may be directly fetched via one or more RDMA Read
299	     operations during XDR decode, may be conveyed.  The "read chunk"
300	     will therefore be useful in both RPC calls and replies, while the
301	     "write chunk" is used solely in replies.

303	     These "chunks" are the key concept in an existing proposal
304	     [RPCRDMA].  They convey what are effectively pointers to remote
305	     memory across the network.  They allow cooperating peers to
306	     exchange data outside of XDR encodings but still use XDR for
307	     describing the data to be transferred.  And, finally, through use
308	     of XDR they maintain a large degree of on-the-wire compatibility.

310	     The central concept of the RDMA transport is to provide the
311	     additional encoding conventions to convey this placement
312	     information in transport-specific encoding, and to modify the XDR
313	     handling of bulk data.

315	                             Block Diagram

317	     +------------------------+-----------------------------------+
318	     |         NFS            |            NFS + RDMA             |
319	     +------------------------+----------------------+------------+
320	     |           Operations / Procedures             |            |
321	     +-----------------------------------------------+            |
322	     |                   RPC/XDR                     |            |
323	     +--------------------------------+--------------+            |
324	     |       Stream Transport         |      RDMA Transport       |
325	     +--------------------------------+---------------------------+

327	4.  Sources of Overhead

329	     Network and file protocol costs can be categorized as follows:

331	     o    per-byte costs - data touching costs such as checksum or data
332	          copy.  Today's network interface hardware commonly offloads
333	          the checksum, which leaves the other major source of per-byte
334	          overhead, data copy.

336	     o    per-packet costs - interrupts and lower-layer processing.
337	          Today's network interface hardware also commonly coalesce
338	          interrupts to reduce per-packet costs.

340	     o    per-message (request or response) costs - LLP and ULP
341	          processing.

343	     Improvement from optimization becomes more important if the
344	     overhead it targets is a larger share of the total cost.  As other
345	     sources of overhead, such as the checksumming and interrupt
346	     handling above are eliminated, the remaining overheads (primarily
347	     data copy) loom larger.

349	     With copies crossing the bus twice per copy, network processing
350	     overhead is high whenever network bandwidth is large in comparison
351	     to CPU and memory bandwidths.  Generally with today's end-systems,
352	     the effects are observable at network speeds at or above 1 Gbits/s.

354	     A common question is whether an increase in CPU processing power
355	     alleviates the problem of high processing costs of network I/O.
356	     The answer is no, it is the memory bandwidth that is the issue.
357	     Faster CPUs do not help if the CPU spends most of its time waiting
358	     for memory [RFC4297].

360	     TCP offload engine (TOE) technology aims to offload the CPU by
361	     moving TCP/IP protocol processing to the NIC.  However, TOE
362	     technology by itself does nothing to avoid necessary data copies
363	     within upper layer protocols.  [MOG03] provides a description of
364	     the role TOE can play in reducing per-packet and per-message costs.
365	     Beyond the offloads commonly provided by today's network interface
366	     hardware, TOE alone (w/o RDMA) helps in protocol header processing,
367	     but this has been shown to be a minority component of the total
368	     protocol processing overhead. [CHA+01]

370	     Numerous software approaches to the optimization of network
371	     throughput have been made.  Experience has shown that network I/O
372	     interacts with other aspects of system processing such as file I/O
373	     and disk I/O.  [BRU99] [CHU96] Zero-copy optimizations based on
374	     page remapping [CHU96] can be dependent upon machine architecture,
375	     and are not scalable to multi-processor architectures.  Correct
376	     buffer alignment and sizing together are needed to optimize the
377	     performance of zero-copy movement mechanisms [SKE+01].  The NFS
378	     message layout described above does not facilitate the splitting of
379	     headers from data nor does it facilitate providing correct data
380	     buffer alignment.

382	4.1.  Savings from TOE

384	     The expected improvement of TOE specifically for NFS protocol
385	     processing can be quantified and shown to be fundamentally limited.
386	     [SHI+03] presents a set of "LAWS" parameters which serve to
387	     illustrate the issues.  In the TOE case, the copy cost can be
388	     viewed as part of the application processing "a".  Application
389	     processing increases the LAWS "gamma", which is shown by the paper
390	     to result in a diminished benefit for TOE.

392	     For example, if the overhead is 20% TCP/IP, 30% copy and 50% real
393	     application work, then gamma is 80/20 or 4, which means the maximum
394	     benefit of TOE is 1/gamma, or only 25%.

396	     For RDMA (with embedded TOE) and the same example, the "overhead"
397	     (o) offloaded or eliminated is 50% (20%+30%).  Therefore in the
398	     RDMA case, gamma is 50/50 or 1, and the inverse gives the potential
399	     benefit of 1 (100%), a factor of two.

401	                        CPU overhead reduction factor

403	                   No Offload   TCP Offload   RDMA Offload
404	                   -----------+-------------+-------------
405	                      1.00x        1.25x         2.00x

407	     The analysis in the paper shows that RDMA could improve throughput
408	     by the same factor of two, even when the host is (just) powerful
409	     enough to drive the full network bandwidth without RDMA.  It can
410	     also be shown that the speedup may be higher if network bandwidth
411	     grows faster than Moore's Law, although the higher benefits will
412	     apply to a narrow range of applications.

414	4.2.  Savings from RDMA

416	     Performance measurements directly comparing an NFS over RDMA
417	     prototype with conventional network-based NFS processing are
418	     described in [CAL+03].  Comparisons of Read throughput and CPU
419	     overhead were performed on two types of Gigabit Ethernet adapters,
420	     one type being a conventional adapter, and another type with RDMA
421	     capability.  The prototype RDMA protocol performed all transfers
422	     via RDMA Read.  The NFS layer in the study was measured while
423	     performing read transfers, varying the transfer size and readahead
424	     depth across ranges used by typical NFS deployments.

426	     In these results, conventional network-based throughput was
427	     severely limited by the client's CPU being saturated at 100% for
428	     all transfers.  Read throughput reached no more than 60MBytes/s.

430	            I/O Type      Size    Read Throughput     CPU Utilization
431	            Conventional    2KB          20MB/s              100%
432	            Conventional   16KB          40MB/s              100%
433	            Conventional  256KB          60MB/s              100%

435	     However, over RDMA, throughput rose to the theoretical maximum
436	     throughput of the platform, while saturating the single-CPU system
437	     only at maximum throughput.

439	            I/O Type      Size    Read Throughput     CPU Utilization
440	            RDMA            2KB          10MB/s               45%
441	            RDMA           16KB          40MB/s               70%
442	            RDMA          256KB         100MB/s              100%

444	     The lower relative throughput of the RDMA prototype at the small
445	     blocksize may be attributable to the RDMA Read imposed by the
446	     prototype protocol, which reduced the operation rate since it
447	     introduces additional latency.  As well, it may reflect the
448	     relative increase of per-packet setup costs within the DMA portion
449	     of the transfer.

451	5.  Application of RDMA to NFS

453	     Efficient file protocols require efficient data positioning and
454	     movement.  The client system knows the client memory address where
455	     the application has data to be written or wants read data
456	     deposited.  The server system knows the server memory address where
457	     the local filesystem will accept write data or has data to be read.
458	     Neither peer however is aware of the others' data destination in
459	     the current NFS, RPC or XDR protocols.  Existing NFS
460	     implementations have struggled with the performance costs of data
461	     copies when using traditional Ethernet transports.

463	     With the onset of faster networks, the network I/O bottleneck will
464	     worsen.  Fortunately, new transports that support RDMA have
465	     emerged.  RDMA excels at bulk transfer efficiency; it is an
466	     efficient way to deliver direct data placement and remove a major
467	     part of the problem: data copies.  RDMA also addresses other
468	     overheads, e.g., underlying protocol offload, and offers separation
469	     of control information from data.

471	     The current NFS message layout provides the performance enhancing
472	     opportunity for an NFS over RDMA protocol that separates the
473	     control information from data chunks while meeting the alignment
474	     needs of both.  The data chunks can be copied "directly" between
475	     the client and server memory addresses above (with a single
476	     occurrence on each memory bus) while the control information can be
477	     passed "inline".  [RPCRDMA] describes such a protocol.

479	6.  Conclusions

481	     NFS version 4 [RFC3530] has been granted "Proposed Standard"
482	     status.  The NFSv4 protocol was developed along several design
483	     points, important among them: effective operation over wide- area
484	     networks, including the Internet itself;  strong security
485	     integrated into the protocol;  extensive cross-platform
486	     interoperability including integrated locking semantics compatible
487	     with multiple operating systems; and (this is key), protocol
488	     extension.

490	     NFS version 4 is an excellent base on which to add the needed
491	     performance enhancements and improved semantics described above.
492	     The minor versioning support defined in NFS version 4 was designed
493	     to support protocol improvements without disruption to the
494	     installed base.  Evolutionary improvement of the protocol via minor
495	     versioning is a conservative and cautious approach to current and
496	     future problems and shortcomings.

498	     Many arguments can be made as to the efficacy of the file
499	     abstraction in meeting the future needs of enterprise data service
500	     and the Internet.  Fine grained Quality of Service (QoS) policies
501	     (e.g., data delivery, retention, availability, security, ...) are
502	     high among them.

504	     It is vital that the NFS protocol continue to provide these
505	     benefits to a wide range of applications, without its usefulness
506	     being compromised by concerns about performance and semantic
507	     inadequacies.  This can reasonably be addressed in the existing NFS
508	     protocol framework.  A cautious evolutionary improvement of
509	     performance and semantics allows building on the value already
510	     present in the NFS protocol, while addressing new requirements that
511	     have arisen from the application of networking technology.

513	7.  Security Considerations

515	     The NFS protocol, in conjunction with its layering on RPC, provides
516	     a rich and widely interoperable security model to applications and
517	     systems.  Any layering of NFS over RDMA transports must address the
518	     NFS security requirements, and additionally must ensure that no new
519	     vulnerabilities are introduced.  For RDMA, the integrity, and any
520	     privacy, of the data stream are of particular importance.

522	     The core goals of an NFS-to-RDMA binding are to reduce overhead and
523	     to enable high performance.  To support these goals while
524	     maintaining required NFS security protection presents a special
525	     challenge.  Historically, the provision of integrity and privacy
526	     have been implemented within the RPC layer, and their operation
527	     requires local processing of messages exchanged with the RPC peer.
528	     This procesing imposes memory and processing overhead on a per-
529	     message basis, exactly the overhead that RDMA is designed to avoid.

531	     Therefore, it is a requirement that the RDMA transport binding
532	     provide a means to delegate the integrity and privacy processing to
533	     the RDMA hardware, in order to maintain the high level of
534	     performance desired from the approach, while simultaneously
535	     providing the existing highest levels of security required by the
536	     NFS protocol.  This in turn requires a means by which the RPC layer
537	     may invoke these services from the RDMA provider, and for the NFS
538	     layer to negotiate their use end-to-end.

540	     The "Channel Binding" concept [RFC5056] provides a means by which
541	     the RPC and NFS layers may delegate their session protection to the
542	     lower RDMA layers.  An extension to the RPCSEC_GSS protocol
543	     [RPCSECGSSV2] may then be specified to negotiate the use of these
544	     bindings, and to establish the shared secrets necessary to protect
545	     the sessions.

547	     The protocol described in [RPCRDMA] specifies the use of these
548	     mechanisms, and they are required to implement the protocol.

550	     An additional consideration is protection of the integrity and
551	     privacy of local memory by the RDMA transport itself.  The use of
552	     RDMA by NFS must not introduce any vulnerabilities to system memory
553	     contents, or to memory owned by user processes.  These protections
554	     are provided by the RDMA layer specifications, and specifically
555	     their security models.  It is required that any RDMA provider used
556	     for NFS transport be conformant to the requirements of [RFC5042] in
557	     order to satisfy these protections.

559	8.  IANA Considerations

561	     This document has no IANA considerations.

563	9.  Acknowledgements

565	     The authors wish to thank Jeff Chase who provided many useful
566	     suggestions.

568	10.  Normative References

570	     [RFC3530]
571	          S. Shepler, et al., "NFS Version 4 Protocol", Standards Track
572	          RFC

574	     [RFC1831bis]
575	          R. Thurlow, Ed., "RPC: Remote Procedure Call Protocol
576	          Specification Version 2", Standards Track RFC

578	     [RFC4506]
579	          M. Eisler, Ed. "XDR: External Data Representation Standard",
580	          Standards Track RFC

582	     [RFC1813]
583	          B. Callaghan, B. Pawlowski, P. Staubach, "NFS Version 3
584	          Protocol Specification", Informational RFC

586	     [RPCSECGSSV2]
587	          M. Eisler, "RPCSEC_GSS Version 2", Internet Draft Work In
588	          Progress, draft-ietf-nfsv4-rpcsec-gss-v2

590	     [RFC5056]
591	          N. Williams, "On the Use of Channel Bindings to Secure
592	          Channels", Standards Track RFC

594	     [RFC5042]
595	          J. Pinkerton, E. Deleganes, "Direct Data Placement Protocol
596	          (DDP) / Remote Direct Memory Access Protocol (RDMAP) Security"
597	          Standards Track RFC

599	11.  Informative References

601	     [BRU99]
602	          J. Brustoloni, "Interoperation of copy avoidance in network
603	          and file I/O", in Proc. INFOCOM '99, pages 534-542, New York,
604	          NY, Mar. 1999., IEEE.  Also available from
605	          http://www.cs.pitt.edu/~jcb/publs.html

607	     [CAL+03]
608	          B. Callaghan, T. Lingutla-Raj, A.  Chiu, P. Staubach, O. Asad,
609	          "NFS over RDMA", in Proceedings of ACM SIGCOMM Summer 2003
610	          NICELI Workshop.

612	     [CHA+01]
613	          J. S. Chase, A. J. Gallatin, K. G. Yocum, "Endsystem
614	          optimizations for high-speed TCP", IEEE Communications,
615	          39(4):68-74, April 2001.

617	     [CHA+99]
618	          J. S. Chase, D. C. Anderson, A. J. Gallatin, A. R. Lebeck, K.
619	          G. Yocum, "Network I/O with Trapeze", in 1999 Hot
620	          Interconnects Symposium, August 1999.

622	     [CHU96]
623	          H.K. Chu, "Zero-copy TCP in Solaris", Proc. of the USENIX 1996
624	          Annual Technical Conference, San Diego, CA, January 1996

626	     [DCK+03]
627	          M. DeBergalis, P. Corbett, S. Kleiman, A. Lent, D. Noveck, T.
628	          Talpey, M. Wittle, "The Direct Access File System", in
629	          Proceedings of 2nd USENIX Conference on File and Storage
630	          Technologies (FAST '03), San Francisco, CA, March 31 - April
631	          2, 2003

633	     [FJDAFS]
634	          Fujitsu Prime Software Technologies, "Meet the DAFS
635	          Performance with DAFS/VI Kernel Implementation using cLAN",
636	          available from
637	          http://www.pst.fujitsu.com/english/dafsdemo/index.html, 2001.

639	     [FJNFS]
640	          Fujitsu Prime Software Technologies, "An Adaptation of VIA to
641	          NFS on Linux", available from
642	          http://www.pst.fujitsu.com/english/nfs/index.html, 2000.

644	     [GAL+99]
645	          A. Gallatin, J. Chase, K. Yocum, "Trapeze/IP: TCP/IP at Near-
646	          Gigabit Speeds", 1999 USENIX Technical Conference (Freenix
647	          Track), June 1999.

649	     [KM02]
650	          K. Magoutis, "Design and Implementation of a Direct Access
651	          File System (DAFS) Kernel Server for FreeBSD", in Proceedings
652	          of USENIX BSDCon 2002 Conference, San Francisco, CA, February
653	          11-14, 2002.

655	     [MAF+02]
656	          K. Magoutis, S. Addetia, A. Fedorova, M. Seltzer, J. Chase, D.
657	          Gallatin, R. Kisley, R. Wickremesinghe, E. Gabber, "Structure
658	          and Performance of the Direct Access File System (DAFS)", in
659	          Proceedings of 2002 USENIX Annual Technical Conference,
660	          Monterey, CA, June 9-14, 2002.

662	     [MOG03]
663	          J. Mogul, "TCP offload is a dumb idea whose time has come",
664	          9th Workshop on Hot Topics in Operating Systems (HotOS IX),
665	          Lihue, HI, May 2003. USENIX.

667	     [NFSv4.1]
668	          S. Shepler, ed., "NFSv4 Minor Version 1" Internet Draft work-
669	          in-progress, draft-ietf-nfsv4-minorversion1

671	     [PAI+00]
672	          V. S. Pai, P. Druschel, W. Zwaenepoel, "IO-Lite: a unified I/O
673	          buffering and caching system", ACM Trans. Computer Systems,
674	          18(1):37-66, Feb. 2000.

676	     [RDDP]
677	          RDDP Working Group charter,
678	          http://www.ietf.org/html.charters/rddp-charter.html

680	     [RFC4297]
681	          A. Romanow, J. Mogul, T. Talpey, S. Bailey, "Remote Direct
682	          Memory Access (RDMA) over IP Problem Statement", Informational
683	          RFC

685	     [RFC1094]
686	          Sun Microsystems, "NFS: Network File System Protocol
687	          Specification"

689	     [RPCRDMA]
690	          T. Talpey, B. Callaghan, "RDMA Transport for ONC RPC",
691	          Internet Draft Work in Progress, draft-ietf-nfsv4-rpcrdma

693	     [SHI+03]
694	          P. Shivam, J. Chase, "On the Elusive Benefits of Protocol
695	          Offload", Proceedings of ACM SIGCOMM Summer 2003 NICELI
696	          Workshop, also available from
697	          http://issg.cs.duke.edu/publications/niceli03.pdf

699	     [SKE+01]
700	          K.-A. Skevik, T. Plagemann, V. Goebel, P. Halvorsen,
701	          "Evaluation of a Zero-Copy Protocol Implementation", in
702	          Proceedings of the 27th Euromicro Conference - Multimedia and
703	          Telecommunications Track (MTT'2001), Warsaw, Poland, September
704	          2001.

706	Authors' Addresses

708	     Tom Talpey
709	     Network Appliance, Inc.
710	     1601 Trapelo Road, #16
711	     Waltham, MA 02451 USA

713	     Phone: +1 781 768 5329
714	     Email: thomas.talpey@netapp.com

716	     Chet Juszczak
717	     Chet's Boathouse Co.
718	     P.O. Box 1467
719	     Merrimack, NH 03054

721	     Email: chetnh@earthlink.net

723	Intellectual Property and Copyright Statements

725	Full Copyright Statement

727	     Copyright (C) The IETF Trust (2008).
728	     This document is subject to the rights, licenses and restrictions
729	     contained in BCP 78, and except as set forth therein, the authors
730	     retain all their rights.

732	     This document and the information contained herein are provided on
733	     an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE
734	     REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE
735	     IETF TRUST AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL
736	     WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY
737	     WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE
738	     ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS
739	     FOR A PARTICULAR PURPOSE.

741	Intellectual Property
742	     The IETF takes no position regarding the validity or scope of any
743	     Intellectual Property Rights or other rights that might be claimed
744	     to pertain to the implementation or use of the technology described
745	     in this document or the extent to which any license under such
746	     rights might or might not be available; nor does it represent that
747	     it has made any independent effort to identify any such rights.
748	     Information on the procedures with respect to rights in RFC
749	     documents can be found in BCP 78 and BCP 79.

751	     Copies of IPR disclosures made to the IETF Secretariat and any
752	     assurances of licenses to be made available, or the result of an
753	     attempt made to obtain a general license or permission for the use
754	     of such proprietary rights by implementers or users of this
755	     specification can be obtained from the IETF on-line IPR repository
756	     at http://www.ietf.org/ipr.

758	     The IETF invites any interested party to bring to its attention any
759	     copyrights, patents or patent applications, or other proprietary
760	     rights that may cover technology that may be required to implement
761	     this standard.  Please address the information to the IETF at ietf-
762	     ipr@ietf.org.

764	Acknowledgment
765	     Funding for the RFC Editor function is provided by the IETF
766	     Administrative Support Activity (IASA).