idnits 2.17.1 draft-ietf-nfsv4-nfs-rdma-problem-statement-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3667, Section 5.1 on line 15. -- Found old boilerplate from RFC 3978, Section 5.5 on line 679. -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 690. -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 697. -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 703. ** Found boilerplate matching RFC 3978, Section 5.4, paragraph 1 (on line 670), which is fine, but *also* found old RFC 2026, Section 10.4C, paragraph 1 text on line 35. ** The document seems to lack an RFC 3978 Section 5.1 IPR Disclosure Acknowledgement -- however, there's a paragraph with a matching beginning. Boilerplate error? ** This document has an original RFC 3978 Section 5.4 Copyright Line, instead of the newer IETF Trust Copyright according to RFC 4748. ** The document seems to lack an RFC 3978 Section 5.4 Reference to BCP 78 -- however, there's a paragraph with a matching beginning. Boilerplate error? ** This document has an original RFC 3978 Section 5.5 Disclaimer, instead of the newer disclaimer which includes the IETF Trust according to RFC 4748. ** The document uses RFC 3667 boilerplate or RFC 3978-like boilerplate instead of verbatim RFC 3978 boilerplate. After 6 May 2005, submission of drafts without verbatim RFC 3978 boilerplate is not accepted. The following non-3978 patterns matched text found in the document. That text should be removed or replaced: By submitting this Internet-Draft, I certify that any applicable patent or other IPR claims of which I am aware have been disclosed, or will be disclosed, and any of which I become aware will be disclosed, in accordance with RFC 3668. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack a Security Considerations section. ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- Couldn't find a document date in the document -- date freshness check skipped. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'RFC1094' is mentioned on line 65, but not defined == Missing Reference: 'RPC1831' is mentioned on line 100, but not defined == Missing Reference: 'RDDP' is mentioned on line 111, but not defined == Missing Reference: 'ONCRDMA' is mentioned on line 466, but not defined == Unused Reference: 'FJDAFS' is defined on line 584, but no explicit reference was found in the text == Unused Reference: 'FJNFS' is defined on line 590, but no explicit reference was found in the text == Unused Reference: 'KM02' is defined on line 600, but no explicit reference was found in the text ** Obsolete normative reference: RFC 3530 (Obsoleted by RFC 7530) ** Obsolete normative reference: RFC 1831 (Obsoleted by RFC 5531) ** Obsolete normative reference: RFC 1832 (Obsoleted by RFC 4506) ** Downref: Normative reference to an Informational RFC: RFC 1813 -- No information found for draft-ietf-nfsv4-session - is the name correct? Summary: 13 errors (**), 0 flaws (~~), 9 warnings (==), 8 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 INTERNET-DRAFT Tom Talpey 3 Expires: August 2005 Chet Juszczak 5 February, 2005 7 NFS RDMA Problem Statement 8 draft-ietf-nfsv4-nfs-rdma-problem-statement-02.txt 10 Status of this Memo 12 By submitting this Internet-Draft, I certify that any applicable 13 patent or other IPR claims of which I am aware have been disclosed, 14 or will be disclosed, and any of which I become aware will be 15 disclosed, in accordance with RFC 3668. 17 Internet-Drafts are working documents of the Internet Engineering 18 Task Force (IETF), its areas, and its working groups. Note that 19 other groups may also distribute working documents as Internet- 20 Drafts. 22 Internet-Drafts are draft documents valid for a maximum of six 23 months and may be updated, replaced, or obsoleted by other 24 documents at any time. It is inappropriate to use Internet-Drafts 25 as reference material or to cite them other than as "work in 26 progress." 28 The list of current Internet-Drafts can be accessed at 29 http://www.ietf.org/ietf/1id-abstracts.txt The list of 30 Internet-Draft Shadow Directories can be accessed at 31 http://www.ietf.org/shadow.html. 33 Copyright Notice 35 Copyright (C) The Internet Society (2005). All Rights Reserved. 37 Abstract 39 This draft addresses applying Remote Direct Memory Access to the 40 NFS protocols. NFS implementations historically incur significant 41 overhead due to data copies on end-host systems, as well as other 42 sources. The potential benefits of RDMA to these implementations 43 are explored, and the reasons why RDMA is especially well-suited to 44 NFS and network file protocols in general are evaluated. 46 Table Of Contents 48 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 2 49 2. Problem Statement . . . . . . . . . . . . . . . . . . . . 4 50 3. File Protocol Architecture . . . . . . . . . . . . . . . . 5 51 4. Sources of Overhead . . . . . . . . . . . . . . . . . . . 7 52 4.1. Savings from TOE . . . . . . . . . . . . . . . . . . . . 8 53 4.2. Savings from RDMA . . . . . . . . . . . . . . . . . . . 9 54 5. Application of RDMA to NFS . . . . . . . . . . . . . . . . 10 55 6. Improved Semantics . . . . . . . . . . . . . . . . . . . . 10 56 7. Conclusions . . . . . . . . . . . . . . . . . . . . . . . 11 57 Acknowledgements . . . . . . . . . . . . . . . . . . . . . 11 58 Normative References . . . . . . . . . . . . . . . . . . . 12 59 Informative References . . . . . . . . . . . . . . . . . . 12 60 Authors' Addresses . . . . . . . . . . . . . . . . . . . . 14 61 Full Copyright Statement . . . . . . . . . . . . . . . . . 15 63 1. Introduction 65 The Network File System (NFS) protocol (as described in [RFC1094], 66 [RFC1813], and [RFC3530]) is one of several remote file access 67 protocols used in the class of processing architecture sometimes 68 called Network Attached Storage (NAS). 70 Historically, remote file access has proved to be a convenient, 71 cost-effective way to share information over a network, a concept 72 proven over time by the popularity of the NFS protocol. However, 73 there are issues in such a deployment. 75 As compared to a local (direct-attached) file access architecture, 76 NFS removes the overhead of managing the local on-disk filesystem 77 state and its metadata, but interposes at least a transport network 78 and two network endpoints between an application process and the 79 files it is accessing. This tradeoff has to date usually resulted 80 in a net performance loss as a result of reduced bandwidth, 81 increased application server CPU utilization, and other overheads. 83 Several classes of applications, including those directly 84 supporting enterprise activities in high performance domains such 85 as database applications and shared clusters, have therefore 86 encountered issues with moving to NFS architectures. While this 87 has been due principally to the performance costs of NFS versus 88 direct attached files, other reasons are relevant, such as the lack 89 of strong consistency guarantees being provided by NFS 90 implementations. 92 Replication of local file access performance on NAS using 93 traditional network protocol stacks has proven difficult, not 94 because of protocol processing overheads, but because of data copy 95 costs in the network endpoints. This is especially true since host 96 buses are now often the main bottleneck in NAS architectures 97 [MOG03] [CHA+01]. 99 The External Data Representation [RFC1832] employed beneath NFS and 100 RPC [RPC1831] can add more data copies, exacerbating the problem. 102 Data copy-avoidance designs have not been widely adopted for a 103 variety of reasons. [BRU99] points out that "many copy avoidance 104 techniques for network I/O are not applicable or may even backfire 105 if applied to file I/O." Other designs that eliminate unnecessary 106 copies, such as [PAI+00], are incompatible with existing APIs and 107 therefore force application changes. 109 Over the past year, an effort to standardize a set of protocols for 110 Remote Direct Memory Access, RDMA, over the standard Internet 111 Protocol Suite has been chartered [RDDP]. Several drafts have been 112 proposed and are under discussion. 114 RDMA is a general solution to the problem of CPU overhead incurred 115 due to data copies, primarily at the receiver. Substantial 116 research has addressed this and has borne out the efficacy of the 117 approach. An overview of this is the RDDP Problem Statement 118 document, [RDDPPS]. 120 In addition to the per-byte savings of off-loading data copies, 121 RDMA-enabled NICs (RNICS) offload the underlying protocol layers as 122 well, e.g. TCP, further reducing CPU overhead due to NAS 123 processing. 125 1.1. Background 127 The RDDP Problem Statement [RDDPPS] asserts: 129 "High costs associated with copying are an issue primarily for 130 large scale systems ... with high bandwidth feeds, usually 131 multiprocessors and clusters, that are adversely affected by 132 copying overhead. Examples of such machines include all 133 varieties of servers: database servers, storage servers, 134 application servers for transaction processing, for e- 135 commerce, and web serving, content distribution, video 136 distribution, backups, data mining and decision support, and 137 scientific computing. Note that such servers almost 138 exclusively service many concurrent sessions (transport 139 connections), which, in aggregate, are responsible for > 1 140 Gbits/s of communication. Nonetheless, the cost of copying 141 overhead for a particular load is the same whether from few or 142 many sessions." 144 Note that each of the servers listed above could be accessing their 145 file data as an NFS client, or NFS serving the data to such 146 clients, or acting as both. 148 The CPU overhead of the NFS and TCP/IP protocol stacks (including 149 data copies or reduced copy workarounds) becomes a significant 150 matter in these clients and servers. File access using locally 151 attached disks imposes relatively low overhead due to the highly 152 optimized I/O path and direct memory access afforded to the storage 153 controller. This is not the case with NFS, which must pass data 154 to, and especially from, the network and network processing stack 155 to the NFS stack. Frequently, data copies are imposed on this 156 transfer, in some cases several such copies in each direction. 158 Copies are potentially encountered in an NFS implementation 159 exchanging data to and from user address spaces, within kernel 160 buffer caches, in XDR marshalling and unmarshalling, and within 161 network stacks and network drivers. Other overheads such as 162 serialization among multiple threads of execution sharing a single 163 NFS mount point and transport connection are additionally 164 encountered. 166 Numerous upper layer protocols achieve extremely high bandwidth and 167 low overhead through the use of RDMA. [MAF+02] show that the RDMA- 168 based Direct Access File System (with a user-level implementation 169 of the file system client) can outperform even a zero-copy 170 implementation of NFS [CHA+01] [CHA+99] [GAL+99]. Also, file data 171 access implies the use of large ULP messages. These large messages 172 tend to amortize any increase in per-message costs due to the 173 offload of protocol processing incurred when using RNICs while 174 gaining the benefits of reduced per-byte costs. Finally, the 175 direct memory addressing afforded by RDMA avoids many sources of 176 contention on network resources. 178 2. Problem Statement 180 The principal performance problem encountered by NFS 181 implementations is the CPU overhead required to implement the 182 protocol. Primary among the sources of this overhead is the 183 movement of data from NFS protocol messages to its eventual 184 destination in user buffers or aligned kernel buffers. Due to the 185 nature of the RPC and XDR protocols, the NFS data payload arrives 186 at arbitrary alignment and the NFS requests are completed in an 187 arbitrary sequence. 189 The data copies consume system bus bandwidth and CPU time, reducing 190 the available system capacity for applications [RDDPPS]. Achieving 191 zero-copy with NFS has, to date, required sophisticated, version- 192 specific "header cracking" hardware and/or extensive platform- 193 specific virtual memory mapping tricks. Such approaches become 194 even more difficult for NFS version 4 due to the existence of the 195 COMPOUND operation, which further reduces alignment and greatly 196 complicates ULP offload. 198 Furthermore, NFS will soon be challenged by emerging high-speed 199 network fabrics such as 10 Gbits/s Ethernet. Performing even raw 200 network I/O such as TCP is an issue at such speeds with today's 201 hardware. The problem is fundamental in nature and has led the 202 IETF to explore RDMA [RDDPPS]. 204 Zero-copy techniques benefit file protocols extensively, as they 205 enable direct user I/O, reduce the overhead of protocol stacks, 206 provide perfect alignment into caches, etc. Many studies have 207 already shown the performance benefits of such techniques [SKE+01, 208 DCK+03, FJNFS, FJDAFS, MAF+02]. 210 RDMA implementations generally have other interesting properties, 211 such as hardware assisted protocol access, and support for user 212 space access to I/O. RDMA is compelling here for another reason; 213 hardware offloaded networking support in itself does not avoid data 214 copies, without resorting to implementing part of the NFS protocol 215 in the NIC. Support of RDMA by NFS enables the highest performance 216 at the architecture level rather than by implementation; this 217 enables ubiquitous and interoperable solutions. 219 By providing file access performance equivalent to that of local 220 file systems, NFS over RDMA will enable applications running on a 221 set of client machines to interact through an NFS file system, just 222 as applications running on a single machine might interact through 223 a local file system. 225 3. File Protocol Architecture 227 NFS runs as an ONC RPC [RFC1831] application. Being a file access 228 protocol, NFS is very "rich" in data content (versus control 229 information). 231 NFS messages can range from very small (under 100 bytes) to very 232 large (from many kilobytes to a megabyte or more). They are all 233 contained within an RPC message and follow a variable length RPC 234 header. This layout provides an alignment challenge for the data 235 items contained in an NFS call (request) or reply (response) 236 message. 238 In addition to the control information in each NFS call or reply 239 message, sometimes there are large "chunks" of application file 240 data, for example read and write requests. With NFS version 4 (due 241 to the existence of the COMPOUND operation) there can be several of 242 these data chunks interspersed with control information. 244 ONC RPC is a remote procedure call protocol that has been run over 245 a variety of transports. Most implementations today use UDP or 246 TCP. RPC messages are defined in terms of an eXternal Data 247 Representation (XDR) [RFC1832] which provides a canonical data 248 representation across a variety of host architectures. An XDR data 249 stream is conveyed differently on each type of transport. On UDP, 250 RPC messages are encapsulated inside datagrams, while on a TCP byte 251 stream, RPC messages are delineated by a record marking protocol. 252 An RDMA transport also conveys RPC messages in a unique fashion 253 that must be fully described if client and server implementations 254 are to interoperate. 256 The RPC transport is responsible for conveying an RPC message from 257 a sender to a receiver. An RPC message is either an RPC call from 258 a client to a server, or an RPC reply from the server back to the 259 client. An RPC message contains an RPC call header followed by 260 arguments if the message is an RPC call, or an RPC reply header 261 followed by results if the message is an RPC reply. The call 262 header contains a transaction ID (XID) followed by the program and 263 procedure number as well as a security credential. An RPC reply 264 header begins with an XID that matches that of the RPC call 265 message, followed by a security verifier and results. All data in 266 an RPC message is XDR encoded. 268 The encoding of XDR data into transport buffers is referred to as 269 "marshalling", and the decoding of XDR data contained within 270 transport buffers and into destination RPC procedure result 271 buffers, is referred to as "unmarshalling". The process of 272 marshalling takes place therefore at the sender of any particular 273 message, be it an RPC request or an RPC response. Unmarshalling, 274 of course, takes place at the receiver. 276 Normally, any bulk data is moved (copied) as a result of the 277 unmarshalling process, because the destination adddress is not 278 known until the RPC code receives control and subsequently invokes 279 the XDR unmarshalling routine. In other words, XDR-encoded data is 280 not self-describing, and it carries no placement information. This 281 results in a data copy in most NFS implementations. 283 One mechanism by which the RPC layer may overcome this is for each 284 request to include placement information, to be used for direct 285 placement during XDR encode. This "write chunk" can avoid sending 286 bulk data inline in an RPC message and generally results in one or 287 more RDMA Write operations. 289 Similarly, a "read chunk", where placement information referring to 290 bulk data which may be directly fetched via one or more RDMA Read 291 operations during XDR decode, may be conveyed. The "read chunk" 292 will therefore be useful in both RPC calls and replies, while the 293 "write chunk" is used solely in replies. 295 These "chunks" are the key concept in an existing proposal 296 [RPCRDMA]. They convey what are effectively pointers to remote 297 memory across the network. They allow cooperating peers to 298 exchange data outside of XDR encodings but still use XDR for 299 describing the data to be transferred. And, finally, through use 300 of XDR they maintain a large degree of on-the-wire compatibility. 302 The central concept of the RDMA transport is to provide the 303 additional encoding conventions to convey this placement 304 information in transport-specific encoding, and to modify the XDR 305 handling of bulk data. 307 Block Diagram 309 +------------------------+-----------------------------------+ 310 | NFS | NFS + RDMA | 311 +------------------------+----------------------+------------+ 312 | Operations / Procedures | | 313 +-----------------------------------------------+ | 314 | RPC/XDR | | 315 +--------------------------------+--------------+ | 316 | Stream Transport | RDMA Transport | 317 +--------------------------------+---------------------------+ 319 4. Sources of Overhead 321 Network and file protocol costs can be categorized as follows: 323 o per-byte costs - data touching costs such as checksum or data 324 copy. Today's network interface hardware commonly offloads 325 the checksum, which leaves the other major source of per-byte 326 overhead, data copy. 328 o per-packet costs - interrupts and lower-layer processing. 329 Today's network interface hardware also commonly coalesce 330 interrupts to reduce per-packet costs. 332 o per-message (request or response) costs - LLP and ULP 333 processing. 335 Improvement from optimization becomes more important if the 336 overhead it targets is a larger share of the total cost. As other 337 sources of overhead, such as the checksumming and interrupt 338 handling above are eliminated, the remaining overheads (primarily 339 data copy) loom larger. 341 With copies crossing the bus twice per copy, network processing 342 overhead is high whenever network bandwidth is large in comparison 343 to CPU and memory bandwidths. Generally with today's end-systems, 344 the effects are observable at network speeds at or above 1 Gbits/s. 346 A common question is whether increase in CPU processing power 347 alleviates the problem of high processing costs of network I/O. 348 The answer is no, it is the memory bandwidth that is the issue. 349 Faster CPUs do not help if the CPU spends most of its time waiting 350 for memory [RDDPPS]. 352 TCP offload engine (TOE) technology aims to offload the CPU by 353 moving TCP/IP protocol processing to the NIC. However, TOE 354 technology by itself does nothing to avoid necessary data copies 355 within upper layer protocols. [MOG03] provides a description of 356 the role TOE can play in reducing per-packet and per-message costs. 357 Beyond the offloads commonly provided by today's network interface 358 hardware, TOE alone (w/o RDMA) helps in protocol header processing, 359 but this has been shown to be a minority component of the total 360 protocol processing overhead. [CHA+01] 362 Numerous software approaches to the optimization of network 363 throughput have been made. Experience has shown that network I/O 364 interacts with other aspects of system processing such as file I/O 365 and disk I/O. [BRU99] [CHU96] Zero-copy optimizations based on 366 page remapping [CHU96] can be dependent upon machine architecture, 367 and are not scaleable to multi-processor architectures. Correct 368 buffer alignment and sizing together are needed to optimize the 369 performance of zero-copy movement mechanisms [SKE+01]. The NFS 370 message layout described above does not facilitate the splitting of 371 headers from data nor does it facilitate providing correct data 372 buffer alignment. 374 4.1. Savings from TOE 376 The expected improvement of TOE specifically for NFS protocol 377 processing can be quantified and shown to be fundamentally limited. 378 [SHI+03] presents a set of "LAWS" parameters which serve to 379 illustrate the issues. In the TOE case, the copy cost can be 380 viewed as part of the application processing "a". Application 381 processing increases the LAWS "gamma", which is shown by the paper 382 to result in a diminished benefit for TOE. 384 For example, if the overhead is 20% TCP/IP, 30% copy and 50% real 385 application work, then gamma is 80/20 or 4, which means the maximum 386 benefit of TOE is 1/gamma, or only 25%. 388 For RDMA (with embedded TOE) and the same example, the "overhead" 389 (o) offloaded or eliminated is 50% (20%+30%). Therefore in the 390 RDMA case, gamma is 50/50 or 1, and the inverse gives the potential 391 benefit of 1 (100%), a factor of two. 393 CPU overhead reduction factor 395 No Offload TCP Offload RDMA Offload 396 -----------+-------------+------------- 397 1.00x 1.25x 2.00x 399 The analysis in the paper shows that RDMA could improve throughput 400 by the same factor of two, even when the host is (just) powerful 401 enough to drive the full network bandwidth without RDMA. It can 402 also be shown that the speedup may be higher if network bandwidth 403 grows faster than Moore's Law, although the higher benefits will 404 apply to a narrow range of applications. 406 4.2. Savings from RDMA 408 Performance measurements directly comparing an NFS over RDMA 409 prototype with conventional network-based NFS processing are 410 described in [CAL+03]. Comparisons of Read throughput and CPU 411 overhead were performed on two Gigabit Ethernet adapters, one 412 conventional and one with RDMA capability. The prototype RDMA 413 protocol performed all transfers via RDMA Read. 415 In these results, conventional network-based throughput was 416 severely limited by the client's CPU being saturated at 100% for 417 all transfers. Read throughput reached no more than 60MBytes/s. 419 I/O Type Size Read Throughput CPU Utilization 420 Conventional 2KB 20MB/s 100% 421 Conventional 16KB 40MB/s 100% 422 Conventional 256KB 60MB/s 100% 424 However, over RDMA, throughput rose to the theoretical maximum 425 throughput of the platform, while saturating the single-CPU system 426 only at maximum throughput. 428 I/O Type Size Read Throughput CPU Utilization 429 RDMA 2KB 10MB/s 45% 430 RDMA 16KB 40MB/s 70% 431 RDMA 256KB 100MB/s 100% 433 The lower relative throughput of the RDMA prototype at the small 434 blocksize may be attributable to the RDMA Read imposed by the 435 prototype protocol, which reduced the operation rate since it 436 introduces additional latency. As well, it may reflect the 437 relative increase of per-packet setup costs within the DMA portion 438 of the transfer. 440 5. Application of RDMA to NFS 442 Efficient file protocols require efficient data positioning and 443 movement. The client system knows the client memory address where 444 the application has data to be written or wants read data 445 deposited. The server system knows the server memory address where 446 the local filesystem will accept write data or has data to be read. 447 Neither peer however is aware of the others' data destination in 448 the current NFS, RPC or XDR protocols. Existing NFS 449 implementations have struggled with the performance costs of data 450 copies when using traditional Ethernet transports. 452 With the onset of faster networks, the network I/O bottleneck will 453 worsen. Fortunately, new transports that support RDMA have 454 emerged. RDMA excels at bulk transfer efficiency; it is an 455 efficient way to deliver direct data placement and remove a major 456 part of the problem: data copies. RDMA also addresses other 457 overheads, e.g. underlying protocol offload, and offers separation 458 of control information from data. 460 The current NFS message layout provides the performance enhancing 461 opportunity for an NFS over RDMA protocol that separates the 462 control information from data chunks while meeting the alignment 463 needs of both. The data chunks can be copied "directly" between 464 the client and server memory addresses above (with a single 465 occurrence on each memory bus) while the control information can be 466 passed "inline". [ONCRDMA] describes such a protocol. 468 6. Improved Semantics 470 Network file protocols need to export the application programming 471 interfaces and semantics that applications, especially mission 472 critical ones like database and clusters, have been developed to 473 expect. These APIs and semantics are historical in nature and 474 successful deprecation is doubtful. NFS has not delivered all of 475 the semantics (for example, reliable filesystem transactions) for 476 the sake of acceptable performance. 478 The advanced properties of RDMA-capable transports allow improved 479 semantics. [DAFS] is an example of a protocol which exports 480 semantics which are similar to those of NFSv4, but improved in 481 specific areas. Improved NFS semantics can also be delivered. As 482 an example, [NFSRDMA] describes an implementation of RPC for RDMA 483 transport that is evolutionary in nature yet enables the provision 484 of reliable and idempotent filesystem operation. This proposal 485 shows that it is possible to deliver extended semantics with an 486 RPC/XDR layer implementation with no changes required above the NFS 487 layer, and few within. 489 7. Conclusions 491 NFS version 4 [RFC3530] has recently been granted "Proposed 492 Standard" status. The NFSv4 protocol was developed along several 493 design points, important among them: effective operation over wide- 494 area networks, including the Internet itself; strong security 495 integrated into the protocol; extensive cross-platform 496 interoperability including integrated locking semantics compatible 497 with multiple operating systems; and (this is key), protocol 498 extension. 500 NFS version 4 is an excellent base on which to add the needed 501 performance enhancements and improved semantics described above. 502 The minor versioning support defined in NFS version 4 was designed 503 to support protocol improvements without disruption to the 504 installed base. Evolutionary improvement of the protocol via minor 505 versioning is a conservative and cautious approach to current and 506 future problems and shortcomings. 508 Many arguments can be made as to the efficacy of the file 509 abstraction in meeting the future needs of enterprise data service 510 and the Internet. Fine grained Quality of Service (QoS) policies 511 (e.g. data delivery, retention, availability, security, ...) are 512 high among them. 514 It is vital that the NFS protocol continue to provide these 515 benefits to a wide range of applications, without its usefulness 516 being compromised by concerns about performance and semantic 517 inadequacies. This can reasonably be addressed in the existing NFS 518 protocol framework. A cautious evolutionary improvement of 519 performance and semantics allows building on the value already 520 present in the NFS protocol, while addressing new requirements that 521 have arisen from the application of networking technology. 523 8. Acknowledgements 525 The authors wish to thank Jeff Chase who provided many useful 526 suggestions. 528 9. Normative References 530 [RFC3530] 531 S. Shepler, et. al., "NFS Version 4 Protocol", Standards Track 532 RFC 534 [RFC1831] 535 R. Srinivasan, "RPC: Remote Procedure Call Protocol 536 Specification Version 2", Standards Track RFC 538 [RFC1832] 539 R. Srinivasan, "XDR: External Data Representation Standard", 540 Standards Track RFC 542 [RFC1813] 543 B. Callaghan, B. Pawlowski, P. Staubach, "NFS Version 3 544 Protocol Specification", Informational RFC 546 10. Informative References 548 [BRU99] 549 J. Brustoloni, "Interoperation of copy avoidance in network 550 and file I/O", in Proc. INFOCOM '99, pages 534-542, New York, 551 NY, Mar. 1999., IEEE. Also available from 552 http://www.cs.pitt.edu/~jcb/publs.html 554 [CAL+03] 555 B. Callaghan, T. Lingutla-Raj, A. Chiu, P. Staubach, O. Asad, 556 "NFS over RDMA", in Proceedings of ACM SIGCOMM Summer 2003 557 NICELI Workshop. 559 [CHA+01] 560 J. S. Chase, A. J. Gallatin, K. G. Yocum, "Endsystem 561 optimizations for high-speed TCP", IEEE Communications, 562 39(4):68-74, April 2001. 564 [CHA+99] 565 J. S. Chase, D. C. Anderson, A. J. Gallatin, A. R. Lebeck, K. 566 G. Yocum, "Network I/O with Trapeze", in 1999 Hot 567 Interconnects Symposium, August 1999. 569 [CHU96] 570 H.K. Chu, "Zero-copy TCP in Solaris", Proc. of the USENIX 1996 571 Annual Technical Conference, San Diego, CA, January 1996 573 [DAFS] 574 Direct Access File System Specification version 1.0, available 575 from http://www.dafscollaborative.org, September 2001 577 [DCK+03] 578 M. DeBergalis, P. Corbett, S. Kleiman, A. Lent, D. Noveck, T. 579 Talpey, M. Wittle, "The Direct Access File System", in 580 Proceedings of 2nd USENIX Conference on File and Storage 581 Technologies (FAST '03), San Francisco, CA, March 31 - April 582 2, 2003 584 [FJDAFS] 585 Fujitsu Prime Software Technologies, "Meet the DAFS 586 Performance with DAFS/VI Kernel Implementation using cLAN", 587 available from 588 http://www.pst.fujitsu.com/english/dafsdemo/index.html, 2001. 590 [FJNFS] 591 Fujitsu Prime Software Technologies, "An Adaptation of VIA to 592 NFS on Linux", available from 593 http://www.pst.fujitsu.com/english/nfs/index.html, 2000. 595 [GAL+99] 596 A. Gallatin, J. Chase, K. Yocum, "Trapeze/IP: TCP/IP at Near- 597 Gigabit Speeds", 1999 USENIX Technical Conference (Freenix 598 Track), June 1999. 600 [KM02] 601 K. Magoutis, "Design and Implementation of a Direct Access 602 File System (DAFS) Kernel Server for FreeBSD", in Proceedings 603 of USENIX BSDCon 2002 Conference, San Francisco, CA, February 604 11-14, 2002. 606 [MAF+02] 607 K. Magoutis, S. Addetia, A. Fedorova, M. Seltzer, J. Chase, D. 608 Gallatin, R. Kisley, R. Wickremesinghe, E. Gabber, "Structure 609 and Performance of the Direct Access File System (DAFS)", in 610 Proceedings of 2002 USENIX Annual Technical Conference, 611 Monterey, CA, June 9-14, 2002. 613 [MOG03] 614 J. Mogul, "TCP offload is a dumb idea whose time has come", 615 9th Workshop on Hot Topics in Operating Systems (HotOS IX), 616 Lihue, HI, May 2003. USENIX. 618 [NFSRDMA] 619 T. Talpey, S. Shepler, J. Bauman "NFSv4 Session Extensions", 620 Internet Draft Work in Progress, draft-ietf-nfsv4-session 622 [PAI+00] 623 V. S. Pai, P. Druschel, W. Zwaenepoel, "IO-Lite: a unified I/O 624 buffering and caching system", ACM Trans. Computer Systems, 625 18(1):37-66, Feb. 2000. 627 [RDDPPS] 628 Remote Direct Data Placement Working Group Problem Statement, 629 A. Romanow, J. Mogul, T. Talpey, S. Bailey, Internet Draft 630 Work in Progress, draft-ietf-rddp-problem-statement 632 [RPCRDMA] 633 B. Callaghan, T. Talpey, "RDMA Transport for ONC RPC", 634 Internet Draft Work in Progress, draft-ietf-nfsv4-rpcrdma 636 [SHI+03] 637 P. Shivam, J. Chase, "On the Elusive Benefits of Protocol 638 Offload", to be published in Proceedings of ACM SIGCOMM Summer 639 2003 NICELI Workshop, also available from 640 http://issg.cs.duke.edu/publications/niceli03.pdf 642 [SKE+01] 643 K.-A. Skevik, T. Plagemann, V. Goebel, P. Halvorsen, 644 "Evaluation of a Zero-Copy Protocol Implementation", in 645 Proceedings of the 27th Euromicro Conference - Multimedia and 646 Telecommunications Track (MTT'2001), Warsaw, Poland, September 647 2001. 649 Authors' Addresses 651 Tom Talpey 652 Network Appliance, Inc. 653 375 Totten Pond Road 654 Waltham, MA 02451 USA 656 Phone: +1 781 768 5329 657 Email: thomas.talpey@netapp.com 658 Chet Juszczak 659 Chet's Boathouse Co. 660 P.O. Box 1467 661 Merrimack, NH 03054 663 Email: chetnh@earthlink.net 665 Full Copyright Statement 667 Copyright (C) The Internet Society (2005). This document is 668 subject to the rights, licenses and restrictions contained in BCP 669 78 and except as set forth therein, the authors retain all their 670 rights. 672 This document and the information contained herein are provided on 673 an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE 674 REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND 675 THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, 676 EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT 677 THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR 678 ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A 679 PARTICULAR PURPOSE. 681 Intellectual Property 683 The IETF takes no position regarding the validity or scope of any 684 Intellectual Property Rights or other rights that might be claimed 685 to pertain to the implementation or use of the technology described 686 in this document or the extent to which any license under such 687 rights might or might not be available; nor does it represent that 688 it has made any independent effort to identify any such rights. 689 Information on the procedures with respect to rights in RFC 690 documents can be found in BCP 78 and BCP 79. 692 Copies of IPR disclosures made to the IETF Secretariat and any 693 assurances of licenses to be made available, or the result of an 694 attempt made to obtain a general license or permission for the use 695 of such proprietary rights by implementers or users of this 696 specification can be obtained from the IETF on-line IPR repository 697 at http://www.ietf.org/ipr. 699 The IETF invites any interested party to bring to its attention any 700 copyrights, patents or patent applications, or other proprietary 701 rights that may cover technology that may be required to implement 702 this standard. Please address the information to the IETF at ietf- 703 ipr@ietf.org. 705 Acknowledgement 707 Funding for the RFC Editor function is currently provided by the 708 Internet Society.