idnits 2.17.1 draft-ietf-nfsv4-nfs-rdma-problem-statement-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard == The page length should not exceed 58 lines per page, but there was 15 longer pages, the longest (page 7) being 67 lines == It seems as if not all pages are separated by form feeds - found 0 form feeds but 15 pages Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack a Security Considerations section. ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- Couldn't find a document date in the document -- date freshness check skipped. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'RFC1094' is mentioned on line 65, but not defined == Missing Reference: 'RPC1831' is mentioned on line 100, but not defined == Missing Reference: 'RDDP' is mentioned on line 111, but not defined == Missing Reference: 'ONCRDMA' is mentioned on line 466, but not defined == Unused Reference: 'FJDAFS' is defined on line 566, but no explicit reference was found in the text == Unused Reference: 'FJNFS' is defined on line 572, but no explicit reference was found in the text == Unused Reference: 'KM02' is defined on line 582, but no explicit reference was found in the text -- Possible downref: Non-RFC (?) normative reference: ref. 'BRU99' -- Possible downref: Non-RFC (?) normative reference: ref. 'CHU96' -- Possible downref: Non-RFC (?) normative reference: ref. 'DAFS' -- Possible downref: Non-RFC (?) normative reference: ref. 'FJDAFS' -- Possible downref: Non-RFC (?) normative reference: ref. 'FJNFS' -- Possible downref: Non-RFC (?) normative reference: ref. 'KM02' -- Possible downref: Non-RFC (?) normative reference: ref. 'MOG03' -- Possible downref: Normative reference to a draft: ref. 'NFSRDMA' ** Obsolete normative reference: RFC 3530 (Obsoleted by RFC 7530) == Outdated reference: A later version (-05) exists of draft-ietf-rddp-problem-statement-03 ** Downref: Normative reference to an Informational draft: draft-ietf-rddp-problem-statement (ref. 'RDDPPS') ** Obsolete normative reference: RFC 1831 (Obsoleted by RFC 5531) ** Obsolete normative reference: RFC 1832 (Obsoleted by RFC 4506) ** Downref: Normative reference to an Informational RFC: RFC 1813 -- Unexpected draft version: The latest known version of draft-callaghan-rpcrdma is -00, but you're referring to -01. -- Possible downref: Normative reference to a draft: ref. 'RPCRDMA' Summary: 9 errors (**), 0 flaws (~~), 12 warnings (==), 12 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 Internet-Draft Tom Talpey 2 Expires: August 2004 Network Appliance, Inc. 4 Chet Juszczak 5 Sun Microsystems, Inc. 7 February, 2004 9 NFS RDMA Problem Statement 10 draft-ietf-nfsv4-nfs-rdma-problem-statement-00.txt 12 Status of this Memo 14 This document is an Internet-Draft and is subject to all provisions 15 of Section 10 of RFC2026. 17 Internet-Drafts are working documents of the Internet Engineering 18 Task Force (IETF), its areas, and its working groups. Note that 19 other groups may also distribute working documents as Internet- 20 Drafts. 22 Internet-Drafts are draft documents valid for a maximum of six 23 months and may be updated, replaced, or obsoleted by other 24 documents at any time. It is inappropriate to use Internet-Drafts 25 as reference material or to cite them other than as "work in 26 progress." 28 The list of current Internet-Drafts can be accessed at 29 http://www.ietf.org/ietf/1id-abstracts.txt 31 The list of Internet-Draft Shadow Directories can be accessed at 32 http://www.ietf.org/shadow.html. 34 Copyright Notice 36 Copyright (C) The Internet Society (2004). All Rights Reserved. 38 Abstract 40 This draft addresses applying Remote Direct Memory Access to the 41 NFS protocols. NFS implementations historically incur significant 42 overhead due to data copies on end-host systems, as well as other 43 sources. The potential benefits of RDMA to these implementations 44 are explored, and the reasons why RDMA is especially well-suited to 45 NFS and network file protocols in general are evaluated. 47 Table Of Contents 49 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 2 50 2. Problem Statement . . . . . . . . . . . . . . . . . . . . 4 51 3. File Protocol Architecture . . . . . . . . . . . . . . . . 5 52 4. Sources of Overhead . . . . . . . . . . . . . . . . . . . 7 53 4.1. Savings from TOE . . . . . . . . . . . . . . . . . . . . 8 54 4.2. Savings from RDMA . . . . . . . . . . . . . . . . . . . 9 55 5. Application of RDMA to NFS . . . . . . . . . . . . . . . . 10 56 6. Improved Semantics . . . . . . . . . . . . . . . . . . . . 10 57 7. Conclusions . . . . . . . . . . . . . . . . . . . . . . . 11 58 Acknowledgements . . . . . . . . . . . . . . . . . . . . . 11 59 References . . . . . . . . . . . . . . . . . . . . . . . . 12 60 Authors' Addresses . . . . . . . . . . . . . . . . . . . . 14 61 Full Copyright Statement . . . . . . . . . . . . . . . . . 15 63 1. Introduction 65 The Network File System (NFS) protocol (as described in [RFC1094], 66 [RFC1813], and [RFC3530]) is one of several remote file access 67 protocols used in the class of processing architecture sometimes 68 called Network Attached Storage (NAS). 70 Historically, remote file access has proved to be a convenient, 71 cost-effective way to share information over a network, a concept 72 proven over time by the popularity of the NFS protocol. However, 73 there are issues in such a deployment. 75 As compared to a local (direct-attached) file access architecture, 76 NFS removes the overhead of managing the local on-disk filesystem 77 state and its metadata, but interposes at least a transport network 78 and two network endpoints between an application process and the 79 files it is accessing. This tradeoff has to date usually resulted 80 in a net performance loss as a result of reduced bandwidth, 81 increased application server CPU utilization, and other overheads. 83 Several classes of applications, including those directly 84 supporting enterprise activities in high performance domains such 85 as database applications and shared clusters, have therefore 86 encountered issues with moving to NFS architectures. While this 87 has been due principally to the performance costs of NFS versus 88 direct attached files, other reasons are relevant, such as the lack 89 of strong consistency guarantees being provided by NFS 90 implementations. 92 Replication of local file access performance on NAS using 93 traditional network protocol stacks has proven difficult, not 94 because of protocol processing overheads, but because of data copy 95 costs in the network endpoints. This is especially true since host 96 buses are now often the main bottleneck in NAS architectures 97 [MOG03] [CHA+01]. 99 The External Data Representation [RFC1832] employed beneath NFS and 100 RPC [RPC1831] can add more data copies, exacerbating the problem. 102 Data copy-avoidance designs have not been widely adopted for a 103 variety of reasons. [BRU99] points out that "many copy avoidance 104 techniques for network I/O are not applicable or may even backfire 105 if applied to file I/O." Other designs that eliminate unnecessary 106 copies, such as [PAI+00], are incompatible with existing APIs and 107 therefore force application changes. 109 Over the past year, an effort to standardize a set of protocols for 110 Remote Direct Memory Access, RDMA, over the standard Internet 111 Protocol Suite has been chartered [RDDP]. Several drafts have been 112 proposed and are under discussion. 114 RDMA is a general solution to the problem of CPU overhead incurred 115 due to data copies, primarily at the receiver. Substantial 116 research has addressed this and has borne out the efficacy of the 117 approach. An overview of this is the RDDP Problem Statement 118 document, [RDDPPS]. 120 In addition to the per-byte savings of off-loading data copies, 121 RDMA-enabled NICs (RNICS) offload the underlying protocol layers as 122 well, e.g. TCP, further reducing CPU overhead due to NAS 123 processing. 125 1.1. Background 127 The RDDP Problem Statement [RDDPPS] asserts: 129 "High costs associated with copying are an issue primarily for 130 large scale systems ... with high bandwidth feeds, usually 131 multiprocessors and clusters, that are adversely affected by 132 copying overhead. Examples of such machines include all 133 varieties of servers: database servers, storage servers, 134 application servers for transaction processing, for e- 135 commerce, and web serving, content distribution, video 136 distribution, backups, data mining and decision support, and 137 scientific computing. Note that such servers almost 138 exclusively service many concurrent sessions (transport 139 connections), which, in aggregate, are responsible for > 1 140 Gbits/s of communication. Nonetheless, the cost of copying 141 overhead for a particular load is the same whether from few or 142 many sessions." 144 Note that each of the servers listed above could be accessing their 145 file data as an NFS client, or NFS serving the data to such 146 clients, or acting as both. 148 The CPU overhead of the NFS and TCP/IP protocol stacks (including 149 data copies or reduced copy workarounds) becomes a significant 150 matter in these clients and servers. File access using locally 151 attached disks imposes relatively low overhead due to the highly 152 optimized I/O path and direct memory access afforded to the storage 153 controller. This is not the case with NFS, which must pass data 154 to, and especially from, the network and network processing stack 155 to the NFS stack. Frequently, data copies are imposed on this 156 transfer, in some cases several such copies in each direction. 158 Copies are potentially encountered in an NFS implementation 159 exchanging data to and from user address spaces, within kernel 160 buffer caches, in XDR marshalling and unmarshalling, and within 161 network stacks and network drivers. Other overheads such as 162 serialization among multiple threads of execution sharing a single 163 NFS mount point and transport connection are additionally 164 encountered. 166 Numerous upper layer protocols achieve extremely high bandwidth and 167 low overhead through the use of RDMA. [MAF+02] show that the RDMA- 168 based Direct Access File System (with a user-level implementation 169 of the file system client) can outperform even a zero-copy 170 implementation of NFS [CHA+01] [CHA+99] [GAL+99]. Also, file data 171 access implies the use of large ULP messages. These large messages 172 tend to amortize any increase in per-message costs due to the 173 offload of protocol processing incurred when using RNICs while 174 gaining the benefits of reduced per-byte costs. Finally, the 175 direct memory addressing afforded by RDMA avoids many sources of 176 contention on network resources. 178 2. Problem Statement 180 The principal performance problem encountered by NFS 181 implementations is the CPU overhead required to implement the 182 protocol. Primary among the sources of this overhead is the 183 movement of data from NFS protocol messages to its eventual 184 destination in user buffers or aligned kernel buffers. Due to the 185 nature of the RPC and XDR protocols, the NFS data payload arrives 186 at arbitrary alignment and the NFS requests are completed in an 187 arbitrary sequence. 189 The data copies consume system bus bandwidth and CPU time, reducing 190 the available system capacity for applications [RDDPPS]. Achieving 191 zero-copy with NFS has, to date, required sophisticated, version- 192 specific "header cracking" hardware and/or extensive platform- 193 specific virtual memory mapping tricks. Such approaches become 194 even more difficult for NFS version 4 due to the existence of the 195 COMPOUND operation, which further reduces alignment and greatly 196 complicates ULP offload. 198 Furthermore, NFS will soon be challenged by emerging high-speed 199 network fabrics such as 10 Gbits/s Ethernet. Performing even raw 200 network I/O such as TCP is an issue at such speeds with today's 201 hardware. The problem is fundamental in nature and has led the 202 IETF to explore RDMA [RDDPPS]. 204 Zero-copy techniques benefit file protocols extensively, as they 205 enable direct user I/O, reduce the overhead of protocol stacks, 206 provide perfect alignment into caches, etc. Many studies have 207 already shown the performance benefits of such techniques [SKE+01, 208 DCK+03, FJNFS, FJDAFS, MAF+02]. 210 RDMA implementations generally have other interesting properties, 211 such as hardware assisted protocol access, and support for user 212 space access to I/O. RDMA is compelling here for another reason; 213 hardware offloaded networking support in itself does not avoid data 214 copies, without resorting to implementing part of the NFS protocol 215 in the NIC. Support of RDMA by NFS enables the highest performance 216 at the architecture level rather than by implementation; this 217 enables ubiquitous and interoperable solutions. 219 By providing file access performance equivalent to that of local 220 file systems, NFS over RDMA will enable applications running on a 221 set of client machines to interact through an NFS file system, just 222 as applications running on a single machine might interact through 223 a local file system. 225 3. File Protocol Architecture 227 NFS runs as an ONC RPC [RFC1831] application. Being a file access 228 protocol, NFS is very "rich" in data content (versus control 229 information). 231 NFS messages can range from very small (under 100 bytes) to very 232 large (from many kilobytes to a megabyte or more). They are all 233 contained within an RPC message and follow a variable length RPC 234 header. This layout provides an alignment challenge for the data 235 items contained in an NFS call (request) or reply (response) 236 message. 238 In addition to the control information in each NFS call or reply 239 message, sometimes there are large "chunks" of application file 240 data, for example read and write requests. With NFS version 4 (due 241 to the existence of the COMPOUND operation) there can be several of 242 these data chunks interspersed with control information. 244 ONC RPC is a remote procedure call protocol that has been run over 245 a variety of transports. Most implementations today use UDP or 246 TCP. RPC messages are defined in terms of an eXternal Data 247 Representation (XDR) [RFC1832] which provides a canonical data 248 representation across a variety of host architectures. An XDR data 249 stream is conveyed differently on each type of transport. On UDP, 250 RPC messages are encapsulated inside datagrams, while on a TCP byte 251 stream, RPC messages are delineated by a record marking protocol. 252 An RDMA transport also conveys RPC messages in a unique fashion 253 that must be fully described if client and server implementations 254 are to interoperate. 256 The RPC transport is responsible for conveying an RPC message from 257 a sender to a receiver. An RPC message is either an RPC call from 258 a client to a server, or an RPC reply from the server back to the 259 client. An RPC message contains an RPC call header followed by 260 arguments if the message is an RPC call, or an RPC reply header 261 followed by results if the message is an RPC reply. The call 262 header contains a transaction ID (XID) followed by the program and 263 procedure number as well as a security credential. An RPC reply 264 header begins with an XID that matches that of the RPC call 265 message, followed by a security verifier and results. All data in 266 an RPC message is XDR encoded. 268 The encoding of XDR data into transport buffers is referred to as 269 "marshalling", and the decoding of XDR data contained within 270 transport buffers and into destination RPC procedure result 271 buffers, is referred to as "unmarshalling". The process of 272 marshalling takes place therefore at the sender of any particular 273 message, be it an RPC request or an RPC response. Unmarshalling, 274 of course, takes place at the receiver. 276 Normally, any bulk data is moved (copied) as a result of the 277 unmarshalling process, because the destination adddress is not 278 known until the RPC code receives control and subsequently invokes 279 the XDR unmarshalling routine. In other words, XDR-encoded data is 280 not self-describing, and it carries no placement information. This 281 results in a data copy in most NFS implementations. 283 One mechanism by which the RPC layer may overcome this is for each 284 request to include placement information, to be used for direct 285 placement during XDR encode. This "write chunk" can avoid sending 286 bulk data inline in an RPC message and generally results in one or 287 more RDMA Write operations. 289 Similarly, a "read chunk", where placement information referring to 290 bulk data which may be directly fetched via one or more RDMA Read 291 operations during XDR decode, may be conveyed. The "read chunk" 292 will therefore be useful in both RPC calls and replies, while the 293 "write chunk" is used solely in replies. 295 These "chunks" are the key concept in an existing proposal 296 [RPCRDMA]. They convey what are effectively pointers to remote 297 memory across the network. They allow cooperating peers to 298 exchange data outside of XDR encodings but still use XDR for 299 describing the data to be transferred. And, finally, through use 300 of XDR they maintain a large degree of on-the-wire compatibility. 302 The central concept of the RDMA transport is to provide the 303 additional encoding conventions to convey this placement 304 information in transport-specific encoding, and to modify the XDR 305 handling of bulk data. 307 Block Diagram 309 +------------------------+-----------------------------------+ 310 | NFS | NFS + RDMA | 311 +------------------------+----------------------+------------+ 312 | Operations / Procedures | | 313 +-----------------------------------------------+ | 314 | RPC/XDR | | 315 +--------------------------------+--------------+ | 316 | Stream Transport | RDMA Transport | 317 +--------------------------------+---------------------------+ 319 4. Sources of Overhead 321 Network and file protocol costs can be categorized as follows: 323 o per-byte costs - data touching costs such as checksum or data 324 copy. Today's network interface hardware commonly offloads 325 the checksum, which leaves the other major source of per-byte 326 overhead, data copy. 328 o per-packet costs - interrupts and lower-layer processing. 329 Today's network interface hardware also commonly coalesce 330 interrupts to reduce per-packet costs. 332 o per-message (request or response) costs - LLP and ULP 333 processing. 335 Improvement from optimization becomes more important if the 336 overhead it targets is a larger share of the total cost. As other 337 sources of overhead, such as the checksumming and interrupt 338 handling above are eliminated, the remaining overheads (primarily 339 data copy) loom larger. 341 With copies crossing the bus twice per copy, network processing 342 overhead is high whenever network bandwidth is large in comparison 343 to CPU and memory bandwidths. Generally with today's end-systems, 344 the effects are observable at network speeds at or above 1 Gbits/s. 346 A common question is whether increase in CPU processing power 347 alleviates the problem of high processing costs of network I/O. 348 The answer is no, it is the memory bandwidth that is the issue. 349 Faster CPUs do not help if the CPU spends most of its time waiting 350 for memory [RDDPPS]. 352 TCP offload engine (TOE) technology aims to offload the CPU by 353 moving TCP/IP protocol processing to the NIC. However, TOE 354 technology by itself does nothing to avoid necessary data copies 355 within upper layer protocols. [MOG03] provides a description of 356 the role TOE can play in reducing per-packet and per-message costs. 357 Beyond the offloads commonly provided by today's network interface 358 hardware, TOE alone (w/o RDMA) helps in protocol header processing, 359 but this has been shown to be a minority component of the total 360 protocol processing overhead. [CHA+01] 362 Numerous software approaches to the optimization of network 363 throughput have been made. Experience has shown that network I/O 364 interacts with other aspects of system processing such as file I/O 365 and disk I/O. [BRU99] [CHU96] Zero-copy optimizations based on 366 page remapping [CHU96] can be dependent upon machine architecture, 367 and are not scaleable to multi-processor architectures. Correct 368 buffer alignment and sizing together are needed to optimize the 369 performance of zero-copy movement mechanisms [SKE+01]. The NFS 370 message layout described above does not facilitate the splitting of 371 headers from data nor does it facilitate providing correct data 372 buffer alignment. 374 4.1. Savings from TOE 376 The expected improvement of TOE specifically for NFS protocol 377 processing can be quantified and shown to be fundamentally limited. 378 [SHI+03] presents a set of "LAWS" parameters which serve to 379 illustrate the issues. In the TOE case, the copy cost can be 380 viewed as part of the application processing "a". Application 381 processing increases the LAWS "gamma", which is shown by the paper 382 to result in a diminished benefit for TOE. 384 For example, if the overhead is 20% TCP/IP, 30% copy and 50% real 385 application work, then gamma is 80/20 or 4, which means the maximum 386 benefit of TOE is 1/gamma, or only 25%. 388 For RDMA (with embedded TOE) and the same example, the "overhead" 389 (o) offloaded or eliminated is 50% (20%+30%). Therefore in the 390 RDMA case, gamma is 50/50 or 1, and the inverse gives the potential 391 benefit of 1 (100%), a factor of two. 393 CPU overhead reduction factor 395 No Offload TCP Offload RDMA Offload 396 -----------+-------------+------------- 397 1.00x 1.25x 2.00x 399 The analysis in the paper shows that RDMA could improve throughput 400 by the same factor of two, even when the host is (just) powerful 401 enough to drive the full network bandwidth without RDMA. It can 402 also be shown that the speedup may be higher if network bandwidth 403 grows faster than Moore's Law, although the higher benefits will 404 apply to a narrow range of applications. 406 4.2. Savings from RDMA 408 Performance measurements directly comparing an NFS over RDMA 409 prototype with conventional network-based NFS processing are 410 described in [CAL+03]. Comparisons of Read throughput and CPU 411 overhead were performed on two Gigabit Ethernet adapters, one 412 conventional and one with RDMA capability. The prototype RDMA 413 protocol performed all transfers via RDMA Read. 415 In these results, conventional network-based throughput was 416 severely limited by the client's CPU being saturated at 100% for 417 all transfers. Read throughput reached no more than 60MBytes/s. 419 I/O Type Size Read Throughput CPU Utilization 420 Conventional 2KB 20MB/s 100% 421 Conventional 16KB 40MB/s 100% 422 Conventional 256KB 60MB/s 100% 424 However, over RDMA, throughput rose to the theoretical maximum 425 throughput of the platform, while saturating the single-CPU system 426 only at maximum throughput. 428 I/O Type Size Read Throughput CPU Utilization 429 RDMA 2KB 10MB/s 45% 430 RDMA 16KB 40MB/s 70% 431 RDMA 256KB 100MB/s 100% 433 The lower relative throughput of the RDMA prototype at the small 434 blocksize may be attributable to the RDMA Read imposed by the 435 prototype protocol, which reduced the operation rate since it 436 introduces additional latency. As well, it may reflect the 437 relative increase of per-packet setup costs within the DMA portion 438 of the transfer. 440 5. Application of RDMA to NFS 442 Efficient file protocols require efficient data positioning and 443 movement. The client system knows the client memory address where 444 the application has data to be written or wants read data 445 deposited. The server system knows the server memory address where 446 the local filesystem will accept write data or has data to be read. 447 Neither peer however is aware of the others' data destination in 448 the current NFS, RPC or XDR protocols. Existing NFS 449 implementations have struggled with the performance costs of data 450 copies when using traditional Ethernet transports. 452 With the onset of faster networks, the network I/O bottleneck will 453 worsen. Fortunately, new transports that support RDMA have 454 emerged. RDMA excels at bulk transfer efficiency; it is an 455 efficient way to deliver direct data placement and remove a major 456 part of the problem: data copies. RDMA also addresses other 457 overheads, e.g. underlying protocol offload, and offers separation 458 of control information from data. 460 The current NFS message layout provides the performance enhancing 461 opportunity for an NFS over RDMA protocol that separates the 462 control information from data chunks while meeting the alignment 463 needs of both. The data chunks can be copied "directly" between 464 the client and server memory addresses above (with a single 465 occurrence on each memory bus) while the control information can be 466 passed "inline". [ONCRDMA] describes such a protocol. 468 6. Improved Semantics 470 Network file protocols need to export the application programming 471 interfaces and semantics that applications, especially mission 472 critical ones like database and clusters, have been developed to 473 expect. These APIs and semantics are historical in nature and 474 successful deprecation is doubtful. NFS has not delivered all of 475 the semantics (for example, reliable filesystem transactions) for 476 the sake of acceptable performance. 478 The advanced properties of RDMA-capable transports allow improved 479 semantics. [DAFS] is an example of a protocol which exports 480 semantics which are similar to those of NFSv4, but improved in 481 specific areas. Improved NFS semantics can also be delivered. As 482 an example, [NFSRDMA] describes an implementation of RPC for RDMA 483 transport that is evolutionary in nature yet enables the provision 484 of reliable and idempotent filesystem operation. This proposal 485 shows that it is possible to deliver extended semantics with an 486 RPC/XDR layer implementation with no changes required above the NFS 487 layer, and few within. 489 7. Conclusions 491 NFS version 4 [RFC3530] has recently been granted "Proposed 492 Standard" status. The NFSv4 protocol was developed along several 493 design points, important among them: effective operation over wide- 494 area networks, including the Internet itself; strong security 495 integrated into the protocol; extensive cross-platform 496 interoperability including integrated locking semantics compatible 497 with multiple operating systems; and (this is key), protocol 498 extension. 500 NFS version 4 is an excellent base on which to add the needed 501 performance enhancements and improved semantics described above. 502 The minor versioning support defined in NFS version 4 was designed 503 to support protocol improvements without disruption to the 504 installed base. Evolutionary improvement of the protocol via minor 505 versioning is a conservative and cautious approach to current and 506 future problems and shortcomings. 508 Many arguments can be made as to the efficacy of the file 509 abstraction in meeting the future needs of enterprise data service 510 and the Internet. Fine grained Quality of Service (QoS) policies 511 (e.g. data delivery, retention, availability, security, ...) are 512 high among them. 514 It is vital that the NFS protocol continue to provide these 515 benefits to a wide range of applications, without its usefulness 516 being compromised by concerns about performance and semantic 517 inadequacies. This can reasonably be addressed in the existing NFS 518 protocol framework. A cautious evolutionary improvement of 519 performance and semantics allows building on the value already 520 present in the NFS protocol, while addressing new requirements that 521 have arisen from the application of networking technology. 523 8. Acknowledgements 525 The authors wish to thank Jeff Chase who provided many useful 526 suggestions. 528 9. References 530 [BRU99] 531 J. Brustoloni, "Interoperation of copy avoidance in network 532 and file I/O", in Proc. INFOCOM '99, pages 534-542, New York, 533 NY, Mar. 1999., IEEE. Also available from 534 http://www.cs.pitt.edu/~jcb/publs.html 536 [CAL+03] 537 B. Callaghan, T. Lingutla-Raj, A. Chiu, P. Staubach, O. Asad, 538 "NFS over RDMA", in Proceedings of ACM SIGCOMM Summer 2003 539 NICELI Workshop. 541 [CHA+01] 542 J. S. Chase, A. J. Gallatin, K. G. Yocum, "Endsystem 543 optimizations for high-speed TCP", IEEE Communications, 544 39(4):68-74, April 2001. 546 [CHA+99] 547 J. S. Chase, D. C. Anderson, A. J. Gallatin, A. R. Lebeck, K. 548 G. Yocum, "Network I/O with Trapeze", in 1999 Hot 549 Interconnects Symposium, August 1999. 551 [CHU96] 552 H.K. Chu, "Zero-copy TCP in Solaris", Proc. of the USENIX 1996 553 Annual Technical Conference, San Diego, CA, January 1996 555 [DAFS] 556 Direct Access File System Specification version 1.0, available 557 from http://www.dafscollaborative.org, September 2001 559 [DCK+03] 560 M. DeBergalis, P. Corbett, S. Kleiman, A. Lent, D. Noveck, T. 561 Talpey, M. Wittle, "The Direct Access File System", in 562 Proceedings of 2nd USENIX Conference on File and Storage 563 Technologies (FAST '03), San Francisco, CA, March 31 - April 564 2, 2003 566 [FJDAFS] 567 Fujitsu Prime Software Technologies, "Meet the DAFS 568 Performance with DAFS/VI Kernel Implementation using cLAN", 569 available from 570 http://www.pst.fujitsu.com/english/dafsdemo/index.html, 2001. 572 [FJNFS] 573 Fujitsu Prime Software Technologies, "An Adaptation of VIA to 574 NFS on Linux", available from 575 http://www.pst.fujitsu.com/english/nfs/index.html, 2000. 577 [GAL+99] 578 A. Gallatin, J. Chase, K. Yocum, "Trapeze/IP: TCP/IP at Near- 579 Gigabit Speeds", 1999 USENIX Technical Conference (Freenix 580 Track), June 1999. 582 [KM02] 583 K. Magoutis, "Design and Implementation of a Direct Access 584 File System (DAFS) Kernel Server for FreeBSD", in Proceedings 585 of USENIX BSDCon 2002 Conference, San Francisco, CA, February 586 11-14, 2002. 588 [MAF+02] 589 K. Magoutis, S. Addetia, A. Fedorova, M. Seltzer, J. Chase, D. 590 Gallatin, R. Kisley, R. Wickremesinghe, E. Gabber, "Structure 591 and Performance of the Direct Access File System (DAFS)", in 592 Proceedings of 2002 USENIX Annual Technical Conference, 593 Monterey, CA, June 9-14, 2002. 595 [MOG03] 596 J. Mogul, "TCP offload is a dumb idea whose time has come", 597 9th Workshop on Hot Topics in Operating Systems (HotOS IX), 598 Lihue, HI, May 2003. USENIX. 600 [NFSRDMA] 601 T. Talpey, S. Shepler, "NFSv4 RDMA and Session Extensions", 602 Internet Draft Work in Progress, draft-talpey-nfsv4-rdma- 603 sess-01, February 2004. 605 [PAI+00] 606 V. S. Pai, P. Druschel, W. Zwaenepoel, "IO-Lite: a unified I/O 607 buffering and caching system", ACM Trans. Computer Systems, 608 18(1):37-66, Feb. 2000. 610 [RFC3530] 611 S. Shepler, et. al., "NFS Version 4 Protocol", Standards Track 612 RFC 614 [RDDPPS] 615 Remote Direct Data Placement Working Group Problem Statement, 616 A. Romanow, J. Mogul, T. Talpey, S. Bailey, draft-ietf-rddp- 617 problem-statement-03 619 [RFC1831] 620 R. Srinivasan, "RPC: Remote Procedure Call Protocol 621 Specification Version 2", Standards Track RFC 623 [RFC1832] 624 R. Srinivasan, "XDR: External Data Representation Standard", 625 Standards Track RFC 627 [RFC1813] 628 B. Callaghan, B. Pawlowski, P. Staubach, "NFS Version 3 629 Protocol Specification", Informational RFC 631 [RPCRDMA] 632 B. Callaghan, T. Talpey, "RDMA Transport for ONC RPC", 633 Internet Draft Work in Progress, draft-callaghan-rpcrdma-01 635 [SHI+03] 636 P. Shivam, J. Chase, "On the Elusive Benefits of Protocol 637 Offload", to be published in Proceedings of ACM SIGCOMM Summer 638 2003 NICELI Workshop, also available from 639 http://issg.cs.duke.edu/publications/niceli03.pdf 641 [SKE+01] 642 K.-A. Skevik, T. Plagemann, V. Goebel, P. Halvorsen, 643 "Evaluation of a Zero-Copy Protocol Implementation", in 644 Proceedings of the 27th Euromicro Conference - Multimedia and 645 Telecommunications Track (MTT'2001), Warsaw, Poland, September 646 2001. 648 Authors' Addresses 650 Tom Talpey 651 Network Appliance, Inc. 652 375 Totten Pond Road 653 Waltham, MA 02451 USA 655 Phone: +1 781 768 5329 656 EMail: thomas.talpey@netapp.com 658 Chet Juszczak 659 Sun Microsystems, Inc. 660 43 Nagog Park 661 Acton, MA 01720 USA 663 Phone: +1 978 206 9148 664 Email: chet.juszczak@sun.com 666 Full Copyright Statement 668 Copyright (C) The Internet Society (2004). All Rights Reserved. 670 This document and translations of it may be copied and furnished to 671 others, and derivative works that comment on or otherwise explain 672 it or assist in its implementation may be prepared, copied, 673 published and distributed, in whole or in part, without restriction 674 of any kind, provided that the above copyright notice and this 675 paragraph are included on all such copies and derivative works. 676 However, this document itself may not be modified in any way, such 677 as by removing the copyright notice or references to the Internet 678 Society or other Internet organizations, except as needed for the 679 purpose of developing Internet standards in which case the 680 procedures for copyrights defined in the Internet Standards process 681 must be followed, or as required to translate it into languages 682 other than English. 684 The limited permissions granted above are perpetual and will not be 685 revoked by the Internet Society or its successors or assigns. 687 This document and the information contained herein is provided on 688 an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET 689 ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR 690 IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF 691 THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED 692 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.