idnits 2.17.1 draft-ietf-rddp-problem-statement-05.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3667, Section 5.1 on line 15. -- Found old boilerplate from RFC 3978, Section 5.5 on line 936. -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 946. -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 953. -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 959. ** Found boilerplate matching RFC 3978, Section 5.4, paragraph 1 (on line 927), which is fine, but *also* found old RFC 2026, Section 10.4C, paragraph 1 text on line 36. ** The document seems to lack an RFC 3978 Section 5.1 IPR Disclosure Acknowledgement -- however, there's a paragraph with a matching beginning. Boilerplate error? ** This document has an original RFC 3978 Section 5.4 Copyright Line, instead of the newer IETF Trust Copyright according to RFC 4748. ** The document seems to lack an RFC 3978 Section 5.4 Reference to BCP 78 -- however, there's a paragraph with a matching beginning. Boilerplate error? ** This document has an original RFC 3978 Section 5.5 Disclaimer, instead of the newer disclaimer which includes the IETF Trust according to RFC 4748. ** The document uses RFC 3667 boilerplate or RFC 3978-like boilerplate instead of verbatim RFC 3978 boilerplate. After 6 May 2005, submission of drafts without verbatim RFC 3978 boilerplate is not accepted. The following non-3978 patterns matched text found in the document. That text should be removed or replaced: By submitting this Internet-Draft, I certify that any applicable patent or other IPR claims of which I am aware have been disclosed, or will be disclosed, and any of which I become aware will be disclosed, in accordance with RFC 3668. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year == Line 796 has weird spacing: '...le from http:...' -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- Couldn't find a document date in the document -- date freshness check skipped. -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'B99' is mentioned on line 258, but not defined == Missing Reference: 'BP96' is mentioned on line 456, but not defined == Missing Reference: 'IPSEC' is mentioned on line 586, but not defined == Missing Reference: 'TLS' is mentioned on line 586, but not defined == Missing Reference: 'R2001' is mentioned on line 785, but not defined == Unused Reference: 'DAPP93' is defined on line 751, but no explicit reference was found in the text == Unused Reference: 'KSZ95' is defined on line 803, but no explicit reference was found in the text == Unused Reference: 'Ma02' is defined on line 807, but no explicit reference was found in the text == Unused Reference: 'Wa97' is defined on line 883, but no explicit reference was found in the text == Outdated reference: A later version (-07) exists of draft-ietf-rddp-arch-06 -- Obsolete informational reference (is this intentional?): RFC 793 (ref. 'Po81') (Obsoleted by RFC 9293) Summary: 8 errors (**), 0 flaws (~~), 13 warnings (==), 9 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internet-Draft Allyn Romanow (Cisco) 3 Expires: April 2005 Jeff Mogul (HP) 4 Tom Talpey (NetApp) 5 Stephen Bailey (Sandburst) 7 Remote Direct Memory Access (RDMA) over IP Problem Statement 8 draft-ietf-rddp-problem-statement-05 10 Status of this Memo 12 By submitting this Internet-Draft, I certify that any applicable 13 patent or other IPR claims of which I am aware have been disclosed, 14 or will be disclosed, and any of which I become aware will be 15 disclosed, in accordance with RFC 3668. 17 Internet-Drafts are working documents of the Internet Engineering 18 Task Force (IETF), its areas, and its working groups. Note that 19 other groups may also distribute working documents as Internet- 20 Drafts. 22 Internet-Drafts are draft documents valid for a maximum of six 23 months and may be updated, replaced, or obsoleted by other 24 documents at any time. It is inappropriate to use Internet-Drafts 25 as reference material or to cite them other than as "work in 26 progress." 28 The list of current Internet-Drafts can be accessed at 29 http://www.ietf.org/ietf/1id-abstracts.txt 31 The list of Internet-Draft Shadow Directories can be accessed at 32 http://www.ietf.org/shadow.html 34 Copyright Notice 36 Copyright (C) The Internet Society (2004). All Rights Reserved. 38 Abstract 40 Overhead due to the movement of user data in the end-system network 41 I/O processing path at high speeds is significant, and has limited 42 the use of Internet protocols in interconnection networks and the 43 Internet itself - especially where high bandwidth, low latency 44 and/or low overhead are required by the hosted application. 46 This draft examines this overhead, and addresses an architectural, 47 IP-based "copy avoidance" solution for its elimination, by enabling 48 Remote Direct Memory Access (RDMA). 50 Table Of Contents 52 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 2 53 2. The high cost of data movement operations in network I/O . 4 54 2.1. Copy avoidance improves processing overhead . . . . . . . 5 55 3. Memory bandwidth is the root cause of the problem . . . . 6 56 4. High copy overhead is problematic for many key Internet 57 applications . . . . . . . . . . . . . . . . . . . . . . . 7 58 5. Copy Avoidance Techniques . . . . . . . . . . . . . . . . 10 59 5.1. A Conceptual Framework: DDP and RDMA . . . . . . . . . . . 12 60 6. Conclusions . . . . . . . . . . . . . . . . . . . . . . . 12 61 7. Security Considerations . . . . . . . . . . . . . . . . . 13 62 8. Terminology . . . . . . . . . . . . . . . . . . . . . . . 14 63 9. Acknowledgements . . . . . . . . . . . . . . . . . . . . . 15 64 Informative References . . . . . . . . . . . . . . . . . . 15 65 Authors' Addresses . . . . . . . . . . . . . . . . . . . . 19 66 Full Copyright Statement . . . . . . . . . . . . . . . . . 20 68 1. Introduction 70 This draft considers the problem of high host processing overhead 71 associated with the movement of user data to and from the network 72 interface under high speed conditions. This problem is often 73 referred to as the "I/O bottleneck" [CT90]. More specifically, the 74 source of high overhead that is of interest here is data movement 75 operations - copying. The throughput of a system may therefore be 76 limited by the overhead of this copying. This issue is not to be 77 confused with TCP offload, which is not addressed here. High speed 78 refers to conditions where the network link speed is high relative 79 to the bandwidths of the host CPU and memory. With today's 80 computer systems, one Gigabit per second (Gbits/s) and over is 81 considered high speed. 83 High costs associated with copying are an issue primarily for large 84 scale systems. Although smaller systems such as rack-mounted PCs 85 and small workstations would benefit from a reduction in copying 86 overhead, the benefit to smaller machines will be primarily in the 87 next few years as they scale in the amount of bandwidth they 88 handle. Today it is large system machines with high bandwidth 89 feeds, usually multiprocessors and clusters, that are adversely 90 affected by copying overhead. Examples of such machines include 91 all varieties of servers: database servers, storage servers, 92 application servers for transaction processing, for e-commerce, and 93 web serving, content distribution, video distribution, backups, 94 data mining and decision support, and scientific computing. 96 Note that such servers almost exclusively service many concurrent 97 sessions (transport connections), which, in aggregate, are 98 responsible for > 1 Gbits/s of communication. Nonetheless, the 99 cost of copying overhead for a particular load is the same whether 100 from few or many sessions. 102 The I/O bottleneck, and the role of data movement operations, have 103 been widely studied in research and industry over the last 104 approximately 14 years, and we draw freely on these results. 105 Historically, the I/O bottleneck has received attention whenever 106 new networking technology has substantially increased line rates - 107 100 Megabit per second (Mbits/s) Fast Ethernet and Fibre 108 Distributed Data Interface [FDDI], 155 Mbits/s Asynchronous 109 Transfer Mode [ATM], 1 Gbits/s Ethernet. In earlier speed 110 transitions, the availability of memory bandwidth allowed the I/O 111 bottleneck issue to be deferred. Now however, this is no longer 112 the case. While the I/O problem is significant at 1 Gbits/s, it is 113 the introduction of 10 Gbits/s Ethernet which is motivating an 114 upsurge of activity in industry and research [DAFS, IB, VI, CGZ01, 115 Ma02, MAF+02]. 117 Because of high overhead of end-host processing in current 118 implementations, the TCP/IP protocol stack is not used for high 119 speed transfer. Instead, special purpose network fabrics, using a 120 technology generally known as Remote Direct Memory Access (RDMA), 121 have been developed and are widely used. RDMA is a set of 122 mechanisms that allow the network adapter, under control of the 123 application, to steer data directly into and out of application 124 buffers. Examples of such interconnection fabrics include Fibre 125 Channel [FIBRE] for block storage transfer, Virtual Interface 126 Architecture [VI] for database clusters, and Infiniband [IB], 127 Compaq Servernet [SRVNET] and Quadrics [QUAD] for System Area 128 Networks. These link level technologies limit application scaling 129 in both distance and size, meaning that the number of nodes cannot 130 be arbitrarily large. 132 This problem statement substantiates the claim that in network I/O 133 processing, high overhead results from data movement operations, 134 specifically copying; and that copy avoidance significantly 135 decreases this processing overhead. It describes when and why the 136 high processing overheads occur, explains why the overhead is 137 problematic, and points out which applications are most affected. 139 The document goes on to discuss why the problem is relevant to the 140 Internet and to Internet-based applications. Applications which 141 store, manage and distribute the information of the Internet are 142 well suited to applying the copy avoidance solution. They will 143 benefit by avoiding high processing overheads, which removes limits 144 to the available scaling of tiered end-systems. Copy avoidance 145 also eliminates latency for these systems, which can further 146 benefit effective distributed processing. 148 In addition, this document introduces an architectural approach to 149 solving the problem, which is developed in detail in [BT04]. It 150 also discusses how the proposed technology may introduce security 151 concerns and how they should be addressed. 153 Finally, this document includes a Terminology section to aid as a 154 reference for several new terms introduced by RDMA. 156 2. The high cost of data movement operations in network I/O 158 A wealth of data from research and industry shows that copying is 159 responsible for substantial amounts of processing overhead. It 160 further shows that even in carefully implemented systems, 161 eliminating copies significantly reduces the overhead, as 162 referenced below. 164 Clark et al. [CJRS89] in 1989 shows that TCP [Po81] overhead 165 processing is attributable to both operating system costs such as 166 interrupts, context switches, process management, buffer 167 management, timer management, and to the costs associated with 168 processing individual bytes, specifically computing the checksum 169 and moving data in memory. They found moving data in memory is the 170 more important of the costs, and their experiments show that memory 171 bandwidth is the greatest source of limitation. In the data 172 presented [CJRS89], 64% of the measured microsecond overhead was 173 attributable to data touching operations, and 48% was accounted for 174 by copying. The system measured Berkeley TCP on a Sun-3/60 using 175 1460 Byte Ethernet packets. 177 In a well-implemented system, copying can occur between the network 178 interface and the kernel, and between the kernel and application 179 buffers - two copies, each of which are two memory bus crossings - 180 for read and write. Although in certain circumstances it is 181 possible to do better, usually two copies are required on receive. 183 Subsequent work has consistently shown the same phenomenon as the 184 earlier Clark study. A number of studies report results that data- 185 touching operations, checksumming and data movement, dominate the 186 processing costs for messages longer than 128 Bytes [BS96, CGY01, 187 Ch96, CJRS89, DAPP93, KP96]. For smaller sized messages, per- 188 packet overheads dominate [KP96, CGY01]. 190 The percentage of overhead due to data-touching operations 191 increases with packet size, since time spent on per-byte operations 192 scales linearly with message size [KP96]. For example, Chu [Ch96] 193 reported substantial per-byte latency costs as a percentage of 194 total networking software costs for an MTU size packet on 195 SPARCstation/20 running memory-to-memory TCP tests over networks 196 with 3 different MTU sizes. The percentage of total software costs 197 attributable to per-byte operations were: 199 1500 Byte Ethernet 18-25% 200 4352 Byte FDDI 35-50% 201 9180 Byte ATM 55-65% 203 Although many studies report results for data-touching operations 204 including checksumming and data movement together, much work has 205 focused just on copying [BS96, B99, Ch96, TK95]. For example, 206 [KP96] reports results that separate processing times for checksum 207 from data movement operations. For the 1500 Byte Ethernet size, 208 20% of total processing overhead time is attributable to copying. 209 The study used 2 DECstations 5000/200 connected by an FDDI network. 210 (In this study checksum accounts for 30% of the processing time.) 212 2.1. Copy avoidance improves processing overhead 214 A number of studies show that eliminating copies substantially 215 reduces overhead. For example, results from copy-avoidance in the 216 IO-Lite system [PDZ99], which aimed at improving web server 217 performance, show a throughput increase of 43% over an optimized 218 web server, and 137% improvement over an Apache server. The system 219 was implemented in a 4.4BSD derived UNIX kernel, and the 220 experiments used a server system based on a 333MHz Pentium II PC 221 connected to a switched 100 Mbits/s Fast Ethernet. 223 There are many other examples where elimination of copying using a 224 variety of different approaches showed significant improvement in 225 system performance [CFF+94, DP93, EBBV95, KSZ95, TK95, Wa97]. We 226 will discuss the results of one of these studies in detail in order 227 to clarify the significant degree of improvement produced by copy 228 avoidance [Ch02]. 230 Recent work by Chase et al. [CGY01], measuring CPU utilization, 231 shows that avoiding copies reduces CPU time spent on data access 232 from 24% to 15% at 370 Mbits/s for a 32 KBytes MTU using an 233 AlphaStation XP1000 and a Myrinet adapter [BCF+95]. This is an 234 absolute improvement of 9% due to copy avoidance. 236 The total CPU utilization was 35%, with data access accounting for 237 24%. Thus the relative importance of reducing copies is 26%. At 238 370 Mbits/s, the system is not very heavily loaded. The relative 239 improvement in achievable bandwidth is 34%. This is the 240 improvement we would see if copy avoidance were added when the 241 machine was saturated by network I/O. 243 Note that improvement from the optimization becomes more important 244 if the overhead it targets is a larger share of the total cost. 245 This is what happens if other sources of overhead, such as 246 checksumming, are eliminated. In [CGY01], after removing checksum 247 overhead, copy avoidance reduces CPU utilization from 26% to 10%. 248 This is a 16% absolute reduction, a 61% relative reduction, and a 249 160% relative improvement in achievable bandwidth. 251 In fact, today's network interface hardware commonly offloads the 252 checksum, which removes the other source of per-byte overhead. 253 They also coalesce interrupts to reduce per-packet costs. Thus, 254 today copying costs account for a relatively larger part of CPU 255 utilization than previously, and therefore relatively more benefit 256 is to be gained in reducing them. (Of course this argument would 257 be specious if the amount of overhead were insignificant, but it 258 has been shown to be substantial. [BS96, B99, Ch96, KP96, TK95]) 260 3. Memory bandwidth is the root cause of the problem 262 Data movement operations are expensive because memory bandwidth is 263 scarce relative to network bandwidth and CPU bandwidth [PAC+97]. 264 This trend existed in the past and is expected to continue into the 265 future [HP97, STREAM], especially in large multiprocessor systems. 267 With copies crossing the bus twice per copy, network processing 268 overhead is high whenever network bandwidth is large in comparison 269 to CPU and memory bandwidths. Generally with today's end-systems, 270 the effects are observable at network speeds over 1 Gbits/s. In 271 fact, with multiple bus crossings it is possible to see the bus 272 bandwidth being the limiting factor for throughput. This prevents 273 such an end-system from silultaneously achieving full network 274 bandwidth and full application performance. 276 A common question is whether increase in CPU processing power 277 alleviates the problem of high processing costs of network I/O. 278 The answer is no, it is the memory bandwidth that is the issue. 279 Faster CPUs do not help if the CPU spends most of its time waiting 280 for memory [CGY01]. 282 The widening gap between microprocessor performance and memory 283 performance has long been a widely recognized and well-understood 284 problem [PAC+97]. Hennessy [HP97] shows microprocessor performance 285 grew from 1980-1998 at 60% per year, while the access time to DRAM 286 improved at 10% per year, giving rise to an increasing "processor- 287 memory performance gap". 289 Another source of relevant data is the STREAM Benchmark Reference 290 Information website which provides information on the STREAM 291 benchmark [STREAM]. The benchmark is a simple synthetic benchmark 292 program that measures sustainable memory bandwidth (in MBytes/s) 293 and the corresponding computation rate for simple vector kernels 294 measured in MFLOPS. The website tracks information on sustainable 295 memory bandwidth for hundreds of machines and all major vendors. 297 Results show measured system performance statistics. Processing 298 performance from 1985-2001 increased at 50% per year on average, 299 and sustainable memory bandwidth from 1975 to 2001 increased at 35% 300 per year on average over all the systems measured. A similar 15% 301 per year lead of processing bandwidth over memory bandwidth shows 302 up in another statistic, machine balance [Mc95], a measure of the 303 relative rate of CPU to memory bandwidth (FLOPS/cycle) / (sustained 304 memory ops/cycle) [STREAM]. 306 Network bandwidth has been increasing about 10-fold roughly every 8 307 years, which is a 40% per year growth rate. 309 A typical example illustrates that the memory bandwidth compares 310 unfavorably with link speed. The STREAM benchmark shows that a 311 modern uniprocessor PC, for example the 1.2 GHz Athlon in 2001, 312 will move the data 3 times in doing a receive operation - 1 for the 313 network interface to deposit the data in memory, and 2 for the CPU 314 to copy the data. With 1 GBytes/s of memory bandwidth, meaning one 315 read or one write, the machine could handle approximately 2.67 316 Gbits/s of network bandwidth, one third the copy bandwidth. But 317 this assumes 100% utilization, which is not possible, and more 318 importantly the machine would be totally consumed! (A rule of 319 thumb for databases is that 20% of the machine should be required 320 to service I/O, leaving 80% for the database application. And, the 321 less the better.) 323 In 2001, 1 Gbits/s links were common. An application server may 324 typically have two 1 Gbits/s connections - one connection backend 325 to a storage server and one front-end, say for serving HTTP 326 [FGM+99]. Thus the communications could use 2 Gbits/s. In our 327 typical example, the machine could handle 2.7 Gbits/s at its 328 theoretical maximum while doing nothing else. This means that the 329 machine basically could not keep up with the communication demands 330 in 2001, with the relative growth trends the situation only gets 331 worse. 333 4. High copy overhead is problematic for many key Internet applications 335 If a significant portion of resources on an application machine is 336 consumed in network I/O rather than in application processing, it 337 makes it difficult for the application to scale - to handle more 338 clients, to offer more services. 340 Several years ago the most affected applications were streaming 341 multimedia, parallel file systems and supercomputing on clusters 342 [BS96]. In addition, today the applications that suffer from 343 copying overhead are more central in Internet computing - they 344 store, manage, and distribute the information of the Internet and 345 the enterprise. They include database applications doing 346 transaction processing, e-commerce, web serving, decision support, 347 content distribution, video distribution, and backups. Clusters 348 are typically used for this category of application, since they 349 have advantages of availability and scalability. 351 Today these applications, which provide and manage Internet and 352 corporate information, are typically run in data centers that are 353 organized into three logical tiers. One tier is typically a set of 354 web servers connecting to the WAN. The second tier is a set of 355 application servers that run the specific applications usually on 356 more powerful machines, and the third tier is backend databases. 357 Physically, the first two tiers - web server and application server 358 - are usually combined [Pi01]. For example an e-commerce server 359 communicates with a database server and with a customer site, or a 360 content distribution server connects to a server farm, or an OLTP 361 server connects to a database and a customer site. 363 When network I/O uses too much memory bandwidth, performance on 364 network paths between tiers can suffer. (There might also be 365 performance issues on Storage Area Network paths used either by the 366 database tier or the application tier.) The high overhead from 367 network-related memory copies diverts system resources from other 368 application processing. It also can create bottlenecks that limit 369 total system performance. 371 There are a large and growing number of these application servers 372 distributed throughout the Internet. In 1999 approximately 3.4 373 million server units were shipped, in 2000, 3.9 million units, and 374 the estimated annual growth rate for 2000-2004 was 17 percent 375 [Ne00, Pa01]. 377 There is high motivation to maximize the processing capacity of 378 each CPU, as scaling by adding CPUs one way or another has 379 drawbacks. For example, adding CPUs to a multiprocessor will not 380 necessarily help, as a multiprocessor improves performance only 381 when the memory bus has additional bandwidth to spare. Clustering 382 can add additional complexity to handling the applications. 384 In order to scale a cluster or multiprocessor system, one must 385 proportionately scale the interconnect bandwidth. Interconnect 386 bandwidth governs the performance of communication-intensive 387 parallel applications; if this (often expressed in terms of 388 "bisection bandwidth") is too low, adding additional processors 389 cannot improve system throughput. Interconnect latency can also 390 limit the performance of applications that frequently share data 391 between processors. 393 So, excessive overheads on network paths in a "scalable" system 394 both can require the use of more processors than optimal, and can 395 reduce the marginal utility of those additional processors. 397 Copy avoidance scales a machine upwards by removing at least two- 398 thirds the bus bandwidth load from the "very best" 1-copy (on 399 receive) implementations, and removes at least 80% of the bandwidth 400 overhead from the 2-copy implementations. 402 The removal of bus bandwidth requirement, in turn, removes 403 bottlenecks from the network processing path and increases the 404 throughput of the machine. On a machine with limited bus 405 bandwidth, the advantages of removing this load is immediately 406 evident, as the host can attain full network bandwidth. Even on a 407 machine with bus bandwidth adequate to sustain full network 408 bandwidth, removal of bus bandwidth load serves to increase the 409 availabilty of the machine for the processing of user applications, 410 in some cases dramatically. 412 An example showing poor performance with copies and improved 413 scaling with copy avoidance is illustrative. The IO-Lite work 414 [PDZ99] shows higher server throughput servicing more clients using 415 a zero-copy system. In an experiment designed to mimic real world 416 web conditions by simulating the effect of TCP WAN connections on 417 the server, the performance of 3 servers was compared. One server 418 was Apache, another an optimized server called Flash, and the third 419 the Flash server running IO-Lite, called Flash-Lite with zero copy. 420 The measurement was of throughput in requests/second as a function 421 of the number of slow background clients that could be served. As 422 the table shows, Flash-Lite has better throughput, especially as 423 the number of clients increases. 425 Apache Flash Flash-Lite 426 ------ ----- ---------- 427 #Clients Throughput reqs/s Throughput Throughput 429 0 520 610 890 430 16 390 490 890 431 32 360 490 850 432 64 360 490 890 433 128 310 450 880 434 256 310 440 820 436 Traditional Web servers (which mostly send data and can keep most 437 of their content in the file cache) are not the worst case for copy 438 overhead. Web proxies (which often receive as much data as they 439 send) and complex Web servers based on System Area Networks or 440 multi-tier systems will suffer more from copy overheads than in the 441 example above. 443 5. Copy Avoidance Techniques 445 There have been extensive research investigation and industry 446 experience with two main alternative approaches to eliminating data 447 movement overhead, often along with improving other Operating 448 System processing costs. In one approach, hardware and/or software 449 changes within a single host reduce processing costs. In another 450 approach, memory-to-memory networking [MAF+02], the exchange of 451 explicit data placement information between hosts allows them to 452 reduce processing costs. 454 The single host approaches range from new hardware and software 455 architectures [KSZ95, Wa97, DWB+93] to new or modified software 456 systems [BP96, Ch96, TK95, DP93, PDZ99]. In the approach based on 457 using a networking protocol to exchange information, the network 458 adapter, under control of the application, places data directly 459 into and out of application buffers, reducing the need for data 460 movement. Commonly this approach is called RDMA, Remote Direct 461 Memory Access. 463 As discussed below, research and industry experience has shown that 464 copy avoidance techniques within the receiver processing path alone 465 have proven to be problematic. The research special purpose host 466 adapter systems had good performance and can be seen as precursors 467 for the commercial RDMA-based adapters [KSZ95, DWB+93]. In 468 software, many implementations have successfully achieved zero-copy 469 transmit, but few have accomplished zero-copy receive. And those 470 that have done so make strict alignment and no-touch requirements 471 on the application, greatly reducing the portability and usefulness 472 of the implementation. 474 In contrast, experience has proven satisfactory with memory-to- 475 memory systems that permit RDMA - performance has been good and 476 there have not been system or networking difficulties. RDMA is a 477 single solution. Once implemented, it can be used with any OS and 478 machine architecture, and it does not need to be revised when 479 either of these changes. 481 In early work, one goal of the software approaches was to show that 482 TCP could go faster with appropriate OS support [CJR89, CFF+94]. 483 While this goal was achieved, further investigation and experience 484 showed that, though possible to craft software solutions, specific 485 system optimizations have been complex, fragile, extremely 486 interdependent with other system parameters in complex ways, and 487 often of only marginal improvement [CFF+94, CGY01, Ch96, DAPP93, 488 KSZ95, PDZ99]. The network I/O system interacts with other aspects 489 of the Operating System such as machine architecture and file I/O, 490 and disk I/O [Br99, Ch96, DP93]. 492 For example, the Solaris Zero-Copy TCP work [Ch96], which relies on 493 page remapping, shows that the results are highly interdependent 494 with other systems, such as the file system, and that the 495 particular optimizations are specific for particular architectures, 496 meaning for each variation in architecture optimizations must be 497 re-crafted [Ch96]. 499 With RDMA, application I/O buffers are mapped directly, and the 500 authorized peer may access it without incurring additional 501 processing overhead. When RDMA is implemented in hardware, 502 arbitrary data movement can be performed without involving the host 503 CPU at all. 505 A number of research projects and industry products have been based 506 on the memory-to-memory approach to copy avoidance. These include 507 U-Net [EBBV95], SHRIMP [BLA+94], Hamlyn [BJM+96], Infiniband [IB], 508 Winsock Direct [Pi01]. Several memory-to-memory systems have been 509 widely used and have generally been found to be robust, to have 510 good performance, and to be relatively simple to implement. These 511 include VI [VI], Myrinet [BCF+95], Quadrics [QUAD], Compaq/Tandem 512 Servernet [SRVNET]. Networks based on these memory-to-memory 513 architectures have been used widely in scientific applications and 514 in data centers for block storage, file system access, and 515 transaction processing. 517 By exporting direct memory access "across the wire", applications 518 may direct the network stack to manage all data directly from 519 application buffers. A large and growing class of applications has 520 already emerged which takes advantage of such capabilities, 521 including all the major databases, as well as file systems such as 522 DAFS [DAFS] and network protocols such as Sockets Direct [SDP]. 524 5.1. A Conceptual Framework: DDP and RDMA 526 An RDMA solution can be usefully viewed as being comprised of two 527 distinct components: "direct data placement (DDP)" and "remote 528 direct memory access (RDMA) semantics". They are distinct in 529 purpose and also in practice - they may be implemented as separate 530 protocols. 532 The more fundamental of the two is the direct data placement 533 facility. This is the means by which memory is exposed to the 534 remote peer in an appropriate fashion, and the means by which the 535 peer may access it, for instance reading and writing. 537 The RDMA control functions are semantically layered atop direct 538 data placement. Included are operations that provide "control" 539 features, such as connection and termination, and the ordering of 540 operations and signaling their completions. A "send" facility is 541 provided. 543 While the functions (and potentially protocols) are distinct, 544 historically both aspects taken together have been referred as 545 "RDMA". The facilities of direct data placement are useful in and 546 of themselves, and may be employed by other upper layer protocols 547 to facilitate data transfer. Therefore, it is often useful to 548 refer to DDP as the data placement functionality and RDMA as the 549 control aspect. 551 [BT04] develops an architecture for DDP and RDMA atop the Internet 552 Protocol Suite, and is a companion draft to this problem statement. 554 6. Conclusions 556 This Problem Statement concludes that an IP-based, general solution 557 for reducing processing overhead in end-hosts is desirable. 559 It has shown that high overhead of the processing of network data 560 leads to end-host bottlenecks. These bottlenecks are in large part 561 attributable to the copying of data. The bus bandwidth of machines 562 has historically been limited, and the bandwidth of high-speed 563 interconnects taxes it heavily. 565 An architectural solution to alleviate these bottlenecks best 566 satisifies the issue. Further, the high speed of today's 567 interconnects and the deployment of these hosts on Internet 568 Protocol-based networks leads to the desireability to layer such a 569 solution on the Internet Protocol Suite. The architecture 570 described in [BT04] is such a proposal. 572 7. Security Considerations 574 Solutions to the problem of reducing copying overhead in high 575 bandwidth transfers may introduce new security concerns. Any 576 proposed solution must be analyzed for security vulnerabilities and 577 any such vulnerabilities addressed. Potential security weaknesses 578 due to resource issues that might lead to denial-of-service 579 attacks, overwrites and other concurrent operations, the ordering 580 of completions as required by the RDMA protocol, the granularity of 581 transfer, and any other identified vulnerabilities; need to be 582 examined, described and an adequate resolution to them found. 584 Layered atop Internet transport protocols, the RDMA protocols will 585 gain leverage from and must permit integration with Internet 586 security standards, such as IPsec and TLS [IPSEC, TLS]. However, 587 there may be implementation ramifications for certain security 588 approaches with respect to RDMA, due to its copy avoidance. 590 IPsec, operating to secure the connection on a packet-by-packet 591 basis, seems to be a natural fit to securing RDMA placement, which 592 operates in conjunction with transport. Because RDMA enables an 593 implementation to avoid buffering, it is preferable to perform all 594 applicable security protection prior to processing of each segment 595 by the transport and RDMA layers. Such a layering enables the most 596 efficient secure RDMA implementation. 598 The TLS record protocol, on the other hand, is layered on top of 599 reliable transports and cannot provide such security assurance 600 until an entire record is available, which may require the 601 buffering and/or assembly of several distinct messages prior to TLS 602 processing. This defers RDMA processing and introduces overheads 603 that RDMA is designed to avoid. TLS therefore is viewed as 604 potentially a less natural fit for protecting the RDMA protocols. 606 It is necessary to guarantee properties such as confidentiality, 607 integrity, and authentication on an RDMA communications channel. 608 However, these properties cannot defend against all attacks from 609 properly authenticated peers, which might be malicious, 610 compromised, or buggy. Therefore the RDMA design must address 611 protection against such attacks. For example, an RDMA peer should 612 not be able to read or write memory regions without prior consent. 614 Further, it must not be possible to evade memory consistency checks 615 at the recipient. The RDMA design must allow the recipient to rely 616 on its consistent memory contents by explicitly controlling peer 617 access to memory regions at appropriate times. 619 Peer connections which do not pass authentication and authorization 620 checks by upper layers must not be permitted to begin processing in 621 RDMA mode with an inappropriate endpoint. Once associated, peer 622 accesses to memory regions must be authenticated and made subject 623 to authorization checks in the context of the association and 624 connection on which they are to be performed, prior to any transfer 625 operation or data being accessed. 627 The RDMA protocols must ensure that these region protections be 628 under strict application control. Remote access to local memory by 629 a network peer is particularly important in the Internet context, 630 where such access can be exported globally. 632 8. Terminology 634 This section contains general terminology definitions for this 635 document and for Remote Direct Memory Access in general. 637 Remote Direct Memory Access (RDMA) 638 A method of accessing memory on a remote system in which the 639 local system specifies the location of the data to be 640 transferred. 642 RDMA Protocol 643 A protocol that supports RDMA Operations to transfer data 644 between systems. 646 Fabric 647 The collection of links, switches, and routers that connect a 648 set of systems. 650 Storage Area Network (SAN) 651 A network where disks, tapes and other storage devices are 652 made available to one or more end-systems via a fabric. 654 System Area Network 655 A network where clustered systems share services, such as 656 storage and interprocess communication, via a fabric. 658 Fibre Channel (FC) 659 An ANSI standard link layer with associated protocols, 660 typically used to implement Storage Area Networks. [FIBRE] 662 Virtual Interface Architecture (VI, VIA) 663 An RDMA interface definition developed by an industry group 664 and implemented with a variety of differing wire protocols. 665 [VI] 667 Infiniband (IB) 668 An RDMA interface, protocol suite and link layer specification 669 defined by an industry trade association. [IB] 671 9. Acknowledgements 673 Jeff Chase generously provided many useful insights and 674 information. Thanks to Jim Pinkerton for many helpful discussions. 676 10. Informative References 678 [ATM] 679 The ATM Forum, "Asynchronous Transfer Mode Physical Layer 680 Specification" af-phy-0015.000, etc. drafts available from 681 http://www.atmforum.com/standards/approved.html 683 [BCF+95] 684 N. J. Boden, D. Cohen, R. E. Felderman, A. E. Kulawik, C. L. 685 Seitz, J. N. Seizovic, and W. Su. "Myrinet - A gigabit-per- 686 second local-area network", IEEE Micro, February 1995 688 [BJM+96] 689 G. Buzzard, D. Jacobson, M. Mackey, S. Marovich, J. Wilkes, 690 "An implementation of the Hamlyn send-managed interface 691 architecture", in Proceedings of the Second Symposium on 692 Operating Systems Design and Implementation, USENIX Assoc., 693 October 1996 695 [BLA+94] 696 M. A. Blumrich, K. Li, R. Alpert, C. Dubnicki, E. W. Felten, 697 "A virtual memory mapped network interface for the SHRIMP 698 multicomputer", in Proceedings of the 21st Annual Symposium on 699 Computer Architecture, April 1994, pp. 142-153 701 [Br99] 702 J. C. Brustoloni, "Interoperation of copy avoidance in network 703 and file I/O", Proceedings of IEEE Infocom, 1999, pp. 534-542 705 [BS96] 706 J. C. Brustoloni, P. Steenkiste, "Effects of buffering 707 semantics on I/O performance", Proceedings OSDI'96, USENIX, 708 Seattle, WA October 1996, pp. 277-291 710 [BT04] 711 S. Bailey, T. Talpey, "The Architecture of Direct Data 712 Placement (DDP) And Remote Direct Memory Access (RDMA) On 713 Internet Protocols", Internet Draft Work in Progress, draft- 714 ietf-rddp-arch-06, October 2004 716 [CFF+94] 717 C-H Chang, D. Flower, J. Forecast, H. Gray, B. Hawe, A. 718 Nadkarni, K. K. Ramakrishnan, U. Shikarpur, K. Wilde, "High- 719 performance TCP/IP and UDP/IP networking in DEC OSF/1 for 720 Alpha AXP", Proceedings of the 3rd IEEE Symposium on High 721 Performance Distributed Computing, August 1994, pp. 36-42 723 [CGY01] 724 J. S. Chase, A. J. Gallatin, and K. G. Yocum, "End system 725 optimizations for high-speed TCP", IEEE Communications 726 Magazine, Volume: 39, Issue: 4 , April 2001, pp 68-74. 727 http://www.cs.duke.edu/ari/publications/end-system.{ps,pdf} 729 [Ch96] 730 H.K. Chu, "Zero-copy TCP in Solaris", Proc. of the USENIX 1996 731 Annual Technical Conference, San Diego, CA, January 1996 733 [Ch02] 734 Jeffrey Chase, Personal communication 736 [CJRS89] 737 D. D. Clark, V. Jacobson, J. Romkey, H. Salwen, "An analysis 738 of TCP processing overhead", IEEE Communications Magazine, 739 volume: 27, Issue: 6, June 1989, pp 23-29 741 [CT90] 742 D. D. Clark, D. Tennenhouse, "Architectural considerations for 743 a new generation of protocols", Proceedings of the ACM SIGCOMM 744 Conference, 1990 746 [DAFS] 747 DAFS Collaborative, "Direct Access File System Specification 748 v1.0", September 2001, available from 749 http://www.dafscollaborative.org 751 [DAPP93] 752 P. Druschel, M. B. Abbott, M. A. Pagels, L. L. Peterson, 753 "Network subsystem design", IEEE Network, July 1993, pp. 8-17 755 [DP93] 756 P. Druschel, L. L. Peterson, "Fbufs: a high-bandwidth cross- 757 domain transfer facility", Proceedings of the 14th ACM 758 Symposium of Operating Systems Principles, December 1993 760 [DWB+93] 761 C. Dalton, G. Watson, D. Banks, C. Calamvokis, A. Edwards, J. 762 Lumley, "Afterburner: architectural support for high- 763 performance protocols", Technical Report, HP Laboratories 764 Bristol, HPL-93-46, July 1993 766 [EBBV95] 767 T. von Eicken, A. Basu, V. Buch, and W. Vogels, "U-Net: A 768 user-level network interface for parallel and distributed 769 computing", Proc. of the 15th ACM Symposium on Operating 770 Systems Principles, Copper Mountain, Colorado, December 3-6, 771 1995 773 [FDDI] 774 International Standards Organization, "Fibre Distributed Data 775 Interface", ISO/IEC 9314, committee drafts available from 776 http://www.iso.org 778 [FGM+99] 779 R. Fielding, J. Gettys, J. Mogul, F. Frystyk, L. Masinter, P. 780 Leach, T. Berners-Lee, "Hypertext Transfer Protocol - 781 HTTP/1.1", RFC 2616, June 1999 783 [FIBRE] 784 ANSI Technical Committee T10, "Fibre Channel Protocol (FCP)" 785 (and as revised and updated), ANSI X3.269:1996 [R2001], 786 committee draft available from 787 http://www.t10.org/drafts.htm#FibreChannel 789 [HP97] 790 J. L. Hennessy, D. A. Patterson, Computer Organization and 791 Design, 2nd Edition, San Francisco: Morgan Kaufmann 792 Publishers, 1997 794 [IB] InfiniBand Trade Association, "InfiniBand Architecture 795 Specification, Volumes 1 and 2", Release 1.1, November 2002, 796 available from http://www.infinibandta.org/specs 798 [KP96] 799 J. Kay, J. Pasquale, "Profiling and reducing processing 800 overheads in TCP/IP", IEEE/ACM Transactions on Networking, Vol 801 4, No. 6, pp.817-828, December 1996 803 [KSZ95] 804 K. Kleinpaste, P. Steenkiste, B. Zill, "Software support for 805 outboard buffering and checksumming", SIGCOMM'95 807 [Ma02] 808 K. Magoutis, "Design and Implementation of a Direct Access 809 File System (DAFS) Kernel Server for FreeBSD", in Proceedings 810 of USENIX BSDCon 2002 Conference, San Francisco, CA, February 811 11-14, 2002. 813 [MAF+02] 814 K. Magoutis, S. Addetia, A. Fedorova, M. I. Seltzer, J. S. 815 Chase, D. Gallatin, R. Kisley, R. Wickremesinghe, E. Gabber, 816 "Structure and Performance of the Direct Access File System 817 (DAFS)", in Proceedings of the 2002 USENIX Annual Technical 818 Conference, Monterey, CA, June 9-14, 2002. 820 [Mc95] 821 J. D. McCalpin, "A Survey of memory bandwidth and machine 822 balance in current high performance computers", IEEE TCCA 823 Newsletter, December 1995 825 [Ne00] 826 A. Newman, "IDC report paints conflicted picture of server 827 market circa 2004", ServerWatch, July 24, 2000 828 http://serverwatch.internet.com/news/2000_07_24_a.html 830 [Pa01] 831 M. Pastore, "Server shipments for 2000 surpass those in 1999", 832 ServerWatch, February 7, 2001 833 http://serverwatch.internet.com/news/2001_02_07_a.html 835 [PAC+97] 836 D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K. Keeton, 837 C. Kozyrakis, R. Thomas, K. Yelick , "A case for intelligient 838 RAM: IRAM", IEEE Micro, April 1997 840 [PDZ99] 841 V. S. Pai, P. Druschel, W. Zwaenepoel, "IO-Lite: a unified I/O 842 buffering and caching system", Proc. of the 3rd Symposium on 843 Operating Systems Design and Implementation, New Orleans, LA, 844 February 1999 846 [Pi01] 847 J. Pinkerton, "Winsock Direct: The Value of System Area 848 Networks", May 2001, available from 849 http://www.microsoft.com/windows2000/techinfo/ 850 howitworks/communications/winsock.asp 852 [Po81] 853 J. Postel, "Transmission Control Protocol - DARPA Internet 854 Program Protocol Specification", RFC 793, September 1981 856 [QUAD] 857 Quadrics Ltd., Quadrics QSNet product information, available 858 from http://www.quadrics.com/website/pages/02qsn.html 860 [SDP] 861 InfiniBand Trade Association, "Sockets Direct Protocol v1.0", 862 Annex A of InfiniBand Architecture Specification Volume 1, 863 Release 1.1, November 2002, available from 864 http://www.infinibandta.org/specs 866 [SRVNET] 867 R. Horst, "TNet: A reliable system area network", IEEE Micro, 868 pp. 37-45, February 1995 870 [STREAM] 871 J. D. McAlpin, The STREAM Benchmark Reference Information, 872 http://www.cs.virginia.edu/stream/ 874 [TK95] 875 M. N. Thadani, Y. A. Khalidi, "An efficient zero-copy I/O 876 framework for UNIX", Technical Report, SMLI TR-95-39, May 1995 878 [VI] Compaq Computer Corp., Intel Corporation and Microsoft 879 Corporation, "Virtual Interface Architecture Specification 880 Version 1.0", December 1997, available from 881 http://www.vidf.org/info/04standards.html 883 [Wa97] 884 J. R. Walsh, "DART: Fast application-level networking via 885 data-copy avoidance", IEEE Network, July/August 1997, pp. 886 28-38 888 Authors' Addresses 890 Stephen Bailey 891 Sandburst Corporation 892 600 Federal Street 893 Andover, MA 01810 USA 895 Phone: +1 978 689 1614 896 Email: steph@sandburst.com 897 Jeffrey C. Mogul 898 Western Research Laboratory 899 Hewlett-Packard Company 900 1501 Page Mill Road, MS 1251 901 Palo Alto, CA 94304 USA 903 Phone: +1 650 857 2206 (email preferred) 904 Email: JeffMogul@acm.org 906 Allyn Romanow 907 Cisco Systems, Inc. 908 170 W. Tasman Drive 909 San Jose, CA 95134 USA 911 Phone: +1 408 525 8836 912 Email: allyn@cisco.com 914 Tom Talpey 915 Network Appliance 916 375 Totten Pond Road 917 Waltham, MA 02451 USA 919 Phone: +1 781 768 5329 920 Email: thomas.talpey@netapp.com 922 Full Copyright Statement 924 Copyright (C) The Internet Society (2004). This document is 925 subject to the rights, licenses and restrictions contained in BCP 926 78 and except as set forth therein, the authors retain all their 927 rights. 929 This document and the information contained herein are provided on 930 an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE 931 REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND 932 THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, 933 EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT 934 THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR 935 ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A 936 PARTICULAR PURPOSE. 938 Intellectual Property 939 The IETF takes no position regarding the validity or scope of any 940 Intellectual Property Rights or other rights that might be claimed 941 to pertain to the implementation or use of the technology described 942 in this document or the extent to which any license under such 943 rights might or might not be available; nor does it represent that 944 it has made any independent effort to identify any such rights. 945 Information on the procedures with respect to rights in RFC 946 documents can be found in BCP 78 and BCP 79. 948 Copies of IPR disclosures made to the IETF Secretariat and any 949 assurances of licenses to be made available, or the result of an 950 attempt made to obtain a general license or permission for the use 951 of such proprietary rights by implementers or users of this 952 specification can be obtained from the IETF on-line IPR repository 953 at http://www.ietf.org/ipr. 955 The IETF invites any interested party to bring to its attention any 956 copyrights, patents or patent applications, or other proprietary 957 rights that may cover technology that may be required to implement 958 this standard. Please address the information to the IETF at ietf- 959 ipr@ietf.org. 961 Acknowledgement 962 Funding for the RFC Editor function is currently provided by the 963 Internet Society.