idnits 2.17.1 draft-ietf-rddp-problem-statement-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year == Line 723 has weird spacing: '...le from http:...' -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- Couldn't find a document date in the document -- date freshness check skipped. -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'B99' is mentioned on line 185, but not defined == Missing Reference: 'BP96' is mentioned on line 435, but not defined == Missing Reference: 'IPSEC' is mentioned on line 566, but not defined == Missing Reference: 'TLS' is mentioned on line 566, but not defined == Missing Reference: 'R2001' is mentioned on line 712, but not defined == Unused Reference: 'DAPP93' is defined on line 683, but no explicit reference was found in the text == Unused Reference: 'KSZ95' is defined on line 730, but no explicit reference was found in the text == Unused Reference: 'Ma02' is defined on line 734, but no explicit reference was found in the text == Unused Reference: 'Wa97' is defined on line 810, but no explicit reference was found in the text == Outdated reference: A later version (-07) exists of draft-ietf-rddp-arch-04 -- Obsolete informational reference (is this intentional?): RFC 793 (ref. 'Po81') (Obsoleted by RFC 9293) Summary: 2 errors (**), 0 flaws (~~), 13 warnings (==), 4 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 Allyn Romanow (Cisco) 2 Internet-Draft Jeff Mogul (HP) 3 Expires: July 2004 Tom Talpey (NetApp) 4 Stephen Bailey (Sandburst) 6 RDMA over IP Problem Statement 7 draft-ietf-rddp-problem-statement-03 9 Status of this Memo 11 This document is an Internet-Draft and is in full conformance with 12 all provisions of Section 10 of RFC2026. 14 Internet-Drafts are working documents of the Internet Engineering 15 Task Force (IETF), its areas, and its working groups. Note that 16 other groups may also distribute working documents as Internet- 17 Drafts. 19 Internet-Drafts are draft documents valid for a maximum of six 20 months and may be updated, replaced, or obsoleted by other 21 documents at any time. It is inappropriate to use Internet-Drafts 22 as reference material or to cite them other than as "work in 23 progress." 25 The list of current Internet-Drafts can be accessed at 26 http://www.ietf.org/ietf/1id-abstracts.txt 28 The list of Internet-Draft Shadow Directories can be accessed at 29 http://www.ietf.org/shadow.html. 31 Copyright Notice 33 Copyright (C) The Internet Society (2004). All Rights Reserved. 35 Abstract 37 This draft addresses an IP-based solution to the problem of high 38 system overhead due to end-host copying of user data in the network 39 I/O path at high speeds. The problem is due to the high cost of 40 memory bandwidth, and it can be substantially improved using "copy 41 avoidance." The overhead has limited the use of TCP/IP in 42 interconnection networks especially where high bandwidth, low 43 latency and/or low overhead of end-system data movement are 44 required by the hosted application. 46 Table Of Contents 48 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 2 49 2. The high cost of data movement operations in network I/O . 3 50 2.1. Copy avoidance improves processing overhead . . . . . . . 5 51 3. Memory bandwidth is the root cause of the problem . . . . 6 52 4. High copy overhead is problematic for many key Internet 53 applications . . . . . . . . . . . . . . . . . . . . . . . 7 54 5. Copy Avoidance Techniques . . . . . . . . . . . . . . . . 9 55 5.1. A Conceptual Framework: DDP and RDMA . . . . . . . . . . . 11 56 6. Conclusions . . . . . . . . . . . . . . . . . . . . . . . 12 57 7. Security Considerations . . . . . . . . . . . . . . . . . 12 58 8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . 13 59 Informative References . . . . . . . . . . . . . . . . . . 13 60 Authors' Addresses . . . . . . . . . . . . . . . . . . . . 18 61 Full Copyright Statement . . . . . . . . . . . . . . . . . 18 63 1. Introduction 65 This draft considers the problem of high host processing overhead 66 associated with the movement of user data to and from the network 67 interface under high speed conditions. This problem is often 68 referred to as the "I/O bottleneck" [CT90]. More specifically, the 69 source of high overhead that is of interest here is data movement 70 operations - copying. The throughput of a system may therefore be 71 limited by the overhead of this copying. This issue is not to be 72 confused with TCP offload, which is not addressed here. High speed 73 refers to conditions where the network link speed is high relative 74 to the bandwidths of the host CPU and memory. With today's 75 computer systems, one Gbits/s and over is considered high speed. 77 High costs associated with copying are an issue primarily for large 78 scale systems. Although smaller systems such as rack-mounted PCs 79 and small workstations would benefit from a reduction in copying 80 overhead, the benefit to smaller machines will be primarily in the 81 next few years as they scale in the amount of bandwidth they 82 handle. Today it is large system machines with high bandwidth 83 feeds, usually multiprocessors and clusters, that are adversely 84 affected by copying overhead. Examples of such machines include 85 all varieties of servers: database servers, storage servers, 86 application servers for transaction processing, for e-commerce, and 87 web serving, content distribution, video distribution, backups, 88 data mining and decision support, and scientific computing. 90 Note that such servers almost exclusively service many concurrent 91 sessions (transport connections), which, in aggregate, are 92 responsible for > 1 Gbits/s of communication. Nonetheless, the 93 cost of copying overhead for a particular load is the same whether 94 from few or many sessions. 96 The I/O bottleneck, and the role of data movement operations, have 97 been widely studied in research and industry over the last 98 approximately 14 years, and we draw freely on these results. 99 Historically, the I/O bottleneck has received attention whenever 100 new networking technology has substantially increased line rates - 101 100 Mbits/s FDDI and Fast Ethernet, 155 Mbits/s ATM, 1 Gbits/s 102 Ethernet. In earlier speed transitions, the availability of memory 103 bandwidth allowed the I/O bottleneck issue to be deferred. Now 104 however, this is no longer the case. While the I/O problem is 105 significant at 1 Gbits/s, it is the introduction of 10 Gbits/s 106 Ethernet which is motivating an upsurge of activity in industry and 107 research [DAFS, IB, VI, CGZ01, Ma02, MAF+02]. 109 Because of high overhead of end-host processing in current 110 implementations, the TCP/IP protocol stack is not used for high 111 speed transfer. Instead, special purpose network fabrics, using a 112 technology generally known as remote direct memory access (RDMA), 113 have been developed and are widely used. RDMA is a set of 114 mechanisms that allow the network adapter, under control of the 115 application, to steer data directly into and out of application 116 buffers. Examples of such interconnection fabrics include Fibre 117 Channel [FIBRE] for block storage transfer, Virtual Interface 118 Architecture [VI] for database clusters, Infiniband [IB], Compaq 119 Servernet [SRVNET], Quadrics [QUAD] for System Area Networks. 120 These link level technologies limit application scaling in both 121 distance and size, meaning that the number of nodes cannot be 122 arbitrarily large. 124 This problem statement substantiates the claim that in network I/O 125 processing, high overhead results from data movement operations, 126 specifically copying; and that copy avoidance significantly 127 decreases the processing overhead. It describes when and why the 128 high processing overheads occur, explains why the overhead is 129 problematic, and points out which applications are most affected. 131 In addition, this document introduces an architectural approach to 132 solving the problem, which is developed in detail in [BT04]. It 133 also discusses how the proposed technology may introduce security 134 concerns and how they should be addressed. 136 2. The high cost of data movement operations in network I/O 138 A wealth of data from research and industry shows that copying is 139 responsible for substantial amounts of processing overhead. It 140 further shows that even in carefully implemented systems, 141 eliminating copies significantly reduces the overhead, as 142 referenced below. 144 Clark et al. [CJRS89] in 1989 shows that TCP [Po81] overhead 145 processing is attributable to both operating system costs such as 146 interrupts, context switches, process management, buffer 147 management, timer management, and to the costs associated with 148 processing individual bytes, specifically computing the checksum 149 and moving data in memory. They found moving data in memory is the 150 more important of the costs, and their experiments show that memory 151 bandwidth is the greatest source of limitation. In the data 152 presented [CJRS89], 64% of the measured microsecond overhead was 153 attributable to data touching operations, and 48% was accounted for 154 by copying. The system measured Berkeley TCP on a Sun-3/60 using 155 1460 Byte Ethernet packets. 157 In a well-implemented system, copying can occur between the network 158 interface and the kernel, and between the kernel and application 159 buffers - two copies, each of which are two memory bus crossings - 160 for read and write. Although in certain circumstances it is 161 possible to do better, usually two copies are required on receive. 163 Subsequent work has consistently shown the same phenomenon as the 164 earlier Clark study. A number of studies report results that data- 165 touching operations, checksumming and data movement, dominate the 166 processing costs for messages longer than 128 Bytes [BS96, CGY01, 167 Ch96, CJRS89, DAPP93, KP96]. For smaller sized messages, per- 168 packet overheads dominate [KP96, CGY01]. 170 The percentage of overhead due to data-touching operations 171 increases with packet size, since time spent on per-byte operations 172 scales linearly with message size [KP96]. For example, Chu [Ch96] 173 reported substantial per-byte latency costs as a percentage of 174 total networking software costs for an MTU size packet on 175 SPARCstation/20 running memory-to-memory TCP tests over networks 176 with 3 different MTU sizes. The percentage of total software costs 177 attributable to per-byte operations were: 179 1500 Byte Ethernet 18-25% 180 4352 Byte FDDI 35-50% 181 9180 Byte ATM 55-65% 183 Although many studies report results for data-touching operations 184 including checksumming and data movement together, much work has 185 focused just on copying [BS96, B99, Ch96, TK95]. For example, 186 [KP96] reports results that separate processing times for checksum 187 from data movement operations. For the 1500 Byte Ethernet size, 188 20% of total processing overhead time is attributable to copying. 189 The study used 2 DECstations 5000/200 connected by an FDDI network. 190 (In this study checksum accounts for 30% of the processing time.) 192 2.1. Copy avoidance improves processing overhead 194 A number of studies show that eliminating copies substantially 195 reduces overhead. For example, results from copy-avoidance in the 196 IO-Lite system [PDZ99], which aimed at improving web server 197 performance, show a throughput increase of 43% over an optimized 198 web server, and 137% improvement over an Apache server. The system 199 was implemented in a 4.4BSD derived UNIX kernel, and the 200 experiments used a server system based on a 333MHz Pentium II PC 201 connected to a switched 100 Mbits/s Fast Ethernet. 203 There are many other examples where elimination of copying using a 204 variety of different approaches showed significant improvement in 205 system performance [CFF+94, DP93, EBBV95, KSZ95, TK95, Wa97]. We 206 will discuss the results of one of these studies in detail in order 207 to clarify the significant degree of improvement produced by copy 208 avoidance [Ch02]. 210 Recent work by Chase et al. [CGY01], measuring CPU utilization, 211 shows that avoiding copies reduces CPU time spent on data access 212 from 24% to 15% at 370 Mbits/s for a 32 KBytes MTU using an 213 AlphaStation XP1000 and a Myrinet adapter [BCF+95]. This is an 214 absolute improvement of 9% due to copy avoidance. 216 The total CPU utilization was 35%, with data access accounting for 217 24%. Thus the relative importance of reducing copies is 26%. At 218 370 Mbits/s, the system is not very heavily loaded. The relative 219 improvement in achievable bandwidth is 34%. This is the 220 improvement we would see if copy avoidance were added when the 221 machine was saturated by network I/O. 223 Note that improvement from the optimization becomes more important 224 if the overhead it targets is a larger share of the total cost. 225 This is what happens if other sources of overhead, such as 226 checksumming, are eliminated. In [CGY01], after removing checksum 227 overhead, copy avoidance reduces CPU utilization from 26% to 10%. 228 This is a 16% absolute reduction, a 61% relative reduction, and a 229 160% relative improvement in achievable bandwidth. 231 In fact, today's network interface hardware commonly offloads the 232 checksum, which removes the other source of per-byte overhead. 233 They also coalesce interrupts to reduce per-packet costs. Thus, 234 today copying costs account for a relatively larger part of CPU 235 utilization than previously, and therefore relatively more benefit 236 is to be gained in reducing them. (Of course this argument would 237 be specious if the amount of overhead were insignificant, but it 238 has been shown to be substantial.) 240 3. Memory bandwidth is the root cause of the problem 242 Data movement operations are expensive because memory bandwidth is 243 scarce relative to network bandwidth and CPU bandwidth [PAC+97]. 244 This trend existed in the past and is expected to continue into the 245 future [HP97, STREAM], especially in large multiprocessor systems. 247 With copies crossing the bus twice per copy, network processing 248 overhead is high whenever network bandwidth is large in comparison 249 to CPU and memory bandwidths. Generally with today's end-systems, 250 the effects are observable at network speeds over 1 Gbits/s. In 251 fact, with multiple bus crossings it is possible to see the bus 252 bandwidth being the limiting factor for throughput. This prevents 253 such an end-system from silultaneously achieving full network 254 bandwidth and full application performance. 256 A common question is whether increase in CPU processing power 257 alleviates the problem of high processing costs of network I/O. 258 The answer is no, it is the memory bandwidth that is the issue. 259 Faster CPUs do not help if the CPU spends most of its time waiting 260 for memory [CGY01]. 262 The widening gap between microprocessor performance and memory 263 performance has long been a widely recognized and well-understood 264 problem [PAC+97]. Hennessy [HP97] shows microprocessor performance 265 grew from 1980-1998 at 60% per year, while the access time to DRAM 266 improved at 10% per year, giving rise to an increasing "processor- 267 memory performance gap". 269 Another source of relevant data is the STREAM Benchmark Reference 270 Information website which provides information on the STREAM 271 benchmark [STREAM]. The benchmark is a simple synthetic benchmark 272 program that measures sustainable memory bandwidth (in MBytes/s) 273 and the corresponding computation rate for simple vector kernels 274 measured in MFLOPS. The website tracks information on sustainable 275 memory bandwidth for hundreds of machines and all major vendors. 277 Results show measured system performance statistics. Processing 278 performance from 1985-2001 increased at 50% per year on average, 279 and sustainable memory bandwidth from 1975 to 2001 increased at 35% 280 per year on average over all the systems measured. A similar 15% 281 per year lead of processing bandwidth over memory bandwidth shows 282 up in another statistic, machine balance [Mc95], a measure of the 283 relative rate of CPU to memory bandwidth (FLOPS/cycle) / (sustained 284 memory ops/cycle) [STREAM]. 286 Network bandwidth has been increasing about 10-fold roughly every 8 287 years, which is a 40% per year growth rate. 289 A typical example illustrates that the memory bandwidth compares 290 unfavorably with link speed. The STREAM benchmark shows that a 291 modern uniprocessor PC, for example the 1.2 GHz Athlon in 2001, 292 will move the data 3 times in doing a receive operation - 1 for the 293 network interface to deposit the data in memory, and 2 for the CPU 294 to copy the data. With 1 GBytes/s of memory bandwidth, meaning one 295 read or one write, the machine could handle approximately 2.67 296 Gbits/s of network bandwidth, one third the copy bandwidth. But 297 this assumes 100% utilization, which is not possible, and more 298 importantly the machine would be totally consumed! (A rule of 299 thumb for databases is that 20% of the machine should be required 300 to service I/O, leaving 80% for the database application. And, the 301 less the better.) 303 In 2001, 1 Gbits/s links were common. An application server may 304 typically have two 1 Gbits/s connections - one connection backend 305 to a storage server and one front-end, say for serving HTTP 306 [FGM+99]. Thus the communications could use 2 Gbits/s. In our 307 typical example, the machine could handle 2.7 Gbits/s at its 308 theoretical maximum while doing nothing else. This means that the 309 machine basically could not keep up with the communication demands 310 in 2001, with the relative growth trends the situation only gets 311 worse. 313 4. High copy overhead is problematic for many key Internet applications 315 If a significant portion of resources on an application machine is 316 consumed in network I/O rather than in application processing, it 317 makes it difficult for the application to scale - to handle more 318 clients, to offer more services. 320 Several years ago the most affected applications were streaming 321 multimedia, parallel file systems and supercomputing on clusters 322 [BS96]. In addition, today the applications that suffer from 323 copying overhead are more central in Internet computing - they 324 store, manage, and distribute the information of the Internet and 325 the enterprise. They include database applications doing 326 transaction processing, e-commerce, web serving, decision support, 327 content distribution, video distribution, and backups. Clusters 328 are typically used for this category of application, since they 329 have advantages of availability and scalability. 331 Today these applications, which provide and manage Internet and 332 corporate information, are typically run in data centers that are 333 organized into three logical tiers. One tier is typically a set of 334 web servers connecting to the WAN. The second tier is a set of 335 application servers that run the specific applications usually on 336 more powerful machines, and the third tier is backend databases. 337 Physically, the first two tiers - web server and application server 338 - are usually combined [Pi01]. For example an e-commerce server 339 communicates with a database server and with a customer site, or a 340 content distribution server connects to a server farm, or an OLTP 341 server connects to a database and a customer site. 343 When network I/O uses too much memory bandwidth, performance on 344 network paths between tiers can suffer. (There might also be 345 performance issues on SAN paths used either by the database tier or 346 the application tier.) The high overhead from network-related 347 memory copies diverts system resources from other application 348 processing. It also can create bottlenecks that limit total system 349 performance. 351 There are a large and growing number of these application servers 352 distributed throughout the Internet. In 1999 approximately 3.4 353 million server units were shipped, in 2000, 3.9 million units, and 354 the estimated annual growth rate for 2000-2004 was 17 percent 355 [Ne00, Pa01]. 357 There is high motivation to maximize the processing capacity of 358 each CPU, as scaling by adding CPUs one way or another has 359 drawbacks. For example, adding CPUs to a multiprocessor will not 360 necessarily help, as a multiprocessor improves performance only 361 when the memory bus has additional bandwidth to spare. Clustering 362 can add additional complexity to handling the applications. 364 In order to scale a cluster or multiprocessor system, one must 365 proportionately scale the interconnect bandwidth. Interconnect 366 bandwidth governs the performance of communication-intensive 367 parallel applications; if this (often expressed in terms of 368 "bisection bandwidth") is too low, adding additional processors 369 cannot improve system throughput. Interconnect latency can also 370 limit the performance of applications that frequently share data 371 between processors. 373 So, excessive overheads on network paths in a "scalable" system 374 both can require the use of more processors than optimal, and can 375 reduce the marginal utility of those additional processors. 377 Copy avoidance scales a machine upwards by removing at least two- 378 thirds the bus bandwidth load from the "very best" 1-copy (on 379 receive) implementations, and removes at least 80% of the bandwidth 380 overhead from the 2-copy implementations. 382 The removal of bus bandwidth requirement, in turn, removes 383 bottlenecks from the network processing path and increases the 384 throughput of the machine. On a machine with limited bus 385 bandwidth, the advantages of removing this load is immediately 386 evident, as the host can attain full network bandwidth. Even on a 387 machine with bus bandwidth adequate to sustain full network 388 bandwidth, removal of bus bandwidth load serves to increase the 389 availabilty of the machine for the processing of user applications, 390 in some cases dramatically. 392 An example showing poor performance with copies and improved 393 scaling with copy avoidance is illustrative. The IO-Lite work 394 [PDZ99] shows higher server throughput servicing more clients using 395 a zero-copy system. In an experiment designed to mimic real world 396 web conditions by simulating the effect of TCP WAN connections on 397 the server, the performance of 3 servers was compared. One server 398 was Apache, another an optimized server called Flash, and the third 399 the Flash server running IO-Lite, called Flash-Lite with zero copy. 400 The measurement was of throughput in requests/second as a function 401 of the number of slow background clients that could be served. As 402 the table shows, Flash-Lite has better throughput, especially as 403 the number of clients increases. 405 Apache Flash Flash-Lite 406 ------ ----- ---------- 407 #Clients Thruput reqs/s Thruput Thruput 409 0 520 610 890 410 16 390 490 890 411 32 360 490 850 412 64 360 490 890 413 128 310 450 880 414 256 310 440 820 416 Traditional Web servers (which mostly send data and can keep most 417 of their content in the file cache) are not the worst case for copy 418 overhead. Web proxies (which often receive as much data as they 419 send) and complex Web servers based on SANs or multi-tier systems 420 will suffer more from copy overheads than in the example above. 422 5. Copy Avoidance Techniques 424 There have been extensive research investigation and industry 425 experience with two main alternative approaches to eliminating data 426 movement overhead, often along with improving other Operating 427 System processing costs. In one approach, hardware and/or software 428 changes within a single host reduce processing costs. In another 429 approach, memory-to-memory networking [MAF+02], the exchange of 430 explicit data placement information between hosts allows them to 431 reduce processing costs. 433 The single host approaches range from new hardware and software 434 architectures [KSZ95, Wa97, DWB+93] to new or modified software 435 systems [BP96, Ch96, TK95, DP93, PDZ99]. In the approach based on 436 using a networking protocol to exchange information, the network 437 adapter, under control of the application, places data directly 438 into and out of application buffers, reducing the need for data 439 movement. Commonly this approach is called RDMA, Remote Direct 440 Memory Access. 442 As discussed below, research and industry experience has shown that 443 copy avoidance techniques within the receiver processing path alone 444 have proven to be problematic. The research special purpose host 445 adapter systems had good performance and can be seen as precursors 446 for the commercial RDMA-based NICs [KSZ95, DWB+93]. In software, 447 many implementations have successfully achieved zero-copy transmit, 448 but few have accomplished zero-copy receive. And those that have 449 done so make strict alignment and no-touch requirements on the 450 application, greatly reducing the portability and usefulness of the 451 implementation. 453 In contrast, experience has proven satisfactory with memory-to- 454 memory systems that permit RDMA - performance has been good and 455 there have not been system or networking difficulties. RDMA is a 456 single solution. Once implemented, it can be used with any OS and 457 machine architecture, and it does not need to be revised when 458 either of these changes. 460 In early work, one goal of the software approaches was to show that 461 TCP could go faster with appropriate OS support [CJR89, CFF+94]. 462 While this goal was achieved, further investigation and experience 463 showed that, though possible to craft software solutions, specific 464 system optimizations have been complex, fragile, extremely 465 interdependent with other system parameters in complex ways, and 466 often of only marginal improvement [CFF+94, CGY01, Ch96, DAPP93, 467 KSZ95, PDZ99]. The network I/O system interacts with other aspects 468 of the Operating System such as machine architecture and file I/O, 469 and disk I/O [Br99, Ch96, DP93]. 471 For example, the Solaris Zero-Copy TCP work [Ch96], which relies on 472 page remapping, shows that the results are highly interdependent 473 with other systems, such as the file system, and that the 474 particular optimizations are specific for particular architectures, 475 meaning for each variation in architecture optimizations must be 476 re-crafted [Ch96]. 478 With RDMA, application I/O buffers are mapped directly, and the 479 authorized peer may access it without incurring additional 480 processing overhead. When RDMA is implemented in hardware, 481 arbitrary data movement can be performed without involving the host 482 CPU at all. 484 A number of research projects and industry products have been based 485 on the memory-to-memory approach to copy avoidance. These include 486 U-Net [EBBV95], SHRIMP [BLA+94], Hamlyn [BJM+96], Infiniband [IB], 487 Winsock Direct [Pi01]. Several memory-to-memory systems have been 488 widely used and have generally been found to be robust, to have 489 good performance, and to be relatively simple to implement. These 490 include VI [VI], Myrinet [BCF+95], Quadrics [QUAD], Compaq/Tandem 491 Servernet [SRVNET]. Networks based on these memory-to-memory 492 architectures have been used widely in scientific applications and 493 in data centers for block storage, file system access, and 494 transaction processing. 496 By exporting direct memory access "across the wire", applications 497 may direct the network stack to manage all data directly from 498 application buffers. A large and growing class of applications has 499 already emerged which takes advantage of such capabilities, 500 including all the major databases, as well as file systems such as 501 DAFS [DAFS] and network protocols such as Sockets Direct [SDP]. 503 5.1. A Conceptual Framework: DDP and RDMA 505 An RDMA solution can be usefully viewed as being comprised of two 506 distinct components: "direct data placement (DDP)" and "remote 507 direct memory access (RDMA) semantics". They are distinct in 508 purpose and also in practice - they may be implemented as separate 509 protocols. 511 The more fundamental of the two is the direct data placement 512 facility. This is the means by which memory is exposed to the 513 remote peer in an appropriate fashion, and the means by which the 514 peer may access it, for instance reading and writing. 516 The RDMA control functions are semantically layered atop direct 517 data placement. Included are operations that provide "control" 518 features, such as connection and termination, and the ordering of 519 operations and signaling their completions. A "send" facility is 520 provided. 522 While the functions (and potentially protocols) are distinct, 523 historically both aspects taken together have been referred as 524 "RDMA". The facilities of direct data placement are useful in and 525 of themselves, and may be employed by other upper layer protocols 526 to facilitate data transfer. Therefore, it is often useful to 527 refer to DDP as the data placement functionality and RDMA as the 528 control aspect. 530 [BT04] develops an architecture for DDP and RDMA atop the Internet 531 Protocol Suite, and is a companion draft to this problem statement. 533 6. Conclusions 535 This Problem Statement concludes that an IP-based, general solution 536 for reducing processing overhead in end-hosts is desirable. 538 It has shown that high overhead of the processing of network data 539 leads to end-host bottlenecks. These bottlenecks are in large part 540 attributable to the copying of data. The bus bandwidth of machines 541 has historically been limited, and the bandwidth of high-speed 542 interconnects taxes it heavily. 544 An architectural solution to alleviate these bottlenecks best 545 satisifies the issue. Further, the high speed of today's 546 interconnects and the deployment of these hosts on Internet 547 Protocol-based networks leads to the desireability to layer such a 548 solution on the Internet Protocol Suite. The architecture 549 described in [BT04] is such a proposal. 551 7. Security Considerations 553 Solutions to the problem of reducing copying overhead in high 554 bandwidth transfers via one or more protocols may introduce new 555 security concerns. Any proposed solution must be analyzed for 556 security threats and any such threats addressed. Potential 557 security weaknesses due to resource issues that might lead to 558 denial-of-service attacks, overwrites and other concurrent 559 operations, the ordering of completions as required by the RDMA 560 protocol, the granularity of transfer, and any other identified 561 threats; need to be examined, described and an adequate solution to 562 them found. 564 Layered atop Internet transport protocols, the RDMA protocols will 565 gain leverage from and must permit integration with Internet 566 security standards, such as IPSec and TLS [IPSEC, TLS]. A thorough 567 analysis of the degree to which these protocols address potential 568 threats is required. 570 Security for an RDMA design requires more than just securing the 571 communication channel. While it is necessary to be able to 572 guarantee channel properties such as privacy, integrity, and 573 authentication, these properties cannot defend against all attacks 574 from properly authenticated peers, which might be malicious, 575 compromised, or buggy. For example, an RDMA peer should not be 576 able to read or write memory regions without prior consent. 578 Further, it must not be possible to evade consistency checks at the 579 recipient. The RDMA design must allow the recipient to rely on its 580 consistent memory contents by controlling peer access to memory 581 regions explicitly, and must disallow peer access to regions when 582 not authorized. 584 The RDMA protocols must ensure that regions addressable by RDMA 585 peers be under strict application control. Remote access to local 586 memory by a network peer introduces a number of potential security 587 concerns. This becomes particularly important in the Internet 588 context, where such access can be exported globally. 590 The RDMA protocols carry in part what is essentially user 591 information, explicitly including addressing information and 592 operation type (read or write), and implicitly including protection 593 and attributes. As such, the protocol requires checking of these 594 higher level aspects in addition to the basic formation of 595 messages. The semantics associated with each class of error must 596 be clearly defined, and the expected action to be taken on mismatch 597 be specified. In some cases, this will result in a catastrophic 598 error on the RDMA association, however in others a local or remote 599 error may be signalled. Certain of these errors may require 600 consideration of abstract local semantics, which must be carefully 601 specified so as to provide useful behavior while not constraining 602 the implementation. 604 8. Acknowledgements 606 Jeff Chase generously provided many useful insights and 607 information. Thanks to Jim Pinkerton for many helpful discussions. 609 9. Informative References 611 [BCF+95] 612 N. J. Boden, D. Cohen, R. E. Felderman, A. E. Kulawik, C. L. 613 Seitz, J. N. Seizovic, and W. Su. "Myrinet - A gigabit-per- 614 second local-area network", IEEE Micro, February 1995 616 [BJM+96] 617 G. Buzzard, D. Jacobson, M. Mackey, S. Marovich, J. Wilkes, 618 "An implementation of the Hamlyn send-managed interface 619 architecture", in Proceedings of the Second Symposium on 620 Operating Systems Design and Implementation, USENIX Assoc., 621 October 1996 623 [BLA+94] 624 M. A. Blumrich, K. Li, R. Alpert, C. Dubnicki, E. W. Felten, 625 "A virtual memory mapped network interface for the SHRIMP 626 multicomputer", in Proceedings of the 21st Annual Symposium on 627 Computer Architecture, April 1994, pp. 142-153 629 [Br99] 630 J. C. Brustoloni, "Interoperation of copy avoidance in network 631 and file I/O", Proceedings of IEEE Infocom, 1999, pp. 534-542 633 [BS96] 634 J. C. Brustoloni, P. Steenkiste, "Effects of buffering 635 semantics on I/O performance", Proceedings OSDI'96, USENIX, 636 Seattle, WA October 1996, pp. 277-291 638 RFC Editor note: 639 Replace following architecture draft-ietf- name, status and date 640 with appropriate reference when assigned. 642 [BT04] 643 S. Bailey, T. Talpey, "The Architecture of Direct Data 644 Placement (DDP) And Remote Direct Memory Access (RDMA) On 645 Internet Protocols", Internet Draft Work in Progress, draft- 646 ietf-rddp-arch-04, January 2004 648 [CFF+94] 649 C-H Chang, D. Flower, J. Forecast, H. Gray, B. Hawe, A. 650 Nadkarni, K. K. Ramakrishnan, U. Shikarpur, K. Wilde, "High- 651 performance TCP/IP and UDP/IP networking in DEC OSF/1 for 652 Alpha AXP", Proceedings of the 3rd IEEE Symposium on High 653 Performance Distributed Computing, August 1994, pp. 36-42 655 [CGY01] 656 J. S. Chase, A. J. Gallatin, and K. G. Yocum, "End system 657 optimizations for high-speed TCP", IEEE Communications 658 Magazine, Volume: 39, Issue: 4 , April 2001, pp 68-74. 659 http://www.cs.duke.edu/ari/publications/end-system.{ps,pdf} 661 [Ch96] 662 H.K. Chu, "Zero-copy TCP in Solaris", Proc. of the USENIX 1996 663 Annual Technical Conference, San Diego, CA, January 1996 665 [Ch02] 666 Jeffrey Chase, Personal communication 668 [CJRS89] 669 D. D. Clark, V. Jacobson, J. Romkey, H. Salwen, "An analysis 670 of TCP processing overhead", IEEE Communications Magazine, 671 volume: 27, Issue: 6, June 1989, pp 23-29 673 [CT90] 674 D. D. Clark, D. Tennenhouse, "Architectural considerations for 675 a new generation of protocols", Proceedings of the ACM SIGCOMM 676 Conference, 1990 678 [DAFS] 679 DAFS Collaborative, "Direct Access File System Specification 680 v1.0", September 2001, available from 681 http://www.dafscollaborative.org 683 [DAPP93] 684 P. Druschel, M. B. Abbott, M. A. Pagels, L. L. Peterson, 685 "Network subsystem design", IEEE Network, July 1993, pp. 8-17 687 [DP93] 688 P. Druschel, L. L. Peterson, "Fbufs: a high-bandwidth cross- 689 domain transfer facility", Proceedings of the 14th ACM 690 Symposium of Operating Systems Principles, December 1993 692 [DWB+93] 693 C. Dalton, G. Watson, D. Banks, C. Calamvokis, A. Edwards, J. 694 Lumley, "Afterburner: architectural support for high- 695 performance protocols", Technical Report, HP Laboratories 696 Bristol, HPL-93-46, July 1993 698 [EBBV95] 699 T. von Eicken, A. Basu, V. Buch, and W. Vogels, "U-Net: A 700 user-level network interface for parallel and distributed 701 computing", Proc. of the 15th ACM Symposium on Operating 702 Systems Principles, Copper Mountain, Colorado, December 3-6, 703 1995 705 [FGM+99] 706 R. Fielding, J. Gettys, J. Mogul, F. Frystyk, L. Masinter, P. 707 Leach, T. Berners-Lee, "Hypertext Transfer Protocol - 708 HTTP/1.1", RFC 2616, June 1999 710 [FIBRE] 711 ANSI Technical Committee T10, "Fibre Channel Protocol (FCP)" 712 (and as revised and updated), ANSI X3.269:1996 [R2001], 713 committee draft available from 714 http://www.t10.org/drafts.htm#FibreChannel 716 [HP97] 717 J. L. Hennessy, D. A. Patterson, Computer Organization and 718 Design, 2nd Edition, San Francisco: Morgan Kaufmann 719 Publishers, 1997 721 [IB] InfiniBand Trade Association, "InfiniBand Architecture 722 Specification, Volumes 1 and 2", Release 1.1, November 2002, 723 available from http://www.infinibandta.org/specs 725 [KP96] 726 J. Kay, J. Pasquale, "Profiling and reducing processing 727 overheads in TCP/IP", IEEE/ACM Transactions on Networking, Vol 728 4, No. 6, pp.817-828, December 1996 730 [KSZ95] 731 K. Kleinpaste, P. Steenkiste, B. Zill, "Software support for 732 outboard buffering and checksumming", SIGCOMM'95 734 [Ma02] 735 K. Magoutis, "Design and Implementation of a Direct Access 736 File System (DAFS) Kernel Server for FreeBSD", in Proceedings 737 of USENIX BSDCon 2002 Conference, San Francisco, CA, February 738 11-14, 2002. 740 [MAF+02] 741 K. Magoutis, S. Addetia, A. Fedorova, M. I. Seltzer, J. S. 742 Chase, D. Gallatin, R. Kisley, R. Wickremesinghe, E. Gabber, 743 "Structure and Performance of the Direct Access File System 744 (DAFS)", in Proceedings of the 2002 USENIX Annual Technical 745 Conference, Monterey, CA, June 9-14, 2002. 747 [Mc95] 748 J. D. McCalpin, "A Survey of memory bandwidth and machine 749 balance in current high performance computers", IEEE TCCA 750 Newsletter, December 1995 752 [Ne00] 753 A. Newman, "IDC report paints conflicted picture of server 754 market circa 2004", ServerWatch, July 24, 2000 755 http://serverwatch.internet.com/news/2000_07_24_a.html 757 [Pa01] 758 M. Pastore, "Server shipments for 2000 surpass those in 1999", 759 ServerWatch, February 7, 2001 760 http://serverwatch.internet.com/news/2001_02_07_a.html 762 [PAC+97] 763 D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K. Keeton, 764 C. Kozyrakis, R. Thomas, K. Yelick , "A case for intelligient 765 RAM: IRAM", IEEE Micro, April 1997 767 [PDZ99] 768 V. S. Pai, P. Druschel, W. Zwaenepoel, "IO-Lite: a unified I/O 769 buffering and caching system", Proc. of the 3rd Symposium on 770 Operating Systems Design and Implementation, New Orleans, LA, 771 February 1999 773 [Pi01] 774 J. Pinkerton, "Winsock Direct: The Value of System Area 775 Networks", May 2001, available from 776 http://www.microsoft.com/windows2000/techinfo/ 777 howitworks/communications/winsock.asp 779 [Po81] 780 J. Postel, "Transmission Control Protocol - DARPA Internet 781 Program Protocol Specification", RFC 793, September 1981 783 [QUAD] 784 Quadrics Ltd., Quadrics QSNet product information, available 785 from http://www.quadrics.com/website/pages/02qsn.html 787 [SDP] 788 InfiniBand Trade Association, "Sockets Direct Protocol v1.0", 789 Annex A of InfiniBand Architecture Specification Volume 1, 790 Release 1.1, November 2002, available from 791 http://www.infinibandta.org/specs 793 [SRVNET] 794 R. Horst, "TNet: A reliable system area network", IEEE Micro, 795 pp. 37-45, February 1995 797 [STREAM] 798 J. D. McAlpin, The STREAM Benchmark Reference Information, 799 http://www.cs.virginia.edu/stream/ 801 [TK95] 802 M. N. Thadani, Y. A. Khalidi, "An efficient zero-copy I/O 803 framework for UNIX", Technical Report, SMLI TR-95-39, May 1995 805 [VI] Compaq Computer Corp., Intel Corporation and Microsoft 806 Corporation, "Virtual Interface Architecture Specification 807 Version 1.0", December 1997, available from 808 http://www.vidf.org/info/04standards.html 810 [Wa97] 811 J. R. Walsh, "DART: Fast application-level networking via 812 data-copy avoidance", IEEE Network, July/August 1997, pp. 813 28-38 815 Authors' Addresses 817 Stephen Bailey 818 Sandburst Corporation 819 600 Federal Street 820 Andover, MA 01810 USA 822 Phone: +1 978 689 1614 823 Email: steph@sandburst.com 825 Jeffrey C. Mogul 826 Western Research Laboratory 827 Hewlett-Packard Company 828 1501 Page Mill Road, MS 1251 829 Palo Alto, CA 94304 USA 831 Phone: +1 650 857 2206 (email preferred) 832 Email: JeffMogul@acm.org 834 Allyn Romanow 835 Cisco Systems, Inc. 836 170 W. Tasman Drive 837 San Jose, CA 95134 USA 839 Phone: +1 408 525 8836 840 Email: allyn@cisco.com 842 Tom Talpey 843 Network Appliance 844 375 Totten Pond Road 845 Waltham, MA 02451 USA 847 Phone: +1 781 768 5329 848 Email: thomas.talpey@netapp.com 850 Full Copyright Statement 852 Copyright (C) The Internet Society (2004). All Rights Reserved. 854 This document and translations of it may be copied and furnished to 855 others, and derivative works that comment on or otherwise explain 856 it or assist in its implementation may be prepared, copied, 857 published and distributed, in whole or in part, without restriction 858 of any kind, provided that the above copyright notice and this 859 paragraph are included on all such copies and derivative works. 860 However, this document itself may not be modified in any way, such 861 as by removing the copyright notice or references to the Internet 862 Society or other Internet organizations, except as needed for the 863 purpose of developing Internet standards in which case the 864 procedures for copyrights defined in the Internet Standards process 865 must be followed, or as required to translate it into languages 866 other than English. 868 The limited permissions granted above are perpetual and will not be 869 revoked by the Internet Society or its successors or assigns. 871 This document and the information contained herein is provided on 872 an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET 873 ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR 874 IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF 875 THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED 876 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.