idnits 2.17.1 draft-ietf-rddp-problem-statement-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- Couldn't find a document date in the document -- date freshness check skipped. -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'B99' is mentioned on line 182, but not defined == Missing Reference: 'PA01' is mentioned on line 348, but not defined == Missing Reference: 'BP96' is mentioned on line 417, but not defined == Missing Reference: 'IPSEC' is mentioned on line 524, but not defined == Missing Reference: 'TLS' is mentioned on line 524, but not defined == Unused Reference: 'DAPP93' is defined on line 640, but no explicit reference was found in the text == Unused Reference: 'KSZ95' is defined on line 684, but no explicit reference was found in the text == Unused Reference: 'Ma02' is defined on line 688, but no explicit reference was found in the text == Unused Reference: 'MYR' is defined on line 706, but no explicit reference was found in the text == Unused Reference: 'Pa01' is defined on line 714, but no explicit reference was found in the text == Unused Reference: 'Wa97' is defined on line 760, but no explicit reference was found in the text -- Possible downref: Non-RFC (?) normative reference: ref. 'Br99' -- Possible downref: Non-RFC (?) normative reference: ref. 'BS96' -- Possible downref: Non-RFC (?) normative reference: ref. 'BSW02' -- Possible downref: Non-RFC (?) normative reference: ref. 'BT02' -- Possible downref: Non-RFC (?) normative reference: ref. 'CGY01' -- Possible downref: Non-RFC (?) normative reference: ref. 'Ch96' -- Possible downref: Non-RFC (?) normative reference: ref. 'Ch02' -- Possible downref: Non-RFC (?) normative reference: ref. 'CJRS89' -- Possible downref: Non-RFC (?) normative reference: ref. 'CT90' -- Possible downref: Non-RFC (?) normative reference: ref. 'DAFS' -- Possible downref: Non-RFC (?) normative reference: ref. 'DAPP93' -- Possible downref: Non-RFC (?) normative reference: ref. 'DP93' -- Possible downref: Non-RFC (?) normative reference: ref. 'EBBV95' -- Possible downref: Non-RFC (?) normative reference: ref. 'FIBRE' -- Possible downref: Non-RFC (?) normative reference: ref. 'HP97' -- Possible downref: Non-RFC (?) normative reference: ref. 'IB' -- Possible downref: Non-RFC (?) normative reference: ref. 'KP96' -- Possible downref: Non-RFC (?) normative reference: ref. 'KSZ95' -- Possible downref: Non-RFC (?) normative reference: ref. 'Ma02' -- Possible downref: Non-RFC (?) normative reference: ref. 'Mc95' -- Possible downref: Non-RFC (?) normative reference: ref. 'MYR' -- Possible downref: Non-RFC (?) normative reference: ref. 'Ne00' -- Possible downref: Non-RFC (?) normative reference: ref. 'Pa01' -- Possible downref: Non-RFC (?) normative reference: ref. 'PDZ99' -- Possible downref: Non-RFC (?) normative reference: ref. 'Pi01' ** Obsolete normative reference: RFC 793 (ref. 'Po81') (Obsoleted by RFC 9293) -- Possible downref: Non-RFC (?) normative reference: ref. 'QUAD' -- Possible downref: Non-RFC (?) normative reference: ref. 'SDP' -- Possible downref: Non-RFC (?) normative reference: ref. 'SRVNET' -- Possible downref: Non-RFC (?) normative reference: ref. 'STREAM' -- Possible downref: Non-RFC (?) normative reference: ref. 'TK95' -- Possible downref: Non-RFC (?) normative reference: ref. 'VI' -- Possible downref: Non-RFC (?) normative reference: ref. 'Wa97' Summary: 4 errors (**), 0 flaws (~~), 13 warnings (==), 35 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Allyn Romanow (Cisco) 3 Internet-Draft Jeff Mogul (HP) 4 Expires: August 2003 Tom Talpey (NetApp) 5 Stephen Bailey (Sandburst) 7 RDMA over IP Problem Statement 8 draft-ietf-rddp-problem-statement-01.txt 10 Status of this Memo 12 This document is an Internet-Draft and is in full conformance with 13 all provisions of Section 10 of RFC2026. 15 Internet-Drafts are working documents of the Internet Engineering 16 Task Force (IETF), its areas, and its working groups. Note that 17 other groups may also distribute working documents as Internet- 18 Drafts. 20 Internet-Drafts are draft documents valid for a maximum of six 21 months and may be updated, replaced, or obsoleted by other 22 documents at any time. It is inappropriate to use Internet-Drafts 23 as reference material or to cite them other than as "work in 24 progress." 26 The list of current Internet-Drafts can be accessed at 27 http://www.ietf.org/ietf/1id-abstracts.txt 29 The list of Internet-Draft Shadow Directories can be accessed at 30 http://www.ietf.org/shadow.html. 32 Copyright Notice 34 Copyright (C) The Internet Society (2003). All Rights Reserved. 36 Abstract 38 This draft addresses an IP-based solution to the problem of high 39 system costs due to network I/O copying in end-hosts at high 40 speeds. The problem is due to the high cost of memory bandwidth, 41 and it can be substantially improved using "copy avoidance." The 42 high overhead has prevented TCP/IP from being used as an 43 interconnection network. 45 Table Of Contents 47 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 2 48 2. The high cost of data movement operations in network I/O . 3 49 2.1. Copy avoidance improves processing overhead . . . . . . . 5 50 3. Memory bandwidth is the root cause of the problem . . . . 6 51 4. High copy overhead is problematic for many key Internet 52 applications . . . . . . . . . . . . . . . . . . . . . . . 7 53 5. Copy Avoidance Techniques . . . . . . . . . . . . . . . . 9 54 5.1. A Conceptual Framework: DDP and RDMA . . . . . . . . . . . 11 55 6. Security Considerations . . . . . . . . . . . . . . . . . 11 56 7. Acknowledgements . . . . . . . . . . . . . . . . . . . . . 12 57 References . . . . . . . . . . . . . . . . . . . . . . . . 12 58 Authors' Addresses . . . . . . . . . . . . . . . . . . . . 17 59 Full Copyright Statement . . . . . . . . . . . . . . . . . 17 61 1. Introduction 63 This draft considers the problem of high host processing overhead 64 associated with network I/O that occurs under high speed 65 conditions. This problem is often referred to as the "I/O 66 bottleneck" [CT90]. More specifically, the source of high overhead 67 that is of interest here is data movement operations - copying. 68 This issue is not be confused with TCP offload, which is not 69 addressed here. High speed refers to conditions where the network 70 link speed is high relative to the bandwidths of the host CPU and 71 memory. With today's computer systems, one Gbits/s and over is 72 considered high speed. 74 High costs associated with copying are an issue primarily for large 75 scale systems. Although smaller systems such as rack-mounted PCs 76 and small workstations would benefit from a reduction in copying 77 overhead, the benefit to smaller machines will be primarily in the 78 next few years as they scale in the amount of bandwidth they 79 handle. Today it is large system machines with high bandwidth 80 feeds, usually multiprocessors and clusters, that are adversely 81 affected by copying overhead. Examples of such machines include 82 all varieties of servers: database servers, storage servers, 83 application servers for transaction processing, for e-commerce, and 84 web serving, content distribution, video distribution, backups, 85 data mining and decision support, and scientific computing. 87 Note that such servers almost exclusively service many concurrent 88 sessions (transport connections), which, in aggregate, are 89 responsible for > 1 Gbits/s of communication. Nonetheless, the 90 cost of copying overhead for a particular load is the same whether 91 from few or many sessions. 93 The I/O bottleneck, and the role of data movement operations, have 94 been widely studied in research and industry over the last 95 approximately 14 years, and we draw freely on these results. 96 Historically, the I/O bottleneck has received attention whenever 97 new networking technology has substantially increased line rates - 98 100 Mbits/s FDDI and Fast Ethernet, 155 Mbits/s ATM, 1 Gbits/s 99 Ethernet. In earlier speed transitions, the availability of memory 100 bandwidth allowed the I/O bottleneck issue to be deferred. Now 101 however, this is no longer the case. While the I/O problem is 102 significant at 1 Gbits/s, it is the introduction of 10 Gbits/s 103 Ethernet which is motivating an upsurge of activity in industry and 104 research [DAFS, IB, VI, CGZ01, Ma02, MAF+02]. 106 Because of high overhead of end-host processing in current 107 implementations, the TCP/IP protocol stack is not used for high 108 speed transfer. Instead, special purpose network fabrics, using a 109 technology generally known as remote direct memory access (RDMA), 110 have been developed and are widely used. RDMA is a set of 111 mechanisms that allow the network adapter, under control of the 112 application, to steer data directly into and out of application 113 buffers. Examples of such interconnection fabrics include Fibre 114 Channel [FIBRE] for block storage transfer, Virtual Interface 115 Architecture [VI] for database clusters, Infiniband [IB], Compaq 116 Servernet [SRVNET], Quadrics [QUAD] for System Area Networks. 117 These link level technologies limit application scaling in both 118 distance and size, meaning that the number of nodes cannot be 119 arbitrarily large. 121 This problem statement substantiates the claim that in network I/O 122 processing, high overhead results from data movement operations, 123 specifically copying; and that copy avoidance significantly 124 decreases the processing overhead. It describes when and why the 125 high processing overheads occur, explains why the overhead is 126 problematic, and points out which applications are most affected. 128 In addition, this document introduces an architectural approach to 129 solving the problem, which is developed in detail in [BT02]. It 130 also discusses how the proposed technology may introduce security 131 concerns and how they should be addressed. 133 2. The high cost of data movement operations in network I/O 135 A wealth of data from research and industry shows that copying is 136 responsible for substantial amounts of processing overhead. It 137 further shows that even in carefully implemented systems, 138 eliminating copies significantly reduces the overhead, as 139 referenced below. 141 Clark et al. [CJRS89] in 1989 shows that TCP [Po81] overhead 142 processing is attributable to both operating system costs such as 143 interrupts, context switches, process management, buffer 144 management, timer management, and to the costs associated with 145 processing individual bytes, specifically computing the checksum 146 and moving data in memory. They found moving data in memory is the 147 more important of the costs, and their experiments show that memory 148 bandwidth is the greatest source of limitation. In the data 149 presented [CJRS89], 64% of the measured microsecond overhead was 150 attributable to data touching operations, and 48% was accounted for 151 by copying. The system measured Berkeley TCP on a Sun-3/60 using 152 1460 Byte Ethernet packets. 154 In a well-implemented system, copying can occur between the network 155 interface and the kernel, and between the kernel and application 156 buffers - two copies, each of which are two memory bus crossings - 157 for read and write. Although in certain circumstances it is 158 possible to do better, usually two copies are required on receive. 160 Subsequent work has consistently shown the same phenomenon as the 161 earlier Clark study. A number of studies report results that data- 162 touching operations, checksumming and data movement, dominate the 163 processing costs for messages longer than 128 Bytes [BS96, CGY01, 164 Ch96, CJRS89, DAPP93, KP96]. For smaller sized messages, per- 165 packet overheads dominate [KP96, CGY01]. 167 The percentage of overhead due to data-touching operations 168 increases with packet size, since time spent on per-byte operations 169 scales linearly with message size [KP96]. For example, Chu [Ch96] 170 reported substantial per-byte latency costs as a percentage of 171 total networking software costs for an MTU size packet on 172 SPARCstation/20 running memory-to-memory TCP tests over networks 173 with 3 different MTU sizes. The percentage of total software costs 174 attributable to per-byte operations were: 176 1500 Byte Ethernet 18-25% 177 4352 Byte FDDI 35-50% 178 9180 Byte ATM 55-65% 180 Although many studies report results for data-touching operations 181 including checksumming and data movement together, much work has 182 focused just on copying [BS96, B99, Ch96, TK95]. For example, 183 [KP96] reports results that separate processing times for checksum 184 from data movement operations. For the 1500 Byte Ethernet size, 185 20% of total processing overhead time is attributable to copying. 186 The study used 2 DECstations 5000/200 connected by an FDDI network. 187 (In this study checksum accounts for 30% of the processing time.) 189 2.1. Copy avoidance improves processing overhead 191 A number of studies show that eliminating copies substantially 192 reduces overhead. For example, results from copy-avoidance in the 193 IO-Lite system [PDZ99], which aimed at improving web server 194 performance, show a throughput increase of 43% over an optimized 195 web server, and 137% improvement over an Apache server. The system 196 was implemented in a 4.4BSD derived UNIX kernel, and the 197 experiments used a server system based on a 333MHz Pentium II PC 198 connected to a switched 100 Mbits/s Fast Ethernet. 200 There are many other examples where elimination of copying using a 201 variety of different approaches showed significant improvement in 202 system performance [CFF+94, DP93, EBBV95, KSZ95, TK95, Wa97]. We 203 will discuss the results of one of these studies in detail in order 204 to clarify the significant degree of improvement produced by copy 205 avoidance [Ch02]. 207 Recent work by Chase et al. [CGY01], measuring CPU utilization, 208 shows that avoiding copies reduces CPU time spent on data access 209 from 24% to 15% at 370 Mbits/s for a 32 KBytes MTU using an 210 AlphaStation XP1000 and a Myrinet adapter [BCF+95]. This is an 211 absolute improvement of 9% due to copy avoidance. 213 The total CPU utilization was 35%, with data access accounting for 214 24%. Thus the relative importance of reducing copies is 26%. At 215 370 Mbits/s, the system is not very heavily loaded. The relative 216 improvement in achievable bandwidth is 34%. This is the 217 improvement we would see if copy avoidance were added when the 218 machine was saturated by network I/O. 220 Note that improvement from the optimization becomes more important 221 if the overhead it targets is a larger share of the total cost. 222 This is what happens if other sources of overhead, such as 223 checksumming, are eliminated. In [CGY01], after removing checksum 224 overhead, copy avoidance reduces CPU utilization from 26% to 10%. 225 This is a 16% absolute reduction, a 61% relative reduction, and a 226 160% relative improvement in achievable bandwidth. 228 In fact, today's network interface hardware commonly offloads the 229 checksum, which removes the other source of per-byte overhead. 230 They also coalesce interrupts to reduce per-packet costs. Thus, 231 today copying costs account for a relatively larger part of CPU 232 utilization than previously, and therefore relatively more benefit 233 is to be gained in reducing them. (Of course this argument would 234 be specious if the amount of overhead were insignificant, but it 235 has been shown to be substantial.) 237 3. Memory bandwidth is the root cause of the problem 239 Data movement operations are expensive because memory bandwidth is 240 scarce relative to network bandwidth and CPU bandwidth [PAC+97]. 241 This trend existed in the past and is expected to continue into the 242 future [HP97, STREAM], especially in large multiprocessor systems. 244 With copies crossing the bus twice per copy, network processing 245 overhead is high whenever network bandwidth is large in comparison 246 to CPU and memory bandwidths. Generally with today's end-systems, 247 the effects are observable at network speeds over 1 Gbits/s. 249 A common question is whether increase in CPU processing power 250 alleviates the problem of high processing costs of network I/O. 251 The answer is no, it is the memory bandwidth that is the issue. 252 Faster CPUs do not help if the CPU spends most of its time waiting 253 for memory [CGY01]. 255 The widening gap between microprocessor performance and memory 256 performance has long been a widely recognized and well-understood 257 problem [PAC+97]. Hennessy [HP97] shows microprocessor performance 258 grew from 1980-1998 at 60% per year, while the access time to DRAM 259 improved at 10% per year, giving rise to an increasing "processor- 260 memory performance gap". 262 Another source of relevant data is the STREAM Benchmark Reference 263 Information website which provides information on the STREAM 264 benchmark [STREAM]. The benchmark is a simple synthetic benchmark 265 program that measures sustainable memory bandwidth (in MBytes/s) 266 and the corresponding computation rate for simple vector kernels 267 measured in MFLOPS. The website tracks information on sustainable 268 memory bandwidth for hundreds of machines and all major vendors. 270 Results show measured system performance statistics. Processing 271 performance from 1985-2001 increased at 50% per year on average, 272 and sustainable memory bandwidth from 1975 to 2001 increased at 35% 273 per year on average over all the systems measured. A similar 15% 274 per year lead of processing bandwidth over memory bandwidth shows 275 up in another statistic, machine balance [Mc95], a measure of the 276 relative rate of CPU to memory bandwidth (FLOPS/cycle) / (sustained 277 memory ops/cycle) [STREAM]. 279 Network bandwidth has been increasing about 10-fold roughly every 8 280 years, which is a 40% per year growth rate. 282 A typical example illustrates that the memory bandwidth compares 283 unfavorably with link speed. The STREAM benchmark shows that a 284 modern uniprocessor PC, for example the 1.2 GHz Athlon in 2001, 285 will move the data 3 times in doing a receive operation - 1 for the 286 network interface to deposit the data in memory, and 2 for the CPU 287 to copy the data. With 1 GBytes/s of memory bandwidth, meaning one 288 read or one write, the machine could handle approximately 2.67 289 Gbits/s of network bandwidth, one third the copy bandwidth. But 290 this assumes 100% utilization, which is not possible, and more 291 importantly the machine would be totally consumed! (A rule of 292 thumb for databases is that 20% of the machine should be required 293 to service I/O, leaving 80% for the database application. And, the 294 less the better.) 296 In 2001, 1 Gbits/s links were common. An application server may 297 typically have two 1 Gbits/s connections - one connection backend 298 to a storage server and one front-end, say for serving HTTP 299 [FGM+99]. Thus the communications could use 2 Gbits/s. In our 300 typical example, the machine could handle 2.7 Gbits/s at its 301 theoretical maximum while doing nothing else. This means that the 302 machine basically could not keep up with the communication demands 303 in 2001, with the relative growth trends the situation only gets 304 worse. 306 4. High copy overhead is problematic for many key Internet applications 308 If a significant portion of resources on an application machine is 309 consumed in network I/O rather than in application processing, it 310 makes it difficult for the application to scale - to handle more 311 clients, to offer more services. 313 Several years ago the most affected applications were streaming 314 multimedia, parallel file systems and supercomputing on clusters 315 [BS96]. In addition, today the applications that suffer from 316 copying overhead are more central in Internet computing - they 317 store, manage, and distribute the information of the Internet and 318 the enterprise. They include database applications doing 319 transaction processing, e-commerce, web serving, decision support, 320 content distribution, video distribution, and backups. Clusters 321 are typically used for this category of application, since they 322 have advantages of availability and scalability. 324 Today these applications, which provide and manage Internet and 325 corporate information, are typically run in data centers that are 326 organized into three logical tiers. One tier is typically a set of 327 web servers connecting to the WAN. The second tier is a set of 328 application servers that run the specific applications usually on 329 more powerful machines, and the third tier is backend databases. 330 Physically, the first two tiers - web server and application server 331 - are usually combined [Pi01]. For example an e-commerce server 332 communicates with a database server and with a customer site, or a 333 content distribution server connects to a server farm, or an OLTP 334 server connects to a database and a customer site. 336 When network I/O uses too much memory bandwidth, performance on 337 network paths between tiers can suffer. (There might also be 338 performance issues on SAN paths used either by the database tier or 339 the application tier.) The high overhead from network-related 340 memory copies diverts system resources from other application 341 processing. It also can create bottlenecks that limit total system 342 performance. 344 There are a large and growing number of these application servers 345 distributed throughout the Internet. In 1999 approximately 3.4 346 million server units were shipped, in 2000, 3.9 million units, and 347 the estimated annual growth rate for 2000-2004 was 17 percent 348 [Ne00, PA01]. 350 There is high motivation to maximize the processing capacity of 351 each CPU, as scaling by adding CPUs one way or another has 352 drawbacks. For example, adding CPUs to a multiprocessor will not 353 necessarily help, as a multiprocessor improves performance only 354 when the memory bus has additional bandwidth to spare. Clustering 355 can add additional complexity to handling the applications. 357 In order to scale a cluster or multiprocessor system, one must 358 proportionately scale the interconnect bandwidth. Interconnect 359 bandwidth governs the performance of communication-intensive 360 parallel applications; if this (often expressed in terms of 361 "bisection bandwidth") is too low, adding additional processors 362 cannot improve system throughput. Interconnect latency can also 363 limit the performance of applications that frequently share data 364 between processors. 366 So, excessive overheads on network paths in a "scalable" system 367 both can require the use of more processors than optimal, and can 368 reduce the marginal utility of those additional processors. 370 Copy avoidance scales a machine upwards by removing at least two- 371 thirds the bus bandwidth load from the "very best" 1-copy (on 372 receive) implementations, and removes at least 80% of the bandwidth 373 overhead from the 2-copy implementations. 375 An example showing poor performance with copies and improved 376 scaling with copy avoidance is illustrative. The IO-Lite work 377 [PDZ99] shows higher server throughput servicing more clients using 378 a zero-copy system. In an experiment designed to mimic real world 379 web conditions by simulating the effect of TCP WAN connections on 380 the server, the performance of 3 servers was compared. One server 381 was Apache, another an optimized server called Flash, and the third 382 the Flash server running IO-Lite, called Flash-Lite with zero copy. 383 The measurement was of throughput in requests/second as a function 384 of the number of slow background clients that could be served. As 385 the table shows, Flash-Lite has better throughput, especially as 386 the number of clients increases. 388 Apache Flash Flash-Lite 389 ------ ----- ---------- 390 #Clients Thruput reqs/s Thruput Thruput 392 0 520 610 890 393 16 390 490 890 394 32 360 490 850 395 64 360 490 890 396 128 310 450 880 397 256 310 440 820 399 Traditional Web servers (which mostly send data and can keep most 400 of their content in the file cache) are not the worst case for copy 401 overhead. Web proxies (which often receive as much data as they 402 send) and complex Web servers based on SANs or multi-tier systems 403 will suffer more from copy overheads than in the example above. 405 5. Copy Avoidance Techniques 407 There have been extensive research investigation and industry 408 experience with two main alternative approaches to eliminating data 409 movement overhead, often along with improving other Operating 410 System processing costs. In one approach, hardware and/or software 411 changes within a single host reduce processing costs. In another 412 approach, memory-to-memory networking [MAF+02], hosts communicate 413 via information that allows them to reduce processing costs. 415 The single host approaches range from new hardware and software 416 architectures [KSZ95, Wa97, DWB+93] to new or modified software 417 systems [BP96, Ch96, TK95, DP93, PDZ99]. In the approach based on 418 using a networking protocol to exchange information, the network 419 adapter, under control of the application, places data directly 420 into and out of application buffers, reducing the need for data 421 movement. Commonly this approach is called RDMA, Remote Direct 422 Memory Access. 424 As discussed below, research and industry experience has shown that 425 copy avoidance techniques within the receiver processing path alone 426 have proven to be problematic. The research special purpose host 427 adapter systems had good performance and can be seen as precursors 428 for the commercial RDMA-based NICs [KSZ95, DWB+93]. In software, 429 many implementations have successfully achieved zero-copy transmit, 430 but few have accomplished zero-copy receive. And those that have 431 done so make strict alignment and no-touch requirements on the 432 application, greatly reducing the portability and usefulness of the 433 implementation. 435 In contrast, experience has proven satisfactory with memory-to- 436 memory systems that permit RDMA - performance has been good and 437 there have not been system or networking difficulties. RDMA is a 438 single solution. Once implemented, it can be used with any OS and 439 machine architecture, and it does not need to be revised when 440 either of these changes. 442 In early work, one goal of the software approaches was to show that 443 TCP could go faster with appropriate OS support [CJR89, CFF+94]. 444 While this goal was achieved, further investigation and experience 445 showed that, though possible to craft software solutions, specific 446 system optimizations have been complex, fragile, extremely 447 interdependent with other system parameters in complex ways, and 448 often of only marginal improvement [CFF+94, CGY01, Ch96, DAPP93, 449 KSZ95, PDZ99]. The network I/O system interacts with other aspects 450 of the Operating System such as machine architecture and file I/O, 451 and disk I/O [Br99, Ch96, DP93]. 453 For example, the Solaris Zero-Copy TCP work [Ch96], which relies on 454 page remapping, shows that the results are highly interdependent 455 with other systems, such as the file system, and that the 456 particular optimizations are specific for particular architectures, 457 meaning for each variation in architecture optimizations must be 458 re-crafted [Ch96]. 460 A number of research projects and industry products have been based 461 on the memory-to-memory approach to copy avoidance. These include 462 U-Net [EBBV95], SHRIMP [BLA+94], Hamlyn [BJM+96], Infiniband [IB], 463 Winsock Direct [Pi01]. Several memory-to-memory systems have been 464 widely used and have generally been found to be robust, to have 465 good performance, and to be relatively simple to implement. These 466 include VI [VI], Myrinet [BCF+95], Quadrics [QUAD], Compaq/Tandem 467 Servernet [SRVNET]. Networks based on these memory-to-memory 468 architectures have been used widely in scientific applications and 469 in data centers for block storage, file system access, and 470 transaction processing. 472 By exporting direct memory access "across the wire", applications 473 may direct the network stack to manage all data directly from 474 application buffers. A large and growing class of applications has 475 already emerged which takes advantage of such capabilities, 476 including all the major databases, as well as file systems such as 477 DAFS [DAFS] and network protocols such as Sockets Direct [SDP]. 479 5.1. A Conceptual Framework: DDP and RDMA 481 An RDMA solution can be usefully viewed as being comprised of two 482 distinct components: "direct data placement (DDP)" and "remote 483 direct memory access (RDMA) semantics". They are distinct in 484 purpose and also in practice - they may be implemented as separate 485 protocols. 487 The more fundamental of the two is the direct data placement 488 facility. This is the means by which memory is exposed to the 489 remote peer in an appropriate fashion, and the means by which the 490 peer may access it, for instance reading and writing. 492 The RDMA control functions are semantically layered atop direct 493 data placement. Included are operations that provide "control" 494 features, such as connection and termination, and the ordering of 495 operations and signaling their completions. A "send" facility is 496 provided. 498 While the functions (and potentially protocols) are distinct, 499 historically both aspects taken together have been referred as 500 "RDMA". The facilities of direct data placement are useful in and 501 of themselves, and may be employed by other upper layer protocols 502 to facilitate data transfer. Therefore, it is often useful to 503 refer to DDP as the data placement functionality and RDMA as the 504 control aspect. 506 [BT02] develops an architecture for DDP and RDMA, and is a 507 companion draft to this problem statement. 509 6. Security Considerations 511 Solutions to the problem of reducing copying overhead in high 512 bandwidth transfers via one or more protocols may introduce new 513 security concerns. Any proposed solution must be analyzed for 514 security threats and any such threats addressed. [BSW02] brings up 515 potential security weaknesses due to resource issues that might 516 lead to denial-of-service attacks, overwrites and other concurrent 517 operations, the ordering of completions as required by the RDMA 518 protocol, and the granularity of transfer. Each of these concerns 519 plus any other identified threats need to be examined, described 520 and an adequate solution to them found. 522 Layered atop Internet transport protocols, the RDMA protocols will 523 gain leverage from and must permit integration with Internet 524 security standards, such as IPSec and TLS [IPSEC, TLS]. A thorough 525 analysis of the degree to which these protocols solve threats is 526 required. 528 Security for an RDMA design requires more than just securing the 529 communication channel. While it is necessary to be able to 530 guarantee channel properties such as privacy, integrity, and 531 authentication, these properties cannot defend against all attacks 532 from properly authenticated peers, which might be malicious, 533 compromised, or buggy. For example, an RDMA peer should not be 534 able to read or write memory regions without prior consent. 536 Further, it must not be possible to evade consistency checks at the 537 recipient. For example, the RDMA design should not allow a peer to 538 update a region after the completion of an authorized update. 540 The RDMA protocols must ensure that regions addressable by RDMA 541 peers be under strict application control. Remote access to local 542 memory by a network peer introduces a number of potential security 543 concerns. This becomes particularly important in the Internet 544 context, where such access can be exported globally. 546 The RDMA protocols carry in part what is essentially user 547 information, explicitly including addressing information and 548 operation type (read or write), and implicitly including protection 549 and attributes. As such, the protocol requires checking of these 550 higher level aspects in addition to the basic formation of 551 messages. The semantics associated with each class of error must 552 be clearly defined, and the expected action to be taken on mismatch 553 be specified. In some cases, this will result in a catastrophic 554 error on the RDMA association, however in others a local or remote 555 error may be signalled. Certain of these errors may require 556 consideration of abstract local semantics, which must be carefully 557 specified so as to provide useful behavior while not constraining 558 the implementation. 560 7. Acknowledgements 562 Jeff Chase generously provided many useful insights and 563 information. Thanks to Jim Pinkerton for many helpful discussions. 565 8. References 567 [BCF+95] 568 N. J. Boden, D. Cohen, R. E. Felderman, A. E. Kulawik, C. L. 569 Seitz, J. N. Seizovic, and W. Su. "Myrinet - A gigabit-per- 570 second local-area network", IEEE Micro, February 1995 572 [BJM+96] 573 G. Buzzard, D. Jacobson, M. Mackey, S. Marovich, J. Wilkes, 574 "An implementation of the Hamlyn send-managed interface 575 architecture", in Proceedings of the Second Symposium on 576 Operating Systems Design and Implementation, USENIX Assoc., 577 October 1996 579 [BLA+94] 580 M. A. Blumrich, K. Li, R. Alpert, C. Dubnicki, E. W. Felten, 581 "A virtual memory mapped network interface for the SHRIMP 582 multicomputer", in Proceedings of the 21st Annual Symposium on 583 Computer Architecture, April 1994, pp. 142-153 585 [Br99] 586 J. C. Brustoloni, "Interoperation of copy avoidance in network 587 and file I/O", Proceedings of IEEE Infocom, 1999, pp. 534-542 589 [BS96] 590 J. C. Brustoloni, P. Steenkiste, "Effects of buffering 591 semantics on I/O performance", Proceedings OSDI'96, USENIX, 592 Seattle, WA October 1996, pp. 277-291 594 [BSW02] 595 D. Black, M. Speer, J. Wroclawski, "DDP and RDMA Concerns", 596 http://www.ietf.org/internet-drafts/draft-ietf-rddp-rdma- 597 concerns-00.txt, Work in Progress, December 2002 599 [BT02] 600 S. Bailey, T. Talpey, "The Architecture of Direct Data 601 Placement (DDP) And Remote Direct Memory Access (RDMA) On 602 Internet Protocols", Work in Progress, 603 http://www.ietf.org/internet-drafts/draft-ietf-rddp- 604 arch-01.txt, February 2003 606 [CFF+94] 607 C-H Chang, D. Flower, J. Forecast, H. Gray, B. Hawe, A. 608 Nadkarni, K. K. Ramakrishnan, U. Shikarpur, K. Wilde, "High- 609 performance TCP/IP and UDP/IP networking in DEC OSF/1 for 610 Alpha AXP", Proceedings of the 3rd IEEE Symposium on High 611 Performance Distributed Computing, August 1994, pp. 36-42 613 [CGY01] 614 J. S. Chase, A. J. Gallatin, and K. G. Yocum, "End system 615 optimizations for high-speed TCP", IEEE Communications 616 Magazine, Volume: 39, Issue: 4 , April 2001, pp 68-74. 617 http://www.cs.duke.edu/ari/publications/end-system.{ps,pdf} 619 [Ch96] 620 H.K. Chu, "Zero-copy TCP in Solaris", Proc. of the USENIX 1996 621 Annual Technical Conference, San Diego, CA, January 1996 623 [Ch02] 624 Jeffrey Chase, Personal communication 626 [CJRS89] 627 D. D. Clark, V. Jacobson, J. Romkey, H. Salwen, "An analysis 628 of TCP processing overhead", IEEE Communications Magazine, 629 volume: 27, Issue: 6, June 1989, pp 23-29 631 [CT90] 632 D. D. Clark, D. Tennenhouse, "Architectural considerations for 633 a new generation of protocols", Proceedings of the ACM SIGCOMM 634 Conference, 1990 636 [DAFS] 637 Direct Access File System http://www.dafscollaborative.org 638 http://www.ietf.org/internet-drafts/draft-wittle-dafs-00.txt 640 [DAPP93] 641 P. Druschel, M. B. Abbott, M. A. Pagels, L. L. Peterson, 642 "Network subsystem design", IEEE Network, July 1993, pp. 8-17 644 [DP93] 645 P. Druschel, L. L. Peterson, "Fbufs: a high-bandwidth cross- 646 domain transfer facility", Proceedings of the 14th ACM 647 Symposium of Operating Systems Principles, December 1993 649 [DWB+93] 650 C. Dalton, G. Watson, D. Banks, C. Calamvokis, A. Edwards, J. 651 Lumley, "Afterburner: architectural support for high- 652 performance protocols", Technical Report, HP Laboratories 653 Bristol, HPL-93-46, July 1993 655 [EBBV95] 656 T. von Eicken, A. Basu, V. Buch, and W. Vogels, "U-Net: A 657 user-level network interface for parallel and distributed 658 computing", Proc. of the 15th ACM Symposium on Operating 659 Systems Principles, Copper Mountain, Colorado, December 3-6, 660 1995 662 [FGM+99] 663 R. Fielding, J. Gettys, J. Mogul, F. Frystyk, L. Masinter, P. 664 Leach, T. Berners-Lee, "Hypertext Transfer Protocol - 665 HTTP/1.1", RFC 2616, June 1999 667 [FIBRE] 668 Fibre Channel Standard 669 http://www.fibrechannel.com/technology/index.master.html 671 [HP97] 672 J. L. Hennessy, D. A. Patterson, Computer Organization and 673 Design, 2nd Edition, San Francisco: Morgan Kaufmann 674 Publishers, 1997 676 [IB] InfiniBand Architecture Specification, Volumes 1 and 2, 677 Release 1.0.a. http://www.infinibandta.org 679 [KP96] 680 J. Kay, J. Pasquale, "Profiling and reducing processing 681 overheads in TCP/IP", IEEE/ACM Transactions on Networking, Vol 682 4, No. 6, pp.817-828, December 1996 684 [KSZ95] 685 K. Kleinpaste, P. Steenkiste, B. Zill, "Software support for 686 outboard buffering and checksumming", SIGCOMM'95 688 [Ma02] 689 K. Magoutis, "Design and Implementation of a Direct Access 690 File System (DAFS) Kernel Server for FreeBSD", in Proceedings 691 of USENIX BSDCon 2002 Conference, San Francisco, CA, February 692 11-14, 2002. 694 [MAF+02] 695 K. Magoutis, S. Addetia, A. Fedorova, M. I. Seltzer, J. S. 696 Chase, D. Gallatin, R. Kisley, R. Wickremesinghe, E. Gabber, 697 "Structure and Performance of the Direct Access File System 698 (DAFS)", accepted for publication at the 2002 USENIX Annual 699 Technical Conference, Monterey, CA, June 9-14, 2002. 701 [Mc95] 702 J. D. McCalpin, "A Survey of memory bandwidth and machine 703 balance in current high performance computers", IEEE TCCA 704 Newsletter, December 1995 706 [MYR] 707 Myrinet, http://www.myricom.com 709 [Ne00] 710 A. Newman, "IDC report paints conflicted picture of server 711 market circa 2004", ServerWatch, July 24, 2000 712 http://serverwatch.internet.com/news/2000_07_24_a.html 714 [Pa01] 715 M. Pastore, "Server shipments for 2000 surpass those in 1999", 716 ServerWatch, February 7, 2001 717 http://serverwatch.internet.com/news/2001_02_07_a.html 719 [PAC+97] 720 D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K. Keeton, 721 C. Kozyrakis, R. Thomas, K. Yelick , "A case for intelligient 722 RAM: IRAM", IEEE Micro, April 1997 724 [PDZ99] 725 V. S. Pai, P. Druschel, W. Zwaenepoel, "IO-Lite: a unified I/O 726 buffering and caching system", Proc. of the 3rd Symposium on 727 Operating Systems Design and Implementation, New Orleans, LA, 728 February 1999 730 [Pi01] 731 J. Pinkerton, "Winsock Direct: the value of System Area 732 Networks". http://www.microsoft.com/windows2000/techinfo/ 733 howitworks/communications/winsock.asp 735 [Po81] 736 J. Postel, "Transmission Control Protocol - DARPA Internet 737 Program Protocol Specification", RFC 793, September 1981 739 [QUAD] 740 Quadrics Ltd., http://www.quadrics.com 742 [SDP] 743 Sockets Direct Protocol v1.0 745 [SRVNET] 746 Compaq Servernet, 747 http://nonstop.compaq.com/view.asp?PAGE=ServerNet 749 [STREAM] 750 The STREAM Benchmark Reference Information, 751 http://www.cs.virginia.edu/stream/ 753 [TK95] 754 M. N. Thadani, Y. A. Khalidi, "An efficient zero-copy I/O 755 framework for UNIX", Technical Report, SMLI TR-95-39, May 1995 757 [VI] Virtual Interface Architecture Specification Version 1.0. 758 http://www.vidf.org/info/04standards.html 760 [Wa97] 761 J. R. Walsh, "DART: Fast application-level networking via 762 data-copy avoidance", IEEE Network, July/August 1997, pp. 763 28-38 765 Authors' Addresses 767 Stephen Bailey 768 Sandburst Corporation 769 600 Federal Street 770 Andover, MA 01810 USA 772 Phone: +1 978 689 1614 773 Email: steph@sandburst.com 775 Jeffrey C. Mogul 776 Western Research Laboratory 777 Hewlett-Packard Company 778 1501 Page Mill Road, MS 1251 779 Palo Alto, CA 94304 USA 781 Phone: +1 650 857 2206 (email preferred) 782 Email: JeffMogul@acm.org 784 Allyn Romanow 785 Cisco Systems, Inc. 786 170 W. Tasman Drive 787 San Jose, CA 95134 USA 789 Phone: +1 408 525 8836 790 Email: allyn@cisco.com 792 Tom Talpey 793 Network Appliance 794 375 Totten Pond Road 795 Waltham, MA 02451 USA 797 Phone: +1 781 768 5329 798 Email: thomas.talpey@netapp.com 800 Full Copyright Statement 802 Copyright (C) The Internet Society (2003). All Rights Reserved. 804 This document and translations of it may be copied and furnished to 805 others, and derivative works that comment on or otherwise explain 806 it or assist in its implementation may be prepared, copied, 807 published and distributed, in whole or in part, without restriction 808 of any kind, provided that the above copyright notice and this 809 paragraph are included on all such copies and derivative works. 810 However, this document itself may not be modified in any way, such 811 as by removing the copyright notice or references to the Internet 812 Society or other Internet organizations, except as needed for the 813 purpose of developing Internet standards in which case the 814 procedures for copyrights defined in the Internet Standards process 815 must be followed, or as required to translate it into languages 816 other than English. 818 The limited permissions granted above are perpetual and will not be 819 revoked by the Internet Society or its successors or assigns. 821 This document and the information contained herein is provided on 822 an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET 823 ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR 824 IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF 825 THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED 826 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.