idnits 2.17.1 draft-ietf-rddp-problem-statement-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year == Line 683 has weird spacing: '...le from http:...' -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- Couldn't find a document date in the document -- date freshness check skipped. -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'B99' is mentioned on line 184, but not defined == Missing Reference: 'BP96' is mentioned on line 419, but not defined == Missing Reference: 'IPSEC' is mentioned on line 526, but not defined == Missing Reference: 'TLS' is mentioned on line 526, but not defined == Missing Reference: 'R2001' is mentioned on line 672, but not defined == Unused Reference: 'DAPP93' is defined on line 643, but no explicit reference was found in the text == Unused Reference: 'KSZ95' is defined on line 690, but no explicit reference was found in the text == Unused Reference: 'Ma02' is defined on line 694, but no explicit reference was found in the text == Unused Reference: 'Wa97' is defined on line 770, but no explicit reference was found in the text == Outdated reference: A later version (-07) exists of draft-ietf-rddp-arch-02 -- Obsolete informational reference (is this intentional?): RFC 793 (ref. 'Po81') (Obsoleted by RFC 9293) Summary: 2 errors (**), 0 flaws (~~), 13 warnings (==), 4 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Allyn Romanow (Cisco) 3 Internet-Draft Jeff Mogul (HP) 4 Expires: December 2003 Tom Talpey (NetApp) 5 Stephen Bailey (Sandburst) 7 RDMA over IP Problem Statement 8 draft-ietf-rddp-problem-statement-02 10 Status of this Memo 12 This document is an Internet-Draft and is in full conformance with 13 all provisions of Section 10 of RFC2026. 15 Internet-Drafts are working documents of the Internet Engineering 16 Task Force (IETF), its areas, and its working groups. Note that 17 other groups may also distribute working documents as Internet- 18 Drafts. 20 Internet-Drafts are draft documents valid for a maximum of six 21 months and may be updated, replaced, or obsoleted by other 22 documents at any time. It is inappropriate to use Internet-Drafts 23 as reference material or to cite them other than as "work in 24 progress." 26 The list of current Internet-Drafts can be accessed at 27 http://www.ietf.org/ietf/1id-abstracts.txt 29 The list of Internet-Draft Shadow Directories can be accessed at 30 http://www.ietf.org/shadow.html. 32 Copyright Notice 34 Copyright (C) The Internet Society (2003). All Rights Reserved. 36 Abstract 38 This draft addresses an IP-based solution to the problem of high 39 system costs due to network I/O copying in end-hosts at high 40 speeds. The problem is due to the high cost of memory bandwidth, 41 and it can be substantially improved using "copy avoidance." The 42 high overhead has limited the use of TCP/IP in interconnection 43 networks especially where high bandwidth, low latency and/or low 44 overhead of end-system data movement are required by the hosted 45 application. 47 Table Of Contents 49 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 2 50 2. The high cost of data movement operations in network I/O . 3 51 2.1. Copy avoidance improves processing overhead . . . . . . . 5 52 3. Memory bandwidth is the root cause of the problem . . . . 6 53 4. High copy overhead is problematic for many key Internet 54 applications . . . . . . . . . . . . . . . . . . . . . . . 7 55 5. Copy Avoidance Techniques . . . . . . . . . . . . . . . . 9 56 5.1. A Conceptual Framework: DDP and RDMA . . . . . . . . . . . 11 57 6. Security Considerations . . . . . . . . . . . . . . . . . 11 58 7. Acknowledgements . . . . . . . . . . . . . . . . . . . . . 12 59 Informative References . . . . . . . . . . . . . . . . . . 12 60 Authors' Addresses . . . . . . . . . . . . . . . . . . . . 17 61 Full Copyright Statement . . . . . . . . . . . . . . . . . 18 63 1. Introduction 65 This draft considers the problem of high host processing overhead 66 associated with network I/O that occurs under high speed 67 conditions. This problem is often referred to as the "I/O 68 bottleneck" [CT90]. More specifically, the source of high overhead 69 that is of interest here is data movement operations - copying. 70 This issue is not be confused with TCP offload, which is not 71 addressed here. High speed refers to conditions where the network 72 link speed is high relative to the bandwidths of the host CPU and 73 memory. With today's computer systems, one Gbits/s and over is 74 considered high speed. 76 High costs associated with copying are an issue primarily for large 77 scale systems. Although smaller systems such as rack-mounted PCs 78 and small workstations would benefit from a reduction in copying 79 overhead, the benefit to smaller machines will be primarily in the 80 next few years as they scale in the amount of bandwidth they 81 handle. Today it is large system machines with high bandwidth 82 feeds, usually multiprocessors and clusters, that are adversely 83 affected by copying overhead. Examples of such machines include 84 all varieties of servers: database servers, storage servers, 85 application servers for transaction processing, for e-commerce, and 86 web serving, content distribution, video distribution, backups, 87 data mining and decision support, and scientific computing. 89 Note that such servers almost exclusively service many concurrent 90 sessions (transport connections), which, in aggregate, are 91 responsible for > 1 Gbits/s of communication. Nonetheless, the 92 cost of copying overhead for a particular load is the same whether 93 from few or many sessions. 95 The I/O bottleneck, and the role of data movement operations, have 96 been widely studied in research and industry over the last 97 approximately 14 years, and we draw freely on these results. 98 Historically, the I/O bottleneck has received attention whenever 99 new networking technology has substantially increased line rates - 100 100 Mbits/s FDDI and Fast Ethernet, 155 Mbits/s ATM, 1 Gbits/s 101 Ethernet. In earlier speed transitions, the availability of memory 102 bandwidth allowed the I/O bottleneck issue to be deferred. Now 103 however, this is no longer the case. While the I/O problem is 104 significant at 1 Gbits/s, it is the introduction of 10 Gbits/s 105 Ethernet which is motivating an upsurge of activity in industry and 106 research [DAFS, IB, VI, CGZ01, Ma02, MAF+02]. 108 Because of high overhead of end-host processing in current 109 implementations, the TCP/IP protocol stack is not used for high 110 speed transfer. Instead, special purpose network fabrics, using a 111 technology generally known as remote direct memory access (RDMA), 112 have been developed and are widely used. RDMA is a set of 113 mechanisms that allow the network adapter, under control of the 114 application, to steer data directly into and out of application 115 buffers. Examples of such interconnection fabrics include Fibre 116 Channel [FIBRE] for block storage transfer, Virtual Interface 117 Architecture [VI] for database clusters, Infiniband [IB], Compaq 118 Servernet [SRVNET], Quadrics [QUAD] for System Area Networks. 119 These link level technologies limit application scaling in both 120 distance and size, meaning that the number of nodes cannot be 121 arbitrarily large. 123 This problem statement substantiates the claim that in network I/O 124 processing, high overhead results from data movement operations, 125 specifically copying; and that copy avoidance significantly 126 decreases the processing overhead. It describes when and why the 127 high processing overheads occur, explains why the overhead is 128 problematic, and points out which applications are most affected. 130 In addition, this document introduces an architectural approach to 131 solving the problem, which is developed in detail in [BT02]. It 132 also discusses how the proposed technology may introduce security 133 concerns and how they should be addressed. 135 2. The high cost of data movement operations in network I/O 137 A wealth of data from research and industry shows that copying is 138 responsible for substantial amounts of processing overhead. It 139 further shows that even in carefully implemented systems, 140 eliminating copies significantly reduces the overhead, as 141 referenced below. 143 Clark et al. [CJRS89] in 1989 shows that TCP [Po81] overhead 144 processing is attributable to both operating system costs such as 145 interrupts, context switches, process management, buffer 146 management, timer management, and to the costs associated with 147 processing individual bytes, specifically computing the checksum 148 and moving data in memory. They found moving data in memory is the 149 more important of the costs, and their experiments show that memory 150 bandwidth is the greatest source of limitation. In the data 151 presented [CJRS89], 64% of the measured microsecond overhead was 152 attributable to data touching operations, and 48% was accounted for 153 by copying. The system measured Berkeley TCP on a Sun-3/60 using 154 1460 Byte Ethernet packets. 156 In a well-implemented system, copying can occur between the network 157 interface and the kernel, and between the kernel and application 158 buffers - two copies, each of which are two memory bus crossings - 159 for read and write. Although in certain circumstances it is 160 possible to do better, usually two copies are required on receive. 162 Subsequent work has consistently shown the same phenomenon as the 163 earlier Clark study. A number of studies report results that data- 164 touching operations, checksumming and data movement, dominate the 165 processing costs for messages longer than 128 Bytes [BS96, CGY01, 166 Ch96, CJRS89, DAPP93, KP96]. For smaller sized messages, per- 167 packet overheads dominate [KP96, CGY01]. 169 The percentage of overhead due to data-touching operations 170 increases with packet size, since time spent on per-byte operations 171 scales linearly with message size [KP96]. For example, Chu [Ch96] 172 reported substantial per-byte latency costs as a percentage of 173 total networking software costs for an MTU size packet on 174 SPARCstation/20 running memory-to-memory TCP tests over networks 175 with 3 different MTU sizes. The percentage of total software costs 176 attributable to per-byte operations were: 178 1500 Byte Ethernet 18-25% 179 4352 Byte FDDI 35-50% 180 9180 Byte ATM 55-65% 182 Although many studies report results for data-touching operations 183 including checksumming and data movement together, much work has 184 focused just on copying [BS96, B99, Ch96, TK95]. For example, 185 [KP96] reports results that separate processing times for checksum 186 from data movement operations. For the 1500 Byte Ethernet size, 187 20% of total processing overhead time is attributable to copying. 188 The study used 2 DECstations 5000/200 connected by an FDDI network. 189 (In this study checksum accounts for 30% of the processing time.) 191 2.1. Copy avoidance improves processing overhead 193 A number of studies show that eliminating copies substantially 194 reduces overhead. For example, results from copy-avoidance in the 195 IO-Lite system [PDZ99], which aimed at improving web server 196 performance, show a throughput increase of 43% over an optimized 197 web server, and 137% improvement over an Apache server. The system 198 was implemented in a 4.4BSD derived UNIX kernel, and the 199 experiments used a server system based on a 333MHz Pentium II PC 200 connected to a switched 100 Mbits/s Fast Ethernet. 202 There are many other examples where elimination of copying using a 203 variety of different approaches showed significant improvement in 204 system performance [CFF+94, DP93, EBBV95, KSZ95, TK95, Wa97]. We 205 will discuss the results of one of these studies in detail in order 206 to clarify the significant degree of improvement produced by copy 207 avoidance [Ch02]. 209 Recent work by Chase et al. [CGY01], measuring CPU utilization, 210 shows that avoiding copies reduces CPU time spent on data access 211 from 24% to 15% at 370 Mbits/s for a 32 KBytes MTU using an 212 AlphaStation XP1000 and a Myrinet adapter [BCF+95]. This is an 213 absolute improvement of 9% due to copy avoidance. 215 The total CPU utilization was 35%, with data access accounting for 216 24%. Thus the relative importance of reducing copies is 26%. At 217 370 Mbits/s, the system is not very heavily loaded. The relative 218 improvement in achievable bandwidth is 34%. This is the 219 improvement we would see if copy avoidance were added when the 220 machine was saturated by network I/O. 222 Note that improvement from the optimization becomes more important 223 if the overhead it targets is a larger share of the total cost. 224 This is what happens if other sources of overhead, such as 225 checksumming, are eliminated. In [CGY01], after removing checksum 226 overhead, copy avoidance reduces CPU utilization from 26% to 10%. 227 This is a 16% absolute reduction, a 61% relative reduction, and a 228 160% relative improvement in achievable bandwidth. 230 In fact, today's network interface hardware commonly offloads the 231 checksum, which removes the other source of per-byte overhead. 232 They also coalesce interrupts to reduce per-packet costs. Thus, 233 today copying costs account for a relatively larger part of CPU 234 utilization than previously, and therefore relatively more benefit 235 is to be gained in reducing them. (Of course this argument would 236 be specious if the amount of overhead were insignificant, but it 237 has been shown to be substantial.) 239 3. Memory bandwidth is the root cause of the problem 241 Data movement operations are expensive because memory bandwidth is 242 scarce relative to network bandwidth and CPU bandwidth [PAC+97]. 243 This trend existed in the past and is expected to continue into the 244 future [HP97, STREAM], especially in large multiprocessor systems. 246 With copies crossing the bus twice per copy, network processing 247 overhead is high whenever network bandwidth is large in comparison 248 to CPU and memory bandwidths. Generally with today's end-systems, 249 the effects are observable at network speeds over 1 Gbits/s. 251 A common question is whether increase in CPU processing power 252 alleviates the problem of high processing costs of network I/O. 253 The answer is no, it is the memory bandwidth that is the issue. 254 Faster CPUs do not help if the CPU spends most of its time waiting 255 for memory [CGY01]. 257 The widening gap between microprocessor performance and memory 258 performance has long been a widely recognized and well-understood 259 problem [PAC+97]. Hennessy [HP97] shows microprocessor performance 260 grew from 1980-1998 at 60% per year, while the access time to DRAM 261 improved at 10% per year, giving rise to an increasing "processor- 262 memory performance gap". 264 Another source of relevant data is the STREAM Benchmark Reference 265 Information website which provides information on the STREAM 266 benchmark [STREAM]. The benchmark is a simple synthetic benchmark 267 program that measures sustainable memory bandwidth (in MBytes/s) 268 and the corresponding computation rate for simple vector kernels 269 measured in MFLOPS. The website tracks information on sustainable 270 memory bandwidth for hundreds of machines and all major vendors. 272 Results show measured system performance statistics. Processing 273 performance from 1985-2001 increased at 50% per year on average, 274 and sustainable memory bandwidth from 1975 to 2001 increased at 35% 275 per year on average over all the systems measured. A similar 15% 276 per year lead of processing bandwidth over memory bandwidth shows 277 up in another statistic, machine balance [Mc95], a measure of the 278 relative rate of CPU to memory bandwidth (FLOPS/cycle) / (sustained 279 memory ops/cycle) [STREAM]. 281 Network bandwidth has been increasing about 10-fold roughly every 8 282 years, which is a 40% per year growth rate. 284 A typical example illustrates that the memory bandwidth compares 285 unfavorably with link speed. The STREAM benchmark shows that a 286 modern uniprocessor PC, for example the 1.2 GHz Athlon in 2001, 287 will move the data 3 times in doing a receive operation - 1 for the 288 network interface to deposit the data in memory, and 2 for the CPU 289 to copy the data. With 1 GBytes/s of memory bandwidth, meaning one 290 read or one write, the machine could handle approximately 2.67 291 Gbits/s of network bandwidth, one third the copy bandwidth. But 292 this assumes 100% utilization, which is not possible, and more 293 importantly the machine would be totally consumed! (A rule of 294 thumb for databases is that 20% of the machine should be required 295 to service I/O, leaving 80% for the database application. And, the 296 less the better.) 298 In 2001, 1 Gbits/s links were common. An application server may 299 typically have two 1 Gbits/s connections - one connection backend 300 to a storage server and one front-end, say for serving HTTP 301 [FGM+99]. Thus the communications could use 2 Gbits/s. In our 302 typical example, the machine could handle 2.7 Gbits/s at its 303 theoretical maximum while doing nothing else. This means that the 304 machine basically could not keep up with the communication demands 305 in 2001, with the relative growth trends the situation only gets 306 worse. 308 4. High copy overhead is problematic for many key Internet applications 310 If a significant portion of resources on an application machine is 311 consumed in network I/O rather than in application processing, it 312 makes it difficult for the application to scale - to handle more 313 clients, to offer more services. 315 Several years ago the most affected applications were streaming 316 multimedia, parallel file systems and supercomputing on clusters 317 [BS96]. In addition, today the applications that suffer from 318 copying overhead are more central in Internet computing - they 319 store, manage, and distribute the information of the Internet and 320 the enterprise. They include database applications doing 321 transaction processing, e-commerce, web serving, decision support, 322 content distribution, video distribution, and backups. Clusters 323 are typically used for this category of application, since they 324 have advantages of availability and scalability. 326 Today these applications, which provide and manage Internet and 327 corporate information, are typically run in data centers that are 328 organized into three logical tiers. One tier is typically a set of 329 web servers connecting to the WAN. The second tier is a set of 330 application servers that run the specific applications usually on 331 more powerful machines, and the third tier is backend databases. 332 Physically, the first two tiers - web server and application server 333 - are usually combined [Pi01]. For example an e-commerce server 334 communicates with a database server and with a customer site, or a 335 content distribution server connects to a server farm, or an OLTP 336 server connects to a database and a customer site. 338 When network I/O uses too much memory bandwidth, performance on 339 network paths between tiers can suffer. (There might also be 340 performance issues on SAN paths used either by the database tier or 341 the application tier.) The high overhead from network-related 342 memory copies diverts system resources from other application 343 processing. It also can create bottlenecks that limit total system 344 performance. 346 There are a large and growing number of these application servers 347 distributed throughout the Internet. In 1999 approximately 3.4 348 million server units were shipped, in 2000, 3.9 million units, and 349 the estimated annual growth rate for 2000-2004 was 17 percent 350 [Ne00, Pa01]. 352 There is high motivation to maximize the processing capacity of 353 each CPU, as scaling by adding CPUs one way or another has 354 drawbacks. For example, adding CPUs to a multiprocessor will not 355 necessarily help, as a multiprocessor improves performance only 356 when the memory bus has additional bandwidth to spare. Clustering 357 can add additional complexity to handling the applications. 359 In order to scale a cluster or multiprocessor system, one must 360 proportionately scale the interconnect bandwidth. Interconnect 361 bandwidth governs the performance of communication-intensive 362 parallel applications; if this (often expressed in terms of 363 "bisection bandwidth") is too low, adding additional processors 364 cannot improve system throughput. Interconnect latency can also 365 limit the performance of applications that frequently share data 366 between processors. 368 So, excessive overheads on network paths in a "scalable" system 369 both can require the use of more processors than optimal, and can 370 reduce the marginal utility of those additional processors. 372 Copy avoidance scales a machine upwards by removing at least two- 373 thirds the bus bandwidth load from the "very best" 1-copy (on 374 receive) implementations, and removes at least 80% of the bandwidth 375 overhead from the 2-copy implementations. 377 An example showing poor performance with copies and improved 378 scaling with copy avoidance is illustrative. The IO-Lite work 379 [PDZ99] shows higher server throughput servicing more clients using 380 a zero-copy system. In an experiment designed to mimic real world 381 web conditions by simulating the effect of TCP WAN connections on 382 the server, the performance of 3 servers was compared. One server 383 was Apache, another an optimized server called Flash, and the third 384 the Flash server running IO-Lite, called Flash-Lite with zero copy. 385 The measurement was of throughput in requests/second as a function 386 of the number of slow background clients that could be served. As 387 the table shows, Flash-Lite has better throughput, especially as 388 the number of clients increases. 390 Apache Flash Flash-Lite 391 ------ ----- ---------- 392 #Clients Thruput reqs/s Thruput Thruput 394 0 520 610 890 395 16 390 490 890 396 32 360 490 850 397 64 360 490 890 398 128 310 450 880 399 256 310 440 820 401 Traditional Web servers (which mostly send data and can keep most 402 of their content in the file cache) are not the worst case for copy 403 overhead. Web proxies (which often receive as much data as they 404 send) and complex Web servers based on SANs or multi-tier systems 405 will suffer more from copy overheads than in the example above. 407 5. Copy Avoidance Techniques 409 There have been extensive research investigation and industry 410 experience with two main alternative approaches to eliminating data 411 movement overhead, often along with improving other Operating 412 System processing costs. In one approach, hardware and/or software 413 changes within a single host reduce processing costs. In another 414 approach, memory-to-memory networking [MAF+02], hosts communicate 415 via information that allows them to reduce processing costs. 417 The single host approaches range from new hardware and software 418 architectures [KSZ95, Wa97, DWB+93] to new or modified software 419 systems [BP96, Ch96, TK95, DP93, PDZ99]. In the approach based on 420 using a networking protocol to exchange information, the network 421 adapter, under control of the application, places data directly 422 into and out of application buffers, reducing the need for data 423 movement. Commonly this approach is called RDMA, Remote Direct 424 Memory Access. 426 As discussed below, research and industry experience has shown that 427 copy avoidance techniques within the receiver processing path alone 428 have proven to be problematic. The research special purpose host 429 adapter systems had good performance and can be seen as precursors 430 for the commercial RDMA-based NICs [KSZ95, DWB+93]. In software, 431 many implementations have successfully achieved zero-copy transmit, 432 but few have accomplished zero-copy receive. And those that have 433 done so make strict alignment and no-touch requirements on the 434 application, greatly reducing the portability and usefulness of the 435 implementation. 437 In contrast, experience has proven satisfactory with memory-to- 438 memory systems that permit RDMA - performance has been good and 439 there have not been system or networking difficulties. RDMA is a 440 single solution. Once implemented, it can be used with any OS and 441 machine architecture, and it does not need to be revised when 442 either of these changes. 444 In early work, one goal of the software approaches was to show that 445 TCP could go faster with appropriate OS support [CJR89, CFF+94]. 446 While this goal was achieved, further investigation and experience 447 showed that, though possible to craft software solutions, specific 448 system optimizations have been complex, fragile, extremely 449 interdependent with other system parameters in complex ways, and 450 often of only marginal improvement [CFF+94, CGY01, Ch96, DAPP93, 451 KSZ95, PDZ99]. The network I/O system interacts with other aspects 452 of the Operating System such as machine architecture and file I/O, 453 and disk I/O [Br99, Ch96, DP93]. 455 For example, the Solaris Zero-Copy TCP work [Ch96], which relies on 456 page remapping, shows that the results are highly interdependent 457 with other systems, such as the file system, and that the 458 particular optimizations are specific for particular architectures, 459 meaning for each variation in architecture optimizations must be 460 re-crafted [Ch96]. 462 A number of research projects and industry products have been based 463 on the memory-to-memory approach to copy avoidance. These include 464 U-Net [EBBV95], SHRIMP [BLA+94], Hamlyn [BJM+96], Infiniband [IB], 465 Winsock Direct [Pi01]. Several memory-to-memory systems have been 466 widely used and have generally been found to be robust, to have 467 good performance, and to be relatively simple to implement. These 468 include VI [VI], Myrinet [BCF+95], Quadrics [QUAD], Compaq/Tandem 469 Servernet [SRVNET]. Networks based on these memory-to-memory 470 architectures have been used widely in scientific applications and 471 in data centers for block storage, file system access, and 472 transaction processing. 474 By exporting direct memory access "across the wire", applications 475 may direct the network stack to manage all data directly from 476 application buffers. A large and growing class of applications has 477 already emerged which takes advantage of such capabilities, 478 including all the major databases, as well as file systems such as 479 DAFS [DAFS] and network protocols such as Sockets Direct [SDP]. 481 5.1. A Conceptual Framework: DDP and RDMA 483 An RDMA solution can be usefully viewed as being comprised of two 484 distinct components: "direct data placement (DDP)" and "remote 485 direct memory access (RDMA) semantics". They are distinct in 486 purpose and also in practice - they may be implemented as separate 487 protocols. 489 The more fundamental of the two is the direct data placement 490 facility. This is the means by which memory is exposed to the 491 remote peer in an appropriate fashion, and the means by which the 492 peer may access it, for instance reading and writing. 494 The RDMA control functions are semantically layered atop direct 495 data placement. Included are operations that provide "control" 496 features, such as connection and termination, and the ordering of 497 operations and signaling their completions. A "send" facility is 498 provided. 500 While the functions (and potentially protocols) are distinct, 501 historically both aspects taken together have been referred as 502 "RDMA". The facilities of direct data placement are useful in and 503 of themselves, and may be employed by other upper layer protocols 504 to facilitate data transfer. Therefore, it is often useful to 505 refer to DDP as the data placement functionality and RDMA as the 506 control aspect. 508 [BT02] develops an architecture for DDP and RDMA, and is a 509 companion draft to this problem statement. 511 6. Security Considerations 513 Solutions to the problem of reducing copying overhead in high 514 bandwidth transfers via one or more protocols may introduce new 515 security concerns. Any proposed solution must be analyzed for 516 security threats and any such threats addressed. Potential 517 security weaknesses due to resource issues that might lead to 518 denial-of-service attacks, overwrites and other concurrent 519 operations, the ordering of completions as required by the RDMA 520 protocol, the granularity of transfer, and any other identified 521 threats; need to be examined, described and an adequate solution to 522 them found. 524 Layered atop Internet transport protocols, the RDMA protocols will 525 gain leverage from and must permit integration with Internet 526 security standards, such as IPSec and TLS [IPSEC, TLS]. A thorough 527 analysis of the degree to which these protocols address potential 528 threats is required. 530 Security for an RDMA design requires more than just securing the 531 communication channel. While it is necessary to be able to 532 guarantee channel properties such as privacy, integrity, and 533 authentication, these properties cannot defend against all attacks 534 from properly authenticated peers, which might be malicious, 535 compromised, or buggy. For example, an RDMA peer should not be 536 able to read or write memory regions without prior consent. 538 Further, it must not be possible to evade consistency checks at the 539 recipient. The RDMA design must allow the recipient to rely on its 540 consistent memory contents by controlling peer access to memory 541 regions explicitly, and must disallow peer access to regions when 542 not authorized. 544 The RDMA protocols must ensure that regions addressable by RDMA 545 peers be under strict application control. Remote access to local 546 memory by a network peer introduces a number of potential security 547 concerns. This becomes particularly important in the Internet 548 context, where such access can be exported globally. 550 The RDMA protocols carry in part what is essentially user 551 information, explicitly including addressing information and 552 operation type (read or write), and implicitly including protection 553 and attributes. As such, the protocol requires checking of these 554 higher level aspects in addition to the basic formation of 555 messages. The semantics associated with each class of error must 556 be clearly defined, and the expected action to be taken on mismatch 557 be specified. In some cases, this will result in a catastrophic 558 error on the RDMA association, however in others a local or remote 559 error may be signalled. Certain of these errors may require 560 consideration of abstract local semantics, which must be carefully 561 specified so as to provide useful behavior while not constraining 562 the implementation. 564 7. Acknowledgements 566 Jeff Chase generously provided many useful insights and 567 information. Thanks to Jim Pinkerton for many helpful discussions. 569 8. Informative References 571 [BCF+95] 572 N. J. Boden, D. Cohen, R. E. Felderman, A. E. Kulawik, C. L. 573 Seitz, J. N. Seizovic, and W. Su. "Myrinet - A gigabit-per- 574 second local-area network", IEEE Micro, February 1995 576 [BJM+96] 577 G. Buzzard, D. Jacobson, M. Mackey, S. Marovich, J. Wilkes, 578 "An implementation of the Hamlyn send-managed interface 579 architecture", in Proceedings of the Second Symposium on 580 Operating Systems Design and Implementation, USENIX Assoc., 581 October 1996 583 [BLA+94] 584 M. A. Blumrich, K. Li, R. Alpert, C. Dubnicki, E. W. Felten, 585 "A virtual memory mapped network interface for the SHRIMP 586 multicomputer", in Proceedings of the 21st Annual Symposium on 587 Computer Architecture, April 1994, pp. 142-153 589 [Br99] 590 J. C. Brustoloni, "Interoperation of copy avoidance in network 591 and file I/O", Proceedings of IEEE Infocom, 1999, pp. 534-542 593 [BS96] 594 J. C. Brustoloni, P. Steenkiste, "Effects of buffering 595 semantics on I/O performance", Proceedings OSDI'96, USENIX, 596 Seattle, WA October 1996, pp. 277-291 598 RFC Editor note: 599 Replace following architecture draft-ietf- name, status and date 600 with appropriate reference when assigned. 602 [BT02] 603 S. Bailey, T. Talpey, "The Architecture of Direct Data 604 Placement (DDP) And Remote Direct Memory Access (RDMA) On 605 Internet Protocols", Internet Draft Work in Progress, draft- 606 ietf-rddp-arch-02, June 2003 608 [CFF+94] 609 C-H Chang, D. Flower, J. Forecast, H. Gray, B. Hawe, A. 610 Nadkarni, K. K. Ramakrishnan, U. Shikarpur, K. Wilde, "High- 611 performance TCP/IP and UDP/IP networking in DEC OSF/1 for 612 Alpha AXP", Proceedings of the 3rd IEEE Symposium on High 613 Performance Distributed Computing, August 1994, pp. 36-42 615 [CGY01] 616 J. S. Chase, A. J. Gallatin, and K. G. Yocum, "End system 617 optimizations for high-speed TCP", IEEE Communications 618 Magazine, Volume: 39, Issue: 4 , April 2001, pp 68-74. 619 http://www.cs.duke.edu/ari/publications/end-system.{ps,pdf} 621 [Ch96] 622 H.K. Chu, "Zero-copy TCP in Solaris", Proc. of the USENIX 1996 623 Annual Technical Conference, San Diego, CA, January 1996 625 [Ch02] 626 Jeffrey Chase, Personal communication 628 [CJRS89] 629 D. D. Clark, V. Jacobson, J. Romkey, H. Salwen, "An analysis 630 of TCP processing overhead", IEEE Communications Magazine, 631 volume: 27, Issue: 6, June 1989, pp 23-29 633 [CT90] 634 D. D. Clark, D. Tennenhouse, "Architectural considerations for 635 a new generation of protocols", Proceedings of the ACM SIGCOMM 636 Conference, 1990 638 [DAFS] 639 DAFS Collaborative, "Direct Access File System Specification 640 v1.0", September 2001, available from 641 http://www.dafscollaborative.org 643 [DAPP93] 644 P. Druschel, M. B. Abbott, M. A. Pagels, L. L. Peterson, 645 "Network subsystem design", IEEE Network, July 1993, pp. 8-17 647 [DP93] 648 P. Druschel, L. L. Peterson, "Fbufs: a high-bandwidth cross- 649 domain transfer facility", Proceedings of the 14th ACM 650 Symposium of Operating Systems Principles, December 1993 652 [DWB+93] 653 C. Dalton, G. Watson, D. Banks, C. Calamvokis, A. Edwards, J. 654 Lumley, "Afterburner: architectural support for high- 655 performance protocols", Technical Report, HP Laboratories 656 Bristol, HPL-93-46, July 1993 658 [EBBV95] 659 T. von Eicken, A. Basu, V. Buch, and W. Vogels, "U-Net: A 660 user-level network interface for parallel and distributed 661 computing", Proc. of the 15th ACM Symposium on Operating 662 Systems Principles, Copper Mountain, Colorado, December 3-6, 663 1995 665 [FGM+99] 666 R. Fielding, J. Gettys, J. Mogul, F. Frystyk, L. Masinter, P. 667 Leach, T. Berners-Lee, "Hypertext Transfer Protocol - 668 HTTP/1.1", RFC 2616, June 1999 670 [FIBRE] 671 ANSI Technical Committee T10, "Fibre Channel Protocol (FCP)" 672 (and as revised and updated), ANSI X3.269:1996 [R2001], 673 committee draft available from 674 http://www.t10.org/drafts.htm#FibreChannel 676 [HP97] 677 J. L. Hennessy, D. A. Patterson, Computer Organization and 678 Design, 2nd Edition, San Francisco: Morgan Kaufmann 679 Publishers, 1997 681 [IB] InfiniBand Trade Association, "InfiniBand Architecture 682 Specification, Volumes 1 and 2", Release 1.1, November 2002, 683 available from http://www.infinibandta.org/specs 685 [KP96] 686 J. Kay, J. Pasquale, "Profiling and reducing processing 687 overheads in TCP/IP", IEEE/ACM Transactions on Networking, Vol 688 4, No. 6, pp.817-828, December 1996 690 [KSZ95] 691 K. Kleinpaste, P. Steenkiste, B. Zill, "Software support for 692 outboard buffering and checksumming", SIGCOMM'95 694 [Ma02] 695 K. Magoutis, "Design and Implementation of a Direct Access 696 File System (DAFS) Kernel Server for FreeBSD", in Proceedings 697 of USENIX BSDCon 2002 Conference, San Francisco, CA, February 698 11-14, 2002. 700 [MAF+02] 701 K. Magoutis, S. Addetia, A. Fedorova, M. I. Seltzer, J. S. 702 Chase, D. Gallatin, R. Kisley, R. Wickremesinghe, E. Gabber, 703 "Structure and Performance of the Direct Access File System 704 (DAFS)", accepted for publication at the 2002 USENIX Annual 705 Technical Conference, Monterey, CA, June 9-14, 2002. 707 [Mc95] 708 J. D. McCalpin, "A Survey of memory bandwidth and machine 709 balance in current high performance computers", IEEE TCCA 710 Newsletter, December 1995 712 [Ne00] 713 A. Newman, "IDC report paints conflicted picture of server 714 market circa 2004", ServerWatch, July 24, 2000 715 http://serverwatch.internet.com/news/2000_07_24_a.html 717 [Pa01] 718 M. Pastore, "Server shipments for 2000 surpass those in 1999", 719 ServerWatch, February 7, 2001 720 http://serverwatch.internet.com/news/2001_02_07_a.html 722 [PAC+97] 723 D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K. Keeton, 724 C. Kozyrakis, R. Thomas, K. Yelick , "A case for intelligient 725 RAM: IRAM", IEEE Micro, April 1997 727 [PDZ99] 728 V. S. Pai, P. Druschel, W. Zwaenepoel, "IO-Lite: a unified I/O 729 buffering and caching system", Proc. of the 3rd Symposium on 730 Operating Systems Design and Implementation, New Orleans, LA, 731 February 1999 733 [Pi01] 734 J. Pinkerton, "Winsock Direct: The Value of System Area 735 Networks", May 2001, available from 736 http://www.microsoft.com/windows2000/techinfo/ 737 howitworks/communications/winsock.asp 739 [Po81] 740 J. Postel, "Transmission Control Protocol - DARPA Internet 741 Program Protocol Specification", RFC 793, September 1981 743 [QUAD] 744 Quadrics Ltd., Quadrics QSNet product information, available 745 from http://www.quadrics.com/website/pages/02qsn.html 747 [SDP] 748 InfiniBand Trade Association, "Sockets Direct Protocol v1.0", 749 Annex A of InfiniBand Architecture Specification Volume 1, 750 Release 1.1, November 2002, available from 751 http://www.infinibandta.org/specs 753 [SRVNET] 754 R. Horst, "TNet: A reliable system area network", IEEE Micro, 755 pp. 37-45, February 1995 757 [STREAM] 758 J. D. McAlpin, The STREAM Benchmark Reference Information, 759 http://www.cs.virginia.edu/stream/ 761 [TK95] 762 M. N. Thadani, Y. A. Khalidi, "An efficient zero-copy I/O 763 framework for UNIX", Technical Report, SMLI TR-95-39, May 1995 765 [VI] Compaq Computer Corp., Intel Corporation and Microsoft 766 Corporation, "Virtual Interface Architecture Specification 767 Version 1.0", December 1997, available from 768 http://www.vidf.org/info/04standards.html 770 [Wa97] 771 J. R. Walsh, "DART: Fast application-level networking via 772 data-copy avoidance", IEEE Network, July/August 1997, pp. 773 28-38 775 Authors' Addresses 777 Stephen Bailey 778 Sandburst Corporation 779 600 Federal Street 780 Andover, MA 01810 USA 782 Phone: +1 978 689 1614 783 Email: steph@sandburst.com 785 Jeffrey C. Mogul 786 Western Research Laboratory 787 Hewlett-Packard Company 788 1501 Page Mill Road, MS 1251 789 Palo Alto, CA 94304 USA 791 Phone: +1 650 857 2206 (email preferred) 792 Email: JeffMogul@acm.org 794 Allyn Romanow 795 Cisco Systems, Inc. 796 170 W. Tasman Drive 797 San Jose, CA 95134 USA 799 Phone: +1 408 525 8836 800 Email: allyn@cisco.com 801 Tom Talpey 802 Network Appliance 803 375 Totten Pond Road 804 Waltham, MA 02451 USA 806 Phone: +1 781 768 5329 807 Email: thomas.talpey@netapp.com 809 Full Copyright Statement 811 Copyright (C) The Internet Society (2003). All Rights Reserved. 813 This document and translations of it may be copied and furnished to 814 others, and derivative works that comment on or otherwise explain 815 it or assist in its implementation may be prepared, copied, 816 published and distributed, in whole or in part, without restriction 817 of any kind, provided that the above copyright notice and this 818 paragraph are included on all such copies and derivative works. 819 However, this document itself may not be modified in any way, such 820 as by removing the copyright notice or references to the Internet 821 Society or other Internet organizations, except as needed for the 822 purpose of developing Internet standards in which case the 823 procedures for copyrights defined in the Internet Standards process 824 must be followed, or as required to translate it into languages 825 other than English. 827 The limited permissions granted above are perpetual and will not be 828 revoked by the Internet Society or its successors or assigns. 830 This document and the information contained herein is provided on 831 an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET 832 ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR 833 IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF 834 THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED 835 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.