idnits 2.17.1 draft-garcia-direct-access-problem-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- Couldn't find a document date in the document -- date freshness check skipped. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Unused Reference: 'FCIP' is defined on line 611, but no explicit reference was found in the text -- Possible downref: Non-RFC (?) normative reference: ref. 'Chase' -- Possible downref: Non-RFC (?) normative reference: ref. 'DAFSAPI' -- Possible downref: Non-RFC (?) normative reference: ref. 'FCIP' ** Obsolete normative reference: RFC 2616 (ref. 'HTTP') (Obsoleted by RFC 7230, RFC 7231, RFC 7232, RFC 7233, RFC 7234, RFC 7235) -- Possible downref: Non-RFC (?) normative reference: ref. 'IB' ** Downref: Normative reference to an Informational RFC: RFC 1813 (ref. 'NFSv3') ** Obsolete normative reference: RFC 2960 (ref. 'SCTP') (Obsoleted by RFC 4960) ** Obsolete normative reference: RFC 793 (ref. 'TCP') (Obsoleted by RFC 9293) -- Possible downref: Non-RFC (?) normative reference: ref. 'VI' -- Possible downref: Non-RFC (?) normative reference: ref. 'WSD' Summary: 7 errors (**), 0 flaws (~~), 3 warnings (==), 8 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 S. Bailey (Sandburst) 3 Internet-draft D. Garcia (Compaq) 4 Expires: May 2002 J. Hilland (Compaq) 5 A. Romanow (Cisco) 7 Direct Access Problem Statement 8 draft-garcia-direct-access-problem-00 10 Status of this Memo 12 This document is an Internet-Draft and is in full conformance with 13 all provisions of Section 10 of RFC2026. 15 Internet-Drafts are working documents of the Internet Engineering 16 Task Force (IETF), its areas, and its working groups. Note that 17 other groups may also distribute working documents as Internet- 18 Drafts. 20 Internet-Drafts are draft documents valid for a maximum of six 21 months and may be updated, replaced, or obsoleted by other 22 documents at any time. It is inappropriate to use Internet-Drafts 23 as reference material or to cite them other than as "work in 24 progress." 26 The list of current Internet-Drafts can be accessed at 27 http://www.ietf.org/ietf/1id-abstracts.txt 29 The list of Internet-Draft Shadow Directories can be accessed at 30 http://www.ietf.org/shadow.html. 32 Copyright Notice 34 Copyright (C) The Internet Society (2001). All Rights Reserved. 36 Abstract 38 This problem statement describes barriers to the use of Internet 39 Protocols for highly scalable, high bandwidth, low latency 40 transfers necessary in some of today's important applications, 41 particularly applications found within data centers. In addition 42 to describing technical reasons for the problems, it gives an 43 overview of common non-IP solutions to these problems which have 44 been deployed over the years. 46 The perspective of this draft is that it would be very beneficial 47 to have an IP-based solution for these problems so IP can be used 48 for high speed data transfers within data centers, in addition to 49 IP's many other uses. 51 Table Of Contents 53 1. Introduction . . . . . . . . . . . . . . . . . . . . . . 2 54 1.1. High Bandwidth Transfer Overhead . . . . . . . . . . . . 3 55 1.2. Proliferation Of Fabrics in Data Centers . . . . . . . . 4 56 1.3. Potential Solutions . . . . . . . . . . . . . . . . . . 4 57 2. High Bandwidth Data Transfer In The Data Center . . . . 6 58 2.1. Scalable Data Center Applications . . . . . . . . . . . 7 59 2.2. Client/Server Communication . . . . . . . . . . . . . . 7 60 2.3. Block Storage . . . . . . . . . . . . . . . . . . . . . 8 61 2.4. File Storage . . . . . . . . . . . . . . . . . . . . . . 9 62 2.5. Backup . . . . . . . . . . . . . . . . . . . . . . . . . 9 63 2.6. The Common Thread . . . . . . . . . . . . . . . . . . . 10 64 3. Non-IP Solutions . . . . . . . . . . . . . . . . . . . . 10 65 3.1. Proprietary Solutions . . . . . . . . . . . . . . . . . 11 66 3.2. Standards-based Solutions . . . . . . . . . . . . . . . 11 67 3.2.1. The Virtual Interface Architecture (VIA) . . . . . . . . 12 68 3.2.2. InfiniBand . . . . . . . . . . . . . . . . . . . . . . . 12 69 4. Conclusion . . . . . . . . . . . . . . . . . . . . . . . 13 70 5. Security Considerations . . . . . . . . . . . . . . . . 13 71 6. References . . . . . . . . . . . . . . . . . . . . . . . 13 72 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . 15 73 A. RDMA Technology Overview . . . . . . . . . . . . . . . . 16 74 A.1 Use of Memory Access Transfers . . . . . . . . . . . . . 16 75 A.2 Use Of Push Transfers . . . . . . . . . . . . . . . . . 17 76 A.3 RDMA-based I/O Example . . . . . . . . . . . . . . . . . 18 77 Full Copyright Statement . . . . . . . . . . . . . . . . . . . 19 79 1. Introduction 81 Protocols in the IP family offer a huge, ever increasing range of 82 functions, including mail, messaging, telephony, media and 83 hypertext content delivery, block and file storage, and network 84 control. IP has been so successful that applications only use 85 other forms of communication when there is a very compelling 86 reason. Currently, it is often not acceptable to use IP protocols 87 for high-speed communication within a data center. In these cases, 88 copying data to application buffers consumes too much CPU that is 89 otherwise needed to perform application functions. 91 This limitation of IP protocols has not been particularly important 92 until now because the domain of high performance transfers was 93 limited to a relatively specialized niche of low volume 94 applications, such as scientific supercomputing. Applications that 95 needed more efficient transfer than IP could offer simply used 96 other purpose-built solutions. 98 As the use of the Internet has become pervasive and critical, the 99 growth in number and importance of data centers has matched the 100 growth of the Internet. The role of the data center is similarly 101 critical. The high-end environment of the data center makes up the 102 core and nexus of today's Internet. Everything goes in and out of 103 data centers. 105 Applications running within data centers frequently require high 106 bandwidth data transfer. Due to the high host processing overhead 107 of high bandwidth communication in IP, the industry has developed 108 non-IP technology to serve data center traffic. That said, the 109 obstacles to lowering host processing overhead in the IP are well- 110 understood and straightforward to address. Simple techniques could 111 allow the penetration of existing IP protocols into data centers 112 where non-IP technology is currently used. 114 Technology advances have made feasible specially designed network 115 interfaces that place IP protocol data directly in application 116 buffers. While it is certainly possible to use control information 117 directly from existing IP protocol messages to place data in 118 application buffers, but the sheer number and diversity of current 119 and future IP protocols calls for a generic solution instead. 120 Therefore, the goal is to investigate a generic data placement 121 solution for IP protocols that would allow a single network 122 interface to perform direct data placement for a wide variety of 123 mature, evolving and completely new protocols. 125 There is a great desire to develop lower overhead, more scalable 126 data transfer technology based on IP. This desire comes from the 127 advantages of using one protocol technology rather than several, 128 and from the many efficiencies of technology based upon a single, 129 widely adopted, open standard. 131 This document describes the problems that IP faces in delivering 132 highly scalable high bandwidth data transfer. The first section 133 describes the issues in general. The second section describes 134 several specific scenarios, discussing particular application 135 domains and specific problems that arise. The third section 136 describes approaches that have historically been used to address 137 low overhead, high bandwidth data transfer needs. The appendix 138 gives an overview of how a particular class of non-IP technologies 139 addresses this problem with Remote Direct Memory Access (RDMA). 141 1.1. High Bandwidth Transfer Overhead 143 Transport protocols such as TCP [TCP] and SCTP [SCTP] have 144 successfully shielded upper layers from the complexities of moving 145 data between two computers. This has been very successful in 146 making TCP/IP ubiquitous. However, with current IP 147 implementations, Upper Layer Protocols (ULPs), such as NFS [NFSv3] 148 and HTTP [HTTP], require incoming data packets to be buffered and 149 copied before the data is used. 151 It is this data copying that is a primary source of overhead in IP 152 data transfers. Copying received data for high bandwidth transfers 153 consumes significant processing time and memory bandwidth. If data 154 is buffered and then copied, the data moves across the memory bus 155 at least three times during the data transfer. By comparison, if 156 the incoming data is placed directly where the application requires 157 it, the data moves across the memory bus only once. This copying 158 overhead currently means that additional processing resources, such 159 as additional processors in a multiprocessor machine, are needed to 160 reach faster and faster wire speeds. 162 A wide range of ad hoc solutions have been explored to eliminate 163 data copying overhead withing the framework of current IP 164 protocols, but despite extensive study, still no adequate or 165 general solution exists [Chase]. 167 1.2. Proliferation Of Fabrics in Data Centers 169 The current alternative to paying the high costs due to data 170 transfer overhead in data centers is the use of several different 171 communication technologies at once. Data centers are likely to have 172 separate Ethernet IP, Fibre Channel storage, and InfiniBand, VIA or 173 proprietary interprocess communication (IPC) networks. Special 174 purpose networks are used for storage and IPC to reduce the 175 processor overhead associated with data communications; and in the 176 case of IPC, to reduce latency as well. 178 Using such proprietary and special purpose solutions runs counter 179 to the requirements of data center computing. Data center 180 designers and operators do not want the expense and complexity of 181 building and maintaining three separate communications networks. 182 Three NICs and three fabric ports are expensive, consume valuable 183 IO card slots, power and machine room space. 185 A single IP fabric would be far preferable. IP networks are best 186 positioned to fill the role of all three of these existing 187 networks. At 1 to 10 gigabit speeds current IP interconnects could 188 offer comparable or superior performance characteristics to special 189 purpose purpose interconnects, if it were not for the high overhead 190 and latency of IP data transfers. An IP-based alternative to the 191 IPC and storage fabrics would be less costly, and much more easily 192 manageable than maintaining separate communication fabrics. 194 1.3. Potential Solutions 196 One frequently proposed solution to the problem of data transfer 197 overhead in IP data transfers is to wait for the next generation of 198 faster processors and speedier memories to render the problem 199 irrelevant. However, in the evolution of the Internet, processor 200 and memory speeds are not the only variables that have increased 201 exponentially over time. Data link speeds have grown exponentially 202 as well. Recently, spurred by the demand for core network 203 bandwidth, data link speeds have grown faster than both processor 204 computation rates and processor memory transfer rates. Whatever 205 speed increases occur in processors and memories, it is clear that 206 link speeds will continue to grow aggressively as well. 208 Rather than relying on increasing CPU performance, non-IP solutions 209 use network interface hardware to attack several Several distinct 210 sources of overhead can be seen. For a small, one-way IP data 211 transfer, typically both the sender and receiver must make several 212 context switches, process several interrupts, and send and receive 213 a network packet. In addition, the receiver must perform at least 214 one data copy. This single transfer could require 10,000 215 instructions of execution and total time measured in hundreds of 216 microseconds if not milliseconds. The sources of overhead in this 217 transfer are: 219 o context switches and interrupts, 221 o execution of protocol code, 223 o copying the data on the receiver. 225 Copying competes with DMA and other processor accesses for memory 226 system bandwidth, and all these sources of overhead can also have 227 significant secondary effects on the efficiency of application 228 execution by interfering with system caches. 230 Depending on the application, each of these sources of overhead may 231 be small or large factor in total overhead, but the cumulative 232 effect of all of them is nearly always substantial for high 233 bandwidth transfers. If data transfers are very small, data 234 copying is only a small cost, but context switching and protocol 235 stack execution become performance limiting factors. For large 236 transfers, the most common high bandwidth data transfers, context 237 switching and protocol stack execution can be amortized away, 238 within certain limits, but data copying becomes costly. 240 Non-IP solutions address these sources of overhead with network 241 interface hardware that: 243 o reduces context switches and interrupts with kernel-bypass 244 capability, where the application communicates directly 245 through network interface without kernel intervention, 247 o reduces protocol stack processing with protocol offload 248 hardware that performs some or all protocol processing (e.g. 249 ACK processing), 251 o reduces data copying overhead by placing data directly in 252 application buffers. 254 The application of these techniques reduces both data transfer 255 overhead, and data transfer latency. Context switches and data 256 copying are substantial sources of end-to-end latency that are 257 eliminated by kernel-bypass and direct data placement. Offloaded 258 protocol processing can also typically be performed an order of 259 magnitude faster than a comparable, general purpose protocol stack, 260 due to the ability to exploit extensive parallelism in hardware. 261 While protocol offload does reduce overhead, for the vast majority 262 of current high bandwidth data transfer applications, eliminating 263 data copies is much more important. 265 These techniques, and others, may be equally applicable to reducing 266 the overhead of IP data transfers. 268 2. High Bandwidth Transfers In The Data Center 270 There are numerous uses of high bandwidth data transfers in today's 271 data centers. While these applications are found in the data 272 center, they have implications for the desktop as well. This 273 problem statement focuses on data center scenarios below, but it 274 would be beneficial to find a solution that meets data center while 275 possibly remaining affordable for the desktop. 277 Why is high bandwidth data transfer in the data center important 278 for IP networking? Performance on the Internet, as well as 279 intranets, is dependent on the performance of the data center. 280 Every request, be it a web page, database query or file and print 281 service goes to or through data center servers. Often a multi- 282 tiered computing solution is used, where multiple machines in the 283 data center satisfy these requests. Despite the explosive growth 284 of the server market, data centers are running into critical 285 limitations that impact every client directly or indirectly. 286 Unlike servers, clients are largely limited in performance by the 287 human at the interface. In contrast, data center performance is 288 limited by the speeds and feeds of the network and I/O devices as 289 well as hardware and software components. 291 With new protocols such as iSCSI, IP networks are increasingly 292 taking on the functions of special purpose interconnects, such as 293 Fibre Channel. However, the limitations created by high data 294 transfer overhead described here have not as yet been addressed for 295 IP protocols in general. 297 First and foremost, all the problems illustrated in scenarios below 298 occur on IP protocol based networks. It is imperative to 299 understand the pervasiveness of IP networks within the data center 300 and that all of the problems described below occur in IP-based data 301 transfer solutions. Therefore, a solution to these problems will 302 naturally also be a part of the IP protocol suite. 304 Although the problems discussed below manifest themselves in 305 different ways, investigation into the source of these problems 306 shows a common thread running through them. These scenarios are 307 not exhaustive list, but rather describe the wide range of problems 308 exhibited in scalability and performance of the applications and 309 infrastructures encountered in data center computing as a result of 310 high communication overhead. 312 2.1. Scalable Data Center Applications 314 A key characteristic of any data center application is its ability 315 to scale as demands increase. For many Internet services, 316 applications must scale in response to the success of the service 317 and the increased demand which results. In other cases, 318 applications must be scaled as capabilities are added to a service, 319 again in response to the success of the service, changes in the 320 competitive environment or goals of the provider. 322 Virtually all data center applications require intermachine 323 communication, and therefore, application scalability may be 324 directly limited by communication overhead. From the application 325 viewpoint, every CPU cycle spent performing data transfer is a 326 wasted cycle that affects scalability. For high bandwidth data 327 transfers using IP, this overhead can be 30-40% of available CPU. 328 If an application is running on a single single server, and it is 329 scaled by adding a second server, communication overhead of 40% 330 means that the CPU available to the application from two servers is 331 only 120% of that of the single server. The problem is even worse 332 with many servers, because most servers are communicating with more 333 than one other server. If three servers are connected in a 334 pipeline where 40% CPU is required for data transfers to or from 335 another server, the total available CPU power would still be only 336 120% of the power of a single server! Not all data center 337 applications require this level of communication, but many do. The 338 high overhead of data transfers in IP severely impacts the 339 viability of IP for scalable data center applications. 341 2.2. Client/Server Communication 343 Client/server communication in the data center is a variation of 344 the scalable data center application scenario, but applies to 345 standalone servers as well as parallel applications. The overhead 346 of high bandwidth data communication weighs heavily on the server. 347 The server's ability to respond is limited by any communication 348 overhead it incurs. 350 In addition, client/server application performance is often 351 dominated by data transfer latency characteristics. Reducing 352 latency can greatly improve application performance. Techniques 353 commonly employed in IP network interfaces, such as TCP checksum 354 calculation offload, reduce transfer overhead somewhat, but they 355 typically do not reduce latency at all. Another technique used to 356 reduce latency in IP communication is to dedicate multiple threads 357 of execution, each running on a separate processor, to processing 358 requests concurrently. However, this multithreading solution has 359 limits, as the number of outstanding requests can vastly exceed the 360 number of processors. Furthermore, the effect of multithreading 361 concurrency is additive with any other latency reduction in the 362 data transfers themselves. 364 To address the problems of high bandwidth IP client/server 365 communication, a solution would ideally reduce both end to end 366 communication latency, and communication overhead. 368 2.3. Block Storage 370 Block storage, in the form of iSCSI [iSCSI] and IP Fibre Channel 371 protocols [FCIP, iFCP], is a IP new application area of great 372 interest to the storage and data center communities. Just as data 373 centers eagerly desire to replace special-purpose interprocess 374 communication fabrics with IP, there is parallel and equal interest 375 in migrating block storage traffic from special-purpose storage 376 fabrics to IP. 378 As with other forms of high bandwidth communication, the data 379 transfer overhead in traditional IP implementations, particularly 380 the three bus crossings required for receiving data, may 381 substantially limit data center storage transfer performance 382 compared to what is commonplace with special-purpose storage 383 fabrics. In addition, data copying, even if it is performed within 384 a specialized IP-storage adapter, will substantially increase 385 transfer latency, which can noticeably degrade the performance of 386 both file systems, and applications. 388 Protocol offload and direct data placement comparable to what is 389 provided by existing storage fabric interfaces (Fibre Channel, 390 SCSI, FireWire, etc.) are possible pieces of a solution to the 391 problems created by IP data transfer overhead for block storage. 392 It has been claimed that block storage is such an important 393 application that IP block storage protocols should be directly 394 offloaded by network interface hardware, rather through use of a 395 generic application-independent offload solution. However, even 396 the block storage community recognizes the benefits of more 397 general-purpose ways to reduce IP transfer overhead, and most 398 expect to eventually use such general-purpose capabilities for 399 block storage when they become available, if for no other reason 400 than it reduces the risks and impact of changing and evolving the 401 block storage protocols themselves. 403 2.4. File Storage 405 The file storage application exhibits a compound problem within the 406 data center. File servers and clients are subject to the 407 communication characteristics of both block storage and 408 client/server applications. The problems created by high transfer 409 overhead are particularly acute for file storage implementations 410 that are built with a substantial amount of user-mode code. In any 411 form of file storage application, many CPU cycles are spent 412 traversing the kernel mode file system, disk storage subsystems, 413 protocol stacks, and driving network hardware, similar to the block 414 storage scenario. In addition, file systems must address the 415 communication problems of a distributed client/server application. 416 There may be substantial shared state distributed among servers and 417 clients creating the need for extensive communication to maintain 418 this shared state. 420 A solution to the communication overhead problems of IP data 421 transfer for file storage involves a union of the approaches for 422 efficient disk storage and efficient client/server communication, 423 as discussed above. In other words, both low overhead and low 424 latency communication are goals. 426 2.5. Backup 428 One of the problems with IP-based storage backup is that it 429 consumes a great deal of the host CPU's time and resources. 430 Unfortunately, the high overhead required for IP-based backup is 431 typically not acceptable in an active data center. 433 The challenge of backup is that it is usually performed on machines 434 which are also actively participating in the services the data 435 center is providing. At a minimum, a machine performing backup 436 must maintain some synchronization with other machines modifying 437 the state being backed up, so the backup is coherent. As discussed 438 in the section above on Scalable Data Center Applications, any 439 overhead placed on active machines can substantially affect 440 scalability and solution cost. 442 Backup solutions on specialized storage-fabrics allow systems to 443 back up the data without the host processor ever touching the data. 444 Data is transfered to the backup device from disk storage through 445 host memory, or sometimes even directly without passing through the 446 host, as a so-called third party transfer. 448 Storage backup in the data center could be done with IP if data 449 transfer overhead were substantially reduced. 451 2.6. The Common Thread 453 There is a common thread running through the problems of using IP 454 communication in all of these scenarios. The union of the 455 solutions to these problems are a high bandwidth, low latency, low 456 CPU overhead data transfer solution. Non-IP solutions offer 457 technical solutions to these problems but the they lack the 458 ubiquity and price/performance characteristics necessary for a 459 viable, general solution. 461 3. Non-IP Solutions 463 The most refined non-IP solution to reducing communication 464 overhead, has a rich history reaching back almost 20 years. This 465 solution uses a data transfer metaphor called Remote Direct Memory 466 Access (RDMA). See Appendix A for an introduction to RDMA. In 467 spite of the technical advantages of the various non-IP solutions, 468 all have ultimately lacked the ubiquity and price/performance 469 characteristics necessary to gain widespread usage. This lack of 470 widespread adoption has also resulted in various shortcomings of 471 particular incarnations, such as incomplete integration with native 472 platform capabilities, or other software implementation 473 limitations. In addition, no non-IP solutions offer the massive 474 range of network scalability IP protocols support. Non-IP 475 solutions typically only scale to tens or hundreds of nodes in a 476 single network, and have no story to tell about interconnection of 477 multiple networks. 479 Several non-IP solutions will be briefly described here to show the 480 state of experience with this set of problems. 482 3.1. Proprietary Solutions 484 Low overhead communication technologies have traditionally been 485 developed as proprietary value-added products by computer platform 486 vendors. Such solutions were tightly integrated with platform 487 operating systems and did provide powerful, well integrated 488 communication capabilities. However, applications written for one 489 solution were not portable to others. Also, the solutions were 490 expensive, as is typically the case with value-added technologies. 492 The earliest example of an low overhead communication technology 493 was Digital's VAX Cluster Interconnect (CI), first released in 494 1983. The CI allowed computers and storage to be connected as 495 peers on a small multipoint network used for both IPC and I/O. The 496 CI made VAX/VMS Clusters the only alternative to mainframes for 497 large commercial applications for many years. 499 Tandem ServerNet was a another proprietary block transfer 500 technology developed in the mid 1990s. It has been used to perform 501 Disk I/O, IPC and network I/O in the Himalaya product line. This 502 architecture allows the Himalaya platform to be inherently scalable 503 because the software has been designed to take advantage of the 504 offload capability and zero copy techniques. Tandem attempted to 505 take this product into the Industry Standard Server market, but the 506 price/performance characteristics and its of being a proprietary 507 solution prevented wide adoption. 509 Silicon Graphics used a standards-based network fabric, HiPPI-800, 510 but built a proprietary low overhead communication mechanism on 511 top. Other platform vendors such as IBM, HP and Sun have also 512 offered a variety of proprietary low overhead communication 513 solutions over the years. 515 3.2. Standards-based Solutions 517 Increasing fluidity in the landscape of major platform vendors has 518 drastically increased the desire for all applications to be 519 portable. Platforms which were here yesterday might be gone 520 tomorrow. This has killed the willingness of application and data 521 center designers and maintainers to use proprietary features of any 522 platform. 524 Unwillingness to continue to use proprietary interconnects forced 525 platform vendors to collaborate on standards-based low overhead 526 communication technologies to replace the proprietary ones which 527 had become critical to building data center applications. Two of 528 these standards-based solutions considered to be roughly parent and 529 child are described below. 531 3.2.1. The Virtual Interface Architecture (VIA) 533 VIA [VI] was a technology jointly developed by Compaq, Intel and 534 Microsoft. VIA helped prove the feasibility of doing IPC offload, 535 user mode I/O and traditional kernel mode I/O as well. 537 While VIA implementations met with some limited success, VIA turned 538 out to only fill a small market niche, for several reasons. First, 539 commercially available operating systems lacked a pervasive 540 interface. Second, because the standard did not define a wire 541 protocol, no two implementations of the VIA standard were 542 interoperable on the wire. Third, different implementations were 543 not interoperable at the software layer either, since the API 544 definition was an appendix to the specification and not part of the 545 specification itself. 547 Yet with parallel applications, VIA proved itself time and again. 548 It was used to set the new benchmark record in the terabyte data 549 sort in Sandia Labs. It set new TPC-C records for distributed 550 databases, and it was used to set new TPC-C records as the client- 551 server communication link. VIA also set the foundation for work 552 such as the Sockets Direct Protocol through the implementation of 553 the Winsock Direct Protocol in Windows 2000 [WSD]. And it gave the 554 DAFS collective a rally point for a common programming interface 555 [DAFSAPI]. 557 3.2.2. InfiniBand 559 InfiniBand [IB] was developed by the InfiniBand Trade Association 560 (IBTA) as a low overhead communication technology that provides 561 remote direct memory access transfers, including interlocked atomic 562 operations, as well as traditional datagram-style transfers. 564 InfiniBand defines a new electromechanical interface, card and 565 cable form factors, physical interface, link layer, transport layer 566 and upper layer software transport interface. The IBTA has also 567 described a fabric management infrastructure to initialize and 568 maintain the fabric. 570 While all of the specialized technology of InfiniBand does provide 571 impressive performance characteristics, IB lacks the ubiquity and 572 price/performance of IP. In addition, management of InfiniBand 573 fabrics will require new tools and training, and InfiniBand 574 additionally lacks the huge base of applications, protocols, 575 thoroughly engineered security and routing technology available in 576 IP. 578 4. Conclusion 580 This document has described the set of problems that hinder the 581 widespread use of IP for high speed data transfers in data centers. 582 There have been a variety of other, non-IP solutions available 583 which have met with only limited success, for different reasons. 584 After many years of experience in both the IP and non-IP domains, 585 the problems appear to be reasonably well understood, and a 586 direction to a solution is suggested by this study. However, some 587 additional investigation and subsequent execution on an 588 architecture and necessary protocol(s) for reducing overhead in 589 high bandwidth IP data transfers are required. 591 5. Security Considerations 593 This draft states a problem and, therefore, does not require 594 particular security considerations other than those dedicated to 595 squelching the free spread of ideas, should the problem discussion 596 itself be considered seditious or otherwise unsafe. 598 6. References 600 [Chase] 601 Jeff S. Chase, et.al., "End system optimizations for high- 602 speed TCP", IEEE Communications Magazine , Volume: 39, Issue: 603 4 , April 2001, pp 68-74. 604 http://www.cs.duke.edu/ari/publications/end-system.{ps,pdf} 606 [DAFSAPI] 607 "Direct Access File System Application Programming Interface", 608 version 0.9.5, 09/21/2001. 609 http://www.dafscollaborative.org/tools/dafs_api.pdf 611 [FCIP] 612 Raj Bhagwat, et al., "Fibre Channel Over TCP/IP (FCIP)", 613 09/20/2001. http://www.ietf.org/internet-drafts/draft-ietf- 614 ips-fcovertcpip-06.txt 616 [HTTP] 617 J. Gettys et al., "Hypertext Transfer Protocol - HTTP/1.1", 618 RFC 2616, June 1999 620 [IB] InfiniBand Architecture Specification, Volumes 1 and 2, 621 release 1.0.a. http://www.infinibandta.org 623 [iFCP] 624 Charles Monia et al., "iFCP - A Protocol for Internet Fibre 625 Channel Storage Networking", 10/19/2001. 626 http://www.ietf.org/internet-drafts/draft-ietf-ips-ifcp-06.txt 628 [iSCSI] 629 J. Satran, et al., "iSCSI", 10/01/2001. 630 http://www.ietf.org/internet-drafts/draft-ietf-ips- 631 iscsi-08.txt 633 [NFSv3] 634 B. Callaghan, "NFS Version 3 Protocol Specification", RFC 635 1813, June 1995 637 [SCTP] 638 R.R. Stewart, Q. Xie, K. Morneault, C. Sharp, H.J. 639 Schwarzbauer, T. Taylor, I. Rytina, M. Kalla, L. Zhang, 640 and, V. Paxson, "Stream Control Transmission Protocol," 641 RFC2960, October 2000. 643 [TCP] 644 Postel, J., "Transmission Control Protocol - DARPA Internet 645 Program Protocol Specification", RFC 793, September 1981 647 [VI] Virtual Interface Architecture Specification version 1.0. 648 http://www.viarch.org/html/collateral/san_10.pdf 650 [WSD] 651 "Winsock Direct and Protocol Offload On SANs", version 1.0, 652 3/3/2001, from "Designing Hardware for the Microsoft Windows 653 Family of Operating Systems". 654 http://www.microsoft.com/hwdev/network/san 656 Authors' Addresses 658 Stephen Bailey 659 Sandburst Corporation 660 600 Federal Street 661 Andover, MA 01810 662 USA 664 Phone: +1 978 689 1614 665 Email: steph@sandburst.com 666 Dave Garcia 667 Compaq Computer Corp. 668 19333 Valco Parkway 669 Cupertino, CA 95014 670 USA 672 Phone: +1 408 285 6116 673 EMail: dave.garcia@compaq.com 675 Jeff Hilland 676 Compaq Computer Corp. 677 20555 SH 249 678 Houston, TX 77070 679 USA 681 Phone: +1 281 514 9489 682 EMail: jeff.hilland@compaq.com 684 Allyn Romanow 685 Cisco Systems, Inc. 686 170 W. Tasman Drive 687 San Jose, CA 95134 688 USA 690 Phone: +1 408 525 8836 691 Email: allyn@cisco.com 693 Appendix A. RDMA Technology Overview 695 This section describes how Remote Direct Memory Access (RDMA) 696 technology such as the Virtual Interface Architecture (VIA) and 697 InfiniBand (IB) provide for low overhead data transfer. VIA and IB 698 are examples of the RDMA technology also used by many proprietary 699 low over head data transfer solutions. 701 The IB and VIA protocols both provide memory access and push 702 transfer semantics. With memory access transfers, data from the 703 local computer is written/read directly to/from an address space of 704 the remote computer. How, when and why buffers are accessed is 705 defined by the ULP layer above IB or VIA. 707 With push transfers, the data source pushes data to an anonymous 708 receive buffer at the destination. TCP and UDP transfers are both 709 example of push transfers. VIA and IB both call their Push 710 transfer a Send operation, which is a datagram-style push transfer. 711 The data receiver chooses where to place the data; the receive 712 buffer is anonymous with respect to the sender of the data. 714 A.1 Use of Memory Access Transfers 716 In the memory access transfer model, the initiator of the data 717 transfer explicitly indicates where data is extracted from or 718 placed on the remote computer. VI and InfiniBand both define 719 memory access read (called RDMA Read) and memory access write 720 (called RDMA Write) transfers. The buffer address is carried in 721 each PDU allowing the network interface to directly place the data 722 in application buffers. Placing the data directly into the 723 application's buffer has three significant benefits: 725 o CPU and memory bus utilization are lowered by not having to 726 copy the data. Since memory access transfers use buffer 727 addresses supplied by the application, data can be directly 728 placed at its final location. 730 o memory access transfers incur no CPU overhead during transfers 731 if the network interface offloads RDMA (and lower layer) 732 protocol processing. There is enough information in RDMA PDUs 733 for the target network interface to complete RDMA Reads or 734 RDMA Writes without any local CPU action. 736 o Memory access transfers allow splitting of ULP headers and 737 data. With memory access transfers, the ULP can control the 738 exact placement of all received data, including ULP headers 739 and ULP data. ULP headers and other control information can 740 be placed in separate buffers from ULP data. This is 741 frequently a distinct advantage compared to having ULP headers 742 and data in the same buffers, as an additional data copy may 743 be otherwise required to separate them. 745 Providing memory access transfers does not mean a processor's 746 entire memory space is open for unprotected transfers. The remote 747 computer controls which of its buffers can be accessed by memory 748 access transfers. Incoming RDMA Read and RDMA Write operations can 749 only access buffers to which the receiving host has explicitly 750 permitted RDMA accesses. When the ULP allows RDMA access to a 751 buffer, the extent and address characteristics of buffer can be 752 chosen by the ULP. A buffer could use the virtual address space of 753 the process, it could be a physical address (if allowed), or it 754 could be a new virtual address space created for the individual 755 buffer. 757 In both IB and VIA the RDMA buffer is registered with the receiving 758 network interface before RDMA operations can occur. For a typical 759 hardware offload network interface, this is enough information to 760 build an address translation table and associate appropriate 761 security information with the buffer. The address translation table 762 lets the NIC convert the incoming buffer target address into a 763 local physical address. 765 A.2 Use Of Push Transfers 767 Memory access transfers contrast with the push transfers typically 768 used by IP applications. With push transfers the source has no 769 visibility or control over where data will be delivered on the 770 destination machine. While most protocols use some form of push 771 transfer, IB and VIA define a datagram-style push transfer that 772 allows a form of direct data placement on the receive side. 774 IB and VIA both require the application to pre-post receive 775 buffers. The application pre-posts receive buffers for a 776 connection and they are filled by subsequent incoming Send 777 operations. Since the receive buffer is pre-posted, the network 778 interface can place the data from the incoming Send operation 779 directly into the application's buffer. IB and VIA allow use of a 780 scattered receive buffers to support splitting the ULP header from 781 data within a single Send. 783 Neither memory access nor push transfers are inherently superior -- 784 each has its merits. Furthermore, memory access transfers can be 785 built atop push transfers or vice versa. However, direct support 786 of memory access transfers allows much lower transfer overhead than 787 if memory access transfers are emulated. 789 A.3 RDMA-based I/O Example 791 If the RDMA protocol is offloaded to the network interface, the 792 RDMA Read operation allows an I/O subsystem, such as a storage 793 array, to fully control all aspects of data transfer for 794 outstanding I/O operations. An example of a simple I/O operation 795 shows several benefits of using memory access transfers. 797 Consider an I/O block Write operation where the host processor 798 wishes to move a block of data (the data source) to an I/O 799 subsystem. The host first registers the data source with its 800 network interface as an RDMA address block. Next the host pushes a 801 small Send operation to the I/O subsystem. The message describes 802 the I/O write request and tells the I/O subsystem where it can find 803 the data in the virtual address space presented through the 804 communication connection by the network interface. After receiving 805 this message, the I/O subsystem can pull the data from the host's 806 buffer as needed. This gives the I/O subsystem the ability to both 807 schedule and pace its data transfer, thereby requiring less 808 buffering on the I/O subsystem. When the I/O subsystem completes 809 the data pull, it pushes a completion message back to the host with 810 a small Send operation. The completion message tells the host the 811 I/O operation is complete and that it can deregister its RDMA 812 block. 814 In this example the host processor spent very few CPU cycles doing 815 the I/O block Write operation. The processor sent out a small 816 message and the I/O subsystem did all the data movement. After the 817 I/O operation was completed the host processor received a single 818 completion message. 820 Full Copyright Statement 822 Copyright (C) The Internet Society (2001). All Rights Reserved. 824 This document and translations of it may be copied and furnished to 825 others, and derivative works that comment on or otherwise explain 826 it or assist in its implementation may be prepared, copied, 827 published and distributed, in whole or in part, without restriction 828 of any kind, provided that the above copyright notice and this 829 paragraph are included on all such copies and derivative works. 830 However, this document itself may not be modified in any way, such 831 as by removing the copyright notice or references to the Internet 832 Society or other Internet organizations, except as needed for the 833 purpose of developing Internet standards in which case the 834 procedures for copyrights defined in the Internet Standards process 835 must be followed, or as required to translate it into languages 836 other than English. 838 The limited permissions granted above are perpetual and will not be 839 revoked by the Internet Society or its successors or assigns. 841 This document and the information contained herein is provided on 842 an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET 843 ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR 844 IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF 845 THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED 846 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.