idnits 2.17.1 draft-ietf-rddp-applicability-06.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3978, Section 5.1 on line 17. -- Found old boilerplate from RFC 3978, Section 5.5 on line 1004. -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 981. -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 988. -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 994. ** This document has an original RFC 3978 Section 5.4 Copyright Line, instead of the newer IETF Trust Copyright according to RFC 4748. ** This document has an original RFC 3978 Section 5.5 Disclaimer, instead of the newer disclaimer which includes the IETF Trust according to RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year == The document seems to lack the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. (The document does seem to have the reference to RFC 2119 which the ID-Checklist requires). -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (April 24, 2006) is 6576 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Unused Reference: '1' is defined on line 909, but no explicit reference was found in the text == Unused Reference: '2' is defined on line 912, but no explicit reference was found in the text == Unused Reference: '3' is defined on line 915, but no explicit reference was found in the text == Unused Reference: '4' is defined on line 918, but no explicit reference was found in the text == Unused Reference: '5' is defined on line 923, but no explicit reference was found in the text == Unused Reference: '9' is defined on line 937, but no explicit reference was found in the text ** Obsolete normative reference: RFC 2246 (ref. '2') (Obsoleted by RFC 4346) ** Obsolete normative reference: RFC 2406 (ref. '3') (Obsoleted by RFC 4303, RFC 4305) ** Obsolete normative reference: RFC 2960 (ref. '4') (Obsoleted by RFC 4960) ** Downref: Normative reference to an Informational RFC: RFC 3257 (ref. '5') == Outdated reference: A later version (-07) exists of draft-ietf-rddp-rdmap-05 == Outdated reference: A later version (-07) exists of draft-ietf-rddp-ddp-05 == Outdated reference: A later version (-07) exists of draft-ietf-rddp-sctp-02 == Outdated reference: A later version (-10) exists of draft-ietf-rddp-security-08 == Outdated reference: A later version (-08) exists of draft-ietf-rddp-mpa-02 == Outdated reference: A later version (-08) exists of draft-ietf-nfsv4-nfsdirect-02 Summary: 7 errors (**), 0 flaws (~~), 15 warnings (==), 7 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Remote Direct Data Placement C. Bestler 3 Working group Broadcom Corporation 4 Internet-Draft L. Coene 5 Expires: October 26, 2006 Siemens 6 April 24, 2006 8 Applicability of Remote Direct Memory Access Protocol (RDMA) and Direct 9 Data Placement (DDP) 10 draft-ietf-rddp-applicability-06.txt 12 Status of this Memo 14 By submitting this Internet-Draft, each author represents that any 15 applicable patent or other IPR claims of which he or she is aware 16 have been or will be disclosed, and any of which he or she becomes 17 aware will be disclosed, in accordance with Section 6 of BCP 79. 19 Internet-Drafts are working documents of the Internet Engineering 20 Task Force (IETF), its areas, and its working groups. Note that 21 other groups may also distribute working documents as Internet- 22 Drafts. 24 Internet-Drafts are draft documents valid for a maximum of six months 25 and may be updated, replaced, or obsoleted by other documents at any 26 time. It is inappropriate to use Internet-Drafts as reference 27 material or to cite them other than as "work in progress." 29 The list of current Internet-Drafts can be accessed at 30 http://www.ietf.org/ietf/1id-abstracts.txt. 32 The list of Internet-Draft Shadow Directories can be accessed at 33 http://www.ietf.org/shadow.html. 35 This Internet-Draft will expire on October 26, 2006. 37 Copyright Notice 39 Copyright (C) The Internet Society (2006). 41 Abstract 43 This document describes the applicability of Remote Direct Memory 44 Access Protocol (RDMAP) and the Direct Data Placement Protocol (DDP). 45 It compares and contrasts the different transport options over IP 46 that DDP can use, provides guidance to ULP developers on choosing 47 between available transports and/or how to be indifferent to the 48 specific transport layer used, compares use of DDP with direct use of 49 the supporting transports, and compares DDP over IP transports with 50 non-IP transports that support RDMA functionality. 52 Table of Contents 54 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 55 2. Definitions . . . . . . . . . . . . . . . . . . . . . . . . . 5 56 3. Direct Placement . . . . . . . . . . . . . . . . . . . . . . . 6 57 3.1. Fewer Required ULP Interactions . . . . . . . . . . . . . 6 58 3.2. Direct Placement using only the LLP . . . . . . . . . . . 6 59 4. Tagged Messages . . . . . . . . . . . . . . . . . . . . . . . 8 60 4.1. Order Independent Reception . . . . . . . . . . . . . . . 8 61 4.2. Reduced ULP Notifications . . . . . . . . . . . . . . . . 9 62 4.3. Simplified ULP Exchanges . . . . . . . . . . . . . . . . . 9 63 4.4. Order Independent Sending . . . . . . . . . . . . . . . . 11 64 4.5. Untagged Messages and Tagged Buffers as ULP Credits . . . 12 65 5. RDMA Read . . . . . . . . . . . . . . . . . . . . . . . . . . 14 66 6. LLP Comparisons . . . . . . . . . . . . . . . . . . . . . . . 15 67 6.1. Multistreaming Implications . . . . . . . . . . . . . . . 15 68 6.2. Out of Order Reception Implications . . . . . . . . . . . 15 69 6.3. Header and Marker Overhead . . . . . . . . . . . . . . . . 15 70 6.4. Middlebox Support . . . . . . . . . . . . . . . . . . . . 15 71 6.5. Processing Overhead . . . . . . . . . . . . . . . . . . . 16 72 6.6. Data Integrity Implications . . . . . . . . . . . . . . . 16 73 6.6.1. MPA/TCP Specifics . . . . . . . . . . . . . . . . . . 16 74 6.6.2. SCTP Specifics . . . . . . . . . . . . . . . . . . . . 17 75 6.7. Non-IP Transports . . . . . . . . . . . . . . . . . . . . 17 76 6.7.1. No RDMA Layer Ack . . . . . . . . . . . . . . . . . . 17 77 6.8. Other IP Transports . . . . . . . . . . . . . . . . . . . 18 78 6.9. LLP Independent Session Establishment . . . . . . . . . . 19 79 6.9.1. RDMA-only Session Establishment . . . . . . . . . . . 19 80 6.9.2. RDMA-Conditional Session Establishment . . . . . . . . 19 81 7. Local Interface Implications . . . . . . . . . . . . . . . . . 21 82 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 22 83 9. Security considerations . . . . . . . . . . . . . . . . . . . 23 84 9.1. Connection/Association Setup . . . . . . . . . . . . . . . 23 85 9.2. Tagged Buffer Exposure . . . . . . . . . . . . . . . . . . 23 86 9.3. Impact of Encrypted Transports . . . . . . . . . . . . . . 24 87 10. References . . . . . . . . . . . . . . . . . . . . . . . . . . 25 88 10.1. Normative references . . . . . . . . . . . . . . . . . . . 25 89 10.2. Informative References . . . . . . . . . . . . . . . . . . 25 90 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 26 91 Intellectual Property and Copyright Statements . . . . . . . . . . 27 93 1. Introduction 95 Remote Direct Memory Access Protocol [6] and Direct Data Placement 96 [7] work together to provide application independent efficient 97 placement of application payload directly into buffers specified by 98 the Upper Layer Protocol (ULP). 100 The DDP protocol is responsible for direct placement of received 101 payload into ULP specified buffers. The RDMAP protocol provides 102 completion notifications to the ULP and support for Data Sink 103 initiated fetch of advertised buffers (RDMA Reads). 105 DDP and RDMAP are both application independent protocols which allow 106 the ULP to perform remote direct data placement. DDP can use 107 multiple standard IP transports including SCTP and TCP. 109 By clarifying the situations where the functionality of these 110 protocols are applicable, this document can guide implementers, 111 application and protocol designers in selecting which protocols to 112 use. 114 The applicability of RDMAP/DDP is driven by their unique 115 capabilities: 117 o The existence of an application independent protocol allows common 118 solutions to be implemented in hardware and/or the kernel. This 119 document will discuss when common data placement procedures are of 120 the greatest benefit to applications as contrasted with 121 application specific solutions built on top of direct use of the 122 underlying transport. 124 o DDP supports both untagged and tagged buffers. Tagged buffers 125 allow the Data Sink ULP to be indifferent to what order (or in 126 what messages) the Data Source sent the data, or what order 127 packets are received in. Typically tagged data can be used for 128 payload transfer, while untagged is best used for control 129 messages. However each upper layer protocol can determine the 130 optimal use of tagged and untagged messages for itself. This 131 document will discuss when Data Source flexibility is of benefit 132 to applications. 134 o RDMAP consolidates ULP notifications, thereby minimizing the 135 number of required ULP interactions. 137 o RDMAP defines RDMA Reads, which allow remote access to advertised 138 buffers. This document will review the advantages of using RDMA 139 Reads as contrasted to alternate solutions. 141 Some non-IP transports, such as InfiniBand, directly integrate RDMA 142 features. This document will review the applicability of providing 143 RDMA services over ubiquitous IP transports as opposed to the use of 144 customized transport protocols. Due to the fact that DDP is defined 145 cleanly as a layer over existing IP transports, DDP has simpler 146 ordering rules than some prior RDMA protocols. This may have some 147 implications for application designers. 149 The full capabilities of DDP and RDMAP can only be fully realized by 150 applications that are designed to exploit them. The co-existence of 151 RDMAP/DDP aware local interfaces with traditional socket interfaces 152 will also be explored. 154 Finally, DDP support is defined for at least two IP transports: SCTP 155 [8] and MPA over TCP [10]. The rationale for supporting both 156 transports is reviewed, as well as when each would be the appropriate 157 selection. 159 2. Definitions 161 Advertisement - the act of informing a Remote Peer that a local RDMA 162 Buffer is available to it. A Node makes available an RDMA Buffer 163 for incoming RDMA Read or RDMA Write access by informing its RDMA/ 164 DDP peer of the Tagged Buffer identifiers (STag, base address, and 165 buffer length). This advertisement of Tagged Buffer information 166 is not defined by RDMA/DDP and is left to the ULP. A typical 167 method would be for the Local Peer to embed the Tagged Buffer's 168 Steering Tag, base address, and length in a Send Message destined 169 for the Remote Peer. 171 Data Sink - The peer receiving a data payload. Note that the Data 172 Sink can be required to both send and receive RDMA/DDP Messages to 173 transfer a data payload. 175 Data Source - The peer sending a data payload. Note that the Data 176 Source can be required to both send and receive RDMA/DDP Messages 177 to transfer a data payload. 179 Lower Layer Protocol (LLP) The transport protocol that provides 180 services to DDP. This is an IP transport with any required 181 adaptation layer. Adaptation layers are defined for SCTP and TCP. 183 Steering Tag (STag) An identifier of a Tagged Buffer on a Node, valid 184 as defined within a protocol specification. 186 Tagged Message A DDP message that is directed to a ULP specified 187 buffer based upon imbedded addressing information. In the 188 immediate sense, the destination buffer is specified by the 189 message sender. The message receiver is given no independent 190 indication that a tagged message has been received. 192 Untagged Message A DDP message that is directed to a ULP specified 193 buffer based upon a Message Sequence Number being matched with a 194 receiver supplied buffer. The destination buffer is specified by 195 the message receiver. The message receiver is notified by some 196 mechanism that an untagged message has been received. 198 Upper Layer Protocol (ULP) The direct user of RDMAP/DDP services. In 199 addition to protocols such as iSER [11] and NFSv4 over RDMA [12], 200 the ULP may be embedded in an application, or a middleware layer 201 as is often the case for the Sockets Direct Protocol (SDP) and 202 Remote Procedure Call (RPC) protocols. 204 3. Direct Placement 206 Direct Data Placement optimizes the placement of ULP payload into the 207 correct destination buffers, typically eliminating intermediate 208 copying. Placement is enabled without regard to order of arrival, 209 order of transmission or requiring per-placement interaction with the 210 ULP. 212 RDMAP minimizes the required ULP interactions . This capability is 213 most valuable for applications that require multiple transport layer 214 packets for each required ULP interaction. 216 3.1. Fewer Required ULP Interactions 218 While reducing the number of required ULP interactions is in itself 219 desirable, it is critical for high speed connections. The burst 220 packet rate for a high speed interface could easily exceed the host 221 systems ability to switch ULP contexts. 223 Content access applications are primary examples of applications with 224 both high bandwidth and high content to required ULP interaction 225 ratios. These applications include file access protocols (NAS), 226 storage access (SAN), database access and other application specific 227 forms of content access such as HTTP, XML and email. 229 3.2. Direct Placement using only the LLP 231 Direct data placement can be achieved without RDMA. Pre-posting of 232 receive buffers could allow a non-RDMA network stack to place data 233 directly to user buffers. 235 The degree to which DDP optimizes depends on which transport is being 236 compared with, and on the nature of the local interface. Without 237 RDMAP/DDP pre-posting buffers requires the receiving side to 238 accurately predict the required buffers and their sizes. This is not 239 feasible for all ULPs. By contrast, DDP only requires the ULP to 240 predict the sequence and size of incoming untagged messages. 242 An application that could predict incoming messages and required 243 nothing more than direct placement into buffers might be able to do 244 so with a properly designed local interface to SCTP or TCP. Doing so 245 for TCP requires making predictions at a byte level rather than a 246 message level. 248 The main benefit of DDP for such an application would be that pre- 249 posting of receive buffers is a mandated local interface capability, 250 and that predictions can be made on a per-message basis (not per 251 byte). 253 The Lower Layer Protocol, LLP, can also be used directly if ULP 254 specific knowledge is built into the protocol stack to allow "parse 255 and place" handling of received packets. Such a solution either 256 requires interaction with the ULP, or that the protocol stack have 257 knowledge of ULP specific syntax rules. 259 DDP achieves the benefits of directly placing incoming payload 260 without requiring tight coupling between the ULP and the protocol 261 stack. However, "parse and place" capabilities can certainly provide 262 equivalent services to a limited number of ULPs. 264 4. Tagged Messages 266 This section covers the major benefits from the use of Tagged 267 Messages. 269 A more critical advantage of DDP is the ability of the Data Source to 270 use tagged buffers. Tagging messages allows the Data Source to 271 choose the ordering and packetization of its payload deliveries. 272 With direct data placement based solely upon pre-posted receives, the 273 packetization and delivery of payload must be agreed by the ULP peers 274 in advance. 276 The Upper Layer Protocol can allocate content between untagged and/or 277 tagged messages to maximize the potential optimizations. Placing 278 content within an untagged message can deliver the content in the 279 same packet that signals completion to the receiver. This can 280 improve latency. It can even eliminate round trips. But it requires 281 making larger anonymous buffers to be available. 283 Some examples of data that typically belongs in the untagged message 284 would include short fixed size control data that is inherently part 285 of the control message almost always should be included in the 286 untagged message, relatively short payload that is almost always 287 needed (especially when it would eliminate a round-trip to fetch the 288 data. For example, the initial data on a write request, and of 289 course advertising tagged buffers that specify the location of data 290 not in the untagged message. 292 Tagged messages standardizes direct placement of data without per- 293 packet interaction with the upper layers. Even if there is an upper 294 layer protocol encoding of what is being transferred, as is common 295 with middleware solutions, this information is not understood at the 296 application independent layers. The directions on where to place the 297 incoming data cannot be accessed without switching to the ULP first. 298 DDP provides a standardized 'packing list' which can be interpreted 299 without requiring ULP interaction. Indeed, it is designed to be 300 implementable in hardware. 302 4.1. Order Independent Reception 304 Tagged messages are directed to a buffer based on an included 305 Steering Tag. Additionally, no notice is provided to the ULP for each 306 individual Tagged Message's arrival. Together these allow tagged 307 messages received out-of-order to be processed without intermediate 308 buffering or additional notifications to the ULP. 310 4.2. Reduced ULP Notifications 312 RDMAP offers both tagged and untagged messages. No receiving side 313 ULP interactions are required for tagged messages. By optimally 314 dividing traffic between tagged and untagged messages the ULP can 315 limit the number of events that must be dealt with at the ULP layer. 316 This typically reduces the number of context switches required and 317 improves performance. 319 RDMAP further reduces required ULP interactions consolidating 320 completion notifications of tagged messages with the completion 321 notification of a trailing untagged message. For most ULPs this 322 radically reduces the number of ULP required interactions even 323 further. 325 While RDMAP consolidation of notices is beneficial to most 326 applications, it may be detrimental to some applications that benefit 327 from streamed delivery to enable ULP processing of received data as 328 promptly as possible. A ULP that uses RDMAP cannot begin processing 329 any portion of an exchange until it receives notification that the 330 entire exchange has been placed. An "exchange" here is a set of zero 331 or more tagged messages and a single terminating untagged message. 332 An application that would prefer to begin work on the received 333 payload, no matter what order it arrived in, as soon as possible 334 might prefer to work directly with the LLP. RDMAP is optimized for 335 applications that are more concerned when the entire exchange is 336 complete. 338 An application that benefits from being able to begin processing of 339 each received packet as quickly as possible may find RDMAP interferes 340 with that goal. 342 Such an application might be able to retain most of the benefits of 343 RDMAP by using the DDP layer directly. However, in addition to 344 taking on the responsibilities of the RDMAP layer, the application 345 would likely have more difficulty finding support for a DDP-only API. 346 Many hardware implementations may choose to tightly couple RDMAP and 347 DDP, and might not provide an API directly to DDP services. 349 These features minimize the required interactions with the ULP. This 350 can be extremely beneficial for applications that use multiple 351 transport layer packets to accomplish what is a single ULP 352 interaction. 354 4.3. Simplified ULP Exchanges 356 The notification rules for Tagged Messages allows ULPs to create 357 multi-message "exchanges" consisting of zero or more tagged messages 358 that represent a single step in the ULP interaction. The receiving 359 ULP is notified that the untagged message has arrived, and implicitly 360 of any associated tagged messages. 362 A ULP where all exchanges would naturally be only the untagged 363 message would derive virtually no benefit from the use of RDMAP/DDP 364 as opposed to SCTP. But while tagged buffers are the justification 365 for RDMAP/DDP, untagged buffers are still necessary. Without 366 untagged buffers the only method to exchange buffer advertisements 367 would involve out-of-band communications and/or sharing of compile 368 time constants. Most RDMA-aware ULPs use untagged buffers for 369 requests and responses. Buffer advertisements are typically done 370 within these untagged messages. 372 More importantly there would be no reliable method for the upper 373 layer peers to synchronize. The absence of any guarantees about 374 ordering within or between tagged messages is fundamental to allowing 375 the DDP layer to optimize transfer of tagged payload. 377 So no ULP can be defined entirely in terms of tagged messages. 378 Eventually a notification that confirms delivery must be generated 379 from the RDMAP/DDP layer. 381 Limiting use of untagged buffers to requests and responses by moving 382 all bulk data using tagged transfers can greatly simplify the amount 383 of prediction that the Data Sink must perform in pre-posting receive 384 buffers. For example, a typical RDMA enabled interaction would 385 consist of the following: 387 Client sends transaction request to server's as an untagged 388 message. 390 This message includes buffer advertisements for the buffers where 391 the results are to be placed. 393 The Server sends multiple tagged messages to the advertised 394 buffers. 396 The Server sends transaction reply as an untagged message to the 397 client. 399 Client receives single notification, indicating completion of the 400 interaction. 402 With this type of exchange the pacing and required size of untagged 403 buffers is highly predictable. The variability of response sizes is 404 absorbed by tagged transfers. 406 4.4. Order Independent Sending 408 Use of tagged messages is especially applicable when the Data Sink 409 does not know the actual size, structure or location of the content 410 it is requesting (or updating). 412 For example, suppose the Data Sink ULP needs to fetch four related 413 pieces of data into a four separate buffers. With SCTP the Data Sink 414 ULP could receive four messages into four separate buffers, only 415 having to predict the maximum size of each. However it would have to 416 dictate the order in which the Data Source supplied the separate 417 pieces. If the Data Source found it advantageous to fetch them in a 418 different order it would have to use intermediate buffering to re- 419 order the pieces into the expected order even though the application 420 only required that all four be delivered and did not truly have an 421 ordering requirement. 423 Techniques such as RAID striping and mirroring represent this same 424 problem, but one step further. What appears to be a single resource 425 to the Data Sink is actually stored in separate locations by the Data 426 Source. Non RDMA protocols would either require the Data Source to 427 fetch the material in the desired order or force the Data Source to 428 use its own holding buffers to assemble an image of the destination 429 buffer. 431 While sometimes referred to as a "buffer-to-buffer" solution, RDMA 432 more fundamentally enables remote buffer access. The ULP is free to 433 work with larger remote buffers than it has locally. This reduces 434 buffering requirements and the number of times the data must be 435 copied in an end-to-end transfer. 437 There are numerous reasons why the Data Sink would not know the true 438 order or location of the requested data. It could be different for 439 each client, different records selected and/or different sort orders, 440 RAID striping, file fragmentation, volume fragmentation, volume 441 mirroring and server-side dynamic compositing of content (such as 442 server side includes for HTTP). 444 In all of these cases the Data Source is free to assemble the desired 445 data in the Data Sink's buffer in whatever order the component data 446 becomes available to it. It is not constrained on ordering. It does 447 not have to assemble an image in its own memory before creating it in 448 the Data Sink's buffers. 450 Note that while DDP enables use of tagged messages for bulk transfer, 451 there are some application scenarios where untagged messages would 452 still be used for bulk transfer. For example, a file server may not 453 expose its own memory to its clients. A client wishing to write may 454 advertise a buffer which the server will issue RDMA Reads upon. 455 However, when performing a small write it may be preferable to 456 include the data in the untagged message rather than incurring an 457 additional round trip with the RDMA Read and its response. 459 Generally, the best use of an untagged message is to synchronize and 460 to deliver data that is naturally tied to the same message as the 461 synchronization. For initial data transfers this has the additional 462 benefit of avoiding the need to advertise specific tagged buffers for 463 indefinite time periods. Instead anonymous buffers can be used for 464 initial data reception. Because anonymous buffers do not need to be 465 tied to specific messages in advance this can be a major benefit. 467 4.5. Untagged Messages and Tagged Buffers as ULP Credits 469 The handling of end-to-end buffer credits differs considerably with 470 DDP than when the ULP directly uses either TCP or SCTP. 472 With both TCP and SCTP buffer credits are based upon the receiver 473 granting transmit permission based on the total number of bytes. 474 These credits reflect system buffering resources and/or simple flow 475 control. They do not represent ULP resources. 477 DDP defines no standard flow control, but presumes the existince of a 478 ULP mechanism. The presumed mechanism is that the Data Sink ULP has 479 issued credits to the Data Source allowing the Data Source to send a 480 specific number of untagged messages. 482 The ULP peers must ensure that the sender is aware of the maximum 483 size that can be sent to any specific target buffer. One method of 484 doing so is to use a standard size for all untagged buffers within a 485 given connection. For example, a ULP may specify an initial untagged 486 buffer size to be used immediately after session establishment, and 487 then optionally specify mechanisms for negotiating changes. 489 Tagged buffers are ULP resources advertised directly from ULP to ULP. 490 A DDP put to a known tagged buffer is constrained only by transport 491 level flow control, not by available system buffering. 493 Either tagged or untagged buffers allows bypassing of system buffer 494 resources. Use of tagged buffers additionally allows the Data Source 495 to choose what order to exercise the credits in. 497 To the extent allowed by the ULP, tagged buffers are also divisible 498 resources. The Data Sink can advertise a single 100 KB buffer, and 499 then receive notifications from its peer that it had written 50 KB, 500 20 KB and 30 KB to that buffer in three successive transactions. 502 ULP-management of tagged buffer resources, independent of transport 503 and DDP layer credits, is an additional benefit of RDMA protocols. 504 Large bulk transfers cannot be blocked by limited general purpose 505 buffering capacity. Applications can flow control based upon higher 506 level abstractions, such as number of outstanding requests, 507 independent of the amount of data that must be transferred. 509 However, use of system buffering, as offered by direct use of the 510 underlying transports, can be preferable under certain circumstances. 512 One example would be when the number of target ULP buffers is 513 sufficiently large, and the rate at which any writes arrive is 514 sufficiently low, that pinning all the target ULP buffers in memory 515 would be undesirable. The maximum transfer rate, and hence the 516 maximum amount of system buffering required, may be more stable and 517 predictable than the total ULP buffer exposure. 519 Another would be the Data Sink wishes to receive a stream of data at 520 a predictable rate, but does not know in advance what the size of 521 each data packet will be. This is common from streaming media that 522 has been encoded with a variable bit rate. With DDP the Data Sink 523 would either have to use untagged buffers large enough for the 524 largest packet, or advertise a circular buffer. If for security or 525 other reasons the Data Sink did not want the size of its buffer to be 526 publicly known, using the underlying SCTP transport directly may be 527 preferable because of their byte-oriented credits. 529 5. RDMA Read 531 RDMA Reads are a further service provided by RDMAP. RDMA Reads allow 532 the Data Sink to fetch exactly the portion of the peer ULP buffer 533 required on a "just in time" basis. This can be done without 534 requiring per-fetch support from the Data Source ULP. 536 Storage servers may wish to limit the maximum write buffer allocated 537 to any single session. The storage server may be a very minimal 538 layer between the client and the disk storage media, or the server 539 may merely wish to limit the total resources that would be required 540 if all clients could push the entire payload they wished written at 541 their own convenience. 543 In either case, there is little benefit in transferring data from the 544 Data Source far in advance of when it will be written to the 545 persistent storage media. RDMA Reads allow the Storage Server to 546 fetch the payload on a "just in time" basis. In this fashion a 547 relatively small number of block sized buffers can be used to execute 548 a single transaction that specified writing a large file, or a 549 Storage Server with numerous clients can fetch buffers from the 550 individual clients in the order that is most convenient to the 551 server. 553 This same capability can be used when the desired portion of the 554 advertised buffer is not known in advance. For example the 555 advertised buffer could contain performance statistics. The data 556 sink could request the portions of the data it required, without 557 requiring an interaction with the Data Source ULP. 559 This is applicable for many applications that publish semi-volatile 560 data that does not require transactional validity checking (i.e., 561 authorized users have read access to the entire set of data). It is 562 less applicable when there are ULP consistency checks that must be 563 performed upon the data. Such applications would be better served by 564 having the client send a request, and having the server use RDMA 565 Writes to publish the requested data. Neither RDMAP or DDP provide 566 mechanisms for bundling multiple disjoint updates into an atomic 567 operation. Therefore use of an advertised buffer as a data resource 568 is subject to the same caveats as any randomly updated data resource, 569 such as flat files, that do not enforce their own consistency. 571 6. LLP Comparisons 573 Normally the choice of underlying IP transport is irrelevant to the 574 ULP. RDMAP and DDP provides the same services over either. There 575 may be performance impacts of the choice, however. It is the 576 responsibility of the ULP to determine which IP transport is best 577 suited to its needs. 579 SCTP provides for preservation of message boundaries. Each DDP 580 segment will be delivered within a single SCTP packet. The 581 equivalent services are only available with TCP through the use of 582 the MPA (Marker PDU Alignment) adaptation layer. 584 6.1. Multistreaming Implications 586 SCTP also provides multi-streaming. When the same pair of hosts have 587 need for multiple DDP streams this can be a major advantage. A 588 single SCTP association carries multiple DDP streams, consolidating 589 connection setup, congestion control and acknowledgements. 591 Completions are controlled by the DDP Source Sequence Number (DDP- 592 SSN) on a per stream basis. Therefore combining multiple DDP Streams 593 into a single SCTP association cannot result in a dropped packet 594 carrying data for one stream delaying completions on others. 596 6.2. Out of Order Reception Implications 598 The use of unordered Data Chunks with SCTP guarantees that the DDP 599 layer will be able to perform placements when IP datagrams are 600 received out of order. 602 Placement of out-of-order DDP Segments carried over MPA/TCP is not 603 guaranteed, but certainly allowed. The ability of the MPA receiver 604 to process out-of-order DDP Segments may be impaired when alignment 605 of TCP segments and MPA FPDUs is lost. Using SCTP, each DDP Segment 606 is encoded in a single Data Chunk and never spread over multiple IP 607 datagrams. 609 6.3. Header and Marker Overhead 611 MPA and TCP headers together are smaller than the headers used by 612 SCTP and its adaptation layer. However, this advantage can be 613 reduced by the insertion of MPA markers. The different in ULP 614 payload per IP Datagram is not likely to be a signifigant factor. 616 6.4. Middlebox Support 618 Even with the MPA adaptation layer, DDP traffic carried over MPA/TCP 619 will appear to all network middleboxes as a normal TCP connection. 620 In many environments there may be a requirement to use only TCP 621 connections to satisfy existing network elements and/or to facilitate 622 monitoring and control of connections. While SCTP is certainly just 623 as monitorable and controllable as TCP, there is no guarantee that 624 the network management infrastructure has the required support for 625 both. 627 6.5. Processing Overhead 629 A DDP stream delivered via MPA/TCP will require more processing 630 effort that one delivered over SCTP. However this extra work may be 631 justified for many deployments where full SCTP support is unavailable 632 in the endpoints of the network, or where middleboxes impair the 633 usability of SCTP. 635 6.6. Data Integrity Implications 637 Both the SCTP and MPA/TCP adaptation provide end-to-end CRC32c 638 protection against data accidental corruption, or its equivalent. 640 A ULP that requires a greater degree of protection may add it own. 641 However, DDP and RDMAP headers will only be guaranteed to have the 642 equivalent of end-to-end CRC32c protection. A ULP that requires data 643 integrity checking more thorough than an end-to-end CRC32c should 644 first invalidate all STags that reference a buffer before applying 645 their own integrity check. 647 CRC32c only provides protection against random corruption. To 648 protect against unauthorized alteration or forging of data packets, 649 security methods must be applied. IPsec is supported for both SCTP 650 and MPA/TCP. 652 6.6.1. MPA/TCP Specifics 654 It is mandatory for MPA/TCP implementations to implement CRC32c, but 655 it is NOT mandatory to use the CRC32c during an RDMA connection. The 656 activating or deactivating of the CRC in MPA/TCP is an administrative 657 configuration operation at the local and remote end. The 658 administration of the CRC(ON/OFF) is invisible to the ULP. 660 Applications SHOULD trust that this administrative option will only 661 be used when the end-to-end protection is at least as effective as a 662 transport layer CRC32c. Applications SHOULD NOT apply additional 663 protection as a guard against this administrative option being turned 664 on inadvertently. 666 Administrators MUST NOT enable CRC32c suppression unless the end-to- 667 end protection is truly equivalent. 669 If the CRC is active/used for one direction/end , then the use of the 670 CRC is mandatory in both directions/ends. 672 If both ends have been configured NOT to use the CRC, then this is 673 allowed as long as an equivalent protection(comparable or better 674 than/to CRC) from undetected errors on the connection is provided. 676 6.6.2. SCTP Specifics 678 SCTP provides CRC32c protection automatically. The adaptation to 679 SCTP provides for no option to suppress SCTP CRC32c protection. 681 6.7. Non-IP Transports 683 DDP is defined to operate over ubiquitous IP transports such as SCTP 684 and TCP. This enabled a new DDP-enabled node to be added anywhere to 685 an IP network. No DDP-specific support from middle-boxes is 686 required. 688 There are non-IP transport fabric offering RDMA capabilities. 689 Because these capabilities are integrated with the transport protocol 690 they have some technical advantages when compared to RDMA over IP. 691 For example fencing of RDMA operations can be based upon transport 692 level acks. Because DDP is cleanly layered over an IP transport, any 693 explicit RDMA layer ack must be separate from the transport layer 694 ack. 696 There may be deployments where the benefits of RDMA/transport 697 integration outweigh the benefits of being on an IP network. 699 6.7.1. No RDMA Layer Ack 701 DDP does not provide for its own acknowledgements. The only form of 702 ack provided at the RDMAP layer is an RDMA Read Response. DDP and 703 RDMAP rely almost entirely upon other layers for flow control and 704 pacing. The LLP is relied upon to guarantee delivery and avoid 705 network congestion, and ULP level acking is relied upon for ULP 706 pacing and to avoid ULP buffer overruns. 708 Previous RDMA protocols, such as InfiniBand, have been able to use 709 their integration with the transport layer to provide stronger 710 ordering guarantees. It is important that application designers that 711 require such guarantees to provide them through ULP interaction. 713 Specifically: 715 There is no ability for a local interface to "fence" outbound 716 messages to guarantee that prior tagged messages have been placed 717 prior to sending a tagged message. The only guarantees available 718 from the other side would be an RDMA Read Response (coming from 719 the RDMAP layer) or a response from the ULP layer. Remember that 720 the normal ordering rules only guarantee when the Data Sink ULP 721 will be notified of untagged messages, it does not control when 722 data is placed into receive buffers. 724 Re-use of tagged buffers must be done with extreme care. The fact 725 that an untagged message indicates that all prior tagged messages 726 have been placed does not guarantee that no later tagged message 727 have. The best strategy is to only change the state of any given 728 advertised buffers with with untagged messages. 730 As covered elsewhere in this document, flow control of untagged 731 messages MUST be provided by the ULP itself. 733 6.8. Other IP Transports 735 Both TCP and SCTP provide DDP with reliable transport with TCP 736 friendly rate control. As currently DDP is defined to work over 737 reliable transports and implicitly relies upon some form of rate 738 control. 740 DDP is fully compatible with a non-reliable protocol. Out-of-order 741 placement is obviously not dependent on whether the other DDP 742 Segments ever actually arrive. 744 However, RDMAP requires the LLP to provide reliable service. An 745 alternate completion handling protocol would be required if DDP were 746 to be deployed over an unreliable IP transport. 748 As noted in the prior section on tagged buffers as ULP credits, 749 neither RDMAP or DDP provide any flow control for tagged messages. 750 If no transport layer flow control is provided, an RDMAP/DDP 751 application would be only limited by the link layer rate, almost 752 inevitably resulting in severe network congestion. 754 RDMAP encourages applications to be ignorant of the underlying 755 transport PMTU. The ULP is only notified when all messages ending in 756 a single untagged message have completed. The ULP is not aware of 757 the granularity or ordering of the underlying message. This approach 758 assumes that the ULP is only interested in the complete set of 759 messages, and has no use for a subset of them. 761 6.9. LLP Independent Session Establishment 763 For an RDMAP/DDP application, the transport services provided by a 764 pair of SCTP Streams and by a TCP connection both provide the same 765 service (reliable delivery of DDP Segments between two connected 766 RDMAP/DDP endpoints). 768 6.9.1. RDMA-only Session Establishment 770 It is also possible to allow for transport neutral establishment of 771 RDMAP/DDP sessions between endpoints. Combined, these two features 772 would allow most applications to be unconcerned as to which LLP was 773 actually in use. 775 Specifically, the procedures for DDP Stream Session establishment 776 discussed in section 3 of the SCTP mapping, and section 13.3 of the 777 MPA/TCP mapping, both allow for the exchange of ULP specific data 778 ("Private Data") before enabling the exchange of DDP Segments. This 779 delay can allow for proper selection and/or configuration of the 780 endpoints based upon the exchanged data. For example, each DDP 781 Stream Session associated with a single client session might be 782 assigned to the same DDP Protection Domain. 784 To be transport neutral, the applications should exchange Private 785 Data as part of session establishment messages to determine how the 786 RDMA endpoints are to be configured. One side must be the Initiator, 787 and the other the Responder. 789 With SCTP, a pair of SCTP streams can be used for sequential 790 sessions. With MPA/TCP each connection can be used for at most one 791 session. However, the same source/destination pair of ports can be 792 re-used sequentially subject to normal TCP rules. 794 Both SCTP and MPA limit the private data size to a maximum of 512 795 bytes. 797 MPA/TCP requires the end of the TCP connection that initiated the 798 conversion to MPA mode to send the first DDP Segment. SCTP does not 799 have this requirement. ULPs which wish to be transport neutral 800 should require the initiating end to send the first message. A zero- 801 length RDMA Write can be used for this purpose if the ULP logic 802 itself does naturally support this restriction. 804 6.9.2. RDMA-Conditional Session Establishment 806 It is sometimes desirable for the active side of a session to connect 807 with the passive side before knowing whether the passive side 808 supports RDMA. 810 This style of session establishment can be supported with either TCP 811 or SCTP, but not as transparently as for RDMA-only sessions. Pre- 812 existing non-RDMA servers are also far more likely to be using TCP 813 than SCTP. 815 With TCP. a normal TCP connection is established. It is then used by 816 the ULP to determine whether or not to convert to MPA mode and use 817 RDMA. This will typically be integral with other session 818 establishment negotiations. 820 With SCTP, the establishment of an association tests whether RDMA is 821 supported. If not supported, the application simply requests the 822 association without the RDMA adaptation indication. 824 One key difference is that with SCTP the determination as to whether 825 the peer can support RDMA is made before the transport layer 826 association/connection is established while with TCP the established 827 connection itself is used to determine whether RDMA is supported. 829 7. Local Interface Implications 831 Full utilization of DDP and RDMAP capabilities requires a local 832 interface that explicitly requests these services. Protocols such as 833 Sockets Direct Protocol (SDP) can allow applications to keep their 834 traditional byte-stream or message-stream interface and still enjoy 835 many of the benefits of the optimized wire level protocols. 837 8. IANA Considerations 839 There are no IANA considerations in this document. 841 9. Security considerations 843 9.1. Connection/Association Setup 845 Both the SCTP and TCP adaptations allow for existing procedures to be 846 followed for the establishment of the SCTP association or TCP 847 connection. Use of DDP does not impair the use of any security 848 measures to filter, validate and/or log the remote end of an 849 association/connection. 851 Authentication of peers and approval of connections is outside of the 852 scope of DDP. Connection authentication is the responsibility of the 853 ULP, which may be based upon information from the LLP. IPSEC is 854 usable for both TCP and SCTP. 856 9.2. Tagged Buffer Exposure 858 DDP only exposes ULP memory to the extent explicitly allowed by ULP 859 actions. These include posting of receive operations and enabling of 860 Steering Tags. 862 DDP validates that STags are only used by the remote peer to the 863 extent authorized by the ULP. The STag selects from a pool of 864 buffers previously authorized by the ULP; an STag by itself does not 865 authorize access. 867 Use of randomization in generating STag values may be useful in 868 preventing 'off by one' and other programmatic errors, but is of 869 limited value in countering generation and misuse of STag values by 870 an active attacker. IPsec provides countermeasures that can prevent 871 such an unauthorized attacker from gaining access to buffers used by 872 DDP and RDMAP. 874 Neither RDMAP or DDP place requirements on how ULP's advertise 875 buffers. A ULP may use a single Steering Tag for multiple buffer 876 advertisements. However, the ULP should be aware that enforcement on 877 STag usage is likely limited to the overall range that is enabled. 878 If the remote peer writes into the 'wrong' advertised buffer, neither 879 the DDP or RDMAP layer will be aware of this. Nor is there any 880 report to the ULP on how the remote peer specifically used tagged 881 buffers. 883 Unless the ULP peers have an adequate basis for mutual trust, the 884 receiving ULP might be well advised to use a distinct STag for each 885 interaction, and to invalidate it after each use or to require its 886 peer to use the RDMAP option to invalidate the STag with its 887 responding untagged message. 889 9.3. Impact of Encrypted Transports 891 While DDP is cleanly layered over the LLP, its maximum benefit may be 892 limited when the LLP Stream is secured with a streaming cypher, such 893 as Transport Layer Security (TLS). If the LLP must decrypt in order, 894 it cannot provide out-of-order DDP Segments to the DDP layer for 895 placement purposes. IPsec tunnel mode encrypts entire IP Datagrams. 896 IPsec transport mode encrypts TCP Segments or SCTP packets. In 897 neither case should IPsec preclude providing out-of-order DDP 898 Segments to the DDP layer for placement. 900 Note that end-to-end use of IPsec cryptographic integrity protection 901 may allow suppression of MPA CRC generation and checking under 902 certain circumstances. This is one example where the LLP may be 903 judged to have "or equivalent" protection to an end-to-end CRC32c. 905 10. References 907 10.1. Normative references 909 [1] Bradner, S., "Key words for use in RFCs to Indicate Requirement 910 Levels", BCP 14, RFC 2119, March 1997. 912 [2] Dierks, T. and C. Allen, "The TLS Protocol Version 1.0", 913 RFC 2246, January 1999. 915 [3] Kent, S. and R. Atkinson, "IP Encapsulating Security Payload 916 (ESP)", RFC 2406, November 1998. 918 [4] Stewart, R., Xie, Q., Morneault, K., Sharp, C., Schwarzbauer, 919 H., Taylor, T., Rytina, I., Kalla, M., Zhang, L., and V. 920 Paxson, "Stream Control Transmission Protocol", RFC 2960, 921 October 2000. 923 [5] Coene, L., "Stream Control Transmission Protocol Applicability 924 Statement", RFC 3257, April 2002. 926 [6] Recio, R., "An RDMA Protocol Specification", 927 draft-ietf-rddp-rdmap-05 (work in progress), July 2005. 929 [7] Shah, H., "Direct Data Placement over Reliable Transports", 930 draft-ietf-rddp-ddp-05 (work in progress), July 2005. 932 [8] Stewart, R., "Stream Control Transmission Protocol (SCTP) 933 Remote Direct Memory Access (RDMA) Direct Data Placement (DDP) 934 Adaptation", draft-ietf-rddp-sctp-02 (work in progress), 935 August 2005. 937 [9] Pinkerton, J., "DDP/RDMAP Security", 938 draft-ietf-rddp-security-08 (work in progress), March 2006. 940 [10] Culley, P., "Marker PDU Aligned Framing for TCP Specification", 941 draft-ietf-rddp-mpa-02 (work in progress), February 2005. 943 10.2. Informative References 945 [11] Ko, M., "iSCSI Extensions for RDMA Specification", 946 October 2005. 948 [12] Callaghan, B. and T. Talpey, "NFS Direct Data Placement", 949 draft-ietf-nfsv4-nfsdirect-02 (work in progress), October 2005. 951 Authors' Addresses 953 Caitlin Bestler 954 Broadcom Corporation 955 16215 Alton Parkway 956 P.O. Box 57013 957 Irvine, CA 92619-7013 958 USA 960 Phone: 949-926-6383 961 Email: caitlinb@broadcom.com 963 Lode Coene 964 Siemens 965 Atealaan 26 966 Herentals, 2200 967 Belgium 969 Phone: +32-14-252081 970 Email: lode.coene@siemens.com 972 Intellectual Property Statement 974 The IETF takes no position regarding the validity or scope of any 975 Intellectual Property Rights or other rights that might be claimed to 976 pertain to the implementation or use of the technology described in 977 this document or the extent to which any license under such rights 978 might or might not be available; nor does it represent that it has 979 made any independent effort to identify any such rights. Information 980 on the procedures with respect to rights in RFC documents can be 981 found in BCP 78 and BCP 79. 983 Copies of IPR disclosures made to the IETF Secretariat and any 984 assurances of licenses to be made available, or the result of an 985 attempt made to obtain a general license or permission for the use of 986 such proprietary rights by implementers or users of this 987 specification can be obtained from the IETF on-line IPR repository at 988 http://www.ietf.org/ipr. 990 The IETF invites any interested party to bring to its attention any 991 copyrights, patents or patent applications, or other proprietary 992 rights that may cover technology that may be required to implement 993 this standard. Please address the information to the IETF at 994 ietf-ipr@ietf.org. 996 Disclaimer of Validity 998 This document and the information contained herein are provided on an 999 "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS 1000 OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET 1001 ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, 1002 INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE 1003 INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED 1004 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 1006 Copyright Statement 1008 Copyright (C) The Internet Society (2006). This document is subject 1009 to the rights, licenses and restrictions contained in BCP 78, and 1010 except as set forth therein, the authors retain all their rights. 1012 Acknowledgment 1014 Funding for the RFC Editor function is currently provided by the 1015 Internet Society.