idnits 2.17.1 draft-ietf-rddp-applicability-05.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3978, Section 5.1 on line 17. -- Found old boilerplate from RFC 3978, Section 5.5 on line 976. -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 953. -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 960. -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 966. ** This document has an original RFC 3978 Section 5.4 Copyright Line, instead of the newer IETF Trust Copyright according to RFC 4748. ** This document has an original RFC 3978 Section 5.5 Disclaimer, instead of the newer disclaimer which includes the IETF Trust according to RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year == The document seems to lack the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. (The document does seem to have the reference to RFC 2119 which the ID-Checklist requires). -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (December 5, 2005) is 6716 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Unused Reference: '1' is defined on line 886, but no explicit reference was found in the text == Unused Reference: '2' is defined on line 889, but no explicit reference was found in the text == Unused Reference: '3' is defined on line 892, but no explicit reference was found in the text == Unused Reference: '4' is defined on line 895, but no explicit reference was found in the text == Unused Reference: '5' is defined on line 899, but no explicit reference was found in the text == Unused Reference: '6' is defined on line 902, but no explicit reference was found in the text == Unused Reference: '7' is defined on line 905, but no explicit reference was found in the text == Unused Reference: '8' is defined on line 908, but no explicit reference was found in the text == Unused Reference: '9' is defined on line 913, but no explicit reference was found in the text ** Obsolete normative reference: RFC 2246 (ref. '2') (Obsoleted by RFC 4346) ** Obsolete normative reference: RFC 2406 (ref. '3') (Obsoleted by RFC 4303, RFC 4305) ** Obsolete normative reference: RFC 2960 (ref. '4') (Obsoleted by RFC 4960) ** Downref: Normative reference to an Informational RFC: RFC 3257 (ref. '5') == Outdated reference: A later version (-07) exists of draft-ietf-rddp-rdmap-05 == Outdated reference: A later version (-07) exists of draft-ietf-rddp-ddp-05 == Outdated reference: A later version (-07) exists of draft-ietf-rddp-sctp-02 == Outdated reference: A later version (-08) exists of draft-ietf-rddp-mpa-02 == Outdated reference: A later version (-08) exists of draft-ietf-nfsv4-nfsdirect-02 Summary: 7 errors (**), 0 flaws (~~), 17 warnings (==), 7 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Remote Direct Data Placement C. Bestler 3 Working group Broadcom 4 Internet-Draft L. Coene 5 Expires: June 8, 2006 Siemens 6 December 5, 2005 8 Applicability of Remote Direct Memory Access Protocol (RDMA) and Direct 9 Data Placement (DDP) 10 draft-ietf-rddp-applicability-05.txt 12 Status of this Memo 14 By submitting this Internet-Draft, each author represents that any 15 applicable patent or other IPR claims of which he or she is aware 16 have been or will be disclosed, and any of which he or she becomes 17 aware will be disclosed, in accordance with Section 6 of BCP 79. 19 Internet-Drafts are working documents of the Internet Engineering 20 Task Force (IETF), its areas, and its working groups. Note that 21 other groups may also distribute working documents as Internet- 22 Drafts. 24 Internet-Drafts are draft documents valid for a maximum of six months 25 and may be updated, replaced, or obsoleted by other documents at any 26 time. It is inappropriate to use Internet-Drafts as reference 27 material or to cite them other than as "work in progress." 29 The list of current Internet-Drafts can be accessed at 30 http://www.ietf.org/ietf/1id-abstracts.txt. 32 The list of Internet-Draft Shadow Directories can be accessed at 33 http://www.ietf.org/shadow.html. 35 This Internet-Draft will expire on June 8, 2006. 37 Copyright Notice 39 Copyright (C) The Internet Society (2005). 41 Abstract 43 This document describes the applicability of Remote Direct Memory 44 Access Protocol (RDMAP) and the Direct Data Placement Protocol (DDP). 45 It comparese and contrasts the different transport options over IP 46 that DDP can use, provides guidance to ULP developers on choosing 47 between available transports and/or how to be indifferent to the 48 specific transport layer used, compares use of DDP with direct use of 49 the supporting transports, and compares DDP over IP transports with 50 non-IP transports that support RDMA functionality. 52 Table of Contents 54 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 55 2. Definitions . . . . . . . . . . . . . . . . . . . . . . . . . 5 56 3. Direct Placement . . . . . . . . . . . . . . . . . . . . . . . 6 57 3.1. Fewer Required ULP Interactions . . . . . . . . . . . . . 6 58 3.2. Direct Placement using only the LLP . . . . . . . . . . . 6 59 4. Tagged Messages . . . . . . . . . . . . . . . . . . . . . . . 8 60 4.1. Order Independent Reception . . . . . . . . . . . . . . . 8 61 4.2. Reduced ULP Notifications . . . . . . . . . . . . . . . . 9 62 4.3. Simplified ULP Exchanges . . . . . . . . . . . . . . . . . 9 63 4.4. Order Independent Sending . . . . . . . . . . . . . . . . 11 64 4.5. Untagged Messages and Tagged Buffers as ULP Credits . . . 12 65 5. RDMA Read . . . . . . . . . . . . . . . . . . . . . . . . . . 14 66 6. LLP Comparisons . . . . . . . . . . . . . . . . . . . . . . . 15 67 6.1. Multistreaming Implications . . . . . . . . . . . . . . . 15 68 6.2. Out of Order Reception Implications . . . . . . . . . . . 15 69 6.3. Header and Marker Overhead . . . . . . . . . . . . . . . . 15 70 6.4. Middlebox Support . . . . . . . . . . . . . . . . . . . . 15 71 6.5. Processing Overhead . . . . . . . . . . . . . . . . . . . 16 72 6.6. Data Integrity Implications . . . . . . . . . . . . . . . 16 73 6.6.1. MPA/TCP Specifics . . . . . . . . . . . . . . . . . . 16 74 6.6.2. SCTP Specifics . . . . . . . . . . . . . . . . . . . . 17 75 6.7. Non-IP Transports . . . . . . . . . . . . . . . . . . . . 17 76 6.7.1. No RDMA Layer Ack . . . . . . . . . . . . . . . . . . 17 77 6.8. Other IP Transports . . . . . . . . . . . . . . . . . . . 18 78 6.9. LLP Independent Session Establishment . . . . . . . . . . 18 79 6.9.1. RDMA-only Session Establishment . . . . . . . . . . . 19 80 6.9.2. RDMA-Conditional Session Establishment . . . . . . . . 19 81 7. Local Interface Implications . . . . . . . . . . . . . . . . . 21 82 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 22 83 9. Security considerations . . . . . . . . . . . . . . . . . . . 23 84 9.1. Connection/Association Setup . . . . . . . . . . . . . . . 23 85 9.2. Tagged Buffer Exposure . . . . . . . . . . . . . . . . . . 23 86 9.3. Impact of Encrypted Transports . . . . . . . . . . . . . . 23 87 10. References . . . . . . . . . . . . . . . . . . . . . . . . . . 24 88 10.1. Normative references . . . . . . . . . . . . . . . . . . . 24 89 10.2. Informative References . . . . . . . . . . . . . . . . . . 24 90 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 25 91 Intellectual Property and Copyright Statements . . . . . . . . . . 26 93 1. Introduction 95 Remote Direct Memory Access Protocol (RDMAP) and Direct Data 96 Placement (DDP) work together to provide application independent 97 efficient placement of application payload directly into buffers 98 specified by the Upper Layer Protocol (ULP). 100 The DDP protocol is responsible for direct placement of received 101 payload into ULP specified buffers. The RDMAP protocol provides 102 completion notifications to the ULP and support for Data Sink 103 initiated fetch of advertised buffers (RDMA Reads). 105 DDP and RDMAP are both application independent protocols which allow 106 the ULP to perform remote direct data placement. DDP can use 107 multiple standard IP transports including SCTP and TCP. 109 By clarifying the situations where the functionality of these 110 protocols are applicable, this document can guide implementers, 111 application and protocol designers in selecting which protocols to 112 use. 114 The applicability of RDMAP/DDP is driven by their unique 115 capabilities: 117 o The existence of an application independent protocol allows common 118 solutions to be implemented in hardware and/or the kernel. This 119 document will discuss when common data placement procedures are of 120 the greatest benefit to applications as contrasted with 121 application specific solutions built on top of direct use of the 122 underlying transport. 124 o DDP supports both untagged and tagged buffers. Tagged buffers 125 allow the Data Sink ULP to be indifferent to what order (or in 126 what messages) the Data Source sent the data, or what order 127 packets are received in. Typically tagged data can be used for 128 payload transfer, while untagged is best used for control 129 messages. However each upper layer protocol can determine the 130 optimal use of tagged and untagged messages for itself. This 131 document will discuss when Data Source flexibility is of benefit 132 to applications. 134 o RDMAP consolidates ULP notifications, thereby minimizing the 135 number of required ULP interactions. 137 o RDMAP defines RDMA Reads, which allow remote access to advertised 138 buffers. This document will review the advantages of using RDMA 139 Reads as contrasted to alternate solutions. 141 Some non-IP transports, such as InfiniBand, directly integrate RDMA 142 features. This document will review the applicability of providing 143 RDMA services over ubiquitous IP transports as opposed to the use of 144 customized transport protocols. Due to the fact that DDP is defined 145 cleanly as a layer over existing IP transports, DDP has simpler 146 ordering rules than some prior RDMA protocols. This may have some 147 implications for application designers. 149 The full capabilities of DDP and RDMAP can only be fully realized by 150 applications that are designed to exploit them. The co-existence of 151 RDMAP/DDP aware local interfaces with traditional socket interfaces 152 will also be explored. 154 Finally, DDP support is defined for at least two IP transports: SCTP 155 and TCP. The rationale for supporting both transports is reviewed, 156 as well as when each would be the appropriate selection. 158 2. Definitions 160 Advertisement - the act of informing a Remote Peer that a local RDMA 161 Buffer is available to it. A Node makes available an RDMA Buffer 162 for incoming RDMA Read or RDMA Write access by informing its RDMA/ 163 DDP peer of the Tagged Buffer identifiers (STag, base address, and 164 buffer length). This advertisement of Tagged Buffer information 165 is not defined by RDMA/DDP and is left to the ULP. A typical 166 method would be for the Local Peer to embed the Tagged Buffer's 167 Steering Tag, base address, and length in a Send Message destined 168 for the Remote Peer. 170 Data Sink - The peer receiving a data payload. Note that the Data 171 Sink can be required to both send and receive RDMA/DDP Messages to 172 transfer a data payload. 174 Data Source - The peer sending a data payload. Note that the Data 175 Source can be required to both send and receive RDMA/DDP Messages 176 to transfer a data payload. 178 Lower Layer Protocol (LLP) The transport protocol that provides 179 services to DDP. This is an IP transport with any required 180 adaptation layer. Adaptation layers are defined for SCTP and TCP. 182 Steering Tag (STag) An identifier of a Tagged Buffer on a Node, valid 183 as defined within a protocol specification. 185 Tagged Message A DDP message that is directed to a ULP specified 186 buffer based upon imbedded addressing information. In the 187 immediate sense, the destination buffer is specified by the 188 message sender. The message receiver is given no independent 189 indication that a tagged message has been received. 191 Untagged Message A DDP message that is directed to a ULP specified 192 buffer based upon a Message Sequence Number being matched with a 193 receiver supplied buffer. The destination buffer is specified by 194 the message receiver. The message receiver is notified by some 195 mechanism that an untagged message has been received. 197 Upper Layer Protocol (ULP) The direct user of RDMAP/DDP services. In 198 addition to protocols such as iSER [10] and NFSv4 over RDMA [11], 199 the ULP may be embedded in an application, or a middleware layer 200 as is often the case for the Sockets Direct Protocol (SDP) and 201 Remote Procedure Call (RPC) protocols. 203 3. Direct Placement 205 Direct Data Placement optimizes the placement of ULP payload into the 206 correct destination buffers, typically eliminating intermediate 207 copying. Placement is enabled without regard to order of arrival, 208 order of transmission or requiring per-placement interaction with the 209 ULP. 211 RDMAP minimizes the required ULP interactions . This capability is 212 most valuable for applications that require multiple transport layer 213 packets for each required ULP interaction. 215 3.1. Fewer Required ULP Interactions 217 While reducing the number of required ULP interactions is in itself 218 desirable, it is critical for high speed connections. The burst 219 packet rate for a high speed interface could easily exceed the host 220 systems ability to switch ULP contexts. 222 Content access applications are primary examples of applications with 223 both high bandwidth and high content to required ULP interaction 224 ratios. These applications include file access protocols (NAS), 225 storage access (SAN), database access and other application specific 226 forms of content access such as HTTP, XML and email. 228 3.2. Direct Placement using only the LLP 230 Direct data placement can be achieved without RDMA. Pre-posting of 231 receive buffers could allow a non-RDMA network stack to place data 232 directly to user buffers. 234 The degree to which DDP optimizes depends on which transport is being 235 compared with, and on the nature of the local interface. Without 236 RDMAP/DDP pre-posting buffers requires the receiving side to 237 accurately predict the required buffers and their sizes. This is not 238 feasible for all ULPs. By contrast, DDP only requires the ULP to 239 predict the sequence and size of incoming untagged messages. 241 An application that could predict incoming messages and required 242 nothing more than direct placement into buffers might be able to do 243 so with a properly designed local interface to SCTP or TCP. Doing so 244 for TCP requires making predictions at a byte level rather than a 245 message level. 247 The main benefit of DDP for such an application would be that pre- 248 posting of receive buffers is a mandated local interface capability, 249 and that predictions can be made on a per-message basis (not per 250 byte). 252 The Lower Layer Protocol, LLP, can also be used directly if ULP 253 specific knowledge is built into the protocol stack to allow "parse 254 and place" handling of received packets. Such a solution either 255 requires interaction with the ULP, or that the protocol stack have 256 knowledge of ULP specific syntax rules. 258 DDP achieves the benefits of directly placing incoming payload 259 without requiring tight coupling between the ULP and the protocol 260 stack. However, "parse and place" capabilities can certainly provide 261 equivalent services to a limited number of ULPs. 263 4. Tagged Messages 265 This section covers the major benefits from the use of Tagged 266 Messages. 268 A more critical advantage of DDP is the ability of the Data Source to 269 use tagged buffers. Tagging messages allows the Data Source to 270 choose the ordering and packetization of its payload deliveries. 271 With direct data placement based solely upon pre-posted receives, the 272 packetization and delivery of payload must be agreed by the ULP peers 273 in advance. 275 The Upper Layer Protocol can allocate content between untagged and/or 276 tagged messages to maximize the potential optimizations. Placing 277 content within an untagged message can deliver the content in the 278 same packet that signals completion to the receiver. This can 279 improve latency. It can even eliminate round trips. But it requires 280 making larger anonymous buffers to be available. 282 Some examples of data that typically belongs in the untagged message 283 would include short fixed size control data that is inherently part 284 of the control message almost always should be included in the 285 untagged message, relatively short payload that is almost always 286 needed (especially when it would eliminate a round-trip to fetch the 287 data. For example, the initial data on a write request, and of 288 course advertising tagged buffers that specify the location of data 289 not in the untagged message. 291 Tagged messages standardizes direct placemtn of data without per- 292 packet interaction with the upper layers. Even if there is an upper 293 layer protocol encoding of what is being transferred, as is common 294 with middleware solutions, this information is not understood at the 295 application independent layers. The directions on where to place the 296 incoming data cannot be accessed without switching to the ULP first. 297 DDP provides a standardized\ 'packing list' which can be interpreted 298 without requiring ULP interaction. Indeed, it is designed to be 299 implementable in hardware. 301 4.1. Order Independent Reception 303 Tagged messages are directed to a buffer based on an included 304 Steering Tag. Additionally, no notice is provided to the ULP for each 305 individual Tagged Message's arrival. Together these allow tagged 306 messages received out-of-order to be processed without intermediate 307 buffering or additional notifications to the ULP. 309 4.2. Reduced ULP Notifications 311 RDMAP offers both tagged and untagged messages. No receiving side 312 ULP interactions are required for tagged messages. By optimally 313 dividing traffic between tagged and untagged messages the ULP can 314 limit the number of events that must be dealt with at the ULP layer. 315 This typically reduces the number of context switches required and 316 improves performance. 318 RDMAP further reduces required ULP interactions consolidating 319 completion notifications of tagged messages with the completion 320 notification of a trailing untagged message. For most ULPs this 321 radically reduces the number of ULP required interactions even 322 further. 324 While RDMAP consolidation of notices is beneficial to most 325 applications, it may be detrimental to some applications that benefit 326 from streamed delivery to enable ULP processing of received data as 327 promptly as possible. A ULP that uses RDMAP cannot begin processing 328 any portion of an exchange until it receives notification that the 329 entire exchange has been placed. An "exchange" here is a set of zero 330 or more tagged messages and a single terminating untagged message. 331 An application that would prefer to begin work on the received 332 payload, no matter what order it arrived in, as soon as possible 333 might prefer to work directly with the LLP. RDMAP is optimized for 334 applications that are more concerned when the entire exchange is 335 complete. 337 An application that benefits from being able to begin processing of 338 each received packet as quickly as possible may find RDMAP interferes 339 with that goal. 341 Such an application might be able to retain most of the benefits of 342 RDMAP by using the DDP layer directly. However, in addition to 343 taking on the responsibilities of the RDMAP layer, the application 344 would likely have more difficulty finding support for a DDP-only API. 345 Many hardware implementations may choose to tightly couple RDMAP and 346 DDP, and might not provide an API directly to DDP services. 348 These features minimize the required interactions with the ULP. This 349 can be extremely beneficial for applications that use multiple 350 transport layer packets to accomplish what is a single ULP 351 interaction. 353 4.3. Simplified ULP Exchanges 355 The notification rules for Tagged Messages allows ULPs to create 356 multi-message "exchanges" consisting of zero or more tagged messages 357 that represent a single step in the ULP interaction. The receiving 358 ULP is notified that the untagged message has arrived, and implicitly 359 of any associated tagged messages. 361 A ULP where all exchanges would naturally be only the untagged 362 message would derive virtually no benefit from the use of RDMAP/DDP 363 as opposed to SCTP. But while tagged buffers are the justification 364 for RDMAP/DDP, untagged buffers are still necessary. Without 365 untagged buffers the only method to exchange buffer advertisements 366 would involve out-of-band communications and/or sharing of compile 367 time constants. Most RDMA-aware ULPs use untagged buffers for 368 requests and responses. Buffer advertisements are typically done 369 within these untagged messages. 371 More importantly there would be no reliable method for the upper 372 layer peers to synchronize. The absence of any guarantees about 373 ordering within or between tagged messages is fundamental to allowing 374 the DDP layer to optimize transfer of tagged payload. 376 So no ULP can be defined entirely in terms of tagged messages. 377 Eventually a notification that confirms delivery must be generated 378 from the RDMAP/DDP layer. 380 Limiting use of untagged buffers to requests and responses by moving 381 all bulk data using tagged transfers can greatly simplify the amount 382 of prediction that the Data Sink must perform in pre-posting receive 383 buffers. For example, a typical RDMA enabled interaction would 384 consist of the following: 386 Client sends transaction request to server's as an untagged 387 message. 389 This message includes buffer advertisements for the buffers where 390 the results are to be placed. 392 The Server sends multiple tagged messages to the advertised 393 buffers. 395 The Server sends transaction reply as an untagged message to the 396 client. 398 Client receives single notification, indicating completion of the 399 interaction. 401 With this type of exchange the pacing and required size of untagged 402 buffers is highly predictable. The variability of response sizes is 403 absorbed by tagged transfers. 405 4.4. Order Independent Sending 407 Use of tagged messages is especially applicable when the Data Sink 408 does not know the actual size, structure or location of the content 409 it is requesting (or updating). 411 For example, suppose the Data Sink ULP needs to fetch four related 412 pieces of data into a four separate buffers. With SCTP the Data Sink 413 ULP could receive four messages into four separate buffers, only 414 having to predict the maximum size of each. However it would have to 415 dictate the order in which the Data Source supplied the separate 416 pieces. If the Data Source found it advantageous to fetch them in a 417 different order it would have to use intermediate buffering to re- 418 order the pieces into the expected order even though the application 419 only required that all four be delivered and did not truly have an 420 ordering requirement. 422 Techniques such as RAID striping and mirroring represent this same 423 problem, but one step further. What appears to be a single resource 424 to the Data Sink is actually stored in separate locations by the Data 425 Source. Non RDMA protocols would either require the Data Source to 426 fetch the material in the desired order or force the Data Source to 427 use its own holding buffers to assemble an image of the destination 428 buffer. 430 While sometimes referred to as a "buffer-to-buffer" solution, RDMA 431 more fundamentally enables remote buffer access. The ULP is free to 432 work with larger remote buffers than it has locally. This reduces 433 buffering requirements and the number of times the data must be 434 copied in an end-to-end transfer. 436 There are numerous reasons why the Data Sink would not know the true 437 order or location of the requested data. It could be different for 438 each client, different records selected and/or different sort orders, 439 RAID striping, file fragmentation, volume fragmentation, volume 440 mirroring and server-side dynamic compositing of content (such as 441 server side includes for HTTP). 443 In all of these cases the Data Source is free to assemble the desired 444 data in the Data Sink's buffer in whatever order the component data 445 becomes available to it. It is not constrained on ordering. It does 446 not have to assemble an image in its own memory before creating it in 447 the Data Sink's buffers. 449 Note that while DDP enables use of tagged messages for bulk transfer, 450 there are some application scenarios where untagged messages would 451 still be used for bulk transfer. For example, a file server may not 452 expose its own memory to its clients. A client wishing to write may 453 advertise a buffer which the server will issue RDMA Reads upon. 454 However, when performing a small write it may be preferable to 455 include the data in the untagged message rather than incurring an 456 additional round trip with the RDMA Read and its response. 458 Generally, the best use of an untagged message is to synchronize and 459 to deliver data that is naturally tied to the same message as the 460 synchronization. For initial data transfers this has the additional 461 benefit of avoiding the need to advertise specific tagged buffers for 462 indefinite time periods. Instead anonymous buffers can be used for 463 initial data reception. Because anonymous buffers do not need to be 464 tied to specific messages in advance this can be a major benefit. 466 4.5. Untagged Messages and Tagged Buffers as ULP Credits 468 The handling of end-to-end buffer credits differs considerably with 469 DDP than when the ULP directly uses either TCP or SCTP. 471 With both TCP and SCTP buffer credits are based upon the receiver 472 granting transmit permission based on the total number of bytes. 473 These credits reflect system buffering resources and/or simple flow 474 control. They do not represent ULP resources. 476 DDP defines no standard flow control, but presumes the existince of a 477 ULP mechanism. The presumed mechanism is that the Data Sink ULP has 478 issued credits to the Data Source allowing the Data Source to send a 479 specific number of untagged messages. 481 The ULP peers must ensure that the sender is aware of the maximum 482 size that can be sent to any specific target buffer. One method of 483 doing so is to use a standard size for all untagged buffers within a 484 given connection. For example, a ULP may specify an initial untagged 485 buffer size to be used immediately after session establishment, and 486 then optionally specify mechanisms for negotiating changes. 488 Tagged buffers are ULP resources advertised directly from ULP to ULP. 489 A DDP put to a known tagged buffer is constrained only by transport 490 level flow control, not by available system buffering. 492 Either tagged or untagged buffers allows bypassing of system buffer 493 resources. Use of tagged buffers additionally allows the Data Source 494 to choose what order to exercise the credits in. 496 To the extent allowed by the ULP, tagged buffers are also divisible 497 resources. The Data Sink can advertise a single 100 KB buffer, and 498 then receive notifications from its peer that it had written 50 KB, 499 20 KB and 30 KB to that buffer in three successive transactions. 501 ULP-management of tagged buffer resources, independent of transport 502 and DDP layer credits, is an additional benefit of RDMA protocols. 503 Large bulk transfers cannot be blocked by limited general purpose 504 buffering capacity. Applications can flow control based upon higher 505 level abstractions, such as number of outstanding requests, 506 independent of the amount of data that must be transferred. 508 However, use of system buffering, as offered by direct use of the 509 underlying transports, can be preferable under certain circumstances. 511 One example would be when the number of target ULP buffers is 512 sufficiently large, and the rate at which any writes arrive is 513 sufficiently low, that pinning all the target ULP buffers in memory 514 would be undesirable. The maximum transfer rate, and hence the 515 maximum amount of system buffering required, may be more stable and 516 predictable than the total ULP buffer exposure. 518 Another would be the Data Sink wishes to receive a stream of data at 519 a predictable rate, but does not know in advance what the size of 520 each data packet will be. This is common from streaming media that 521 has been encoded with a variable bit rate. With DDP the Data Sink 522 would either have to use untagged buffers large enough for the 523 largest packet, or advertise a circular buffer. If for security or 524 other reasons the Data Sink did not want the size of its buffer to be 525 publicly known, using the underlying SCTP transport directly may be 526 preferable because of their byte-oriented credits. 528 5. RDMA Read 530 RDMA Reads are a further service provided by RDMAP. RDMA Reads allow 531 the Data Sink to fetch exactly the portion of the peer ULP buffer 532 required on a "just in time" basis. This can be done without 533 requiring per-fetch support from the Data Source ULP. 535 Storage servers may wish to limit the maximum write buffer allocated 536 to any single session. The storage server may be a very minimal 537 layer between the client and the disk storage media, or the server 538 may merely wish to limit the total resources that would be required 539 if all clients could push the entire payload they wished written at 540 their own convenience. 542 In either case, there is little benefit in transferring data from the 543 Data Source far in advance of when it will be written to the 544 persistent storage media. RDMA Reads allow the Storage Server to 545 fetch the payload on a "just in time" basis. In this fashion a 546 relatively small number of block sized buffers can be used to execute 547 a single transaction that specified writing a large file, or a 548 Storage Server with numerous clients can fetch buffers from the 549 individual clients in the order that is most convenient to the 550 server. 552 This same capability can be used when the desired portion of the 553 advertised buffer is not known in advance. For example the 554 advertised buffer could contain performance statistics. The data 555 sink could request the portions of the data it required, without 556 requiring an interaction with the Data Source ULP. 558 This is applicable for many applications that publish semi-volatile 559 data that does not require transactional validity checking (i.e., 560 authorized users have read access to the entire set of data). It is 561 less applicable when there are ULP consistency checks that must be 562 performed upon the data. Such applications would be better served by 563 having the client send a request, and having the server use RDMA 564 Writes to publish the requested data. Neither RDMAP or DDP provide 565 mechanisms for bundling multiple disjoint updates into an atomic 566 operation. Therefore use of an advertised buffer as a data resource 567 is subject to the same caveats as any randomly updated data resource, 568 such as flat files, that do not enforce their own consistency. 570 6. LLP Comparisons 572 Normally the choice of underlying IP transport is irrelevant to the 573 ULP. RDMAP and DDP provides the same services over either. There 574 may be performance impacts of the choice, however. It is the 575 responsibility of the ULP to determine which IP transport is best 576 suited to its needs. 578 SCTP provides for preservation of message boundaries. Each DDP 579 segment will be delivered within a single SCTP packet. The 580 equivalent services are only available with TCP through the use of 581 the MPA (Marker PDU Alignment) adaptation layer. 583 6.1. Multistreaming Implications 585 SCTP also provides multi-streaming. When the same pair of hosts have 586 need for multiple DDP streams this can be a major advantage. A 587 single SCTP association carries multiple DDP streams, consolidating 588 connection setup, congestion control and acknowledgements. 590 Completions are controlled by the DDP Source Sequence Number (DDP- 591 SSN) on a per stream basis. Therefore combining multiple DDP Streams 592 into a single SCTP association cannot result in a dropped packet 593 carrying data for one stream delaying completions on others. 595 6.2. Out of Order Reception Implications 597 The use of unordered Data Chunks with SCTP guarantees that the DDP 598 layer will be able to perform placements when IP datagrams are 599 received out of order. 601 Placement of out-of-order DDP Segments carried over MPA/TCP is not 602 guaranteed, but certainly allowed. The ability of the MPA receiver 603 to process out-of-order DDP Segments may be impaired when alignment 604 of TCP segments and MPA FPDUs is lost. Using SCTP, each DDP Segment 605 is encoded in a single Data Chunk and never spread over multiple IP 606 datagrams. 608 6.3. Header and Marker Overhead 610 MPA and TCP headers together are smaller than the headers used by 611 SCTP and its adaptation layer. However, this advantage can be 612 reduced by the insertion of MPA markers. The different in ULP 613 payload per IP Datagram is not likely to be a signifigant factor. 615 6.4. Middlebox Support 617 Even with the MPA adaptation layer, DDP traffic carried over MPA/TCP 618 will appear to all network middleboxes as a normal TCP connection. 619 In many environments there may be a requirement to use only TCP 620 connections to satisfy existing network elements and/or to facilitate 621 monitoring and control of connections. While SCTP is certainly just 622 as monitorable and controllable as TCP, there is no guarantee that 623 the network management infrastructure has the required support for 624 both. 626 6.5. Processing Overhead 628 A DDP stream delivered via MPA/TCP will require more processing 629 effort that one delivered over SCTP. However this extra work may be 630 justified for many deployments where full SCTP support is unavailable 631 in the endpoints of the network, or where middleboxes impair the 632 usability of SCTP. 634 6.6. Data Integrity Implications 636 Both the SCTP and MPA/TCP adaptation provide end-to-end CRC32c 637 protection against data corruption, or its equivalent. 639 A ULP that requires a greater degree of protection may add it own. 640 However, DDP and RDMAP headers will only be guaranteed to have the 641 equivalent of end-to-end CRC32c protection. A ULP that requires data 642 integrity checking more thorough than an end-to-end CRC32c should 643 first invalidate all STags that reference a buffer before applying 644 their own integrity check. 646 6.6.1. MPA/TCP Specifics 648 It is mandatory for MPA/TCP implementations to implement CRC32c, but 649 it is NOT mandatory to use the CRC32c during an RDMA connection. The 650 activating or deactivating of the CRC in MPA/TCP is an administrative 651 configuration operation at the local and remote end. The 652 administration of the CRC(ON/OFF) is invisible to the ULP. 654 Applications SHOULD trust that this administrative option will only 655 be used when the end-to-end protection is at least as effective as a 656 transport layer CRC32c. Applications SHOULD NOT apply additional 657 protection as a guard against this administrative option being turned 658 on inadvertently. 660 Administrators MUST NOT enable CRC32c suppression unless the end-to- 661 end protection is truly equivalent. 663 If the CRC is active/used for one direction/end , then the use of the 664 CRC is mandatory in both directions/ends. 666 If both ends have been configured NOT to use the CRC, then this is 667 allowed as long as an equivalent protection(comparable or better 668 than/to CRC) from undetected errors on the connection is provided. 670 6.6.2. SCTP Specifics 672 SCTP provides CRC32c protection automatically. The adaptation to 673 SCTP provides for no option to suppress SCTP CRC32c protection. 675 6.7. Non-IP Transports 677 DDP is defined to operate over ubiquitous IP transports such as SCTP 678 and TCP. This enabled a new DDP-enabled node to be added anywhere to 679 an IP network. No DDP-specific support from middle-boxes is 680 required. 682 There are non-IP transport fabric offering RDMA capabilities. 683 Because these capabilities are integrated with the transport protocol 684 they have some technical advantages when compared to RDMA over IP. 685 For example fencing of RDMA operations can be based upon transport 686 level acks. Because DDP is cleanly layered over an IP transport, any 687 explicit RDMA layer ack must be separate from the transport layer 688 ack. 690 There may be deployments where the benefits of RDMA/transport 691 integration outweigh the benefits of being on an IP network. 693 6.7.1. No RDMA Layer Ack 695 DDP does not provide for its own acknowledgements. The only form of 696 ack provided at the RDMAP layer is an RDMA Read Response. DDP and 697 RDMAP rely almost entirely upon other layers for flow control and 698 pacing. The LLP is relied upon to guarantee delivery and avoid 699 network congestion, and ULP level acking is relied upon for ULP 700 pacing and to avoid ULP buffer overruns. 702 Previous RDMA protocols, such as InfiniBand, have been able to use 703 their integration with the transport layer to provide stronger 704 ordering guarantees. It is important that application designers that 705 require such guarantees to provide them through ULP interaction. 707 Specifically: 709 There is no ability for a local interface to "fence" outbound 710 messages to guarantee that prior tagged messages have been placed 711 prior to sending a tagged message. The only guarantees available 712 from the other side would be an RDMA Read Response (coming from 713 the RDMAP layer) or a response from the ULP layer. Remember that 714 the normal ordering rules only guarantee when the Data Sink ULP 715 will be notified of untagged messages, it does not control when 716 data is placed into receive buffers. 718 Re-use of tagged buffers must be done with extreme care. The fact 719 that an untagged message indicates that all prior tagged messages 720 have been placed does not guarantee that no later tagged message 721 have. The best strategy is to only change the state of any given 722 advertised buffers with with untagged messages. 724 As covered elsewhere in this document, flow control of untagged 725 messages MUST be provided by the ULP itself. 727 6.8. Other IP Transports 729 Both TCP and SCTP provide DDP with reliable transport with TCP 730 friendly rate control. As currently DDP is defined to work over 731 reliable transports and implicitly relies upon some form of rate 732 control. 734 DDP is fully compatible with a non-reliable protocol. Out-of-order 735 placement is obviously not dependent on whether the other DDP 736 Segments ever actually arrive. 738 However, RDMAP requires the LLP to provide reliable service. An 739 alternate completion handling protocol would be required if DDP were 740 to be deployed over an unreliable IP transport. 742 As noted in the prior section on tagged buffers as ULP credits, 743 neither RDMAP or DDP provide any flow control for tagged messages. 744 If no transport layer flow control is provided, an RDMAP/DDP 745 application would be only limited by the link layer rate, almost 746 inevitably resulting in severe network congestion. 748 RDMAP encourages applications to be ignorant of the underlying 749 transport PMTU. The ULP is only notified when all messages ending in 750 a single untagged message have completed. The ULP is not aware of 751 the granularity or ordering of the underlying message. This approach 752 assumes that the ULP is only interested in the complete set of 753 messages, and has no use for a subset of them. 755 6.9. LLP Independent Session Establishment 757 For an RDMAP/DDP application, the transport services provided by a 758 pair of SCTP Streams and by a TCP connection both provide the same 759 service (reliable delivery of DDP Segments between two connected 760 RDMAP/DDP endpoints). 762 6.9.1. RDMA-only Session Establishment 764 It is also possible to allow for transport neutral establishment of 765 RDMAP/DDP sessions between endpoints. Combined, these two features 766 would allow most applications to be unconcerned as to which LLP was 767 actually in use. 769 Specifically, the procedures for DDP Stream Session establishment 770 discussed in section 3 of the SCTP mapping, and section 13.3 of the 771 MPA/TCP mapping, both allow for the exchange of ULP specific data 772 ("Private Data") before enabling the exchange of DDP Segments. This 773 delay can allow for proper selection and/or configuration of the 774 endpoints based upon the exchanged data. For example, each DDP 775 Stream Session associated with a single client session might be 776 assigned to the same DDP Protection Domain. 778 To be transport neutral, the applications should exchange Private 779 Data as part of session establishment messages to determine how the 780 RDMA endpoints are to be configured. One side must be the Initiator, 781 and the other the Responder. 783 With SCTP, a pair of SCTP streams can be used for sequential 784 sessions. With MPA/TCP each connection can be used for at most one 785 session. However, the same source/destination pair of ports can be 786 re-used sequentially subject to normal TCP rules. 788 Both SCTP and MPA limit the private data size to a maximum of 512 789 bytes. 791 MPA/TCP requires the end of the TCP connection that initiated the 792 conversion to MPA mode to send the first DDP Segment. SCTP does not 793 have this requirement. ULPs which wish to be transport neutral 794 should require the initiating end to send the first message. A zero- 795 length RDMA Write can be used for this purpose if the ULP logic 796 itself does naturally support this restriction. 798 6.9.2. RDMA-Conditional Session Establishment 800 It is sometimes desirable for the active side of a session to connect 801 with the passive side before knowing whether the passive side 802 supports RDMA. 804 This style of session establishment can be supported with either TCP 805 or SCTP, but not as transparently as for RDMA-only sessions. Pre- 806 existing non-RDMA servers are also far more likely to be using TCP 807 than SCTP. 809 With TCP. a normal TCP connection is established. It is then used by 810 the ULP to determine whether or not to convert to MPA mode and use 811 RDMA. This will typically be integral with other session 812 establishment negotiations. 814 With SCTP, the establishment of an association tests whether RDMA is 815 supported. If not supported, the application simply requests the 816 association without the RDMA adaptation indication. 818 In key difference is that with SCTP the determination as to whether 819 the peer can support RDMA is made before the transport layer 820 association/connection is established while with TCP the established 821 connection itself is used to determine whether RDMA is supported. 823 7. Local Interface Implications 825 Full utilization of DDP and RDMAP capabilities requires a local 826 interface that explicitly requests these services. Protocols such as 827 Sockets Direct Protocol (SDP) can allow applications to keep their 828 traditional byte-stream or message-stream interface and still enjoy 829 many of the benefits of the optimized wire level protocols. 831 8. IANA Considerations 833 There are no IANA considerations in this document. 835 9. Security considerations 837 9.1. Connection/Association Setup 839 Both the SCTP and TCP adaptations allow for existing procedures to be 840 followed for the establishment of the SCTP association or TCP 841 connection. Use of DDP does not impair the use of any security 842 measures to filter, validate and/or log the remote end of an 843 association/connection. 845 9.2. Tagged Buffer Exposure 847 DDP only exposes ULP memory to the extent explicitly allowed by ULP 848 actions. These include posting of receive operations and enabling of 849 Steering Tags. 851 Neither RDMAP or DDP place requirements on how ULP's advertise 852 buffers. A ULP may use a single Steering Tag for multiple buffer 853 advertisements. However, the ULP should be aware that enforcement on 854 STag usage is likely limited to the overall range that is enabled. 855 If the remote peer writes into the 'wrong' advertised buffer, neither 856 the DDP or RDMAP layer will be aware of this. Nor is there any 857 report to the ULP on how the remote peer specifically used tagged 858 buffers. 860 Unless the ULP peers have an adequate basis for mutual trust, the 861 receiving ULP might be well advised to use a distinct STag for each 862 interaction, and to invalidate it after each use or to require its 863 peer to use the RDMAP option to invalidate the STag with its 864 responding untagged message. 866 9.3. Impact of Encrypted Transports 868 While DDP is cleanly layered over the LLP, its maximum benefit may be 869 limited when the LLP Stream is secured with a streaming cypher, such 870 as Transport Layer Security (TLS). If the LLP must decrypt in order, 871 it cannot provide out-of-order DDP Segments to the DDP layer for 872 placement purposes. IPsec tunnel mode encrypts entire IP Datagrams. 873 IPsec transport mode encrypts TCP Segments or SCTP packets. In 874 neither case should IPsec preclude providing out-of-order DDP 875 Segments to the DDP layer for placement. 877 Note that end-to-end use of IPsec cryptographic integrity protection 878 may allow suppression of MPA CRC generation and checking under 879 certain circumstances. This is one example where the LLP may be 880 judged to have "or equivalent" protection to an end-to-end CRC32c. 882 10. References 884 10.1. Normative references 886 [1] Bradner, S., "Key words for use in RFCs to Indicate Requirement 887 Levels", BCP 14, RFC 2119, March 1997. 889 [2] Dierks, T. and C. Allen, "The TLS Protocol Version 1.0", 890 RFC 2246, January 1999. 892 [3] Kent, S. and R. Atkinson, "IP Encapsulating Security Payload 893 (ESP)", RFC 2406, November 1998. 895 [4] Stewart, R., Xie, Q., Morneault, K., Sharp, C., Schwarzbauer, 896 H., Taylor, T., Rytina, I., Kalla, M., Zhang, L., and V. Paxson, 897 "Stream Control Transmission Protocol", RFC 2960, October 2000. 899 [5] Coene, L., "Stream Control Transmission Protocol Applicability 900 Statement", RFC 3257, April 2002. 902 [6] Recio, R., "An RDMA Protocol Specification", 903 draft-ietf-rddp-rdmap-05 (work in progress), July 2005. 905 [7] Shah, H., "Direct Data Placement over Reliable Transports", 906 draft-ietf-rddp-ddp-05 (work in progress), July 2005. 908 [8] Stewart, R., "Stream Control Transmission Protocol (SCTP) Remote 909 Direct Memory Access (RDMA) Direct Data Placement (DDP) 910 Adaptationn", draft-ietf-rddp-sctp-02 (work in progress), 911 August 2005. 913 [9] Culley, P., "Marker PDU Aligned Framing for TCP Specification", 914 draft-ietf-rddp-mpa-02 (work in progress), February 2005. 916 10.2. Informative References 918 [10] Ko, M., "iSCSI Extensions for RDMA Specification", 919 October 2005. 921 [11] Callaghan, B. and T. Talpey, "NFS Direct Data Placemetn", 922 draft-ietf-nfsv4-nfsdirect-02 (work in progress), October 2005. 924 Authors' Addresses 926 Caitlin Bestler 927 Broadcom 928 49 Discovery 929 Irvine, CA 92618 930 USA 932 Phone: 949-926-6383 933 Email: caitlinb@broadcom.com 935 Lode Coene 936 Siemens 937 Atealaan 26 938 Herentals, 2200 939 Belgium 941 Phone: +32-14-252081 942 Email: lode.coene@siemens.com 944 Intellectual Property Statement 946 The IETF takes no position regarding the validity or scope of any 947 Intellectual Property Rights or other rights that might be claimed to 948 pertain to the implementation or use of the technology described in 949 this document or the extent to which any license under such rights 950 might or might not be available; nor does it represent that it has 951 made any independent effort to identify any such rights. Information 952 on the procedures with respect to rights in RFC documents can be 953 found in BCP 78 and BCP 79. 955 Copies of IPR disclosures made to the IETF Secretariat and any 956 assurances of licenses to be made available, or the result of an 957 attempt made to obtain a general license or permission for the use of 958 such proprietary rights by implementers or users of this 959 specification can be obtained from the IETF on-line IPR repository at 960 http://www.ietf.org/ipr. 962 The IETF invites any interested party to bring to its attention any 963 copyrights, patents or patent applications, or other proprietary 964 rights that may cover technology that may be required to implement 965 this standard. Please address the information to the IETF at 966 ietf-ipr@ietf.org. 968 Disclaimer of Validity 970 This document and the information contained herein are provided on an 971 "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS 972 OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET 973 ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, 974 INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE 975 INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED 976 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 978 Copyright Statement 980 Copyright (C) The Internet Society (2005). This document is subject 981 to the rights, licenses and restrictions contained in BCP 78, and 982 except as set forth therein, the authors retain all their rights. 984 Acknowledgment 986 Funding for the RFC Editor function is currently provided by the 987 Internet Society.