idnits 2.17.1 draft-ietf-rddp-applicability-04.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3978, Section 5.1 on line 17. -- Found old boilerplate from RFC 3978, Section 5.5 on line 914. -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 891. -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 898. -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 904. ** This document has an original RFC 3978 Section 5.4 Copyright Line, instead of the newer IETF Trust Copyright according to RFC 4748. ** This document has an original RFC 3978 Section 5.5 Disclaimer, instead of the newer disclaimer which includes the IETF Trust according to RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year == The document seems to lack the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. (The document does seem to have the reference to RFC 2119 which the ID-Checklist requires). -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (October 11, 2005) is 6765 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Unused Reference: '1' is defined on line 832, but no explicit reference was found in the text == Unused Reference: '2' is defined on line 835, but no explicit reference was found in the text == Unused Reference: '3' is defined on line 838, but no explicit reference was found in the text == Unused Reference: '4' is defined on line 841, but no explicit reference was found in the text == Unused Reference: '5' is defined on line 845, but no explicit reference was found in the text == Unused Reference: '6' is defined on line 848, but no explicit reference was found in the text == Unused Reference: '7' is defined on line 851, but no explicit reference was found in the text == Unused Reference: '8' is defined on line 854, but no explicit reference was found in the text == Unused Reference: '9' is defined on line 859, but no explicit reference was found in the text ** Obsolete normative reference: RFC 2246 (ref. '2') (Obsoleted by RFC 4346) ** Obsolete normative reference: RFC 2406 (ref. '3') (Obsoleted by RFC 4303, RFC 4305) ** Obsolete normative reference: RFC 2960 (ref. '4') (Obsoleted by RFC 4960) ** Downref: Normative reference to an Informational RFC: RFC 3257 (ref. '5') == Outdated reference: A later version (-07) exists of draft-ietf-rddp-rdmap-05 == Outdated reference: A later version (-07) exists of draft-ietf-rddp-ddp-05 == Outdated reference: A later version (-07) exists of draft-ietf-rddp-sctp-02 == Outdated reference: A later version (-08) exists of draft-ietf-rddp-mpa-02 Summary: 7 errors (**), 0 flaws (~~), 16 warnings (==), 7 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Remote Direct Data Placement C. Bestler 3 Working group Broadcom 4 Internet-Draft L. Coene 5 Expires: April 14, 2006 Siemens 6 October 11, 2005 8 Applicability of Remote Direct Memory Access Protocol (RDMA) and Direct 9 Data Placement (DDP) 10 draft-ietf-rddp-applicability-04.txt 12 Status of this Memo 14 By submitting this Internet-Draft, each author represents that any 15 applicable patent or other IPR claims of which he or she is aware 16 have been or will be disclosed, and any of which he or she becomes 17 aware will be disclosed, in accordance with Section 6 of BCP 79. 19 Internet-Drafts are working documents of the Internet Engineering 20 Task Force (IETF), its areas, and its working groups. Note that 21 other groups may also distribute working documents as Internet- 22 Drafts. 24 Internet-Drafts are draft documents valid for a maximum of six months 25 and may be updated, replaced, or obsoleted by other documents at any 26 time. It is inappropriate to use Internet-Drafts as reference 27 material or to cite them other than as "work in progress." 29 The list of current Internet-Drafts can be accessed at 30 http://www.ietf.org/ietf/1id-abstracts.txt. 32 The list of Internet-Draft Shadow Directories can be accessed at 33 http://www.ietf.org/shadow.html. 35 This Internet-Draft will expire on April 14, 2006. 37 Copyright Notice 39 Copyright (C) The Internet Society (2005). 41 Abstract 43 This document describes the applicability of Remote Direct Memory 44 Access Protocol (RDMAP) and the Direct Data Placement Protocol (DDP). 45 It comparese and contrasts the different transport options over IP 46 that DDP can use, provides guidance to ULP developers on choosing 47 between available transports and/or how to be indifferent to the 48 specific transport layer used, compares use of DDP with direct use of 49 the supporting transports, and compares DDP over IP transports with 50 non-IP transports that support RDMA functionality. 52 Table of Contents 54 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 55 2. Definitions . . . . . . . . . . . . . . . . . . . . . . . . . 5 56 3. Direct Placement . . . . . . . . . . . . . . . . . . . . . . . 6 57 3.1. Fewer Required ULP Interactions . . . . . . . . . . . . . 6 58 3.2. Direct Placement using only the LLP . . . . . . . . . . . 6 59 4. Tagged Messages . . . . . . . . . . . . . . . . . . . . . . . 8 60 4.1. Order Independent Reception . . . . . . . . . . . . . . . 8 61 4.2. Reduced ULP Notifications . . . . . . . . . . . . . . . . 8 62 4.3. Simplified ULP Exchanges . . . . . . . . . . . . . . . . . 9 63 4.4. Order Independent Sending . . . . . . . . . . . . . . . . 10 64 4.5. Tagged Buffers as ULP Credits . . . . . . . . . . . . . . 11 65 5. RDMA Read . . . . . . . . . . . . . . . . . . . . . . . . . . 13 66 6. LLP Comparisons . . . . . . . . . . . . . . . . . . . . . . . 14 67 6.1. Multistreaming Implications . . . . . . . . . . . . . . . 14 68 6.2. Out of Order Reception Implications . . . . . . . . . . . 14 69 6.3. Header and Marker Overhead . . . . . . . . . . . . . . . . 14 70 6.4. Middlebox Support . . . . . . . . . . . . . . . . . . . . 15 71 6.5. Processing Overhead . . . . . . . . . . . . . . . . . . . 15 72 6.6. Data Integrity Implications . . . . . . . . . . . . . . . 15 73 6.6.1. MPA/TCP Specifics . . . . . . . . . . . . . . . . . . 15 74 6.6.2. SCTP Specifics . . . . . . . . . . . . . . . . . . . . 16 75 6.7. Non-IP Transports . . . . . . . . . . . . . . . . . . . . 16 76 6.7.1. No RDMA Layer Ack . . . . . . . . . . . . . . . . . . 16 77 6.8. Other IP Transports . . . . . . . . . . . . . . . . . . . 17 78 6.9. LLP Independent Session Establishment . . . . . . . . . . 17 79 6.9.1. RDMA-only Session Establishment . . . . . . . . . . . 18 80 6.9.2. RDMA-Conditional Session Establishment . . . . . . . . 18 81 7. Local Interface Implications . . . . . . . . . . . . . . . . . 20 82 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 21 83 9. Security considerations . . . . . . . . . . . . . . . . . . . 22 84 9.1. Connection/Association Setup . . . . . . . . . . . . . . . 22 85 9.2. Tagged Buffer Exposure . . . . . . . . . . . . . . . . . . 22 86 9.3. Impact of Encrypted Transports . . . . . . . . . . . . . . 22 87 10. Normative references . . . . . . . . . . . . . . . . . . . . . 23 88 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 24 89 Intellectual Property and Copyright Statements . . . . . . . . . . 25 91 1. Introduction 93 Remote Direct Memory Access Protocol (RDMAP) and Direct Data 94 Placement (DDP) work together to provide application independent 95 efficient placement of application payload directly into buffers 96 specified by the Upper Layer Protocol (ULP). 98 The DDP protocol is responsible for direct placement of received 99 payload into ULP specified buffers. The RDMAP protocol provides 100 completion notifications to the ULP and support for Data Sink 101 initiated fetch of advertised buffers (RDMA Reads). 103 DDP and RDMAP are both application independent protocols which allow 104 the ULP to perform remote direct data placement. DDP can use 105 multiple standard IP transports including SCTP and TCP. 107 By clarifying the situations where the functionality of these 108 protocols are applicable, this document can guide implementers, 109 application and protocol designers in selecting which protocols to 110 use. 112 The applicability of RDMAP/DDP is driven by their unique 113 capabilities: 115 o The existence of an application independent protocol allows common 116 solutions to be implemented in hardware and/or the kernel. This 117 document will discuss when common data placement procedures are of 118 the greatest benefit to applications as contrasted with 119 application specific solutions built on top of direct use of the 120 underlying transport. 122 o DDP supports both untagged and tagged buffers. Tagged buffers 123 allow the Data Sink ULP to be indifferent to what order (or in 124 what packets) the Data Source sent the data, or what order they 125 are received in. This document will discuss when Data Source 126 flexibility is of benefit to applications. 128 o RDMAP consolidates ULP notifications, thereby minimizing the 129 number of required ULP interactions. 131 o RDMAP defines RDMA Reads, which allow remote access to advertised 132 buffers. This document will review the advantages of using RDMA 133 Reads as contrasted to alternate solutions. 135 Some non-IP transports, such as InfiniBand, directly integrate RDMA 136 features. This document will review the applicability of providing 137 RDMA services over ubiquitous IP transports as opposed to the use of 138 customized transport protocols. Due to the fact that DDP is defined 139 cleanly as a layer over existing IP transports, DDP has simpler 140 ordering rules than some prior RDMA protocols. This may have some 141 implications for application designers. 143 The full capabilities of DDP and RDMAP can only be fully realized by 144 applications that are designed to exploit them. The co-existence of 145 RDMAP/DDP aware local interfaces with traditional socket interfaces 146 will also be explored. 148 Finally, DDP support is defined for at least two IP transports: SCTP 149 and TCP. The rationale for supporting both transports is reviewed, 150 as well as when each would be the appropriate selection. 152 2. Definitions 154 Advertisement - the act of informing a Remote Peer that a local RDMA 155 Buffer is available to it. A Node makes available an RDMA Buffer 156 for incoming RDMA Read or RDMA Write access by informing its RDMA/ 157 DDP peer of the Tagged Buffer identifiers (STag, base address, and 158 buffer length). This advertisement of Tagged Buffer information 159 is not defined by RDMA/DDP and is left to the ULP. A typical 160 method would be for the Local Peer to embed the Tagged Buffer's 161 Steering Tag, base address, and length in a Send Message destined 162 for the Remote Peer. 164 Data Sink - The peer receiving a data payload. Note that the Data 165 Sink can be required to both send and receive RDMA/DDP Messages to 166 transfer a data payload. 168 Data Source - The peer sending a data payload. Note that the Data 169 Source can be required to both send and receive RDMA/DDP Messages 170 to transfer a data payload. 172 Lower Layer Protocol (LLP) The transport protocol that provides 173 services to DDP. This is an IP transport with any required 174 adaptation layer. Adaptation layers are defined for SCTP and TCP. 176 Steering Tag (STag) An identifier of a Tagged Buffer on a Node, valid 177 as defined within a protocol specification. 179 Tagged Message A DDP message that is directed to a ULP specified 180 buffer based upon imbedded addressing information. In the 181 immediate sense, the destination buffer is specified by the 182 message sender. 184 Untagged Message A DDP message that is directed to a ULP specified 185 buffer based upon a Message Sequence Number being matched with a 186 receiver supplied buffer. The destination buffer is specified by 187 the message receiver. 189 Upper Layer Protocol (ULP) The direct user of RDMAP/DDP services. 190 This may be an application, or a middleware layer such as Sockets 191 Direct Protocol (SDP) or Remote Procedure Calls (RPC). 193 3. Direct Placement 195 Direct Data Placement optimizes the placement of ULP payload into the 196 correct destination buffers, typically eliminating intermediate 197 copying. Placement is enabled without regard to order of arrival, 198 order of transmission or requiring per-placement interaction with the 199 ULP. 201 RDMAP minimizes the required ULP interactions . This capability is 202 most valuable for applications that require multiple transport layer 203 packets for each required ULP interaction. 205 3.1. Fewer Required ULP Interactions 207 While reducing the number of required ULP interactions is in itself 208 desirable, it is critical for high speed connections. The burst 209 packet rate for a high speed interface could easily exceed the host 210 systems ability to switch ULP contexts. 212 Content access applications are primary examples of applications with 213 both high bandwidth and high content to required ULP interaction 214 ratios. These applications include file access protocols (NAS), 215 storage access (SAN), database access and other application specific 216 forms of content access such as HTTP, XML and email. 218 3.2. Direct Placement using only the LLP 220 Direct data placement can be achieved without RDMA. Pre-posting of 221 receive buffers could allow a non-RDMA network stack to place data 222 directly to user buffers. 224 The degree to which DDP optimizes depends on which transport is being 225 compared with, and on the nature of the local interface. Without 226 RDMAP/DDP pre-posting buffers requires the receiving side to 227 accurately predict the required buffers and their sizes. This is not 228 feasible for all ULPs. By contrast, DDP only requires the ULP to 229 predict the sequence and size of incoming untagged messages. 231 An application that could predict incoming messages and required 232 nothing more than direct placement into buffers might be able to do 233 so with a properly designed local interface to SCTP or TCP. Doing so 234 for TCP requires making predictions at a byte level rather than a 235 message level. 237 The main benefit of DDP for such an application would be that pre- 238 posting of receive buffers is a mandated local interface capability, 239 and that predictions can be made on a per-message basis (not per 240 byte). 242 The LLP can also be used directly if ULP specific knowledge is built 243 into the protocol stack to allow "parse and place" handling of 244 received packets. Such a solution either requires interaction with 245 the ULP, or that the protocol stack have knowledge of ULP specific 246 syntax rules. 248 DDP achieves the benefits of directly placing incoming payload 249 without requiring tight coupling between the ULP and the protocol 250 stack. However, "parse and place" capabilities can certainly provide 251 equivalent services to a limited number of ULPs. 253 4. Tagged Messages 255 This section covers the major benefits from the use of Tagged 256 Messages. 258 A more critical advantage of DDP is the ability of the Data Source to 259 use tagged buffers. Tagging messages allows the Data Source to 260 choose the ordering and packetization of its payload deliveries. 261 With direct data placement based solely upon pre-posted receives, the 262 packetization and delivery of payload must be agreed by the ULP peers 263 in advance. Even if there is an encoding of what is being 264 transferred, as is common with middleware solutions, this information 265 is not understood at the application independent layers. The 266 directions on where to place the incoming data cannot be accessed 267 without switching to the ULP first. DDP provides a standardized 268 'packing list' which can be interpreted without requiring ULP 269 interaction. Indeed, it is designed to be implementable in hardware. 271 4.1. Order Independent Reception 273 Tagged messages are directed to a buffer based on an included 274 Steering Tag. Additionally, no notice is provided to the ULP for each 275 individual Tagged Message's arrival. Together these allow tagged 276 messages received out-of-order to be processed without intermediate 277 buffering or additional notifications to the ULP. 279 4.2. Reduced ULP Notifications 281 RDMAP further reduces required ULP interactions consolidating 282 completion notifications of tagged messages with the completion 283 notification of a trailing untagged message. For most ULPs this 284 radically reduces the number of ULP required interactions even 285 further. 287 While RDMAP consolidation of notices is beneficial to most 288 applications, it may be detrimental to some applications that benefit 289 from streamed delivery to enable ULP processing of received data as 290 promptly as possible. A ULP that uses RDMAP cannot begin processing 291 any portion of an exchange until it receives notification that the 292 entire exchange has been placed. An "exchange" here is a set of zero 293 or more tagged messages and a single terminating untagged message. 294 An application that would prefer to begin work on the received 295 payload, no matter what order it arrived in, as soon as possible 296 might prefer to work directly with the LLP. RDMAP is optimized for 297 applications that are more concerned when the entire exchange is 298 complete. 300 An application that benefits from being able to begin processing of 301 each received packet as quickly as possible may find RDMAP interferes 302 with that goal. 304 Such an application might be able to retain most of the benefits of 305 RDMAP by using the DDP layer directly. However, in addition to 306 taking on the responsibilities of the RDMAP layer, the application 307 would likely have more difficulty finding support for a DDP-only API. 308 Many hardware implementations may choose to tightly couple RDMAP and 309 DDP, and might not provide an API directly to DDP services. 311 These features minimize the required interactions with the ULP. This 312 can be extremely beneficial for applications that use multiple 313 transport layer packets to accomplish what is a single ULP 314 interaction. 316 4.3. Simplified ULP Exchanges 318 The notification rules for Tagged Messages allows ULPs to create 319 multi-message "exchanges" consisting of zero or more tagged messages 320 that represent a single step in the ULP interaction. The receiving 321 ULP is notified that the untagged message has arrived, and implicitly 322 of any associated tagged messages. 324 A ULP where all exchanges would naturally be only the untagged 325 message would derive virtually no benefit from the use of RDMAP/DDP 326 as opposed to SCTP. But while tagged buffers are the justification 327 for RDMAP/DDP, untagged buffers are still necessary. Without 328 untagged buffers the only method to exchange buffer advertisements 329 would involve out-of-band communications and/or sharing of compile 330 time constants. Most RDMA-aware ULPs use untagged buffers for 331 requests and responses. Buffer advertisements are typically done 332 within these untagged messages. 334 Limiting use of untagged buffers to requests and responses by moving 335 all bulk data using tagged transfers can greatly simplify the amount 336 of prediction that the Data Sink must perform in pre-posting receive 337 buffers. For example, a typical RDMA enabled interaction would 338 consist of the following: 340 Client sends transaction request to server's as an untagged 341 message. 343 This message includes buffer advertisements for the buffers where 344 the results are to be placed. 346 The Server sends multiple tagged messages to the advertised 347 buffers. 349 The Server sends transaction reply as an untagged message to the 350 client. 352 Client receives single notification, indicating completion of the 353 interaction. 355 With this type of exchange the pacing and required size of untagged 356 buffers is highly predictable. The variability of response sizes is 357 absorbed by tagged transfers. 359 4.4. Order Independent Sending 361 Use of tagged messages is especially applicable when the Data Sink 362 does not know the actual size, structure or location of the content 363 it is requesting (or updating). 365 For example, suppose the Data Sink ULP needs to fetch four related 366 pieces of data into a four separate buffers. With SCTP the Data Sink 367 ULP could receive four messages into four separate buffers, only 368 having to predict the maximum size of each. However it would have to 369 dictate the order in which the Data Source supplied the separate 370 pieces. If the Data Source found it advantageous to fetch them in a 371 different order it would have to use intermediate buffering to re- 372 order the pieces into the expected order even though the application 373 only required that all four be delivered and did not truly have an 374 ordering requirement. 376 Techniques such as RAID striping and mirroring represent this same 377 problem, but one step further. What appears to be a single resource 378 to the Data Sink is actually stored in separate locations by the Data 379 Source. Non RDMA protocols would either require the Data Source to 380 fetch the material in the desired order or force the Data Source to 381 use its own holding buffers to assemble an image of the destination 382 buffer. 384 While sometimes referred to as a "buffer-to-buffer" solution, RDMA 385 more fundamentally enables remote buffer access. The ULP is free to 386 work with larger remote buffers than it has locally. This reduces 387 buffering requirements and the number of times the data must be 388 copied in an end-to-end transfer. 390 There are numerous reasons why the Data Sink would not know the true 391 order or location of the requested data. It could be different for 392 each client, different records selected and/or different sort orders, 393 RAID striping, file fragmentation, volume fragmentation, volume 394 mirroring and server-side dynamic compositing of content (such as 395 server side includes for HTTP). 397 In all of these cases the Data Source is free to assemble the desired 398 data in the Data Sinks buffer in whatever order the component data 399 becomes available to it. It is not constrained on ordering. It does 400 not have to assemble an image in its own memory before creating it in 401 the Data Sink's buffers. 403 Note that while DDP enables use of tagged messages for bulk transfer, 404 there are some application scenarios where untagged messages would 405 still be used for bulk transfer. For example, under the Direct 406 Access File Server (DAFS) protocol the file server does not expose 407 its own memory to its clients. A client wishing to write may 408 advertise a buffer which the server will issue RDMA Reads upon. 409 However, when performing a small write it may be preferable to 410 include the data in the untagged message rather than incurring an 411 additional round trip with the RDMA Read and its response. 413 4.5. Tagged Buffers as ULP Credits 415 The handling of end-to-end buffer credits differs considerably with 416 DDP than when the ULP directly uses either TCP or SCTP. 418 With both TCP and SCTP buffer credits are based upon the receiver 419 granting transmit permission based on the total number of bytes. 420 These credits reflect system buffering resources and/or simple flow 421 control. They do not represent ULP resources. 423 DDP defines no standard flow control, but presumes the existince of a 424 ULP mechanism. The presumed mechanism is that the Data Sink ULP has 425 issued credits to the Data Source allowing the Data Source to send a 426 specific number of untagged messages. 428 The ULP peers must ensure that the sender is aware of the maximum 429 size that can be sent to any specific target buffer. One method of 430 doing so is to use a standard size for all untagged buffers within a 431 given connection. For example, DAFS specifies an initial size 432 requirement for session establishment, during which the untagged 433 buffer size for the remainder of the session is negotiated. 435 Tagged buffers are ULP resources advertised directly from ULP to ULP. 436 A DDP put to a known tagged buffer is constrained only by transport 437 level flow control, not by available system buffering. 439 Either tagged or untagged buffers allows bypassing of system buffer 440 resources. Use of tagged buffers additionally allows the Data Source 441 to choose what order to exercise the credits in. 443 To the extent allowed by the ULP, tagged buffers are also divisible 444 resources. The Data Sink can advertise a single 100 KB buffer, and 445 then receive notifications from its peer that it had written 50 KB, 446 20 KB and 30 KB to that buffer in three successive transactions. 448 ULP-management of tagged buffer resources, independent of transport 449 and DDP layer credits, is an additional benefit of RDMA protocols. 450 Large bulk transfers cannot be blocked by limited general purpose 451 buffering capacity. Applications can flow control based upon higher 452 level abstractions, such as number of outstanding requests, 453 independent of the amount of data that must be transferred. 455 However, use of system buffering, as offered by direct use of the 456 underlying transports, can be preferable under certain circumstances. 458 One example would be when the number of target ULP buffers is 459 sufficiently large, and the rate at which any writes arrive is 460 sufficiently low, that pinning all the target ULP buffers in memory 461 would be undesirable. The maximum transfer rate, and hence the 462 maximum amount of system buffering required, may be more stable and 463 predictable than the total ULP buffer exposure. 465 Another would be the Data Sink wishes to receive a stream of data at 466 a predictable rate, but does not know in advance what the size of 467 each data packet will be. This is common from streaming media that 468 has been encoded with a variable bit rate. With DDP the Data Sink 469 would either have to use untagged buffers large enough for the 470 largest packet, or advertise a circular buffer. If for security or 471 other reasons the Data Sink did not want the size of its buffer to be 472 publicly known, using the underlying SCTP transport directly may be 473 preferable because of their byte-oriented credits. 475 5. RDMA Read 477 RDMA Reads are a further service provided by RDMAP. RDMA Reads allow 478 the Data Sink to fetch exactly the portion of the peer ULP buffer 479 required on a "just in time" basis. This can be done without 480 requiring per-fetch support from the Data Source ULP. 482 Storage servers may wish to limit the maximum write buffer allocated 483 to any single session. The storage server may be a very minimal 484 layer between the client and the disk storage media, or the server 485 may merely wish to limit the total resources that would be required 486 if all clients could push the entire payload they wished written at 487 their own convenience. 489 In either case, there is little benefit in transferring data from the 490 Data Source far in advance of when it will be written to the 491 persistent storage media. RDMA Reads allow the Storage Server to 492 fetch the payload on a "just in time" basis. In this fashion a 493 relatively small number of block sized buffers can be used to execute 494 a single transaction that specified writing a large file, or a 495 Storage Server with numerous clients can fetch buffers from the 496 individual clients in the order that is most convenient to the 497 server. 499 This same capability can be used when the desired portion of the 500 advertised buffer is not known in advance. For example the 501 advertised buffer could contain performance statistics. The data 502 sink could request the portions of the data it required, without 503 requiring an interaction with the Data Source ULP. 505 This is applicable for many applications that publish semi-volatile 506 data that does not require transactional validity checking (i.e., 507 authorized users have read access to the entire set of data). It is 508 less applicable when there are ULP consistency checks that must be 509 performed upon the data. Such applications would be better served by 510 having the client send a request, and having the server use RDMA 511 Writes to publish the requested data. Neither RDMAP or DDP provide 512 mechanisms for bundling multiple disjoint updates into an atomic 513 operation. Therefore use of an advertised buffer as a data resource 514 is subject to the same caveats as any randomly updated data resource, 515 such as flat files, that do not enforce their own cosnsistency. 517 6. LLP Comparisons 519 Normally the choice of underlying IP transport is irrelevant to the 520 ULP. RDMAP and DDP provides the same services over either. There 521 may be performance impacts of the choice, however. It is the 522 responsibility of the ULP to determine which IP transport is best 523 suited to its needs. 525 SCTP provides for preservation of message boundaries. Each DDP 526 segment will be delivered within a single SCTP packet. The 527 equivalent services are only available with TCP through the use of 528 the MPA adaptation layer. 530 6.1. Multistreaming Implications 532 SCTP also provides multi-streaming. When the same pair of hosts have 533 need for multiple DDP streams this can be a major advantage. A 534 single SCTP association carries multiple DDP streams, consolidating 535 connection setup, congestion control and acknowledgements. 537 Completions are controlled by the DDP Source Sequence Number (DDP- 538 SSN) on a per stream basis. Therefore combining multiple DDP Streams 539 into a single SCTP association cannot result in a dropped packet 540 carrying data for one stream delaying completions on others. 542 6.2. Out of Order Reception Implications 544 The use of unordered Data Chunks with SCTP guarantees that the DDP 545 layer will be able to perform placements when IP datagrams are 546 received out of order. 548 Placement of out-of-order DDP Segments carried over MPA/TCP is not 549 guaranteed, but certainly allowed. The ability of the MPA receiver 550 to process out-of-order DDP Segments may be impaired when alignment 551 of TCP segments and MPA FPDUs is lost. Using SCTP, each DDP Segment 552 is encoded in a single Data Chunk and never spread over multiple IP 553 datagrams. 555 6.3. Header and Marker Overhead 557 MPA and TCP headers together are smaller than the headers used by 558 SCTP and its adaptation layer. However, this advantage can be 559 considerably reduced by the insertion of MPA markers. In any event 560 the different in ULP payload per IP Datagram is not likely to be a 561 signifigant factor. 563 6.4. Middlebox Support 565 Even with the MPA adaptation layer, DDP traffic carried over MPA/TCP 566 will appear to all network middleboxes as a normal TCP connection. 567 In many environments there may be a requirement to use only TCP 568 connections to satisfy existing network elements and/or to facilitate 569 monitoring and control of connections. While SCTP is certainly just 570 as monitorable and controllable as TCP, there is no guarantee that 571 the network management infrastructure has the required support for 572 both. 574 6.5. Processing Overhead 576 A DDP stream delivered via MPA/TCP will required more processing 577 effort that one delivered over SCTP. However this extra work may be 578 justified for many deployments where full SCTP support is unavailable 579 in the endpoints of the network, or where middleboxes impair the 580 usability of SCTP. 582 6.6. Data Integrity Implications 584 Both the SCTP and MPA/TCP adaptation provide end-to-end CRC32c 585 protection against data corruption, or its equivalent. 587 A ULP that requires a greater degree of protection may add it own. 588 However, DDP and RDMAP headers will only be guaranteed to have the 589 equivalent of end-to-end CRC32c protection. A ULP that requires data 590 integrity checking more thorough than an end-to-end CRC32c should 591 first invalidate all STags that reference a buffer before applying 592 their own integrity check. 594 6.6.1. MPA/TCP Specifics 596 It is mandatory for MPA/TCP implementations to implement CRC32c, but 597 it is NOT mandatory to use the CRC32c during an RDMA connection. The 598 activating or deactivating of the CRC in MPA/TCP is an administrative 599 configuration operation at the local and remote end. The 600 administration of the CRC(ON/OFF) is invisible to the ULP. 602 Applications SHOULD trust that this administrative option will only 603 be used when the end-to-end protection is at least as effective as a 604 transport layer CRC32c. Applications SHOULD NOT apply additional 605 protection as a guard against this administrative option being turned 606 on inadvertently. 608 Administrators MUST NOT enable CRC32c suppression unless the end-to- 609 end protection is truly equivalent. 611 If the CRC is active/used for one direction/end , then the use of the 612 CRC is mandatory in both directions/ends. 614 If both ends have been configured NOT to use the CRC, then this is 615 allowed as long as an equivalent protection(comparable or better 616 than/to CRC) from undetected errors on the connection is provided. 618 6.6.2. SCTP Specifics 620 SCTP provides CRC32c protection automatically. The adaptation to 621 SCTP provides for no option to suppress SCTP CRC32c protection. 623 6.7. Non-IP Transports 625 DDP is defined to operate over ubiquitous IP transports such as SCTP 626 and TCP. This enabled a new DDP-enabled node to be added anywhere to 627 an IP network. No DDP-specific support from middle-boxes is 628 required. 630 There are non-IP transport fabric offering RDMA capabilities. 631 Because these capabilities are integrated with the transport protocol 632 they have some technical advantages when compared to RDMA over IP. 633 For example fencing of RDMA operations can be based upon transport 634 level acks. Because DDP is cleanly layered over an IP transport, any 635 explicit RDMA layer ack must be separate from the transport layer 636 ack. 638 There may be deployments where the benefits of RDMA/transport 639 integration outweigh the benefits of being on an IP network. 641 6.7.1. No RDMA Layer Ack 643 DDP does not provide for its own acknowledgements. The only form of 644 ack provided at the RDMAP layer is an RDMA Read Response. DDP and 645 RDMAP rely almost entirely upon other layers for flow control and 646 pacing. The LLP is relied upon to guarantee delivery and avoid 647 network congestion, and ULP level acking is relied upon for ULP 648 pacing and to avoid ULP buffer overruns. 650 Previous RDMA protocols, such as InfiniBand, have been able to use 651 their integration with the transport layer to provide stronger 652 ordering guarantees. It is important that application designers that 653 require such guarantees to provide them through ULP interaction. 655 Specifically: 657 There is no ability for a local interface to "fence" outbound 658 messages to guarantee that prior tagged messages have been placed 659 prior to sending a tagged message. The only guarantees available 660 from the other side would be an RDMA Read Response (coming from 661 the RDMAP layer) or a response from the ULP layer. Remember that 662 the normal ordering rules only guarantee when the Data Sink ULP 663 will be notified of untagged messages, it does not control when 664 data is placed into receive buffers. 666 Re-use of tagged buffers must be done with extreme care. The fact 667 that an untagged message indicates that all prior tagged messages 668 have been placed does not guarantee that no later tagged message 669 have. The best strategy is to only change the state of any given 670 advertised buffers with with untagged messages. 672 As covered elsewhere in this document, flow control of untagged 673 messages MUST be provided by the ULP itself. 675 6.8. Other IP Transports 677 Both TCP and SCTP provide DDP with reliable transport with TCP 678 friendly rate control. As currently DDP is defined to work over 679 reliable transports and implicitly relies upon some form of rate 680 control. 682 DDP is fully compatible with a non-reliable protocol. Out-of-order 683 placement is obviously not dependent on whether the other DDP 684 Segments ever actually arrive. 686 However, RDMAP requires the LLP to provide reliable service. An 687 alternate completion handling protocol would be required if DDP were 688 to be deployed over an unreliable IP transport. 690 As noted in the prior section on tagged buffers as ULP credits, 691 neither RDMAP or DDP provide any flow control for tagged messages. 692 If no transport layer flow control is provided, an RDMAP/DDP 693 application would be only limited by the link layer rate, almost 694 inevitably resulting in severe network congestion. 696 RDMAP encourages applications to be ignorant of the underlying 697 transport PMTU. The ULP is only notified when all messages ending in 698 a single untagged message have completed. The ULP is not aware of 699 the granularity or ordering of the underlying message. This approach 700 assumes that the ULP is only interested in the complete set of 701 messages, and has no use for a subset of them. 703 6.9. LLP Independent Session Establishment 705 For an RDMAP/DDP application, the transport services provided by a 706 pair of SCTP Streams and by a TCP connection both provide the same 707 service (reliable delivery of DDP Segments between two connected 708 RDMAP/DDP endpoints). 710 6.9.1. RDMA-only Session Establishment 712 It is also possible to allow for transport neutral establishment of 713 RDMAP/DDP sessions between endpoints. Combined, these two features 714 would allow most applications to be unconcerned as to which LLP was 715 actually in use. 717 Specifically, the procedures for DDP Stream Session establishment 718 discussed in section 3 of the SCTP mapping, and section 13.3 of the 719 MPA/TCP mapping, both allow for the exchange of ULP specific data 720 ("Private Data") before enabling the exchange of DDP Segments. This 721 delays can allow for proper selection and/or configuration of the 722 endpoints based upon the exchanged data. For example, each DDP 723 Stream Session associated with a single client session might be 724 assigned to the same DDP Protection Domain. 726 To be transport neutral, the applications should exchange Private 727 Data as part of session establishment messages to determine how the 728 RDMA endpoints are to be configured. One side must be the Initiator, 729 and the other the Responder. 731 With SCTP, a pair of SCTP streams can be used for sequential 732 sessions. With MPA/TCP each connection can be used for at most one 733 session. However, the same source/destination pair of ports can be 734 re-used sequentially subject to normal TCP rules. 736 Both SCTP and MPA limit the private data size to a maximum of 512 737 bytes. 739 MPA/TCP requires the end of the TCP connection that initiated the 740 conversion to MPA mode to send the first DDP Segment. SCTP does not 741 have this requirement. ULPs which wish to be transport neutral 742 should require the initiating end to send the first message. A zero- 743 length RDMA Write can be used for this purpose if the ULP logic 744 itself does naturally support this restriction. 746 6.9.2. RDMA-Conditional Session Establishment 748 It is sometimes desirable for the active side of a session to connect 749 with the passive side before knowing whether the passive side 750 supports RDMA. 752 This style of session establishment can be supported with either TCP 753 or SCTP, but not as transparently as for RDMA-only sessions. Pre- 754 existing non-RDMA servers are also far more likely to be using TCP 755 than SCTP. 757 With TCP. a normal TCP connection is established. It is then used by 758 the ULP to determine whether or not to convert to MPA mode and use 759 RDMA. This will typically be integral with other session 760 establishment negotiations. 762 With SCTP, the establishment of an association tests whether RDMA is 763 supported. If not supported, the application simply requests the 764 association without the RDMA adaptation indication. 766 In key difference is that with SCTP the determination as to whether 767 the peer can support RDMA is made before the transport layer 768 association/connection is established while with TCP the established 769 connection itself is used to determine whether RDMA is supported. 771 7. Local Interface Implications 773 Full utilization of DDP and RDMAP capabilities requires a local 774 interface that explicitly requests these services. Protocols such as 775 Sockets Direct Protocol (SDP) can allow applications to keep their 776 traditional byte-stream or message-stream interface and still enjoy 777 many of the benefits of the optimized wire level protocols. 779 8. IANA Considerations 781 There are no IANA considerations in this document. 783 9. Security considerations 785 9.1. Connection/Association Setup 787 Both the SCTP and TCP adaptations allow for existing procedures to be 788 followed for the establishment of the SCTP association or TCP 789 connection. Use of DDP does not impair the use of any security 790 measures to filter, validate and/or log the remote end of an 791 association/connection. 793 9.2. Tagged Buffer Exposure 795 DDP only exposes ULP memory to the extent explicitly allowed by ULP 796 actions. These include posting of receive operations and enabling of 797 Steering Tags. 799 Neither RDMAP or DDP place requirements on how ULP's advertise 800 buffers. A ULP may use a single Steering Tag for multiple buffer 801 advertisements. However, the ULP should be aware that enforcement on 802 STag usage is likely limited to the overall range that is enabled. 803 If the remote peer writes into the 'wrong' advertised buffer, neither 804 the DDP or RDMAP layer will be aware of this. Nor is there any 805 report to the ULP on how the remote peer specifically used tagged 806 buffers. 808 Unless the ULP peers have an adequate basis for mutual trust, the 809 receiving ULP might be well advised to use a distinct STag for each 810 interaction, and to invalidate it after each use or to require its 811 peer to use the RDMAP option to invalidate the STag with its 812 responding untagged message. 814 9.3. Impact of Encrypted Transports 816 While DDP is cleanly layered over the LLP, its maximum benefit may be 817 limited when the LLP Stream is secured with a streaming cypher, such 818 as Transport Layer Security (TLS). If the LLP must decrypt in order, 819 it cannot provide out-of-order DDP Segments to the DDP layer for 820 placement purposes. IPsec tunnel mode encrypts entire IP Datagrams. 821 IPsec transport mode encrypts TCP Segments or SCTP packets. In 822 neither case should IPsec preclude providing out-of-order DDP 823 Segments to the DDP layer for placement. 825 Note that end-to-end use of IPsec cryptographic integrity protection 826 may allow suppression of MPA CRC generation and checking under 827 certain circumstances. This is one example where the LLP may be 828 judged to have "or equivalent" protection to an end-to-end CRC32c. 830 10. Normative references 832 [1] Bradner, S., "Key words for use in RFCs to Indicate Requirement 833 Levels", BCP 14, RFC 2119, March 1997. 835 [2] Dierks, T. and C. Allen, "The TLS Protocol Version 1.0", 836 RFC 2246, January 1999. 838 [3] Kent, S. and R. Atkinson, "IP Encapsulating Security Payload 839 (ESP)", RFC 2406, November 1998. 841 [4] Stewart, R., Xie, Q., Morneault, K., Sharp, C., Schwarzbauer, 842 H., Taylor, T., Rytina, I., Kalla, M., Zhang, L., and V. Paxson, 843 "Stream Control Transmission Protocol", RFC 2960, October 2000. 845 [5] Coene, L., "Stream Control Transmission Protocol Applicability 846 Statement", RFC 3257, April 2002. 848 [6] Recio, R., "An RDMA Protocol Specification", 849 draft-ietf-rddp-rdmap-05 (work in progress), July 2005. 851 [7] Shah, H., "Direct Data Placement over Reliable Transports", 852 draft-ietf-rddp-ddp-05 (work in progress), July 2005. 854 [8] Stewart, R., "Stream Control Transmission Protocol (SCTP) Remote 855 Direct Memory Access (RDMA) Direct Data Placement (DDP) 856 Adaptationn", draft-ietf-rddp-sctp-02 (work in progress), 857 August 2005. 859 [9] Culley, P., "Marker PDU Aligned Framing for TCP Specification", 860 draft-ietf-rddp-mpa-02 (work in progress), February 2005. 862 Authors' Addresses 864 Caitlin Bestler 865 Broadcom 866 49 Discovery 867 Irvine, CA 92618 868 USA 870 Phone: 949-926-6383 871 Email: caitlinb@broadcom.com 873 Lode Coene 874 Siemens 875 Atealaan 26 876 Herentals, 2200 877 Belgium 879 Phone: +32-14-252081 880 Email: lode.coene@siemens.com 882 Intellectual Property Statement 884 The IETF takes no position regarding the validity or scope of any 885 Intellectual Property Rights or other rights that might be claimed to 886 pertain to the implementation or use of the technology described in 887 this document or the extent to which any license under such rights 888 might or might not be available; nor does it represent that it has 889 made any independent effort to identify any such rights. Information 890 on the procedures with respect to rights in RFC documents can be 891 found in BCP 78 and BCP 79. 893 Copies of IPR disclosures made to the IETF Secretariat and any 894 assurances of licenses to be made available, or the result of an 895 attempt made to obtain a general license or permission for the use of 896 such proprietary rights by implementers or users of this 897 specification can be obtained from the IETF on-line IPR repository at 898 http://www.ietf.org/ipr. 900 The IETF invites any interested party to bring to its attention any 901 copyrights, patents or patent applications, or other proprietary 902 rights that may cover technology that may be required to implement 903 this standard. Please address the information to the IETF at 904 ietf-ipr@ietf.org. 906 Disclaimer of Validity 908 This document and the information contained herein are provided on an 909 "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS 910 OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET 911 ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, 912 INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE 913 INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED 914 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 916 Copyright Statement 918 Copyright (C) The Internet Society (2005). This document is subject 919 to the rights, licenses and restrictions contained in BCP 78, and 920 except as set forth therein, the authors retain all their rights. 922 Acknowledgment 924 Funding for the RFC Editor function is currently provided by the 925 Internet Society.