idnits 2.17.1 draft-ietf-rddp-applicability-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. ** There are 109 instances of too long lines in the document, the longest one being 1 character in excess of 72. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year == Line 340 has weird spacing: '...r sends multi...' == Line 424 has weird spacing: '...g so is to us...' == Line 426 has weird spacing: '...ich the untag...' == Line 445 has weird spacing: '...control based...' == Line 759 has weird spacing: '...e level proto...' == The document seems to lack the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. (The document does seem to have the reference to RFC 2119 which the ID-Checklist requires). -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (October 13, 2003) is 7494 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Unused Reference: '1' is defined on line 810, but no explicit reference was found in the text == Unused Reference: '2' is defined on line 813, but no explicit reference was found in the text == Unused Reference: '3' is defined on line 817, but no explicit reference was found in the text == Unused Reference: '4' is defined on line 820, but no explicit reference was found in the text == Unused Reference: '5' is defined on line 824, but no explicit reference was found in the text == Unused Reference: '6' is defined on line 827, but no explicit reference was found in the text == Unused Reference: '7' is defined on line 830, but no explicit reference was found in the text == Unused Reference: '8' is defined on line 833, but no explicit reference was found in the text == Unused Reference: '9' is defined on line 838, but no explicit reference was found in the text ** Obsolete normative reference: RFC 2246 (ref. '2') (Obsoleted by RFC 4346) ** Obsolete normative reference: RFC 2406 (ref. '3') (Obsoleted by RFC 4303, RFC 4305) ** Obsolete normative reference: RFC 2960 (ref. '4') (Obsoleted by RFC 4960) ** Downref: Normative reference to an Informational RFC: RFC 3257 (ref. '5') == Outdated reference: A later version (-07) exists of draft-ietf-rddp-rdmap-00 == Outdated reference: A later version (-07) exists of draft-ietf-rddp-ddp-00 == Outdated reference: A later version (-07) exists of draft-ietf-rddp-sctp-00 == Outdated reference: A later version (-08) exists of draft-ietf-rddp-mpa-00 Summary: 8 errors (**), 0 flaws (~~), 21 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Remote Direct Data Placement C. Bestler 3 Working group L. Coene 4 Internet-Draft October 13, 2003 5 Expires: April 12, 2004 7 Applicability of Remote Direct Memory Access Protocol (RDMA) and 8 Direct Data Placement (DDP) 9 draft-ietf-rddp-applicability-01.txt 11 Status of this Memo 13 This document is an Internet-Draft and is in full conformance with 14 all provisions of Section 10 of RFC2026. 16 Internet-Drafts are working documents of the Internet Engineering 17 Task Force (IETF), its areas, and its working groups. Note that 18 other groups may also distribute working documents as Internet- 19 Drafts. 21 Internet-Drafts are draft documents valid for a maximum of six months 22 and may be updated, replaced, or obsoleted by other documents at any 23 time. It is inappropriate to use Internet-Drafts as reference 24 material or to cite them other than as "work in progress." 26 The list of current Internet-Drafts can be accessed at http:// 27 www.ietf.org/ietf/1id-abstracts.txt. 29 The list of Internet-Draft Shadow Directories can be accessed at 30 http://www.ietf.org/shadow.html. 32 This Internet-Draft will expire on April 12, 2004. 34 Copyright Notice 36 Copyright (C) The Internet Society (2003). All Rights Reserved. 38 Abstract 40 This document describes the applicability of Remote Direct Memory 41 Access Protocol (RDMAP) and the Direct Data Placement Protocol 42 (DDP). It contrasts the different transport options over IP that DDP 43 can use, compares use of DDP with direct use of the supporting 44 transports, and compares DDP over IP transports with non-IP 45 transports that support RDMA functionality. 47 Table of Contents 49 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 50 2. Definitions . . . . . . . . . . . . . . . . . . . . . . . . 5 51 3. Direct Placement . . . . . . . . . . . . . . . . . . . . . . 6 52 3.1 Fewer Required ULP Interactions . . . . . . . . . . . . . . 6 53 3.2 Direct Placement using only the LLP . . . . . . . . . . . . 6 54 4. Tagged Messages . . . . . . . . . . . . . . . . . . . . . . 8 55 4.1 Order Independent Reception . . . . . . . . . . . . . . . . 8 56 4.2 Reduced ULP Notifications . . . . . . . . . . . . . . . . . 8 57 4.3 Simplified ULP Exchanges . . . . . . . . . . . . . . . . . . 9 58 4.4 Order Independent Sending . . . . . . . . . . . . . . . . . 10 59 4.5 Tagged Buffers as ULP Credits . . . . . . . . . . . . . . . 11 60 5. RDMA Read . . . . . . . . . . . . . . . . . . . . . . . . . 13 61 6. LLP Comparisons . . . . . . . . . . . . . . . . . . . . . . 14 62 6.1 Multistreaming Implications . . . . . . . . . . . . . . . . 14 63 6.2 Out of Order Reception Implications . . . . . . . . . . . . 14 64 6.3 Header and Marker Overhead . . . . . . . . . . . . . . . . . 14 65 6.4 Middlebox Support . . . . . . . . . . . . . . . . . . . . . 14 66 6.5 Processing Overhead . . . . . . . . . . . . . . . . . . . . 15 67 6.6 Data Integrity Implications . . . . . . . . . . . . . . . . 15 68 6.6.1 MPA/TCP Specifics . . . . . . . . . . . . . . . . . . . . . 15 69 6.6.2 SCTP Specifics . . . . . . . . . . . . . . . . . . . . . . . 16 70 6.7 Non-IP Transports . . . . . . . . . . . . . . . . . . . . . 16 71 6.7.1 No RDMA Layer Ack . . . . . . . . . . . . . . . . . . . . . 16 72 6.8 Other IP Transports . . . . . . . . . . . . . . . . . . . . 17 73 6.9 LLP Independent Session Establishment . . . . . . . . . . . 17 74 6.9.1 RDMA-only Session Establishment . . . . . . . . . . . . . . 18 75 6.9.2 RDMA-Conditional Session Establishment . . . . . . . . . . . 18 76 7. Local Interface Implications . . . . . . . . . . . . . . . . 20 77 8. Security considerations . . . . . . . . . . . . . . . . . . 21 78 8.1 Connection/Association Setup . . . . . . . . . . . . . . . . 21 79 8.2 Tagged Buffer Exposure . . . . . . . . . . . . . . . . . . . 21 80 8.3 Impact of Encrypted Transports . . . . . . . . . . . . . . . 21 81 References . . . . . . . . . . . . . . . . . . . . . . . . . 22 82 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . 22 83 Full Copyright Statement . . . . . . . . . . . . . . . . . . 24 85 1. Introduction 87 Remote Direct Memory Access Protocol (RDMAP) and Direct Data 88 Placement (DDP) work together to provide application independent 89 efficient placemenet of application payload directly into buffers 90 specified by the Upper Layer Protocol (ULP). 92 The DDP protocol is responsible for direct placement of received 93 payload into ULP specified buffers. The RDMAP protocol provides 94 completion notifications to the ULP and support for Data Sink 95 initiated fetch of advertised buffers (RDMA Reads). 97 DDP and RDMAP are both application independent protocols which allow 98 the ULP to perform remote direct data placement. DDP can use 99 multiple standard IP transports including SCTP and TCP. 101 By clarifying the situations where the functionality of these 102 protocols are applicable, this document can guide implementers, 103 application and protocol designers in selecting which protocols to 104 use. 106 The applicability of RDMAP/DDP is driven by their unique 107 capabilities: 109 o The existence of an application independent protocol allows common 110 solutions to be implemented in hardware and/or the kernel. This 111 document will discuss when common data placement procedures are of 112 the greatest benefit to applications as contrasted with 113 application specific solutions built on top of direct use of the 114 underlying transport. 116 o DDP supports both untagged and tagged buffers. Tagged buffers 117 allow the Data Sink ULP to be indifferent to what order (or in 118 what packets) the Data Source sent the data, or what order they 119 are received in. This document will discuss when Data Source 120 flexibility is of benefit to applications. 122 o RDMAP consolidates ULP notifications, thereby minimizing the 123 number of required ULP interactions. 125 o RDMAP defines RDMA Reads, which allow remote access to advertised 126 buffers. This document will review the advantages of using RDMA 127 Reads as contrasted to alternate solutions. 129 Some non-IP transports, such as InfiniBand, directly integrate RDMA 130 features. This document will review the applicability of providing 131 RDMA services over ubiquitous IP transports as opposed to the use of 132 customized transport protocols. Due to the fact that DDP is defined 133 cleanly as a layer over existing IP transports, DDP has simpler 134 ordering rules than some prior RDMA protocols. This may have some 135 implications for application designers. 137 The full capabilities of DDP and RDMAP can only be fully realized by 138 applications that are designed to exploit them. The co-existence of 139 RDMAP/DDP aware local interfaces with traditional socket interfaces 140 will also be explored. 142 Finally, DDP support is defined for at least two IP transports: SCTP 143 and TCP. The rationale for supporting both transports is reviewed, 144 as well as when each would be the appropriate selection. 146 2. Definitions 148 Advertisement - the act of informing a Remote Peer that a local RDMA 149 Buffer is available to it. A Node makes available an RDMA Buffer 150 for incoming RDMA Read or RDMA Write access by informing its RDMA/ 151 DDP peer of the Tagged Buffer identifiers (STag, base address, and 152 buffer length). This advertisement of Tagged Buffer information 153 is not defined by RDMA/DDP and is left to the ULP. A typical 154 method would be for the Local Peer to embed the Tagged Buffer's 155 Steering Tag, base address, and length in a Send Message destined 156 for the Remote Peer. 158 Data Sink - The peer receiving a data payload. Note that the Data 159 Sink can be required to both send and receive RDMA/DDP Messages to 160 transfer a data payload. 162 Data Source - The peer sending a data payload. Note that the Data 163 Source can be required to both send and receive RDMA/DDP Messages 164 to transfer a data payload. 166 Lower Layer Protocol (LLP) The transport protocol that provides 167 services to DDP. This is an IP transport with any required 168 adaptation layer. Adaptation layers are defined for SCTP and TCP. 170 Steering Tag (STag) An identifier of a Tagged Buffer on a Node, valid 171 as defined within a protocol specification. 173 Tagged Message A DDP message that is directed to a ULP specified 174 buffer based upon imbedded addressing information. In the 175 immediate sense, the destination buffer is specified by the 176 message sender. 178 Untagged Message A DDP message that is directed to a ULP specified 179 buffer based upon a Message Sequence Number being matched with a 180 receiver supplied buffer. The destination buffer is specified by 181 the message receiver. 183 Upper Layer Protocol (ULP) The direct user of RDMAP/DDP services. 184 This may be an application, or a middleware layer such as Sockets 185 Direct Protocol (SDP) or Remote Procedure Calls (RPC). 187 3. Direct Placement 189 Direct Data Placement optimizes the placement of ULP payload into the 190 correct destination buffers, typically eliminating intermediate 191 copying. Placement is enabled without regard to order of arrival, 192 order of transmission or requiring per-placement interaction with the 193 ULP. 195 RDMAP minimizes the required ULP interactions . This capability is 196 most valuable for applications that require multiple transport layer 197 packets for each required ULP interaction. 199 3.1 Fewer Required ULP Interactions 201 While reducing the number of required ULP interactions is in itself 202 desirable, it is critical for high speed connections. The burst 203 packet rate for a high speed interface could easily exceed the host 204 systems ability to switch ULP contexts. 206 Content access applications are primary examples of applications with 207 both high bandwidth and high content to required ULP interaction 208 ratios. These applications include file access protocols (NAS), 209 storage access (SAN), database access and other application specific 210 forms of content access such as HTTP, XML and email. 212 3.2 Direct Placement using only the LLP 214 Direct data placement can be achieved without RDMA. Pre-posting of 215 receive buffers could allow a non-RDMA network stack to place data 216 directly to user buffers. 218 The degree to which DDP optimizes depends on which transport is being 219 compared with, and on the nature of the local interface. Without 220 RDMAP/DDP pre-posting buffers requires the receiving side to 221 accurately predict the required buffers and their sizes. This is not 222 feasible for all ULPs. By contrast, DDP only requires the ULP to 223 predict the sequence and size of incoming untagged messages. 225 An application that could predict incoming messages and required 226 nothing more than direct placement into buffers might be able to do 227 so with a properly designed local interface to SCTP or TCP. Doing so 228 for TCP requires making predictions at a byte level rather than a 229 message level. 231 The main benefit of DDP for such an application would be that pre- 232 posting of receive buffers is a mandated local interface capability, 233 and that predictions can be made on a per-message basis (not per 234 byte). 236 The LLP can also be used directly if ULP specific knowledge is built 237 into the protocol stack to allow "parse and place" handling of 238 received packets. Such a solution either requires interaction with 239 the ULP, or that the protocol stack have knowledge of ULP specific 240 syntax rules. 242 DDP achieves the benefits of directly placing incoming payload 243 without requiring tight coupling between the ULP and the protocol 244 stack. However, "parse and place" capabilities can certainly provide 245 equivalent services to a limited number of ULPs. 247 4. Tagged Messages 249 This section covers the major benefits from the use of Tagged 250 Messages. 252 A more critical advantage of DDP is the ability of the Data Source to 253 use tagged buffers. Tagging messages allows the Data Source to 254 choose the ordering and packetization of its payload deliveries. 255 With direct data placement based solely upon pre-posted receives, the 256 packetization and delivery of payload must be agreed by the ULP peers 257 in advance. Even if there is an encoding of what is being 258 transferred, as is common with middleware solutions, this information 259 is not understood at the application independent layers. The 260 directions on where to place the incoming data cannot be accessed 261 without switching to the ULP first. DDP provides a standardized 262 'packing list' which can be interpreted without requiring ULP 263 interaction. Indeed, it is designed to be implementable in hardware. 265 4.1 Order Independent Reception 267 Tagged messages are directed to a buffer based on an included 268 Steering Tag. Additionally, no notice is provided to the ULP for 269 each individual Tagged Message's arrival. Together these allow 270 tagged messages received out-of-order to be processed without 271 intermediate buffering or additional notifications to the ULP. 273 4.2 Reduced ULP Notifications 275 RDMAP further reduces required ULP interactions consolidating 276 completion notifications of tagged messages with the completion 277 notification of a trailing untagged message. For most ULPs this 278 radically reduces the number of ULP required interactions even 279 further. 281 While RDMAP consolidation of notices is beneficial to most 282 applications. It may be detrimental to some applications that 283 benefit from streamed delivery to enable ULP processing of received 284 data as promptly as possible. A ULP that uses RDMAP cannot begin 285 processing any portion of an exchange until it receives notification 286 that the entire exchange has been placed. An "exchange" here is a 287 set of zero or more tagged messages and a single terminating untagged 288 message. An application that would prefer to begin work on the 289 received payload, no matter what order it arrived in, as soon as 290 possible might prefer to work directly with the LLP. RDMAP is 291 optimized for applications that are more concerned when the entire 292 exchange is complete. 294 An application that benefits from being able to begin processing of 295 each received packet as quickly as possible may find RDMAP interferes 296 with that goal. 298 Such an application might be able to retain most of the benefits of 299 RDMAP by using the DDP layer directly. However, in addition to 300 taking on the responsibilities of the RDMAP layer, the application 301 would likely have more difficulty finding support for a DDP-only API. 302 Many hardware implementations may choose to tightly couple RDMAP and 303 DDP, and might not provide an API directly to DDP services. 305 These features minimize the required interactions with the ULP. This 306 can be extremely beneficial for applications that use multiple 307 transport layer packets to accomplish what is a single ULP 308 interaction. 310 4.3 Simplified ULP Exchanges 312 The notification rules for Tagged Messages allows ULPs to create 313 multi-message "exchanges" consisting of zero or more tagged messages 314 that represent a single step in the ULP interaction. The receiving 315 ULP is notified that the untagged message has arrived, and implicitly 316 of any associated tagged messages. 318 A ULP where all exchanges would naturally be only the untagged 319 message would derive virtually no benefit from the use of RDMAP/DDP 320 as opposed to SCTP. But while tagged buffers are the justification 321 for RDMAP/DDP, untagged buffers are still necessary. Without 322 untagged buffers the only method to exchange buffer advertisements 323 would involve out-of-band communications and/or sharing of compile 324 time constants. Most RDMA-aware ULPs use untagged buffers for 325 requests and responses. Buffer advertisements are typically done 326 within these untagged messages. 328 Limiting use of untagged buffers to requests and responses by moving 329 all bulk data using tagged transfers can greatly simplify the amount 330 of prediction that the Data Sink must perform in pre-posting receive 331 buffers. For example, a typical RDMA enabled interaction would 332 consist of the following: 334 Client sends transaction request to server's as an untagged 335 message. 337 This message includes buffer advertisements for the buffers where 338 the results are to be placed. 340 The Server sends multiple tagged messages to the advertised 341 buffers. 343 The Server sends transaction reply as an untagged message to the 344 client. 346 Client receives single notification, indicating completion of the 347 interaction. 349 With this type of exchange the pacing and required size of untagged 350 buffers is highly predictable. The variability of response sizes is 351 absorbed by tagged transfers. 353 4.4 Order Independent Sending 355 Use of tagged messages is especially applicable when the Data Sink 356 does not know the actual size, structure or location of the content 357 it is requesting (or updating). 359 For example, suppose the Data Sink ULP needs to fetch four related 360 pieces of data into a four separate buffers. With SCTP the Data Sink 361 ULP could receive four messages into four separate buffers, only 362 having to predict the maximum size of each. However it would have to 363 dictate the order in which the Data Source supplied the separate 364 pieces. If the Data Source found it advantageous to fetch them in a 365 different order it would have to use intermediate buffering to re- 366 order the pieces into the expected order even though the application 367 only required that all four be delivered and did not truly have an 368 ordering requirement. 370 Techniques such as RAID striping and mirroring represent this same 371 problem, but one step further. What appears to be a single resource 372 to the Data Sink is actually stored in separate locations by the Data 373 Source. Non RDMA protocols would either require the Data Source to 374 fetch the material in the desired order or force the Data Source to 375 use its own holding buffers to assemble an image of the destination 376 buffer. 378 While sometimes referred to as a "buffer-to-buffer" solution, RDMA 379 more fundamentally enables remote buffer access. The ULP is free to 380 work with larger remote buffers than it has locally. This reduces 381 buffering requirements and the number of times the data must be 382 copied in an end-to-end transfer. 384 There are numerous reasons why the Data Sink would not know the true 385 order or location of the requested data. It could be different for 386 each client, different records selected and/or different sort orders, 387 RAID striping, file fragmentation, volume fragmentation, volume 388 mirroring and server-side dynamic compositing of content (such as 389 server side includes for HTTP). 391 In all of these cases the Data Source is free to assemble the desired 392 data in the Data Sinks buffer in whatever order the component data 393 becomes available to it. It is not constrained on ordering. It does 394 not have to assemble an image in its own memory before creating it in 395 the Data Sink's buffers. 397 Note that while DDP enables use of tagged messages for bulk transfer, 398 there are some application scenarios where untagged messages would 399 still be used for bulk transfer. For example, under the Direct 400 Access File Server (DAFS) protocol the file server does not expose 401 its own memory to its clients. A client wishing to write may 402 advertise a buffer which the server will issue RDMA Reads upon. 403 However, when performing a small write it may be preferable to 404 include the data in the untagged message rather than incurring an 405 additional round trip with the RDMA Read and its response. 407 4.5 Tagged Buffers as ULP Credits 409 The handling of end-to-end buffer credits differs considerably with 410 DDP than when the ULP directly uses either TCP or SCTP. 412 With both TCP and SCTP buffer credits are based upon the receiver 413 granting transmit permission based on the total number of bytes. 414 These credits reflect system buffering resources and/or simple flow 415 control. They do not represent ULP resources. 417 DDP defines no standard flow control, but presumes the existince of a 418 ULP mechanism. The presumed mechanism is that the Data Sink ULP has 419 issued credits to the Data Source allowing the Data Source to send a 420 specific number of untagged messages. 422 The ULP peers must ensure that the sender is aware of the maximum 423 size that can be sent to any specific target buffer. One method of 424 doing so is to use a standard size for all untagged buffers within a 425 given connection. For example, DAFS specifies an initial size 426 requirement for session establishment, during which the untagged 427 buffer size for the remainder of the session is negotiated. 429 Tagged buffers are ULP resources advertised directly from ULP to ULP. 430 A DDP put to a known tagged buffer is constrained only by transport 431 level flow control, not by available system buffering. 433 Either tagged or untagged buffers allows bypassing of system buffer 434 resources. Use of tagged buffers additionally allows the Data Source 435 to choose what order to exercise the credits in. 437 To the extent allowed by the ULP, tagged buffers are also divisible 438 resources. The Data Sink can advertise a single 100 KB buffer, and 439 then receive notifications from its peer that it had written 50 KB, 440 20 KB and 30 KB to that buffer in three successive transactions. 442 ULP-management of tagged buffer resources, independent of transport 443 and DDP layer credits, is an additional benefit of RDMA protocols. 444 Large bulk transfers cannot be blocked by limited general purpose 445 buffering capacity. Applications can flow control based upon higher 446 level abstractions, such as number of outstanding requests, 447 independent of the amount of data that must be transferred. 449 However, use of system buffering, as offered by direct use of the 450 underlying transports, can be preferable under certain circumstances. 452 One example would be when the number of target ULP buffers is 453 sufficiently large, and the rate at which any writes arrive is 454 sufficiently low, that pinning all the target ULP buffers in memory 455 would be undesirable. The maximum transfer rate, and hence the 456 maximum amount of system buffering required, may be more stable and 457 predictable than the total ULP buffer exposure. 459 Another would be the Data Sink wishes to receive a stream of data at 460 a predictable rate, but does not know in advance what the size of 461 each data packet will be. This is common from streaming media that 462 has been encoded with a variable bit rate. With DDP the Data Sink 463 would either have to use untagged buffers large enough for the 464 largest packet, or advertise a circular buffer. If for security or 465 other reasons the Data Sink did not want the size of its buffer to be 466 publicly known, using the underlying SCTP transport directly may be 467 preferable because of their byte-oriented credits. 469 5. RDMA Read 471 RDMA Reads are a further service provided by RDMAP. RDMA Reads allow 472 the Data Sink to fetch exactly the portion of the peer ULP buffer 473 required on a "just in time" basis. This can be done without 474 requiring per-fetch support from the Data Source ULP. 476 Storage servers may wish to limit the maximum write buffer allocated 477 to any single session. The storage server may be a very minimal 478 layer between the client and the disk storage media, or the server 479 may merely wish to limit the total resources that would be required 480 if all clients could push the entire payload they wished written at 481 their own convenience. 483 In either case, there is little benefit in transferring data from the 484 Data Source far in advance of when it will be written to the 485 persistent storage media. RDMA Reads allow the Storage Server to 486 fetch the payload on a "just in time" basis. In this fashion a 487 relatively small number of block sized buffers can be used to execute 488 a single transaction that specified writing a large file, or a 489 Storage Server with numerous clients can fetch buffers from the 490 individual clients in the order that is most convenient to the 491 server. 493 This same capability can be used when the desired portion of the 494 advertised buffer is not known in advance. For example the 495 advertised buffer could contain performance statistics. The data 496 sink could request the portions of the data it required, without 497 requiring an interaction with the Data Source ULP. 499 This is applicable for many applications that publish semi-volatile 500 data that does not require transactional validity checking (i.e., 501 authorized users have read access to the entire set of data). It is 502 less applicable when there are ULP consistency checks that must be 503 performed upon the data. Such applications would be better served by 504 having the client send a request, and having the server use RDMA 505 Writes to publish the requested data. Neither RDMAP or DDP provide 506 mechanisms for bundling multiple disjoint updates into an atomic 507 operation. Therefore use of an advertised buffer as a data resource 508 is subject to the same caveats as any randomly updated data resource, 509 such as flat files, that do not enforce their own cosnsistency. 511 6. LLP Comparisons 513 Normally the choice of underlying IP transport is irrelevant to the 514 ULP. RDMAP and DDP provides the same services over either. There 515 may be performance impacts of the choice, however. It is the 516 responsibility of the ULP to determine which IP transport is best 517 suited to its needs. 519 SCTP provides for preservation of message boundaries. Each DDP 520 segment will be delivered within a single SCTP packet. The 521 equivalent services are only available with TCP through the use of 522 the MPA adaptation layer. 524 6.1 Multistreaming Implications 526 SCTP also provides multi-streaming. When the same pair of hosts have 527 need for multiple DDP streams this can be a major advantage. A 528 single SCTP association carries multiple DDP streams, consolidating 529 connection setup, congestion control and acknowledgements. 531 Completions are controlled by the DDP Source Sequence Number (DDP- 532 SSN) on a per stream basis. Therefore combining multiple DDP Streams 533 into a single SCTP association cannot result in a dropped packet 534 carrying data for one stream delaying completions on others. 536 6.2 Out of Order Reception Implications 538 The use of unordered Data Chunks with SCTP guarantees that the DDP 539 layer will be able to perform placements when IP datagrams are 540 received out of order. 542 Placement of out-of-order DDP Segments carried over MPA/TCP is not 543 guaranteed, but certainly allowed. The ability of the MPA receiver 544 to process out-of-order DDP Segments may be impaired when TCP 545 alignment is lost. Using SCTP, each DDP Segment is encoded in a 546 single Data Chunk and never spread over multiple IP datagrams. 548 6.3 Header and Marker Overhead 550 MPA and TCP headers together are smaller than the headers used by 551 SCTP and its adaptation layer. However, this advantage can be 552 considerably reduced by the insertion of MPA markers. In any event 553 the different in ULP payload per IP Datagram is not likely to be a 554 signifigant factor. 556 6.4 Middlebox Support 558 Even with the MPA adaptation layer, DDP traffic carried over MPA/TCP 559 will appear to all network middleboxes as a normal TCP connection. 560 In many environmenets there may be a requirement to use only TCP 561 connections to satisfy existing network elements and/or to facilitate 562 monitoring and control of connections. While SCTP is certainly just 563 as monitorable and controllable as TCP, there is no guarantee that 564 the network management infrastructure has the required support for 565 both. 567 6.5 Processing Overhead 569 A DDP stream delivered via MPA/TCP will require more processing 570 effort than one delivered over SCTP. However this extra work may be 571 justified for many deployments where full SCTP support is unavailable 572 in the intermediate network. 574 6.6 Data Integrity Implications 576 Both the SCTP and MPA/TCP adaptation provide end-to-end CRC32c 577 protection against data corruption, or its equivalent. 579 A ULP that requires a greater degree of protection may add it own. 580 However, DDP and RDMAP headers will only be guaranteed to have the 581 equivalent of end-to-end CRC32c protection. A ULP that requires data 582 integrity checking more thorough than an end-to-end CRC32c should 583 first invalidate all STags that reference a buffer before applying 584 their own integrity check. 586 6.6.1 MPA/TCP Specifics 588 It is mandatory for MPA/TCP implementations to implement CRC32c, but 589 it is NOT mandatory to use the CRC32c during an RDMA connection. The 590 activating or deactivating of the CRC in MPA/TCP is an administrative 591 configuration operation at the local and remote end. The 592 administration of the CRC(ON/OFF) is invisible to the ULP. 594 Applications SHOULD trust that this administrative option will only 595 be used when the end-to-end protection is at least as effective as a 596 transport layer CRC32c. Applications SHOULD NOT apply additional 597 protection as a guard against this administrative option being turned 598 on inadvertently. 600 Administrators MUST NOT enable CRC32c suppression unless the end-to- 601 end protection is truly equivalent. 603 If the CRC is active/used for one direction/end , then the use of the 604 CRC is mandatory in both directions/ends. 606 If both ends have been configured NOT to use the CRC, then this is 607 allowed as long as an equivalent protection(comparable or better 608 than/to CRC) from undetected errors on the connection is provided. 610 6.6.2 SCTP Specifics 612 SCTP provides CRC32c protection automatically. The adaptation to 613 SCTP provides for no option to suppress SCTP CRC32c protection. 615 6.7 Non-IP Transports 617 DDP is defined to operate over ubiquitous IP transports such as SCTP 618 and TCP. This enabled a new DDP-enabled node to be added anywhere to 619 an IP network. No DDP-specific support from middle-boxes is 620 required. 622 There are non-IP transport fabric offering RDMA capabilities. 623 Because these capabilities are integrated with the transport protocol 624 they have some technical advantages when compared to RDMA over IP. 625 For example fencing of RDMA operations can be based upon transport 626 level acks. Because DDP is cleanly layered over an IP transport, any 627 explicit RDMA layer ack must be separate from the transport layer 628 ack. 630 There may be deployments where the benefits of RDMA/transport 631 integration outweigh the benefits of being on an IP network. 633 6.7.1 No RDMA Layer Ack 635 DDP does not provide for its own acknowledgements. The only form of 636 ack provided at the RDMAP layer is an RDMA Read Response. DDP and 637 RDMAP rely almost entirely upon other layers for flow control and 638 pacing. The LLP is relied upon to guarantee delivery and avoid 639 network congestion, and ULP level acking is relied upon for ULP 640 pacing and to avoid ULP buffer overruns. 642 Previous RDMA protocols, such as InfiniBand, have been able to use 643 their integration with the transport layer to provide stronger 644 ordering guarantees. It is important that application designers that 645 require such guarantees to provide them through ULP interaction. 647 Specifically: 649 There is no ability for a local interface to "fence" outbound 650 messages to guarantee that prior tagged messages have been placed 651 prior to sending a tagged message. The only guarantees available 652 from the other side would be an RDMA Read Response (coming from 653 the RDMAP layer) or a response from the ULP layer. Remember that 654 the normal ordering rules only guarantee when the Data Sink ULP 655 will be notified of untagged messages, it does not control when 656 data is placed into receive buffers. 658 Re-use of tagged buffers must be done with extreme care. The fact 659 that an untagged message indicates that all prior tagged messages 660 have been placed does not guarantee that no later tagged message 661 have. The best strategy is to only change the state of any given 662 advertised buffers with with untagged messages. 664 As covered elsewhere in this document, flow control of untagged 665 messages MUST be provided by the ULP itself. 667 6.8 Other IP Transports 669 Both TCP and SCTP provide DDP with reliable transport with TCP 670 friendly rate control. As currently DDP is defined to work over 671 reliable transports and implicitly relies upon some form of rate 672 control. 674 DDP is fully compatible with a non-reliable protocol. Out-of-order 675 placement is obviously not dependent on whether the other DDP 676 Segments ever actually arrive. 678 However, RDMAP requires the LLP to provide reliable service. An 679 alternate completion handling protocol would be required if DDP were 680 to be deployed over an unreliable IP transport. 682 As noted in the prior section on tagged buffers as ULP credits, 683 neither RDMAP or DDP provide any flow control for tagged messages. 684 If no transport layer flow control is provided, an RDMAP/DDP 685 application would be only limited by the link layer rate, almost 686 inevitably resulting in severe network congestion. 688 RDMAP encourages applications to be ignorant of the underlying 689 transport PMTU. The ULP is only notified when all messages ending in 690 a single untagged message have completed. The ULP is not aware of 691 the granularity or ordering of the underlying message. This approach 692 assumes that the ULP is only interested in the complete set of 693 messages, and has no use for a subset of them. 695 6.9 LLP Independent Session Establishment 697 For an RDMAP/DDP application, the transport services provided by a 698 pair of SCTP Streams and by a TCP connection both provide the same 699 service (reliable delivery of DDP Segments between two connected 700 RDMAP/DDP endpoints). 702 6.9.1 RDMA-only Session Establishment 704 It is also possible to allow for transport neutral establishment of 705 RDMAP/DDP sessions between endpoints. Combined, these two features 706 would allow most applications to be unconcerned as to which LLP was 707 actually in use. 709 Specifically, the procedures for DDP Stream Session establishment 710 discussed in section 3 of the SCTP mapping, and section 13.3 of the 711 MPA/TCP mapping, both allow for the exchange of ULP specific data 712 ("Private Data") before enabling the exchange of DDP Segments. This 713 delays can allow for proper selection and/or configuration of the 714 endpoints based upon the exchanged data. For example, each DDP 715 Stream Session associated with a single client session might be 716 assigned to the same DDP Protection Domain. 718 To be transport neutral, the applications should exchange Private 719 Data as part of session establishment messages to determine how the 720 RDMA endpoints are to be configured. One side must be the Initiator, 721 and the other the Responder. 723 With SCTP, a pair of SCTP streams can be used for sequential 724 sessions. With MPA/TCP each connection can be used for at most one 725 session. However, the same source/destination pair of ports can be 726 re-used sequentially subject to normal TCP rules. 728 6.9.2 RDMA-Conditional Session Establishment 730 It is sometimes desirable for the active side of a session to connect 731 with the passive side before knowing whether the passive side 732 supports RDMA. 734 This style of session establishment can be supported with either TCP 735 or SCTP, but not as transparently as for RDMA-only sessions. Pre- 736 existing non-RDMA servers are also far more likely to be using TCP 737 than SCTP. 739 With TCP. a normal TCP connection is established. It is then used 740 by the ULP to determine whether or not to convert to MPA mode and use 741 RDMA. This will typically be integral with other session 742 establishment negotiations. 744 With SCTP, the establishment of an association tests whether RDMA is 745 supported. If not supported, the application simply requests the 746 association without the RDMA adaptation indication. 748 In key difference is that with SCTP the determination as to whether 749 the peer can support RDMA is made before the transport layer 750 association/connection is established while with TCP the established 751 connection itself is used to determine whether RDMA is supported. 753 7. Local Interface Implications 755 Full utilization of DDP and RDMAP capabilities requires a local 756 interface that explicitly requests these services. Protocols such as 757 Sockets Direct Protocol (SDP) can allow applications to keep their 758 traditional byte-stream or message-stream interface and still enjoy 759 many of the benefits of the optimized wire level protocols. 761 8. Security considerations 763 8.1 Connection/Association Setup 765 Both the SCTP and TCP adaptations allow for existing procedures to be 766 followed for the establishment of the SCTP association or TCP 767 connection. Use of DDP does not impair the use of any security 768 measures to filter, validate and/or log the remote end of an 769 association/connection. 771 8.2 Tagged Buffer Exposure 773 DDP only exposes ULP memory to the extent explicitly allowed by ULP 774 actions. These include posting of receive operations and enabling of 775 Steering Tags. 777 Neither RDMAP or DDP place requirements on how ULP's advertise 778 buffers. A ULP may use a single Steering Tag for multiple buffer 779 advertisements. However, the ULP should be aware that enforcement on 780 STag usage is likely limited to the overall range that is enabled. 781 If the remote peer writes into the 'wrong' advertised buffer, neither 782 the DDP or RDMAP layer will be aware of this. Nor is there any 783 report to the ULP on how the remote peer specifically used tagged 784 buffers. 786 Unless the ULP peers have an adequate basis for mutual trust, the 787 receiving ULP might be well advised to use a distinct STag for each 788 interaction, and to invalidate it after each use or to require its 789 peer to use the RDMAP option to invalidate the STag with its 790 responding untagged message. 792 8.3 Impact of Encrypted Transports 794 While DDP is cleanly layered over the LLP, its maximum benefit may be 795 limited when the LLP Stream is secured with a streaming cypher, such 796 as Transport Layer Security (TLS). If the LLP must decrypt in order, 797 it cannot provide out-of-order DDP Segments to the DDP layer for 798 placement purposes. IPsec tunnel mode encrypts entire IP Datagrams. 799 IPsec transport mode encrypts TCP Segments or SCTP packets. In 800 neither case should IPsec preclude providing out-of-order DDP 801 Segments to the DDP layer for placement. 803 Note that end-to-end use of IPsec cryptographic integrity protection 804 may allow suppression of MPA CRC generation and checking under 805 certain circumstances. This is one example where the LLP may be 806 judged to have "or equivalent" protection to an end-to-end CRC32c. 808 References 810 [1] Bradner, S., "Key words for use in RFCs to Indicate Requirement 811 Levels", BCP 14, RFC 2119, March 1997. 813 [2] Dierks, T., Allen, C., Treese, W., Karlton, P., Freier, A. and 814 P. Kocher, "The TLS Protocol Version 1.0", RFC 2246, January 815 1999. 817 [3] Kent, S. and R. Atkinson, "IP Encapsulating Security Payload 818 (ESP)", RFC 2406, November 1998. 820 [4] Stewart, R., Xie, Q., Morneault, K., Sharp, C., Schwarzbauer, 821 H., Taylor, T., Rytina, I., Kalla, M., Zhang, L. and V. Paxson, 822 "Stream Control Transmission Protocol", RFC 2960, October 2000. 824 [5] Coene, L., "Stream Control Transmission Protocol Applicability 825 Statement", RFC 3257, April 2002. 827 [6] Recio, R., "An RDMA Protocol Specification", draft-ietf-rddp- 828 rdmap-00 (work in progress), February 2003. 830 [7] Shah, H., "Direct Data Placement over Reliable Transports", 831 draft-ietf-rddp-ddp-00 (work in progress), February 2003. 833 [8] Stewart, R., "Stream Control Transmission Protocol (SCTP) Remote 834 Direct Memory Access (RDMA) Direct Data Placement (DDP) 835 Adaptationn", draft-ietf-rddp-sctp-00 (work in progress), 836 September 2003. 838 [9] Culley, P., "Marker PDU Aligned Framing for TCP Specification", 839 draft-ietf-rddp-mpa-00 (work in progress), October 2003. 841 Authors' Addresses 843 Caitlin Bestler 844 1241 W. North Shore 845 # 2G 846 Chicago, IL 60626 847 USA 849 Phone: +1-773-743-1594 850 EMail: cait@asomi.com 851 Lode Coene 852 Atealaan 26 853 Herentals, 2200 854 Belgium 856 Phone: +32-14-252081 857 EMail: lode.coene@siemens.com 859 Full Copyright Statement 861 Copyright (C) The Internet Society (2003). All Rights Reserved. 863 This document and translations of it may be copied and furnished to 864 others, and derivative works that comment on or otherwise explain it 865 or assist in its implementation may be prepared, copied, published 866 and distributed, in whole or in part, without restriction of any 867 kind, provided that the above copyright notice and this paragraph are 868 included on all such copies and derivative works. However, this 869 document itself may not be modified in any way, such as by removing 870 the copyright notice or references to the Internet Society or other 871 Internet organizations, except as needed for the purpose of 872 developing Internet standards in which case the procedures for 873 copyrights defined in the Internet Standards process must be 874 followed, or as required to translate it into languages other than 875 English. 877 The limited permissions granted above are perpetual and will not be 878 revoked by the Internet Society or its successors or assigns. 880 This document and the information contained herein is provided on an 881 "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING 882 TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING 883 BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION 884 HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF 885 MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 887 Acknowledgement 889 Funding for the RFC Editor function is currently provided by the 890 Internet Society.